Three Monte Carlo permutation tests
- 3. February 2023
- #statistics
Tests of significance in the randomized experiment have frequently been presented by way of normal law theory, whereas their validity stems from randomization theory. – O. Kempthorne, 1955.
In this article we’ll use permutation tests to investigate three different problems. The data are taken from the Handbook of Small Data Sets, more specifically from the GitHub repository Handbook-of-Small-Data-Sets.
Table of contents
The code presented in this article was used to demonstrate some basic statistical techniques to a group of software developers in a seminar. Software developers tend to prefer code and simulation over theory and equations. I chose to focus on Monte Carlo permutation tests in the seminar, since:
- They are easy to implement (literally 10 lines of Python code or so).
- No mathematical formulas are needed.
- The tests are quite intuitive and easy to understand.
- They work well in many cases, and make few assumptions about underlying probability distributions.
In general, Monte Carlo permutation tests proceed as follows:
- Determine a suitable test statistic \(T\). For instance, when testing whether two groups are different, we can use
np.mean(group_A) - np.mean(group_B)
as a test statistic. - Determine the observed value of the test statistic \(T_{\operatorname{obs}}\). In the case of testing whether two groups are different we would simply compute
T_obs = np.mean(group_A) - np.mean(group_B)
. - Determine a null hypothesis \(H_0\). Continuing the example with group means, the null hypothesis would be that the group labels \(A\) and \(B\) do not matter (labels are exchangeable), so switching the labels randomly would have no effect.
- Use Monte Carlo sampling to determine the distribution of \(T\) under \(H_0\), \(p(T | H_0 \text{ is true})\). In our example we would assign
len(group_A)
observations randomly with label \(A\), and the remaining observations with label \(B\). Then we would compute \(T\) using this resampled data set. We repeat this simulation many times. - Compare \(T\) with \(p(T | H_0 \text{ is true})\) to get a \(p\)-value.
If you’re a software developer, looking at the code below will make the pieces fall into place.
We’ll use the following Python environment:
Python implementation: CPython
Python version : 3.9.4
IPython version : 7.22.0
numpy : 1.23.5
scipy : 1.9.3
matplotlib: 3.6.2
We start off by importing packages.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import itertools
Permutation test for differences between group means
Consider the following problem, which is dataset number 242 in the book Handbook of Small Data Sets). We wish to determine if a restricted diet has an effect on the total lifelength of rats.
Here are the data:
rats_restricted = np.array([105,193,211,236,302,363,389,390,391,403,530,604,
605,630,716,718,727,731,749,769,770,789,804,810,
811,833,868,871,875,893,897,901,906,907,919,923,
931,940,957,958,961,962,974,979,982,1001,1008,
1010,1011,1012,1014,1017,1032,1039,1045,1046,
1047,1057,1063,1070,1073,1076,1085,1090,1094,
1099,1107,1119,1120,1128,1129,1131,1133,1136,
1138,1144,1149,1160,1166,1170,1173,1181,1183,
1188,1190,1203,1206,1209,1218,1220,1221,1228,
1230,1231,1233,1239,1244,1258,1268,1294,1316,
1327,1328,1369,1393,1435])
rats_free_eating = np.array([89,104,387,465,479,494,496,514,532,536,545,547,
548,582,606,609,619,620,621,630,635,639,648,652,
653,654,660,665,667,668,670,675,677,678,678,681,
684,688,694,695,697,698,702,704,710,711,712,715,
716,717,720,721,730,731,732,733,735,736,738,739,
741,743,746,749,751,753,764,765,768,770,773,777,
779,780,788,791,794,796,799,801,806,807,815,836,
838,850,859,894,963])
A visual summary of the data can be created with a histogram.
plt.figure(figsize=(6, 3))
plt.hist(rats_free_eating, alpha=0.5, density=True, label="Free eating", bins="fd")
plt.hist(rats_restricted, alpha=0.5, density=True, label="Restricted eating", bins="fd")
plt.legend()
plt.yticks([], [])
plt.xlabel("Weight")
plt.tight_layout()
To perform a permutation test, we assume that the group labels are exchangeable under the null hypothesis \(H_0\). In that case, we can rearrange the randomly to determine the distribution of the test statistic under \(H_0\). The following code defines a test statistic and simulates it under \(H_0\).
def test_statistic(group1, group2):
return np.mean(group1) - np.mean(group2)
def permute_groups(group1, group2):
generator = np.random.default_rng(1) # Random number generator
size_group1 = len(group1)
stacked_data = np.hstack((group1, group2))
while True:
# Randomly permute the data, yield the new groups
data_permutation = generator.permutation(stacked_data)
yield data_permutation[:size_group1], data_permutation[size_group1:]
# Observed test statistic
observed_diff = test_statistic(rats_free_eating, rats_restricted)
# Simulations of the test statistic under H_0
num_simulations = 9999
permutations = permute_groups(rats_free_eating, rats_restricted)
permutations_sliced = itertools.islice(permutations, num_simulations)
simulated_mean_diffs = np.array([test_statistic(g1, g2)
for (g1, g2) in permutations_sliced])
We use matplotlib to visualize the results:
# Create plot
plt.figure(figsize=(6, 3))
plt.title(f"p-value = {np.mean(simulated_mean_diffs >= observed_diff).round(4)}")
plt.hist(simulated_mean_diffs, bins="fd", label="Simulated")
plt.axvline(x=observed_diff, color="black", label="Observed")
plt.legend()
plt.yticks([], [])
plt.xlabel("Difference in group means")
plt.tight_layout()
The observed value is extreme, far beyond what we would expect if \(H_0\) were true. As a result we reject \(H_0\) and claim there is a statistical significance between the means of the groups. Rats on a restricted diet really do live longer, and the probability that we observed these results due to chance are vanishingly small.
Permutation test for differences between paired observations
Next consider the following data set, showing murder rates in the USA for \(30\) different cities. We want to determine if the murder rate increased from the 60s to the 70s.
murder_rates_1960 = np.array([10.1,10.6,8.2,4.9,11.5,17.3,12.4,
11.1,8.6,10.0,4.4,13.0,9.3,11.7,
9.1,7.9,4.5,8.1,17.7,11.0,10.8,
12.5,8.9,4.4,6.4,3.8,14.2,6.6,6.2,3.3])
murder_rates_1970 = np.array([20.4,22.1,10.2,9.8,13.7,24.7,15.4,12.7,
13.3,18.4,3.9,14.0,11.1,16.9,16.2,8.2,
12.6,17.8,13.1,15.6,14.7,12.6,7.9,11.2,
14.9,10.5,15.3,11.4,5.5,6.6])
Again, we start by visualizing the data for the \(30\) cities using a scatter plot in matplotlib.
plt.figure(figsize=(6, 3))
plt.scatter(murder_rates_1960, murder_rates_1970, zorder=5)
plt.plot(np.linspace(3, 25), np.linspace(3, 25), color="k")
plt.xlabel("1960")
plt.ylabel("1970")
plt.grid(True, ls="--", zorder=0, alpha=0.33)
plt.tight_layout()
In \(26\) out of \(30\) cities the murder rate increased. Could this be due to chance? To find out, we will randomly permute the observations within each city. The null hypothesis is that for each city, \(x_{\text{before}}\) and \(x_{\text{after}}\) are exchangeable. The test statistic is the difference in the means.
Since
we can simply bootstrap from \([1, -1]\) and multiply the differences with either \(1\) or \(-1\). My friend Jens noticed this, and it lets us implement a fast vectorized algorithm. We precompute differences once, and then we multiply the difference for each city with \(1\) or \(-1\) randomly.
differences = murder_rates_1970 - murder_rates_1960
observed_mean_diff = np.mean(differences) # Observed value of test statistic
num_simulations = 9999
signs = np.random.choice([1, -1], size=(num_simulations, len(differences)))
simulated_diffs = np.mean(differences * signs, axis=1) # Test statistic under H_0
p_value = np.mean(simulated_diffs >= observed_mean_diff)
Visualizing the results, we obtain the following figure:
We reject \(H_0\) and claim statistical significance.
Permutation test for correlation coefficient
The third dataset contains information about people; their heights and pulses (beats per minute). Is there a correlation between height and pulse?
Here are the data:
patient_height = np.array([160,172,167,185,162,163,175,177,185,165,162,
182,173,162,167,172,170,177,170,168,163,178,
158,182,157,167,160,170,170,160,177,182,166,
168,170,155,148,175,175,168,160,180,153,175,
185,145,165,170,165,175])
patient_pulse = np.array([68,116,80,80,84,95,80,80,80,76,80,100,92,88,92,
90,80,90,80,90,80,80,80,76,80,80,78,84,90,80,
80,80,72,80,80,80,82,104,76,80,84,68,70,84,80,
64,82,84,84,72])
Let’s visualize it.
plt.figure(figsize=(6, 3))
plt.scatter(patient_height, patient_pulse, zorder=5)
plt.grid(True, ls="--", zorder=0, alpha=0.33)
plt.xlabel("Height")
plt.ylabel("Pulse")
plt.tight_layout()
It’s not immediately clear whether there is a correlation. The Pearson correlation coefficient is \(0.2182\). Could this be due to chance?
We could use any sensible measure of correlation here, but let’s stick to Pearson. The null hypothesis \(H_0\) is that there is no relationship between height and pulse. To simulate from \(H_0\) we permute either pulse data or the height data (or both, but it’s superfluous).
def corr(group1, group2):
# return stats.spearmanr(group1, group2).correlation # <- an alternative
return stats.pearsonr(group1, group2).statistic
def permute_group(group1):
generator = np.random.default_rng(1) # Seed generator
while True:
yield generator.permutation(group1)
observed_correlation = corr(patient_height, patient_pulse)
# Simulations
num_simulations = 9999
permutations = permute_group(patient_height)
perms_sliced = itertools.islice(permutations, num_simulations)
simulated_correlations = np.array([corr(patient_height, patient_pulse)
for patient_height in perms_sliced])
Visualizing the result, we obtain the following figure:
The \(p\)-value is \(0.0608\) with \(9999\) simulations, which is not significant under the standard rejection threshold of \(0.05\). Under this threshold we would not reject \(H_0\), since there is not enough evidence of a correlation between height and pulse.
Summary and references
Monte Carlo permutation tests offer an intuitive non-parametric approach to hypothesis testing. A dash of creativity combined with common sense is required to determine test statistics and null hypotheses, but once that is done the computer does the rest.
Permutation tests are very flexible. In the first example, we could’ve used the difference between medians instead of the difference between means. In the second example, we could’ve used ratios instead of differences. In the third example, we could’ve used Spearman’s rho or Kendall’s tau instead of Peasons linear correlation coefficient. Choosing the appropriate test statistic is a modeling choice, but running the code with an alternative test statistic requires only trivial code changes.
Despite the virtues of permutation tests, traditional parametric approaches still have their place. We did not go into any theory behind permutation tests in this article, but interested readers can take a look at the book “Permutation, Parametric and Bootstrap Tests of Hypotheses.”