In The Design of Experiments, Fisher describes an experiment conducted by Charles Darwin to measure the effect of cross-fertilization on plant growth. In Darwin's experiment, pairs of plants were grown from the same batch of seed in the same pot under the same conditions, except that one was self-fertilized and the other was cross-fertilized.
The evident object of these precautions is to increase the sensitiveness of the experiment, by making such differences in growth rate as were to be observed as little as possible dependent from environmental circumstancces, and as much as possible, therefore, from intrinsic differences due to their mode of origin.
The x
and y
arrays below record the height of the cross-fertilized and self-fertilized plants, respectively.
import numpy as np
x = np.array([23.5, 12, 21, 22, 19.125, 21.5, 22.125, 20.375, 18.25, 21.625, 23.25, 21, 22.125, 23, 12]) # (in) cross-fertilized
y = np.array([17.375, 20.375, 20, 20, 18.375, 18.625, 18.625, 15.25, 16.5, 18, 16.25, 18, 12.75, 15.5, 18]) # (in) self-fertilized
assert len(x) == len(y) # elements at corresponding positions form a pair
The null hypothesis was that the method of fertilization would have no effect, and Darwin's alternative hypothesis was that cross-fertilized plants would be taller. Fisher recommends the t-test, which is implemented for paired ("related") samples by stats.ttest_rel
.
from scipy import stats
# alternative hypothesis 'greater': the mean of the distribution underlying
# `x` is greater than the mean of the distribution underlying `y`
res_t = stats.ttest_rel(x, y, alternative='greater')
res_t
Ttest_relResult(statistic=2.1479874613311205, pvalue=0.024851472010900447)
However, the t-test is derived under the assumption that samples were drawn from a normal population. He reports that
There has, however, in recent years, been a tendency for theoretical statisticians, not closely in touch with the requirements of experimental data, to stress the element of normality, in the hypothesis tested, as though it were a serious limitation to the test applied.
(The same is true today!) And so he proposes a different test that does not rely on the assumption of normality:
On the hypothesis that the two series of seeds are random sampled from identical populations, and that their sites have been assigned to members of each pair independently at random, the 15 differences [in height between pairs] would each have occured with equal frequency with a positive or with a negative sign...
Since ex hypothesi each of these $2^{15}$ combinations will occur by chance with equal frequency, a knowledge of how many of them are equal to or greater than the the value actually observed affords a direct arithmetical test of the significance of this value.
He continues to perform the following test.
from itertools import product
import matplotlib.pyplot as plt
def statistic(d):
# d is the differences in height between paired plants
# the statistic is the sum of these differences
return np.sum(d)
# compute the statistic for all possible combinations of signs on the elements of `d`
def null_distribution(d):
signs = (-1, 1)
n = len(d) # number of observations per sample
null_distribution = []
for dsigns in product(*[signs]*n):
stat = statistic(d * dsigns)
null_distribution.append(stat)
return null_distribution
d = x - y
null_dist = null_distribution(d)
assert len(null_dist) == 2**15
bins = np.unique(null_dist).tolist()
bins.append(np.max(null_dist)+1)
plt.hist(null_dist, density=True, bins=bins)
plt.xlabel("Value of test statistic")
plt.ylabel("Observed Frequency")
Text(0, 0.5, 'Observed Frequency')
In just 863 cases out of 32,768 the total deviation will have a positive value as great as or greater than that observed.
The $p$-value is simply the ratio of the two.
assert np.sum(null_dist >= statistic(d)) == 863
pvalue = np.sum(null_dist >= statistic(d)) / len(null_dist)
print(f"t-test p-value: {res_t.pvalue}")
print(f"permutation test p-value: {pvalue}")
t-test p-value: 0.024851472010900447 permutation test p-value: 0.026336669921875
The two tests agree remarkably well in this case, suggesting the applicability of the t-test even when the original data are not strictly normally distributed.
permutation_test
minimizes the code required to perform the same procedure. The most essential difference here compared to the independent
sample tests above is that we pass permutation_type='samples'
instead of permutation_type='independent'
. Note also that we can simply pass np.sum
as the statistic because it satisfies the required interface, even with vectorized=True
.
# Note that the first argument of `permutation_test` is a sequence containing all the samples.
# We only have one sample here, but it still needs to be in a sequence.
res = stats.permutation_test((d,), np.sum, alternative='greater', permutation_type='samples', vectorized=True, n_resamples=np.inf)
assert res.pvalue == pvalue
The value of the permutation_type
, "samples", comes from the fact that permuting the signs of the differences is equivalent to permuting the sample to which each paired observation is assigned. That is, we can obtain the same $p$-value by performing a two-sample test in which observations are exchanged between the two samples in all possible ways without breaking up pairs. This is what permutation_test
does with permutation_type='samples'
when there are two or more samples.
def statistic(x, y, axis=0):
# observations will be permuted between samples `x` and `y`,
# changing the sign of the corresponding element of `d`
d = x - y
return np.sum(d, axis=axis)
res = stats.permutation_test((x, y), statistic, alternative='greater', permutation_type='samples', n_resamples=np.inf)
assert res.pvalue == pvalue
Other common hypothesis tests can be performed as paired-sample permutation tests. We continue with several comparisons against scipy.stats.wilcoxon
to further illustrate the usage of permutation_test
with permutation_type='samples'
.
wilcoxon
is to ttest_rel
as mannwhitneyu
is to ttest_ind
; it is a nonparametric permutation test of the null hypothesis that each observation in a pair is drawn from the same distribution. Because it is a paired-sample test, it computes the $p$-value by reassigning observations between the samples in all possible ways while maintaining the pairs.
rng = np.random.default_rng()
x = rng.random(7)
y = rng.random(7)
# Each element in `x` is paired with the corresponding element of `y`
stats.wilcoxon(x, y, alternative='greater')
WilcoxonResult(statistic=10.0, pvalue=0.765625)
Using wilcoxon
to compute only the statistic, we can use permutation_test
to calculate precisely the same $p$-value.
def statistic(x, y):
return stats.wilcoxon(x, y, alternative='greater', method='approx').statistic
res = stats.permutation_test((x, y), statistic, alternative='greater', permutation_type='samples')
res.pvalue
C:\Users\matth\Desktop\scipy\scipy\stats\_morestats.py:3380: UserWarning: Sample size too small for normal approximation. warnings.warn("Sample size too small for normal approximation.")
0.765625
(The warning can be safely ignored because it is referring to the $p$-value returned by the test, whereas we are using only the statistic value.)
Again, the advantage of permutation_test
is its ability to handles ties in the data; wilcoxon
does not.
x = rng.integers(0, 5, size=7)
y = rng.integers(0, 5, size=7)
stats.wilcoxon(x, y, method='exact')
C:\Users\matth\Desktop\scipy\scipy\stats\_morestats.py:3366: UserWarning: Exact p-value calculation does not work if there are zeros. Switching to normal approximation. warnings.warn("Exact p-value calculation does not work if there are "
WilcoxonResult(statistic=4.5, pvalue=0.8539232992870668)
res = stats.permutation_test((x, y), statistic, permutation_type='samples')
res.statistic, res.pvalue
(4.5, 1.0)
The wilcoxon
statistic does not depend on the specific values in x
and y
, only on the values of x - y
. Instead of passing x
and y
into wilcoxon
separately, wilcoxon
can accept a single sample - the differences between paired observations.
x = np.array([209, 200, 177, 169, 159, 169, 187])
y = np.array([151, 168, 147, 164, 166, 163, 176])
stats.wilcoxon(x - y, alternative='greater', method='exact')
WilcoxonResult(statistic=25.0, pvalue=0.0390625)
As described above, this is relatively common in paired-sample tests (e.g. see also ttest_rel
), so permutation_test
also supports single samples as input and forms the null distribution by permuting the signs of each observation.
def statistic(d):
# return just the Mann-Whitney U statistic
return stats.wilcoxon(d, alternative='greater').statistic
res = stats.permutation_test((x-y,), statistic, alternative='greater', permutation_type='samples')
res.pvalue
0.0390625
Suppose that we wish to perform the Wilcoxon test with a two-sided alternative. We might expect that alternative='two-sided'
should be specified everywhere the alternative
parameter is accepted.
alternative = 'two-sided'
res1 = stats.wilcoxon(x - y, alternative=alternative)
def statistic(d):
return stats.wilcoxon(d, alternative=alternative).statistic
res2 = stats.permutation_test((x-y,), statistic, alternative=alternative, permutation_type='samples')
print(res1.pvalue, res2.pvalue)
0.078125 0.15625
This guess was not correct; the pvalue
returned by permutation_test
is greater by a factor of two. Note that the documentation of wilcoxon
states:
If
alternative
is “two-sided”, [thestatistic
is] the sum of the ranks of the differences above or below zero, whichever is smaller.
The sign information that should be carried by the statistic ("above or below zero") is not preserved by statistic
. This can be corrected by passing alternative='greater'
or alternative='less'
in the call to wilcoxon
so that that the statistic is always:
the sum of the ranks of the differences above zero.
def statistic(d):
# `alternative='less'` or `alternative='greater'` will work
return stats.wilcoxon(d, alternative='less').statistic
res2 = stats.permutation_test((x-y,), statistic, alternative=alternative, permutation_type='samples')
np.testing.assert_allclose(res2.pvalue, res1.pvalue, atol=1e-15)
permutation_test
with permutation_type='samples'
is a versatile tool for comparing paired samples. Provided only data and a statistic, it can produce the null distribution and replicate the $p$-value of similar tests in SciPy, and it may be more accurate than these existing implementations, especially for small samples and when there are ties:
In addition, permutation_test
with permutation_type='samples'
can be used to perform tests not yet implemented in SciPy.
However, there are yet other types of permutation tests that assume neither that samples are independent nor paired. We conclude the study of permutation_test
with Correlated-Sample Tests.