In Individual Comparisons by Ranking Methods, Wilcoxon considers two sprays designed to kill flying insects. A subset of the data, the percentage of flies killed in repeated trials of two treatments, is recorded below.
import numpy as np
x = np.array([61, 62, 67, 63, 56, 58])
y = np.array([60, 68, 59, 72, 64])
In the paper, Wilcoxon describes a test to assess whether the two samples are drawn from the same population that is now commonly described as a nonparametric version of the independent sample t-test - that is, a version of the t-test that does not make a normality (or any particular distributional) assumption.
Given samples x
and y
, Wilcoxon introduces a statistic which is proportional to an empirical estimate of the probability that a random observation from the distribution underlying x
will be less than a random observation from the distribution underlying y
. Suppose we want to test the null hypothesis that the samples are drawn from the same distribution against the alternative that they are drawn from different distributions which tend to produce samples that give lower values of the statistic. Under certain assumptions, this can be argued as evidence that the location of the distribution underlying x
is less than the location of the distribution underlying y
. To perform this test, we pass the data into scipy.stats.mannwhitneyu
with alternative='less'
.
from scipy import stats
_, pvalue = stats.mannwhitneyu(x, y, alternative='less')
pvalue # p-value is greater than our threshold; test is inconclusive
0.16450216450216448
Like the mean comparison test in Efron's example from the previous tutorial on Permutation Tests, this is an example of an "independent sample" test of the null hypothesis that group labels (x
, y
) are entirely random. In fact, because mannwhitneyu
claims to produce an exact $p$-value, we would expect permutation_test
to return precisely the same $p$-value (using mannwhitneyu
only to compute the statistic).
def statistic(x, y):
# return just the Mann-Whitney U statistic
return stats.mannwhitneyu(x, y, alternative='less').statistic
# "independent" is the default `permutation type`, so we are not required to pass it here
# We pass `alternative='less'` because lesser values of the statistic are more extreme
res = stats.permutation_test((x, y), statistic, permutation_type='independent', alternative='less')
np.testing.assert_allclose(res.pvalue, pvalue, atol=1e-15)
Just as with monte_carlo_test
, vectorizing the statistic
function can greatly improve the speed of the test.
# Before
%timeit stats.permutation_test((x, y), statistic, alternative='less')
272 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# After
def statistic_vectorized(x, y, axis=0):
# return just the Mann-Whitney U statistic
return stats.mannwhitneyu(x, y, axis=axis, alternative='less').statistic
%timeit stats.permutation_test((x, y), statistic_vectorized, alternative='less', vectorized=True)
56.5 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although mannwhitneyu
provides an exact $p$-value for the data above, permutation_test
comes in handy when there are ties in the samples. As the mannwhitneyu
documentation states,
'exact'
: computes the exact p-value by comparing the observed statistic against the exact distribution of the statistic under the null hypothesis. No correction is made for ties.
The complete data set in Wilcoxon's original paper had ties.
x = [60, 67, 61, 62, 67, 63, 56, 58]
y = [68, 68, 59, 72, 64, 67, 70, 74]
res1 = stats.mannwhitneyu(x, y, method='exact', alternative='two-sided')
# By default, only 9,999 random permutations are used.
# We pass n_resamples=np.inf to ensure that all 12,870 possible permutations are used
res2 = stats.permutation_test((x, y), statistic_vectorized, alternative='two-sided', vectorized=True, n_resamples=np.inf)
print(res1.pvalue, res2.pvalue)
0.014763014763014764 0.01351981351981352
The two $p$-values are similar despite the ties, but only permutation_test
is truly "exact" in this case. Either way, our 1% threshold for statistical significance is not met, and the test is inconclusive.
scipy.stats.kruskal
is a many-sample extension of the Mann-Whitney U test, but SciPy provides only an approximate (asymptotic) $p$-value. It is possible to perform an exact version of the test using permutation_test
for very small samples, and a randomized test using a subset of the possible permutations may yield more accurate results than the approximation implemented by kruskal
, especially if there are ties or the sample size is small. Using the (artificial) data for milk cap production from Kruskal and Wallis' original paper, we have:
x = [340, 345, 330, 342, 338]
y = [339, 333, 344]
z = [347, 343, 349, 355]
stats.kruskal(x, y, z)
KruskalResult(statistic=5.656410256410254, pvalue=0.059118869289796136)
At the expense of some time, the exact p-value for this data is given by permutation_test
.
def statistic(x, y, z, axis=0):
return stats.kruskal(x, y, z, axis=axis).statistic
res = stats.permutation_test((x, y, z), statistic, vectorized=True, alternative='greater', n_resamples=np.inf)
res.pvalue
0.048629148629148626
Note that we passed alternative='greater'
into permutation_test
but not into kruskal
. This is because the kruskal
test is inherently one-sided: data generated under the null hypothesis tends to generate small positive values of the statistic with greater values always being more exceptional. This raises the point that setting up a permutation test requires some study of both the underlying statistic and SciPy's implementation. Another example of this is shown in the next section.
Suppose that we wish to perform the two-sample Kolmogorov-Smirnov test to test the null hypothesis that two samples were drawn from the same distribution against the alternative that the distribution $X$ underlying sample x
is stochastically greater than the distribution $Y$ underlying sample y
. Roughly speaking, this is the alternative that $X$ "tends to be" greater than $Y$.
Here, we'll use randomly-generated data that best illustrates some confusing (but important) points. We choose shapes of the samples to generate the RuntimeWarning
reported in gh-14019.
from scipy import stats
import matplotlib.pyplot as plt
# Indeed, the distribution $X$ is stochastically greater the distribution $y$
X = stats.norm(loc=+0.2)
Y = stats.norm(loc=0)
x = X.rvs(size=801)
y = Y.rvs(size=399)
grid = np.linspace(-4, 4, 100)
plt.plot(grid, X.pdf(grid), 'C0')
plt.plot(grid, Y.pdf(grid), 'C1')
plt.hist(x, density=True, color='C0', bins=30, alpha=0.5)
plt.hist(y, density=True, color='C1', bins=30, alpha=0.5)
plt.title('Distribution PDFs and Sample Histograms')
plt.legend(['x', 'y'])
plt.show()
Our first difficulty is determining the correct value of alternative
to pass into ks_2samp
. From its documentation, we see that the alternatives are expressed not in terms of the values of the samples or the location of the underlying distributions, but in terms of cumulative density functions of the underlying distributions.
two-sided
: The null hypothesis is that the two distributions are identical, $F(x)=G(x)$ for all $x$; the alternative is that they are not identical.less
: The null hypothesis is that $F(x) >= G(x)$ for all $x$; the alternative is that $F(x) < G(x)$ for at least one $x$.greater
: The null hypothesis is that $F(x) <= G(x)$ for all $x$; the alternative is that $F(x) > G(x)$ for at least one $x$.
Note that if a distribution $X$ tends to be greater than $Y$, we find that the cumulative distribution function of $X$ lies below the cumulative distribution function of $Y$.
plt.plot(grid, X.cdf(grid), 'C0')
plt.plot(grid, Y.cdf(grid), 'C1')
plt.title('Distribution CDFs')
plt.legend(['x', 'y'])
plt.show()
Therefore, to test the alternative that $X$ is stochastically greater than $Y$, we pass alternative='less'
into ks_2samp
.
res1 = stats.ks_2samp(x, y, alternative='less', method='exact')
print(res1)
KstestResult(statistic=0.12484394506866417, pvalue=0.00022195733373729093)
C:\Users\matth\AppData\Local\Temp\ipykernel_35644\3756467023.py:1: RuntimeWarning: ks_2samp: Exact calculation unsuccessful. Switching to method=asymp. res1 = stats.ks_2samp(x, y, alternative='less', method='exact')
The $p$-value is tiny, confirming what we already know: the data are inconsistent with the null hypothesis, and we have evidence to reject it in favor of the alternative.
The warning states that ks_2samp
was unable to compute an exact $p$-value, and an asymptotic $p$-value is being returned instead. To determine whether the asymptotic $p$-value is accurate for these sample sizes, we can perform a permutation test.
def statistic(x, y):
return stats.ks_2samp(x, y, alternative='less').statistic
# This would be extremely slow!
# res2 = stats.permutation_test((x, y), statistic, alternative='greater')
The calculation above would be extremely slow to run. Unfortunately, ks_2samp
does not accept an axis
argument, so we can't speed it up using vectorization without truly implementing the statistic ourselves. However, lack of vectorization is not be the bottleneck here. Note that the call to ks_2samp
is quite slow with the default parameters even for 1D inputs.
# No need for the warning; we know the exact calculation is unsuccessful
import warnings
warnings.filterwarnings('ignore')
%timeit stats.ks_2samp(x, y, alternative='less', method='exact')
374 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
By default, permutation_test
needs to call ks_2samp
9999 times, which would take about an hour. We can speed this up dramatically by noting permutation_test
only uses ks_2samp
to compute the test statistic, so the pvalue
attribute of the ks_2samp
result object is not used at all. We can use ks_2samp
to compute essentially the same value of the test statistic, but much faster, by specifying method='asymp'
.
# method='asymp' and method='exact' result in the same statistic value
res1 = stats.ks_2samp(x, y, alternative='less', method='asymp')
res2 = stats.ks_2samp(x, y, alternative='less', method='exact')
np.testing.assert_allclose(res1.statistic, res2.statistic, atol=1e-15)
# but method='asymp' is much faster
%timeit stats.ks_2samp(x, y, alternative='less', method='asymp')
167 µs ± 947 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Now we can run a randomized permutation_test
in reasonable time.
def statistic(x, y):
return stats.ks_2samp(x, y, alternative='less', method='asymp').statistic
res3 = stats.permutation_test((x, y), statistic, alternative='less')
print(res1.pvalue, res3.pvalue)
0.0002219573337372917 0.9999
This was much faster, but something is still wrong. Either the approximate $p$-value is wildly inaccurate, or we have set up our test incorrectly. The latter turns out to be the case: the value of alternative
passed into ks_2samp
changes the definition of the test statistic, but a greater statistic is always considered more extreme. Therefore, even if we wish to perform a test equivalent to ks_2samp
with alternative='less'
, we actually need to pass alternative='greater'
into permutation_test
!
# greater values of the statistic returned by `ks_2samp` are more extreme
res4= stats.permutation_test((x, y), statistic, alternative='greater')
print(res1.pvalue, res4.pvalue)
0.0002219573337372917 0.0005
At last, permutation_test
is invoked correctly. Indeed, the asymptotic $p$-value produced by ks_2samp
appears to be reliable for these sample sizes.
As we can see, permutation_test
with permutation_type='independent'
is a versatile tool for comparing independent samples. Provided only data and a statistic, it can produce the null distribution and replicate the $p$-value of many such tests in SciPy, and it may be more accurate than these existing implementations, especially for small samples and when there are ties:
ttest_ind
cramervonmises_2samp
ks_2samp
epps_singleton_2samp
mannwhitneyu
kruskal
friedmanchisquare
brunnermunzel
ansari
bartlett
levene
anderson_ksamp
fligner
median_test
mood
In addition, permutation_test
with permutation_type='independent'
can be used to perform tests not yet implemented in SciPy.
However, there are other types of permutation tests that do not assume that the samples are entirely independent. We continue the study of permutation_test
with Paired-Sample Tests.