Notebook

Permutation tests based on ranks, and calibrating parametric tests using permutations¶

This notebook explores some common nonparametric tests based on ranks, and shows that (depending on the null hypothesis) parametric tests can be calibrated nonparametrically to have the correct significance level when the assumptions of the parametric tests fail.

Notation¶

Given a set of real numbers $\{x_j\}_{j=1}^N$, $x_{(i)}$ denotes the $i$th smallest element of the set. That is, \begin{equation} x_{(1)} \le x_{(2)} \le \ldots \le x_{(N)}. \end{equation}

The (mid-)rank $r_j$ of the $j$th observation, $x_j$, is $\#\{ k: x_k < x_j \} + (\#\{ k: x_k = x_j \}+1)/2$. This assigns tied observations the average of the ranks they would have had if they had been distinct.

For example, for the set $\{1, 4, 2, 6\}$ the corresponding (mid-)ranks are $1, 3, 2, 4$; for the set $\{1, 4, 2, 6, 2, 2\}$, the mid-ranks are $1, 5, 3, 6, 3, 3$; and for the set $\{1, 4, 2, 6, 2, 2, 2\}$, the mid-ranks are $1, 6, 3.5, 7, 3.5, 3.5, 3.5$.

The sum of the (mid-)ranks of $N$ observations is $N(N+1)/2$.

When the data are random variables $\{X_j \}_{j=1}^N$, the (random) ranks will be written using uppercase letters, $\{R_j\}_{j=1}^N$. When there are only two groups, control and treatment, $\{S_j\}_{j=1}^n$ will denote the ranks of the responses in the treatment group.

Many of the methods described in this notebook can be performed using the original data or using the ranks of the data.

Why use ranks instead of the original data?¶

In some problems, only ranks are observed. This is common when it is possible to make comparative judgements about outcomes, but not absolute judgements. For example, a doctor might be able to rank a collection of patients according to the severity with which they suffer from some disease they share, but might not be able to rate the severity on an objective quantitative scale—and certainly not an equal-increment scale, e.g., a scale for which 5 is "better than" 4 by the same amount that 2 is "better than" 1. Similarly, a consumer might be able to order a collection of items by preference: she prefers A to B, B to C and C to D, for example. Yet she might not be able to quantify the strength of her preferences. Having techniques for dealing with ranks is then useful. The same issue occurs with measurements on Likert scales.
If the measurements are "contaminated" by outliers, using ranks can decrease the sensitivity to the outliers, potentially increasing power.
If the original data have no ties, the null distribution of the rank-based statistic is "universal": can tabulate it once and for all. (This is less important than it used to be because of increases in computing power.)
Often, the normal approximation to the null distribution of the rank-based statistic is more accurate than the normal approximation to the null distribution of the test statistic for raw data. (This is less important than it used to be because of increases in computing power.)

Why use the original data instead of ranks?¶

The original units are more meaningful for things like treatment effect, estimating shifts, etc.
Tests using the original data may have more power.

Notation review¶

As in the notebook on causal inference, there are $T$ possible treatments (or $T$ groups being compared), numbered 0 to $T-1$, and $W_j$ is the treatment that subject $j$ receives (or the group that subject $j$ belongs to). Subject $j$'s response if assigned to treatment $t$ is $x_{jt}$, $j=1, \ldots, N$, $t=0, \ldots, T-1$. The number of subjects assigned to treatment $t$ is $\sum_{j=1}^N 1(W_j = t)$. (When there are only two treatments, $t=0$ and $t=1$, $n :=\sum_j W_j$ is the number of subjects assigned to treatment 1.) The effect of treatment $t$ on subject $j$ (compared to control) is \begin{equation} \tau_{jt} := x_{jt} - x_{j0}, \;\;t=1, \ldots, T-1. \end{equation} The average effect of treatment $t$ (compared to control) is \begin{equation} \bar{\tau}_t := \frac{1}{N} \sum_{j=1}^N (x_{jt} - x_{j0}), \;\; t=1, \ldots, T-1. \end{equation}

The two-sample problem¶

We first specialize to the case of $T=2$ groups or treatments. There are $N$ items in all, $n$ in the "treatment" group and $m := N-n$ in the "control" group. Let $\{X_j\}_{j=1}^m$ denote the data for the control group and $\{Y_j\}_{j=1}^n$ denote the data for the treatment group.

In the *randomization* version of the two-sample problem, the two groups are provided by nature, and the question is whether the groups are "statistically distinguishable," in the sense that they don't look like a single group that was partitioned randomly. In the context of causal inference, the groups started as a single group of subjects, randomly partitioned into two groups that receive different treatments, and the question is whether their responses are statistically distinguishable.

There is also a *population* version of the two-sample problem, which asks whether the two groups look like IID samples from the same parent population, i.e., whether there is some $F$ for which $\{X_1, \ldots, X_m, Y_1, \ldots, Y_n \}$ are IID $F$. If the two groups are IID samples from the same distribution, then, conditional on the set of $N$ observed values, every subset of $n$ of the $N$ values is equally likely to be the group of size $n$ and every subset of $m$ of the $N$ values is equally likely to be the group of size $m$, so tests that work for the randomization version of the 2-sample problem also work for the population version of the 2-sample problem.

Lehmann (1998, p. 64–65) identifies five models for five problems, all of which lead to the same (conditional) null distribution of ranks:

Randomization model for comparing two treatments. $N$ subjects are given and fixed; $n$ are assigned at random to treatment and $m= N-n$ to control.
Population model for comparing two treatments. $N$ subjects are a drawn as a

simple random sample from a much larger population; $n$ are assigned at random to treatment and $m = N-n$ to control. Condition on the configuration of ties if there are ties among the data.

Comparing two sub-populations using a sample from each.

A simple random sample of $n$ subjects is drawn from one much larger population with many more than $N$ members, and a simple random sample of $m$ subjects is drawn from another population with many more than $m$ members. Condition on the configuration of ties if there are ties among the data.

Comparing two sub-populations using a sample from the pooled population.

A simple random sample of $N$ subjects is drawn from the pooled population, giving random samples from the two populations but with random sample sizes. Condition on the sample sizes and, if there are ties, on the configuration of the ties.

Comparing two sets of measurements. Independent sets of $n$ and $m$

measurements come from two sources. Condition on the configuration of ties if there are ties among the data.

Testing for a difference in location¶

The Wilcoxon rank-sum test.¶

Define the sum of the ranks of the responses in the treatment group, group 1: \begin{equation} W_Y := \sum_j W_j R_j. \end{equation} (Note that $W$ is being used in two very different ways in the notation.) This is the Wilcoxon rank sum statistic. Under the null, every subset of $n$ of the responses is equally likely to be the treatment group. The expected value of $W_Y$ under the null is $n(N+1)/2$. Moreover, under the null, the distribution of $W_Y$ is symmetric. If treatment tends to increase responses, $W_Y$ will tend to be larger than its median under the null; if treatment tends to decrease responses, $W_Y$ will tend to be smaller than its median under the null.

To test against the alternative that treatment tends to increase responses, we would reject for large values of $W_Y$. To test against the alternative that treatment tends to decrease responses, we would reject for small values of $W_Y$. The critical value of the test is set using the probability distribution of $W_Y$ on the assumption that the strong null hypothesis is true. For a level-alpha test against the alternative that treatment increases responses, we would find the smallest $c$ such that, if the strong null is true, $\mathbb{P}(W_Y \ge c) \le \alpha$. We would then reject the strong null if the observed value of $W_Y$ is $c$ or greater. Recall that $1 + 2 + \cdots + k = k(k+1)/2$.

If the treated subjects have the smallest possible ranks, 1 to $n$, then $W_Y = 1 + 2 + \cdots + n = n(n+1)/2$. If the treated subjects have the largest possible ranks, $N - n+1$ to $N$, then \begin{eqnarray} W_Y & = & (N-n+1) + (N-n+2) + \cdots + N \\ & = & (N-n) + 1 + (N-n) + 2 + \cdots (N-n) + n \\ & = & n(N-n) + (1 + 2 + \cdots + n) \\ & = & n(N-n) + n(n+1)/2. \end{eqnarray} All the integers between $n(n+1)/2$ and $n(N-n) + n(n+1)/2$ are possible values of $W_Y$. The null distribution of $W_Y$ under the strong null hypothesis is symmetric about $n(N+1)/2$.

(In the randomization model, each subset of $n$ of the $N$ ranks is equally likely to be the ranks of the treatment group.
The probability that the treatment ranks are $\{1, 2, \ldots , n\}$ is equal to the probability that the treatment ranks are $\{N, N-1, \ldots, N-n+1\}$.
More generally, consider re-labeling the $j$th-ranked observation to be the $N-j+1$st-ranked observation. The labels of the observations would still be $\{1, 2, \ldots , N\}$, so the probability distribution of the sum of the labels on the treated subjects would be the same as the probability distribution of $W_Y$ under the strong null hypothesis.
However, if the sum of the treatment ranks was $W_Y$, the sum of the new labels would be $n(N+1)-W_Y$. Thus $\mathbb{P}\{W_Y = k\} = \mathbb{P}\{W_Y = n(N+1) - k \}$. That is, the probability distribution of $W_Y$ is symmetric about $n(N+1)/2$.
This argument is essentially that in Lehmann, E.L., 1998. Nonparametrics: Statistical Methods Based on Ranks. Upper Saddle River, N.J.: Prentice Hall, pp 12-13.

Thus the expected value of $W_Y$ under the null hypothesis is $n(N+1)/2$. (This also follows from the fact that the expected value of the sample sum of a simple random sample of size $n$ from a box of $N$ numbers is equal to $n$ times the mean of the $N$ numbers; the mean of the integers 1 to $N$ is $(N+1)/2$.) The variance of $W_Y$ under the strong null hypothesis is $mn(N+1)/12$. (This follows from the fact that the variance of the sample sum of a simple random sample of size $n$ from a list of $N$ numbers is $(N-n)n(\mbox{variance of list})/(N-1) = mn(\mbox{variance of list})/(N-1)$. The variance of the list of the integers 1 to $N$ is $(N^2-1)/12 = (N-1)(N+1)/12$, so the variance of $W_Y$ is $mn(N-1)(N+1)/(12(N-1)) = mn(N+1)/12$.)

Define $W_X$ to be the sum of the control ranks. Because the treatment ranks and control ranks together comprise all the ranks, $ W_X + W_Y = 1 + 2 + \cdots + N = N(N+1)/2$. Thus, $W_X = N(N+1)/2 - W_Y$.

It follows that the (null) expected value of $W_X$ is $N(N+1)/2 - n(N+1)/2 = m(N+1)/2$, and that the (null) variance of $W_X$ is also $mn(N+1)/12$. When $n$ and $m$ are both large, the normal approximations to the null distributions of $W_X$ and to $W_Y$ tend to be accurate.

The Wilcoxon rank-sum test is exactly what you get if you replace each observation by its rank, then do a permutation test using the sum of the (ranks of the) responses in the treatment group. Thus, the code in the previous chapter can be used to simulate the null distribution, by replacing the raw data with their ranks. However, the normal approximation to the null distribution of $W_Y$ can be better than the normal approximation to the null distribution of the sample sum of the responses to treatment, because the ranks are evenly spread out, whereas the raw responses can be highly skewed or multimodal.

Mann-Whitney Statistics¶

Define $W_{XY} := W_Y - n(n+1)/2$. This is $W_Y$ minus its minimum possible value. Let $W_{YX} := W_X - m(m+1)/2$, $W_X$ minus its minimum possible value. Under the strong null hypothesis, the probability distribution of $W_{XY}$ is the same as the probability distribution of $W_{YX}$, a consequence of the symmetry of the probability distribution of $W_Y$. The statistics $W_{XY}$ and $W_{YX}$ are called the Mann-Whitney statistics.

Let $\{X_1, \ldots, X_m\}$ denote the $m$ control responses and let $\{Y_1, \ldots, Y_n\}$ denote the $n$ treatment responses. Consider $\#\{(i, j) : 1 \le i \le m, \; 1 \le j \le n \mbox{ and } X_i < Y_j \}$. This is the number of (control, treatment) pairs such that the control response is less than the treatment response.

Let $\{S_j: j = 1, \ldots , n\}$ be the ranks of the treatment responses. Let $\{S_{(j)}: j = 1, \ldots , n\}$ be those ranks in increasing order. Partition the set of (control, treatment) pairs for which the control response is less than the treatment response on the value of the treatment response: The total number of such pairs for which the control response is less than the treatment response is the number of pairs where the control response is less than the smallest treatment response, plus the number of pairs where the control response is less than the second-smallest treatment response, and so on. This will help us count the pairs.

How many responses are less than $S_{(1)}$, the rank of the smallest treatment response? By definition, there are $S_{(1)}-1$ of them—all of which are control responses. The number of response values that are less than $S_{(2)}$ is $S_{(2)}-1$, one of which is $S_{(1)}$, so the total number of control responses that are less than $S_{(2)}$ is $S_{(2)}-2$. The total number of control responses that are less than $S_{(j)}$ is $S_{(j)} - j$, so the total number of (control, treatment) pairs with the control response less than the treatment response is \begin{equation} S_{(1)}-1 + S_{(2)} - 2 + \cdots + S_{(n)} - n = W_Y - (1 + 2 + \cdots + n) = W_Y - n(n+1)/2 = W_{XY}. \end{equation} Thus \begin{equation} W_{XY} = \#\{(i, j) : 1 \le i \le m, \; 1 \le j \le n \mbox{ and } X_i < Y_j \}. \end{equation} That is, the Mann-Whitney statistic $W_{XY}$ (and the Wilcoxon rank sum $W_Y$, up to an additive constant) measures the number of (control, treatment) pairs for which the treatment response is at least as large as the control response. The larger the positive effect of treatment, the larger the Mann-Whitney and Wilcoxon rank sum statistics tend to be.

Equivalent tests. Multiplying the test statistic by a (positive) constant produces an equivalent test (equivalent means that the tests reject the null for exactly the same data). Hence using $\bar{W}_Y := W_Y/n$ produces an equivalent test.

Since $\sum_j r_j = N(N+1)/2$, as $W$ increases, the sum (and the mean) of the control ranks $W_X := \sum_j (1-W_j) r_j$ decreases monotonically, so the Wilcoxon rank-sum test is equivalent to a test based on the difference between the mean ranks for treatment and control, $W_Y/n - W_X/m$.

The Mann-Whitney test and the Wilcoxon rank-sum test are equivalent.

Permutation $t$-test.¶

Use the $t$ statistic as the test statistic, but calibrate it using the permutation distribution rather than Student's t distribution. See introduction to permutation tests.

Confidence bounds for a shift in the two-sample problem¶

The strong null is that each subject would have the same response regardless of the treatment assigned to the subject. One alternative hypothesis is that the effect of treatment is to shift each subject's response by the same amount, $\Delta$. In applications, there is generally little reason to belive that treatment will affect every subject by exactly the same amount: this is a very artificial alternative. But it is a strong alternative, in the sense that it lets us find the sampling distribution of any statistic from the observed data.

The approach here is based on the duality between tests and confidence sets. Suppose we have a family of level $\alpha$ tests of the null hypotheses $H_{d}$: $\Delta=d$ for all real $d$. Let $S(Y)$ denote the set of values of $d$ for which the test of the corresponding hypothesis $\Delta = d$ would not rejected the null hypothesis if the data were $y$ (here $y$ is generic; in the problems we have been considering, $y$ specifies all the treatment responses and all the control responses). Then $S(Y)$ is a $1-\alpha$ confidence set for $\Delta$.

Proof. If $d$ is the true value of $\Delta$, then the probability (under $d$) that the corresponding test rejects the hypothesis $\Delta = d$ is at most $\alpha$. The probability that the test does not reject is at least $1-\alpha$, so with probability $1-\alpha$ under $\Delta$, the value $\Delta$ is in $S(Y)$. That is, \begin{equation} \mathbb{P}_d \{S(Y) \ni \Delta \} \ge 1-\alpha. \end{equation} The shape and size of the confidence set are tied to the family of hypothesis tests that are being "inverted." No matter what, the procedure gives a $1-\alpha$ confidence set, but it can be large or bizarre (e.g., consist of disjoint pieces) according to the nature of the tests. The duality between tests and confidence sets can be exploited to produce confidence sets with desirable properties; see, for example, Benjamini and Stark (1996. Non-equivariant simultaneous confidence intervals less likely to contain zero, J. Amer. Stat. Assoc., 91, 329–337); Benjamini, Hochberg and Stark (1998. Confidence Intervals with more Power to determine the Sign: Two Ends constrain the Means, J. Amer. Stat. Assoc., 93, 309–317); or Evans, Hansen and Stark (2005. Minimax Expected Measure Confidence Sets for Restricted Location Parameters, Bernoulli, 11, 571–590).

To use this result to find a confidence set for $\Delta$, we need a family of level $\alpha$ tests of the hypotheses $\Delta = d$.

Consider the Mann-Whitney statistic $W_{XY}$. Under the strong null hypothesis, the expected value of $W_{XY}$ is $m$$n$/2. Suppose that the critical values for the two-sided test of the hypothesis that $\Delta=0$ are $mn/2-c$ and $mn/2+c$. Note that under the hypothesis that $\Delta = d$, the distribution of the Mann-Whitney statistic for the data ${X_1, \ldots, X_m, Y_1-d, \ldots, Y_n-d }$, which we denote $W_{XY-d}$, is the same as the distribution of $W_{XY}$ under the original strong null hypothesis. A two-sided level-$\alpha$ test of the hypothesis $\Delta=d$ thus would reject if $W_{XY-d} \le mn/2-c$ or $W_{XY-d} \ge mn/2+c$. This is a family of tests on which we can base the confidence set.

We can find a confidence interval for $\Delta$ using these tests as follows: Let $d$ be the estimated treatment effect, the median of the $mn$ differences $d_{ij}$. Subtract $d$ from the treatment observations; compute the ranks of this new data set; perform a two-sided Wilcoxon rank sum test of the null hypothesis of no treatment effect using these new data. If the test does not reject, include $d$ in the confidence set. Increase $d$ systematically until the test rejects to get the upper endpoint of the confidence interval; decrease $d$ systematically until the test rejects to get the lower endpoint of the confidence interval.

It is clear that the only values of $d$ that could be endpoints of the confidence interval are those that cause the ranks of the treatment observations to change—only those values change the value of $W_{XY}$. We can exploit this to find a confidence interval for $\Delta$ with less trial-and-error searching.

To change the value of $W_{XY-d}$, we need to change $d$ by enough to change the number of pairs $(i, j)$ such that $X_i < Y_j-d$. The values of $d$ that change the ranks are precisely the differences $d_{ij} = Y_j-X_i$.

Let $d_{(1)}, \ldots, d_{(mn)}$ denote the order statistics of the $mn$ differences $\{d_{ij}\}$, and define $d_{(0)} := -\infty$ and $d_{(mn+1)} := \infty$. Then $d_{(k)} \le d < d_{(k+1)}$ if and only if $W_{XY-d} = mn-k$.

To see this, note that $W_{XY-d} \le mn-k$ if $mn-k$ of the differences $\{(Y_j - d - X_i)\}$ are at least zero, that is, if $d_{(k)} \le d$. The distribution of $W_{XY-d}$ when $d$ is the true shift does not depend on $d$: it is the same as the distribution of $W_{XY}$ under the strong null hypothesis. Using the symmetry of the null distribution of $W_{XY}$, we see that for any $d$, \begin{equation} \mathbb{P}_d \{ d_{(k)} \le d < d_{(k+1)} \} = \mathbb{P}_0 \{W_{XY} = mn - k \} = \mathbb{P}_0 \{W_{XY} = k \}. \end{equation}

Thus the random differences $d_{ij}$ carve the real line into random intervals with known probabilities of containing the true value of $d$ (if the shift alternative is true!). To find a confidence interval for $d$, we can start with (one of) the interval $[d_{(k)}, d_{(k+1)})$ with the largest chance of containing $d$, then append intervals in decreasing order of their probability of containing $d$ until the sum of their probabilities is at least $1-\alpha$. Because of the unimodality of the distribution, these intervals will be contiguous; let $[d_{(\ell)}, d_{(u)})$ denote the interval. Alternatively, we could choose the intervals to try to have symmetric non-coverage probabilities; that is, so that $\mathbb{P}_d \{ d_{(\ell)} > d \}$ is as close as possible to $\mathbb{P}_d \{ d_{(u)} < d \}$. (I don't know whether these two approaches differ; I think the former corresponds to inverting the two-sided Wilcoxon test.) All the probability calculations are implicit in calculating the distribution of $W_{XY}$ under the strong null hypothesis. In fact, one need not even calculate the entire distribution; see Lehmann (1998, pp. 92ff) for shortcuts.

Testing for a difference in dispersion: the Siegel-Tukey test¶

Assign "ranks" differently: smallest observation has "rank" 1, largest has "rank" 2, second-smallest has "rank" 3, etc. Then apply Wilcoxon the rank-sum test to these "ranks." The null distribution of the sum is the same, but this statistic has power against the alternative that treatment affects dispersion (and not location). In contrast, the Wilcoxon rank-sum test has power against the alternative that treatment affects location, not dispersion.

Testing for a difference in distribution: the Smirnov test¶

A common measure of the "distance" between two probability distributions $F$ and $G$ is the Smirnov or Kolmogorov-Smirnov distance, \begin{equation} \| F-G \|_\infty := \sup_{x \in \mathbb{R}} |F(x)-G(x)|. \end{equation} A two-sample test sensitive to general differences (rather than changes in location and/or spread) can be based on that distance.

Let $\hat{F}$ be the empirical CDF of the responses to treatment and $\hat{G}$ be the empirical CDF of the responses to control: \begin{equation} \hat{F}(x) := \frac{1}{n} \sum_{j=1}^n 1_{x \ge X_j} \end{equation} \begin{equation} \hat{G}(x) := \frac{1}{m} \sum_{j=1}^n 1_{x \ge Y_j}. \end{equation} The Smirnov test uses $\|\hat{F} - \hat{G}\|$ as the test statistic.

Note that the value of $\|\hat{F} - \hat{G}\|$ depends only on the relative ordering of the values $\{X_j\}$ and $\{Y_j\}$, not on their numerical values. Thus, replacing the original data by their ranks does not change the test statistic (or its distribution under the null hypothesis): the Smirnov test is a rank-based test. Moreover, if there are no ties, the null distribution of the test statistic depends only on $n$ and $m$.

Here is a simulation of the Smirnov test statistic under the null hypothesis:

In [2]:

%matplotlib inline
import matplotlib.pyplot as plt
import math
import numpy as np
import scipy as sp
import scipy.stats
from scipy.stats import binom
import pandas as pd
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML

In [24]:

def ecdf(x):
    '''
       calculates the empirical cdf of data x
       returns the unique values of x in ascending order and the cumulative probabity at those values
       NOTE: This is not an efficient algorithm: it is O(n^2), where n is the length of x. 
       A better algorithm would rely on the Collections package or something similar and could work
       in O(n log n)
    '''
    theVals = sorted(np.unique(x))
    theProbs = np.array([sum(x <= v) for v in theVals])/float(len(x))
    if (theVals[0] > 0.0):
        theVals = np.append(0., theVals)
        theProbs = np.append(0., theProbs)
    return theVals, theProbs


def plotSmirnov(m, n):
    '''
    plot the ecdfs and Smirnov statistic for samples from a uniform distribution
    
    '''
    sam = np.random.uniform(size=n+m)
    y, pry = ecdf(sam[0:n])
    x, prx = ecdf(sam[n:])
    fig, ax = plt.subplots(nrows=1, ncols=1)
    ax.step(y, pry, color='b', where='post', label=f'ecdf(y): {n} data')
    ax.step(x, prx, color='r', where='post', label=f'ecdf(x): {m} data')
    ax.legend(loc='best')
    diff = [abs(np.sum(sam[n:] <= zi)/m - np.sum(sam[0:n] <= zi)/n) for zi in sam]
    dLoc = np.argmax(diff)
    zm = sam[dLoc]
    xd = np.sum(sam[n:] <= zm)/m
    yd = np.sum(sam[0:n] <= zm)/n
    ymin, ymax = min(xd, yd), max(xd, yd)
    ax.axvline(x=zm, ymin=ymin, ymax=ymax, color='g', linewidth='2')
    ax.text(0.5, 0.1, f'Smirnov statistic: {round(ymax-ymin, 3)}', color='g', weight='heavy')
    plt.show()

interact(plotSmirnov, m=widgets.IntSlider(min=3, max=300, step=1, value=10), n=widgets.IntSlider(min=3, max=300, step=1, value=10))

interactive(children=(IntSlider(value=10, description='m', max=300, min=3), IntSlider(value=10, description='n…

Out[24]:

<function __main__.plotSmirnov(m, n)>

Wilcoxon signed rank test¶

The signed (mid)-rank of an observation $x_k \in \{x_j\}_{j=1}^N$ is $\mathrm{sgn}(x_k) r_k$, where $r_k$ is the rank of $|x_k|$ in the set $\{|x_j|\}_{j=1}^N$, and \begin{equation} \mathrm{sgn}(x) := \left \{ \begin{array}{lr} -1, & x < 0 \cr 0, & x=0 \cr 1, & x > 0. \end{array} \right . \end{equation}

Suppose there are $N$ pairs of subjects; in each pair, one subject is assigned to treatment and one to control, at random, with equal probability. Look at the differences between the treated and control subject in each pair. Add the signed ranks of those differences. To test against the alternative that treatment increases (decreases) responses, reject if the sum is sufficiently large (small).

[MORE TO COME]

The $k$-sample problem¶

(Calling this $k$, but we are using $T$ to denote the number of groups.) Size of group $t$ is $n_t$, $t=0, \ldots, T-1$.

Testing for differences in location¶

The Kruskal-Wallis test¶

\begin{equation} K := \frac{12}{N(N+1)} \sum_{t=0}^{T-1} n_t \left ( \bar{R}_t - (N+1)/2 \right )^2, \end{equation}

where $\bar{R}_t$ is the mean of the ranks of the observations in group $t$. (The mean of all $N$ (mid-)ranks is $(N+1)/2$.)

Recall that the $F$-statistic is

\begin{equation} F := \frac{\mbox{ "explained variance" }}{\mbox{"unexplained variance"}} = \frac{\sum_{t=0}^{T-1} n_t({\bar {Y}}_{t\cdot }-{\bar {Y}})^{2}/(T-1)}{\sum_{t=0}^{T-1}\sum _{j=1}^{n_t} \left( Y_{tj}-{\bar {Y}}_{t\cdot } \right)^{2}/(N-T)}. \end{equation}

Thus, $K$ is essentially the $F$-statistic of the ranks, since the overall sum of squared differences from the mean rank is constant.

The asymptotic null distribution (as all the group sizes approach infinity) is chi-square with $T-1$ degrees of freedom.

The permutation $F$ test¶

Apply the usual $F$ test, but calibrate it using the permutation distribution.

Confidence bounds for quantiles¶

Suppose $\{X_j\}_{j=1}^N$ are real random variables that are IID $F$. We want a confidence bound for $F_\alpha$, the $\alpha$ quantile of $F$, i.e., the smallest value of $x$ for which $F(x) := \mathbb{P} \{X_j \le x \} \ge \alpha$.

The approach illustrated above for finding a confidence interval for the shift $d$ is very similar to a nonparametric approach to finding confidence intervals for a quantile of a probability distribution. Let $\{X_j\}_{j=1}^N$ be IID real-valued random variables with continuous cdf $F$. For $0 < q < 1$, let $x_q := F^{-1}(q) = \inf \{x: F(x) \ge q \}$ denote the $q$th quantile of the distribiution. We shall find a confidence interval for $x_q$ that does not depend on any other assumptions about $F$.

If $x_q=x$, the probability that $X_j \le x$ is $q$. The data are independent, so $N(x) := \#\{j : X_j \le x\}$ has a binomial distribution with parameters $n=N$ and $p=q$. Let $p_{N,q}(k) := {{N} \choose {k}} q^k (1-q)^{N-k} = \mathbb{P}_{x_q=x}\{ N(x) = k \}$, $k=0, \ldots, N$.

Note that $\mathbb{P}_{x_q=x} \{ N(x) = k \}$ does not depend on $x_q$. The probability distribution of $N(x)$ is unimodal; we could build a level $\alpha$ test of the hypothesis that $x_q=x$ by assigning to the rejection region the values of $k$ for which $_{n,q}(k)$ is smallest, subject to the constraint that the sum of the probabilities is at most $\alpha$.

To find a confidence interval for $x_q$, we could start with the $k$th order statistic $X_{(k)}$ as a trial value for $x$, where $k$ is as close as possible to $qn$, then work "outward" as before, increasing and decreasing $x$ until the test of the hypothesis $x_q = x$ rejects.

We can streamline the approach in the same way as we did to find a confidence interval for the treatment effect $\Delta$ in the shift model. The values of $x$ at which $N(x)$ changes are the order statistics $X_{(j)}$, so it suffices to consider them when searching for the endpoints of the confidence interval. Now $N(x) = k$ iff $X_{(k)} \le x < X_{(k+1)}$, $k = 0, 1, \ldots, N$, where $X_{(0)} := -\infty$ and $X_{(N+1)} := \infty$.

\begin{equation} \mathbb{P}_{x_q} \{ X_{(k)} \le x_q < X_{(k+1)} \} = {{N} \choose {k}} q^k (1-q)^{N-k}. \end{equation}

The order statistics carve the real line into intervals with known chances of containing the true value of $x_q$. Those chances are determined by the binomial distribution with parameters $n$ and $q$. We can find a confidence interval for $x_q$ by starting with the interval $[X_{(k)}, X_{(k+1)})$ with the largest chance of containing $x_q$, then appending intervals in decreasing order of their probability of containing $x_q$ until the sum of their probabilities is at least $1-\alpha$.

Exercise. Extend this derivation to distributions that are not necessarily continuous to obtain a conservative confidence interval for $x_q$.

One-sample tests¶

Typically used to test whether data are a sample from a symmetric distribution or a distribution with a specified median.

When there is no random assignment¶

Imaginary randomization¶

In many situations in which hypothesis tests are used, there is no actual random assignment. This happens frequently in observational studies involving humans. For instance, the following characteristics are not really assigned at random

gender
ethnicity
socio-economic status (SES)
educational attainment
whether people are housed, and if so, where they live

In such situations, the corresponding null hypothesis is typically that the characteristic/label is assigned "as if" at random, i.e., that the characteristic is an arbitrary label unconnected with anything else.

For more discussion, see

Freedman, D.A., and R. Berk, 2012. Statistical Assumptions as Empirical Commitments. https://www.cambridge.org/core/books/abs/statistical-models-and-causal-inference/statistical-assumptions-as-empirical-commitments/B1879A1919929D5948F6BB539F91D786
Freedman, D.A., and D. Lane, 1983. A Nonstochastic Interpretation of Reported Significance Levels, J. Business and Economic Statistics, 1, 292-298. https://www.tandfonline.com/doi/abs/10.1080/07350015.1983.10509354