Multiple comparisons¶

In [1]:

# first set up inline plotting
%pylab inline 

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

The problem occurs when you permform several statistical tests: by chance, some will be significant even if the data has been generated under the null hypothesis.

FWER : Bonferroni¶

Imaging you are doing $n$ t tests each one with a risk of type one error of $\alpha$, and we denote by $t_\alpha$ the t-value such that for a variable t drawn from a Student distribution, we have $P(t < t_\alpha) = \alpha$, the probability of getting at least one significant test is these n tests is :

$$ \alpha_{F} = P(\textrm{ one or more } t_i > t_\alpha) = 1 - \prod_{i} P(t_i < t_\alpha) = 1 - (1 - \alpha)^n $$$$ \alpha = 1 -(1 - \alpha_{F})^{\frac{1}{n}} = 1 - \exp\left(\frac{1}{n} \log(1 - \alpha_{F})\right) $$

In [2]:

# code demo: a t-distribution
from scipy.stats import t as tdist
df = 14
n = 100
alph = .05

# Often easier to compute the static distribution of t with df : t14 = tdist(df)
# rvs : draw one or several random variables
ts = tdist(df).rvs(size=n)

# The threshold as a t value: 
t_alph = tdist.isf(alph,df)

print "Nubmer of test significant at the %f level: %d " %(alph, sum(ts >= t_alph))

Nubmer of test significant at the 0.050000 level: 5

To correct for the number of tests with the Bonferroni correction :

In [3]:

# if we have p values : 
p = tdist(df).sf(ts)
p_corr = (1 - (1 - p)**n)
print "Number of test significant with uncorrected threshold: %d, corrected: %d ", \
            sum(p <= alph), sum(p_corr <= alph)

# This can be put in a little function:
def p_corrected(ps):
    return( 1 - (1 - p)**n )

# alternatively : compare orginal p with : 1.0 - exp(1./n * log(1 - .05))

def alph_corr(alph,n):
    return 1.0 - exp(1./n * log(1 - alph))

print "Uncorrected threshold: %f, corrected threshold: %f" % (alph, alph_corr(alph,n))

Number of test significant with uncorrected threshold: %d, corrected: %d  5 0
Uncorrected threshold: 0.050000, corrected threshold: 0.000513

On average we would have to run about 20 times the previous code to get a (spurious) significant effect.

Note that when the tests are NOT INDEPENDENT, then this procedure is conservative (on average you will get less false positive than the specified risk or error), and therefore not the most sensitive.

False Discovery Rate¶

The false detection rate (introduced by Benjamini and Hochberg (1995), see [http://en.wikipedia.org/wiki/False_discovery_rate] for an history of this measure.

Let's denote :

$S$ the number of true positive
$V$ the number of false positives (Type I error)
$R$ is the number of rejected null hypotheses (also called "discoveries")
$Q$ the proportion of false discoveries amongst the discoveries = $\frac{V}{S+V}$

We want to control such that $Q = \frac{V}{S+V} = \frac{V}{R}$ is small.

The false discovery rate is the expectation of the false discovery proportion $\textrm{FDR} = E(Q)$.

The procedure : step by step.

First, we order the p-values by increasing order. We start to test the smallest p-value (the highest t or z), and declare the test significant if its p-value is less than $\alpha/n$, for instance $0.05/n$. If this test is significant, we test the following pOur new threshold to declare significance is $\alpha * 2 /n)$. This is iterated, until a test is not significant.

$$ T_{BH} = \textrm{max} \left \{ { P_{(i)} : P_{(i)} < \frac{\alpha i }{n} } \right \} $$

An intuition : $i \frac{\alpha}{n} $ is the expected value if I draw $i$ times in a bernouilli distribution of probability $ \frac{\alpha}{n} $

Let's implement this and compare to Bonferroni or uncorrected thresolds:

In [4]:

import scipy.stats as st
norm01 = st.norm(0,1)

In [5]:

# adapted from nipy/algorithms/statistics/empirical_pvalue
#q = ( (range(n)+1)/n) * 0.05 ) 

def fdr(p_values=None, verbose=0):
    """Returns the FDR associated with each p value

    Parameters
    -----------
    p_values : ndarray of shape (n)
        The samples p-value

    Returns
    -------
    q : array of shape(n)
        The corresponding fdr values
    """
    #p_values = check_p_values(p_values)
    n_samples = p_values.size
    order = p_values.argsort()
    sp_values = p_values[order]

    # compute q while in ascending order
    q = np.minimum(1, n_samples * sp_values / np.arange(1, n_samples + 1))
    for i in range(n_samples - 1, 0, - 1):
        q[i - 1] = min(q[i], q[i - 1])

    # reorder the results
    inverse_order = np.arange(n_samples)
    inverse_order[order] = np.arange(n_samples)
    q = q[inverse_order]

    if verbose:
        import matplotlib.pylab as mp
        mp.figure()
        mp.xlabel('Input p-value')
        mp.plot(p_values, q, '.')
        mp.ylabel('Associated fdr')
    return q


def fdr_threshold(p_values, alpha=0.05):
    """Return FDR threshold given p values

    Parameters
    -----------
    p_values : array of shape (n), optional
        The samples p-value
    alpha : float, optional
        The desired FDR significance

    Returns
    -------
    critical_p_value: float
        The p value corresponding to the FDR alpha
    """
    n_samples = np.size(p_values)
    sp_values = np.sort(p_values)
    p_corr = alpha / n_samples
    i = np.arange(1, n_samples + 1) # i from 1 to n_sample
    critical_set = sp_values[sp_values < p_corr * i]
    if len(critical_set) > 0:
        critical_p_value = critical_set.max()
    else:
        critical_p_value = p_corr
    return critical_p_value

In [6]:

n = 1000
unif = st.uniform()
p = unif.rvs(size=n)

q = []
alph = arange(0.01,0.1,0.01)
for a in alph:
    q.append(fdr_threshold(p,a))

    
#more elegant: 
q = [fdr_threshold(p,a) for a in alph]
    
print size(alph), size(q)   
plot(alph, q,'--')

9 9

Out[6]:

[<matplotlib.lines.Line2D at 0x3afdbd0>]

In [7]:

n = 1000
z = norm01.rvs(size=n)
p = norm01.sf(z)

print fdr_threshold(p)
print sum(p < fdr_threshold(p))

q = fdr(p, verbose=1)

if len(z) <= 20:
    order = z.argsort()    
    print "z \n", z 
    print "p \n", p 
    print "q \n", q
    print "sorted z \n", z[order]
    print "sorted fdr \n", q[order]

5e-05
0

In [8]:

alph = 0.05
df = 14
t14 = tdist(df)
n = 200
ts = t14.rvs(size=n)
# ts = np.sort(ts)           # careful of ts = ts.sort() .... you get None !!!
ps = t14.sf(ts)

#uncorrected :
print  'Uncorrected: ',sum(ps < alph)

#Bonferroni : 
print 'Bonferroni: ',sum(ps < alph_corr(alph,n))

#FDR : (suppose z)
fdr_alph = fdr_threshold(ps)
print 'FDR: ',sum(ps < fdr_alph)

Uncorrected:  9
Bonferroni:  0
FDR:  0

In [9]:

# Adding signal :
z = st.norm.rvs(0,1, size=n)
print sum(z>1.5)
z = z[z>1.5]+1

alph, df, n = 0.05, 14, 100
t14 = tdist(df)
ts = t14.rvs(size=n)
# add some signal : 
print "number of true positive: ", sum(ts > 1.5)
ts[ts > 1.5] += 1.5
ps = t14.sf(ts)

#uncorrected :
print  'Uncorrected: ',alph, sum(ps < alph)
#Bonferroni : 
print 'Bonferroni: ', alph_corr(alph,n), sum(ps < alph_corr(alph,n))
#FDR :
fdr_alph = fdr_threshold(ps)
q = fdr(ps, verbose=1)
print 'FDR: ',fdr_alph, sum(ps < fdr_alph), sum(q <= alph)

13
number of true positive:  16
Uncorrected:  0.05 16
Bonferroni:  0.000512801416262 0
FDR:  0.00463758728239 15 16

Permutations : a quick introduction¶

The principle of the permutation is simple: we construct a distribution of the statistics we are interested in under the assumption that there is no signal. For instance, let's take the one sample t-test scenario.

In [10]:

#N subjects, z[i] = effects observed for subject i.

N = 15
mu, sigma = 0.75, 1.0
normH1 = st.norm(mu, sigma)
z = normH1.rvs(size=N) # a series of z values under H1

print z

[ 0.44980773  0.3049634   0.23894538  0.95284122  0.85829427  2.60215513
  1.47310698  0.4147775  -0.09161498  0.37855162  1.17150395  0.69070465
  3.3456306   0.89502638  0.68098054]

The idea here is the following: if the z are drawn from a normal distribution of mean zero, then there there is as much chance to get a negative value than a positive values (the distribution is symmetric). We can therefore arbitrarily swap the sign of our values and construct the distribution of their mean under this hypothesis.

In [11]:

psign = np.random.randint(2, size=len(z))
psign[psign==0] = -1
print psign
print z * psign

[ 1 -1  1 -1 -1  1  1  1 -1  1  1  1  1  1  1]
[ 0.44980773 -0.3049634   0.23894538 -0.95284122 -0.85829427  2.60215513
  1.47310698  0.4147775   0.09161498  0.37855162  1.17150395  0.69070465
  3.3456306   0.89502638  0.68098054]

In [22]:

def distrib_under_H0(x, stat, nperm = 1000):
    dist = []
    for i in xrange(nperm):
        psign = np.random.randint(2, size=len(x))
        psign[psign==0] = -1
        dist.append(stat(x * psign))
    return dist

def my_stat(x):
    return(mean(x))

In [23]:

d_H0 = distrib_under_H0(z, my_stat)

In [24]:

size(d_H0)
hist(d_H0)
mean(z)
empirical_p = 1.0 * sum(d_H0 > mean(z))/size(d_H0)
print my_stat(z), empirical_p

0.957711624634 0.0

In this instance, we have a significant p value at the 5 percent risk of error because under the null hypothesis, the actual data statistics is observed less than 5 percent of the times. This framework is

powerful : any statistics can be tested
tricky
- non exangeable data
- what exactly is the null hypothesis tested
demanding in computing time : make sure we can get a good estimation of the probability

How do we deal with multiple comparison using permutations ?¶

We simply take the distribution of the maximum statistics across tests !

An example ?

FWER : RFT¶

In [14]: