Jonathan Taylor (Stanford)
Inference for Large Scale Data, April 20, 2015
http://statweb.stanford.edu/~jtaylo/talks/sfu2015, (notebook)
Selective inference
Running example: model selection with the LASSO (arxiv.org/1311.6238 )
A general framework for selective inference (arxiv.org/1410.2597 )
Further examples of selective inference.
there is often no hypothesis specified before collecting data. - Screening in *omics - Peak / bump hunting in neuroimaging - Model selection in regression
requires specifying hypotheses before collecting data.
allows for valid inference after some exploration.
Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test.
... confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.
Today, I will focus on testing hypotheses suggested by the data.
Answer is parametric, interesting asymptotic questions remain.
# Design matrix
# Columns are site / amino acid pairs
X.shape
(633, 91)
# Variable names
NRTI_muts[:10], len(NRTI_muts)
(['P6D', 'P20R', 'P21I', 'P35I', 'P35M', 'P35T', 'P39A', 'P41L', 'P43E', 'P43N'], 91)
fig_3TC
Many coefficients seem small.
Use the LASSO to select variables:
lambda_theoretical
43.0
active_3TC
['P62V', 'P65R', 'P67N', 'P69i', 'P75I', 'P77L', 'P83K', 'P90I', 'P115F', 'P151M', 'P181C', 'P184V', 'P190A', 'P215F', 'P215Y', 'P219R']
fig_3TC
The LASSO selected $\hat{E} \subset$ NRTI_muts of size 16 at $\lambda \approx 43$.
What to report?
Naive inference after selection is wrong.
Reference distribution of selected model is biased because of cherry picking.
Why not fix it?
fig_select
fig_select
fig_select
Intervals consistent with the data, having observed active set.
fig_select
Intervals seem to be long.
Laid out formally in arxiv.org/1410.2597.
Data $y \sim F$. (We have no model for $F$ at this point!)
Set of questions ${\cal Q}$ we might ask about $F$.
Use some exploratory technique to generate questions
- Solve LASSO at some fixed $\lambda$ and look at active set.
- Choose a model by BIC and forward stepwise or best subset.
- Marginal screening.
Instead of a fixed $\lambda$, we might look at the LASSO path.
%%R -i X,Y
library(lars)
plot(lars(X, as.numeric(Y), type='lar'))
Loaded lars 1.2
A sequential procedure might consider "event times" $\lambda_j$.
Might take $${\cal Q} = \{1 \leq j \leq p\}$$
The selection procedure is $$j^*(y)= \widehat{\cal Q}(y) = \text{argmax}_{1 \leq j \leq p} |X_j^Ty|.$$
Note
Functionals $\beta_{j|E}$ also appear in POSI.
Simultaneous inference over all questions $(j,E) \in {\cal Q}$.
Achieved by controlling FWER: find $K_{\alpha}$ s.t.
where $\epsilon \sim N(0, \sigma^2 I)$.
We control coverage and type I error on selected questions.
where $$ \mathbb{P} = \mathbb{P}_{\mu} = N(\mu, \sigma^2 I). $$
We derive exact pivots for $\beta_{j|E}(\mu)$ under $\mathbb{Q}_{E,z_E}$.
Report intervals based on $\mathbb{Q}_{\hat{E}, z_{\hat{E}}}$.
$A(E,z_E)$ and $b(E,z_E)$ come from the KKT conditions.
Active block
(Credit Naftali Harris)
Condition on sufficient statistic $X_1^Ty$.
Allowing for effect of $X_1$.
a 1-parameter exponential family $$ \frac{f_{\theta}(z)}{f_0(z)} \propto \exp(\theta z) \cdot 1_{[{\cal V}^-(z^{\perp}),{\cal V}^+(z^{\perp})]}(z) $$ with $\theta = \beta_{3|\{1,3\}}(\mu)/\sigma^2$.
fig_select
Conditional control implies marginal control.
We report results $\phi_{(j| \hat{E})}(y), j \in \hat{E}$.
pvalue_table
Mutation | Naive OLS | Selective |
P62V | 0.137 | 0.369 |
P65R | 0.000 | 0.000 |
P67N | 0.000 | 0.000 |
P69i | 0.000 | 0.000 |
P75I | 0.484 | 0.553 |
P77L | 0.270 | 0.469 |
P83K | 0.003 | 0.051 |
P90I | 0.000 | 0.014 |
P115F | 0.014 | 0.168 |
P151M | 0.008 | 0.081 |
P181C | 0.000 | 0.002 |
P184V | 0.000 | 0.000 |
P190A | 0.063 | 0.309 |
P215F | 0.000 | 0.016 |
P215Y | 0.000 | 0.000 |
P219R | 0.001 | 0.099 |
Use some portion of the data to form a model $\hat{E}(y_1)$.
Perform usual inference for
Might use only part of the data for exploration.
fig_carve
fig_carve
Data carving shorter than data splitting.
fig_carve
Inference cannot be reduced to simple univariate problem.
carve_pvalue_table
Mutation | Data splitting | Data carving |
P41L | 0.155 | 0.048 |
P62V | 0.555 | 0.204 |
P65R | 0.261 | 0.000 |
P67N | 0.173 | 0.000 |
P69i | 0.479 | 0.000 |
P77L | 0.025 | 0.231 |
P83K | 0.876 | 0.013 |
P115F | 0.068 | 0.066 |
P116Y | 0.220 | 0.204 |
P181C | 0.000 | 0.000 |
P184V | 0.000 | -0.000 |
P215F | 0.000 | 0.009 |
P215Y | 0.284 | 0.000 |
P219R | 0.090 | 0.035 |
Nothing was directly tied to LASSO as long as we can describe selection event, i.e. the event that the selected model became interesting.
Nothing was directly tied to the model $N(\mu, \sigma^2 I)$ either.
Can carry out inference known (or unknown) $\sigma$ for selected model:
where $$ \mathbb{P}_{\beta_E} \overset{D}{=} N(X_E\beta_E, \sigma^2 I). $$
Split the data $y=(y_1,y_2)$, $X=(X_1,X_2)$.
Run LASSO on $(y_1,X_1,\lambda)$.
Selection event: affine constraints on $y_1$.
Inference for $\beta_{j|E}$: condition $\mathbb{Q}_{\beta_E;(E,z_E)}$ on $X_{E\setminus\{j\}}^Ty$.
Monte Carlo sampling from multivariate Gaussian subject to affine constraints.
fig_carve
carve_pvalue_table
Mutation | Data splitting | Data carving |
P41L | 0.155 | 0.048 |
P62V | 0.555 | 0.204 |
P65R | 0.261 | 0.000 |
P67N | 0.173 | 0.000 |
P69i | 0.479 | 0.000 |
P77L | 0.025 | 0.231 |
P83K | 0.876 | 0.013 |
P115F | 0.068 | 0.066 |
P116Y | 0.220 | 0.204 |
P181C | 0.000 | 0.000 |
P184V | 0.000 | -0.000 |
P215F | 0.000 | 0.009 |
P215Y | 0.284 | 0.000 |
P219R | 0.090 | 0.035 |
Selective intervals (Benjamini & Yekutieli; Weinstein, Fithian and Benjamini; others)
Drop-the-losers binomial designs (Sampson & Sill)
$p$-values for maxima of random fields (Schwartzman & Chen)
Effect size estimation (Benjamini & Rosenblatt; Zhong & Prentice)
Asymptotic analysis of first step of LAR / LASSO / FS was considered in covTest.
Selective framework provides exact test of global null
LAR sequence up to $k$ steps (tracking signs on entering) can be expressed as a set of affine inequalities (including AIC stopping).
Exact extension of covTest
beyond first step.
The LAR approach does not generally allow grouped (i.e. categorical variables).
Extension to groups of covTest
.
Coming soon: inference for complete FS path for grouped variables.
Tuning parameter is free of $\sigma$.
Selection event is no longer convex in general, though selective inference still possible.
Also coming soon.
Selective distribution used for hypothesis tests, intervals.
Selective pseudo MLE:
Mean estimation in orthogonal design after BH.
Provides estimate of $\sigma^2$ in $\sqrt{\text{LASSO}}$.
Inference so far is very parameteric.
We have some partial results: arxiv.org/abs/1501.03588.
Ryan Tibshirani and others at CMU also have something coming.
What about GLMs? No explicit results yet.
Let $(T(x))_{x \in B}$ be an image of test statistics (SPM).
Set ${\cal Q}=B$, and $\hat{\cal Q}(T)$ to be the set of local maxima / minima.
Report p-value for each critical point in $\hat{\cal Q}(T)$.
Long history in brain imaging (work of Worsley, Friston, Benjamini).
Goal was simultaneous inference.
Simultaneous tools can be converted to selective tools (arxiv.org/1308.3020).
Similar approach can be used for testing in PCA (arxiv.org/1410.8260).
Recent work of Schwartzman & Chen (arxiv.org/1405.1400).
Selective distributions: Slepian models / Palm distributions.
NSF-DMS 1208857 and AFOSR-113039.
Many collaborators.