Selective inference in regression¶

Jonathan Taylor (Stanford)

Inference for Large Scale Data, April 20, 2015

http://statweb.stanford.edu/~jtaylo/talks/sfu2015, (notebook)

Outline¶

Selective inference
Running example: model selection with the LASSO (arxiv.org/1311.6238 )
A general framework for selective inference (arxiv.org/1410.2597 )
Further examples of selective inference.

Acknowledgements¶

This is joint work with many:¶

Yunjin Choi
Will Fithian
Jason Lee
Richard Lockhart
Joshua Loftus
Stephen Reid
Dennis Sun
Yuekai Sun
Xiaoying Tian
Rob Tibshirani
Ryan Tibshirani
Others in progress...

Selective inference¶

Arguably, in modern science

there is often no hypothesis specified before collecting data. - Screening in *omics - Peak / bump hunting in neuroimaging - Model selection in regression

Frequentist inference

requires specifying hypotheses before collecting data.

We describe a version of selective inference which

allows for valid inference after some exploration.

Tukey and Exploratory Data Analysis (EDA)¶

From Wikipedia entry on EDA

Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test.

Tukey and Exploratory Data Analysis (EDA)¶

Reading further:

... confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

Selective inference¶

Today, I will focus on testing hypotheses suggested by the data.
Answer is parametric, interesting asymptotic questions remain.

Running example¶

In vitro measurement of resistance of sample of HIV viruses to NRTI drug 3TC.
633 cases, and 91 different mutations occuring more than 10 times in sample.
Source: HIVDB
Goal: to build an interpretable predictive model of resistance based on mutation pattern.

In [4]:

# Design matrix
# Columns are site / amino acid pairs
X.shape

Out[4]:

(633, 91)

In [5]:

# Variable names
NRTI_muts[:10], len(NRTI_muts)

Out[5]:

(['P6D',
  'P20R',
  'P21I',
  'P35I',
  'P35M',
  'P35T',
  'P39A',
  'P41L',
  'P43E',
  'P43N'],
 91)

In [7]:

fig_3TC

Out[7]:

Model selection with the LASSO¶

Many coefficients seem small.
Use the LASSO to select variables:

$$ \hat{\beta}_{\lambda} = \text{argmin}_{\beta \in \mathbb{R}^p} \frac{1}{2} \|y-X\beta\|^2_2 + \lambda \|\beta\|_1. $$

Theoretically motivated choice of $\lambda$ (Negabhan et al., 2012)

$$ \lambda = \kappa \cdot \mathbb{E}( \|X^T\epsilon\|_{\infty}), \qquad \epsilon \sim N(0, \sigma^2 I). $$

Used $\kappa=1$ below, $\sigma^2$ the usual estimate from full model.

arxiv.org/1311.6238 ¶

Variables chosen for 3TC¶

In [9]:

lambda_theoretical

Out[9]:

43.0

In [11]:

active_3TC

Out[11]:

['P62V',
 'P65R',
 'P67N',
 'P69i',
 'P75I',
 'P77L',
 'P83K',
 'P90I',
 'P115F',
 'P151M',
 'P181C',
 'P184V',
 'P190A',
 'P215F',
 'P215Y',
 'P219R']

In [13]:

fig_3TC

Out[13]:

Inference after LASSO¶

The LASSO selected $\hat{E} \subset$ NRTI_muts of size 16 at $\lambda \approx 43$.
What to report?
Naive inference after selection is wrong.
Reference distribution of selected model is biased because of cherry picking.
Why not fix it?

In [16]:

fig_select

Out[16]:

In [18]:

fig_select

Out[18]:

What are these intervals?¶

In [19]:

fig_select

Out[19]:

Intervals consistent with the data, having observed active set.

In [20]:

fig_select

Out[20]:

Intervals seem to be long.

Setup for selective inference¶

Laid out formally in arxiv.org/1410.2597.
Data $y \sim F$. (We have no model for $F$ at this point!)
Set of questions ${\cal Q}$ we might ask about $F$.
Use some exploratory technique to generate questions

$$ \widehat{\cal Q}(y) \subseteq {\cal Q}. $$

- Solve LASSO at some fixed $\lambda$ and look at active set.
- Choose a model by BIC and forward stepwise or best subset.
- Marginal screening.

Test some or all of the hypotheses suggested by the point process $\widehat{\cal Q}(y)$.

LASSO path¶

Instead of a fixed $\lambda$, we might look at the LASSO path.

In [21]:

%%R -i X,Y
library(lars)
plot(lars(X, as.numeric(Y), type='lar'))

Loaded lars 1.2

LASSO path¶

A sequential procedure might consider "event times" $\lambda_j$.
Might take $${\cal Q} = \{1 \leq j \leq p\}$$
The selection procedure is $$j^*(y)= \widehat{\cal Q}(y) = \text{argmax}_{1 \leq j \leq p} |X_j^Ty|.$$
Note

$$ |X_{j^*(y)}^Ty| = \lambda_1. $$

Under $H_0: \beta \equiv 0$ (and normalization) ( covTest)

$$ \lambda_1(\lambda_1 - \lambda_2) \overset{D}{\to} \text{Exp}(1). $$

In fact, under $H_0:\beta \equiv 0$ (Kac-Rice)

$$ \frac{1 - \Phi(\lambda_1)}{1 - \Phi(\lambda_2)} \overset{D}{=} \text{Unif}(0,1). $$

Sequential aspect makes the multi-step procedure more complicated than fixed $\lambda$...

What are the questions?¶

Linear regression¶

Define

$${\cal Q} = \left\{(j,E): E \subset \{1, \dots, p\}, j \in E\right\}.$$

Indexes the set of OLS functionals, i.e. partial regression coefficients :

$$ \beta_{j|E}(\mu) = e_j^TX[:,E]^{\dagger}(\mu), \qquad (j,E) \in {\cal Q}. $$

Simultaneous vs. selective inference¶

Functionals $\beta_{j|E}$ also appear in POSI.
Simultaneous inference over all questions $(j,E) \in {\cal Q}$.
Achieved by controlling FWER: find $K_{\alpha}$ s.t.

$$ \mathbb{P} \left(\sup_{(j,E) \in \cal Q} \frac{\beta_{j|E}(\epsilon)}{\|\beta_{j,E}\|_2} > K_{\alpha}\right) \leq \alpha $$

where $\epsilon \sim N(0, \sigma^2 I)$.

Selective inference uses the LASSO to select¶

$$ \widehat{\cal Q}(y) \subset {\cal Q}. $$

We control coverage and type I error on selected questions.

Selective inference for LASSO¶

Our inference is based on the distribution

$$ {\mathbb{Q}}_{E,z_E}(\cdot) = {\mathbb{P}}\left( \ \cdot \ \big\vert (\hat{E}, z_{\hat{E}}) = (E, z_E) \right) $$

where $$ \mathbb{P} = \mathbb{P}_{\mu} = N(\mu, \sigma^2 I). $$

We derive exact pivots for $\beta_{j|E}(\mu)$ under $\mathbb{Q}_{E,z_E}$.
Report intervals based on $\mathbb{Q}_{\hat{E}, z_{\hat{E}}}$.

Selective inference for the LASSO¶

Selection event:

$$ \begin{aligned} S(E,z_E) &= \{ (\hat{E}, \hat z_{\hat{E}}) = (E, z_E)\}\\ &= \left\{y: A(E, z_E)y \leq b(E,z_E) \right\} \end{aligned} $$

$A(E,z_E)$ and $b(E,z_E)$ come from the KKT conditions.
Active block

$$ \text{diag}(z_E)\left(X_E^{\dagger}y - \lambda (X_E^TX_E)^{-1}z_E\right) \geq 0. $$

Inactive block

$$ \left\|X_{-E}^T\left((I-X_EX_E^{\dagger})y + \lambda (X_E^T)^{\dagger} z_E \right) \right\|_{\infty} \leq \lambda. $$

Visualizing LASSO partition¶

(Credit Naftali Harris)

Variables 1 and 3 chosen with positive signs¶

Inference¶

Condition on sufficient statistic $X_1^Ty$.

Inference¶

Allowing for effect of $X_1$.

Reduction to univariate problem¶

Law of $X_3^Ty$ restricted to slice is

a 1-parameter exponential family $$ \frac{f_{\theta}(z)}{f_0(z)} \propto \exp(\theta z) \cdot 1_{[{\cal V}^-(z^{\perp}),{\cal V}^+(z^{\perp})]}(z) $$ with $\theta = \beta_{3|\{1,3\}}(\mu)/\sigma^2$.

Reference measure: $f_0={\cal L}(\hat{\beta}_{3|\{1,3\}}(y))$.

Saturated model¶

Nuisance parameter is $P_{\eta}^{\perp}\mu$.

Saturated model¶

Each $\eta=\eta(E,z_E)$ determines a truncated univariate Gaussian.

In [22]:

fig_select

Out[22]:

Selective hypothesis tests¶

Under $\mathbb{Q}_{E,z_E}$, can construct tests $\phi_{(j|E)}$ of

$$ H_{0,(j|E)} : \beta_{j|E}(\mu) = 0. $$

Tests satisfy selective type I error guarantee

$$ \mathbb{Q}_{E,z_E}(\phi_{j|E}) = \overset{H_{0,(j|E)}}{\leq} \alpha. $$

Conditional control implies marginal control.
We report results $\phi_{(j| \hat{E})}(y), j \in \hat{E}$.

In [24]:

pvalue_table

Out[24]:

Mutation	Naive OLS	Selective
P62V	0.137	0.369
P65R	0.000	0.000
P67N	0.000	0.000
P69i	0.000	0.000
P75I	0.484	0.553
P77L	0.270	0.469
P83K	0.003	0.051
P90I	0.000	0.014
P115F	0.014	0.168
P151M	0.008	0.081
P181C	0.000	0.002
P184V	0.000	0.000
P190A	0.063	0.309
P215F	0.000	0.016
P215Y	0.000	0.000
P219R	0.001	0.099

What does it all mean?¶

Unconditional viewpoint¶

Intervals cover random variables (prediction intervals).
Hypotheses seem to be events! But they're not...

What does it all mean?¶

Conditional viewpoint¶

$\hat{E}$ is constant. Forming confidence intervals for fixed parameters.
Hypothesis tests are fixed parameters of $\mathbb{Q}_{E,z_E}$.

Who's afraid of random hypotheses?¶

Exploratory and confirmatory study¶

Measure pilot data

$$ S_1 = y_1 | X_1 $$

Build a model: $\hat{E}(y_1)$.
Measure a confirmatory sample

$$ S_2 = y_2 | X_2 $$

Form usual $t$-statistics based on

$$ \hat{\beta}_2 = X_{2,\hat{E}(y_1)}^{\dagger}y_2. $$

Who's afraid of random hypotheses?¶

Exploratory and confirmatory study¶

We all know that this is valid....
BUT, the probability space is really the joint law $(y_1, y_2)|(X_1, X_2)$.

Null hypotheses in confirmatory stage are clearly random.

Data splitting¶

Use some portion of the data to form a model $\hat{E}(y_1)$.
Perform usual inference for

$$ X_{2, \hat{E}(y_1)}^{\dagger}\mu_2. $$

Less data for selection, and less data for inference.

Selective inference (can) use all data for exploration and confirmation¶

Might use only part of the data for exploration.

In [26]:

fig_carve

Out[26]:

In [28]:

fig_carve

Out[28]:

Data carving shorter than data splitting.

In [29]:

fig_carve

Out[29]:

Inference cannot be reduced to simple univariate problem.

In [31]:

carve_pvalue_table

Out[31]:

Mutation	Data splitting	Data carving
P41L	0.155	0.048
P62V	0.555	0.204
P65R	0.261	0.000
P67N	0.173	0.000
P69i	0.479	0.000
P77L	0.025	0.231
P83K	0.876	0.013
P115F	0.068	0.066
P116Y	0.220	0.204
P181C	0.000	0.000
P184V	0.000	-0.000
P215F	0.000	0.009
P215Y	0.284	0.000
P219R	0.090	0.035

Data carving¶

Holding out more data, data carving still beats data splitting.

arxiv.org/1410.2597 ¶

A general framework for selective inference¶

Nothing was directly tied to LASSO as long as we can describe selection event, i.e. the event that the selected model became interesting.
Nothing was directly tied to the model $N(\mu, \sigma^2 I)$ either.
Can carry out inference known (or unknown) $\sigma$ for selected model:

$$ \mathbb{Q}_{\beta_E;(E,z_E)}(\cdot) = \mathbb{P}_{\beta_E}\left( \ \cdot \ \big\vert (\hat{E}, z_{\hat{E}}) = (E, z_E) \right) $$

where $$ \mathbb{P}_{\beta_E} \overset{D}{=} N(X_E\beta_E, \sigma^2 I). $$

Typically requires Monte Carlo inference.

arxiv.org/1410.2597 ¶

Data carving with known variance¶

Split the data $y=(y_1,y_2)$, $X=(X_1,X_2)$.
Run LASSO on $(y_1,X_1,\lambda)$.
Selection event: affine constraints on $y_1$.

Data carving with known variance¶

Unconditional distribution:

$$ \frac{d\mathbb{P}_{\beta_E}}{d\mathbb{P}_0}(y) \propto \exp\left( \frac{1}{\sigma^2}\beta_E^TX_E^Ty \right), \qquad \mathbb{P}_0 = N(0, \sigma^2 I). $$

Selective distribution:

$$ \frac{d\mathbb{Q}_{\beta_E;(E,z_E)}}{d\mathbb{P}_{\beta_E}}(y) \propto 1_{S(E,z_E)}(y). $$

Inference for $\beta_{j|E}$: condition $\mathbb{Q}_{\beta_E;(E,z_E)}$ on $X_{E\setminus\{j\}}^Ty$.
Monte Carlo sampling from multivariate Gaussian subject to affine constraints.

In [32]:

fig_carve

Out[32]:

In [33]:

carve_pvalue_table

Out[33]:

Mutation	Data splitting	Data carving
P41L	0.155	0.048
P62V	0.555	0.204
P65R	0.261	0.000
P67N	0.173	0.000
P69i	0.479	0.000
P77L	0.025	0.231
P83K	0.876	0.013
P115F	0.068	0.066
P116Y	0.220	0.204
P181C	0.000	0.000
P184V	0.000	-0.000
P215F	0.000	0.009
P215Y	0.284	0.000
P219R	0.090	0.035

Other examples in the literature¶

Selective intervals (Benjamini & Yekutieli; Weinstein, Fithian and Benjamini; others)
Drop-the-losers binomial designs (Sampson & Sill)
$p$-values for maxima of random fields (Schwartzman & Chen)
Effect size estimation (Benjamini & Rosenblatt; Zhong & Prentice)

Other examples we have looked at¶

Least Angle Regression¶

arxiv.org/1401.3889 ¶

Least Angle Regression¶

Asymptotic analysis of first step of LAR / LASSO / FS was considered in covTest.
Selective framework provides exact test of global null

$$ \exp(-\lambda_2(\lambda_1-\lambda_2)) \approx \frac{1 - \Phi(\lambda_1)}{1 - \Phi(\lambda_2)} \overset{H_0:\beta\equiv 0}{\sim} \text{Unif}(0,1) $$

LAR sequence up to $k$ steps (tracking signs on entering) can be expressed as a set of affine inequalities (including AIC stopping).
Exact extension of covTest beyond first step.

Categorical variables¶

The LAR approach does not generally allow grouped (i.e. categorical variables).
Extension to groups of covTest.
Coming soon: inference for complete FS path for grouped variables.

arxiv.org/1405.3920 ¶

Square-root LASSO¶

$$ \hat{\beta}_{\gamma} = \text{argmin}_{\beta \in \mathbb{R}^p} \|y-X\beta\|_2 + \gamma \|\beta\|_1. $$

Tuning parameter is free of $\sigma$.
Selection event is no longer convex in general, though selective inference still possible.
Also coming soon.

Estimation¶

Selective distribution used for hypothesis tests, intervals.
Selective pseudo MLE:

$$ \int_{\mathbb{R}} X_j^Tz \; \mathbb{Q}_{\hat{\beta}_{j|E}(y);(E,z_E)}\left(dz \big\vert X_{E\setminus j}^Tz=X_{E\setminus j}^Ty \right) = X_j^Ty. $$

Mean estimation in orthogonal design after BH.
Provides estimate of $\sigma^2$ in $\sqrt{\text{LASSO}}$.

arxiv.org/1405.3340 ¶

Asymptotics¶

Inference so far is very parameteric.
We have some partial results: arxiv.org/abs/1501.03588.
Ryan Tibshirani and others at CMU also have something coming.
What about GLMs? No explicit results yet.

Peak inference in neuroimaging & critical points¶

(Credit Wikipedia)

In some sense, this is where it all started...

Peak inference in neuroimaging¶

Let $(T(x))_{x \in B}$ be an image of test statistics (SPM).
Set ${\cal Q}=B$, and $\hat{\cal Q}(T)$ to be the set of local maxima / minima.
Report p-value for each critical point in $\hat{\cal Q}(T)$.

Peak inference in neuroimaging¶

Long history in brain imaging (work of Worsley, Friston, Benjamini).
Goal was simultaneous inference.
Simultaneous tools can be converted to selective tools (arxiv.org/1308.3020).
Similar approach can be used for testing in PCA (arxiv.org/1410.8260).
Recent work of Schwartzman & Chen (arxiv.org/1405.1400).
Selective distributions: Slepian models / Palm distributions.

Thanks¶

NSF-DMS 1208857 and AFOSR-113039.
Many collaborators.

Selective inference in regression¶

Outline¶

Acknowledgements¶

This is joint work with many:¶

Selective inference¶

Tukey and Exploratory Data Analysis (EDA)¶

Tukey and Exploratory Data Analysis (EDA)¶

Selective inference¶

Running example¶

Model selection with the LASSO¶

arxiv.org/1311.6238¶

Variables chosen for 3TC¶

Inference after LASSO¶

What are these intervals?¶

Setup for selective inference¶

LASSO path¶

LASSO path¶

What are the questions?¶

Linear regression¶

Simultaneous vs. selective inference¶

Selective inference uses the LASSO to select¶

Selective inference for LASSO¶

Selective inference for the LASSO¶

Visualizing LASSO partition¶

Variables 1 and 3 chosen with positive signs¶

Inference¶

Inference¶

Reduction to univariate problem¶

Saturated model¶

Saturated model¶

Selective hypothesis tests¶

What does it all mean?¶

Unconditional viewpoint¶

What does it all mean?¶

Conditional viewpoint¶

Who's afraid of random hypotheses?¶

Exploratory and confirmatory study¶

Who's afraid of random hypotheses?¶

Exploratory and confirmatory study¶

Null hypotheses in confirmatory stage are clearly random.

Data splitting¶

Selective inference (can) use all data for exploration and confirmation¶

Data carving¶

arxiv.org/1410.2597¶

A general framework for selective inference¶

arxiv.org/1410.2597¶

Data carving with known variance¶

Data carving with known variance¶

Other examples in the literature¶

Other examples we have looked at¶

Least Angle Regression¶

arxiv.org/1401.3889¶

Least Angle Regression¶

Categorical variables¶

arxiv.org/1405.3920¶

Square-root LASSO¶

Estimation¶

arxiv.org/1405.3340¶

Asymptotics¶

Peak inference in neuroimaging & critical points¶

Peak inference in neuroimaging¶

Peak inference in neuroimaging¶

Thanks¶

arxiv.org/1311.6238 ¶

arxiv.org/1410.2597 ¶

arxiv.org/1410.2597 ¶

arxiv.org/1401.3889 ¶

arxiv.org/1405.3920 ¶

arxiv.org/1405.3340 ¶