Notebook

Resampling methods: the Bootstrap and the Jackknife¶

The Bootstrap¶

We observe an IID sample of size $n$, $\{X_j \}_{j=1}^n$ IID $F$. Each observation is real-valued. We wish to estimate some parameter of the distribution of $F$ that can be written as a functional of $F$, $T(F)$. Examples include the mean, $T(F) = \int x dF(x)$, other moments, etc.

The (unpenalized) nonparametric maximum likelihood estimator of $F$ from the data $\{X_j \}$ is just the empirical distribution $\hat{F}_n$, which assigns mass $1/n$ to each observation: \begin{equation} \arg \max_{\mbox{distributions }G} \mathbb{P}_G \{ X_j = x_j, \; j=1, \ldots, n \} = \hat{F}_n. \end{equation} (Note, however, that the MLE of $F$ is not generally consistent in problems with an infinite number of parameters, such as estimating a density or a distribution function.)

Using the general principle that the maximum likelihood estimator of a function of a parameter is that function of the maximum likelihood estimator of the parameter, we might be led to consider $T(\hat{F}_n)$ as an estimator of $T(F)$.

That is exactly what the sample mean does, as an estimator of the mean: \begin{equation} T(\hat{F}_n) = \int x d\hat{F}_n(x) = \sum_{j=1}^n\frac{1}{n}X_j = \frac{1}{n} \sum_j X_j. \end{equation}

Similarly, the maximum likelihood estimator of \begin{equation} \mathrm{Var}(X) = T(F) = \int \left ( x - \int x dF \right )^2dF \end{equation} is \begin{equation} T(\hat{F}_n) = \int \left ( x - \int x d\hat{F}_n \right )^2 d\hat{F}_n = \frac{1}{n}\sum_j \left (X_j - \frac{1}{n} \sum_k X_k \right )^2. \end{equation} In these cases, we get analytically tractable expressions for $T(\hat{F}_n)$.

What is often more interesting is to estimate a property of the sampling distribution of the estimator $T(\hat{F}_n)$, for example the variance of the estimator $T(\hat{F}_n)$. The bootstrap approximates the sampling distribution of $T(\hat{F}_n)$ by the sampling distribution of $T(\hat{F}_n^*)$, where $\hat{F}_n^*$ is a size-$n$ IID random sample drawn from $\hat{F}_n$. That is, the bootstrap approximates the sampling distribution of an estimator applied to the empirical distribution $\hat{F}_n$ of a random sample of size $n$ from a distribution $F$ by the sampling distribution of that estimator applied to a random sample $\hat{F}_n^*$ of size $n$ from a particular realization $\hat{F}_n$ of the empirical distribution of a sample of size $n$ from $F$.

When $T$ is the mean $\int x dF$, so $T(\hat{F}_n)$ is the sample mean, we could obtain the variance of the distribution of $T( \hat{F}_n^* )$ analytically: Let $\{ X_j^* \}_{j=1}^n$ be an IID sample of size $n$ from $\hat{F}_n$. Then \begin{equation} \mathrm{Var}_{\hat{F}_n} \frac{1}{n}\sum_{j=1}^n X_j^* = \frac{1}{n^2} \sum_{j=1}^n (X_j - \bar{X})^2, \end{equation} where $\{ X_j \}$ are the original data and $\bar{X}$ is their mean. When we do not get a tractable espression for the variance of an estimator under resampling from the empirical distribution, we could still approximate the distribution of $T(\hat{F}_n)$ by generating a large number of size-$n$ IID $\hat{F}_n$ data sets (drawing samples of size $n$ with replacement from $\{ x_j \}_{j=1}^n$), and applying $T$ to each of those sets.

The idea of the bootstrap is to approximate the distribution (under $F$) of an estimator $T(\hat{F}_n)$ by the distribution of the estimator under $\hat{F}_n$, and to approximate that distribution by using a computer to take a large number of pseudo-random samples of size $n$ from $\hat{F}_n$.

This basic idea is quite flexible, and can be applied to a wide variety of testing and estimation problems, including finding confidence sets for functional parameters.
(It is not a panacea, though: we will see later how delicate it can be.) It is related to some other "resampling" schemes in which one re-weights the data to form other distributions. Before doing more theory with the bootstrap, let's examine the jackknife.

The Jackknife¶

The idea behind the jackknife, which is originally due to Tukey and Quenouille, is to form from the data $\{ X_j \}_{j=1}^n$, $n$ sets of $n-1$ data, leaving each datum out in turn. The "distribution" of $T$ applied to these $n$ sets is used to approximate the distribution of $T(\hat{F}_n)$. Let $\hat{F}_{(i)}$ denote the empirical distribution of the data set with the $i$th value deleted; $T_{(i)} = T( \hat{F}_{(i)})$ is the corresponding estimate of $T(F)$. An estimate of the expected value of $T(\hat{F}_n)$ is \begin{equation} \hat{T}_{(\cdot)} = \frac{1}{n} \sum_{i=1}^n T( \hat{F}_{(i)}) . \end{equation} Consider the bias of $T(\hat{F}_n)$: \begin{equation} \mathbb{E}_F T(\hat{F}_n) - T(F). \end{equation} Quenouille's jackknife estimate of the bias is \begin{equation} \widehat{\mbox{BIAS}} = (n-1) (\hat{T}_{(\cdot)} - T(\hat{F}_n) ). \end{equation} It can be shown that if the bias of $T$ has a homogeneous polynomial expansion in $n^{-1}$ whose coefficients do not depend on $n$, then the bias of the bias-corrected estimate \begin{equation} \tilde{T} = nT(\hat{F}_n) - (n-1) T_{(\cdot)} \end{equation} is $O(n^{-2})$ instead of $O(n^{-1})$.

Applying the jackknife estimate of bias to correct the plug-in estimate of variance reproduces the formula for the sample variance (with $1/(n-1)$) from the formula with $1/n$: Define \begin{equation} \bar{X} = \frac{1}{n} \sum_{j=1}^n X_j, \end{equation} \begin{equation} \bar{X}_{(i)} = \frac{1}{n-1} \sum_{j \ne i} X_j, \end{equation} \begin{equation} T(\hat{F}_n) = \hat{\sigma}^2 = \frac{1}{n} \sum_{j=1}^n (X_j - \bar{X})^2, \end{equation} \begin{equation} T(\hat{F}_{(i)}) = \frac{1}{n-1} \sum_{j \ne i} ( X_j - \bar{X}_{(i)})^2, \end{equation} \begin{equation} T(\hat{F}_{(\cdot)}) = \frac{1}{n} \sum_{i=1}^n T(\hat{F}_{(i)}). \end{equation} Now \begin{equation} \bar{X}_{(i)} = \frac{n\bar{X} - X_i}{n-1} = \bar{X} + \frac{1}{n-1} (\bar{X} - X_i), \end{equation} so \begin{eqnarray} ( X_j - \bar{X}_{(i)})^2 &=& \left ( X_j - \bar{X} - \frac{1}{n-1} (\bar{X} - X_i) \right )^2 \nonumber \\ &=& (X_j - \bar{X})^2 + \frac{2}{n-1} (X_j - \bar{X})(X_i - \bar{X}) + \frac{1}{(n-1)^2}(X_i - \bar{X})^2. \end{eqnarray} Note also that \begin{equation} \sum_{j \ne i} (X_j - \bar{X}_{(i)})^2 = \sum_{j=1}^n (X_j - \bar{X}_{(i)})^2 - (X_i - \bar{X}_{(i)})^2. \end{equation} Thus \begin{eqnarray} \sum_{i=1}^n \sum_{j \ne i} (X_j - \bar{X}_{(i)})^2 &=& \frac{1}{n-1} \sum_{i=1}^n \left [ \sum_{j=1}^n \left [ (X_j - \bar{X})^2 + \frac{2}{n-1}(X_j - \bar{X})(X_i - \bar{X}) \right . \right . + \nonumber \\ && + \left . \left . \frac{1}{(n-1)^2}(X_i - \bar{X})^2 \right ] - (X_i - \bar{X})^2 - \right . \nonumber \\ && - \left . \left . \frac{2}{n-1}(X_i - \bar{X})^2 - \frac{1}{(n-1)^2}(X_i - \bar{X})^2 \right . \right ]. \end{eqnarray} The last three terms all are multiples of $(X_i - \bar{X})^2$; the sum of the coefficients is \begin{equation} 1 + 2/(n-1) + 1/(n-1)^2 = n^2/(n-1)^2. \end{equation} The middle term of the inner sum is a constant times $(X_j - \bar{X})$, which sums to zero over $j$. Simplifying the previous displayed equation yields \begin{eqnarray} \sum_{i=1}^n \sum_{j \ne i} (X_j - \bar{X}_{(i)})^2 &=& \frac{1}{n-1} \sum_{i=1}^n \left ( n \hat{\sigma}^2 + \frac{n}{(n-1)^2}(X_i - \bar{X})^2 - \frac{n^2}{(n-1)^2} (X_i - \bar{X})^2 \right ) \nonumber \\ &=& \frac{1}{n-1} \sum_{i=1}^n (n \hat{\sigma}^2 - \frac{n}{n-1} (X_i - \bar{X})^2 ) \nonumber \\ &=& \frac{1}{n-1} \left [ n^2 \hat{\sigma}^2 - \frac{n^2}{n-1} \hat{\sigma}^2 \right ] \nonumber \\ &=& \frac{n(n-2)}{(n-1)^2}\hat{\sigma}^2. \end{eqnarray} The jackknife bias estimate is thus \begin{equation} \widehat{\mbox{BIAS}} = (n-1)\left ( T(\hat{F}_{(\cdot)}) - T(\hat{F}_n) \right ) = \hat{\sigma}^2 \frac{n(n-2) - (n-1)^2}{n-1} = \frac{-\hat{\sigma}^2}{n-1}. \end{equation} The bias-corrected MLE variance estimate is therefore \begin{equation} \hat{\sigma}^2 \left ( 1 - \frac{1}{n-1} \right ) = \hat{\sigma}^2 \frac{n}{n-1} = \frac{1}{n-1} \sum_{j=1}^n (X_j - \bar{X})^2 = S^2, \end{equation} the usual sample variance.

The jackknife also can be used to estimate other properties of an estimator, such as its variance. The jackknife estimate of the variance of $T(\hat{F}_n)$ is \begin{equation} \hat{\mathrm{Var}}(T) = \frac{n-1}{n} \sum_{j=1}^n ( T_{(j)} - T_{(\cdot)} )^2. \end{equation}

It is convenient to think of distributions on data sets to compare the jackknife and the bootstrap. We shall follow the notation in Efron (1982). We condition on $(X_i = x_i)$ and treat the data as fixed in what follows. Let $\mathcal{S}_n$ be the $n$-dimensional simplex \begin{equation} \mathcal{S}_n \equiv \{ \mathbb{P}^* = (P_i^*)_{i=1}^n \in \Re^n : P_i^* \ge 0 \mbox{ and } \sum_{i=1}^n P_i^* = 1 \}. \end{equation} A resampling vector $\mathbb{P}^* = (P_k^*)_{k=1}^n $ is any element of $\mathcal{S}_n$; i.e., an $n$-dimensional discrete probability vector. To each $\mathbb{P}^* = (P_k^*) \in \mathcal{S}_n$ there corresponds a re-weighted empirical measure $\hat{F}(\mathbb{P}^*)$ which puts mass $P_k^* $ on $x_k$, and a value of the estimator $T^* = T(\hat{F}(\mathbb{P}^*)) = T(\mathbb{P}^*)$. The resampling vector $\mathbb{P}^0 = (1/n)_{j=1}^n$ corresponds to the empirical distribution $\hat{F}_n$ (each datum $x_j$ has the same mass). The resampling vector \begin{equation} \mathbb{P}_i = \frac{1}{n-1}(1, 1, \ldots, 0, 1, \ldots, 1), \end{equation} which has the zero in the $i$th place, is one of the $n$ resampling vectors the jackknife visits; denote the corresponding value of the estimator $T$ by $T_{(i)}$. The bootstrap visits all resampling vectors whose components are multiples of $1/n$.

The bootstrap estimate of variance tends to be better than the jackknife estimate of variance for nonlinear estimators because of the distance between the empirical measure and the resampled measures: \begin{equation} \| \mathbb{P}^* - \mathbb{P}^0 \| = O_P(n^{-1/2}), \end{equation} while \begin{equation} \| \mathbb{P}_k - \mathbb{P}^0 \| = O(n^{-1}). \end{equation} To see the former, recall that the difference between the empirical distribution and the true distribution is $O_P(n^{-1/2})$: For any two probability distributions $\mathbb{P}_1$, $\mathbb{P}_2$, on $\Re$, define the Kolmogorov-Smirnov distance \begin{equation} d_{KS}(\mathbb{P}_1, \mathbb{P}_2) \equiv \| \mathbb{P}_1 - \mathbb{P}_2 \|_{KS} \equiv \sup_{x \in \Re} | \mathbb{P}_1\{(-\infty, x]\} - \mathbb{P}_2\{(-\infty, x]\} |. \end{equation} There exist universal constants $\chi_n(\alpha)$ so that for every continuous (w.r.t. Lebesgue measure) distribution $F$, \begin{equation} \mathbb{P}_F \left \{ \| F - \hat{F}_n \|_{KS} \ge \chi_n(\alpha) \right \} = \alpha. \end{equation} This is the Dvoretzky-Kiefer-Wolfowitz inequality. Massart (Ann. Prob., 18, 1269--1283, 1990) showed that the constant \begin{equation} \chi_n(\alpha) \le \sqrt{\frac{\ln \frac{2}{\alpha}}{2n}} \end{equation} is tight. Thinking of the bootstrap distribution (the empirical distribution $\hat{F}_n$) as the true cdf and the resamples from it as the data gives the result that the distance between the cdf of the bootstrap resample and the empirical cdf of the original data is $O_P(n^{-1/2})$.

To see that the cdfs of the jackknife samples are $O(n^{-1})$ from the empirical cdf $\hat{F}_n$, note that for univariate real-valued data, the difference between $\hat{F}_n$ and the cdf of the jackknife data set that leaves out the $j$th ranked observation $X_{(j)}$ is largest either at $X_{(j-1)}$ or at $X_{(j)}$. For $j = 1$ or $j = n$, the jackknife samples that omit the smallest or largest observation, the $L_1$ distance between the jackknife measure and the empirical distribution is exactly $1/n$. Consider the jackknife cdf $\hat{F}_{n,(j)}$, the cdf of the sample without $X_{(j)}$, $1 < j < n$. \begin{equation} \hat{F}_{n,(j)}(X_{(j)}) = (j-1)/(n-1), \end{equation} while $\hat{F}_n((X_{(j)}) = j/n$; the difference is \begin{equation} \frac{j}{n} - \frac{j-1}{n-1} = \frac{j(n-1) - n(j-1)}{n(n-1)} = \frac{n-j}{n(n-1)} = \frac{1}{n-1} - \frac{j}{n(n-1)}. \end{equation} On the other hand, \begin{equation} \hat{F}_{n,(j)}(X_{(j-1)}) = (j-1)/(n-1), \end{equation} while $\hat{F}_n((X_{(j-1)})= (j-1)/n$; the difference is \begin{equation} \frac{j-1}{n-1} - \frac{j-1}{n} = \frac{n(j-1) - (n-1)(j-1)}{n(n-1)} = \frac{j - 1}{n(n-1)}. \end{equation} Thus \begin{equation} \| \hat{F}_{n,(j)} - \hat{F}_{n} \| = \frac{1}{n(n-1)} \max\{n-j, j-1\}. \end{equation} But $n/2 \le \max\{n-j, j-1\} \le n-1$, so \begin{equation} \| \hat{F}_{n,(j)} - \hat{F}_n\| = O(n^{-1}). \end{equation}

The neighborhood that the bootstrap samples is larger, and is probabilistically of the right size to correspond to the uncertainty of the empirical distribution function as an estimator of the underlying distribution function $F$ (recall the Kiefer-Dvoretzky-Wolfowitz inequality---a K-S ball of radius $O(n^{-1/2})$ has fixed coverage probability). For linear functionals, this does not matter, but for strongly nonlinear functionals, the bootstrap estimate of the variability tends to be more accurate than the jackknife estimate of the variability.

Let us have a quick look at the distribution of the K-S distance between a continuous distribution and the empirical distribution of a sample $\{X_j\}_{j=1}^n$ IID $F$. The discussion follows Feller (1971, pp.36ff). First we show that for continuous distributions $F$, the distribution of $\| \hat{F}_n - F \|_{KS}$ does not depend on $F$. To see this, note that $F(X_j) \sim U[0, 1]$: Let $x_t \equiv \inf \{x \in \Re : F(x_t) = t \}$. Continuity of $F$ ensures that $x_t$ exists for all $t \in [0, 1]$. Now the event $\{ X_j \le x_t \}$ is equivalent to the event $\{F(X_j) \le F(x_t)\}$ up to a set of $F$-measure zero. Thus \begin{equation} t = \mathbb{P}_F \{ X_j \le x_t \} = \mathbb{P}_F \{ F(X_j) \le F(x_t) \} = \mathbb{P}_F \{ F(X_j) \le t \}, \,\, t \in [0, 1]; \end{equation} i.e., $\{ F(X_j) \}_{j=1}^n$ are IID $U[0, 1]$. Let \begin{equation} \hat{G}_n(t) \equiv \#\{F(X_j) \le t\}/n = \#\{X_j \le x_t\}/n = \hat{F}_n (x_t) \end{equation} be the empirical cdf of $\{ F(X_j) \}_{j=1}^n$. Note that \begin{equation} \sup_{x \in \Re} | \hat{F}_n(x) - F(x) | = \sup_{t \in [0, 1]} | \hat{F}_n(x_t) - F(x_t) | = \sup_{t \in [0, 1]} | \hat{G}_n (t) - t |. \end{equation} The probability distribution of $\hat{G}_n$ is that of the cdf of $n$ IID $U[0, 1]$ random variables (it does not depend on $F$), so the distribution of the K-S distance between the empirical cdf and the true cdf is the same for every continuous distribution. It turns out that for distributions with atoms, the K-S distance between the empirical and the true distribution functions is stochastically smaller than it is for continuous distributions.

Bootstrap Confidence Sets¶

Let $\mathcal{U}$ be an index set (not necessarily countable). Recall that a collection $\{ \mathcal{I}_u \}_{u \in \mathcal{U}}$ of confidence intervals for parameters $\{\theta_u \}_{u \in \mathcal{U}}$ has simultaneous $1-\alpha$ coverage probability if \begin{equation} \mathbb{P}_{\theta} \left \{ \cap_{u \in \mathcal{U}} \{\mathcal{I}_u \ni \theta_u \} \right \} \ge 1-\alpha. \end{equation} If $\mathbb{P} \{ \mathcal{I}_u \ni \theta_u\}$ does not depend on $u$, the confidence intervals are said to be balanced.

Many of the procedures for forming joint confidence sets we have seen depend on pivots, which are functions of the data and the parameter(s) whose distribution is known (even though the parameter and the parent distribution are not). For example, the Scheff'{e} method relies on the fact that (for samples from a multivariate Gaussian with independent components) the sum of squared differences between the data and the corresponding parameters, divided by the variance estimate, has an $F$ distribution, regardless of the parameter values. Similarly, Tukey's maximum modulus method relies on the fact that (again, for independent Gaussian data) the distribution of the maximum of the studentized absolute differences between the data and the corresponding parameters does not depend on the parameters. Both of those examples are parametric, but the idea is more general: the procedure we looked at for finding bounds on the density function subject to shape restrictions just relied on the fact that there are uniform bounds on the probability that the K-S distance between the empirical distribution and the true distribution exceeds some threshold.

Even in cases where there is no known exact pivot, one can sometimes show that some function of the data and parameters is asymptotically a pivot. Working out the distributions of the functions involved is not typically straightforward, and a general method of constructing (possibly simultaneous) confidence sets would be nice.

Efron gives several methods of basing confidence sets on the bootstrap. Those methods are substantially improved (in theory, and in my experience) by Beran's pre-pivoting approach, which leads to iterating the bootstrap.

Let $X_n$ denote a sample of size $n$ from $F$. Let $R_n(\theta) = R_n(X_n, \theta)$ have cdf $H_n$, and let $H_n^{-1}(\alpha)$ be the largest $\alpha$ quantile of the distribution of $R_n$. Then \begin{equation} \{ \gamma \in \Theta : R_n(\gamma) \le H_n^{-1}(1-\alpha) \} \end{equation} is a $1-\alpha$ confidence set for $\theta$.

The Percentile Method¶

The idea of the percentile method is to use the empirical bootstrap percentiles of some quantity to approximate the true percentiles. Consider constructing a confidence interval for a single real parameter $\theta = T(F)$. We will estimate $\theta$ by $\hat{\theta} = T(\hat{F}_n)$. We would like to know the distribution function $H_n = H_n(\cdot, F)$ of $D_n (\theta) = T(\hat{F}_n) - \theta$. Suppose we did. Let $H_n^{-1}(\cdot) = H_n^{-1}(\cdot, F)$ be the inverse cdf of $D_n$. Then \begin{equation} \mathbb{P}_F \{ H_n^{-1}(\alpha/2) \le T(\hat{F}_n) - \theta \le H_n^{-1}(1- \alpha/2) \} = 1-\alpha, \end{equation} so \begin{equation} \mathbb{P}_F \{ \theta \le T(\hat{F}_n) - H_n^{-1}(\alpha/2) \mbox{ and } \theta \ge T(\hat{F}_n) - H_n^{-1}(1-\alpha/2) \} = 1-\alpha, \end{equation} or, equivalently, \begin{equation} \mathbb{P}_F \{ [T(\hat{F}_n) - H_n^{-1}(1-\alpha/2), T(\hat{F}_n) - H_n^{-1}(\alpha/2)] \ni \theta \} = 1-\alpha, \end{equation} so the interval $[T(\hat{F}_n) - H_n^{-1}(1-\alpha/2), T(\hat{F}_n) - H_n^{-1}(\alpha/2)]$ would be a $1-\alpha$ confidence interval for $\theta$.

The idea behind the percentile method is to approximate $H_n(\cdot, F)$ by $\hat{H}_n = H_n(\cdot, \hat{F}_n)$, the distribution of $D_n$ under resampling from $\hat{F}_n$ rather than $F$. (This tends not to be an approximation for constructing confidence sets. See confidence sets.)

An alternative approach is to take $D_n (\theta) = |T(\hat{F}_n) - \theta|$; then \begin{equation} \mathbb{P}_F \{ | T(\hat{F}_n) - \theta | \le H_n^{-1}(1- \alpha) \} = 1-\alpha, \end{equation} so \begin{equation} \mathbb{P}_F \{ [ T(\hat{F}_n - H_n^{-1}(1-\alpha) , T(\hat{F}_n + H_n^{-1}(1-\alpha)] \ni \theta \} = 1-\alpha. \end{equation} In either case, the "raw" bootstrap approach is to approximate $H_n$ by resampling under $\hat{F}_n$.

Beran proves a variety of results under the following condition:

Condition 1. (Beran, 1987)

For any sequence $\{F_n\}$ that converges to $F$ in a metric $d$ on cdfs, $H_n(\cdot, F_n)$ converges weakly to a continuous cdf $H = H(\cdot, F)$ that depends only on $F$, and not the sequence $\{F_n\}$.

Suppose Condition 1 holds. Then because $\hat{F}_n$ is consistent for $F$, the estimate $\hat{H}_n$ converges in probability to $H$ in sup norm; moreover, the distribution of $\hat{H}_n(R_n(\theta))$ converges to $U[0,1]$.

Instead of $D_n$, consider $R_n(\theta) = | T(\hat{F}_n) - \theta |$ or some other (approximate) pivot. Let $\hat{H}_n(\cdot, \hat{F}_n)$ be the bootstrap estimate of the cdf of $R_n$; The set \begin{eqnarray} B_n &=& \{ \gamma \in \Theta : \hat{H}_n (R_n(\gamma)) \le 1-\alpha \} \nonumber \\ &=& \{ \gamma \in \Theta : R_n(\gamma) \le \hat{H}_n^{-1}(1-\alpha) \} \label{eq:BnDef} \end{eqnarray} is (asymptotically) a $1-\alpha$ confidence set for $\theta$.

The level of this set for finite samples tends to be inaccurate. It can be improved in the following way, due to Beran.

The original root, $R_n(\theta)$, whose limiting distribution depends on $F$, was transformed into a new root $R_{n,1}(\theta) = \hat{H}_n(R_n(\theta) )$, whose limiting distribution is $U[0,1]$. The distribution of $R_{n,1}$ depends less strongly on $F$ than does that of $R_n$; Beran calls mapping $R_n$ into $R_{n,1}$ prepivoting. The confidence set acts as if the distribution of $R_{n,1}$ really is uniform, which is not generally true. One could instead treat $R_{n,1}$ itself as a root, and pivot to reduce the dependence on $F$.

Let $H_{n,1} = H_{n,1}(\cdot, F)$ be the cdf of the new root $R_{n,1}(\theta)$, estimate $H_{n,1}$ by $\hat{H}_{n,1} = H_{n,1}(\cdot, \hat{F}_n)$, and define \begin{eqnarray} B_{n,1} &=& \{ \gamma \in \Theta : \hat{H}_{n,1}(R_{n,1}(\gamma)) \le 1-\alpha \} \nonumber \\ &=& \{ \gamma \in \Theta : \hat{H}_{n,1}(\hat{H}_n(R_n(\gamma))) \le 1-\alpha \} \nonumber \\ &=& \{ \gamma \in \Theta : R_n(\gamma) \le \hat{H}_n^{-1}(\hat{H}_{n,1}^{-1}(1-\alpha))) \}. \label{eq:Bn1Def} \end{eqnarray} Beran shows that this confidence set tends to have smaller error in its level than does $B_n$. The transformation can be iterated further, typically resulting in additional reductions in the level error.

Approximating $B_{n,1}$ by Monte Carlo¶

I follow Beran's (1987) notation (mostly).

Let $x_n$ denote the "real" sample of size $n$. Let $x_n^*$ be a bootstrap sample of size $n$ drawn from the empirical cdf $\hat{F}_n$. The components of $x_n^*$ are conditionally IID given $x_n$. Let $\hat{F}_n^*$ denote the "empirical" cdf of the bootstrap sample $x_n^*$. Let $x_n^{**}$ denote a sample of size $n$ drawn from $\hat{F}_n^*$; the components of $x_n^{**}$ are conditionally IID given $x_n$ and $x_n^*$. Let $\hat{\theta}_n = T(\hat{F}_n)$, and $\hat{\theta}_n^* = T(\hat{F}_n^*)$. Then \begin{equation} H_n(s, F) = \mathbb{P}_F \{ R_n (x_n, \theta) \le s \} , \end{equation} and \begin{equation} H_{n,1}(s, F) = \mathbb{P}_F \left \{ \mathbb{P}_{\hat{F}_n} \{ R_n ( x_n^*, \hat{\theta}_n ) < R_n(x_n, \theta) \} \le s \right \}. \end{equation} The bootstrap estimates of these cdfs are \begin{equation} \hat{H}_n(s) = H_n(s, \hat{F}_n ) = \mathbb{P}_{\hat{F}_n} \{ R_n ( x_n^*, \hat{\theta}_n ) \le s \}, \end{equation} and \begin{equation} \hat{H}_{n,1}(s) = H_{n,1}(s, \hat{F}_n) = \mathbb{P}_{\hat{F}_n} \left \{ \mathbb{P}_{\hat{F}_n^*} \{ R_n(x_n^{**}, \hat{\theta}_n^* ) < R_n(x_n^*, \hat{\theta}_n) \} \le s \right \}. \end{equation}

The Monte Carlo approach is as follows:

Draw $\{ y_k^* \}_{k=1}^M$ bootstrap samples of size $n$ from $\hat{F}_n$.

The ecdf of $\{ R_n(y_k^*, \hat{\theta}_n) \}_{k=1}^M $ is an approximation to $\hat{H}_n$.

For $k = 1, \cdots, M$, let $\{ y_{k\ell}^{**} \}_{\ell=1}^N$ be

$N$ size $n$ bootstrap samples from the ecdf of $y_k^*$. Let $\hat{\theta}_{n,k}^* = T(\hat{F}_{n,k}^*)$. Let $Z_k$ be the fraction of the values $$ \{ R_n(y_{k,\ell}^{**}, \hat{\theta}_{n,k}^* ) \}_{\ell=1}^N$$ that are less than or equal to $R_n(y_k^*, \hat{\theta}_n)$.

The ecdf of $\{ Z_k \}$ is an approximation to $\hat{H}_{n,1}$ that improves

(in probability) as $M$ and $N$ grow.

Note that this approach is extremely general.
Beran gives examples for confidence sets for directions, etc.
The pivot can in principle be a function of any number of parameters, which can yield simultaneous confidence sets for parameters of any dimension.

Other approaches to improving coverage probability¶

There are other ways of iterating the bootstrap to improve the level accuracy of bootstrap confidence sets. Efron suggests trying to attain a different coverage probability so that the coverage attained in the second generation samples is the nominal coverage probability. That is, if one wants a 95% confidence set, one tries different percentiles so that in resampling from the sample, the attained coverage probability is 95%. Typically, the percentile one uses in the second generation will be higher than 95%. Here is a sketch of the Monte-Carlo approach:

Set a value of $\alpha^*$ (initially taking $\alpha^* = \alpha$ is reasonable)
From the sample, draw $M$ size-$n$ samples that are each IID

$\hat{F}_n$. Denote the ecdfs of the samples by $\{ \hat{F}_{n,j}^*\}$.

For each $j = 1, \ldots, M$, apply the percentile method to make a (nominal) level

$1-\alpha^*$ confidence interval for $T(\hat{F}_n)$. This gives $M$ confidence intervals; a fraction $1-\alpha'$ will cover $T(\hat{F}_n)$. Typically, $1- \alpha' \ne 1-\alpha$.

If $1-\alpha' < 1 - \alpha$, decrease $\alpha^*$ and return to the previous step.

If $1-\alpha' > 1 - \alpha$, increase $\alpha^*$ and return to the previous step. If $1-\alpha' \approx 1-\alpha$ to the desired level of precision, go to the next step.

Report as a $1-\alpha$ confidence interval for $T(F)$ the (first generation)

bootstrap quantile confidence interval that has nominal $1 - \alpha^*$ coverage probability.

An alternative approach to increasing coverage probability by iterating the bootstrap is to use the same root, but to use a quantile (among second-generation bootstrap samples) of its $1-\alpha$ quantile rather than the quantile observed in the first generation. The heuristic justification is that we would ideally like to know the $1-\alpha$ quantile of the pivot under sampling from the true distribution $F$. We don't. The percentile method estimates the $1-\alpha$ quantile of the pivot under $F$ by the $1-\alpha$ quantile of the pivot under $\hat{F}_n$, but this is subject to sampling variability. To try to be conservative, we could use the bootstrap a second time find an (approximate) upper $1-\alpha^*$ confidence interval for the $1-\alpha$ quantile of the pivot.

Here is a sketch of the Monte-Carlo approach:

Pick a value $\alpha^* \in (0, 1/2)$ (e.g., $\alpha^* = \alpha$). This is a tuning

parameter.

From the sample, draw $M$ size-$n$ samples that are each IID

$\hat{F}_n$. Denote the ecdfs of the samples by $\{ \hat{F}_{n,j}^*\}$.

For each $j = 1, \ldots, M$, draw $N$ size-$n$ samples, each IID

$\hat{F}_{n,j}$. Find the $1-\alpha$ quantile of the pivot. This gives $M$ values of the $1-\alpha$ quantile. Let $c$ be the $1-\alpha^*$ quantile of the $M$ $1-\alpha$ quantiles.

Report as a $1-\alpha$ confidence interval for $T(F)$ the interval one gets

by taking $c$ to be the estimate of the $1-\alpha$ quantile of the pivot.

In a variety of simulations, this tends to be more conservative than Beran's method, and more often attains at least the nominal coverage probability.

Exercise.¶

Consider forming a two-sided 95% confidence interval for the mean $\theta$ of a distribution $F$ based on the sample mean, using $| \bar{X} - \theta |$ as a pivot.

Implement the three "double-bootstrap" approaches to finding a confidence interval

(Beran's pre-pivoting, Efron's calibrated target percentile, and the percentile-of-percentile).

Generate 100 synthetic samples of size 100 from the following distributions: normal,

lognormal, Cauchy, mixtures of normals with the same mean but quite different variances (try different mixture coefficients), and mixtures of normals with different means and different variances (the means should differ enough that the result is bimodal).

Apply the three double bootstrap methods to each, resampling 1000 times from each of

1000 first-generation bootstrap samples.

Which method on the average has the lowest level error? Which method tends to be most

conservative? Try to provide some intuition about the circumstances under which each method fails, and the circumstances under which each method would be expected to perform well.

How do you interpret coverage for the Cauchy?

Warning: You might need to be clever in how you implement this to make it a feasible calculation.
If you try to store all the intermediate results, the memory requirement is huge. On the other hand, if you use too many loops, the execution time will be long.

Bootstrap confidence sets based on Stein (shrinkage) estimates¶

Beran (1995) discusses finding a confidence region for the mean vector $\theta \in \Re^q$, $q \ge 3$, from data $X \sim N(\theta, I)$. This is an example illustrating that what one bootstraps is important, and that naive plug-in bootstrapping doesn't always work.

The sets are spheres centered at the shrinkage estimate \begin{equation} \hat{\theta}_S = \left ( 1 - \frac{q-2}{\|X\|^2} \right ) X, \end{equation} with random diameter $\hat{d}$. That is, the confidence sets $C$ are of the form \begin{equation} C(\hat{\theta}_S, \hat{d}) = \left \{ \gamma \in \Re^q : \| \hat{\theta}_S - \gamma \| \le \hat{d} \right \}. \end{equation} The problem is how to find $\hat{d} = \hat{d}(X; \alpha)$ such that \begin{equation} \mathbb{P}_\gamma \{ C(\hat{\theta}_S, \hat{d}) \ni \gamma \} \ge 1-\alpha \end{equation} whatever be $\gamma \in \Re^q$.

This problem is parametric: $F$ is known up to the $q$-dimensional mean vector $\theta$. We can thus use a "parametric bootstrap" to generate data that are approximately from $F$, instead of drawing directly from $\hat{F}_n$: if we have an estimate $\hat{\theta}$ of $\theta$, we can generate artificial data distributed as $N( \hat{\theta}, I)$. If $\hat{\theta}$ is a good estimator, the artificial data will be distributed nearly as $F$. The issue is in what sense $\hat{\theta}$ needs to be good.

Beran shows (somewhat surprisingly) that resampling from $N(\hat{\theta}_S,I)$ or from $N(X,I)$ do not tend to work well in calibrating $\hat{d}$. The crucial thing in using the bootstrap to calibrate the radius of the confidence sphere seems to be to estimate $\| \theta \|$ well.

Definition. The geometrical risk of a confidence set $C$ for the parameter $\theta \in \Re^q$ is \begin{equation} G_q(C, \theta) \equiv q^{-1/2} E_\theta \sup_{\gamma \in C} \| \gamma - \theta \|. \end{equation} That is, the geometrical risk is the expected distance to the parameter from the most distant point in the confidence set. \end{Definition}

For confidence spheres \begin{equation} C = C(\hat{\theta}, \hat{d}) = \{ \gamma \in \Re^q : \| \gamma - \hat{\theta} \| \le \hat{d} \}, \end{equation} the geometrical risk can be decomposed further: the distance from $\theta$ to the most distant point in the confidence set is the distance from $\theta$ to the center of the sphere, plus the radius of the sphere, so \begin{eqnarray} G_q(C(\hat{\theta}, \hat{r}), \theta) &=& q^{-1/2} E_\theta \left ( \| \hat{\theta} - \theta \| + \hat{d} \right ) \nonumber \\ &=& q^{-1/2} E_\theta \| \hat{\theta} - \theta \| + q^{-1/2} E_\theta \hat{d} . \end{eqnarray}

Lemma

(Beran, 1995, Lemma 4.1). Define \begin{equation} W_q(X, \gamma) \equiv (q^{-1/2} ( \|X - \gamma \|^2 - q ), q^{-1/2} \gamma'(X - \gamma). \end{equation} Suppose $\{ \gamma_q \in \Re^q \}$ is any sequence such that \begin{equation} \label{eq:gammaqCond} \frac{\| \gamma_q \|^2}{q} \rightarrow a < \infty \mbox{ as } q \rightarrow \infty . \end{equation} Then \begin{equation} W_q(X, \gamma_q) \overset{W}{\rightarrow} (\sqrt{2} Z_1, \sqrt{a} Z_2 ) \end{equation} under $\mathbb{P}_{\gamma_q}$, where $Z_1$ and $Z_2$ are IID standard normal random variables. (The symbol $\overset{W}{\rightarrow}$ denotes weak convergence of distributions.)

Proof. Under $ \mathbb{P}_{\gamma_q}$, the distribution of $X - \gamma$ is rotationally invariant, so the distribution of the components of $W_q$ depend on $\gamma$ only through $\| \gamma \|$. Wlog, we may take each component of $\gamma_q$ to be $q^{-1/2}\| \gamma_q\|$. The distribution of the first component of $W_q$ is then that of the sum of squares of $q$ IID standard normals (a chi-square rv with $q$ df), minus the expected value of that sum, times $q^{-1/2}$. The standard deviation of a chi-square random variable with $q$ df is $\sqrt{2q}$, so the first component of $W_q$ is $\sqrt{2}$ times a standardized variable whose distribution is asymptotically (in $q$) normal. The second component of $W_q$ is a linear combination of IID standard normals; by symmetry (as argued above), its distribution is that of \begin{eqnarray} q^{-1/2} \sum_{j=1}^q q^{-1/2}\| \gamma_q\| Z_j &=& \| \gamma_q \|\sum_{j=1}^q Z_j \nonumber \\ \rightarrow a^{1/2} Z_2. \end{eqnarray}

Recall that the squared-error risk (normalized by $q^{-1/2}$) of the James-Stein estimator is $1 - q^{-1} E_\theta \{ (q-2)^2/\|X\|^2 \} < 1$. The difference between the loss of $\hat{\theta}_S$ and an unbiased estimate of its risk is \begin{equation} D_q(X, \theta) = q^{-1/2} \{ \| \hat{\theta}_S - \theta \|^2 - [q - (q-2)^2/\|X\|^2] \}. \end{equation} By rotational invariance, the distribution of this quantity depends on $\theta$ only through $\| \theta\|$; Beran writes the distribution as $H_q(\| \theta \|^2/q)$. Beran shows that if $\{ \gamma_q \in \Re^q \}$ satisfies \ref{eq:gammaqCond}, then \begin{equation} H_q(\|\gamma_q\|^2/q) \overset{W}{\rightarrow} N(0, \sigma^2(a)), \end{equation} where \begin{equation} \sigma^2(t) \equiv 2 - 4t/(1+t)^2 \ge 1. \end{equation} Define \begin{equation} \hat{\theta}_{\mbox{CL}} = [ 1 - (q-2)/\|X\|^2]_+^{1/2} X. \end{equation}

Theorem.

(Beran, 1995, Theorem 3.1) Suppose $\{ \gamma_q \in \Re^q \}$ satisfies \ref{eq:gammaqCond}. Then \begin{equation} H_q( \|\hat{\theta}_{\mbox{CL}}\|^2/q) \overset{W}{\rightarrow} N(0, \sigma^2(a)) , \end{equation} \begin{equation} H_q(\|X\|^2/q) \overset{W}{\rightarrow} N(0, \sigma^2(1+a)), \end{equation} and \begin{equation} H_q(\| \hat{\theta}_S\|^2/q )\rightarrow N(0, \sigma^2(a^2/(1+a))), \end{equation} all in $P_{\gamma_q}$ probability.

It follows that to estimate $H_q$ by the bootstrap consistently, one should use \begin{equation} \hat{H}_B = H_q( \|\hat{\theta}_{\mbox{CL}}\|^2/q ) \end{equation} rather than estimating using either the norm of $X$ or the norm of the James-Stein estimate $\hat{\theta}_{S}$ of $\theta$.

Proof. The Lemma implies that under the conditions of the theorem, $\| \hat{\theta}_{\mbox{CL}}\|^2/q \rightarrow a$, $\|X\|^2/q \rightarrow 1+a$, and $\| \hat{\theta}_S \|^2 /q \rightarrow a^2/(1+a)$.

In [ ]: