We observe an IID sample of size n, {Xj}nj=1 IID F. Each observation is real-valued. We wish to estimate some parameter of the distribution of F that can be written as a functional of F, T(F). Examples include the mean, T(F)=∫xdF(x), other moments, etc.
The (unpenalized) nonparametric maximum likelihood estimator of F from the data {Xj} is just the empirical distribution ˆFn, which assigns mass 1/n to each observation: argmaxdistributions GPG{Xj=xj,j=1,…,n}=ˆFn.
Using the general principle that the maximum likelihood estimator of a function of a parameter is that function of the maximum likelihood estimator of the parameter, we might be led to consider T(ˆFn) as an estimator of T(F).
That is exactly what the sample mean does, as an estimator of the mean: T(ˆFn)=∫xdˆFn(x)=n∑j=11nXj=1n∑jXj.
Similarly, the maximum likelihood estimator of Var(X)=T(F)=∫(x−∫xdF)2dF
What is often more interesting is to estimate a property of the sampling distribution of the estimator T(ˆFn), for example the variance of the estimator T(ˆFn). The bootstrap approximates the sampling distribution of T(ˆFn) by the sampling distribution of T(ˆF∗n), where ˆF∗n is a size-n IID random sample drawn from ˆFn. That is, the bootstrap approximates the sampling distribution of an estimator applied to the empirical distribution ˆFn of a random sample of size n from a distribution F by the sampling distribution of that estimator applied to a random sample ˆF∗n of size n from a particular realization ˆFn of the empirical distribution of a sample of size n from F.
When T is the mean ∫xdF, so T(ˆFn) is the sample mean, we could obtain the variance of the distribution of T(ˆF∗n) analytically: Let {X∗j}nj=1 be an IID sample of size n from ˆFn. Then VarˆFn1nn∑j=1X∗j=1n2n∑j=1(Xj−ˉX)2,
The idea of the bootstrap is to approximate the distribution (under F) of an estimator T(ˆFn) by the distribution of the estimator under ˆFn, and to approximate that distribution by using a computer to take a large number of pseudo-random samples of size n from ˆFn.
This basic idea is quite flexible, and can be applied to a wide variety of
testing and estimation problems, including finding confidence sets for
functional parameters.
(It is not a panacea, though: we will see later how delicate it can be.)
It is related to some other "resampling" schemes in which
one re-weights the data to form other distributions.
Before doing more theory with the bootstrap, let's examine the jackknife.
The idea behind the jackknife, which is originally due to Tukey and Quenouille, is to form from the data {Xj}nj=1, n sets of n−1 data, leaving each datum out in turn. The "distribution" of T applied to these n sets is used to approximate the distribution of T(ˆFn). Let ˆF(i) denote the empirical distribution of the data set with the ith value deleted; T(i)=T(ˆF(i)) is the corresponding estimate of T(F). An estimate of the expected value of T(ˆFn) is ˆT(⋅)=1nn∑i=1T(ˆF(i)).
Applying the jackknife estimate of bias to correct the plug-in estimate of variance reproduces the formula for the sample variance (with 1/(n−1)) from the formula with 1/n: Define ˉX=1nn∑j=1Xj,
The jackknife also can be used to estimate other properties of an estimator, such as its variance. The jackknife estimate of the variance of T(ˆFn) is ^Var(T)=n−1nn∑j=1(T(j)−T(⋅))2.
It is convenient to think of distributions on data sets to compare the jackknife and the bootstrap. We shall follow the notation in Efron (1982). We condition on (Xi=xi) and treat the data as fixed in what follows. Let Sn be the n-dimensional simplex Sn≡{P∗=(P∗i)ni=1∈ℜn:P∗i≥0 and n∑i=1P∗i=1}.
The bootstrap estimate of variance tends to be better than the jackknife estimate of variance for nonlinear estimators because of the distance between the empirical measure and the resampled measures: ‖P∗−P0‖=OP(n−1/2),
To see that the cdfs of the jackknife samples are O(n−1) from the empirical cdf ˆFn, note that for univariate real-valued data, the difference between ˆFn and the cdf of the jackknife data set that leaves out the jth ranked observation X(j) is largest either at X(j−1) or at X(j). For j=1 or j=n, the jackknife samples that omit the smallest or largest observation, the L1 distance between the jackknife measure and the empirical distribution is exactly 1/n. Consider the jackknife cdf ˆFn,(j), the cdf of the sample without X(j), 1<j<n. ˆFn,(j)(X(j))=(j−1)/(n−1),
The neighborhood that the bootstrap samples is larger, and is probabilistically of the right size to correspond to the uncertainty of the empirical distribution function as an estimator of the underlying distribution function F (recall the Kiefer-Dvoretzky-Wolfowitz inequality---a K-S ball of radius O(n−1/2) has fixed coverage probability). For linear functionals, this does not matter, but for strongly nonlinear functionals, the bootstrap estimate of the variability tends to be more accurate than the jackknife estimate of the variability.
Let us have a quick look at the distribution of the K-S distance between a continuous distribution and the empirical distribution of a sample {Xj}nj=1 IID F. The discussion follows Feller (1971, pp.36ff). First we show that for continuous distributions F, the distribution of ‖ˆFn−F‖KS does not depend on F. To see this, note that F(Xj)∼U[0,1]: Let xt≡inf{x∈ℜ:F(xt)=t}. Continuity of F ensures that xt exists for all t∈[0,1]. Now the event {Xj≤xt} is equivalent to the event {F(Xj)≤F(xt)} up to a set of F-measure zero. Thus t=PF{Xj≤xt}=PF{F(Xj)≤F(xt)}=PF{F(Xj)≤t},t∈[0,1];
Let U be an index set (not necessarily countable). Recall that a collection {Iu}u∈U of confidence intervals for parameters {θu}u∈U has simultaneous 1−α coverage probability if Pθ{∩u∈U{Iu∋θu}}≥1−α.
Many of the procedures for forming joint confidence sets we have seen depend on pivots, which are functions of the data and the parameter(s) whose distribution is known (even though the parameter and the parent distribution are not). For example, the Scheff'{e} method relies on the fact that (for samples from a multivariate Gaussian with independent components) the sum of squared differences between the data and the corresponding parameters, divided by the variance estimate, has an F distribution, regardless of the parameter values. Similarly, Tukey's maximum modulus method relies on the fact that (again, for independent Gaussian data) the distribution of the maximum of the studentized absolute differences between the data and the corresponding parameters does not depend on the parameters. Both of those examples are parametric, but the idea is more general: the procedure we looked at for finding bounds on the density function subject to shape restrictions just relied on the fact that there are uniform bounds on the probability that the K-S distance between the empirical distribution and the true distribution exceeds some threshold.
Even in cases where there is no known exact pivot, one can sometimes show that some function of the data and parameters is asymptotically a pivot. Working out the distributions of the functions involved is not typically straightforward, and a general method of constructing (possibly simultaneous) confidence sets would be nice.
Efron gives several methods of basing confidence sets on the bootstrap. Those methods are substantially improved (in theory, and in my experience) by Beran's pre-pivoting approach, which leads to iterating the bootstrap.
Let Xn denote a sample of size n from F. Let Rn(θ)=Rn(Xn,θ) have cdf Hn, and let H−1n(α) be the largest α quantile of the distribution of Rn. Then {γ∈Θ:Rn(γ)≤H−1n(1−α)}
The idea of the percentile method is to use the empirical bootstrap percentiles of some quantity to approximate the true percentiles. Consider constructing a confidence interval for a single real parameter θ=T(F). We will estimate θ by ˆθ=T(ˆFn). We would like to know the distribution function Hn=Hn(⋅,F) of Dn(θ)=T(ˆFn)−θ. Suppose we did. Let H−1n(⋅)=H−1n(⋅,F) be the inverse cdf of Dn. Then PF{H−1n(α/2)≤T(ˆFn)−θ≤H−1n(1−α/2)}=1−α,
The idea behind the percentile method is to approximate Hn(⋅,F) by ˆHn=Hn(⋅,ˆFn), the distribution of Dn under resampling from ˆFn rather than F. (This tends not to be an approximation for constructing confidence sets. See confidence sets.)
An alternative approach is to take Dn(θ)=|T(ˆFn)−θ|; then PF{|T(ˆFn)−θ|≤H−1n(1−α)}=1−α,
Beran proves a variety of results under the following condition:
Condition 1. (Beran, 1987)
For any sequence {Fn} that converges to F in a metric d on cdfs, Hn(⋅,Fn) converges weakly to a continuous cdf H=H(⋅,F) that depends only on F, and not the sequence {Fn}.
Suppose Condition 1 holds. Then because ˆFn is consistent for F, the estimate ˆHn converges in probability to H in sup norm; moreover, the distribution of ˆHn(Rn(θ)) converges to U[0,1].
Instead of Dn, consider Rn(θ)=|T(ˆFn)−θ| or some other (approximate) pivot. Let ˆHn(⋅,ˆFn) be the bootstrap estimate of the cdf of Rn; The set Bn={γ∈Θ:ˆHn(Rn(γ))≤1−α}={γ∈Θ:Rn(γ)≤ˆH−1n(1−α)}
The level of this set for finite samples tends to be inaccurate. It can be improved in the following way, due to Beran.
The original root, Rn(θ), whose limiting distribution depends on F, was transformed into a new root Rn,1(θ)=ˆHn(Rn(θ)), whose limiting distribution is U[0,1]. The distribution of Rn,1 depends less strongly on F than does that of Rn; Beran calls mapping Rn into Rn,1 prepivoting. The confidence set acts as if the distribution of Rn,1 really is uniform, which is not generally true. One could instead treat Rn,1 itself as a root, and pivot to reduce the dependence on F.
Let Hn,1=Hn,1(⋅,F) be the cdf of the new root Rn,1(θ), estimate Hn,1 by ˆHn,1=Hn,1(⋅,ˆFn), and define Bn,1={γ∈Θ:ˆHn,1(Rn,1(γ))≤1−α}={γ∈Θ:ˆHn,1(ˆHn(Rn(γ)))≤1−α}={γ∈Θ:Rn(γ)≤ˆH−1n(ˆH−1n,1(1−α)))}.
I follow Beran's (1987) notation (mostly).
Let xn denote the "real" sample of size n. Let x∗n be a bootstrap sample of size n drawn from the empirical cdf ˆFn. The components of x∗n are conditionally IID given xn. Let ˆF∗n denote the "empirical" cdf of the bootstrap sample x∗n. Let x∗∗n denote a sample of size n drawn from ˆF∗n; the components of x∗∗n are conditionally IID given xn and x∗n. Let ˆθn=T(ˆFn), and ˆθ∗n=T(ˆF∗n). Then Hn(s,F)=PF{Rn(xn,θ)≤s},
The Monte Carlo approach is as follows:
The ecdf of {Rn(y∗k,ˆθn)}Mk=1 is an approximation to ˆHn.
N size n bootstrap samples from the ecdf of y∗k. Let ˆθ∗n,k=T(ˆF∗n,k). Let Zk be the fraction of the values {Rn(y∗∗k,ℓ,ˆθ∗n,k)}Nℓ=1
(in probability) as M and N grow.
Note that this approach is extremely general.
Beran gives examples for confidence sets for directions, etc.
The pivot can in principle be
a function of any number of parameters, which can yield simultaneous confidence
sets for parameters of any dimension.
There are other ways of iterating the bootstrap to improve the level accuracy of bootstrap confidence sets. Efron suggests trying to attain a different coverage probability so that the coverage attained in the second generation samples is the nominal coverage probability. That is, if one wants a 95% confidence set, one tries different percentiles so that in resampling from the sample, the attained coverage probability is 95%. Typically, the percentile one uses in the second generation will be higher than 95%. Here is a sketch of the Monte-Carlo approach:
Set a value of α∗ (initially taking α∗=α is reasonable)
From the sample, draw M size-n samples that are each IID
ˆFn. Denote the ecdfs of the samples by {ˆF∗n,j}.
1−α∗ confidence interval for T(ˆFn). This gives M confidence intervals; a fraction 1−α′ will cover T(ˆFn). Typically, 1−α′≠1−α.
If 1−α′>1−α, increase α∗ and return to the previous step. If 1−α′≈1−α to the desired level of precision, go to the next step.
bootstrap quantile confidence interval that has nominal 1−α∗ coverage probability.
An alternative approach to increasing coverage probability by iterating the bootstrap is to use the same root, but to use a quantile (among second-generation bootstrap samples) of its 1−α quantile rather than the quantile observed in the first generation. The heuristic justification is that we would ideally like to know the 1−α quantile of the pivot under sampling from the true distribution F. We don't. The percentile method estimates the 1−α quantile of the pivot under F by the 1−α quantile of the pivot under ˆFn, but this is subject to sampling variability. To try to be conservative, we could use the bootstrap a second time find an (approximate) upper 1−α∗ confidence interval for the 1−α quantile of the pivot.
Here is a sketch of the Monte-Carlo approach:
parameter.
ˆFn. Denote the ecdfs of the samples by {ˆF∗n,j}.
ˆFn,j. Find the 1−α quantile of the pivot. This gives M values of the 1−α quantile. Let c be the 1−α∗ quantile of the M 1−α quantiles.
by taking c to be the estimate of the 1−α quantile of the pivot.
In a variety of simulations, this tends to be more conservative than Beran's method, and more often attains at least the nominal coverage probability.
Consider forming a two-sided 95% confidence interval for the mean θ of a distribution F based on the sample mean, using |ˉX−θ| as a pivot.
(Beran's pre-pivoting, Efron's calibrated target percentile, and the percentile-of-percentile).
lognormal, Cauchy, mixtures of normals with the same mean but quite different variances (try different mixture coefficients), and mixtures of normals with different means and different variances (the means should differ enough that the result is bimodal).
1000 first-generation bootstrap samples.
conservative? Try to provide some intuition about the circumstances under which each method fails, and the circumstances under which each method would be expected to perform well.
Warning: You might need to be clever in how you implement this to make it
a feasible calculation.
If you try to store all the intermediate results,
the memory requirement is huge. On the other hand, if you use too many loops, the
execution time will be long.
Beran (1995) discusses finding a confidence region for the mean vector θ∈ℜq, q≥3, from data X∼N(θ,I). This is an example illustrating that what one bootstraps is important, and that naive plug-in bootstrapping doesn't always work.
The sets are spheres centered at the shrinkage estimate ˆθS=(1−q−2‖X‖2)X,
This problem is parametric: F is known up to the q-dimensional mean vector θ. We can thus use a "parametric bootstrap" to generate data that are approximately from F, instead of drawing directly from ˆFn: if we have an estimate ˆθ of θ, we can generate artificial data distributed as N(ˆθ,I). If ˆθ is a good estimator, the artificial data will be distributed nearly as F. The issue is in what sense ˆθ needs to be good.
Beran shows (somewhat surprisingly) that resampling from N(ˆθS,I) or from N(X,I) do not tend to work well in calibrating ˆd. The crucial thing in using the bootstrap to calibrate the radius of the confidence sphere seems to be to estimate ‖θ‖ well.
Definition. The geometrical risk of a confidence set C for the parameter θ∈ℜq is Gq(C,θ)≡q−1/2Eθsupγ∈C‖γ−θ‖.
For confidence spheres C=C(ˆθ,ˆd)={γ∈ℜq:‖γ−ˆθ‖≤ˆd},
Lemma
(Beran, 1995, Lemma 4.1). Define Wq(X,γ)≡(q−1/2(‖X−γ‖2−q),q−1/2γ′(X−γ).
Proof. Under Pγq, the distribution of X−γ is rotationally invariant, so the distribution of the components of Wq depend on γ only through ‖γ‖. Wlog, we may take each component of γq to be q−1/2‖γq‖. The distribution of the first component of Wq is then that of the sum of squares of q IID standard normals (a chi-square rv with q df), minus the expected value of that sum, times q−1/2. The standard deviation of a chi-square random variable with q df is √2q, so the first component of Wq is √2 times a standardized variable whose distribution is asymptotically (in q) normal. The second component of Wq is a linear combination of IID standard normals; by symmetry (as argued above), its distribution is that of q−1/2q∑j=1q−1/2‖γq‖Zj=‖γq‖q∑j=1Zj→a1/2Z2.
Recall that the squared-error risk (normalized by q−1/2) of the James-Stein estimator is 1−q−1Eθ{(q−2)2/‖X‖2}<1. The difference between the loss of ˆθS and an unbiased estimate of its risk is Dq(X,θ)=q−1/2{‖ˆθS−θ‖2−[q−(q−2)2/‖X‖2]}.
Theorem.
(Beran, 1995, Theorem 3.1) Suppose {γq∈ℜq} satisfies 60. Then Hq(‖ˆθCL‖2/q)W→N(0,σ2(a)),
It follows that to estimate Hq by the bootstrap consistently, one should use ˆHB=Hq(‖ˆθCL‖2/q)
Proof. The Lemma implies that under the conditions of the theorem, ‖ˆθCL‖2/q→a, ‖X‖2/q→1+a, and ‖ˆθS‖2/q→a2/(1+a).