General Instructions
To do each example, just click the Submit
button.
You do not have to type in any R instructions or specify a dataset.
That's already done for you.
Overview
The subject of this web page is the subsampling bootstrap, which is the subject of a book by Politis, Romano, and Wolfe.
It is also the subject of a more detailed web page, which we will get to in a few weeks.
Basic Idea
The subsampling bootstrap samples without replacement
at a subsample
size b that is smaller than the original
sample size n. The sampling without replacement has the
consequence that the samples are from the true unknown population
distribution.
- The Efron nonparametric bootstrap samples from the wrong distribution (the empirical) at the right sample size n.
- The Politis and Romano subsampling (nonparametric) bootstrap samples from the right distribution at the wrong sample size b.
Rate of Convergence
In order to use the subsampling bootstrap we must know the rate of convergence of the estimator we are using. We assume that if tn is the estimator, θ is the parameter, and n is the sample size, then
We estimate this distribution by the distribution of
where b is the subsample size and tb* is the subsampling bootstrap estimator.
Often r is 1 ⁄ 2 (the square root law
obeyed by
most widely used estimators). Sometimes, as in the
extreme values example below, it is not.
Stationary Process or IID Sampling
There are two ways to do subsampling.
One is essential for stationary time series and is demonstrated in the time series example below. In this method, the subsamples are all blocks of length b in the time series. There are not many such blocks (n − b + 1), but it is necessary to keep the blocks together to keep the dependence in the time series (at least the dependence that is present in blocks of length b).
The other method applies only to IID and is demonstrated in the extreme values example below. In this method, the subsamples are samples without replacement of length b from the original sample. This allows many more samples than the other method and a more accurate bootstrap.
Time Series
Sections 8.5 and 8.6 in Efron and Tibshirani.
Comments
- As usual,
library(bootstrap)
says we are going to use code in thebootstrap
library, which is not available without this command. -
The
lutenhorm
data is explained by its on-line help. Inspection of thelutenhorm
dataset shows that column 4 is the data described in Table 8.1 in Efron and Tibshirani. - The function
acf
on-line help calculates the so-called autocorrelation function of the time series. The height of the bar at lag k is the correlation of Xn and Xn + k assuming the time series is stationary (so this correlation does not depend on n only on k. The correlation at lag zero is one by definition (any random variable is perfectly correlated with itself).The blue dashed lines in the autocorrelation plot are 95% non-simultaneous large sample approximate critical values for testing whether the autocorrelations are non-zero. Autocorrelations that go outside the blue dashed lines are statistically significant. Here only the lag 1 autocorrelation is significant.
- The function
foo
calculates the estimator of the autoregressive coefficient described by Efron and Tibshirani.The vector
z
is the data supplied to the function with the mean subtracted off. The numberm
is the length of the data.The vector
z[-1]
is all the elements ofz
except the first and the vectorz[-m]
is all the elements ofz
except the last.Thus the statement
out <- lm(z[-1] ~ z[-m] + 0)
regresses zt on zt − 1 with no intercept (the+ 0
meansno intercept
). - For time series, the subsampling bootstrap uses only blocks of
contiguous variables. For a series of length
n
and blocks ofb
there are exactlyn - b + 1
such blocks. Generally, we use them all. No need for random samples. - Note that the bootstrap samples are correlated, as the time series
plot for
beta.star
shows. However, this does not matter, so long asb
is long enough so the samples are representative of the behavior of the whole series.As usual, Efron and Tibshirani are using a ridiculously small sample size in this toy problem. There is no reason to believe the subsampling bootstrap here. But it is reasonable for (much) larger data sets.
-
The histogram of
beta.star
shows that the simple method of estimation being used here is badly biased. That's why this method is not recommended by time series books. We only use it here because it is easy to explain. -
The
sqrt(b / n)
in the last line adjusts for the relative sample sizes of the subsample and the whole series. Note that thesqrt
here is only valid for estimators obeying the square root law. If therate
is notroot n
, then a different function ofb / n
is needed, as in the following example.
Extreme Values and IID in General
This section is about using the subsampling bootstrap in situations where
the data are IID (independent and identically distributed) but in which the
ordinary nonparametric bootstrap (Efron's nonparametric bootstrap) does not
work. The only difference for a different problem is that the rate
might be different. In this example we have rate n.
In another problem we might have rate n1 / 3.
In yet another we might have, yet another rate.
Section 7.4 in Efron and Tibshirani.
Comments
-
The vector
theta.star
storesmax(x)
for samples from the subsampling bootstrap. -
The vector
theta.bogo
storesmax(x)
for samples from the ordinary (Efron) bootstrap. -
Note that the
sample
statement is quite different for the regular (Efron) bootstrap and the (Politis and Romano) subsampling bootstrap.For the Efron bootstrap, we sample with replacement at the original sample size with something like
x.star <- sample(x, replace = TRUE)
The subsampling bootstrap samples without replacement at the much smaller sample size
b
with something likex.star <- sample(x, b, replace = FALSE)
Both the
size
and thereplace
arguments ofsample
differ. (For the Efron bootstrap thesize
argument is missing so the defaultlength(x)
is used.) -
Since the asymptotic distribution is non-normal, it makes no sense to
be calculating standard errors. What does make sense is a bootstrap
percentile interval, but we will have to wait until we learn about that
and revisit this issue.
-
For now, we just show that the subsampling bootstrap has done the
Right Thing (with a capital R and a capital T). The plot is a so-called
Q-Q plot. The sorted values of the variable
z.star <- b * (theta.hat - theta.star)
which is supposed to have an Exp(1 / θ) distribution according to the theory, are plotted against the appropriate quantiles of this distribution. If the points lie near the line y = x, thenz.star
does indeed have the claimed distribution.We emphasize that we don't need to know the asymptotic distribution to use the bootstrap samples
z.star
to construct a confidence interval for θ. We can't do it yet because we haven't covered Chapters 12, 13, and 14 in Efron and Tibshirani. When we've done them, we can return to this example and finish it. -
For comparison, we put also the
theta.bogo
samples on the Q-Q plot, so it can be clearly seen they do the Wrong Thing (with a capital W and a capital T).