Now that we have some idea of how (Efron style) bootstrap confidence intervals
work, we can return to the subsampling bootstrap and look at the kind of
confidence intervals Politis and Romano (1994) recommend for the subsampling
bootstrap.
Somewhat suprisingly, they recommend the kind of intervals that Efron and
Tibshirani disparage in Section 13.4 titles Is the Percentile Interval
Backwards?. This section is a response to the recommendations of Hall
(The Bootstrap and Edgeworth Expansion, Springer, 1992, and earlier
papers) where he charactizes bootstrap percentile intervals as looking up
[in] the wrong statistical tables backwards.
We take no sides on the argument between Efron and Hall. There are arguments
on both sides, and either method performs well in some examples and badly in
others.
However, for the subsampling bootstrap, there seems to be no dispute.
The standard method for the subsampling bootstrap is very much like
what Hall recommends for the ordinary bootstrap.
This method is explained in a handout (which will be handed out in class).
For those who didn't get a copy in dead tree format, here it is
in PostScript and in PDF.
Everything down to the bottom of the for loop should be familiar,
just like what we did when we previously
visited this example except some plot commands have been removed.
The four lines below the bootstrap loop calculate the subsampling
bootstrap confidence interval. The example could have stopped there.
The histogram shows the distribution which the subsampling bootstrap uses
to calculate critical values. And the confidence interval below gives
what one could do if one wanted to assume the asymptotic distribution is
normal.
Note that the histogram is quite skewed. Hence we get a confidence interval
that is not symmetric about theta.hat. Thus we do much better
by not assuming normality. We get a shorter confidence interval with better
coverage.
As usual, Efron and Tibshirani are using a ridiculously small sample size
in this toy problem. There is no reason to believe the subsampling bootstrap
here. But it is reasonable for (much) larger data sets.
(This comment repeated from the previous discussion of the subsampling
bootstrap).
Everything down to the bottom of the bootstrap loop should be familiar,
just like what we did when we previously
visited this example except some the commands to also do the Efron
bootstrap have been removed.
The code immediately following the bootstrap loop calculates
a two-sided confidence interval as described in the handout and in
Politis and Romano (1994). That is what you would usually do.
In this problem we actually want a one-sided interval because we know
the estimator always misses the parameter on the low side. That's done
by the lines below the comment.
Now we do subsampling at four different sample sizes defined by
the statement
b <- c(40, 60, 90, 135)
The number of sizes (here 4) and the sizes themselves are arbitrary.
You may change them as you see fit.
Now the bootstrap loop becomes a loop within a loop. In a minor
innovation of our own (not from Politis and Romano) we have used the
principle of common random numbers from Monte Carlo theory.
Our estimation will be as accurate as possible if we make the subsamples
of different sizes as similar as possible given the requirement that
they be samples without replacement from the original sample. The
obvious way to do that is to subsample the subsamples. For the sizes
we have here
we take a subsample of size 135 from the original sample,
then we take a subsample of size 90 from the subsample of size 135,
then we take a subsample of size 60 from the subsample of size 90,
then we take a subsample of size 40 from the subsample of size 60.
Of course all of these samples are without replacement as everywhere in
Politis-Romano subsampling. This guarantees each is a sample of the
required size from the true unknown distribution of the data.
Warning: Don't do
x.star <- sample(x.star, b[j], replace = FALSE)
(or anything like it with x.star appearing twice)
anywhere else in bootstrapping. The only reason this works is because
of the index j running backwards and because of the
assignment x.star <- x done just before the inner loop.
This is a clever trick for this situation only. Don't generalize
it out of context.
After the bootstrap loop within a loop, theta.star
is a matrix, whose i-th column contains nboot
bootstrap subsamples of sample size b[i].
Then theta.hat is subtracted from each theta.star
and the results are stuffed
into a list (zlist) and shown as parallel boxplots.
It is clear from the boxplots how the scale changes with
subsample size.
Now we make another arbitrary definition
k <- (nboot + 1) * seq(0.05, 0.45, 0.05)
l <- (nboot + 1) * seq(0.55, 0.95, 0.05)
The vectors k and l can be any integer sequences
of the same length such that 0 <= k < l <= nboot.
The next step is to find quantile differences,
essentially sort(z.star)[l] - sort(z.star)[k],
although the code uses a partial sort to save time (whether the time
saved is worth the extra code complexity is unknown). These are stuffed
into another list qlist>. Then we take logs and plot the
results as what some call a dot plot but
R calls stripchart. This shows how these log quantile
differences vary with b.
Finally we set y[i] to be the average log quantile differences
for the subsample with subsample size b[i]. Then we find
the estimate of beta by regressing y on
log(b).
That was the hard part. Now that the rate is estimated, everything is
just like the the preceding example we just use
b^beta.hat and b^beta.hat where rates are needed.
The only tricky bit is that we don't bother to go back and get another
bootstrap sample (we already have four!). We just use the m-th.