General Instructions
To do each example, just click the Submit
button.
You do not have to type in any R instructions or specify a dataset.
That's already done for you.
Introduction
Now that we have some idea of how (Efron style) bootstrap confidence intervals work, we can return to the subsampling bootstrap and look at the kind of confidence intervals Politis and Romano (1994) recommend for the subsampling bootstrap.
Somewhat suprisingly, they recommend the kind of intervals that Efron and
Tibshirani disparage in Section 13.4 titles Is the Percentile Interval
Backwards?
. This section is a response to the recommendations of Hall
(The Bootstrap and Edgeworth Expansion, Springer, 1992, and earlier
papers) where he charactizes bootstrap percentile intervals as looking up
[in] the wrong statistical tables backwards
.
We take no sides on the argument between Efron and Hall. There are arguments on both sides, and either method performs well in some examples and badly in others.
However, for the subsampling bootstrap, there seems to be no dispute. The standard method for the subsampling bootstrap is very much like what Hall recommends for the ordinary bootstrap.
This method is explained in a handout (which will be handed out in class). For those who didn't get a copy in dead tree format, here is a PDF.
Time Series
Comments
- Everything down to the bottom of the for loop should be familiar,
just like what we did when we previously
visited this example except some plot commands have been removed.
-
The four lines below the bootstrap loop calculate the subsampling
bootstrap confidence interval. The example could have stopped there.
-
The histogram shows the distribution which the subsampling bootstrap uses
to calculate critical values. And the confidence interval below gives
what one could do if one wanted to assume the asymptotic distribution is
normal.
-
Note that the histogram is quite skewed. Hence we get a confidence interval
that is not symmetric about
theta.hat
. Thus we do much better by not assuming normality. We get a shorter confidence interval with better coverage. - As usual, Efron and Tibshirani are using a ridiculously small sample size in this toy problem. There is no reason to believe the subsampling bootstrap here. But it is reasonable for (much) larger data sets. (This comment repeated from the previous discussion of the subsampling bootstrap).
Time Series Revisited
In this section we look at a long AR(1) time series, so long that it is silly to use all blocks of length b as subsamples. We take a sample of the subsamples.
Also we use a sound method of estimating the AR coefficient.
The R function ar.burg
(on-line
help) estimates AR(k) time
series models for arbitrary k and also determines k.
Here we want the case k = 1, which makes the call a bit complicated
(the default is to estimate k). Another complication is that
this function returns a list, the ar
element of which is
the estimate. That's why the $ar
at the end of the function
call. It picks out that element of the list.
The command spc <- 100
says we will use every 100-th block
of length b. This complicates the specification of the subsamples.
The starting indices of the blocks we use are given by the
vector ii
and the i-th time through the loop
the subsample will be the block starting at ii[i]
and ending
at ii[i] + b - 1
. This is just like the code where we use
all the subsamples except we have replaced i
by ii[i]
at this point.
Everything else is the same as the preceding example.
Extreme Values
Comments
- Everything down to the bottom of the bootstrap loop should be familiar,
just like what we did when we previously
visited this example except some the commands to also do the Efron
bootstrap have been removed.
-
The code immediately following the bootstrap loop calculates
a two-sided confidence interval as described in the handout and in
Politis and Romano (1994). That is what you would usually do.
- In this problem we actually want a one-sided interval because we know the estimator always misses the parameter on the low side. That's done by the lines below the comment.
Estimating the Rate
Comments
- Now we do subsampling at four different sample sizes defined by
the statement
b <- c(40, 60, 90, 135)
The number of sizes (here 4) and the sizes themselves are arbitrary. You may change them as you see fit. - Now the bootstrap loop becomes a loop within a loop. In a minor
innovation of our own (not from Politis and Romano) we have used the
principle of common random numbers from Monte Carlo theory.
Our estimation will be as accurate as possible if we make the subsamples
of different sizes as similar as possible given the requirement that
they be samples without replacement from the original sample. The
obvious way to do that is to subsample the subsamples. For the sizes
we have here
- we take a subsample of size 135 from the original sample,
- then we take a subsample of size 90 from the subsample of size 135,
- then we take a subsample of size 60 from the subsample of size 90,
- then we take a subsample of size 40 from the subsample of size 60.
Warning: Don't do
x.star <- sample(x.star, b[j], replace = FALSE)
(or anything like it withx.star
appearing twice) anywhere else in bootstrapping. The only reason this works is because of the indexj
running backwards and because of the assignmentx.star <- x
done just before the inner loop. This is a clever trick for this situation only. Don't generalize it out of context. -
After the bootstrap loop within a loop,
theta.star
is a matrix, whosei
-th column containsnboot
bootstrap subsamples of sample sizeb[i]
. Thentheta.hat
is subtracted from eachtheta.star
and the results are stuffed into a list (zlist
) and shown as parallel boxplots.It is clear from the boxplots how the scale changes with subsample size.
Now we make another arbitrary definition
k <- (nboot + 1) * seq(0.05, 0.45, 0.05) l <- (nboot + 1) * seq(0.55, 0.95, 0.05)
The vectorsk
andl
can be any integer sequences of the same length such that0 <= k < l <= nboot
. -
The next step is to find quantile differences,
essentially
sort(z.star)[l] - sort(z.star)[k]
, although the code uses a partial sort to save time (whether the time saved is worth the extra code complexity is unknown). These are stuffed into another listqlist
>. Then we take logs and plot the results as what some call adot plot
but R callsstripchart
. This shows how these log quantile differences vary with b. -
Finally we set
y[i]
to be the average log quantile differences for the subsample with subsample sizeb[i]
. Then we find the estimate ofbeta
by regressingy
onlog(b)
. -
That was the hard part. Now that the rate is estimated, everything is
just like the the preceding example we just use
b^beta.hat
andb^beta.hat
where rates are needed.The only tricky bit is that we don't bother to go back and get another bootstrap sample (we already have four!). We just use the
m
-th.