Statistics 5601 (Geyer, Fall 2007) Subsampling Bootstrap Confidence Intervals

General Instructions

To do each example, just click the Submit button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Introduction

Now that we have some idea of how (Efron style) bootstrap confidence intervals work, we can return to the subsampling bootstrap and look at the kind of confidence intervals Politis and Romano (1994) recommend for the subsampling bootstrap.

Somewhat suprisingly, they recommend the kind of intervals that Efron and Tibshirani disparage in Section 13.4 titles Is the Percentile Interval Backwards?. This section is a response to the recommendations of Hall (The Bootstrap and Edgeworth Expansion, Springer, 1992, and earlier papers) where he charactizes bootstrap percentile intervals as looking up [in] the wrong statistical tables backwards.

We take no sides on the argument between Efron and Hall. There are arguments on both sides, and either method performs well in some examples and badly in others.

However, for the subsampling bootstrap, there seems to be no dispute. The standard method for the subsampling bootstrap is very much like what Hall recommends for the ordinary bootstrap.

This method is explained in a handout (which will be handed out in class). For those who didn't get a copy in dead tree format, here is a PDF.

Time Series

Comments

Everything down to the bottom of the for loop should be familiar, just like what we did when we previously visited this example except some plot commands have been removed.
The four lines below the bootstrap loop calculate the subsampling bootstrap confidence interval. The example could have stopped there.
The histogram shows the distribution which the subsampling bootstrap uses to calculate critical values. And the confidence interval below gives what one could do if one wanted to assume the asymptotic distribution is normal.
Note that the histogram is quite skewed. Hence we get a confidence interval that is not symmetric about theta.hat. Thus we do much better by not assuming normality. We get a shorter confidence interval with better coverage.
As usual, Efron and Tibshirani are using a ridiculously small sample size in this toy problem. There is no reason to believe the subsampling bootstrap here. But it is reasonable for (much) larger data sets. (This comment repeated from the previous discussion of the subsampling bootstrap).

Time Series Revisited

In this section we look at a long AR(1) time series, so long that it is silly to use all blocks of length b as subsamples. We take a sample of the subsamples.

Also we use a sound method of estimating the AR coefficient.

The R function ar.burg (on-line help) estimates AR(k) time series models for arbitrary k and also determines k. Here we want the case k = 1, which makes the call a bit complicated (the default is to estimate k). Another complication is that this function returns a list, the ar element of which is the estimate. That's why the $ar at the end of the function call. It picks out that element of the list.

The command spc <- 100 says we will use every 100-th block of length b. This complicates the specification of the subsamples. The starting indices of the blocks we use are given by the vector ii and the i-th time through the loop the subsample will be the block starting at ii[i] and ending at ii[i] + b - 1. This is just like the code where we use all the subsamples except we have replaced i by ii[i] at this point.

Everything else is the same as the preceding example.

Extreme Values

Comments

Everything down to the bottom of the bootstrap loop should be familiar, just like what we did when we previously visited this example except some the commands to also do the Efron bootstrap have been removed.
The code immediately following the bootstrap loop calculates a two-sided confidence interval as described in the handout and in Politis and Romano (1994). That is what you would usually do.
In this problem we actually want a one-sided interval because we know the estimator always misses the parameter on the low side. That's done by the lines below the comment.

Estimating the Rate

R statements print(n <- length(x)) print(theta.hat <- max(x)) nboot <- 2e4 - 1 b <- c(40, 60, 90, 135) b <- sort(b) theta.star <- matrix(NA, nboot, length(b)) for (i in 1:nboot) { x.star <- x for (j in length(b):1) { x.star <- sample(x.star, b[j], replace = FALSE) theta.star[i, j] <- max(x.star) } } zlist <- list() for (i in 1:length(b)) { zlist[[i]] <- theta.star[ , i] - theta.hat } names(zlist) <- b boxplot(zlist, xlab = "subsample size", ylab = expression(hat(theta)[b] - hat(theta)[n])) qlist <- list() k <- (nboot + 1) * seq(0.05, 0.45, 0.05) l <- (nboot + 1) * seq(0.55, 0.95, 0.05) for (i in 1:length(b)) { z.star <- zlist[[i]] sz.star <- sort(z.star, partial = c(k, l)) qlist[[i]] <- sz.star[l] - sz.star[k] } names(qlist) <- b lqlist <- lapply(qlist, log) stripchart(lqlist, xlab = "subsample size", ylab = "log(high quantile - low quantile)", vertical = TRUE) y <- sapply(lqlist, mean) print(beta <- cov(- y, log(b)) / var(log(b))) # confidence interval calculation m <- 3 b <- b[m] theta.star <- theta.star[ , m] alpha <- 0.05 z.star <- b^beta * (theta.star - theta.hat) # two-sided interval crit.val <- sort(z.star)[(nboot + 1) * c(1 - alpha / 2, alpha / 2)] theta.hat - crit.val / n^beta # histogram from which critical values are derived hist(z.star) abline(v = crit.val, lty = 2) # one-sided confidence interval actually makes more sense # since we know theta.hat < theta crit.val <- sort(z.star)[(nboot + 1) * alpha] theta.hat - c(0, crit.val) / n^beta cat("Calculation took", proc.time()[1], "seconds\n")

Dataset URL

Comments

Now we do subsampling at four different sample sizes defined by the statement
```
b <- c(40, 60, 90, 135)
```
The number of sizes (here 4) and the sizes themselves are arbitrary. You may change them as you see fit.
Now the bootstrap loop becomes a loop within a loop. In a minor innovation of our own (not from Politis and Romano) we have used the principle of common random numbers from Monte Carlo theory. Our estimation will be as accurate as possible if we make the subsamples of different sizes as similar as possible given the requirement that they be samples without replacement from the original sample. The obvious way to do that is to subsample the subsamples. For the sizes we have here
- we take a subsample of size 135 from the original sample,
- then we take a subsample of size 90 from the subsample of size 135,
- then we take a subsample of size 60 from the subsample of size 90,
- then we take a subsample of size 40 from the subsample of size 60.
Of course all of these samples are without replacement as everywhere in Politis-Romano subsampling. This guarantees each is a sample of the required size from the true unknown distribution of the data.
Warning: Don't do
```
x.star <- sample(x.star, b[j], replace = FALSE)
```
(or anything like it with x.star appearing twice) anywhere else in bootstrapping. The only reason this works is because of the index j running backwards and because of the assignment x.star <- x done just before the inner loop. This is a clever trick for this situation only. Don't generalize it out of context.
After the bootstrap loop within a loop, theta.star is a matrix, whose i-th column contains nboot bootstrap subsamples of sample size b[i]. Then theta.hat is subtracted from each theta.star and the results are stuffed into a list (zlist) and shown as parallel boxplots.
It is clear from the boxplots how the scale changes with subsample size.
Now we make another arbitrary definition
```
k <- (nboot + 1) * seq(0.05, 0.45, 0.05)
l <- (nboot + 1) * seq(0.55, 0.95, 0.05)
```
The vectors k and l can be any integer sequences of the same length such that 0 <= k < l <= nboot.
The next step is to find quantile differences, essentially sort(z.star)[l] - sort(z.star)[k], although the code uses a partial sort to save time (whether the time saved is worth the extra code complexity is unknown). These are stuffed into another list qlist>. Then we take logs and plot the results as what some call a dot plot but R calls stripchart. This shows how these log quantile differences vary with b.
Finally we set y[i] to be the average log quantile differences for the subsample with subsample size b[i]. Then we find the estimate of beta by regressing y on log(b).
That was the hard part. Now that the rate is estimated, everything is just like the the preceding example we just use b^beta.hat and b^beta.hat where rates are needed.
The only tricky bit is that we don't bother to go back and get another bootstrap sample (we already have four!). We just use the m-th.

Statistics 5601 (Geyer, Fall 2007) Subsampling Bootstrap Confidence Intervals

General Instructions

Introduction

Time Series

Comments

Time Series Revisited

Extreme Values

Comments

Estimating the Rate

Comments

Navigation

Contents