Statistics 5601 (Geyer, Fall 2003) Examples: Subsampling Bootstrap Confidence Intervals

General Instructions
Introduction
Time Series
Extreme Values
Estimating the Rate

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Now that we have some idea of how (Efron style) bootstrap confidence intervals work, we can return to the subsampling bootstrap and look at the kind of confidence intervals Politis and Romano (1994) recommend for the subsampling bootstrap.

Somewhat suprisingly, they recommend the kind of intervals that Efron and Tibshirani disparage in Section 13.4 titles Is the Percentile Interval Backwards?. This section is a response to the recommendations of Hall (The Bootstrap and Edgeworth Expansion, Springer, 1992, and earlier papers) where he charactizes bootstrap percentile intervals as looking up [in] the wrong statistical tables backwards.

We take no sides on the argument between Efron and Hall. There are arguments on both sides, and either method performs well in some examples and badly in others.

However, for the subsampling bootstrap, there seems to be no dispute. The standard method for the subsampling bootstrap is very much like what Hall recommends for the ordinary bootstrap.

This method is explained in a handout (which will be handed out in class). For those who didn't get a copy in dead tree format, here it is in PostScript and in PDF.

Time Series

External Data Entry

Enter a dataset URL :

Comments

Everything down to the bottom of the for loop should be familiar, just like what we did when we previously visited this example except some plot commands have been removed.
The four lines below the bootstrap loop calculate the subsampling bootstrap confidence interval. The example could have stopped there.
The histogram shows the distribution which the subsampling bootstrap uses to calculate critical values. And the confidence interval below gives what one could do if one wanted to assume the asymptotic distribution is normal.
Note that the histogram is quite skewed. Hence we get a confidence interval that is not symmetric about theta.hat. Thus we do much better by not assuming normality. We get a shorter confidence interval with better coverage.
As usual, Efron and Tibshirani are using a ridiculously small sample size in this toy problem. There is no reason to believe the subsampling bootstrap here. But it is reasonable for (much) larger data sets. (This comment repeated from the previous discussion of the subsampling bootstrap).

Extreme Values

Comments

Everything down to the bottom of the bootstrap loop should be familiar, just like what we did when we previously visited this example except some the commands to also do the Efron bootstrap have been removed.
The code immediately following the bootstrap loop calculates a two-sided confidence interval as described in the handout and in Politis and Romano (1994). That is what you would usually do.
In this problem we actually want a one-sided interval because we know the estimator always misses the parameter on the low side. That's done by the lines below the comment.

Estimating the Rate

print(n <- length(x))
print(theta.hat <- max(x))

nboot <- 2e4 - 1
b <- c(40, 60, 90, 135)
b <- sort(b)
theta.star <- matrix(NA, nboot, length(b))
for (i in 1:nboot) {
    x.star <- x
    for (j in length(b):1) {
        x.star <- sample(x.star, b[j], replace = FALSE)
        theta.star[i, j] <- max(x.star)
    }
}

zlist <- list()
for (i in 1:length(b)) {
    zlist[[i]] <- theta.star[ , i] - theta.hat
}
names(zlist) <- b
boxplot(zlist, xlab = "subsample size",
    ylab = expression(hat(theta)[b] - hat(theta)[n]))

qlist <- list()
k <- (nboot + 1) * seq(0.05, 0.45, 0.05)
l <- (nboot + 1) * seq(0.55, 0.95, 0.05)
for (i in 1:length(b)) {
    z.star <- zlist[[i]]
    sz.star <- sort(z.star, partial = c(k, l))
    qlist[[i]] <- sz.star[l] - sz.star[k]
}
names(qlist) <- b

lqlist <- lapply(qlist, log)
stripchart(lqlist, xlab = "subsample size",
    ylab = "log(high quantile - low quantile)",
    vertical = TRUE)

y <- sapply(lqlist, mean)
print(beta <- cov(- y, log(b)) / var(log(b)))

# confidence interval calculation
m <- 3
b <- b[m]
theta.star <- theta.star[ , m]
alpha <- 0.05
z.star <- b^beta * (theta.star - theta.hat)

# two-sided interval
crit.val <- sort(z.star)[(nboot + 1) * c(1 - alpha / 2, alpha / 2)]
theta.hat - crit.val / n^beta

# histogram from which critical values are derived
hist(z.star)
abline(v = crit.val, lty = 2)

# one-sided confidence interval actually makes more sense
# since we know theta.hat < theta
crit.val <- sort(z.star)[(nboot + 1) * alpha]
theta.hat - c(0, crit.val) / n^beta

cat("Calculation took", proc.time()[1], "seconds\n")

External Data Entry

Enter a dataset URL :

Comments

Now we do subsampling at four different sample sizes defined by the statement
```
b <- c(40, 60, 90, 135)
```
The number of sizes (here 4) and the sizes themselves are arbitrary. You may change them as you see fit.
Now the bootstrap loop becomes a loop within a loop. In a minor innovation of our own (not from Politis and Romano) we have used the principle of common random numbers from Monte Carlo theory. Our estimation will be as accurate as possible if we make the subsamples of different sizes as similar as possible given the requirement that they be samples without replacement from the original sample. The obvious way to do that is to subsample the subsamples. For the sizes we have here
- we take a subsample of size 135 from the original sample,
- then we take a subsample of size 90 from the subsample of size 135,
- then we take a subsample of size 60 from the subsample of size 90,
- then we take a subsample of size 40 from the subsample of size 60.
Of course all of these samples are without replacement as everywhere in Politis-Romano subsampling. This guarantees each is a sample of the required size from the true unknown distribution of the data.
Warning: Don't do
```
x.star <- sample(x.star, b[j], replace = FALSE)
```
(or anything like it with x.star appearing twice) anywhere else in bootstrapping. The only reason this works is because of the index j running backwards and because of the assignment x.star <- x done just before the inner loop. This is a clever trick for this situation only. Don't generalize it out of context.
After the bootstrap loop within a loop, theta.star is a matrix, whose i-th column contains nboot bootstrap subsamples of sample size b[i]. Then theta.hat is subtracted from each theta.star and the results are stuffed into a list (zlist) and shown as parallel boxplots.
It is clear from the boxplots how the scale changes with subsample size.
Now we make another arbitrary definition
```
k <- (nboot + 1) * seq(0.05, 0.45, 0.05)
l <- (nboot + 1) * seq(0.55, 0.95, 0.05)
```
The vectors k and l can be any integer sequences of the same length such that 0 <= k < l <= nboot.
The next step is to find quantile differences, essentially sort(z.star)[l] - sort(z.star)[k], although the code uses a partial sort to save time (whether the time saved is worth the extra code complexity is unknown). These are stuffed into another list qlist>. Then we take logs and plot the results as what some call a dot plot but R calls stripchart. This shows how these log quantile differences vary with b.
Finally we set y[i] to be the average log quantile differences for the subsample with subsample size b[i]. Then we find the estimate of beta by regressing y on log(b).
That was the hard part. Now that the rate is estimated, everything is just like the the preceding example we just use b^beta.hat and b^beta.hat where rates are needed.
The only tricky bit is that we don't bother to go back and get another bootstrap sample (we already have four!). We just use the m-th.

Statistics 5601 (Geyer, Fall 2003) Examples: Subsampling Bootstrap Confidence Intervals

Contents

General Instructions

Introduction

Time Series

Comments

Extreme Values

Comments

Estimating the Rate

Comments