Statistics 5601 (Geyer, Fall 2003) Examples: Bootstrap Standard Errors

General Instructions
Mice
- Bootstrapping the Sample Mean
- Bootstrapping Something Else
Law (Correlation)
- Nonparametric Bootstrap
- Parametric Bootstrap
Testing (Students)

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Mice

Bootstrapping the Sample Mean

Chapter 2 in Efron and Tibshirani.

Comments

As usual, library(bootstrap) says we are going to use code in the bootstrap library, which is not available without this command.
All (I think) data sets in Efron and Tibshirani are in this library. The only issue is finding out what they are called, the on-line help for the bootstrap package may give a hint.
In this example, it is a fair guess that the name of the data set may have something to do with mice. The only ones on the list that look like that are mouse.c and mouse.t.
Since the actual help pages for the data sets are completely useless, saying only See Efron and Tibshirani (1993) for details on these datasets but not giving a page number or anything else of the slightest help in identifying which data set it is, the only way to find out whether these really are the data sets we want is to look at them. For example,
```
library(bootstrap)
data(mouse.c)
print(mouse.c)
```
prints mouse.c and, sure enough, these are the 9 numbers in the control group for the mouse data.
Before you can use (or even look at) any data set in the package, you must execute a data statement, as in the example.
The for loop makes nboot bootstrap samples x.star and calculates for each the statistical functional of interest theta.star[i], which in this example is the mean.
As Efron and Tibshirani say (somewhere, I can't find it now), it's always a good idea to look at a histogram of the bootstrap distribution of the estimator. That's what the code
```
hist(theta.star)
abline(v = theta.hat, lty = 2)
```
does. The vertical dotted line added by the second statement is at theta.hat.
This plot shows whether the distribution is normal or not and whether the estimator is biased or not.
For comparison, the last statement in the example gives the theoretical calculation that corresponds to the bootstrap estimate in the preceeding line.
Bogosity Warning. Bootstrapping sample sizes as small as this (n = 9) cannot be expected to give good results. The bootstrap is a large sample procedure. Anyone who thinks the bootstrap is a small sample procedure is making the most fundamental error in statistics: thinking the sample is the population.

Bootstrapping Something Else

This example differs from the preceding one only in the estimator being a 15% trimmed mean rather than the ordinary mean. Everything else is the same.

Chapter 2 in Efron and Tibshirani.

Comments

See the comments for the preceding section for anything not explained here. The two examples are almost exactly the same.
The mean function from the preceding example is replaced by the foo function, which calculates the 15% trimmed mean. Everything else is the same, except the last statement in the preceding example, which gives the theoretical standard error of the ordinary sample mean is irrelevant here, because even if we knew a lot of theory we still couldn't give a useful theoretical standard error for a 15% trimmed mean (or most other complicated estimators).

Law (Correlation)

Nonparametric Bootstrap

Section 6.3 in Efron and Tibshirani.

Comments

See the comments for the mice example about the library and data statements.
The law data set for this example is an R data frame. In the printout it looks like a matrix, and matrix notation can be used to extract the variables. The notation law[ , 2] indicates column 2 of the law data frame.
The sample statement in this example does something a little different from the mice example above.
The problem is that we don't have a univariate random variable to sample. We need to sample (with replacement from the original sample) pairs (X_k, Y_k).
We do that by sampling random indices. The vector k.star produced by the sample statement is a random sample with replacement from the numbers 1, 2, . . ., n. When we use such a vector as an index to another vector, it pulls out the elements indicated. The two statements
```
k.star <- sample(n, replace = TRUE)
x.star <- x[k.star]
```
does exactly the same thing as the single statement from the mice example
```
x.star <- sample(x, replace = TRUE)
```
But (a very big but) the three statements from this example
```
k.star <- sample(n, replace = TRUE)
x.star <- x[k.star]
y.star <- y[k.star]
```
do something very different from the two statements
```
x.star <- sample(x, replace = TRUE)
y.star <- sample(y, replace = TRUE)
```
that a naive person might think would be the way to resample with replacement from the data.
The point is that the wrong thing (using two sample statements) produces independent samples x.star and y.star. We don't have to use the bootstrap to know their correlation is exactly zero!
The right thing (using just one sample statement and an auxiliary index vector k.star) produces samples x.star and y.star that have the same pairing they had in the original data. A pair (x.star[i], y.star[i]) are a pair (x[j], y[j]) from the original data (for some j).

Parametric Bootstrap

Section 6.3 in Efron and Tibshirani.

Comments

And now for something completely different, the nonparametric bootstrap is very different from the parametric bootstrap.
No resample with replacement from the original data. Instead we simulate from the parametric model that is assumed for the data. This is still a bootstrap rather than pure simulation because the estimated parameter value is not the true unknown parameter value. So we still have a sample is not the population issue.
In this case the parametric model is bivariate normal. The mvrnorm function in the MASS library ( on-line help) simulates multivariate normal random vectors. A multivariate normal distribution is determined by two structured parameters, the mean vector and the variance matrix (also called covariance matrix, variance-covariance matrix, and dispersion matrix).
But the correlation coefficient is invariant under changes of location and scale. Thus the mean vector does not affect the distribution of the sample correlation coefficient. Nor do scale changes. Hence we can use the correlation matrix instead of the variance matrix without affecting the distribution of the sample correlation coefficient. The code
```
print(cor.mat <- cor(law))
```
calculates and prints the sample correlation matrix. It has ones on the diagonal and the sample correlation coefficient as its off-diagonal elements. That's why rho.hat is cor.mat[1, 2].
In the real world, the matrix cor.mat is the sample correlation matrix and its [1, 2] element is the plug-in estimate of the parameter of interest.
In the bootstrap world, the matrix cor.mat is (treated as) the population correlation matrix and its [1, 2] element is (treated as) the parameter of interest.
The statement
```
law.star <- mvrnorm(n, c(0, 0), cor.mat)
```
produces a random sample from the parametric model with parameter rho.hat. The two statements following calculate rho.star[i] from the (parametric) bootstrap data law.star in exactly the same way rho.hat is calculated from the original data law.
In calculating Fisher's z we don't need to do another bootstrap. We have rho.star stored. Just transform it to find the sampling distribution of z.
The transformation
```
z <- 0.5 * log((1 + rho) / (1 - rho))
```
has a name inverse hyperbolic tangent and a standard R function to calculate it
```
z <- atanh(rho)
```
which has an inverse function hyperbolic tangent
```
rho <- tanh(z)
```
Because of the skewness of the distribution of rho.hat (as shown by the skewness of the histogram of rho.star)
```
rho.hat + c(-1, 1) * qnorm(0.975) * sd(rho.star)
```
is not a very good 95% confidence interval for the true unknown parameter rho. (Point estimate plus or minus 1.96 standard errors of the point estimate assumes normality or at least approximate normality and here we aren't anywhere close to normality.)
Because of the approximate normality of the distribution of z.hat (as shown by the approximate normality of the histogram of z.star)
```
z.hat + c(-1, 1) * qnorm(0.975) * sd(z.star)
```
is a pretty good 95% confidence interval for the true unknown parameter zeta = atanh(rho). Hence
```
tanh(z.hat + c(-1, 1) * qnorm(0.975) * sd(z.star))
```
is a pretty good 95% confidence interval for the true unknown parameter rho.
The Moral of the Story. Work on a parameter that has an approximately normal estimator. If the original parameter of interest (here rho) doesn't have an approximately normal estimator, then change parameters to one (here zeta) that does. Then transform back to get a c. i. for the original parameter.
We will return to this theme when we discuss better bootstrap confidence intervals (Chapters 14 and 22 in Efron and Tibshirani).

library(bootstrap) data(scor) ### observed data ### n <- nrow(scor) fred <- eigen(var(scor) * (n - 1) / n, symmetric = TRUE) print(fred) eigenval.hat.1 <- fred$values[1] eigenval.hat.2 <- fred$values[2] eigenval.hat.sum <- sum(fred$values) eigenvec.hat.1 <- fred$vectors[ , 1] eigenvec.hat.2 <- fred$vectors[ , 2] ### bootstrap samples ### nboot <- 1000 eigenval.star.1 <- double(nboot) eigenval.star.2 <- double(nboot) eigenval.star.sum <- double(nboot) eigenvec.star.1 <- matrix(as.double(0), nboot, ncol(scor)) eigenvec.star.2 <- matrix(as.double(0), nboot, ncol(scor)) for (i in 1:nboot) { k.star <- sample(n, replace = TRUE) scor.star <- scor[k.star, ] fred <- eigen(var(scor.star) * (n - 1) / n, symmetric = TRUE) eigenval.star.1[i] <- fred$values[1] eigenval.star.2[i] <- fred$values[2] eigenval.star.sum[i] <- sum(fred$values) eigenvec.star.1[i, ] <- fred$vectors[ , 1] eigenvec.star.2[i, ] <- fred$vectors[ , 2] } # patch up signs of eigenvectors. Make agree. sally <- sign(eigenvec.star.1 %*% eigenvec.hat.1) eigenvec.star.1[sally < 0, ] <- (- eigenvec.star.1[sally < 0, ]) sally <- sign(eigenvec.star.2 %*% eigenvec.hat.2) eigenvec.star.2[sally < 0, ] <- (- eigenvec.star.2[sally < 0, ]) ### theta ### print(theta.hat <- eigenval.hat.1 / eigenval.hat.sum) theta.star <- eigenval.star.1 / eigenval.star.sum hist(theta.star) abline(v = theta.hat, lty = 2) sd(theta.star) ### eigenvector variability by boxplot ### boxplot(data.frame(eigenvec.star.1)) boxplot(data.frame(eigenvec.star.2)) ### eigenvector variability by variance matrices ### eout1 <- eigen(var(eigenvec.star.1), symmetric = TRUE) eout2 <- eigen(var(eigenvec.star.2), symmetric = TRUE) round(eout1$values, 6) round(eout2$values, 6)

Comments

See the comments for the mice example about the library and data statements.
Note that there is only one for loop and we save all the junk we want to know about in five data structures
- eigenval.1
- eigenval.2
- eigenval.sum
- eigenvec.1
- eigenvec.2
Of course, it it hard to imagine this whole mess at once. You build up the program in stages. First write the loop and the statement or statements to generate the bootstrap samples. Then add statements to calculate various functions of the bootstrap samples and store them for future reference.
Since the data structure here is complicated, a matrix scor, we do the sampling in two steps
```
k.star <- sample(n, replace = TRUE)
scor.star <- scor[k.star, ]
```
First we generate the vector k.star which contains the indices (subscripts) of the data vectors that go into the bootstrap sample. Then we use the R subscripting operations to generate the corresponding data structure.
This is analogous to the way we did the nonparametric bootstrap for the correlation coefficient and for a similar reason. Whenever we have a complicated data structure, we will need this trick of resampling indices rather than data.
scor is a matrix and so is scor.star. Every row of scor.star is a row of scor. The i-th row of scor.star is the k.star[i]-th row of scor. And that's exactly what we want since k.star[i] is a random integer in the range of allowed row indices of scor.
We replace var(...) with var(...) * (n - 1) / n because Efron and Tibshirani use the variance of the empirical distribution (divide by n) rather than the so-called sample variance (divide by n - 1) which is what the R var function calculates.
The code just after the comment patch up signs . . . . Solves the problem that Efron and Tibshirani moan about on the bottom half of p. 69 the right way rather than the wrong way. If the sign is arbitrary, fix the signs to be consistent rather than just throwing out the ones with the wrong signs, which may bias the bootstrap calculation.
Here we calculate the inner product with the corresponding eigenvector for the original data and adjust the signs of the bootstrap eigenvectors so the inner products are all positive.
The R boxplot function wants a list of vectors rather than a matrix. The data.frame converts a matrix into a list of vectors which are the columns of the matrix. That's why we use that.
You may be wondering how anyone is supposed to come up with that. Simple. The on-line help for the boxplot function has an example illustrating this trick. I just copied the example.
A more traditional way to look at the variability of a random vector (such as one of these eigenvectors) is to look at its variance matrix. So we also do that.
var(eigenvec.1) and var(eigenvec.2) are those variance matrices.
But it's not easy to interpret a five dimensional variance matrix. So we take a hint from the nature of the problem. If eigenvalues and eigenvectors are in general a good way to look at (symmetric) matrices, then they are in particular a good way to look at var(eigenvec.1) and var(eigenvec.2) and var(eigenvec.2) - var(eigenvec.1).
Hence we look at the eigenvalues and eigenvectors (of the variance matrix of the bootstrap sampling distribution of an eigenvector of the data). (eigenvectors of . . . an eigenvector! Woof!)
It is clear that all of the eigenvalues for the variance of the first eigenvector are much smaller than the corresponding eigenvalues for the second eigenvector. Hence the first eigenvector is much less variable than the second.

Statistics 5601 (Geyer, Fall 2003) Examples: Bootstrap Standard Errors

Contents

General Instructions

Mice

Bootstrapping the Sample Mean

Chapter 2 in Efron and Tibshirani.

Comments

Bootstrapping Something Else

Chapter 2 in Efron and Tibshirani.

Comments

Law (Correlation)

Nonparametric Bootstrap

Section 6.3 in Efron and Tibshirani.

Comments

Parametric Bootstrap

Section 6.3 in Efron and Tibshirani.

Comments

Testing (Students)

Section 7.2 in Efron and Tibshirani.

Comments