General Instructions
To do each example, just click the Submit
button.
You do not have to type in any R instructions or specify a dataset.
That's already done for you.
Mice
Bootstrapping the Sample Mean
Chapter 2 in Efron and Tibshirani.
Comments
-
As usual,
library(bootstrap)
says we are going to use functions and/or data in thebootstrap
library, which is not available before this command is executed. -
All (I think) data sets in Efron and Tibshirani are in this library.
The only issue is finding out what they are called, the
on-line help for the bootstrap package may give a hint.
In this example, it is a fair guess that the name of the data set may have something to do with mice. The only ones on the list that look like that are
mouse.c
andmouse.t
.This web page used to complain that the help for these data sets was useless (and two years ago it was), but the help is now as it should be, for example, the on-line help for the mouse data.
To be really sure, just (as we do above) print the data and compare it with the book.
-
The
for
loop makesnboot
bootstrap samplesx.star
and calculates for each the statistical functional of interesttheta.star[i]
, which in this example is the mean. -
As Efron and Tibshirani say (somewhere, I can't find it now),
it's always a good idea to look at a histogram of the bootstrap distribution
of the estimator. That's what the code
hist(theta.star) abline(v = theta.hat, lty = 2)
does. The vertical dotted line added by the second statement is attheta.hat
.This plot shows whether the distribution is normal or not and whether the estimator is biased or not.
-
For comparison, the last statement in the example gives the
theoretical calculation that corresponds to the bootstrap estimate in the
preceeding line.
- Bogosity Warning. Bootstrapping sample sizes as small as this (n = 9) cannot be expected to give good results. The bootstrap is a large sample procedure. Anyone who thinks the bootstrap is a small sample procedure is making the most fundamental error in statistics: thinking the sample is the population.
Bootstrapping Something Else
This example differs from the preceding one only in the estimator being a 15% trimmed mean rather than the ordinary mean. Everything else is the same.
Chapter 2 in Efron and Tibshirani.
Comments
-
See the comments for the preceding section
for anything not explained here. The two examples are almost exactly
the same.
-
The
mean
function from the preceding example is replaced by thefoo
function, which calculates the 15% trimmed mean. Everything else is the same, except the last statement in the preceding example, which gives the theoretical standard error of the ordinary sample mean is irrelevant here, because even if we knew a lot of theory we still couldn't give a useful theoretical standard error for a 15% trimmed mean (or most other complicated estimators).The
foo
function is defined using the Rfunction
function (on-line help), that is the R function whose name isfunction
.- It defines a function object (an R object that is a function).
-
The function defined has one argument named
x
, because the defining expressionfunction(x)
has one argument namedx
. -
When the defined function
foo
is called, R executes the code function body [the expression followingfunction(x)
] replacing each reference to the argumentx
by the actual object in the function call, that is, either offoo(sally) foo(x = sally)
has exactly the same effect as executing the function body with
x
replaced bysally
mean(sally, trim = 0.15)
-
The reason for introducing such a simple function is good coding practice:
use variables rather than constants. We often have something like
conf.level <- 0.95
at the top of an example to make the thingy (here confidence level) easy to change. You know if you change the definition ofconf.level
here, you don't need to change it anywhere else. If you don't use a variable, then you need to search through the code looking for instances of0.95
to perhaps change. We say perhaps because you have to carefully look at each instance to see how the number is being used. The same principle applies to defining a functionfoo
at the top of an example. We know we only need to make a change in one place. - The explanation of the
function
function above is a bit oversimplified. See the section on writing your own functions in our R intro page.
Law (Correlation)
Nonparametric Bootstrap
Section 6.3 in Efron and Tibshirani.
Comments
- See the comments for the mice example
about the
library
statement. -
The
law
data set for this example is an R data frame. Here theattach(law)
does just like theattach(X)
at the top of any Rweb invocation that uses external data entry. It makes the variables in the data frame (hereLSAT
andGPA
) available as ordinary variables. -
The
law
is explained by its on-line help. -
The
sample
statement in this example does something a little different from the mice example above.The problem is that we don't have a univariate random variable to sample. We need to sample (with replacement from the original sample) pairs (Xk, Yk).
We do that by sampling random indices. The vector
k.star
produced by thesample
statement is a random sample with replacement from the numbers 1, 2, . . ., n. When we use such a vector as an index to another vector, it pulls out the elements indicated. The two statementsk.star <- sample(n, replace = TRUE) LSAT.star <- LSAT[k.star]
does exactly the same thing as the single statement analogous to the mice exampleLSAT.star <- sample(LSAT, replace = TRUE)
But (a very big but) the three statements from this examplek.star <- sample(n, replace = TRUE) LSAT.star <- LSAT[k.star] GPA.star <- GPA[k.star]
do something very different from the two statementsLSAT.star <- sample(LSAT, replace = TRUE) GPA.star <- sample(GPA, replace = TRUE)
that a naive person might think would be the way to resample with replacement from the data.The point is that the wrong thing (using two
sample
statements) produces independent samplesLSAT.star
andGPA.star
. We don't have to use the bootstrap to know their correlation is exactly zero!The right thing (using just one
sample
statement and an auxiliary index vectork.star
) produces samplesLSAT.star
andGPA.star
that have the same pairing they had in the original data. A pair (LSAT.star[i]
,GPA.star[i]
) are a pair (LSAT[j]
,GPA[j]
) from the original data (for somej
).
Parametric Bootstrap
Section 6.5 in Efron and Tibshirani.
Comments
- And now for something completely different, the nonparametric bootstrap
is very different from the parametric bootstrap.
No
resample with replacement from the original data
. Instead we simulate from the parametric model that is assumed for the data. This is still abootstrap
rather than pure simulation because the estimated parameter value is not the true unknown parameter value. So we still have asample is not the population
issue. -
In this case the parametric model is bivariate normal.
The
mvrnorm
function in the MASS library (on-line help) simulates multivariate normal random vectors. A multivariate normal distribution is determined by twostructured
parameters, the mean vector and the variance matrix (also calledcovariance
matrix,variance-covariance
matrix, anddispersion
matrix).But the correlation coefficient is invariant under changes of location and scale. Thus the mean vector does not affect the distribution of the sample correlation coefficient, nor do scale changes. Hence we can use the zero vector for the mean vector, and we can use the correlation matrix for the variance matrix without affecting the distribution of the sample correlation coefficient. The code
cor.mat <- cor(law)
calculates and prints the sample correlation matrix. It has ones on the diagonal and the sample correlation coefficientrho.hat
as its off-diagonal elements. -
The code
rho.hat <- cor(law[,1], law[,2])
does exactly the same thing (as can be observed by running the examples) as the coderho.hat <- cor(LSAT, GPA)
in the preceding example. The former treatslaw
as a matrix andlaw[,1]
is its first column andlaw[,2]
is its second column. The latter treats law as an attached data frame in whichLSAT
is its first column andGPA
is its second column. The reason we use the former is because the matrixlaw.star
we simulate in the loop is not a data frame and the latter syntax wouldn't work for it (without several more statements to convert it into a data frame). -
The statement
law.star <- mvrnorm(n, c(0, 0), cor.mat)
produces a random sample from the parametric model with parameterrho.hat
. The following statement calculatesrho.star[i]
from the (parametric) bootstrap datalaw.star
in exactly the same wayrho.hat
is calculated from the original datalaw
. -
In calculating Fisher's z
we don't need to do another bootstrap. We have
rho.star
stored. Just transform it to find the sampling distribution of z. -
The transformation
z <- 0.5 * log((1 + rho) / (1 - rho))
has a name inverse hyperbolic tangent and a standard R function to calculate itz <- atanh(rho)
which has an inverse function hyperbolic tangentrho <- tanh(z)
Because of the skewness of the distribution of
rho.hat
(as shown by the skewness of the histogram ofrho.star
)rho.hat + c(-1, 1) * qnorm(0.975) * sd(rho.star)
is not a very good 95% confidence interval for the true unknown parameter rho. (Point estimate plus or minus 1.96 standard errors of the point estimate assumes normality or at least approximate normality and here we aren't anywhere close to normality.)Because of the approximate normality of the distribution of
z.hat
(as shown by the approximate normality of the histogram ofz.star
)z.hat + c(-1, 1) * qnorm(0.975) * sd(z.star)
is a pretty good 95% confidence interval for the true unknown parameterzeta = atanh(rho)
. Hencetanh(z.hat + c(-1, 1) * qnorm(0.975) * sd(z.star))
is a pretty good 95% confidence interval for the true unknown parameter rho.The Moral of the Story. Work on a parameter that has an approximately normal estimator. If the original parameter of interest (here rho) doesn't have an approximately normal estimator, then change parameters to one (here zeta) that does. Then transform back to get a c. i. for the original parameter.
We will return to this theme when we discuss
better
bootstrap confidence intervals (Chapters 14 and 22 in Efron and Tibshirani).
Nonparametric Bootstrap Revisited
Section 6.3 in Efron and Tibshirani.
Comments
Having stolen the variance stabilizing transformation trick (hyperbolic tangent, in this case) from the theoreticians, we can apply it to the nonparametric bootstrap as well.
It is still true that
tanh(z.hat + c(-1, 1) * qnorm(0.975) * sd(z.star))
is a better bootstrap confidence interval for ρ than the naive interval
rho.hat + c(-1, 1) * qnorm(0.975) * sd(rho.star)
The hyperbolic tangent transformation is no longer exactly variance stabilizing. That depended on the population being normal. Nevertheless, it still does approximately the right thing, as can be seen from the histograms.
Testing (Students)
Section 7.2 in Efron and Tibshirani.
Comments
-
The
scor
is explained by its on-line help. - See the comments for the mice example
about the
library
statement. -
Note that there is only one
for
loop and we save all the junk we want to know about in five data structures-
eigenval.1
-
eigenval.2
-
eigenval.sum
-
eigenvec.1
-
eigenvec.2
-
-
Since the data structure here is complicated, a matrix
scor
, we do the sampling in two stepsk.star <- sample(n, replace = TRUE) scor.star <- scor[k.star, ]
First we generate the vectork.star
which contains the indices (subscripts) of the data vectors that go into the bootstrap sample. Then we use the R subscripting operations to generate the corresponding data structure.This is analogous to the way we did the nonparametric bootstrap for the correlation coefficient and for a similar reason. Whenever we have a complicated data structure, we will need this trick of resampling indices rather than data.
scor
is a matrix (after we doscor <- as.matrix(scor)
) and so isscor.star
. Every row ofscor.star
is a row ofscor
. Thei
-th row ofscor.star
is thek.star[i]
-th row ofscor
. And that's exactly what we want sincek.star[i]
is a random integer in the range of allowed row indices ofscor
. -
We replace
var(...)
withvar(...) * (n - 1) / n
because Efron and Tibshirani use the variance of the empirical distribution (divide byn
) rather than the so-calledsample variance
(divide byn - 1
) which is what the Rvar
function calculates. - The code just after the comment
patch up signs . . . .
Solves the problem that Efron and Tibshirani moan about on the bottom half of p. 69 the right way rather than the wrong way. If the sign is arbitrary, fix the signs to be consistent rather than just throwing out the ones with thewrong
signs, which may bias the bootstrap calculation.Here we calculate the inner product with the corresponding eigenvector for the original data and adjust the signs of the bootstrap eigenvectors so the inner products are all positive.
-
The R
boxplot
function wants a list of vectors rather than a matrix. Thedata.frame
converts a matrix into a list of vectors which are the columns of the matrix. That's why we use that.You may be wondering how anyone is supposed to come up with that. Simple. The on-line help for the boxplot function has an example illustrating this trick. I just copied the example.
-
A more traditional way to look at the variability of a random vector
(such as one of these eigenvectors) is to look at its variance matrix.
So we also do that.
var(eigenvec.1)
andvar(eigenvec.2)
are those variance matrices.But it's not easy to interpret a five dimensional variance matrix. So we take a hint from the nature of the problem. If eigenvalues and eigenvectors are in general a good way to look at variance matrices, then they are in particular a good way to look at
var(eigenvec.1)
andvar(eigenvec.2)
.Hence we look at the eigenvalues and eigenvectors (of the variance matrix of the bootstrap sampling distribution of an eigenvector of the data). (eigenvectors of … an eigenvector …! Woof!)
It is clear that all of the eigenvalues for the variance of the first eigenvector are much smaller than the corresponding eigenvalues for the second eigenvector. Hence the first eigenvector is much less variable than the second.
The Moral of the Story
The bootstrap is not just for simple problems. Although for pedagogical reasons, a lot of what the book and we do is simple (to keep the main issues clear), the bootstrap works in very complicated situations where there is no other way to approach the analysis.
The bootstrap keeps going after theory poops out.