To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.
library(bootstrap)
says we are going to
use functions and/or data in the bootstrap
library,
which is not available before this command is executed.
In this example, it is a fair guess that the name of the data set may
have something to do with mice. The only ones on the list that look
like that are mouse.c
and mouse.t
.
This web page used to complain that the help for these data sets was useless (and two years ago it was), but the help is now as it should be, for example, the on-line help for the mouse data.
To be really sure, just (as we do above) print the data and compare it with the book.
for
loop makes nboot
bootstrap samples
x.star
and calculates for each the statistical functional of
interest theta.star[i]
, which in this example is the mean.
hist(theta.star) abline(v = theta.hat, lty = 2)does. The vertical dotted line added by the second statement is at
theta.hat
.
This plot shows whether the distribution is normal or not and whether the estimator is biased or not.
This example differs from the preceding one only in the estimator being a 15% trimmed mean rather than the ordinary mean. Everything else is the same.
mean
function from the preceding example
is replaced by the foo
function, which calculates
the 15% trimmed mean. Everything else is the same, except the
last statement in the preceding example, which gives the
theoretical standard error of the ordinary sample mean is irrelevant here,
because even if we knew a lot of theory we still couldn't give a useful
theoretical standard error for a 15% trimmed mean (or most other complicated
estimators).
The foo
function is defined using the R function
function
(on-line
help), that is the R function whose name is function
.
x
, because the
defining expression function(x)
has
one argument named x
.
foo
is called, R executes the code
function body [the expression following function(x)
] replacing
each reference to the argument x
by the actual object in the
function call, that is, either of
foo(sally) foo(x = sally)
has exactly the same effect as executing the function body with x
replaced by sally
mean(sally, trim = 0.15)
conf.level <- 0.95at the top of an example to make the thingy (here confidence level) easy to change. You know if you change the definition of
conf.level
here, you don't need to change it anywhere else. If you don't use a variable,
then you need to search through the code looking for instances
of 0.95
to perhaps change. We say perhaps
because you have to carefully look at each instance to see how the number
is being used. The same principle applies to defining
a function foo
at the top of an example. We know we only need
to make a change in one place.
function
function above is a bit
oversimplified. See the newly rewritten and expanded
section on writing functions in our R intro page.
library
statement.
law
data set for this example is an R data frame.
Here the attach(law)
does just like the attach(X)
at the top of any Rweb invocation that uses external data entry.
It makes the variables in the data frame
(here LSAT
and GPA
) available as ordinary
variables.
law
is explained by
its
on-line help.
sample
statement in this example does something
a little different from the mice example above.
The problem is that we don't have a univariate random variable to sample. We need to sample (with replacement from the original sample) pairs (X_{k}, Y_{k}).
We do that by sampling random indices. The vector k.star
produced by the sample
statement is a random sample with
replacement from the numbers 1, 2, . . ., n. When we use such
a vector as an index to another vector, it pulls out the elements indicated.
The two statements
k.star <- sample(n, replace = TRUE) LSAT.star <- LSAT[k.star]does exactly the same thing as the single statement analogous to the mice example
LSAT.star <- sample(LSAT, replace = TRUE)But (a very big but) the three statements from this example
k.star <- sample(n, replace = TRUE) LSAT.star <- LSAT[k.star] GPA.star <- GPA[k.star]do something very different from the two statements
LSAT.star <- sample(LSAT, replace = TRUE) GPA.star <- sample(GPA, replace = TRUE)that a naive person might think would be the way to resample with replacement from the data.
The point is that the wrong thing
(using two sample
statements) produces independent
samples LSAT.star
and GPA.star
. We don't have
to use the bootstrap to know their correlation is exactly zero!
The right thing
(using just one sample
statement and an auxiliary index
vector k.star
) produces
samples LSAT.star
and GPA.star
that have the same
pairing they had in the original data. A pair
(LSAT.star[i]
, GPA.star[i]
)
are a pair
(LSAT[j]
, GPA[j]
)
from the original data (for some j
).
No resample with replacement from the original data
.
Instead we simulate from the parametric model that is assumed
for the data. This is still a bootstrap
rather than pure simulation
because the estimated parameter value is not the true unknown parameter value.
So we still have a sample is not the population
issue.
mvrnorm
function in the MASS library
(on-line
help) simulates multivariate
normal random vectors. A multivariate normal distribution is determined
by two structuredparameters, the mean vector and the variance matrix (also called
covariancematrix,
variance-covariancematrix, and
dispersionmatrix).
But the correlation coefficient is invariant under changes of location and scale. Thus the mean vector does not affect the distribution of the sample correlation coefficient, nor do scale changes. Hence we can use the zero vector for the mean vector, and we can use the correlation matrix for the variance matrix without affecting the distribution of the sample correlation coefficient. The code
cor.mat <- cor(law)calculates and prints the sample correlation matrix. It has ones on the diagonal and the sample correlation coefficient
rho.hat
as
its off-diagonal elements.
rho.hat <- cor(law[,1], law[,2])does exactly the same thing (as can be observed by running the examples) as the code
rho.hat <- cor(LSAT, GPA)in the preceding example. The former treats
law
as a matrix and law[,1]
is its first column
and law[,2]
is its second column. The latter treats law as
an attached data frame in which LSAT
is its first column
and GPA
is its second column. The reason we use the former
is because the matrix law.star
we simulate in the loop is
not a data frame and the latter syntax wouldn't work for it (without several
more statements to convert it into a data frame).
law.star <- mvrnorm(n, c(0, 0), cor.mat)produces a random sample from the parametric model with parameter
rho.hat
.
The following statement calculates rho.star[i]
from the (parametric) bootstrap data law.star
in exactly
the same way rho.hat
is calculated from the original data
law
.
rho.star
stored. Just transform it to find the sampling distribution of z.
z <- 0.5 * log((1 + rho) / (1 - rho))has a name inverse hyperbolic tangent and a standard R function to calculate it
z <- atanh(rho)which has an inverse function hyperbolic tangent
rho <- tanh(z)
Because of the skewness of the distribution of rho.hat
(as shown by the skewness of the histogram of rho.star
)
rho.hat + c(-1, 1) * qnorm(0.975) * sd(rho.star)is not a very good 95% confidence interval for the true unknown parameter rho. (Point estimate plus or minus 1.96 standard errors of the point estimate assumes normality or at least approximate normality and here we aren't anywhere close to normality.)
Because of the approximate normality of the distribution of z.hat
(as shown by the approximate normality of the histogram of z.star
)
z.hat + c(-1, 1) * qnorm(0.975) * sd(z.star)is a pretty good 95% confidence interval for the true unknown parameter
zeta = atanh(rho)
. Hence
tanh(z.hat + c(-1, 1) * qnorm(0.975) * sd(z.star))is a pretty good 95% confidence interval for the true unknown parameter rho.
The Moral of the Story. Work on a parameter that has an approximately normal estimator. If the original parameter of interest (here rho) doesn't have an approximately normal estimator, then change parameters to one (here zeta) that does. Then transform back to get a c. i. for the original parameter.
We will return to this theme when we discuss better
bootstrap
confidence intervals (Chapters 14 and 22 in Efron and Tibshirani).
Having stolen the variance stabilizing transformation trick (hyperbolic tangent, in this case) from the theoreticians, we can apply it to the nonparametric bootstrap as well.
It is still true that
tanh(z.hat + c(-1, 1) * qnorm(0.975) * sd(z.star))
is a better bootstrap confidence interval for ρ than the naive interval
rho.hat + c(-1, 1) * qnorm(0.975) * sd(rho.star)
The hyperbolic tangent transformation is no longer exactly variance stabilizing. That depended on the population being normal. Nevertheless, it still does approximately the right thing, as can be seen from the histograms.
scor
is explained by
its
on-line help.
library
statement.
for
loop and we save all the junk
we want to know about in five data structures
eigenval.1
eigenval.2
eigenval.sum
eigenvec.1
eigenvec.2
scor
,
we do the sampling in two steps
k.star <- sample(n, replace = TRUE) scor.star <- scor[k.star, ]First we generate the vector
k.star
which contains the indices
(subscripts) of the data vectors that go into the bootstrap sample.
Then we use the R subscripting operations to generate the corresponding
data structure.
This is analogous to the way we did the nonparametric bootstrap for the correlation coefficient and for a similar reason. Whenever we have a complicated data structure, we will need this trick of resampling indices rather than data.
scor
is a matrix
(after we do scor <- as.matrix(scor)
)
and so is scor.star
.
Every row of scor.star
is a row of scor
.
The i
-th row of scor.star
is the
k.star[i]
-th row of scor
.
And that's exactly what we want since k.star[i]
is a random
integer in the range of allowed row indices of scor
.
var(...)
with
var(...) * (n - 1) / n
because Efron and Tibshirani use
the variance of the empirical distribution (divide by n
)
rather than the so-called sample variance(divide by
n - 1
)
which is what the R var
function calculates.
patch up signs . . . .Solves the problem that Efron and Tibshirani moan about on the bottom half of p. 69 the right way rather than the wrong way. If the sign is arbitrary, fix the signs to be consistent rather than just throwing out the ones with the
wrongsigns, which may bias the bootstrap calculation.
Here we calculate the inner product with the corresponding eigenvector for the original data and adjust the signs of the bootstrap eigenvectors so the inner products are all positive.
boxplot
function wants a list of vectors rather than a
matrix. The data.frame
converts a matrix into a list of vectors
which are the columns of the matrix. That's why we use that.
You may be wondering how anyone is supposed to come up with that. Simple. The on-line help for the boxplot function has an example illustrating this trick. I just copied the example.
var(eigenvec.1)
and var(eigenvec.2)
are those variance matrices.
But it's not easy to interpret a five dimensional variance matrix. So
we take a hint from the nature of the problem. If eigenvalues and eigenvectors
are in general a good way to look at variance matrices,
then they are in particular a good way to look at
var(eigenvec.1)
and var(eigenvec.2)
.
Hence we look at the eigenvalues and eigenvectors (of the variance matrix of the bootstrap sampling distribution of an eigenvector of the data). (eigenvectors of … an eigenvector …! Woof!)
It is clear that all of the eigenvalues for the variance of the first eigenvector are much smaller than the corresponding eigenvalues for the second eigenvector. Hence the first eigenvector is much less variable than the second.
The bootstrap is not just for simple problems. Although for pedagogical reasons, a lot of the book and we do is simple (to keep the main issues clear), the bootstrap works in very complicated situations where there is no other way to approach the analysis.
The bootstrap keeps going after theory poops out.