General Instructions
To do each example, just click the Submit
button.
You do not have to type in any R instructions or specify a dataset.
That's already done for you.
Bootstrapping Cases
Sections 9.6 and 9.7 in Efron and Tibshirani.
Comments
- The
xlim
andylim
arguments to theplot
function make sure that the point (0, 0) is on the plot, so we can see that estimated regression functions do pass through (0, 0). - The
on-line help
for the function
lmsreg
is clear as mud. The name seems to suggest that it does least median of squares regression, but the documentation doesn't say that anywhere.However, if we decipher the jargon, it says:
- The function
lmsreg
is acompatibility wrapper
for the functionlqs
. - Using
lmsreg
forcesmethod = "lms"
inlqs
. - In the
Details
section it says the function minimizes some function of the sorted squared residuals, formethod = "lms"
thequantile
squared residual. And the default forquantile
(for this method) isfloor((n+1)/2)
. In short, a median. - Woof!
- The function
- The
0 +
in the R formula expression means omit the intercept, as Efron and Tibshirani discuss near the bottom of p. 116.This seems not to be mentioned on the on-line help for
lqs
but see the on-line help forlm
(which does ordinary least-squares regression) or the on-line help forformula
. - Our LMS regression coefficients
Rweb:> coefficients(fred) dose I(dose^2) -0.6683283 -0.0258682
don't agree with the ones reported in Efron and Tibshirani.LMS regression is notoriously computationally difficult. The algorithm must look at all
choose(n, p + 1)
subsets of the cases to find the regression that exactly fitsp + 1
points which minimizes the least median of squares (wherep
is the number of nonconstant predictors).Of course, in this toy data set,
choose(14, 3) = 364
is not very large. So the algorithm does find the exact solution (according to the documentation).But many older LMS regression algorithms only sample subsets of size
p + 1
and hence do not find the exact solution (lmsreg
doesn't either whenchoose(n, p + 1)
is greater than 5000). Then LMS presumably stands forlow (somewhat)
rather thanleast
median of squares.Maybe Efron and Tibshirani used a buggy LMS regression routine (or maybe we are), or maybe theirs just used sampling and didn't find the right answer. Or maybe I'm not doing this right. Our picture looks sensible, though.
- The code
points(x, y, pch = 16) curve(predict(fred, data.frame(dose = x)), add = TRUE, lwd = 2)
puts the original data and LMS regression curve back on the plot after they have been painted over in color"plum"
by all of the bootstrap LMS regression lines drawn by thecurve
statement in the loop. Thepch = 16
andlwd = 2
make heavier points and lines.If you don't want this to take so long, making
nboot
smaller draws the picture faster (though, of course, it makes the bootstrap standard errors less accurate). - Efron and Tibshirani don't plot bootstrap regression lines in this section, but they should. The bizarreness of the picture casts doubt on the sensibleness of the whole procedure (how's that for academic weasel wording).
Bootstrapping Residuals
Sections 9.6 and 9.7 in Efron and Tibshirani.
Comments
- Everything is the same as above except the generation of the bootstrap
data. So see the comments above about what is not changed.
- The residuals and predicted values for the original data are
assigned to the vectors
resid
andpred
for use in the bootstrap. - The bootstrap data, which were
x.star
andy.star
above (when bootstrapping cases) are nowx
andy.star
withk.star <- sample(n, replace = TRUE) y.star <- pred + resid[k.star]
Note that the X values don't change. This bootstrap attempts to simulate the conditional distribution of Y given X, so, as always with conditional probability, the variables (here X) we are conditioning on are treated as if constant. Hence the bootstrap leaves them alone. We construct the new Y by attaching random residuals (a bootstrap sample of residuals) to the original predicted values.The two lines of code just above could be replaced by the one line
y.star <- pred + sample(resid, replace = TRUE)
without changing the algorithm. Since we are only sampling a single vector (resid
) we don't need thek.star
trick. -
Note that the picture is much less bizarre than when bootstrapping cases.
Exercise for the reader: why? Hint: it has something to do with the
engineering details
, also callednuts and bolts
, of the procedures.