Statistics 5601 (Geyer, Fall 2003) Examples: Bootstrapping Regression

General Instructions
Bootstrapping Cases
Bootstrapping Residuals

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

library(bootstrap) data(cell) print(cell) x <- cell[ , 1] y <- cell[ , 2] library(lqs) plot(x, y) fred <- lmsreg(y ~ x + I(x^2)) curve(predict(fred, data.frame(x = x)), add = TRUE) coefficients(fred) beta1.hat <- coefficients(fred)[2] beta2.hat <- coefficients(fred)[3] n <- length(x) nboot <- 1000 beta1.star <- double(nboot) beta2.star <- double(nboot) for (i in 1:nboot) { k.star <- sample(n, replace = TRUE) x.star <- x[k.star] y.star <- y[k.star] sally <- lmsreg(y.star ~ x.star + I(x.star^2)) curve(predict(sally, data.frame(x.star = x)), add = TRUE, col = "plum") beta1.star[i] <- coefficients(sally)[2] beta2.star[i] <- coefficients(sally)[3] } points(x, y, pch = 16) curve(predict(fred, data.frame(x = x)), add = TRUE, lwd = 2) beta1.hat sd(beta1.star) beta2.hat sd(beta2.star) beta2.hat / sd(beta2.star)

External Data Entry

Enter a dataset URL :

Comments

The on-line help for the function lmsreg is clear as mud. The name seems to suggest that it does least median of squares regression, but the documentation doesn't say that anywhere.
However, if we decipher the jargon, it says:
- The function lmsreg is a compatibility wrapper for the function lqs.
- Using lmsreg forces method = "lms" in lqs.
- In the Details section it says the function minimizes some function of the sorted squared residuals, for method = "lms" the quantile squared residual. And the default for quantile (for this method) is floor((n+1)/2). In short, a median.
- Woof!
Our LMS regression coefficients
```
Rweb:> coefficients(fred) 
(Intercept)          x     I(x^2)  
 0.64916667 -1.20893617  0.04478648
```
don't agree with the ones reported in Efron and Tibshirani.
LMS regression is notoriously computationally difficult. The algorithm must look at all choose(n, p + 1) subsets of the cases to find the regression that exactly fits p + 1 points which minimizes the least median of squares (where p is the number of nonconstant predictors).
Of course, in this toy data set, choose(14, 3) = 364 is not very large. So the algorithm does find the exact solution (according to the documentation).
But many older LMS regression algorithms only sample subsets of size p + 1 and hence do not find the exact solution (as does the lqs function when choose(n, p + 1) is greater than 5000). Then LMS presumably stands for low (somewhat) rather than least median of squares.
Maybe Efron and Tibshirani used a buggy LMS regression routine (or maybe we are), or maybe theirs just used sampling and didn't find the right answer. Or maybe I'm not doing this right. Our picture looks sensible, though.
The code
```
points(x, y, pch = 16)
curve(predict(fred, data.frame(x = x)),
   add = TRUE, lwd = 2)
```
puts the original data and LMS regression curve back on the plot after they have been painted over in red by all of the bootstrap LMS regression lines drawn by the curve statement in the loop. The pch = 16 and lwd = 2 make heavier points and lines.
If you don't want this to take so long, making nboot smaller draws the picture faster (though, of course, it makes the bootstrap standard errors less accurate).
Efron and Tibshirani don't plot bootstrap regression lines in this section, but they should. The bizarreness of the picture casts doubt on the sensibleness of the whole procedure (how's that for academic weasel wording).

Bootstrapping Residuals

Sections 9.6 and 9.7 in Efron and Tibshirani.

library(bootstrap) data(cell) x <- cell[ , 1] y <- cell[ , 2] library(lqs) plot(x, y) fred <- lmsreg(y ~ x + I(x^2)) curve(predict(fred, data.frame(x = x)), add = TRUE) coefficients(fred) print(resid <- residuals(fred)) print(pred <- predict(fred)) beta1.hat <- coefficients(fred)[2] beta2.hat <- coefficients(fred)[3] n <- length(x) nboot <- 1000 beta1.star <- double(nboot) beta2.star <- double(nboot) for (i in 1:nboot) { k.star <- sample(n, replace = TRUE) y.star <- pred + resid[k.star] sally <- lmsreg(y.star ~ x + I(x^2)) curve(predict(sally, data.frame(x = x)), add = TRUE, col = "lightseagreen") beta1.star[i] <- coefficients(sally)[2] beta2.star[i] <- coefficients(sally)[3] } points(x, y, pch = 16) curve(predict(fred, data.frame(x = x)), add = TRUE, lwd = 2) beta1.hat sd(beta1.star) beta2.hat sd(beta2.star) beta2.hat / sd(beta2.star)

External Data Entry

Enter a dataset URL :

Comments

Everything is the same as above except the generation of the bootstrap data. So see the comments above about what is not changed.
The residuals and predicted values for the original data are assigned to the vectors resid and pred for use in the bootstrap.
The bootstrap data, which were x.star and y.star above (when bootstrapping cases) are now x and y.star with
```
k.star <- sample(n, replace = TRUE)
y.star <- pred + resid[k.star]
```
Note that the X values don't change. This bootstrap attempts to simulate the conditional distribution of Y given X, so, as always with conditional probability, the variables (here X) we are conditioning on are treated as if constant. Hence the bootstrap leaves them alone. We construct the new Y by attaching random residuals (a bootstrap sample of residuals) to the original predicted values.
The two lines of code just above could be replaced by the one line
```
y.star <- pred + sample(resid, replace = TRUE)
```
without changing the algorithm. Since we are only sampling a single vector (resid) we don't need the k.star trick.
Note that the picture is much less bizarre than when bootstrapping cases. Exercise for the reader: why? Hint: it has something to do with the engineering details, also called nuts and bolts, of the procedures.

Statistics 5601 (Geyer, Fall 2003) Examples: Bootstrapping Regression

Contents

General Instructions

Bootstrapping Cases

Sections 9.6 and 9.7 in Efron and Tibshirani.

Comments

Bootstrapping Residuals

Sections 9.6 and 9.7 in Efron and Tibshirani.

Comments