Stat 5601 (Geyer) Examples (Bootstrapping Regression)

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Bootstrapping Cases

Sections 9.6 and 9.7 in Efron and Tibshirani.

External Data Entry

Enter a dataset URL :

• The on-line help for the function `lmsreg` is clear as mud. The name seems to suggest that it does least median of squares regression, but the documentation doesn't say that anywhere.

However, if we decipher the jargon, it says:

• The function `lmsreg` is a compatibility wrapper for the function `lqs`.

• Using `lmsreg` forces `method = "lms"` in `lqs`.

• In the Details section it says the function minimizes some function of the sorted squared residuals, for `method = "lms"` the `quantile` squared residual. And the default for `quantile` (for this method) is `floor((n+1)/2)`. In short, a median.

• Woof!

• Our LMS regression coefficients
```Rweb:> coefficients(fred)
(Intercept)          x     I(x^2)
0.64916667 -1.20893617  0.04478648
```
don't agree with the ones reported in Efron and Tibshirani.

LMS regression is notoriously computationally difficult. The algorithm must look at all `choose(n, p + 1)` subsets of the cases to find the regression that exactly fits `p + 1` points which minimizes the least median of squares (where `p` is the number of nonconstant predictors).

Of course, in this toy data set, ` choose(14, 3) = 364` is not very large. So the algorithm does find the exact solution (according to the documentation).

But many older LMS regression algorithms only sample subsets of size `p + 1` and hence do not find the exact solution (as does the `lqs` function when `choose(n, p + 1)` is greater than 5000). Then LMS presumably stands for low (somewhat) rather than least median of squares.

Maybe Efron and Tibshirani used a buggy LMS regression routine (or maybe we are), or maybe theirs just used sampling and didn't find the right answer. Or maybe I'm not doing this right. Our picture looks sensible, though.

• The code
```points(x, y, pch = 16)
curve(predict(fred, data.frame(x = x)),
add = TRUE, lwd = 2)
```
puts the original data and LMS regression curve back on the plot after they have been painted over in red by all of the bootstrap LMS regression lines drawn by the `curve` statement in the loop. The `pch = 16` and `lwd = 2` make heavier points and lines.

If you don't want this to take so long, making `nboot` smaller draws the picture faster (though, of course, it makes the bootstrap standard errors less accurate).

• Efron and Tibshirani don't plot bootstrap regression lines in this section, but they should. The bizarreness of the picture casts doubt on the sensibleness of the whole procedure (how's that for academic weasel wording).

I changed the color specifications on this page to character strings specifying color names. If you don't like them you can change them.

```colors()
```
prints all the color names that R knows about, 657 of them. To see the varieties of green do

If you want gray instead of a color use some variant of gray (changing `"green"` to `"gray"` in the example above shows all the shades of gray available) or just pick a number `"gray0"` and `"gray100"` and gray with all the numbers in between are valid color names.

Bootstrapping Residuals

Sections 9.6 and 9.7 in Efron and Tibshirani.

External Data Entry

Enter a dataset URL :

• Everything is the same as above except the generation of the bootstrap data. So see the comments above about what is not changed.

• The residuals and predicted values for the original data are assigned to the vectors `resid` and `pred` for use in the bootstrap.

• The bootstrap data, which were `x.star` and `y.star` above (when bootstrapping cases) are now `x` and `y.star` with
```k.star <- sample(n, replace = TRUE)
y.star <- pred + resid[k.star]
```
Note that the X values don't change. This bootstrap attempts to simulate the conditional distribution of Y given X, so, as always with conditional probability, the variables (here X) we are conditioning on are treated as if constant. Hence the bootstrap leaves them alone. We construct the new Y by attaching random residuals (a bootstrap sample of residuals) to the original predicted values.

The two lines of code just above could be replaced by the one line

```y.star <- pred + sample(resid, replace = TRUE)
```
without changing the algorithm. Since we are only sampling a single vector (`resid`) we don't need the `k.star` trick.

• Note that the picture is much less bizarre than when bootstrapping cases. Exercise for the reader: why? Hint: it has something to do with the engineering details, also called nuts and bolts, of the procedures.