The exam is open book, open notes, open web pages. Do not discuss this exam with anyone except the instructor.

You may use the computer, a calculator, or pencil and paper to get answers, but it is expected that you will use the computer. Show all your work:

- For simple computer commands, you may just write down the command
you used and the result it gave on your exam solution.
- For complicated commands or plots, make a printout and attach
the printout to your exam solution.

**No credit** for numbers with no indication of where
they came from!

The data for this problem are at the URL

With that URL given to Rweb, two variables
`x`

and `y`

are
loaded.
(If you are doing this problem in R rather than Rweb,
see the footnote about reading this data into R).

This is regression data. We assume the standard model that is nonparametric about the regression function

where `g` is an unknown smooth function (infinite-dimensional
parameter), σ is an unknown constant (scalar parameter), and the
`Z`_{i}
are IID
standard normal.

Use the R function `smooth.spline`

(on-line help)
to fit a regression function
(g hat) to these data. Use optimal smoothing, where optimal

is
defined by this function's default method (generalized cross-validation).
[This was fixed Friday, the original statement said ordinary cross-validation.
Either will be accepted.]

Hand in a scatterplot with the smoothing spline regression estimate shown.

This problem continues the analysis started in Question 1 and uses the same data and the same model assumptions.

In this problem we are going to do nonparametric regression the old-fashioned
way, assuming we never heard of smoothing splines, kernel smoothers or the
like. Instead we are going to take a large class of parametric models for the
regression function `g`, and then do model selection

to
figure out which is best.

The point of the exercise is to see that the old-fashioned way is a direct competitor of modern methods. The question could conclude with an essay part where you give your opinion about how old-fashioned and modern methods compare. We won't actually ask the essay question, but you should think about the issue nevertheless.

The particular models we will fit will be polynomials. The R `poly`

(on-line help)
function helps with them. For example

out <- lm(y ~ poly(x, 10)) summary(out) curve(predict(out, newdata = data.frame(x = x)), add = TRUE, col = "red", n = 1001)

adds the best 10-th degree polynomial regression function to the scatterplot
you made in Question 1. The use of `curve`

(on-line help)
follows
the examples
on the page about color
and the page about
bandwith selection in smoothing.
The optional argument `n = 1001`

makes `curve`

draw a smoother curve, not necessary for 10-th degree polynomials,
but necessary to get nice plots for higher degree polynomials.

We select a polynomial by generalized cross validation. In the context of multiple regression, the formula (30) in the notes on smoothing reduces to

gcv(`k`)
=
asr(`k`) ⁄
[1 − (`k` + 1) ⁄ `n`]^{2}

where asr(`k`) is the average squared residual for the polynomial
regression of degree `k` for which the trace of the hat matrix is
the number of regression coefficients, which is `k` + 1 for a
polynomial of degree `k`, and `n` is the
length of the data `n` (which has no relation to the
argument `n`

to the `curve`

function).
If `out`

is the output of the `lm`

function
(on-line help), then

mean(residuals(out)^2)

is the average squared residual.

Find the `k` in the range from 1 to 40 that gives
the smallest gcv(`k`) for these data.

Hand in a scatterplot
with the smoothing spline regression estimate from Question 1 and
the optimal

polynomial from this question both shown.

**Hint:** An R technique for finding the
index at which a vector `sally`

is minimized is

seq(along = sally)[sally == min(sally)]

This problem continues the analysis started in Question 1 and continued in Question 2 and uses the same data and the same model assumptions.

Suppose in the plot that is the answer to Question 1 we want
a confidence interval for the value of the population regression function
`g`(`x`) at `x` = 6.5.

If `out`

is the output of the `smooth.spline`

function,
(on-line help), then

predict(out, 6.5)$y

gives that prediction.

Run an (Efron) nonparametric bootstrap to estimate the sampling
distribution of this estimator of `g`(6.5).

Calculate the 95% bootstrap percentile confidence interval obtained from this sampling distribution.

Hand in a histogram of your bootstrap estimate of the sampling distribution of the estimator showing the endpoints of the confidence interval on the histogram.

[Added Friday] This could be done bootstrapping either cases or residuals and the original statement did not specify. The solution will bootstrap residuals. Either will be accepted.

The data for this problem are at the URL

With that URL given to Rweb, one variable
`x`

is
loaded.
(If you are doing this problem in R rather than Rweb,
see the footnote about reading this data into R).

This is time series data. We assume the same AR(1) model that was assumed for the data in our first example for bootstrapping time series and in Section 8.5 in Efron and Tibshirani

and

where the `Z`_{i}
are IID
standard normal.

We want to do exactly as in our careful example for bootstrapping time series except that we will use a good estimator

beta.hat <- ar(x, aic = FALSE, order.max = 1)$ar

and we will use a much larger subsample size `b` = 250
(of course the time series length `n` must be, and is, much
larger than that).

Hand in the 95% confidence interval calculated as described in our careful example for bootstrapping time series and also hand in a histogram with relevant quantiles marked as done in that example.

If you are doing this problem in R rather than Rweb, you will have to duplicate what Rweb does reading in a URL at the beginning. So all together, you must do for problem 1, for example,

X <- read.table(url("http://www.stat.umn.edu/geyer/s06/5601/mydata/phred.txt"), header = TRUE) names(X) attach(X)

To produce the variables `x`

and `y`

needed
for your analysis.

Of course, you read different data files for different problems
that use external data entry, and the variables in those files
may have names other than `x`

and `y`

.
Everything else stays the same.