Statistics 5601 (Geyer, Spring 2006) Final Exam

Go to question: 1 2 3 4

General Instructions

The exam is open book, open notes, open web pages. Do not discuss this exam with anyone except the instructor.

You may use the computer, a calculator, or pencil and paper to get answers, but it is expected that you will use the computer. Show all your work:

For simple computer commands, you may just write down the command you used and the result it gave on your exam solution.
For complicated commands or plots, make a printout and attach the printout to your exam solution.

No credit for numbers with no indication of where they came from!

Question 1 [25 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/s06/5601/mydata/phred.txt

With that URL given to Rweb, two variables x and y are loaded. (If you are doing this problem in R rather than Rweb, see the footnote about reading this data into R).

This is regression data. We assume the standard model that is nonparametric about the regression function

Y_i = g(X_i) + σ Z_i

where g is an unknown smooth function (infinite-dimensional parameter), σ is an unknown constant (scalar parameter), and the Z_i are IID standard normal.

Use the R function smooth.spline (on-line help) to fit a regression function (g hat) to these data. Use optimal smoothing, where optimal is defined by this function's default method (generalized cross-validation). [This was fixed Friday, the original statement said ordinary cross-validation. Either will be accepted.]

Hand in a scatterplot with the smoothing spline regression estimate shown.

Question 2 [25 pts.]

This problem continues the analysis started in Question 1 and uses the same data and the same model assumptions.

In this problem we are going to do nonparametric regression the old-fashioned way, assuming we never heard of smoothing splines, kernel smoothers or the like. Instead we are going to take a large class of parametric models for the regression function g, and then do model selection to figure out which is best.

The point of the exercise is to see that the old-fashioned way is a direct competitor of modern methods. The question could conclude with an essay part where you give your opinion about how old-fashioned and modern methods compare. We won't actually ask the essay question, but you should think about the issue nevertheless.

The particular models we will fit will be polynomials. The R poly (on-line help) function helps with them. For example

out <- lm(y ~ poly(x, 10))
summary(out)
curve(predict(out, newdata = data.frame(x = x)),
    add = TRUE, col = "red", n = 1001)

adds the best 10-th degree polynomial regression function to the scatterplot you made in Question 1. The use of curve (on-line help) follows the examples on the page about color and the page about bandwith selection in smoothing. The optional argument n = 1001 makes curve draw a smoother curve, not necessary for 10-th degree polynomials, but necessary to get nice plots for higher degree polynomials.

We select a polynomial by generalized cross validation. In the context of multiple regression, the formula (30) in the notes on smoothing reduces to

gcv(k) = asr(k) ⁄ [1 − (k + 1) ⁄ n]²

where asr(k) is the average squared residual for the polynomial regression of degree k for which the trace of the hat matrix is the number of regression coefficients, which is k + 1 for a polynomial of degree k, and n is the length of the data n (which has no relation to the argument n to the curve function). If out is the output of the lm function (on-line help), then

mean(residuals(out)^2)

is the average squared residual.

Find the k in the range from 1 to 40 that gives the smallest gcv(k) for these data.

Hand in a scatterplot with the smoothing spline regression estimate from Question 1 and the optimal polynomial from this question both shown.

Hint: An R technique for finding the index at which a vector sally is minimized is

seq(along = sally)[sally == min(sally)]

Question 3 [25 pts.]

This problem continues the analysis started in Question 1 and continued in Question 2 and uses the same data and the same model assumptions.

Suppose in the plot that is the answer to Question 1 we want a confidence interval for the value of the population regression function g(x) at x = 6.5.

If out is the output of the smooth.spline function, (on-line help), then

predict(out, 6.5)$y

gives that prediction.

Run an (Efron) nonparametric bootstrap to estimate the sampling distribution of this estimator of g(6.5).

Calculate the 95% bootstrap percentile confidence interval obtained from this sampling distribution.

Hand in a histogram of your bootstrap estimate of the sampling distribution of the estimator showing the endpoints of the confidence interval on the histogram.

[Added Friday] This could be done bootstrapping either cases or residuals and the original statement did not specify. The solution will bootstrap residuals. Either will be accepted.

Question 4 [25 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/s06/5601/mydata/rhino.txt

With that URL given to Rweb, one variable x is loaded. (If you are doing this problem in R rather than Rweb, see the footnote about reading this data into R).

This is time series data. We assume the same AR(1) model that was assumed for the data in our first example for bootstrapping time series and in Section 8.5 in Efron and Tibshirani

X_i = μ + Y_i

and

Y_i = β Y_{i − 1} + σ Z_i

where the Z_i are IID standard normal.

We want to do exactly as in our careful example for bootstrapping time series except that we will use a good estimator

beta.hat <- ar(x, aic = FALSE, order.max = 1)$ar

and we will use a much larger subsample size b = 250 (of course the time series length n must be, and is, much larger than that).

Hand in the 95% confidence interval calculated as described in our careful example for bootstrapping time series and also hand in a histogram with relevant quantiles marked as done in that example.

Footnote about Reading Data for Problem 1 into R

If you are doing this problem in R rather than Rweb, you will have to duplicate what Rweb does reading in a URL at the beginning. So all together, you must do for problem 1, for example,

X <- read.table(url("http://www.stat.umn.edu/geyer/s06/5601/mydata/phred.txt"),
    header = TRUE)
names(X)
attach(X)

To produce the variables x and y needed for your analysis.

Of course, you read different data files for different problems that use external data entry, and the variables in those files may have names other than x and y. Everything else stays the same.