The exam is open book, open notes, open web pages. Do not discuss this exam with anyone except the instructor.
You may use the computer, a calculator, or pencil and paper to get answers, but it is expected that you will use the computer. Show all your work:
No credit for numbers with no indication of where they came from!
The data for this problem are at the URL
With that URL given to Rweb, two variables
x
and y
are
loaded.
(If you are doing this problem in R rather than Rweb,
see the footnote about reading this data into R).
This is regression data. We assume the standard model that is nonparametric about the regression function
where g is an unknown smooth function (infinite-dimensional parameter), σ is an unknown constant (scalar parameter), and the Zi are IID standard normal.
Use the R function smooth.spline
(on-line help)
to fit a regression function
(g hat) to these data. Use optimal smoothing, where optimal
is
defined by this function's default method (generalized cross-validation).
[This was fixed Friday, the original statement said ordinary cross-validation.
Either will be accepted.]
Hand in a scatterplot with the smoothing spline regression estimate shown.
This problem continues the analysis started in Question 1 and uses the same data and the same model assumptions.
In this problem we are going to do nonparametric regression the old-fashioned
way, assuming we never heard of smoothing splines, kernel smoothers or the
like. Instead we are going to take a large class of parametric models for the
regression function g, and then do model selection
to
figure out which is best.
The point of the exercise is to see that the old-fashioned way is a direct competitor of modern methods. The question could conclude with an essay part where you give your opinion about how old-fashioned and modern methods compare. We won't actually ask the essay question, but you should think about the issue nevertheless.
The particular models we will fit will be polynomials. The R poly
(on-line help)
function helps with them. For example
out <- lm(y ~ poly(x, 10)) summary(out) curve(predict(out, newdata = data.frame(x = x)), add = TRUE, col = "red", n = 1001)
adds the best 10-th degree polynomial regression function to the scatterplot
you made in Question 1. The use of curve
(on-line help)
follows
the examples
on the page about color
and the page about
bandwith selection in smoothing.
The optional argument n = 1001
makes curve
draw a smoother curve, not necessary for 10-th degree polynomials,
but necessary to get nice plots for higher degree polynomials.
We select a polynomial by generalized cross validation. In the context of multiple regression, the formula (30) in the notes on smoothing reduces to
where asr(k) is the average squared residual for the polynomial
regression of degree k for which the trace of the hat matrix is
the number of regression coefficients, which is k + 1 for a
polynomial of degree k, and n is the
length of the data n (which has no relation to the
argument n
to the curve
function).
If out
is the output of the lm
function
(on-line help), then
mean(residuals(out)^2)
is the average squared residual.
Find the k in the range from 1 to 40 that gives the smallest gcv(k) for these data.
Hand in a scatterplot
with the smoothing spline regression estimate from Question 1 and
the optimal
polynomial from this question both shown.
Hint: An R technique for finding the
index at which a vector sally
is minimized is
seq(along = sally)[sally == min(sally)]
This problem continues the analysis started in Question 1 and continued in Question 2 and uses the same data and the same model assumptions.
Suppose in the plot that is the answer to Question 1 we want a confidence interval for the value of the population regression function g(x) at x = 6.5.
If out
is the output of the smooth.spline
function,
(on-line help), then
predict(out, 6.5)$y
gives that prediction.
Run an (Efron) nonparametric bootstrap to estimate the sampling distribution of this estimator of g(6.5).
Calculate the 95% bootstrap percentile confidence interval obtained from this sampling distribution.
Hand in a histogram of your bootstrap estimate of the sampling distribution of the estimator showing the endpoints of the confidence interval on the histogram.
[Added Friday] This could be done bootstrapping either cases or residuals and the original statement did not specify. The solution will bootstrap residuals. Either will be accepted.
The data for this problem are at the URL
With that URL given to Rweb, one variable
x
is
loaded.
(If you are doing this problem in R rather than Rweb,
see the footnote about reading this data into R).
This is time series data. We assume the same AR(1) model that was assumed for the data in our first example for bootstrapping time series and in Section 8.5 in Efron and Tibshirani
and
where the Zi are IID standard normal.
We want to do exactly as in our careful example for bootstrapping time series except that we will use a good estimator
beta.hat <- ar(x, aic = FALSE, order.max = 1)$ar
and we will use a much larger subsample size b = 250 (of course the time series length n must be, and is, much larger than that).
Hand in the 95% confidence interval calculated as described in our careful example for bootstrapping time series and also hand in a histogram with relevant quantiles marked as done in that example.
If you are doing this problem in R rather than Rweb, you will have to duplicate what Rweb does reading in a URL at the beginning. So all together, you must do for problem 1, for example,
X <- read.table(url("http://www.stat.umn.edu/geyer/s06/5601/mydata/phred.txt"), header = TRUE) names(X) attach(X)
To produce the variables x
and y
needed
for your analysis.
Of course, you read different data files for different problems
that use external data entry, and the variables in those files
may have names other than x
and y
.
Everything else stays the same.