General Instructions

The exam is open book, open notes, open web pages. Do not discuss this exam with anyone except the instructors (Geyer, via e-mail, or Chatterjee).

You may use the computer, a calculator, or pencil and paper to get answers, but it is expected that you will use the computer. Show all your work:

No credit for numbers with no indication of where they came from!

Question 1 [25 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/two-thirds.txt

(and were used in homework). With that URL given to Rweb, one variable x is loaded. (If you are using R at home, see the footnote about reading this data into R).

In the homework we used the sample mean as an estimator of location, but for these data it has a slow rate of convergence n1/3. If we use a more robust estimator, say the Hodges-Lehmann estimator associated with the Wilcoxon signed rank test, call it the sample pseudomedian, then we get the usual n1/2 rate of convergence (the pseudomedian obeys the square root law).

  1. Calculate the sample pseudomedian (Hodges-Lehmann estimator associated with the Wilcoxon signed rank test) of these data.

  2. Estimate the standard error of the sample of this estimate (considered as a point estimate of the location parameter) using Efron's nonparametric bootstrap. Use at least 1000 bootstrap samples to calculate your estimate.

  3. Make a histogram of the bootstrap distribution of the sample pseudomedian showing the point which is the sample pseudomedian of the original data.

Question 2 [25 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/sine.txt

With that URL given to Rweb, two variables x and y are loaded. (If you are using R at home, see the footnote about reading this data into R).

We wish to fit a regression model with x as the predictor and y as the response. These data, except for outliers, appear to fit a fourth degree polynomial that is specified in the R formula mini-language by y ~ poly(x, 4) (on-line help). Since these data have some outliers, we will use a robust regression program ltsreg to do the regression (recall that this program was used in homework).

  1. Fit the model specified by the R formula y ~ poly(x, 4) to these data. Report the regression coefficients.

  2. Estimate the standard errors of all five regression coefficients (considered as a point estimates of the population regression coefficients) using Efron's nonparametric bootstrap. Bootstrap residuals, not cases. Use at least 250 bootstrap samples to calculate your estimate (You should use more than 250, but rweb.stat.umn.edu is slow, at least running ltsreg.)

  3. Produce a plot, like those on the bootstrapping regression web page showing not only the sample regression function but also the bootstrap regression functions.

Question 3 [25 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/t2p3.txt

With that URL given to Rweb, one variable y is loaded. (If you are using R at home, see the footnote about reading this data into R).

These data are actually a simulated stationary time series. We would like to fit an AR(2), which means autoregressive of order 2, model, which has the form

Xn = β1 Xn − 1 + β2 Xn − 2 + Zn

where the betas are autoregressive coefficients (the two parameters of interest) and the Z's are IID mean zero innovations. Then the observed data are

Yn = μ + Xn

where the X's are as defined above. So there are four unknown parameters (the betas, mu, and the innovations variance), but we are only interested in two.

The R statement

ar.burg(y, order.max = 2, aic = FALSE)$ar

(on-line help) produces a vector of length 2 that estimates the betas. Note that there is no need to subtract off the sample mean from the series like we did in the subsampling bootstrap for time series example. That is part of what the ar.burg does.

  1. Do a subsampling bootstrap to obtain bootstrap distributions of the two autoregressive coefficients (betas). Use subsampling bootstrap sample size 15, and assume a root n rate of convergence. (No answers from this part, just code.)

  2. Give (subsampling) bootstrap estimates of the standard errors of the betas.

  3. Give a scatterplot of β1* versus β2* showing (the subsampling bootstrap estimate of) their joint sampling distribution. Put horizontal and vertical lines on the plot to show the corresponding sample estimates the beta hats (can't make hats in HTML).

    Note: the h and v arguments to the abline function (on-line help) add such lines to plots and the lty argument makes different line types.

Question 4 [25 pts.]

The data for this problem are at the URL

http://www.stat.umn .edu/geyer/f06/5601/mydata/t1p4.txt

With that URL given to Rweb, two variables fruit and seeds are loaded. (If you are using R at home, see the footnote about reading this data into R).

These are the same data that were used for question 4 on the first midterm. Note that the URL ends t1p4.txt not t2p4.txt.

  1. Calculate the Pearson correlation coefficient for these data.

  2. Since the data are highly non-normal, the usual assumptions that go with the Pearson correlation coefficient are badly violated. Use the (Efron, nonparametric) bootstrap to approximate the sampling distribution of the Pearson correlation coefficient. Use bootstrap sample size 1000. (No answers from this part, just code.)

  3. Report the bootstrap standard error of the sample Pearson correlation coefficient.

  4. Make a histogram of the bootstrap distribution of the sample Pearson correlation coefficient showing the point which is the sample Pearson correlation coefficient of the original data.

Footnote about Reading Data for Problem 1 into R

If you are doing this problem in R rather than Rweb, you will have to duplicate what Rweb does reading in is URL at the beginning. So all together, you must do for problem 1, for example,

X <- read.table(url("http://www.stat.umn.edu/geyer/f06/5601/mydata/two-thirds.txt"),
    header = TRUE)
names(X)
attach(X)

To produce the variable x needed for your analysis.

Of course, you read different data files for different problems that use external data entry. Everything else stays the same.