University of Minnesota, Twin Cities School of Statistics Stat 5601 Rweb

Statistics 5601, Fall 2001, Prof. Geyer, Homework Assignments

Go to assignment: 1 2 3 4 5 6

Homework Assignment 1, Due Fri, Sep 24, 2001

For the data in Table 3.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-3.txt

Do a Wilcoxon signed rank test of the hypotheses described in Problem 3.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the signed rank test with as close as you can get to 95% confidence.
Do a sign test of the same hypotheses as in part (a).
Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the sign test with as close as you can get to 95% confidence.
Comment on the differences between the two procedures, including any differences in required assumptions, and in theoretical properties such as asymptotic relative efficiency.

The problem just above is Problems 3.1, 3.19, 3.27, 3.63, and more in (H & W).

For the data in Table 3.6 in (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-6.txt, repeat parts (a) and (b) above. Read 3.43 in place of 3.1 in part (a).

The problem just above is Problem 3.12 and more in (H & W).

Problem 3.54 in (H & W).

Problem 3.87 in (H & W). The data are in Table 3.9 in (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-9.txt.

Note: answers are are now here.

Homework Assignment 2, Due Fri, Oct 12, 2001

For the data in Table 4.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t4-3.txt

Do a Wilcoxon rank sum test of the hypotheses described in Problem 4.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the rank sum test with as close as you can get to 95% confidence.
Do a permutation test of the same hypotheses as in part (a) using the difference of sample means as your test statistic.
Do a two-sample Kolmogorov-Smirnov test (testing whether there is any difference in the two populations).

This is Problems 4.1, 4.15, 4.27, and more.

Problem 11.29. The data are at http://www.stat.umn.edu/geyer/5601/hwdata/t11-15.txt

For the data in Table 4.4 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t4-4.txt

Do a Wilcoxon rank sum test of the hypotheses described in Problem 4.5 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the rank sum test with as close as you can get to 99% confidence (note not 95%).
Do a permutation test of the same hypotheses as in part (a) using the difference of sample means as your test statistic. Since the sample sizes are so large, this will have to be a Monte Carlo test.

This is Problems 4.5, 4.19, 4.34, and more.

Note: answers are are now here.

Homework Assignment 3, Due Mon, Oct 29, 2001

For the data in Table 6.4 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t6-4.txt

Do a Kruskal-Wallis test of the hypotheses described in Problem 6.8 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
(Optional) Also do a conventional ANOVA on the same data.

For the data in Table 6.7 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t6-7.txt Do a Jonckheere-Terpstra test of the hypotheses described in Problem 6.19 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".

For the data in Table 7.2 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t7-2.txt

Do a Friedman test of the hypotheses described in Problem 7.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
(Optional) Also do a conventional ANOVA on the same data.

For the data in Table 8.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t8-3.txt

Do a test using Kendall's tau as the test statistic of the hypotheses described in Problem 8.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
Also find the associated point estimate and an approximate 95% confidence interval.
(Optional) Also do the competing normal-theory parametric procedures on the same data.

This is problems 8.1, 8.20, and, except for a change of confidence level, 8.27.

Note: answers are are here.

Homework Assignment 4, Due Fri, Nov 9, 2001

Questions From Efron and Tibshirani

4.1, 4.2, 4.3 (a), 5.2, and 5.4

Additional Question

For the data used in Question 2 on the first midterm, at http://www.stat.umn.edu/geyer/5601/mydata/sally.txt

For the variable x compute
1. the sample mean
2. the sample median
3. the sample 25% trimmed mean, for which R statement is mean(x, trim=0.25)
And for each of these three point estimates calculate a bootstrap standard error (use bootstrap sample size 1000 at least).
Repeat part (a) for the variable y
Using the fact (revealed in the answer key to the midterm) that x is actually a random sample from a normal distribution, we actually know the asymptotic relative efficiency (ARE) of estimators (i) and (ii) in part (a). What is it, and how close is the ratio of bootstrap standard errors?
(Optional) Using the fact (revealed in the answer key to the midterm) that y is actually a random sample from a Cauchy distribution, those who have had a theory course should know the asymptotic relative efficiency (ARE) of estimators (i) and (ii) in part (b). What is it, and how close is the ratio of bootstrap standard errors?

Note: answers to the additional question are are here.

Homework Assignment 5, Due Fri, Nov 16, 2001

Question 1

The data set LakeHuron in the ts package gives annual measurements of the level, in feet, of Lake Huron 1875-1972. As with all data sets included in R the usage is

library(ts)
data(LakeHuron)

Then the name of the time series is LakeHuron, for example,

plot(LakeHuron)

does a time series plot.

The average water level over this period was

> mean(LakeHuron)
[1] 579.0041

Obtain a standard error for this estimate using the subsampling bootstrap with subsample length b = 10. Assume that the sample mean obeys the square root law (that is, the rate is square root of n).

Question 2

The documentation for the R function lmsreg says

There seems no reason other than historical to use the lms and lqs options. LMS estimation is of low efficiency (converging at rate n^{-1/3}) whereas LTS has the same asymptotic efficiency as an M estimator with trimming at the quartiles (Marazzi, 1993, p.201). LQS and LTS have the same maximal breakdown value of (floor((n-p)/2) + 1)/n attained if floor((n+p)/2) <= quantile <= floor((n+p+1)/2). The only drawback mentioned of LTS is greater computation, as a sort was thought to be required (Marazzi, 1993, p.201) but this is not true as a partial sort can be used (and is used in this implementation).

Thus it seems that LMS regression is the Wrong Thing (with a capital W and a capital T), something that survives only for historico-sociological reasons: it was invented first and most people that have heard of robust regression at all have only heard of it. To be fair to Efron and Tibshirani, the literature cited in the documentation for lmsreg is the same vintage as their book. So maybe they thought they were using the latest and greatest.

Anyway, the problem is to redo the LMS examples using LTS.

What changes? Does LTS really work better here than LMS? Describe the differences you see and why you think these differences indicate LTS is better (or worse, if that is what you think) than LMS.

Note: LTS does take longer, so a smaller nboot might be advisible. Also I got some warning messages, which perhaps (???) can be ignored.

Homework Assignment 6, Due Fri, Dec 7, 2001

Question 1

For data used in Question 1 on the second midterm, at http://www.stat.umn.edu/geyer/5601/mydata/gamma.txt we calculated the sample coefficient of skewness calculated by the function

skew <- function(x) {
    xbar <- mean(x)
    mu2.hat <- mean((x - xbar)^2)
    mu3.hat <- mean((x - xbar)^3)
    mu3.hat / sqrt(mu2.hat)^3
}

Find a 95% confidence interval for the true unknown population coefficient of skewness that is just the sample coefficient of skewness plus or minus 1.96 bootstrap standard errors.
Find a 95% confidence interval for the true unknown population coefficient of skewness having the second order accuracy property using the boott function.
Note: Since you have no idea about how to write an sdfun that will variance stabilize the coefficient of skewness, you will have to use one of the other two methods described on the bootstrap t page.
Find a 95% confidence interval for the true unknown population coefficient of skewness using the bootstrap percentile method.
Find a 95% confidence interval for the true unknown population coefficient of skewness using the BC_a (alphabet soup, type 1) method.
Find a 95% confidence interval for the true unknown population coefficient of skewness using the ABC (alphabet soup, type 2) method.
Note: this will require that you write a quite different skew function, starting
```
skew <- function(p, x) {
```
(and you have to fill in the rest of the details, which should, I hope, be clear enough from the discussion of our ABC example).

Question 2

For data used in Question 2 on the second midterm, at http://www.stat.umn.edu/geyer/5601/mydata/ar1.txt we calculated the sample mean (and the sample standard deviation, but we'll ignore the latter for this problem).

Note: The test question forgot to say this explicitly, but this time series does obey the square root law. The proper rate to us is square root of n.

Find a 95% confidence interval for the true unknown population mean that is just the sample mean plus or minus 1.96 bootstrap standard errors.
Find a 95% confidence interval mean that is just the sample mean using the method of Politis and Romano described in the handout.

As on the test, use subsample size 50 for both parts (you can use the same samples).

Note: answers are are here.