# Statistics 5601 (Geyer, Fall 2003) Homework Assignments

Go to assignment:     1     2     3     4     5     6     7

## Homework Assignment 1, Due Wed, Sep 24, 2003

For the data in Table 3.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-3.txt

1. Do a Wilcoxon signed rank test of the hypotheses described in Problem 3.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
2. Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the signed rank test with as close as you can get to 95% confidence. (The phrase as close as you can get means either above or below whichever is closer. Note this is different from the example, which always is above.)
3. Do a sign test of the same hypotheses as in part (a).
4. Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the sign test with as close as you can get to 95% confidence.
5. Comment on the differences between the two procedures, including any differences in required assumptions, and in theoretical properties such as asymptotic relative efficiency.

The problem just above is Problems 3.1, 3.19, 3.27, 3.63, and more in H & W.

For the data in Table 3.6 in H & W also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-6.txt repeat parts (a) and (b) above. Read 3.43 in place of 3.1 in part (a).

The problem just above is Problem 3.12 and more in H & W.

Problem 3.54 in H & W. The data are in the problem statement also at http://www.stat.umn.edu/geyer/5601/hwdata/p3-54.txt.

Problem 3.87 in H & W. The data are in Table 3.9 in H & W also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-9.txt.

Note: answers are are now here.

## Homework Assignment 2, Due Fri, Oct 10, 2003

For the data in Table 4.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t4-3.txt

1. Do a Wilcoxon rank sum test of the hypotheses described in Problem 4.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
2. Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the rank sum test with as close as you can get to 95% confidence.
3. Do a two-sample Kolmogorov-Smirnov test
1. testing whether there is any difference in the two populations, and
2. testing whether allergics have higher histamine levels than nonallergics.

This is Problems 4.1, 4.15, 4.27, and more.

Problem 11.29. The data are at http://www.stat.umn.edu/geyer/5601/hwdata/t11-15.txt

For the data in Table 4.4 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t4-4.txt

1. Do a Wilcoxon rank sum test of the hypotheses described in Problem 4.5 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
2. Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the rank sum test with as close as you can get to 99% confidence (note not 95%).

This is Problems 4.5, 4.19, 4.34, and more.

Note: answers are are now here.

## Homework Assignment 3, Due Fri, Oct 17, 2003

For the data in Table 6.4 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t6-4.txt

1. Do a Kruskal-Wallis test of the hypotheses described in Problem 6.8 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
2. Also do a conventional ANOVA on the same data.

For the data in Table 6.7 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t6-7.txt

1. Do a Jonckheere-Terpstra test of the hypotheses described in Problem 6.19 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
2. Also do a conventional normal theory test of the same hypotheses on the same data.

For the data in Table 7.2 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t7-2.txt

1. Do a Friedman test of the hypotheses described in Problem 7.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
2. Also do a conventional ANOVA on the same data.

For the data in Table 8.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t8-3.txt

1. Do a test using Kendall's tau as the test statistic of the hypotheses described in Problem 8.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
2. Also find the associated point estimate and an approximate 95% confidence interval.
3. Also do the competing normal-theory parametric procedures on the same data.

This is problems 8.1, 8.20, and, except for a change of confidence level, 8.27.

## Homework Assignment 4, Due Fri, Oct 31, 2003

### Questions From Hollander and Wolfe

11.34 and 11.36. The data are in Table 11.18 in Hollander and Wolfe and also at

### Questions From Efron and Tibshirani

4.1, 4.2, 5.2, and 5.4

The file

contains two variables `x`, which is a random sample from a standard normal distribution, and `y`, which is a random sample from a standard Cauchy distribution.

1. For the variable `x` compute

1. the sample mean
2. the sample median
3. the sample 25% trimmed mean, for which R statement is `mean(x, trim=0.25)`

And for each of these three point estimates calculate a bootstrap standard error (use bootstrap sample size 1000 at least).

2. Repeat part (a) for the variable `y`

3. Using the fact that `x` is actually a random sample from a normal distribution, we actually know the asymptotic relative efficiency (ARE) of estimators (i) and (ii) in part (a). (See the efficiency page for details). What is it, and how close is the square of the ratio of bootstrap standard errors?

4. Using the fact that `y` is actually a random sample from a Cauchy distribution, also denoted t(1) because it is the Student t distribution for one degree of freedom, we actually know the asymptotic relative efficiency (ARE) of estimators (i) and (ii) in part (b). (See the efficiency page for details). What is it, and how close is the square of the ratio of bootstrap standard errors?

Note: answers to Problem 11.36 in H&W and the additional question are here.

## Homework Assignment 5, Due Mon, Nov 10, 2003

### Question 1

The data set `LakeHuron` in the `ts` package gives annual measurements of the level, in feet, of Lake Huron 1875-1972. As with all data sets included in R the usage is

```library(ts)
data(LakeHuron)
```

Then the name of the time series is `LakeHuron`, for example,

```plot(LakeHuron)
```

does a time series plot.

The average water level over this period was

```> mean(LakeHuron)
 579.0041
```

Obtain a standard error for this estimate using the subsampling bootstrap with subsample length b = 10. Assume that the sample mean obeys the square root law (that is, the rate is square root of n).

### Question 2

There seems no reason other than historical to use the `lms` and `lqs` options. LMS estimation is of low efficiency (converging at rate n-1/3) whereas LTS has the same asymptotic efficiency as an M estimator with trimming at the quartiles (Marazzi, 1993, p. 201). LQS and LTS have the same maximal breakdown value of `(floor((n-p)/2) + 1)/n` attained if `floor((n+p)/2) <= quantile <= floor((n+p+1)/2)`. The only drawback mentioned of LTS is greater computation, as a sort was thought to be required (Marazzi, 1993, p. 201) but this is not true as a partial sort can be used (and is used in this implementation).

Thus it seems that LMS regression is the Wrong Thing (with a capital W and a capital T), something that survives only for historico-sociological reasons: it was invented first and most people that have heard of robust regression at all have only heard of it. To be fair to Efron and Tibshirani, the literature cited in the documentation for `lmsreg` is the same vintage as their book. So maybe they thought they were using the latest and greatest.

Anyway, the problem is to redo the LMS examples using LTS.

What changes? Does LTS really work better here than LMS? Describe the differences you see and why you think these differences indicate LTS is better (or worse, if that is what you think) than LMS.

Note: LTS does take longer (about 45 seconds for `nboot = 1000`), so a smaller `nboot` might be justifiable.

### Question 3

The file

contains a vector `x` of data from a heavy tailed distribution such that the sample mean has rate of convergence n1 / 3, that is

`n^(1 / 3) * (theta.hat - theta)`

has nontrivial asymptotics (nontrivial here meaning it doesn't converge to zero in probability and also is bounded in probability, so n1 / 3 is the right rate) where `theta.hat` is the sample mean and `theta` is the true unknown population mean.

When the sample mean behaves as badly as this, the sample variance behaves even worse (it converges in probability to infinity), but robust measures of scale make sense, for example, the interquartile range (calculated by the `IQR` function in R, note that the capital letters are not a mistake).

1. Using subsample size b = 20, do a subsampling bootstrap to estimate the distribution of the sample mean.

2. Make a histogram of `theta.star` marking the point `theta.hat`.

3. Make a histogram of `b^(1 / 3) * (theta.star - theta.hat)` which is the analog in the bootstrap world of the distribution of the quantity having nontrivial asymptotics displayed above.

4. Calculate the subsampling bootstrap estimate of the IQR of `theta.hat`, rescaling by the ratio of rates in the appropriate fashion.

## Homework Assignment 6, Due Mon, Nov 24, 2003

### Question 1

The file

contains one variable `x`, which is a random sample from a standard from a gamma distribution.

The coefficient of skewness of a distribution is the third central moment divided by the cube of the standard deviation (this gives a dimensionless quantity, that is zero for any symmetric distribution, positive for distributions with long left tail, and negative for distributions with long right tail).

It can be calculated by the R function defined by

```skew <- function(x) {
xbar <- mean(x)
mu2.hat <- mean((x - xbar)^2)
mu3.hat <- mean((x - xbar)^3)
mu3.hat / sqrt(mu2.hat)^3
}
```

1. Find a 95% confidence interval for the true unknown population coefficient of skewness that is just the sample coefficient of skewness plus or minus 1.96 bootstrap standard errors.

2. Find a 95% confidence interval for the true unknown population coefficient of skewness having the second order accuracy property using the `boott` function.

Note: Since you have no idea about how to write an `sdfun` that will variance stabilize the coefficient of skewness, you will have to use one of the other two methods described on the bootstrap t page.

3. Find a 95% confidence interval for the true unknown population coefficient of skewness using the bootstrap percentile method.

4. Find a 95% confidence interval for the true unknown population coefficient of skewness using the BCa (alphabet soup, type 1) method.

5. Find a 95% confidence interval for the true unknown population coefficient of skewness using the ABC (alphabet soup, type 2) method.

Note: this will require that you write a quite different `skew` function, starting

```skew <- function(p, x) {
```
(and you have to fill in the rest of the details, which should, I hope, be clear enough from the discussion of our ABC example).

### Question 2

The file

contains one variable `x`, which is a random realization of an AR(1) time series.

The sample mean of the time series obeys the square root law, that is,

`sqrt(n) * (theta.hat - theta)`

is asymptotically normal, where `theta.hat` is the sample mean for sample size `n` and `theta` is the true unknown population mean.

1. Find a 95% confidence interval for the true unknown population mean that is just the sample mean plus or minus 1.96 bootstrap standard errors.

2. Find a 95% confidence interval mean that is just the sample mean using the method of Politis and Romano described in the handout and on the second web page on subsampling.

Use subsample size 50 for both parts (you can use the same samples).

### Question 3

The file

contains a vector `x` of data that are a random sample from a heavy tailed distribution such that the sample mean has rate of convergence n1 / 3, that is

`n^(1 / 3) * (theta.hat - theta)`

has nontrivial asymptotics (nontrivial here meaning it doesn't converge to zero in probability and also is bounded in probability, so n1 / 3 is the right rate) where `theta.hat` is the sample mean for sample size `n` and `theta` is the true unknown population mean.

This is the same distribution as was used for Homework 5, Problem 3, but a much larger sample.

1. Find a 95% confidence interval for the true unknown population mean using the sample mean as the point estimator and using the subsampling bootstrap with subsample size b = 100 by the method of Politis and Romano using the known rate `n^(1 / 3)`.

2. Find a 95% confidence interval for the true unknown population using the sample mean as the point estimator and using the subsampling bootstrap with subsample size b = 100 by the method of Politis and Romano estimating the rate (pretending you have been told nothing about the rate). Report both your rate estimate and your confidence interval.

## Homework Assignment 7, Due Wed, Dec 10, 2003

### Question 1

The file

contains two variables `x` and `y`, which are made up regression data.

1. Run each of the four smoothers done by the R functions
• `locpoly`
• `smooth.spline`
• `gam`
• `sm.regression`
using automatic selection of the smoothing parameter, as described on the bandwidth selection web page.

2. Report the smoothing parameter used by each method and what this smoothing parameter purports to be (you may have to grovel around in the on-line help for these functions (follow the links on the bandwidth selection web page)
3. Hand in either (1) four plots showing each of the four smooths or (2) one plot showing all four smooths.

### Question 2

The file

contains a two by two contingency table that are data on a study of whether snoring and disturbing dreams are associated. The row labels of the table are values of a categorical variable indicating the presence or absence of snoring. The column labels of the table are values of a categorical variable measuring the frequency of nightmares.

Because of the way Rweb reads data, it does not get read in a sensible form. The following R statements will fix it up.

```x <- as.matrix(X[ , -1])
dimnames(x)[] <- X[ , 1]
print(x)
```

(the last statement just showing us what we have in `x`).

We want to do a chi-square test of independence. The R command

```chisq.test(x)
```

But as the command itself says in its printout

Chi-squared approximation may be incorrect in: `chisq.test(x)`

So your job is to get a valid P-value using the parametric bootstrap. Use at least nboot = 100,000 bootstrap samples for your bootstrap. Also interpret the P-value.