## General Instructions

The exam is open book, open notes, open web pages. Do not discuss this exam with anyone except the instructor.

You may use the computer, a calculator, or pencil and paper to get answers, but it is expected that you will use the computer. Show all your work:

• For simple computer commands, you may just write down the command you used and the result it gave on your exam solution.

• For complicated commands or plots, make a printout and attach the printout to your exam solution.

No credit for numbers with no indication of where they came from!

## Question 1 [20 pts.]

The data for this problem are at the URL

With that URL given to Rweb, two variables `before` and `after` are loaded. (If you are using R at home, see the footnote about reading this data into R).

These data come from a matched pairs experiment. Before and after values in the same row are for the same individual (hence may be correlated). Data in different rows are for different individuals and are assumed independent and identically distributed.

Individuals are students, and pairs are scores on similar tests taken before and after a coaching session by each student. The scientists want to test the hypothesis that the coaching session increases test score.

1. Do an appropriate sign test for these data. Get an exact P-value. Interpret the P-value, stating whether the test indicates the coaching does or does not increase scores.
2. Find the point estimate for the difference before and after coaching that is the Hodges-Lehmann estimator associated with the sign test.
3. Find the fuzzy P-value corresponding to the exact P-value you obtained in part (a). Interpret the fuzzy P-value and compare it with the exact one.
4. Would these data be appropriate for a Wilcoxon signed rank test? Make a histogram or a stem-and-leaf plot and explain what features of your plot indicate that a signed rank test would or would not be appropriate (as the case may be). The R function `hist` (on-line help) makes histograms. The R function `stem` (on-line help) makes stem-and-leaf plots.

## Question 2 [20 pts.]

The data for this problem are at the URL

With that URL given to Rweb, two variables `treatment` and `control` are loaded. (If you are using R at home, see the footnote about reading this data into R).

These data are records of hours of sleep for two groups. You may assume all the measurements are independent and identically distributed.

The treatment is advice on how to get more sleep. The control group got no such advice but was otherwise treated the same as the other group.

1. Describe the assumptions required for these data to be appropriate for using the confidence interval associated with the Wilcoxon rank sum test.
2. Describe the parameter that this confidence interval estimates.
3. Do this confidence interval, making the confidence level as close to 95% as you can make it (either above or below, whichever is closer). State the exact confidence level of your interval.

## Question 3 [20 pts.]

The data for this problem are at the URL

With that URL given to Rweb, one variable `weight` is loaded. (If you are using R at home, see the footnote about reading this data into R).

These data are a random sample of weights in kilograms of men at the university (so one may assume they are independent and identically distributed). We wish to do a nonparametric test of whether these data have a normal distribution. Carry out the test. State the P-value. Interpret the P-value. Make it clear whether or not you think the test indicates these data are normally distributed.

## Question 4 [20 pts.]

The data for this problem are at the URL

With that URL given to Rweb, two variables `practice` and `race` are loaded. (If you are using R at home, see the footnote about reading this data into R).

The data in each row are data for one individual `practice` is practice time in hours and `race` is race time in minutes. Data on different individuals can be assumed independent and identically distributed.

The question of scientific interest is whether the variables `practice` and `race` have significant association and, if so, what is a confidence interval?

1. Calculate Kendall's tau for these data.

2. Perform the test based on Kendall's tau of the null hypothesis of independence of `practice` and `race` versus the alternative of negative dependence (practice should lower race times). Find the P-value and interpret it.

3. What kind of association does Kendall's tau measure? Does this influence your interpretation of the test?
4. Find an approximate 95% confidence interval for the true unknown parameter value (the population tau).

## Question 5 [20 pts.]

The data for this problem are at the URL

With that URL given to Rweb, three variables `time` and `event` and `arm` are loaded. (If you are using R at home, see the footnote about reading this data into R).

The data (made up) are from a clinical trial of radiation therapy for cancer. The trial had three arms (three different levels of radiation treatment, labeled light, moderate, and severe). The variable `time` is the time (in days) to relapse. The variable `event` gives the censoring indicator, 1 indicates the relapse was observed at the time indicated by `time` and 0 indicates the patient was lost to follow up before relapse was observed at the time indicated by `time`.

1. Plot the Kaplan-Meier estimator of the survival curve for each arm of the trial. You may do them all on the same plot or one plot per arm. Either way make clear which curve goes with which arm.

2. Are any of the differences among the three arms statistically significant? (Do a hypothesis test that addresses this issue and explain the result.)

3. Find a point estimate and a 95% confidence interval for the probability of relapse-free survival for at least 1826 days (5 years).

If you are doing Problem 1 in R rather than Rweb, you will have to duplicate what Rweb does reading in is URL at the beginning. So all together, you must do

```X <- read.table(url("http://www.stat.umn.edu/geyer/5601/mydata/t1p1.txt"),
header = TRUE)
attach(X)
names(X)
```

To produce the two variables `before` and `after` needed for your analysis. Similarly for other problems.