General Instructions

The exam is open book, open notes, open web pages. Do not discuss this exam with anyone except the instructor.

You may use the computer, a calculator, or pencil and paper to get answers, but it is expected that you will use the computer. Show all your work:

No credit for numbers with no indication of where they came from!

Question 1 [20 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/5601/mydata/t1p1.txt

With that URL given to Rweb, two variables before and after are loaded. (If you are using R at home, see the footnote about reading this data into R).

These data come from a matched pairs experiment. Before and after values in the same row are for the same individual (hence may be correlated). Data in different rows are for different individuals and are assumed independent and identically distributed.

Individuals are students, and pairs are scores on similar tests taken before and after a coaching session by each student. The scientists want to test the hypothesis that the coaching session increases test score.

  1. Do an appropriate sign test for these data. Get an exact P-value. Interpret the P-value, stating whether the test indicates the coaching does or does not increase scores.
  2. Find the point estimate for the difference before and after coaching that is the Hodges-Lehmann estimator associated with the sign test.
  3. Find the fuzzy P-value corresponding to the exact P-value you obtained in part (a). Interpret the fuzzy P-value and compare it with the exact one.
  4. Would these data be appropriate for a Wilcoxon signed rank test? Make a histogram or a stem-and-leaf plot and explain what features of your plot indicate that a signed rank test would or would not be appropriate (as the case may be). The R function hist (on-line help) makes histograms. The R function stem (on-line help) makes stem-and-leaf plots.

Question 2 [20 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/5601/mydata/t1p2.txt

With that URL given to Rweb, two variables treatment and control are loaded. (If you are using R at home, see the footnote about reading this data into R).

These data are records of hours of sleep for two groups. You may assume all the measurements are independent and identically distributed.

The treatment is advice on how to get more sleep. The control group got no such advice but was otherwise treated the same as the other group.

  1. Describe the assumptions required for these data to be appropriate for using the confidence interval associated with the Wilcoxon rank sum test.
  2. Describe the parameter that this confidence interval estimates.
  3. Do this confidence interval, making the confidence level as close to 95% as you can make it (either above or below, whichever is closer). State the exact confidence level of your interval.

Question 3 [20 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/5601/mydata/t1p3.txt

With that URL given to Rweb, one variable weight is loaded. (If you are using R at home, see the footnote about reading this data into R).

These data are a random sample of weights in kilograms of men at the university (so one may assume they are independent and identically distributed). We wish to do a nonparametric test of whether these data have a normal distribution. Carry out the test. State the P-value. Interpret the P-value. Make it clear whether or not you think the test indicates these data are normally distributed.

Question 4 [20 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/5601/mydata/t1p4.txt

With that URL given to Rweb, two variables practice and race are loaded. (If you are using R at home, see the footnote about reading this data into R).

The data in each row are data for one individual practice is practice time in hours and race is race time in minutes. Data on different individuals can be assumed independent and identically distributed.

The question of scientific interest is whether the variables practice and race have significant association and, if so, what is a confidence interval?

  1. Calculate Kendall's tau for these data.

  2. Perform the test based on Kendall's tau of the null hypothesis of independence of practice and race versus the alternative of negative dependence (practice should lower race times). Find the P-value and interpret it.

  3. What kind of association does Kendall's tau measure? Does this influence your interpretation of the test?
  4. Find an approximate 95% confidence interval for the true unknown parameter value (the population tau).

Question 5 [20 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/5601/mydata/t1p5.txt

With that URL given to Rweb, three variables time and event and arm are loaded. (If you are using R at home, see the footnote about reading this data into R).

The data (made up) are from a clinical trial of radiation therapy for cancer. The trial had three arms (three different levels of radiation treatment, labeled light, moderate, and severe). The variable time is the time (in days) to relapse. The variable event gives the censoring indicator, 1 indicates the relapse was observed at the time indicated by time and 0 indicates the patient was lost to follow up before relapse was observed at the time indicated by time.

  1. Plot the Kaplan-Meier estimator of the survival curve for each arm of the trial. You may do them all on the same plot or one plot per arm. Either way make clear which curve goes with which arm.

  2. Are any of the differences among the three arms statistically significant? (Do a hypothesis test that addresses this issue and explain the result.)

  3. Find a point estimate and a 95% confidence interval for the probability of relapse-free survival for at least 1826 days (5 years).

If you are doing Problem 1 in R rather than Rweb, you will have to duplicate what Rweb does reading in is URL at the beginning. So all together, you must do

X <- read.table(url("http://www.stat.umn.edu/geyer/5601/mydata/t1p1.txt"),
    header = TRUE)
attach(X)
names(X)

To produce the two variables before and after needed for your analysis. Similarly for other problems.