General Instructions

The exam is open book, open notes, open web pages. Do not discuss this exam with anyone except the instructor.

You may use the computer, a calculator, or pencil and paper to get answers, but it is expected that you will use the computer. Show all your work:

No credit for numbers with no indication of where they came from!

Question 1 [20 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/t1p1.txt

With that URL given to Rweb, two variables treatment and control are loaded. (If you are using R at home, see the footnote about reading this data into R).

Suppose these data come from a matched pairs experiment. Treatment and control values in the same row are for the same individual (hence may be correlated). Data in different rows are for different individuals and are assumed independent and indentically distributed. For concreteness you may assume these data come from a crossover trial, where each individual is given both the treatment and the control (placebo) at different times. In the first half of the trial some individuals get treatment and others control. In the second half they switch. Assume high values of the response are better, so if the treatment numbers are generally larger than the control numbers, that is what the scientists want to see. No one is interested in the possiblity that treatment is worse than placebo.

  1. Do an appropriate sign test for these data. Get an exact P-value. Interpret the P-value.
  2. Find the point estimate for difference in location of the two groups that is the Hodges-Lehmann estimator associated with the sign test.
  3. Find the fuzzy P-value corresponding to the exact P-value you obtained in part (a). Interpret the fuzzy P-value and compare it with the exact one.
  4. Same as part (a) except substitute signed rank test for sign test.
  5. Same as part (b) except substitute signed rank test for sign test.
  6. Same as part (c) except substitute signed rank test for sign test.

Question 2 [20 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/t1p2.txt

With that URL given to Rweb, two variables x and y are loaded. (If you are using R at home, see the footnote about reading this data into R).

You may consider x and y to be independent random samples from two different populations. The question of interest is whether the distributions of (whatever was measured) is the same in each population or different. The variable y is padded with NA values to be the same length as x. You may need to remove these to do some procedures.

  1. Use the appropriate Wilcoxon test for this situation. Since the functions psignrank and pwilcox can crash R when used with large sample sizes, they should not be used for this problem. Let wilcox.test or wilcox.exact use an asymptotic, large sample approximation. Get an approximate P-value. Interpret the P-value.
  2. Same as part (a) except replace Wilcoxon test with Kolmogorov-Smirnov test.
  3. Write a short discussion of why the results of (a) and (b) are so different.

Question 3 [15 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/t1p3.txt

With that URL given to Rweb, one variable x is loaded. (If you are using R at home, see the footnote about reading this data into R).

You may consider these data a random sample from some population. Calculate the confidence interval for the population median associated with the sign test. Report the actual achieved confidence level. Use the interval that comes the closest to 95% confidence (the one just above 95% or just below 95%, whichever is closer).

Question 4 [15 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/t1p4.txt

With that URL given to Rweb, two variables fruit and seeds are loaded. (If you are using R at home, see the footnote about reading this data into R).

The data in each row are measurements on one plant: fruit is the number of fruits produced in one year and seeds is the number of seeds found in a random sample of three fruits (individuals with fewer than three fruits to count were removed from the data). Data in different rows are on different plants and can be assumed independent and identically distributed.

The question of scientific interest is whether the variables fruit and seeds have significant association. The scientists would like to model them as independent variables. Is this justified?

  1. Calculate Kendall's tau for these data.
  2. Perform the test based on Kendall's tau of the null hypothesis of independence of fruit and seeds versus the alternative of dependence. Find the P-value and interpret it.

Question 5 [15 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/t1p5.txt

With that URL given to Rweb, two variables treatment, which is categorical, and yield, which is numeric, are loaded. (If you are using R at home, see the footnote about reading this data into R).

There six treatments. The six groups are supposed to be independent random samples from six populations. The question of scientific interest is whether there are any treatments effects. The null hypothesis is no treatment effects, and the alternative is some.

  1. Perform a Kruskal-Wallis test of whether there are any treatment effects. Report and interpret the P-value.
  2. Actually, the treatment effects, if they exist, should be ordered in in increasing order with the numbers in the names of the treatment groups. Perform a Jonckheere-Terpstra test of no treatment effects versus an increasing sequence of treatment effects. Report and interpret the P-value.
  3. Explain the differences between the results of parts (a) and (b).

Question 6 [15 pts.]

The data for this problem are at the URL

http://www.stat.umn.edu/geyer/f06/5601/mydata/t1p6.txt

With that URL given to Rweb, two variables event.time and failed are loaded. (If you are using R at home, see the footnote about reading this data into R).

The data are from a reliability analysis of light bulb lifetime. The variable event.time is the time (in days) that a light bulb was on test. At time event.time[i] the i-th light bulb either burnt out (failed) or was censored (because the test was stopped). The variable failed[i] is 1 if the i-th light bulb failed and 0 if it was censored.

  1. Plot the Kaplan-Meier estimator of the survival curve for the light bulb failure time distribution.
  2. Find the four point estimates for probability that a light bulb lasts 500, 1000, 1500, or 2000 days of use (that its failure time is greater than 500, 1000, 1500, or 2000 days, respectively).
  3. Find the four 95% confidence intervals for the parameters estimated in part (b).

Footnote about Reading Data into R

If you are doing Problem 1 in R rather than Rweb, you will have to duplicate what Rweb does reading in is URL at the beginning. So all together, you must do

X <- read.table(url("http://www.stat.umn.edu/geyer/f06/5601/mydata/t1p1.txt"),
    header = TRUE)
names(X)
attach(X)

To produce the two variables treatment and control needed for your analysis. Similarly for other problems.