General Instructions
The exam is open book, open notes, open web pages. Do not discuss this exam with anyone except the instructor.
You may use the computer, a calculator, or pencil and paper to get answers, but it is expected that you will use the computer. Show all your work:
- For simple computer commands, you may just write down the command
you used and the result it gave on your exam solution.
- For complicated commands or plots, make a printout and attach
the printout to your exam solution.
No credit for numbers with no indication of where they came from!
Question 1 [20 pts.]
The data for this problem are at the URL
With that URL given to Rweb, two variables
treatment
and control
are loaded.
(If you are using R at home,
see the footnote about reading this data into R).
Suppose these data come from a matched pairs experiment. Treatment and control values in the same row are for the same individual (hence may be correlated). Data in different rows are for different individuals and are assumed independent and indentically distributed. For concreteness you may assume these data come from a crossover trial, where each individual is given both the treatment and the control (placebo) at different times. In the first half of the trial some individuals get treatment and others control. In the second half they switch. Assume high values of the response are better, so if the treatment numbers are generally larger than the control numbers, that is what the scientists want to see. No one is interested in the possiblity that treatment is worse than placebo.
- Do an appropriate sign test for these data. Get an exact P-value. Interpret the P-value.
- Find the point estimate for difference in location of the two groups that is the Hodges-Lehmann estimator associated with the sign test.
- Find the fuzzy P-value corresponding to the exact P-value you obtained in part (a). Interpret the fuzzy P-value and compare it with the exact one.
- Same as part (a) except substitute signed rank test for sign test.
- Same as part (b) except substitute signed rank test for sign test.
- Same as part (c) except substitute signed rank test for sign test.
Question 2 [20 pts.]
The data for this problem are at the URL
With that URL given to Rweb, two variables
x
and y
are loaded.
(If you are using R at home,
see the footnote about reading this data into R).
You may consider x
and y
to be independent
random samples from two different populations. The question of interest
is whether the distributions of (whatever was measured) is the same in
each population or different. The variable y
is padded
with NA
values to be the same length as x
.
You may need to remove these to do some procedures.
- Use the appropriate Wilcoxon test for this situation.
Since the functions
psignrank
andpwilcox
can crash R when used with large sample sizes, they should not be used for this problem. Letwilcox.test
orwilcox.exact
use an asymptotic, large sample approximation. Get an approximate P-value. Interpret the P-value. - Same as part (a) except replace Wilcoxon test with Kolmogorov-Smirnov test.
- Write a short discussion of why the results of (a) and (b) are so different.
Question 3 [15 pts.]
The data for this problem are at the URL
With that URL given to Rweb, one variable
x
is loaded.
(If you are using R at home,
see the footnote about reading this data into R).
You may consider these data a random sample from some population. Calculate the confidence interval for the population median associated with the sign test. Report the actual achieved confidence level. Use the interval that comes the closest to 95% confidence (the one just above 95% or just below 95%, whichever is closer).
Question 4 [15 pts.]
The data for this problem are at the URL
With that URL given to Rweb, two variables
fruit
and seeds
are loaded.
(If you are using R at home,
see the footnote about reading this data into R).
The data in each row are measurements on one plant:
fruit
is the number of fruits produced in one year
and seeds
is the number of seeds found in a random sample
of three fruits (individuals with fewer than three fruits to count were
removed from the data). Data in different rows are on different plants
and can be assumed independent and identically distributed.
The question of scientific interest is whether the variables fruit
and seeds
have significant association. The scientists would
like to model them as independent variables. Is this justified?
- Calculate Kendall's tau for these data.
- Perform the test based on Kendall's tau of the null hypothesis of
independence of
fruit
andseeds
versus the alternative of dependence. Find the P-value and interpret it.
Question 5 [15 pts.]
The data for this problem are at the URL
With that URL given to Rweb, two variables
treatment
, which is categorical, and
yield
, which is numeric, are loaded.
(If you are using R at home,
see the footnote about reading this data into R).
There six treatments. The six groups are supposed to be independent random samples from six populations. The question of scientific interest is whether there are any treatments effects. The null hypothesis is no treatment effects, and the alternative is some.
- Perform a Kruskal-Wallis test of whether there are any treatment effects. Report and interpret the P-value.
- Actually, the treatment effects, if they exist, should be ordered in in increasing order with the numbers in the names of the treatment groups. Perform a Jonckheere-Terpstra test of no treatment effects versus an increasing sequence of treatment effects. Report and interpret the P-value.
- Explain the differences between the results of parts (a) and (b).
Question 6 [15 pts.]
The data for this problem are at the URL
With that URL given to Rweb, two variables
event.time
and failed
are loaded.
(If you are using R at home,
see the footnote about reading this data into R).
The data are from a reliability analysis of light bulb lifetime.
The variable event.time
is the time (in days) that a light bulb
was on test.
At time event.time[i]
the i
-th light bulb either
burnt out (failed) or was censored (because the test was stopped). The
variable failed[i]
is 1 if the i
-th light bulb
failed and 0 if it was censored.
- Plot the Kaplan-Meier estimator of the survival curve for the light bulb failure time distribution.
- Find the four point estimates for probability that a light bulb lasts 500, 1000, 1500, or 2000 days of use (that its failure time is greater than 500, 1000, 1500, or 2000 days, respectively).
- Find the four 95% confidence intervals for the parameters estimated in part (b).
Footnote about Reading Data into R
If you are doing Problem 1 in R rather than Rweb, you will have to duplicate what Rweb does reading in is URL at the beginning. So all together, you must do
X <- read.table(url("http://www.stat.umn.edu/geyer/f06/5601/mydata/t1p1.txt"), header = TRUE) names(X) attach(X)
To produce the two variables treatment
and control
needed for your analysis. Similarly for other problems.