Go to assignment: 1 2 3 4 5 6 7
Homework Assignment 1, Due Fri, Sep 29, 2006
First Problem
For the data in Table 3.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t3-3.txt
- Do a Wilcoxon signed rank test of the hypotheses described in Problem 3.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
- Also compute the Hodges-Lehmann estimator and confidence interval
for the shift parameter that go with the signed rank test with as
close as you can get to 95% confidence. (The phrase
as close as you can get
means either above or below whichever is closer. Note this is different from the example, which always is above.) (Note: The R functionwilcox.test
seems broken for the confidence interval. Don't use it.) - Do a sign test of the same hypotheses as in part (a).
- Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the sign test with as close as you can get to 95% confidence.
- Comment on the differences between the two procedures, including any differences in required assumptions, and in theoretical properties such as asymptotic relative efficiency.
- Repeat parts (a), (b), (c), and (d) above except use fuzzy P-values
and fuzzy confidence intervals done by the R package
fuzzyRankTests
except for the point estimates (there are fuzzy tests, and fuzzy confidence intervals, but no fuzzy point estimates).
(The problem just above is Problems 3.1, 3.19, 3.27, 3.63 in H & W and more.)
Second Problem
For the data in Table 3.6 in H & W also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t3-6.txt repeat parts (a) and (b) above. Read 3.43 in place of 3.1 in part (a). Also produce a fuzzy P-value for the test, call this part (c). Make a plot, either the PDF or the CDF of the fuzzy P-value (your choice). (You do not need to produce a fuzzy confidence interval.)
(The problem just above is Problem 3.12 in H & W and more. Oddly, most of the problem statement is in 3.43. Only the test to do is stated in 3.12.)
Third Problem
Problem 3.54 in H & W. The data are in the problem statement also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/p3-54.txt. Since this is all about ties (zeroes), also produce the fuzzy P-value.
Note: This problem is about the sign test. You can tell because it is in a section (3.4) of the textbook about the sign test (and no other way).
Fourth Problem
Problem 3.87 in H & W. The data are in Table 3.9 in H & W also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t3-9.txt.
Note: This problem is about the signed rank test. You can tell because it is in a section (3.7) of the textbook about the signed rank test.
Answers
Answers in the back of the book
are here.
Homework Assignment 2, Due Fri, Oct 13, 2006
First Problem
For the data in Table 4.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t4-3.txt
- Do a Wilcoxon rank sum test of the hypotheses described in Problem 4.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
- Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the rank sum test with as close as you can get to 95% confidence.
- Do a two-sample Kolmogorov-Smirnov test
- testing whether there is any difference in the two populations, and
- testing whether allergics have higher histamine levels than nonallergics.
(The problem just above is Problems 4.1, 4.15, 4.27 in H & W and more.)
Second Problem
Problem 11.29. The data are at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t11-15.txt
Third Problem
For the data in Table 4.4 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t4-4.txt
- Do a Wilcoxon rank sum test of the hypotheses described in Problem 4.5 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
- Also compute the Hodges-Lehmann estimator and confidence interval for the shift parameter that go with the rank sum test with as close as you can get to 99% confidence (note not 95%).
- Redo part (a) using a fuzzy P-value. (Don't bother with a fuzzy confidence interval. It hardly makes a difference at this sample size.)
(The problem just above is Problems 4.5, 4.19, 4.34 in H & W and more.)
Answers
Answers in the back of the book
are here.
Homework Assignment 3, Due Fri, Oct 20, 2006
First Problem
Problems 11.34 and 11.36. The data are in Table 11.18 in Hollander and Wolfe and also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t11-18.txt
Second Problem
For the data in Table 6.4 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t6-4.txt
- Do a Kruskal-Wallis test of the hypotheses described in Problem 6.8 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
- Also do a conventional ANOVA on the same data.
Third Problem
For the data in Table 6.7 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t6-7.txt
- Do a Jonckheere-Terpstra test of the hypotheses described in Problem 6.19 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
- Also do a conventional normal theory test of the same hypotheses on the same data.
Fourth Problem
For the data in Table 7.2 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t7-2.txt
- Do a Friedman test of the hypotheses described in Problem 7.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
- Also do a conventional ANOVA on the same data.
Fifth Problem
For the data in Table 8.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/f06/5601/hwdata/t8-3.txt
- Do a test using Kendall's tau as the test statistic of the hypotheses described in Problem 8.1 in Hollander and Wolfe. Describe the result of the test in terms of "statistical significance".
- Also find the associated point estimate and an approximate 95% confidence interval.
- Also do the competing normal-theory parametric procedures on the same data.
(The problem just above is Problems 8.1, 8.20, and, except for a change of confidence level, 8.27 in H & W and more.)
Answers
Answers in the back of the book
are here.
Homework Assignment 4, Due Fri, Nov 03, 2006
Problems From Efron and Tibshirani
4.1, 4.2, 5.2, and 5.4
Additional Problem
The file
contains two variables x
, which is a random sample from a standard
normal distribution, and y
, which is a random sample from
a standard Cauchy distribution.
- For the variable
x
compute- the sample mean
- the sample median
- the sample 25% trimmed mean, for which R statement is
mean(x, trim=0.25)
And for each of these three point estimates calculate a bootstrap standard error (use bootstrap sample size 1000 at least).
- Repeat part (a) for the variable
y
-
Using the fact that
x
is actually a random sample from a normal distribution, we actually know the asymptotic relative efficiency (ARE) of estimators (i) and (ii) in part (a). (See the efficiency page for details). What is it, and how close is the square of the ratio of bootstrap standard errors? -
Using the fact that
y
is actually a random sample from a Cauchy distribution, also denoted t(1) because it is the Student t distribution for one degree of freedom, we actually know the asymptotic relative efficiency (ARE) of estimators (i) and (ii) in part (b). (See the efficiency page for details). What is it, and how close is the square of the ratio of bootstrap standard errors?
Answers in the back of the book
for the additional problem are here.
Homework Assignment 5, Due Mon Nov 13, 2006
Question 1
The data set LakeHuron
included in R
(on-line help)
gives
annual measurements of the level, in feet, of Lake Huron 1875-1972
.
The name of the time series is LakeHuron
, for example,
plot(LakeHuron)
does a time series plot.
The average water level over this period was
> mean(LakeHuron) [1] 579.0041
Obtain a standard error for this estimate
(i. e., mean(LakeHuron)
) using the subsampling bootstrap
with subsample length b = 10. Assume that the sample mean
obeys the square root law (that is, the rate
is
square root of n),
and assume the time series is stationary.
Question 2
The
documentation for the R function lmsreg
says
There seems no reason other than historical to use the
lms
andlqs
options. LMS estimation is of low efficiency (converging at rate n− 1 ⁄ 3) whereas LTS has the same asymptotic efficiency as an M estimator with trimming at the quartiles (Marazzi, 1993, p. 201). LQS and LTS have the same maximal breakdown value of(floor((n-p)/2) + 1)/n
attained iffloor((n+p)/2) <= quantile <= floor((n+p+1)/2)
. The only drawback mentioned of LTS is greater computation, as a sort was thought to be required (Marazzi, 1993, p. 201) but this is not true as a partial sort can be used (and is used in this implementation).
Thus it seems that LMS regression is the
Wrong Thing (with a capital W and a capital T), something that survives
only for historico-sociological reasons: it was invented first and most
people that have heard of robust regression at all have only heard of it.
To be fair to Efron and Tibshirani, the literature cited in the documentation
for lmsreg
is the same vintage as their book. So maybe they
thought they were using the latest and greatest.
Anyway, the problem is to redo the LMS examples using LTS.
What changes? Does LTS really work better here than LMS? Describe the differences you see and why you think these differences indicate LTS is better (or worse, if that is what you think) than LMS.
Question 3
The file
contains a vector x
of
data from a heavy tailed distribution such that the
sample mean has
rate of convergence n1 ⁄ 3,
that is
n^(1 / 3) * (theta.hat - theta)
has nontrivial asymptotics (nontrivial
here meaning it doesn't
converge to zero in probability and also is bounded in probability,
so n1 ⁄ 3 is the right rate) where
theta.hat
is the sample mean and theta
is the true unknown population mean.
When the sample mean behaves as badly as this, the sample variance
behaves even worse (it converges in probability to infinity), but
robust measures of scale make sense, for example, the interquartile
range (calculated by the IQR
function in R, note that
the capital letters are not a mistake).
- Using subsample size b = 20, do a subsampling bootstrap
to estimate the distribution of the sample mean.
- Make a histogram of
theta.star
marking the pointtheta.hat
. - Make a histogram of
b^(1 / 3) * (theta.star - theta.hat)
which is the analog in thebootstrap world
of the distribution of the quantity having nontrivial asymptotics displayed above. - Calculate the subsampling bootstrap estimate of the IQR
of
theta.hat
, rescaling by the ratio of rates in the appropriate fashion.
Answers in the back of the book
are here.
Homework Assignment 6, Due Wed Dec 6, 2006
Question 1
The file
contains one variable x
, which is a random sample
from a gamma distribution.
The coefficient of skewness of a distribution is the third central
moment divided by the cube of the standard deviation (this gives
a dimensionless quantity, that is zero for any symmetric distribution,
positive for distributions with long left tail
, and negative for
distributions with long right tail
).
It can be calculated by the R function defined by
skew <- function(x) { xbar <- mean(x) mu2.hat <- mean((x - xbar)^2) mu3.hat <- mean((x - xbar)^3) mu3.hat / sqrt(mu2.hat)^3 }
- Find a 95% confidence interval for the true unknown population
coefficient of skewness that is just the sample coefficient of skewness
plus or minus 1.96 bootstrap standard errors.
- Find a 95% confidence interval
for the true unknown population
coefficient of skewness
having the
second order accuracy
property using theboott
function.Note: Since you have no idea about how to write an
sdfun
that will variance stabilize the coefficient of skewness, you will have to use one of the other two methods described on the bootstrap t page. - Find a 95% confidence interval
for the true unknown population
coefficient of skewness
using the bootstrap percentile method.
- Find a 95% confidence interval
for the true unknown population
coefficient of skewness
using the BCa (alphabet soup, type 1) method.
- Find a 95% confidence interval
for the true unknown population
coefficient of skewness
using the ABC (alphabet soup, type 2) method.
Note: this will require that you write a quite different
function, startingskew
skew <- function(p, x) {
(and you have to fill in the rest of the details, which should, I hope, be clear enough from the discussion of our ABC example).
Question 2
The file
contains one variable x
, which is a random realization of
an AR(1) time series.
The sample mean of the time series obeys the square root law, that is,
sqrt(n) * (theta.hat - theta)
is asymptotically normal, where theta.hat
is the sample
mean for sample size n
and theta
is the true
unknown population mean.
- Find a 95% confidence interval for the true unknown population
mean that is just the sample mean
plus or minus 1.96 bootstrap standard errors.
- Find a 95% confidence interval for the true unknown population mean using the method of Politis and Romano described in the handout and on the second web page on subsampling.
Use subsample size 50 for both parts (you can use the same samples).
Question 3
The file
contains a vector x
of
data that are a random sample from a heavy tailed distribution such that the
sample mean has rate of
convergence n1 ⁄ 3,
that is
n^(1 / 3) * (theta.hat - theta)
has nontrivial asymptotics (nontrivial
here meaning it doesn't
converge to zero in probability and also is bounded in probability,
so n1 ⁄ 3 is the right rate) where
theta.hat
is the sample mean for sample size n
and theta
is the true unknown population mean.
This is the same distribution as was used for Homework 5, Problem 3, but a much larger sample.
- Find a 95% confidence interval for the true unknown population mean
using the sample mean as the point estimator and using the subsampling
bootstrap with subsample size b = 100
by the method of Politis and Romano
using the known rate
n^(1 / 3)
. - This part cancelled. Find a 95% confidence interval for the true unknown population using the sample mean as the point estimator and using the subsampling bootstrap with subsample size b = 100 by the method of Politis and Romano estimating the rate (pretending you have been told nothing about the rate). Report both your rate estimate and your confidence interval.
Answers
Answers in the back of the book
are here.
Homework Assignment 7, Due Wed, Dec 13, 2006
Question 1
The file
contains two variables x
and y
, which are made up
regression data.
Fit the regression function for these data using a kernel smoother with gaussian kernel and bandwidth = 2.
Then repeat with bandwidth = 1.
Then repeat with a bandwidth of your choice. Choose a bandwidth that gives a picture that makes sense to you.
Hand in either (1) three plots each showing one of the three smooths and the scatterplot of points, being sure to adequately identify each plot, or (2) one plot showing all three smooths against the scatterplot, again adequately identifying each smooth.
Question 2
The file
contains two variables x
and y
, which are made up
regression data.
- Run each of the four smoothers done by the R functions
-
locpoly
-
smooth.spline
-
gam
-
sm.regression
-
- Report the smoothing parameter used by each method and what this
smoothing parameter purports to be (you may have to grovel around in
the on-line help for these functions (follow the links on the
bandwidth selection web page)
- Hand in either (1) four plots showing each of the four smooths on the scatterplot or (2) one plot showing all four smooths on the scatterplot.
Question 3
This questioned cancelled. The example about parametric bootstrap of logistic regression was about a test of model comparison. This problem uses the same data but is about a confidence interval for a regression coefficient.
In particular, in the logistic regression that is the big model
in the model comparison
Rweb:> out <- glm(kyphosis ~ age + I(age^2) + number + start,
+ family = "binomial")
Rweb:> summary(out)
Call:
glm(formula = kyphosis ~ age + I(age^2) + number + start, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.23573 -0.51241 -0.24509 -0.06108 2.35495
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.3835660 2.0548871 -2.133 0.03291 *
age 0.0816412 0.0345292 2.364 0.01806 *
I(age^2) -0.0003965 0.0001905 -2.082 0.03737 *
number 0.4268659 0.2365134 1.805 0.07110 .
start -0.2038421 0.0706936 -2.883 0.00393 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 54.428 on 76 degrees of freedom
AIC: 64.428
Number of Fisher Scoring iterations: 6
we are interested in the coefficient for I(age^2)
,
which is highlighted along with its asymptotic standard error
(calculated from Fisher information). A normal-theory, large-sample
confidence interval for the unknown true population regesssion coefficient
would be −0.0003965 ± 1.96 × 0.0001905.
This coefficient itself is
Rweb:> coefficients(out)[3] I(age^2) -0.0003964918
R does not make it easy to get the standard error it calculates.
Rweb:> summary(out)$coefficients[3, 2] [1] 0.0001904622
Nevertheless, if we did
theta.hat <- coefficients(out)[3] sd.hat <- summary(out)$coefficients[3, 2]
then (theta.hat
− θ) ⁄ sd.hat
,
where θ is the unknown population regression coefficient, should be
standard normal if the sample size is sufficiently large. Is it?
- Do a parametric bootstrap simulation of the standardized quantity described above. Plot its histogram.
- Calculate 0.025 and 0.975 quantiles of the simulation distribution done in part (a).
- Calculate parametric bootstrap 95% confidence interval for θ
using these quantiles,
theta.hat
andsd.hat
.
Caution: This problem has nothing to do with the model
that results in out2
in the example.
Clarification: You may get warning messages
fitted probabilities numerically 0 or 1 occurred
….
These do mean regression coefficients (not necessarily the one we are
interested in) are theoretically at plus or minus infinity, although R
will just make them some very large number. Consider this part of the
problem that the bootstrap is supposed to solve.
Answers
Answers in the back of the book
are here.