Go to assignment: 1 2 3 4 5 6 7
For the data in Table 3.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-3.txt
as close as you can getmeans either above or below whichever is closer. Note this is different from the example, which always is above.)
The problem just above is Problems 3.1, 3.19, 3.27, 3.63, and more in H & W.
For the data in Table 3.6 in H & W also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-6.txt repeat parts (a) and (b) above. Read 3.43 in place of 3.1 in part (a).
The problem just above is Problem 3.12 and more in H & W.
Problem 3.54 in H & W. The data are in the problem statement also at http://www.stat.umn.edu/geyer/5601/hwdata/p3-54.txt.
Problem 3.87 in H & W. The data are in Table 3.9 in H & W also at http://www.stat.umn.edu/geyer/5601/hwdata/t3-9.txt.
Note: answers are are now here.
For the data in Table 4.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t4-3.txt
This is Problems 4.1, 4.15, 4.27, and more.
Problem 11.29. The data are at http://www.stat.umn.edu/geyer/5601/hwdata/t11-15.txt
For the data in Table 4.4 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t4-4.txt
This is Problems 4.5, 4.19, 4.34, and more.
Note: answers are are now here.
For the data in Table 6.4 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t6-4.txt
For the data in Table 6.7 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t6-7.txt
For the data in Table 7.2 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t7-2.txt
For the data in Table 8.3 in Hollander and Wolfe (H & W) also at http://www.stat.umn.edu/geyer/5601/hwdata/t8-3.txt
This is problems 8.1, 8.20, and, except for a change of confidence level, 8.27.
Note: answers are here.
11.34 and 11.36. The data are in Table 11.18 in Hollander and Wolfe and also at
4.1, 4.2, 5.2, and 5.4
The file
contains two variables x
, which is a random sample from a standard
normal distribution, and y
, which is a random sample from
a standard Cauchy distribution.
x
compute
mean(x, trim=0.25)
And for each of these three point estimates calculate a bootstrap standard error (use bootstrap sample size 1000 at least).
y
x
is actually a random sample from
a normal distribution, we actually know the asymptotic relative efficiency
(ARE) of estimators (i) and (ii) in part (a).
(See the efficiency page for details).
What is it, and how close is the square of the ratio of
bootstrap standard errors?
y
is actually a random sample from
a Cauchy distribution, also denoted t(1) because it is the Student t
distribution for one degree of freedom,
we actually know the asymptotic relative efficiency
(ARE) of estimators
(i) and (ii) in part (b).
(See the efficiency page for details).
What is it, and how close is the square of the ratio of
bootstrap standard errors?
Note: answers to Problem 11.36 in H&W and the additional question are here.
The data set LakeHuron
in the ts
package gives
annual measurements of the level, in feet, of Lake Huron 1875-1972
.
As with all data sets included in R the usage is
library(ts) data(LakeHuron)
Then the name of the time series is LakeHuron
, for example,
plot(LakeHuron)
does a time series plot.
The average water level over this period was
> mean(LakeHuron) [1] 579.0041
Obtain a standard error for this estimate using the subsampling bootstrap
with subsample length b = 10. Assume that the sample mean
obeys the square root law (that is, the rate
is
square root of n).
The
documentation for the R function lmsreg
says
There seems no reason other than historical to use the
lms
andlqs
options. LMS estimation is of low efficiency (converging at rate n^{-1/3}) whereas LTS has the same asymptotic efficiency as an M estimator with trimming at the quartiles (Marazzi, 1993, p. 201). LQS and LTS have the same maximal breakdown value of(floor((n-p)/2) + 1)/n
attained iffloor((n+p)/2) <= quantile <= floor((n+p+1)/2)
. The only drawback mentioned of LTS is greater computation, as a sort was thought to be required (Marazzi, 1993, p. 201) but this is not true as a partial sort can be used (and is used in this implementation).
Thus it seems that LMS regression is the
Wrong Thing (with a capital W and a capital T), something that survives
only for historico-sociological reasons: it was invented first and most
people that have heard of robust regression at all have only heard of it.
To be fair to Efron and Tibshirani, the literature cited in the documentation
for lmsreg
is the same vintage as their book. So maybe they
thought they were using the latest and greatest.
Anyway, the problem is to redo the LMS examples using LTS.
What changes? Does LTS really work better here than LMS? Describe the differences you see and why you think these differences indicate LTS is better (or worse, if that is what you think) than LMS.
Note: LTS does take longer (about 45 seconds for nboot = 1000
),
so a smaller nboot
might be justifiable.
The file
contains a vector x
of
data from a heavy tailed distribution such that the
sample mean has rate of convergence n^{1 / 3},
that is
n^(1 / 3) * (theta.hat - theta)
has nontrivial asymptotics (nontrivial
here meaning it doesn't
converge to zero in probability and also is bounded in probability,
so n^{1 / 3} is the right rate) where
theta.hat
is the sample mean and theta
is the true unknown population mean.
When the sample mean behaves as badly as this, the sample variance
behaves even worse (it converges in probability to infinity), but
robust measures of scale make sense, for example, the interquartile
range (calculated by the IQR
function in R, note that
the capital letters are not a mistake).
theta.star
marking the point
theta.hat
.
b^(1 / 3) * (theta.star - theta.hat)
which is the analog in the bootstrap worldof the distribution of the quantity having nontrivial asymptotics displayed above.
theta.hat
, rescaling by the ratio of rates in the
appropriate fashion.
Note: answers are here.
The file
contains one variable x
, which is a random sample from a standard
from a gamma distribution.
The coefficient of skewness of a distribution is the third central
moment divided by the cube of the standard deviation (this gives
a dimensionless quantity, that is zero for any symmetric distribution,
positive for distributions with long left tail
, and negative for
distributions with long right tail
).
It can be calculated by the R function defined by
skew <- function(x) { xbar <- mean(x) mu2.hat <- mean((x - xbar)^2) mu3.hat <- mean((x - xbar)^3) mu3.hat / sqrt(mu2.hat)^3 }
second order accuracyproperty using the
boott
function.
Note: Since you have no idea about how to write an sdfun
that will variance stabilize the coefficient of skewness, you will have to
use one of the other two methods described on the
bootstrap t
page.
Note: this will require that you write a quite different
function, starting
skew
skew <- function(p, x) {(and you have to fill in the rest of the details, which should, I hope, be clear enough from the discussion of our ABC example).
The file
contains one variable x
, which is a random realization of
an AR(1) time series.
The sample mean of the time series obeys the square root law, that is,
sqrt(n) * (theta.hat - theta)
is asymptotically normal, where theta.hat
is the sample
mean for sample size n
and theta
is the true
unknown population mean.
Use subsample size 50 for both parts (you can use the same samples).
The file
contains a vector x
of
data that are a random sample from a heavy tailed distribution such that the
sample mean has rate of convergence n^{1 / 3},
that is
n^(1 / 3) * (theta.hat - theta)
has nontrivial asymptotics (nontrivial
here meaning it doesn't
converge to zero in probability and also is bounded in probability,
so n^{1 / 3} is the right rate) where
theta.hat
is the sample mean for sample size n
and theta
is the true unknown population mean.
This is the same distribution as was used for Homework 5, Problem 3, but a much larger sample.
n^(1 / 3)
.
Note: answers are here.
The file
contains two variables x
and y
, which are made up
regression data.
locpoly
smooth.spline
gam
sm.regression
The file
contains a two by two contingency table that are data on a study of whether snoring and disturbing dreams are associated. The row labels of the table are values of a categorical variable indicating the presence or absence of snoring. The column labels of the table are values of a categorical variable measuring the frequency of nightmares.
Because of the way Rweb reads data, it does not get read in a sensible form. The following R statements will fix it up.
x <- as.matrix(X[ , -1]) dimnames(x)[[1]] <- X[ , 1] print(x)
(the last statement just showing us what we have in x
).
We want to do a chi-square test of independence. The R command
chisq.test(x)
But as the command itself says in its printout
Chi-squared approximation may be incorrect in:chisq.test(x)
So your job is to get a valid P-value using the parametric bootstrap. Use at least n_{boot} = 100,000 bootstrap samples for your bootstrap. Also interpret the P-value.
Note: answers are here.