Statistics 5601 (Geyer, Fall 2013) Homework Assignment 7

Due Date

Due Wed, Dec 11, 2013.

First Problem

The file

http://www.stat.umn.edu/geyer/5601/mydata/blurfle.txt

contains two variables x and y, which are made up regression data.

Fit the regression function for these data using a kernel smoother with gaussian kernel and bandwidth = 2.

Then repeat with bandwidth = 1.

Then repeat with a bandwidth of your choice. Choose a bandwidth that gives a picture that makes sense to you.

Hand in either (1) three plots each showing one of the three smooths and the scatterplot of points, being sure to adequately identify each plot, or (2) one plot showing all three smooths against the scatterplot, again adequately identifying each smooth.

Second Problem

The file

http://www.stat.umn.edu/geyer/5601/mydata/blurfle.txt

contains two variables x and y, which are made up regression data.

Run each of the four smoothers done by the R functions
- locpoly
- smooth.spline
- gam
- sm.regression
using automatic selection of the smoothing parameter, as described on the bandwidth selection web page.
Report the smoothing parameter used by each method and what this smoothing parameter purports to be (you may have to grovel around in the on-line help for these functions (follow the links on the bandwidth selection web page)
Hand in either (1) four plots showing each of the four smooths on the scatterplot or (2) one plot showing all four smooths on the scatterplot.

Third Problem

The example about parametric bootstrap of logistic regression was about a test of model comparison. This problem uses the same data but is about a confidence interval for a regression coefficient.

In particular, in the logistic regression that is the big model in the model comparison

Rweb:> out <- glm(kyphosis ~ age + I(age^2) + number + start, 
+     family = "binomial") 
Rweb:> summary(out) 
 
Call: 
glm(formula = kyphosis ~ age + I(age^2) + number + start, family = "binomial") 
 
Deviance Residuals:  
     Min        1Q    Median        3Q       Max   
-2.23573  -0.51241  -0.24509  -0.06108   2.35495   
 
Coefficients: 
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.3835660  2.0548871  -2.133  0.03291 *  
age          0.0816412  0.0345292   2.364  0.01806 *  
I(age^2)    -0.0003965  0.0001905  -2.082  0.03737 *  
number       0.4268659  0.2365134   1.805  0.07110 .  
start       -0.2038421  0.0706936  -2.883  0.00393 ** 
--- 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  
 
(Dispersion parameter for binomial family taken to be 1) 
 
    Null deviance: 83.234  on 80  degrees of freedom 
Residual deviance: 54.428  on 76  degrees of freedom 
AIC: 64.428 
 
Number of Fisher Scoring iterations: 6

we are interested in the coefficient for I(age^2), which is highlighted along with its asymptotic standard error (calculated from Fisher information). A normal-theory, large-sample confidence interval for the unknown true population regesssion coefficient would be −0.0003965 ± 1.96 × 0.0001905.

This coefficient itself is

Rweb:> coefficients(out)[3] 
     I(age^2)  
-0.0003964918

R does not make it easy to get the standard error it calculates.

Rweb:> summary(out)$coefficients[3, 2] 
[1] 0.0001904622

Nevertheless, if we did

theta.hat <- coefficients(out)[3]
sd.hat <- summary(out)$coefficients[3, 2]

then (theta.hat − θ) ⁄ sd.hat, where θ is the unknown population regression coefficient, should be standard normal if the sample size is sufficiently large. Is it?

Do a parametric bootstrap simulation of the standardized quantity described above. Plot its histogram.
Calculate 0.025 and 0.975 quantiles of the simulation distribution done in part (a).
Calculate parametric bootstrap 95% confidence interval for θ using these quantiles, theta.hat and sd.hat.

Caution: This problem has nothing to do with the model that results in out2 in the example.

Clarification: You may get warning messages fitted probabilities numerically 0 or 1 occurred …. These do mean regression coefficients (not necessarily the one we are interested in) are theoretically at plus or minus infinity, although R will just make them some very large number. Consider this part of the problem that the bootstrap is supposed to solve.

Answers

Answers in the back of the book are here.