Due Date
Due Wed, Dec 11, 2013.
First Problem
The file
contains two variables x
and y
, which are made up
regression data.
Fit the regression function for these data using a kernel smoother with gaussian kernel and bandwidth = 2.
Then repeat with bandwidth = 1.
Then repeat with a bandwidth of your choice. Choose a bandwidth that gives a picture that makes sense to you.
Hand in either (1) three plots each showing one of the three smooths and the scatterplot of points, being sure to adequately identify each plot, or (2) one plot showing all three smooths against the scatterplot, again adequately identifying each smooth.
Second Problem
The file
contains two variables x
and y
, which are made up
regression data.
- Run each of the four smoothers done by the R functions
-
locpoly
-
smooth.spline
-
gam
-
sm.regression
-
- Report the smoothing parameter used by each method and what this
smoothing parameter purports to be (you may have to grovel around in
the on-line help for these functions (follow the links on the
bandwidth selection web page)
- Hand in either (1) four plots showing each of the four smooths on the scatterplot or (2) one plot showing all four smooths on the scatterplot.
Third Problem
The example about parametric bootstrap of logistic regression was about a test of model comparison. This problem uses the same data but is about a confidence interval for a regression coefficient.
In particular, in the logistic regression that is the big model
in the model comparison
Rweb:> out <- glm(kyphosis ~ age + I(age^2) + number + start,
+ family = "binomial")
Rweb:> summary(out)
Call:
glm(formula = kyphosis ~ age + I(age^2) + number + start, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.23573 -0.51241 -0.24509 -0.06108 2.35495
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.3835660 2.0548871 -2.133 0.03291 *
age 0.0816412 0.0345292 2.364 0.01806 *
I(age^2) -0.0003965 0.0001905 -2.082 0.03737 *
number 0.4268659 0.2365134 1.805 0.07110 .
start -0.2038421 0.0706936 -2.883 0.00393 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 54.428 on 76 degrees of freedom
AIC: 64.428
Number of Fisher Scoring iterations: 6
we are interested in the coefficient for I(age^2)
,
which is highlighted along with its asymptotic standard error
(calculated from Fisher information). A normal-theory, large-sample
confidence interval for the unknown true population regesssion coefficient
would be −0.0003965 ± 1.96 × 0.0001905.
This coefficient itself is
Rweb:> coefficients(out)[3] I(age^2) -0.0003964918
R does not make it easy to get the standard error it calculates.
Rweb:> summary(out)$coefficients[3, 2] [1] 0.0001904622
Nevertheless, if we did
theta.hat <- coefficients(out)[3] sd.hat <- summary(out)$coefficients[3, 2]
then (theta.hat
− θ) ⁄ sd.hat
,
where θ is the unknown population regression coefficient, should be
standard normal if the sample size is sufficiently large. Is it?
- Do a parametric bootstrap simulation of the standardized quantity described above. Plot its histogram.
- Calculate 0.025 and 0.975 quantiles of the simulation distribution done in part (a).
- Calculate parametric bootstrap 95% confidence interval for θ
using these quantiles,
theta.hat
andsd.hat
.
Caution: This problem has nothing to do with the model
that results in out2
in the example.
Clarification: You may get warning messages
fitted probabilities numerically 0 or 1 occurred
….
These do mean regression coefficients (not necessarily the one we are
interested in) are theoretically at plus or minus infinity, although R
will just make them some very large number. Consider this part of the
problem that the bootstrap is supposed to solve.
Answers
Answers in the back of the book
are here.