Rules
See the Section about Rules for Quizzes and Homeworks on the General Info page.
Your work handed into Canvas should be an Rmarkdown file with text and code chunks that can be run to produce what you did. We do not take your word for what the output is. We may run it ourselves. But we also want the output.
Homeworks must uploaded before midnight the day they are due. Here is the link for uploading this homework. https://canvas.umn.edu/courses/330843/assignments/2881005.
Each homework includes the preceding quiz. You may either redo the quiz questions for homework or not redo them if you are satisfied with your quiz answers. In either case the quiz questions also count as homework questions (so quiz questions count twice, once on the quiz and once on the homework, whether redone or not). If you don't submit anything for problems 1–3 (the quiz questions), then we assume you liked the answers you already submitted.
Quiz 6
Problem 1
This is a simulation study (course notes about simulation).
We are going to study estimators for the Cauchy location family for which we did maximum likelihood in (Section 5.3.3 of the second course notes on models).
R function rcauchy
simulates random variates from this
distribution.
R function dcauchy
evaluates the probability density function
of this distribution (illustrated in the section of the notes linked
just above).
In all cases, we are going to assume the scale parameter is known and equal
to the default value (1.0). The unknown parameter to be estimated is
the location parameter (called location
by the R functions
mentioned above).
The estimators we want to compare are
- the maximum likelihood estimator (MLE) calculated as in the section of the notes linked above.
- the sample median, calculated by R function
median
. - the sample mean, calculated by R function
mean
. - the 10% trimmed sample mean, calculated by R function
mean
with optional argumenttrim = 0.1
.
Note: Theory says the sample mean is a very bad estimator. But that's OK. That's one thing a simulation can show.
Use sample size n
= 15.
Use simulation truth parameter value θ = 0. (In any location family, and in this one in particular, the variance of an equivariant estimator, which all of these are, does not depend on the true unknown parameter value, so the results should be the same regardless of what θ we choose).
Use at least 104 simulations.
Like in the course notes, make a table comparing the MSE of these estimators (with MCSE as appropriate). You do not have to format these nicely in a table using Rmarkdown. As long as you print each number (one estimate of MSE and one estimate of MCSE of MSE for each estimator, which is four numbers) with a clear indication (maybe a comment, maybe a row or column label in a matrix) of what it is, that's enough.
The R code that makes the tables at the ends of Section 5.5 of those course notes and Section 5.6 of those course notes is hidden (not in the HTML) but can be seen in the Rmarkdown source of those course notes.
Be sure to use the principle of common random numbers.
Problem 2
The R command
foo <- read.table(url("http://www.stat.umn.edu/geyer/3701/data/2022/q6p2.txt"), header = TRUE)assigns one R object
foo
, which is a data frame containing
one variable x
, which is the data for this problem.
The estimator we are going to use in this problem is
the 25% trimmed mean (calculated by mean(x, trim = 0.25)
if
the data are x
). What parameter does this estimate?
It estimates the 25% trimmed mean of the true unknown distribution of the
data. Call that unknown parameter θ.
Calculate a 95% nonparametric bootstrap t confidence interval
using these data and R function boott
in R package bootstrap
using no sdfun
(because
we don't know how to make one) following the example in the
course notes on the bootstrap.
Use at least 103 bootstrap samples.
Use at least 200 bootstrap samples, for the inner bootstrap that determines
the SD function (argument nbootsd
).
Problem 3
This problem is about parallelization.
We are going to parallelize, one of the simulations in problem 1, the one for the sample median.
Parallelize this simulation using R function mclapply
in R package parallel
following the example in
(Section 7.1 of the course notes on parallel computing. If you are on
Microsoft Windows, you will have to use mc.cores = 1
as
explained in those notes.
Do everything as in problem 1 except parallelize using mclapply
and use only the sample median, not the other estimators.
Homework 6
Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.
Problem 4
The R command
foo <- read.csv(url("https://www.stat.umn.edu/geyer/3701/data/q6p4.csv")) gout1 <- glm(y ~ x1 + x2, family = binomial, data = foo) summary(gout1) gout2 <- glm(y ~ poly(x1, x2, degree = 2), family = binomial, data = foo) summary(gout2) anova(gout1, gout2, test = "Chisq")does one test of model comparison for a logistic regression, comparing linear and quadratic models.
Not trusting the asymptotics, we want to do a parametric bootstrap calculation of the P-value.
The code
aout <- anova(gout1, gout2, test = "Chisq") aout[2, "Pr(>Chi)"]obtains that P-value.
The code
ystar <- simulate(gout1)$sim_1simulates a new response vector
ystar
under the null hypothesis.
(The R function simulate
is a generic function that package
authors may provide to simulate models.)
Simulate at least 104 simulations of this model (the MLE for
the null hypothesis). Fit the null (linear) and alternative (quadratic)
models to each, obtain the P-value and store it. Suppose
you call the vector you store all these bootstrap P-values
in pstar
.
Suppose you call the P-value for the actual data phat
.
A P-value is a statistic — a function of the data that is not a function of the parameters — so if we think of it as just a test statistic that we don't know the distribution of (which is why we are bootstrapping it), then we use the bootstrap distribution as its sampling distribution. Since we reject the null hypothesis when the P-value is low, the way to calculate the P-value from this simulation is
mean(c(phat, pstar) <= phat)Note that we include the P-value for the actual data in with all the bootstrap P-value. This is valid because under the null hypothesis it has almost the same distribution. It also prevents one getting P = 0 by just doing too few simulations.
Ideally, the R functions glm
and anova
should
always calculate the P-value accurately using care about computer
arithmetic (like we do). Unfortunately, they do not. You may get warnings
from glm.fit
, the R function that glm
calls to
do the actual model fitting. And you may get NA
or NaN
for a bootstrap P
. This would not happen
if the R functions glm
and anova
were coded
with more attention to computer arithmetic. But they aren't.
- Ignore the warnings.
- If you get
NA
orNaN
for any bootstrap P-value, replace it with 1.0. This means we do not reject the null hypothesis at any α level. If you cannot compute a P-value, then you have to assume it is no evidence at all against the null hypothesis, so this is the correct thing to do.
Problem 5
This problem is almost the same as the preceding one. We are going to do the same calculation but think of it as a simulation study rather than as a bootstrap.
With the same vector pstar
computed in the preceding problem.
Check whether it gives valid tests. Under the null hypothesis a test
rejects the null hypothesis at level α with probability α.
Using a P-value to conduct the test, the test rejects at level
α when P ≤ α. This says, if the test statistic
is continuous, that it is uniformly distributed on the interval (0, 1),
that is
We do not want to calculate this for an infinite number of α,
and that wouldn't make sense even if we could do it because nboot
is finite. Moreover, we are only interested in low values of α.
Thus, here is what the problem requires.
- Estimate Pr(P ≤ α) for α = 0.01, 0.02, …, 0.10.
- Also provide MCSE of these estimates.
So the problem requires you to calculate 20 numbers (10 estimates, 10 MCSE).
Hint: recall that MCSE of zero-or-one-valued random variables can be calculated using the formulas for standard errors of sample proportions from STAT 3011 as discussed in (Section 6.1.5 of the course notes on the bootstrap).
Problem 6
In this problem we are going to bootstrap a likelihood-based confidence interval. In particular, we are going to bootstrap the 95% likelihood interval for the scale parameter σ calculated in (Section 9.3.3 of the course notes on models, part II) and reported in (summary section following it) as (2.86,6.65).
The theory on which this test is based is in Section 7.1.1 of those notes.
That theory says, in other words than in that section of the notes, that twice the difference of log likelihoods, the first maximizing over both parameters and the second holding the parameter of interest (that's σ in this problem) fixed, has an asymptotic distribution that is chi-square on one degree of freedom when the null hypothesis is true, that is, when the value σ0 at which we fix σ is the true unknown parameter value.
If we don't believe the asymptotics and want to do a parametric bootstrap instead, we should derive a critical value from simulation of the null distribution of the test statistic.
Hence we do the following steps in each iteration of bootstrap simulation.
Use bootstrap sample size nboot
= 1e4
.
- Simulate data from the MLE distribution. The parameter estimates
were reported to be 9.266664 for μ and 4.362070 for σ in
Section 9.2 of those notes. The R function
rcauchy
will simulate such data. The data sample size was never shown in those notes but wasn
= 45. - Calculate the maximum of the log likelihood optimizing over both parameters and also just maximizing over μ while holding σ fixed at its true unknown parameter value in the bootstrap world, which is the value you are using for the simulation distribution.
- Calculate the bootstrap test statistic (twice the difference in
log likelihoods: big model minus little model). Since we are using
minus the log likelihood for the convenience of the R function
nlm
, it will be little model minus big model (because minus times minus is plus). You can check that you got it the right way around because the test statistic has to be positive (the unconstrained maximum has to be larger than the constrained maximum). - Store each value of the bootstrap test statistic.
The result of this simulation is (the bootstrap estimate of) the sampling distribution of the test statistic under the null hypothesis. The critical value for making a 95% likelihood-based-confidence interval is the 0.95 quantile of this distribution. Find that critical value.
We won't bother actually making the confidence interval. It would be done
exactly the same as in
Section 9.3.3 of those notes) except with crit.chisq
defined as we calculate in this problem rather than
> qchisq(0.95, df = 1) [1] 3.841459which is derived from asymptotic theory and used in those notes. (But you do not have to actually do that.)
Hint: An R function that calculates minus the log likelihood for the two-parameter Cauchy distribution is found in Section 9.2 of those notes. But this function needs to be modified for this problem.
- In maximizing over μ and σ, one needs to
- make the data an argument because the simulated data changes in each bootstrap iteration,
- redefine the function each time through the loop with the data name inside the function the same as outside,
- or define the function only once with the data name inside the function the same as outside.
- In maximizing over μ holding σ fixed, one needs to
make a version of
mlogl
that has only arguments-
mu
andx
if you are making the data an argument - or just
mu
if you are not making the data an argument.
-
Problem 7
This problem is about Bayesian inference via Markov chain Monte Carlo (MCMC), which was covered in the course notes on that subject.
The statistical model is Cauchy location-scale, the same model that was analyzed by likelihood methods in Section 9 of the course notes on statistical models, Part II.
For the prior we are going to use what the course notes on Bayesian inference call the "method of made-up data" (which in this problem is a special case of the well known and widely used method of conjugate priors, which we did not explain in the course notes, and will not explain here).
We take the prior to be the likelihood for made-up data { −1, 0, 1 }, so the unnormalized posterior is the same as the likelihood for the real data times the the likelihood for these made-up data, or, what is the same thing, the likelihood for the concatenation of real and made-up data.
It is not obvious that this prior is proper, but I checked it by integrating it in Mathematica, and it is proper.
Simulate the posterior distribution of the parameters μ and σ using
the R function metrop
in the CRAN package mcmc
.
Do a simulation in which there is no batching (argument blen = 1
to the R function metrop
) so the result returned by
metrop
has component batch
whose components
are simulations of μ and σ rather than batch means.
From this make 95% equal-tailed Bayesian credible intervals for μ and σ, which is something not covered in the course notes, but is very simple. The credible interval for μ has endpoints that are the 0.025 and 0.975 quantiles of the simulations of μ and similarly for σ.
Do not forget the restriction σ > 0. Also do not forget that the starting state for the Metropolis algorithm must also satisfy this constraint.
For data use the same data as was used in the course notes
x <- read.csv(url("http://www.stat.umn.edu/geyer/5102/data/prob7-1.txt"))$x