Statistics 3701 (Geyer, Fall 2022) Homework 6

Rules

See the Section about Rules for Quizzes and Homeworks on the General Info page.

Your work handed into Canvas should be an Rmarkdown file with text and code chunks that can be run to produce what you did. We do not take your word for what the output is. We may run it ourselves. But we also want the output.

Homeworks must uploaded before midnight the day they are due. Here is the link for uploading this homework. https://canvas.umn.edu/courses/330843/assignments/2881005.

Each homework includes the preceding quiz. You may either redo the quiz questions for homework or not redo them if you are satisfied with your quiz answers. In either case the quiz questions also count as homework questions (so quiz questions count twice, once on the quiz and once on the homework, whether redone or not). If you don't submit anything for problems 1–3 (the quiz questions), then we assume you liked the answers you already submitted.

Quiz 6

Problem 1

This is a simulation study (course notes about simulation).

We are going to study estimators for the Cauchy location family for which we did maximum likelihood in (Section 5.3.3 of the second course notes on models).

R function rcauchy simulates random variates from this distribution. R function dcauchy evaluates the probability density function of this distribution (illustrated in the section of the notes linked just above).

In all cases, we are going to assume the scale parameter is known and equal to the default value (1.0). The unknown parameter to be estimated is the location parameter (called location by the R functions mentioned above).

The estimators we want to compare are

the maximum likelihood estimator (MLE) calculated as in the section of the notes linked above.
the sample median, calculated by R function median.
the sample mean, calculated by R function mean.
the 10% trimmed sample mean, calculated by R function mean with optional argument trim = 0.1.

Note: Theory says the sample mean is a very bad estimator. But that's OK. That's one thing a simulation can show.

Use sample size n = 15.

Use simulation truth parameter value θ = 0. (In any location family, and in this one in particular, the variance of an equivariant estimator, which all of these are, does not depend on the true unknown parameter value, so the results should be the same regardless of what θ we choose).

Use at least 10⁴ simulations.

Like in the course notes, make a table comparing the MSE of these estimators (with MCSE as appropriate). You do not have to format these nicely in a table using Rmarkdown. As long as you print each number (one estimate of MSE and one estimate of MCSE of MSE for each estimator, which is four numbers) with a clear indication (maybe a comment, maybe a row or column label in a matrix) of what it is, that's enough.

The R code that makes the tables at the ends of Section 5.5 of those course notes and Section 5.6 of those course notes is hidden (not in the HTML) but can be seen in the Rmarkdown source of those course notes.

Be sure to use the principle of common random numbers.

Problem 2

The R command


foo <- read.table(url("http://www.stat.umn.edu/geyer/3701/data/2022/q6p2.txt"),
    header = TRUE)

assigns one R object foo, which is a data frame containing one variable x, which is the data for this problem.

The estimator we are going to use in this problem is the 25% trimmed mean (calculated by mean(x, trim = 0.25) if the data are x). What parameter does this estimate? It estimates the 25% trimmed mean of the true unknown distribution of the data. Call that unknown parameter θ.

Calculate a 95% nonparametric bootstrap t confidence interval using these data and R function boott in R package bootstrap using no sdfun (because we don't know how to make one) following the example in the course notes on the bootstrap.

Use at least 10³ bootstrap samples.

Use at least 200 bootstrap samples, for the inner bootstrap that determines the SD function (argument nbootsd).

Problem 3

This problem is about parallelization.

We are going to parallelize, one of the simulations in problem 1, the one for the sample median.

Parallelize this simulation using R function mclapply in R package parallel following the example in (Section 7.1 of the course notes on parallel computing. If you are on Microsoft Windows, you will have to use mc.cores = 1 as explained in those notes.

Do everything as in problem 1 except parallelize using mclapply and use only the sample median, not the other estimators.

Homework 6

Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.

Problem 4

The R command


foo <- read.csv(url("https://www.stat.umn.edu/geyer/3701/data/q6p4.csv"))
gout1 <- glm(y ~ x1 + x2, family = binomial, data = foo)
summary(gout1)
gout2 <- glm(y ~ poly(x1, x2, degree = 2), family = binomial, data = foo)
summary(gout2)
anova(gout1, gout2, test = "Chisq")

does one test of model comparison for a logistic regression, comparing linear and quadratic models.

Not trusting the asymptotics, we want to do a parametric bootstrap calculation of the P-value.

The code


aout <- anova(gout1, gout2, test = "Chisq")
aout[2, "Pr(>Chi)"]

obtains that P-value.

The code


ystar <- simulate(gout1)$sim_1

simulates a new response vector ystar under the null hypothesis. (The R function simulate is a generic function that package authors may provide to simulate models.)

Simulate at least 10⁴ simulations of this model (the MLE for the null hypothesis). Fit the null (linear) and alternative (quadratic) models to each, obtain the P-value and store it. Suppose you call the vector you store all these bootstrap P-values in pstar.

Suppose you call the P-value for the actual data phat.

A P-value is a statistic — a function of the data that is not a function of the parameters — so if we think of it as just a test statistic that we don't know the distribution of (which is why we are bootstrapping it), then we use the bootstrap distribution as its sampling distribution. Since we reject the null hypothesis when the P-value is low, the way to calculate the P-value from this simulation is


mean(c(phat, pstar) <= phat)

Note that we include the P-value for the actual data in with all the bootstrap P-value. This is valid because under the null hypothesis it has almost the same distribution. It also prevents one getting P = 0 by just doing too few simulations.

Ideally, the R functions glm and anova should always calculate the P-value accurately using care about computer arithmetic (like we do). Unfortunately, they do not. You may get warnings from glm.fit, the R function that glm calls to do the actual model fitting. And you may get NA or NaN for a bootstrap P. This would not happen if the R functions glm and anova were coded with more attention to computer arithmetic. But they aren't.

Ignore the warnings.
If you get NA or NaN for any bootstrap P-value, replace it with 1.0. This means we do not reject the null hypothesis at any α level. If you cannot compute a P-value, then you have to assume it is no evidence at all against the null hypothesis, so this is the correct thing to do.

Problem 5

This problem is almost the same as the preceding one. We are going to do the same calculation but think of it as a simulation study rather than as a bootstrap.

With the same vector pstar computed in the preceding problem. Check whether it gives valid tests. Under the null hypothesis a test rejects the null hypothesis at level α with probability α. Using a P-value to conduct the test, the test rejects at level α when P ≤ α. This says, if the test statistic is continuous, that it is uniformly distributed on the interval (0, 1), that is

Pr(P ≤ α) = α for all α

We do not want to calculate this for an infinite number of α, and that wouldn't make sense even if we could do it because nboot is finite. Moreover, we are only interested in low values of α.

Thus, here is what the problem requires.

Estimate Pr(P ≤ α) for α = 0.01, 0.02, …, 0.10.
Also provide MCSE of these estimates.

So the problem requires you to calculate 20 numbers (10 estimates, 10 MCSE).

Hint: recall that MCSE of zero-or-one-valued random variables can be calculated using the formulas for standard errors of sample proportions from STAT 3011 as discussed in (Section 6.1.5 of the course notes on the bootstrap).

Problem 6

In this problem we are going to bootstrap a likelihood-based confidence interval. In particular, we are going to bootstrap the 95% likelihood interval for the scale parameter σ calculated in (Section 9.3.3 of the course notes on models, part II) and reported in (summary section following it) as (2.86,6.65).

The theory on which this test is based is in Section 7.1.1 of those notes.

That theory says, in other words than in that section of the notes, that twice the difference of log likelihoods, the first maximizing over both parameters and the second holding the parameter of interest (that's σ in this problem) fixed, has an asymptotic distribution that is chi-square on one degree of freedom when the null hypothesis is true, that is, when the value σ₀ at which we fix σ is the true unknown parameter value.

If we don't believe the asymptotics and want to do a parametric bootstrap instead, we should derive a critical value from simulation of the null distribution of the test statistic.

Hence we do the following steps in each iteration of bootstrap simulation. Use bootstrap sample size nboot = 1e4.

Simulate data from the MLE distribution. The parameter estimates were reported to be 9.266664 for μ and 4.362070 for σ in Section 9.2 of those notes. The R function rcauchy will simulate such data. The data sample size was never shown in those notes but was n = 45.
Calculate the maximum of the log likelihood optimizing over both parameters and also just maximizing over μ while holding σ fixed at its true unknown parameter value in the bootstrap world, which is the value you are using for the simulation distribution.
Calculate the bootstrap test statistic (twice the difference in log likelihoods: big model minus little model). Since we are using minus the log likelihood for the convenience of the R function nlm, it will be little model minus big model (because minus times minus is plus). You can check that you got it the right way around because the test statistic has to be positive (the unconstrained maximum has to be larger than the constrained maximum).
Store each value of the bootstrap test statistic.

The result of this simulation is (the bootstrap estimate of) the sampling distribution of the test statistic under the null hypothesis. The critical value for making a 95% likelihood-based-confidence interval is the 0.95 quantile of this distribution. Find that critical value.

We won't bother actually making the confidence interval. It would be done exactly the same as in Section 9.3.3 of those notes) except with crit.chisq defined as we calculate in this problem rather than


> qchisq(0.95, df = 1)
[1] 3.841459

which is derived from asymptotic theory and used in those notes. (But you do not have to actually do that.)

Hint: An R function that calculates minus the log likelihood for the two-parameter Cauchy distribution is found in Section 9.2 of those notes. But this function needs to be modified for this problem.

In maximizing over μ and σ, one needs to
- make the data an argument because the simulated data changes in each bootstrap iteration,
- redefine the function each time through the loop with the data name inside the function the same as outside,
- or define the function only once with the data name inside the function the same as outside.
In maximizing over μ holding σ fixed, one needs to make a version of mlogl that has only arguments
- mu and x if you are making the data an argument
- or just mu if you are not making the data an argument.

Problem 7

This problem is about Bayesian inference via Markov chain Monte Carlo (MCMC), which was covered in the course notes on that subject.

The statistical model is Cauchy location-scale, the same model that was analyzed by likelihood methods in Section 9 of the course notes on statistical models, Part II.

For the prior we are going to use what the course notes on Bayesian inference call the "method of made-up data" (which in this problem is a special case of the well known and widely used method of conjugate priors, which we did not explain in the course notes, and will not explain here).

We take the prior to be the likelihood for made-up data { −1, 0, 1 }, so the unnormalized posterior is the same as the likelihood for the real data times the the likelihood for these made-up data, or, what is the same thing, the likelihood for the concatenation of real and made-up data.

It is not obvious that this prior is proper, but I checked it by integrating it in Mathematica, and it is proper.

Simulate the posterior distribution of the parameters μ and σ using the R function metrop in the CRAN package mcmc.

Do a simulation in which there is no batching (argument blen = 1 to the R function metrop) so the result returned by metrop has component batch whose components are simulations of μ and σ rather than batch means.

From this make 95% equal-tailed Bayesian credible intervals for μ and σ, which is something not covered in the course notes, but is very simple. The credible interval for μ has endpoints that are the 0.025 and 0.975 quantiles of the simulations of μ and similarly for σ.

Do not forget the restriction σ > 0. Also do not forget that the starting state for the Metropolis algorithm must also satisfy this constraint.

For data use the same data as was used in the course notes


x <- read.csv(url("http://www.stat.umn.edu/geyer/5102/data/prob7-1.txt"))$x