--- title: "Stat 5421 Lecture Notes: Likelihood Computation" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML pdf_document: latex_engine: xelatex number_sections: true md_extensions: -tex_math_single_backslash --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. * The version of the `knitr` package used to make this document is `r packageVersion("knitr")`. * The version of the `numDeriv` package used to make this document is `r packageVersion("numDeriv")`. * The version of the `alabama` package used to make this document is `r packageVersion("alabama")`. # Statistical Model As an example, we are going to use the ABO blood group system. As you all know, you can have blood type A, B, AB, or O. And this governs what blood types you can be safely transfused with. What this is about is red blood cell surface antigens, which are proteins that stick out from the surface of red blood cells and can be recognized by antibodies (antigens are things recognized by antibodies) causing an immune response if antigens different from one's own are introduced by blood transfusion (or any other way). Genetically, these proteins are coded for by DNA and each person has two copies of these genes, one from Mom and one from Dad. And each such gene can be in one of three states (alleles): code for the A protein, code for the B protein, or code for no functional protein (that's the type O). Let the probability of the A, B, and O alleles be $\alpha$, $\beta$, and $\gamma$, respectively. We are now going to assume that the genes inherited from the parents are independent. This is called Hardy-Weinberg equilibrium in genetics. It is exactly true for random mating populations, and approximately true in populations that have had large populations sizes for many generations. Anyway, whether it holds for any particular population, we are going to assume it. The Hardy-Weinberg assumption (statistical independence) says probabilities multiply. This gives us the following table. +:-:+:---------------:+:--------------:+:---------------:+ | | A | B | O | +---+-----------------+----------------+-----------------+ | A | $\alpha^2$ | $\alpha \beta$ | $\alpha \gamma$ | +---+-----------------+----------------+-----------------+ | B | $\beta \alpha$ | $\beta^2$ | $\beta \gamma$ | +---+-----------------+----------------+-----------------+ | O | $\gamma \alpha$ | $\gamma \beta$ | $\gamma^2$ | +---+-----------------+----------------+-----------------+ The row labels are genes inherited from Mom, the column labels from Dad. What can be observed in the lab are just the presence or absence of A or B or both. Thus the observable data has the following table +:----------------------------:+:--------------------------:+:----------------:+:----------------:+ | type A | type B | type O | type AB | +------------------------------+----------------------------+------------------+------------------+ | $\alpha^2 + 2 \alpha \gamma$ | $\beta^2 + 2 \beta \gamma$ | $\gamma^2$ | $2 \alpha \beta$ | +------------------------------+----------------------------+------------------+------------------+ If $n_A$, $n_B$, $n_O$, and $n_{AB}$ are the observed counts of individuals of each blood type, then the log likelihood is \begin{equation} l(\alpha, \beta, \gamma) = n_A \log(\alpha^2 + 2 \alpha \gamma) + n_B \log(\beta^2 + 2 \beta \gamma) + n_O \log(\gamma^2) + n_{AB} \log(2 \alpha \beta) \end{equation} Since probabilities must add to one, we have $\alpha + \beta + \gamma = 1$, so there are only two freely variable parameters. We can choose which one to define in terms of the others. # Data Some data found on web attributed to Clarke et al. (1959, *British Medical Journal*, 1, 603--607) ```{r data} nA <- 186 nB <- 38 nAB <- 36 nO <- 284 ``` # Coding the Log Likelihood, Try I We are going to write a function that evaluates minus the log likelihood because some R optimizers only minimize (not maximize). Minimizing minus the log likelihood is the same as maximizing the log likelihood. We are going to start with the simplest possible R function that evaluates the log likelihood. This is not actually recommended, but is OK for one-time projects producing throw-away code (no one will ever use it again, even your future self). For the Right Thing (TRT), see the [Section 9 below](#fixing-up-our-r-function). R optimizers expect the argument of the function to optimize to be a vector. So the argument of our function will be $\theta = (\alpha, \beta)$. Recall that $\gamma = 1 - \alpha - \beta$. ```{r mlogl.try.i} mlogl <- function(theta) { alpha <- theta[1] beta <- theta[2] gamma <- 1 - alpha - beta - (nA * log(alpha^2 + 2 * alpha * gamma) + nB * log(beta^2 + 2 * beta * gamma) + nO * log(gamma^2) + nAB * log(2 * alpha * beta)) } ``` # Starting Values for Likelihood Maximization Likelihood theory say we find the efficient likelihood estimator (ELE) if we start at a good starting place (so-called root-$n$-consistent estimates) and go uphill from there. Method of moment estimators (that set expectations or probabilities equal to their observed values in the data) are always root-$n$-consistent. The obvious expectations are those of the cell counts. These give four possible equations to solve for two unknowns. \begin{align*} \frac{n_A}{n_A + n_B + n_O + n_{AB}} & = \alpha^2 + 2 \alpha \gamma \\ \frac{n_B}{n_A + n_B + n_O + n_{AB}} & = \beta^2 + 2 \beta \gamma \\ \frac{n_O}{n_A + n_B + n_O + n_{AB}} & = \gamma^2 \\ \frac{n_{AB}}{n_A + n_B + n_O + n_{AB}} & = 2 \alpha \beta \end{align*} We start by using the third of these to estimate $\gamma$. ```{r gamma.twiddle} gamma.twiddle <- sqrt(nO / (nA + nB + nO + nAB)) gamma.twiddle ``` Now with $\gamma$ estimated, we can use the first two equations to estimate $\alpha$ and $\beta$. Let the left-hand side of those equations be denoted $p_A$ and $p_B$. Then we have the quadratic equation $$ \alpha^2 + 2 \alpha \gamma - p_A = 0 $$ which has solution \begin{align*} \alpha & = \frac{- 2 \gamma \pm \sqrt{4 \gamma^2 + 4 p_A}}{2} \\ & = - \gamma \pm \sqrt{\gamma^2 + p_A} \end{align*} Because $\sqrt{\gamma^2 + p_A}$ is greater than $\gamma$, choosing the $-$ in the $\pm$ gives a negative estimate of the probability, which makes no sense. Hence our estimates are ```{r other.twiddle} pA <- nA / (nA + nB + nO + nAB) pB <- nB / (nA + nB + nO + nAB) alpha.twiddle <- (- gamma.twiddle + sqrt(gamma.twiddle^2 + pA)) beta.twiddle <- (- gamma.twiddle + sqrt(gamma.twiddle^2 + pB)) alpha.twiddle beta.twiddle ``` We check whether these sum to one ```{r twiddle.sum.to.one} alpha.twiddle + beta.twiddle + gamma.twiddle ``` They don't. There is nothing in the method of moments that forces estimators to be reasonable in this way. Of course, since these are consistent estimators, this sum would get closer and closer to one as sample size goes to infinity. But it does not have to be exactly one for any finite sample size. Anyway, these are just starting values for optimization. They are good enough (so says the "usual" theory of maximum likelihood estimation). ```{r theta.twiddle} theta.start <- c(alpha.twiddle, beta.twiddle) ``` We will later learn that in a regular full exponential family statistical model we do not need good starting points, [any starting point will do](expfam.html#thm:om-mani-padme-hum). And we will also learn that the [multinomial family of distributions is a regular full exponential family](expfam.html#the-multinomial-distribution). But this model is not the full multinomial model, which has no restrictions on the cell probabilities. It is called a *curved exponential family* because it is not a regular full exponential family. And as such, it does not satisfy the conditions for the theorem that says any starting point will do. Here we need good starting points (like we did with the Cauchy examples in the [other likelihood notes](likelihood.html)). # Maximum Likelihood Estimation We will use R function `nlm` for the optimization. ```{r nlm.try.i, error=TRUE} nout <- nlm(mlogl, theta.start) nout$code <= 2 # should be TRUE for solution nout$estimate ``` R function `nlm` claims it has found a solution even though there were warnings. The issue is that `nlm` does not know that the parameters have to have nonnegative values that sum to less than or equal to one. So it steps outside the parameter space and `NaN` values of our R function `mlogl` result. Eventually, it gets back into the parameter space and finds the solution. But we don't like the warnings, so we redo, starting at the putative solution. ```{r nlm.try.i.too, error=TRUE} nout <- nlm(mlogl, nout$estimate) nout$code <= 2 # should be TRUE for solution nout$estimate theta.start ``` Now no warnings, and no change in the estimates. It appears the code worked the first time and we didn't need the redo. We repeat showing the value of `theta.start` to show that the optimization did something. The maximum likelihood estimator (MLE, `nout$estimate`) is different from the method of moments estimator (MOME, `theta.start`). And we know from theory that the MLE are better estimators than any other (asymptotically, for large sample size). Hence, in particular, are better than MOME. So here are the MLE and MOME ```{r theta.hat} theta.hat <- c(nout$estimate, 1 - sum(nout$estimate)) names(theta.hat) <- c("alpha", "beta", "gamma") theta.hat theta.twiddle <- c(alpha.twiddle, beta.twiddle, gamma.twiddle) names(theta.twiddle) <- c("alpha", "beta", "gamma") theta.twiddle ``` Again, theory says the MLE is better than the MOME. We show both only to show that they are different. # Likelihood Inference ## Wald Intervals We need Fisher information. Here we will use observed Fisher information and let the computer do it for us. We will need to redo the optimization because we forgot to ask the computer to calculate second derivatives. ```{r nlm.try.i.hessian, error=TRUE} nout <- nlm(mlogl, nout$estimate, hessian = TRUE) nout$code <= 2 # should be TRUE for solution nout$estimate nout$hessian ``` `nout$hessian` is the observed Fisher information matrix. It is minus the second derivative matrix of the log likelihood because our function evaluates minus the log likelihood. As an alternative, we could use R package `numDeriv`, which calculates second derivative (Hessian) matrices using a more accurate algorithm. ```{r numDeriv.hessian} library(numDeriv) hessian(mlogl, nout$estimate) ``` We'll use this one, but the other is OK too. ```{r fisher.info.alpha.beta} fisher <- hessian(mlogl, nout$estimate) ``` The asymptotic (large sample approximation) variance of the parameter vector estimate is inverse Fisher information. R function `solve` inverts matrices. ```{r inverse.fisher} fisher.inverse <- solve(fisher) fisher.inverse ``` Thus 95% Wald confidence intervals for $\alpha$ and $\beta$ are given by ```{r wald} conf.level <- 0.95 crit <- qnorm((1 + conf.level) / 2) crit theta.hat[1] + c(-1, 1) * crit * sqrt(fisher.inverse[1, 1]) theta.hat[2] + c(-1, 1) * crit * sqrt(fisher.inverse[2, 2]) ``` To calculate a confidence interval for $\gamma$ we use the fact, proved in theory classes and [discussed in the notes that accompanied chapter 1 in Agresti](ch1.html#delta-method), that if $a$ is a nonrandom vector, $B$ is a nonrandom matrix, and $X$ is a random vector having dimensions such that $a + B X$ makes sense, then $$ \text{var}(a + B X) = B \text{var}(X) B^T $$ What we want here is the $a$ and $B$ that turn the estimate of $(\alpha, \beta)$ into the estimate of $(\alpha, \beta, \gamma)$. That is \begin{align*} a & = \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} \\ B & = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ -1 & -1 \end{pmatrix} \end{align*} Let's check. ```{r check.linear} a <- c(0, 0, 1) b <- matrix(c(1, 0, -1, 0, 1, -1), nrow = 3) a b as.numeric(a + b %*% nout$estimate) theta.hat ``` So ```{r assvar} asymp.var.theta.three.dimensional <- b %*% fisher.inverse %*% t(b) asymp.var.theta.three.dimensional ``` So our 95% confidence interval for $\gamma$ is ```{r wald.gamma} theta.hat[3] + c(-1, 1) * crit * sqrt(asymp.var.theta.three.dimensional[3, 3]) ``` ## Likelihood Intervals Now our critical value is based on the chi-square distribution on one degree of freedom (one because we are doing confidence intervals for one parameter at a time and *are not trying for simultaneous coverage*). ```{r crit.like} crit <- qchisq(conf.level, df = 1) crit sqrt(crit) ``` To find the likelihood-based confidence interval we maximize and minimize the parameter of interest subject to the constraint that the log likelihood is no less than `crit` down from the maximum value. ```{r likelihood.based} library(alabama) # alpha lower confun <- function(theta) crit - 2 * (mlogl(theta) - nout$minimum) aout <- auglag(nout$estimate, fn = function(theta) theta[1], hin = confun) aout$convergence == 0 aout$value ``` All of that blather was most annoying. Tell it to STFU. ```{r likelihood.based.ii} # alpha lower aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1], hin = confun)) aout$convergence == 0 aout$value ``` Need to tell it STFU more forcefully. ```{r likelihood.based.iii} # alpha lower aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1], hin = confun, control.outer = list(trace = FALSE))) aout$convergence == 0 aout$value ``` Great! Now for the rest. ```{r likelihood.based.iv} # alpha lower alpha.lower <- aout$value # alpha upper aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1], hin = confun, control.outer = list(trace = FALSE), control.optim = list(fnscale = -1))) aout$convergence == 0 aout$value alpha.upper <- aout$value ``` The `control.optim = list(fnscale = -1)` is what tells R function `auglag` to maximize rather than minimize. This was particularly hard to find in the documentation because it is on the help page for R function `optim`, which is referred to from the help page for R function `auglag`. And so forth. ```{r likelihood.based.v} # beta lower aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[2], hin = confun, control.outer = list(trace = FALSE))) aout$convergence == 0 aout$value beta.lower <- aout$value # beta upper aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[2], hin = confun, control.outer = list(trace = FALSE), control.optim = list(fnscale = -1))) aout$convergence == 0 aout$value beta.upper <- aout$value # gamma lower aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) 1 - sum(theta), hin = confun, control.outer = list(trace = FALSE))) aout$convergence == 0 aout$value gamma.lower <- aout$value # gamma upper aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) 1 - sum(theta), hin = confun, control.outer = list(trace = FALSE), control.optim = list(fnscale = -1))) aout$convergence == 0 aout$value gamma.upper <- aout$value ``` ## Rao (Score) Intervals These turned out to be just too horrible to do. If we try to derive the Rao pivotal quantity analytically, the calculus is horribly messy. If we try to let the computer do the calculus using R package `numDeriv` then that doesn't work because R function `auglag` in R package `alabama` needs derivatives of the nonlinear constraint function that it tries to calculate by finite-difference approximation. So that means we either need everything done analytically or calculating derivatives of functions containing derivatives by finite-difference approximations of functions themselves defined by finite-difference approximations. And the latter does not work. Computer arithmetic is too sloppy. However much one may like Rao intervals for simple problems, they are just not worth the effort in complicated problems. And this problem is already too complicated for them, even though it only involves a contingency table with four cells and a curved exponential family with two parameters. ## Summary ### Wald Intervals {#summary-wald-nonsimultaneous} ```{r wald.interval.summary, echo=FALSE} # have to redefine because have clobbered above crit <- qnorm((1 + conf.level) / 2) foompter <- cbind( theta.hat - crit * sqrt(diag(asymp.var.theta.three.dimensional)), theta.hat + crit * sqrt(diag(asymp.var.theta.three.dimensional))) colnames(foompter) <- c("lower", "upper") knitr::kable(foompter, digits = c(4, 4), align = "lcc") ``` ### Likelihood Intervals {#summary-likelihood-nonsimultaneous} ```{r likelihood.interval.summary, echo=FALSE} foompter <- rbind(c(alpha.lower, alpha.upper), c(beta.lower, beta.upper), c(gamma.lower, gamma.upper)) colnames(foompter) <- c("lower", "upper") rownames(foompter) <- c("alpha", "beta", "gamma") knitr::kable(foompter, digits = c(4, 4), align = "lcc") ``` These intervals are **not** corrected for simultaneous coverage. See [Section 10 below](#simultaneous-coverage) for that. # Fixing Up Our R Function ## Error Messages The first improvement to make to any function is to make it follow the policy of "garbage in, error messages out". And we want the error messages to be understandable and informative. So ```{r mlogl.try.ii} mlogl <- function(theta) { stopifnot(is.numeric(theta)) stopifnot(is.finite(theta)) stopifnot(length(theta) == 2) alpha <- theta[1] beta <- theta[2] gamma <- 1 - alpha - beta - (nA * log(alpha^2 + 2 * alpha * gamma) + nB * log(beta^2 + 2 * beta * gamma) + nO * log(gamma^2) + nAB * log(2 * alpha * beta)) } ``` We should check everything we reasonably can about our arguments. One might think we should also add ``` stopifnot(theta >= 0) stopifnot(sum(theta) <= 1) ``` But, as we saw, R function `nlm` will call the function supplied to it (which is our function `mlogl` in this handout) when the parameters are outside the sample space (because there is no way to indicate to it what the sample space is). So we do not check these, but rather just suppress the warnings, as shown above, if we don't want to see them. We had a similar problem with R function `auglag` because it also has the constraint function (argument `hin`) that calls our function `mlogl` and it also does not know what the allowed parameter values are. ## Using a Factory Function Global variables are evil, as every course on computer programming says. If the project is one-off and not to be (ever, under any circumstances) used as an example (even for one's own future work), then there is nothing wrong with global variables. But code tends to be re-used, even if the re-use is not planned. So avoid global variables. To do that we don't rewrite our function `mlogl`. Instead we write an R function to write it. A function that makes functions is called a *function factory*. So that is what we are using. R functions remember the environment in which they were created. That is how our current version of the function finds the variables `nA`, `nB`, `nO`, and `nAB`. It was created in the R global environment, and that is where those variables are. If one clobbers or removes those variables, our function will not work. When we use a function factory and put those variables in the execution environment of the function factory, then they are where users do not think to mess with them. Most users don't even know functions have environments. So here is how that looks. ```{r factory} mlogl.factory <- function(nA, nB, nO, nAB) { stopifnot(is.numeric(nA)) stopifnot(is.finite(nA)) stopifnot(round(nA) == nA) # checks integer value stopifnot(nA >= 0) stopifnot(is.numeric(nB)) stopifnot(is.finite(nB)) stopifnot(round(nB) == nB) # checks integer value stopifnot(nB >= 0) stopifnot(is.numeric(nO)) stopifnot(is.finite(nO)) stopifnot(round(nO) == nO) # checks integer value stopifnot(nO >= 0) stopifnot(is.numeric(nAB)) stopifnot(is.finite(nAB)) stopifnot(round(nAB) == nAB) # checks integer value stopifnot(nAB >= 0) function(theta) { stopifnot(is.numeric(theta)) stopifnot(is.finite(theta)) stopifnot(length(theta) == 2) alpha <- theta[1] beta <- theta[2] gamma <- 1 - alpha - beta - (nA * log(alpha^2 + 2 * alpha * gamma) + nB * log(beta^2 + 2 * beta * gamma) + nO * log(gamma^2) + nAB * log(2 * alpha * beta)) } } ``` An R function returns the value of the last expression if there is no explicit call of R function `return`. The last expression of `mlogl.factory` is the definition of a function that is identical to our function `mlogl` copied from the code above. Now we use the factory to make a new `mlogl`. ```{r make.mlogl} mlogl <- mlogl.factory(nA, nB, nO, nAB) ``` And now we don't need the global variables. ```{r rm} rm(nA, nB, nO, nAB) ``` We check that our new function `mlogl` works just like the old one. ```{r nlm.try.ii, error=TRUE} nout.estimate.save <- nout$estimate nout <- suppressWarnings(nlm(mlogl, theta.start)) nout$code <= 2 # should be TRUE for solution all.equal(nout$estimate, nout.estimate.save) ``` ```{r likelihood.based.vi} # have to redefine because have clobbered above # global variables are evil crit <- qchisq(conf.level, df = 1) aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1], hin = confun, control.outer = list(trace = FALSE))) aout$convergence == 0 all.equal(aout$value, alpha.lower) ``` How does this work? If we removed the data, where is it? ```{r where.is.waldo} ls(envir = environment(mlogl)) ``` # Simultaneous Coverage ## Confidence Regions A confidence interval is a random interval that covers the true unknown (scalar) parameter value with a specified probability. A confidence *region* is a random *region* that covers the true unknown (vector) parameter value with a specified probability. Confidence regions are not much used, not even mentioned in intro stats books. We are primarily interested in them as a tool for constructing simultaneous confidence intervals. But we develop the general theory here. Like confidence intervals, confidence regions can be constructed by inverting hypothesis tests. The confidence region is the set of $\theta_0$ that are not rejected by the hypothesis test with hypotheses \begin{align*} H_0 & : \theta = \theta_0 \\ H_1 & : \theta \neq \theta_0 \end{align*} Let $c$ be the appropriate chi-square critical value (the degrees of freedom is the dimension of the parameter space). Then inverting the likelihood ratio test gives the confidence region $$ \{\, \theta : 2 [ l(\hat{\theta}) - l(\theta) ] \le c \,\} $$ where $\hat{\theta}$ is the MLE for the alternative hypothesis (the unrestricted MLE). In the handout titled "Likelihood Inference" the formula for the Wald test statistic using observed Fisher information is $$ W_n' = g(\hat{\theta}_n)^T \bigl[ \nabla g(\hat{\theta}_n) J_n(\hat{\theta}_n)^{-1} \bigl(\nabla g(\hat{\theta}_n) \bigr)^T \bigr]^{-1} g(\hat{\theta}_n) $$ Here we use the special case where $g(\theta) = \theta - \theta_0$, the first derivative of which is the identity matrix. Then we have $$ W_n' = (\hat{\theta}_n - \theta)^T J_n(\hat{\theta}_n) (\hat{\theta}_n - \theta) $$ Then the Wald confidence region is $$ \{\, \theta : (\hat{\theta}_n - \theta)^T J_n(\hat{\theta}_n) (\hat{\theta}_n - \theta) < c \,\} $$ We have to solve a multivariate quadratic equation to find the boundary of the region. ## Simultaneous Confidence Intervals from Confidence Regions If $R$ is a confidence region and $g$ is any scalar-valued function whatsoever, then $$ \inf_{\theta \in R} g(\theta) < g(\theta) < \sup_{\theta \in R} g(\theta) $$ is a confidence interval for the parameter $g(\theta)$, and, moreover, the coverage probability is the same as the coverage probability for $R$, and this is simultaneous for all possible functions $g$. This is how we are going to derive simultaneous confidence intervals here. ## Likelihood Intervals To fix up our likelihood intervals for simultaneous coverage, all we need is to use a critical value for two degrees of freedom. Everything else is the same. Why two? Because there are two independently adjustable parameters. The degrees of freedom for simultaneous coverage is always the number of independently adjustable parameters. We won't show the work just the result. First do ```{r crit.again} crit crit <- qchisq(conf.level, df = 2) crit ``` and then redo the likelihood intervals shown above (work not shown, but is in the source (Rmd) file). ```{r simultaneous.likelihood, echo=FALSE} # alpha lower aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1], hin = confun, control.outer = list(trace = FALSE))) stopifnot(aout$convergence == 0) alpha.lower <- aout$value # alpha upper aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1], hin = confun, control.outer = list(trace = FALSE), control.optim = list(fnscale = -1))) stopifnot(aout$convergence == 0) alpha.upper <- aout$value # beta lower aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[2], hin = confun, control.outer = list(trace = FALSE))) stopifnot(aout$convergence == 0) beta.lower <- aout$value # beta upper aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[2], hin = confun, control.outer = list(trace = FALSE), control.optim = list(fnscale = -1))) stopifnot(aout$convergence == 0) beta.upper <- aout$value # gamma lower aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) 1 - sum(theta), hin = confun, control.outer = list(trace = FALSE))) stopifnot(aout$convergence == 0) gamma.lower <- aout$value # gamma upper aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) 1 - sum(theta), hin = confun, control.outer = list(trace = FALSE), control.optim = list(fnscale = -1))) stopifnot(aout$convergence == 0) gamma.upper <- aout$value # printout foompter <- rbind(c(alpha.lower, alpha.upper), c(beta.lower, beta.upper), c(gamma.lower, gamma.upper)) colnames(foompter) <- c("lower", "upper") rownames(foompter) <- c("alpha", "beta", "gamma") knitr::kable(foompter, digits = c(4, 4), align = "lcc") ``` Comparing with [Section 8.3.2 above](#summary-likelihood-nonsimultaneous) we see the intervals aren't actually that much wider. ## Wald Intervals The same trick works for Wald intervals. This is similar to what is called Scheffé's method for linear models, but the same idea works in general. We just use the same critical value, as for the likelihood intervals, but when we use it for Wald intervals, we need to take a square root. We have been using all along the critical value for Wald intervals is the square root of the critical value for the confidence intervals. We just didn't note that explicitly. But a chi-square random variable with one degree of freedom is the square of a standard normal random variable. Now a chi-square random variable on $k$ degrees of freedom is the sum of squares of $k$ independent standard normal random variables. So we do ```{r crit.too.too} crit <- sqrt(crit) crit ``` and redo all of our Wald interval calculations (again work not shown but in the Rmd file). ```{r simultaneous.wald, echo=FALSE} foompter <- cbind( theta.hat - crit * sqrt(diag(asymp.var.theta.three.dimensional)), theta.hat + crit * sqrt(diag(asymp.var.theta.three.dimensional))) colnames(foompter) <- c("lower", "upper") knitr::kable(foompter, digits = c(4, 4), align = "lcc") ``` Comparing with [Section 8.3.1 above](#summary-wald-nonsimultaneous) we see the intervals aren't actually that much wider.