---
title: "Stat 5421 Lecture Notes: Likelihood Computation"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  html_document:
    number_sections: true
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
  pdf_document:
    latex_engine: xelatex
    number_sections: true
    md_extensions: -tex_math_single_backslash
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document is
   `r packageVersion("rmarkdown")`.

 * The version of the `knitr` package used to make this document is
   `r packageVersion("knitr")`.

 * The version of the `numDeriv` package used to make this document is
   `r packageVersion("numDeriv")`.

 * The version of the `alabama` package used to make this document is
   `r packageVersion("alabama")`.

# Statistical Model

As an example, we are going to use the ABO blood group system.

As you all know, you can have blood type A, B, AB, or O.  And this
governs what blood types you can be safely transfused with.

What this is about is red blood cell surface antigens, which are proteins
that stick out from the surface of red blood cells and can be recognized
by antibodies (antigens are things recognized by antibodies) causing an
immune response if antigens different from one's own are introduced by
blood transfusion (or any other way).

Genetically, these proteins are coded for by DNA and each person has
two copies of these genes, one from Mom and one from Dad.
And each such gene can be in one of three states (alleles):
code for the A protein, code for the B protein, or code for no functional
protein (that's the type O).
Let the probability of the A, B, and O alleles be $\alpha$, $\beta$, and
$\gamma$, respectively.

We are now going to assume that the genes inherited from the parents are
independent.  This is called Hardy-Weinberg equilibrium in genetics.
It is exactly true for random mating populations, and approximately true
in populations that have had large populations sizes for many generations.
Anyway, whether it holds for any particular population, we are going to assume
it.

The Hardy-Weinberg assumption (statistical independence) says probabilities
multiply.  This gives us the following table.

+:-:+:---------------:+:--------------:+:---------------:+
|   |       A         |        B       |       O         |
+---+-----------------+----------------+-----------------+
| A | $\alpha^2$      | $\alpha \beta$ | $\alpha \gamma$ |
+---+-----------------+----------------+-----------------+
| B | $\beta \alpha$  | $\beta^2$      | $\beta \gamma$  |
+---+-----------------+----------------+-----------------+
| O | $\gamma \alpha$ | $\gamma \beta$ | $\gamma^2$      |
+---+-----------------+----------------+-----------------+

The row labels are genes inherited from Mom, the column labels from Dad.

What can be observed in the lab are just the presence or absence of A or B
or both.  Thus the observable data has the following table

+:----------------------------:+:--------------------------:+:----------------:+:----------------:+
|       type A                 |      type B                |      type O      |     type AB      |
+------------------------------+----------------------------+------------------+------------------+
| $\alpha^2 + 2 \alpha \gamma$ | $\beta^2 + 2 \beta \gamma$ |  $\gamma^2$      | $2 \alpha \beta$ |
+------------------------------+----------------------------+------------------+------------------+

If $n_A$, $n_B$, $n_O$, and $n_{AB}$ are the observed counts of individuals
of each blood type, then the log likelihood is
\begin{equation}
   l(\alpha, \beta, \gamma)
   =
   n_A \log(\alpha^2 + 2 \alpha \gamma) + n_B \log(\beta^2 + 2 \beta \gamma)
   + n_O \log(\gamma^2) + n_{AB} \log(2 \alpha \beta)
\end{equation}
Since probabilities must add to one, we have $\alpha + \beta + \gamma = 1$,
so there are only two freely variable parameters.  We can choose which
one to define in terms of the others.

# Data

Some data found on web attributed to
Clarke et al. (1959, *British Medical Journal*, 1, 603--607)
```{r data}
nA <- 186
nB <- 38
nAB <- 36
nO <- 284
```

# Coding the Log Likelihood, Try I

We are going to write a function that evaluates minus the log likelihood
because some R optimizers only minimize (not maximize).  Minimizing minus
the log likelihood is the same as maximizing the log likelihood.

We are going to start with the simplest possible R function that evaluates
the log likelihood.  This is not actually recommended, but is OK for one-time
projects producing throw-away code (no one will ever use it again, even your
future self).  For the Right Thing (TRT), see the
[Section 9 below](#fixing-up-our-r-function).

R optimizers expect the argument of the function to optimize to be a vector.
So the argument of our function will be $\theta = (\alpha, \beta)$.
Recall that $\gamma = 1 - \alpha - \beta$.

```{r mlogl.try.i}
mlogl <- function(theta) {
    alpha <- theta[1]
    beta <- theta[2]
    gamma <- 1 - alpha - beta
    - (nA * log(alpha^2 + 2 * alpha * gamma) +
        nB * log(beta^2 + 2 * beta * gamma) +
        nO * log(gamma^2) +
        nAB * log(2 * alpha * beta))
}
```

# Starting Values for Likelihood Maximization

Likelihood theory say we find the efficient likelihood estimator (ELE) if
we start at a good starting place (so-called root-$n$-consistent estimates)
and go uphill from there.

Method of moment estimators (that set expectations or probabilities equal
to their observed values in the data) are always root-$n$-consistent.

The obvious expectations are those of the cell counts.  These give
four possible equations to solve for two unknowns.
\begin{align*}
   \frac{n_A}{n_A + n_B + n_O + n_{AB}}
   & =
   \alpha^2 + 2 \alpha \gamma
   \\
   \frac{n_B}{n_A + n_B + n_O + n_{AB}}
   & =
   \beta^2 + 2 \beta \gamma
   \\
   \frac{n_O}{n_A + n_B + n_O + n_{AB}}
   & =
   \gamma^2
   \\
   \frac{n_{AB}}{n_A + n_B + n_O + n_{AB}}
   & =
   2 \alpha \beta
\end{align*}
We start by using the third of these to estimate $\gamma$.
```{r gamma.twiddle}
gamma.twiddle <- sqrt(nO / (nA + nB + nO + nAB))
gamma.twiddle
```

Now with $\gamma$ estimated, we can use the first two equations to estimate
$\alpha$ and $\beta$.  Let the left-hand side of those equations be denoted
$p_A$ and $p_B$.  Then we have the quadratic equation
$$
   \alpha^2 + 2 \alpha \gamma - p_A = 0
$$
which has solution
\begin{align*}
   \alpha
   & =
   \frac{- 2 \gamma \pm \sqrt{4 \gamma^2 + 4 p_A}}{2}
   \\
   & =
   - \gamma \pm \sqrt{\gamma^2 + p_A}
\end{align*}
Because $\sqrt{\gamma^2 + p_A}$ is greater than $\gamma$, choosing  the $-$
in the $\pm$ gives a negative estimate of the probability, which makes no
sense.  Hence our estimates are
```{r other.twiddle}
pA <- nA / (nA + nB + nO + nAB)
pB <- nB / (nA + nB + nO + nAB)
alpha.twiddle <- (- gamma.twiddle + sqrt(gamma.twiddle^2 + pA))
beta.twiddle <- (- gamma.twiddle + sqrt(gamma.twiddle^2 + pB))
alpha.twiddle
beta.twiddle
```

We check whether these sum to one
```{r twiddle.sum.to.one}
alpha.twiddle + beta.twiddle + gamma.twiddle
```
They don't.  There is nothing in the method of moments that forces estimators
to be reasonable in this way.  Of course, since these are consistent estimators,
this sum would get closer and closer to one as sample size goes to infinity.
But it does not have to be exactly one for any finite sample size.

Anyway, these are just starting values for optimization.  They are good enough
(so says the "usual" theory of maximum likelihood estimation).
```{r theta.twiddle}
theta.start <- c(alpha.twiddle, beta.twiddle)
```

We will later learn that in a regular full exponential family statistical
model we do not need good starting points,
[any starting point will do](expfam.html#thm:om-mani-padme-hum).
And we will also learn that the [multinomial family of distributions
is a regular full exponential family](expfam.html#the-multinomial-distribution).
But this model is not the full multinomial model, which has no restrictions
on the cell probabilities.  It is called a *curved exponential family*
because it is not a regular full exponential family.
And as such, it does not satisfy the conditions for the theorem that
says any starting point will do.  Here we need good starting points
(like we did with the Cauchy examples in
the [other likelihood notes](likelihood.html)).

# Maximum Likelihood Estimation

We will use R function `nlm` for the optimization.
```{r nlm.try.i, error=TRUE}
nout <- nlm(mlogl, theta.start)
nout$code <= 2 # should be TRUE for solution
nout$estimate
```

R function `nlm` claims it has found a solution even though there were
warnings.  The issue is that `nlm` does not know that the parameters have
to have nonnegative values that sum to less than or equal to one.
So it steps outside the parameter space and
`NaN` values of our R function `mlogl` result.  Eventually, it gets back
into the parameter space and finds the solution.
But we don't like the warnings, so we redo, starting at the putative solution.
```{r nlm.try.i.too, error=TRUE}
nout <- nlm(mlogl, nout$estimate)
nout$code <= 2 # should be TRUE for solution
nout$estimate
theta.start
```
Now no warnings, and no change in the estimates.  It appears the code worked
the first time and we didn't need the redo.

We repeat showing the value of `theta.start` to show that the optimization
did something.  The maximum likelihood estimator (MLE, `nout$estimate`) is
different from the method of moments estimator (MOME, `theta.start`).
And we know from theory that the MLE are better estimators than any other
(asymptotically, for large sample size).  Hence, in particular, are better
than MOME.

So here are the MLE and MOME
```{r theta.hat}
theta.hat <- c(nout$estimate, 1 - sum(nout$estimate))
names(theta.hat) <- c("alpha", "beta", "gamma")
theta.hat

theta.twiddle <- c(alpha.twiddle, beta.twiddle, gamma.twiddle)
names(theta.twiddle) <- c("alpha", "beta", "gamma")
theta.twiddle
```

Again, theory says the MLE is better than the MOME.
We show both only to show that they are different.

# Likelihood Inference

## Wald Intervals

We need Fisher information.  Here we will use observed Fisher information
and let the computer do it for us.  We will need to redo the optimization
because we forgot to ask the computer to calculate second derivatives.
```{r nlm.try.i.hessian, error=TRUE}
nout <- nlm(mlogl, nout$estimate, hessian = TRUE)
nout$code <= 2 # should be TRUE for solution
nout$estimate
nout$hessian
```
`nout$hessian` is the observed Fisher information matrix.  It is minus
the second derivative matrix of the log likelihood because our function
evaluates minus the log likelihood.

As an alternative, we could use R package `numDeriv`, which calculates
second derivative (Hessian) matrices using a more accurate algorithm.
```{r numDeriv.hessian}
library(numDeriv)
hessian(mlogl, nout$estimate)
```

We'll use this one, but the other is OK too.
```{r fisher.info.alpha.beta}
fisher <- hessian(mlogl, nout$estimate)
```

The asymptotic (large sample approximation) variance of the parameter vector
estimate is inverse Fisher information.  R function `solve` inverts matrices.
```{r inverse.fisher}
fisher.inverse <- solve(fisher)
fisher.inverse
```

Thus 95% Wald confidence intervals for $\alpha$ and $\beta$ are given by
```{r wald}
conf.level <- 0.95
crit <- qnorm((1 + conf.level) / 2)
crit
theta.hat[1] + c(-1, 1) * crit * sqrt(fisher.inverse[1, 1])
theta.hat[2] + c(-1, 1) * crit * sqrt(fisher.inverse[2, 2])
```

To calculate a confidence interval for $\gamma$ we use the fact,
proved in theory classes and [discussed in the notes that accompanied
chapter 1 in Agresti](ch1.html#delta-method),
that if $a$ is a nonrandom vector,
$B$ is a nonrandom matrix, and $X$ is a random vector having
dimensions such that $a + B X$ makes sense, then
$$
   \text{var}(a + B X) = B \text{var}(X) B^T
$$
What we want here is the $a$ and $B$ that turn the estimate of
$(\alpha, \beta)$ into the estimate of $(\alpha, \beta, \gamma)$.
That is
\begin{align*}
   a & = \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}
   \\
   B & = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ -1 & -1 \end{pmatrix}
\end{align*}
Let's check.
```{r check.linear}
a <- c(0, 0, 1)
b <- matrix(c(1, 0, -1, 0, 1, -1), nrow = 3)
a
b
as.numeric(a + b %*% nout$estimate)
theta.hat
```

So
```{r assvar}
asymp.var.theta.three.dimensional <- b %*% fisher.inverse %*% t(b)
asymp.var.theta.three.dimensional
```

So our 95% confidence interval for $\gamma$ is
```{r wald.gamma}
theta.hat[3] + c(-1, 1) * crit * sqrt(asymp.var.theta.three.dimensional[3, 3])
```

## Likelihood Intervals

Now our critical value is based on the chi-square distribution on one degree
of freedom (one because we are doing confidence intervals for one parameter
at a time and *are not trying for simultaneous coverage*).
```{r crit.like}
crit <- qchisq(conf.level, df = 1)
crit
sqrt(crit)
```

To find the likelihood-based confidence interval we maximize and minimize
the parameter of interest subject to the constraint that the log likelihood
is no less than `crit` down from the maximum value.
```{r likelihood.based}
library(alabama)
# alpha lower
confun <- function(theta) crit - 2 * (mlogl(theta) - nout$minimum)
aout <- auglag(nout$estimate, fn = function(theta) theta[1], hin = confun)
aout$convergence == 0
aout$value
```

All of that blather was most annoying.  Tell it to STFU.
```{r likelihood.based.ii}
# alpha lower
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1],
    hin = confun))
aout$convergence == 0
aout$value
```

Need to tell it STFU more forcefully.
```{r likelihood.based.iii}
# alpha lower
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1],
    hin = confun, control.outer = list(trace = FALSE)))
aout$convergence == 0
aout$value
```

Great!  Now for the rest.
```{r likelihood.based.iv}
# alpha lower
alpha.lower <- aout$value
# alpha upper
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1],
    hin = confun, control.outer = list(trace = FALSE),
    control.optim = list(fnscale = -1)))
aout$convergence == 0
aout$value
alpha.upper <- aout$value
```
The `control.optim = list(fnscale = -1)` is what tells R function `auglag`
to maximize rather than minimize.  This was particularly hard to find
in the documentation because it is on the help page for R function `optim`,
which is referred to from the help page for R function `auglag`.

And so forth.
```{r likelihood.based.v}
# beta lower
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[2],
    hin = confun, control.outer = list(trace = FALSE)))
aout$convergence == 0
aout$value
beta.lower <- aout$value
# beta upper
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[2],
    hin = confun, control.outer = list(trace = FALSE),
    control.optim = list(fnscale = -1)))
aout$convergence == 0
aout$value
beta.upper <- aout$value
# gamma lower
aout <- suppressWarnings(auglag(nout$estimate,
    fn = function(theta) 1 - sum(theta),
    hin = confun, control.outer = list(trace = FALSE)))
aout$convergence == 0
aout$value
gamma.lower <- aout$value
# gamma upper
aout <- suppressWarnings(auglag(nout$estimate,
    fn = function(theta) 1 - sum(theta),
    hin = confun, control.outer = list(trace = FALSE),
    control.optim = list(fnscale = -1)))
aout$convergence == 0
aout$value
gamma.upper <- aout$value
```

## Rao (Score) Intervals

These turned out to be just too horrible to do.
If we try to derive the Rao pivotal quantity analytically,
the calculus is horribly messy.
If we try to let the computer do the calculus using R package `numDeriv`
then that doesn't work because R function `auglag` in R package `alabama`
needs derivatives of the nonlinear constraint function that it tries
to calculate by finite-difference approximation.  So that means we either
need everything done analytically or calculating derivatives of functions
containing derivatives by finite-difference approximations of functions
themselves defined by finite-difference approximations.  And the latter
does not work.  Computer arithmetic is too sloppy.

However much one may like Rao intervals for simple problems, they are
just not worth the effort in complicated problems.  And this problem
is already too complicated for them, even though it only involves a
contingency table with four cells and a curved exponential family
with two parameters.

## Summary

### Wald Intervals {#summary-wald-nonsimultaneous}

```{r wald.interval.summary, echo=FALSE}
# have to redefine because have clobbered above
crit <- qnorm((1 + conf.level) / 2)
foompter <- cbind(
    theta.hat - crit * sqrt(diag(asymp.var.theta.three.dimensional)),
    theta.hat + crit * sqrt(diag(asymp.var.theta.three.dimensional)))
colnames(foompter) <- c("lower", "upper")
knitr::kable(foompter, digits = c(4, 4), align = "lcc")
```

### Likelihood Intervals {#summary-likelihood-nonsimultaneous}

```{r likelihood.interval.summary, echo=FALSE}
foompter <- rbind(c(alpha.lower, alpha.upper), c(beta.lower, beta.upper),
                  c(gamma.lower, gamma.upper))
colnames(foompter) <- c("lower", "upper")
rownames(foompter) <- c("alpha", "beta", "gamma")
knitr::kable(foompter, digits = c(4, 4), align = "lcc")
```
These intervals are **not** corrected for simultaneous coverage.
See [Section 10 below](#simultaneous-coverage) for that.

# Fixing Up Our R Function

## Error Messages

The first improvement to make to any function is to make it follow the
policy of "garbage in, error messages out".  And we want the error messages
to be understandable and informative.  So
```{r mlogl.try.ii}
mlogl <- function(theta) {
    stopifnot(is.numeric(theta))
    stopifnot(is.finite(theta))
    stopifnot(length(theta) == 2)
    alpha <- theta[1]
    beta <- theta[2]
    gamma <- 1 - alpha - beta
    - (nA * log(alpha^2 + 2 * alpha * gamma) +
        nB * log(beta^2 + 2 * beta * gamma) +
        nO * log(gamma^2) +
        nAB * log(2 * alpha * beta))
}
```
We should check everything we reasonably can about our arguments.
One might think we should also add
```
    stopifnot(theta >= 0)
    stopifnot(sum(theta) <= 1)
```
But, as we saw, R function `nlm` will call the function supplied to it
(which is our function `mlogl` in this handout) when the parameters are
outside the sample space (because there is no way to indicate to it what
the sample space is).  So we do not check these, but rather just suppress
the warnings, as shown above, if we don't want to see them.

We had a similar problem with R function `auglag` because it also has
the constraint function (argument `hin`) that calls our function `mlogl`
and it also does not know what the allowed parameter values are.

## Using a Factory Function

Global variables are evil, as every course on computer programming says.
If the project is one-off and not to be (ever, under any circumstances)
used as an example (even for one's own future work), then there is nothing
wrong with global variables.  But code tends to be re-used, even if the
re-use is not planned.  So avoid global variables.

To do that we don't rewrite our function `mlogl`.  Instead we write an
R function to write it.  A function that makes functions is called a
*function factory*.  So that is what we are using.

R functions remember the environment in which they were created.
That is how our current version of the function finds the variables
`nA`, `nB`, `nO`, and `nAB`.  It was created in the R global environment,
and that is where those variables are.

If one clobbers or removes those variables, our function will not work.
When we use a function factory and put those variables in the execution
environment of the function factory, then they are where users do not
think to mess with them.  Most users don't even know functions have
environments.

So here is how that looks.
```{r factory}
mlogl.factory <- function(nA, nB, nO, nAB) {
    stopifnot(is.numeric(nA))
    stopifnot(is.finite(nA))
    stopifnot(round(nA) == nA) # checks integer value
    stopifnot(nA >= 0)
    stopifnot(is.numeric(nB))
    stopifnot(is.finite(nB))
    stopifnot(round(nB) == nB) # checks integer value
    stopifnot(nB >= 0)
    stopifnot(is.numeric(nO))
    stopifnot(is.finite(nO))
    stopifnot(round(nO) == nO) # checks integer value
    stopifnot(nO >= 0)
    stopifnot(is.numeric(nAB))
    stopifnot(is.finite(nAB))
    stopifnot(round(nAB) == nAB) # checks integer value
    stopifnot(nAB >= 0)
    function(theta) {
        stopifnot(is.numeric(theta))
        stopifnot(is.finite(theta))
        stopifnot(length(theta) == 2)
        alpha <- theta[1]
        beta <- theta[2]
        gamma <- 1 - alpha - beta
        - (nA * log(alpha^2 + 2 * alpha * gamma) +
            nB * log(beta^2 + 2 * beta * gamma) +
            nO * log(gamma^2) +
            nAB * log(2 * alpha * beta))
    }
}
```
An R function returns the value of the last expression if there is no
explicit call of R function `return`.  The last expression of `mlogl.factory`
is the definition of a function that is identical to our function `mlogl`
copied from the code above.

Now we use the factory to make a new `mlogl`.
```{r make.mlogl}
mlogl <- mlogl.factory(nA, nB, nO, nAB)
```

And now we don't need the global variables.
```{r rm}
rm(nA, nB, nO, nAB)
```

We check that our new function `mlogl` works just like the old one.
```{r nlm.try.ii, error=TRUE}
nout.estimate.save <- nout$estimate
nout <- suppressWarnings(nlm(mlogl, theta.start))
nout$code <= 2 # should be TRUE for solution
all.equal(nout$estimate, nout.estimate.save)
```
```{r likelihood.based.vi}
# have to redefine because have clobbered above
# global variables are evil
crit <- qchisq(conf.level, df = 1)
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1],
    hin = confun, control.outer = list(trace = FALSE)))
aout$convergence == 0
all.equal(aout$value, alpha.lower)
```

How does this work?  If we removed the data, where is it?
```{r where.is.waldo}
ls(envir = environment(mlogl))
```

# Simultaneous Coverage

## Confidence Regions

A confidence interval is a random interval that covers the true unknown
(scalar) parameter value with a specified probability.
A confidence *region* is a random *region* that covers the true unknown
(vector) parameter value with a specified probability.

Confidence regions are not much used, not even mentioned in intro stats
books.  We are primarily interested in them as a tool for constructing
simultaneous confidence intervals.  But we develop the general theory here.

Like confidence intervals, confidence regions can be constructed by inverting
hypothesis tests.  The confidence region is the set of $\theta_0$ that are
not rejected by the hypothesis test with hypotheses
\begin{align*}
    H_0 & : \theta = \theta_0
    \\
    H_1 & : \theta \neq \theta_0
\end{align*}
Let $c$ be the appropriate chi-square critical value (the degrees
of freedom is the dimension of the parameter space).
Then inverting the likelihood ratio test gives the confidence region
$$
   \{\, \theta : 2 [ l(\hat{\theta}) - l(\theta) ] \le c \,\}
$$
where $\hat{\theta}$ is the MLE for the alternative hypothesis
(the unrestricted MLE).

In the handout titled "Likelihood Inference" the formula for the
Wald test statistic using observed Fisher information is
$$
   W_n'
   =
   g(\hat{\theta}_n)^T
   \bigl[ \nabla g(\hat{\theta}_n) J_n(\hat{\theta}_n)^{-1}
   \bigl(\nabla g(\hat{\theta}_n) \bigr)^T
   \bigr]^{-1} g(\hat{\theta}_n)
$$
Here we use the special case where $g(\theta) = \theta - \theta_0$,
the first derivative of which is the identity matrix.  Then we have
$$
   W_n'
   =
   (\hat{\theta}_n - \theta)^T J_n(\hat{\theta}_n) (\hat{\theta}_n - \theta)
$$
Then the Wald confidence region is
$$
   \{\, \theta : 
   (\hat{\theta}_n - \theta)^T J_n(\hat{\theta}_n) (\hat{\theta}_n - \theta)
   < c \,\}
$$
We have to solve a multivariate quadratic equation to find the boundary
of the region.

## Simultaneous Confidence Intervals from Confidence Regions

If $R$ is a confidence region and $g$ is any scalar-valued function whatsoever,
then
$$
   \inf_{\theta \in R} g(\theta) < g(\theta) < \sup_{\theta \in R} g(\theta)
$$
is a confidence interval for the parameter $g(\theta)$,
and, moreover, the coverage probability
is the same as the coverage probability for $R$, and this is simultaneous
for all possible functions $g$.  This is how we are going to derive
simultaneous confidence intervals here.

## Likelihood Intervals

To fix up our likelihood intervals for simultaneous coverage, all we need
is to use a critical value for two degrees of freedom.  Everything else is
the same.  Why two?  Because there are two independently adjustable parameters.
The degrees of freedom for simultaneous coverage is always the number of
independently adjustable parameters.

We won't show the work just the result.  First do
```{r crit.again}
crit
crit <- qchisq(conf.level, df = 2)
crit
```
and then redo the likelihood intervals shown above (work not shown,
but is in the source (Rmd) file).
```{r simultaneous.likelihood, echo=FALSE}
# alpha lower
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1],
    hin = confun, control.outer = list(trace = FALSE)))
stopifnot(aout$convergence == 0)
alpha.lower <- aout$value
# alpha upper
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[1],
    hin = confun, control.outer = list(trace = FALSE),
    control.optim = list(fnscale = -1)))
stopifnot(aout$convergence == 0)
alpha.upper <- aout$value
# beta lower
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[2],
    hin = confun, control.outer = list(trace = FALSE)))
stopifnot(aout$convergence == 0)
beta.lower <- aout$value
# beta upper
aout <- suppressWarnings(auglag(nout$estimate, fn = function(theta) theta[2],
    hin = confun, control.outer = list(trace = FALSE),
    control.optim = list(fnscale = -1)))
stopifnot(aout$convergence == 0)
beta.upper <- aout$value
# gamma lower
aout <- suppressWarnings(auglag(nout$estimate,
    fn = function(theta) 1 - sum(theta),
    hin = confun, control.outer = list(trace = FALSE)))
stopifnot(aout$convergence == 0)
gamma.lower <- aout$value
# gamma upper
aout <- suppressWarnings(auglag(nout$estimate,
    fn = function(theta) 1 - sum(theta),
    hin = confun, control.outer = list(trace = FALSE),
    control.optim = list(fnscale = -1)))
stopifnot(aout$convergence == 0)
gamma.upper <- aout$value
# printout
foompter <- rbind(c(alpha.lower, alpha.upper), c(beta.lower, beta.upper),
                  c(gamma.lower, gamma.upper))
colnames(foompter) <- c("lower", "upper")
rownames(foompter) <- c("alpha", "beta", "gamma")
knitr::kable(foompter, digits = c(4, 4), align = "lcc")
```
Comparing with [Section 8.3.2 above](#summary-likelihood-nonsimultaneous)
we see the intervals aren't actually that much wider.

## Wald Intervals

The same trick works for Wald intervals.  This is similar to what is
called Scheffé's method for linear models, but the same idea works in
general.  We just use the same critical value, as for the likelihood
intervals, but when we use it for Wald intervals, we need to take a square
root.  We have been using all along the critical value for Wald intervals
is the square root of the critical value for the confidence intervals.
We just didn't note that explicitly.  But a chi-square random variable
with one degree of freedom is the square of a standard normal random variable.
Now a chi-square random variable on $k$ degrees of freedom is the
sum of squares of $k$ independent standard normal random variables.

So we do
```{r crit.too.too}
crit <- sqrt(crit)
crit
```
and redo all of our Wald interval calculations (again work not shown
but in the Rmd file).
```{r simultaneous.wald, echo=FALSE}
foompter <- cbind(
    theta.hat - crit * sqrt(diag(asymp.var.theta.three.dimensional)),
    theta.hat + crit * sqrt(diag(asymp.var.theta.three.dimensional)))
colnames(foompter) <- c("lower", "upper")
knitr::kable(foompter, digits = c(4, 4), align = "lcc")
```

Comparing with [Section 8.3.1 above](#summary-wald-nonsimultaneous)
we see the intervals aren't actually that much wider.