---
title: "Stat 5421 Lecture Notes: Completion of Exponential Families"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
nocite: |
  @alabama-package, @numderiv-package, @sfsmisc-package, @mass-book, @mass-package, @rmarkdown-book, @rmarkdown-package, @rmarkdown-cookbook @nnet-package glmbb-package
output:
  bookdown::html_document2:
    number_sections: true
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
    toc: true
    toc_depth: 2
  bookdown::pdf_document2:
    latex_engine: xelatex
    number_sections: true
    md_extensions: -tex_math_single_backslash
    toc: true
    toc_depth: 2
linkcolor: blue
urlcolor: blue
bibliography: expfam.bib
csl: journal-of-the-royal-statistical-society.csl
link-citations: true
---

<!--
```{r foo, include=FALSE}
knitr::opts_chunk$set(error = TRUE)
```
-->

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# R

```{r libraries}
library(alabama)
library(rcdd)
library(llmdr, lib.loc = "~/Software/llmdr/package/llmdr.Rcheck")
library(sfsmisc)
library(MASS)
library(glmbb)
library(nnet)
library(CatDataAnalysis)
```

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `alabama` package used to make this document is
   `r packageVersion("alabama")`.

 * The version of the `numDeriv` package used to make this document is
   `r packageVersion("numDeriv")`.

 * The version of the `rcdd` package used to make this document is
   `r packageVersion("rcdd")`.

 * The version of the `llmdr` package used to make this document is
   `r packageVersion("llmdr")`.

 * The version of the `sfsmisc` package used to make this document is
   `r packageVersion("sfsmisc")`.

 * The version of the `MASS` package used to make this document is
   `r packageVersion("MASS")`.

 * The version of the `glmbb` package used to make this document is
   `r packageVersion("glmbb")`.

 * The version of the `nnet` package used to make this document is
   `r packageVersion("nnet")`.

 * The version of the `CatDataAnalysis` package used to make this document is
   `r packageVersion("CatDataAnalysis")`.

 * The version of the `rmarkdown` package used to make this document is
   `r packageVersion("rmarkdown")`.

At present, R package `llmdr` is not publicly available.
That is why it is loaded from a nonstandard location on the author's
computer.  It will be available soon.

# History of Theory

Maximum likelihood estimates (MLE) do not have to exist.
Theoretically, the problem of when MLE exist in
regular full exponential families of distributions
has been completely understood for a long time.

For exponential families having finite
support, this theory is due to @barndorff-nielsen, Theorem 9.15 and the
surrounding discussion.  Finite support means there are only a finite
number of possible values of the canonical statistic vector, as happens
in logistic regression and log-linear models for categorical data when
multinomial or product-multinomial sampling is assumed.

@brown, Chapter 6, Section "Aggregate Exponential Families" extended
Barndorff-Nielsen's theory to exponential families with countable support
under certain regularity conditions.  Countable support means we have
a discrete probability distribution, and Poisson regression with log link
and log-linear models for categoriccal data when Poisson sampling is assumed
satisfy Brown's regularity conditions.

@geyer-gdor reworked this theory to use the concepts

 * direction of recession (DOR)

 * direction of constancy (DOC)

 * generic direction of recession (GDOR)

# Directions in the Canonical Parameter Space

Following @geyer-gdor we are going to be sloppy about the mathematical
meaning of the word *direction*.  Strictly speaking, it is the direction
in which a vector points, so every positive scalar multiple of a vector
points in the same direction.  So a direction is not a vector.  Vectors
have length, directions do not.  Also the zero vector does not point in
any direction.  But we are just going to call vectors directions, and
we are going to allow the zero vector.  There is a reason for this
in the correspondence between DOR and normal vectors (@geyer-gdor,
Theorem 3), which we do not want to discuss.  Mostly this does not
cause confusion.  Occasionally it does lead to some clumsy language.

When the MLE does not exist in the original model (OM),
this is because there is a DOR that is not a DOC,
(@geyer-gdor, Theorems 1, 3, and 4).

Here are the definitions of these concepts that give them their names.

 * A vector $\delta$ in the canonical parameter space of
   a regular full exponential family is a DOR if the log likelihood function
   $l$ is nondecreasing in that direction, meaning for
   any $\theta$ in the canonical parameter space and
   any positive real number $s$
   we have $l(\theta) \le l(\theta + s \delta)$.

 * A vector $\delta$ in the canonical parameter space of
   a regular full exponential family is a DOC if the log likelihood function
   $l$ is constant in that direction, meaning for any
   any $\theta$ in the canonical parameter space and
   any real number $s$
   we have $l(\theta) = l(\theta + s \delta)$.

Every DOC is a DOR, but not vice versa.

And here are some equivalent characterizations.  In them $Y$ denotes
the canonical statistic vector, thought of as a random vector having some
probability distribution in the family, and $y$ denotes its observed value.

 * A vector $\delta$ in the canonical parameter space of
   a regular full exponential family is a DOC if and only if for any
   real number $s$ and any $\theta$ in the canonical parameter space
   $\theta$ and $\theta + s \delta$ correspond to the same distribution.

 * A vector $\delta$ in the canonical parameter space of
   a regular full exponential family is a DOC if and only if
   $\langle Y, \delta \rangle$ is a constant random variable.

 * A vector $\delta$ in the canonical parameter space of
   a regular full exponential family is a DOR if and only
   $\langle Y, \delta \rangle \le \langle y, \delta \rangle$ with
   probability one.

In summary,

 * $\delta$ is a DOC if and only if $\langle Y, \delta \rangle$ is constant,

 * $\delta$ is a DOR if and only if $\langle y, \delta \rangle$ is extreme,
   the upper endpoint of the range of $\langle Y, \delta \rangle$, and

 * $\delta$ is a DOC if and only if and only if $\theta$ and $\theta + \delta$
   correspond to the same distribution.

The last bullet point here says that DOC are about identifiability,
If there is a DOC, then we need to drop some parameters to gain
identifiability.  Usually, the software can do this for us.

The other two bullet points here are about maximum likelihood.

The MLE exists in the OM if and only if every DOR is a DOC
(@geyer-gdor, Theorem 4).

What happens when the MLE does not exist in the OM is the subject of
the following section.

# Limits

If we take straight line limits in the direction of a DOR $\delta$
then all of the probability goes to as far as it can go in that direction
(@geyer-gdor, Theorem 6, but similar theorems are found in @barndorff-nielsen,
@brown, and @geyer-thesis).  Probability piles up on the subset of the support
\begin{equation}
   H = \{\, Y : \langle Y - y, \delta \rangle = 0 \,\}
   (\#eq:h)
\end{equation}

The probability distribution for $\theta + s \delta$ converges
as $s \to \infty$ to the conditional probability distribution for $\theta$
of $Y$ given $Y \in H$.
This family of conditional distributions (for all $\theta$) is an exponential
family of distributions that @geyer-gdor calls a *limiting conditional model*
(LCM).

It is wrong to think of distributions in the LCM only
as conditional distributions or only as limit distributions.  They are both.

Since the log likelihood is nondecreasing along a DOR, the log likelihood
for an LCM lies on or above the log likelihood for the OM.
Hence maximizing the likelihood in the LCM maximizes the likelihood in the OM
too.

The canonical parameters do not converge as we take these limits. They
go off to infinity.

But the probability distributions converge.  So the MLE *distribution*
in the LCM (if one exists, more on this presently) can be thought of as
the MLE.

Moreover, not only distributions converge.  Under the regularity conditions
of Brown, which hold for all statistical models studied in this course,
moments of all orders converge (@eck-geyer, Theorem 7),
for example, the mean of the distribution
in the OM for parameter $\theta + s \delta$ converges to the mean of the
distribution for the parameter $\theta$ in the LCM as $s \to \infty$.
And the same goes for the variance and other moments.

Thus *mean value parameters* are well behaved.  When we take limits to
complete exponential families, means converge.

Every LCM is itself an exponential family, just a different exponential
family from the OM.  The LCM and the OM have the same canonical statistic
and the same canonical parameter.  The LCM just restricts the canonical
statistic to the event $H$ described above.

If $\delta$ is a GDOR of the OM, then $\delta$ is a DOC of the LCM.
This is clear from the fact that the LCM conditions on the event $Y \in H$
with $H$ given by \@ref(eq:h).  This is the reason we have to allow DOC.
Otherwise, we cannot talk about the OM and LCM using the same parameters.

Since the LCM is an exponential family, we can tell when the MLE exists
in it.  Just apply the theory described above to it.  The MLE will exist
in the LCM if and only if every DOR of the LCM (not OM) is a DOC of
the LCM (not OM).  And (@geyer-gdor, Section 3.6) show this happens
whenever the DOR of the OM that we used to form the LCM was *generic*,
not in any way special.
This we call a *generic direction of recession* (GDOR).

A more precise characterization of GDOR will be given at the end
of [Section 8](#canonical-affine-submodels) below.

When we find a GDOR and its corresponding LCM, then the MLE in the LCM
is the MLE *distribution* and the mean value of this distribution is the
MLE *mean value*.  And this MLE
satisfies the observed-equals-expected property, because it is the MLE
for an exponential family (the LCM).  Hence the observed-equals-expected
property always holds.  Similarly, the sufficient dimension reduction property
and maximum entropy properties always hold.  It does not matter whether
the MLE is found in the OM or an LCM.

One might ask when the MLE is in an LCM, what is the statistical model?
The model cannot be the OM, because if that is so, then there is no MLE.
We have to say that the model is the OM and all of its limits.
@geyer-gdor calls that model the *Barndorff-Nielsen completion* of the OM.

The Barndorff-Nielsen completion has no canonical parameterization, because
when you take limits canonical parameters do not converge (they go off to
infinity).  But it is a family of distributions (the OM and all of its LCM).
Also (under the conditions of Brown) it has a mean value parameterization.
Every distribution in the Barndorff-Nielsen completion has a mean vector,
and mean vectors do converge as limits are taken to get LCM.

Each LCM is an exponential family that has a canonical parameterization,
but the Barndorff-Nielsen completion as a whole does not have a canonical
parameterization.  The canonical parameterizations of the LCM do not fit
together to make one parameterization.

# History of Computing

The thesis of your humble author (@geyer-thesis, Chapter 2), included
an algorithm using repeated linear programming to discover DOR that are
not DOC for exponential families.  With software available at the time
it was very clumsy but was actually used to find MLE in LCM for some
very complicated statistical models [@geyer-thompson].

@geyer-gdor used various functions in R package `rcdd` [@rcdd-package]
to find GDOR.
This software was also somewhat clumsy to use (there was no model fitting
function comparable to R function `glm`) but only a dozen or so lines of
code were enough to find out whether the MLE existed in the OM or otherwise
a GDOR and its LCM and the MLE in that LCM.
But this code was slow on big problems.
@eck-geyer have an example (dataset `bigcategorical` in R package `llmdr`),
a contingency table with 1024 cells and 781 parameters,
that took a little over 3 days and 4 hours of computing time to do with
the methods of this paragraph.

@eck-geyer used R function `glm` to maximize the likelihood getting
canonical parameter estimates close to infinity by using small tolerances
and then looking for directions in
which the Fisher information matrix is nearly singular (approximate null
eigenvectors).  This algorithm was fast but has problems with inexactness
of computer arithmetic.  So it is difficult to know exactly when an
approximate null eigenvector of the Fisher information matrix is close
to being a GDOR.

R package `llmdr` [@llmdr-package] uses repeated linear programming
but different linear programs from those used by @geyer-thesis and @geyer-gdor.
It also uses the fast linear programming (R package `glpkAPI`, @glpkapi-package)
that scales to large problems (it uses sparse matrix arithmetic).
But `llmdr` can also use infinite-precision rational arithmetic from R packages
`rcdd` [@rcdd-package] and `gmp` [@gmp-package] to prove, if requested,
that its results are correct.  The big categorical example of @eck-geyer,
mentioned above, that took over 3 days to do with the methods of
@geyer-gdor took only 150 milliseconds to do with R function `llmdr` in
R package `llmdr`.  If proofs were requested, they took, 25 minutes
(slow, but still an improvement over the methods of @geyer-gdor).

So we finally have, in this fourth generation algorithm, methods that
are fast, easy, and safe enough for ordinary users to use.

# The Binomial Family of Distributions

The binomial family of distributions as usually taught has usual parameter
$\pi$ and usual parameter space $0 \le \pi \le 1$.

The binomial family of distributions [thought of as an exponential
family](expfam.html#the-binomial-distribution) has canonical parameter
$\theta = \mathop{\rm logit}(\pi)$ and has canonical parameter space
$- \infty < \theta < + \infty$ and this corresponds to usual parameter
space $0 < \pi < 1$.

We have lost the distributions for $\pi = 0$ and $\pi = 1$.
The reason why these distributions cannot be in the exponential family
is [all distributions in an exponential family have the same
support](expfam.html#non-degeneracy).  There is also the problem
that $\mathop{\rm logit}(0)$ and $\mathop{\rm logit}(1)$ are undefined,
so there are no canonical parameter values for those distributions to have
(canonical parameters have to be real numbers or real vectors, infinite
values are not allowed).

This is a one-parameter family of distributions so its canonical parameter
space is a one-dimensional vector space.  In a one-dimensional vector space,
there are only two directions: toward $+ \infty$ and toward $- \infty$.
There are an infinite number of vectors, but only two directions those
vectors can point in.

The [log likelihood for the binomial distribution](ch1.html#modifications) is
$$
   l(\pi) = y \log(\pi) + (n - y) \log(1 - \pi)
$$
When $y = 0$, this becomes
$$
   l(\pi) = n \log(1 - \pi)
$$
and this is a strictly increasing function of $\pi$ and $\theta$ in the
minus direction (toward $- \infty$).  Because $\mathop{\rm logit}$ is a
monotone function, as $\theta$ decreases, $\pi$ also decreases, $1 - \pi$
increases, and $\log(1 - \pi)$ increases.  So $-1$ (or any negative number,
thought of as a vector in the one-dimensional vector space $\mathbb{R}$)
is a DOR.

And what happens to the probabilities as we take limits?  There are two
cases.  For $y = 0$
$$
   f(y) = (1 - \pi)^n \to 1,
   \qquad \text{as $\pi \to 0$ and $\theta \to - \infty$}.
$$
And for $y > 0$
$$
   f(y) = \pi^y (1 - \pi)^{n - y} \to 0,
   \qquad \text{as $\pi \to 0$ and $\theta \to - \infty$}.
$$
Thus binomial distributions with usual parameter $\pi$ and canonical
parameter $\theta$ converge to the completely degenerate distribution
concentrated at zero as $\pi \to 0$ and $\theta \to - \infty$.

Now the theory described above says the MLE does not exist in the OM
(because $- 1$ is a DOR that is not a DOC) and the LCM is the distribution
concentrated at $Y = 0$.  This is the only distribution in the LCM, so the
LCM has no identifiable parameters, and this only distribution is the MLE
distribution.  This distribution has mean zero and variance zero, which are
the limits of the mean $\pi$ and variance $\pi (1 - \pi)$
as $\pi \to 0$ and $\theta \to - \infty$.

This seems like a crazy amount of theory to throw at such a simple problem.
And it is overkill for a one-dimensional problem like this.
But, as we shall soon, see.  Even two dimensional problems can be hard
for humans to figure out.

So we need the computer to use the theory above to find the answers.

By the way, a similar story could be told about when the data are $y = n$.

# Canonical Affine Submodels

[Canonical affine submodels](expfam.html#casm-sdr) of regular full exponential
families are again regular full exponential families.  If $a$ is the offset
vector and $M$ is the model matrix, then
$$
   \theta = a + M \beta
$$
is the relationship between

 * the saturated model canonical parameter vector $\theta$ and

 * the canonical affine submodel canonical parameter vector $\beta$.

And
$$
   \tau = M^T \mu
$$
is the relationship between

 * the saturated model mean value parameter vector $\mu$ and

 * the canonical affine submodel mean value parameter vector $\tau$.

And

 * the saturated model canonical statistic vector is $y$ and

 * the canonical affine submodel canonical statistic vector is $M^T y$.

So the only difference between saturated models and canonical affine
submodels is that we replace $y$ with $M^T y$.

 * A vector $\delta$ in the canonical parameter space of
   a canonical affine submodel a DOC if and only if
   $\langle M^T (Y - y), \delta \rangle = \langle Y - y, M \delta \rangle$
   is zero with probability one.

 * A vector $\delta$ in the canonical parameter space of
   a canonical affine submodel is a DOR if and only if for any
   $\langle M^T (Y - y), \delta \rangle = \langle Y - y, M \delta \rangle$
   is nonpositive with probability one.

A vector $\delta$ is a DOR for the submodel
if and only if $\eta = M \delta$ is a DOR for the saturated model.  Thus

 * for Poisson sampling $\eta$ is a DOR if and only if $\eta_i \le 0$ for
   all $i$ (because the Poisson distribution has no upper bound) and
   $\eta_i < 0$ implies $y_i = 0$ (because zero is the lower bound),

 * for binomial sampling $\eta$ is a DOR if and only if
   $\eta_i < 0$ implies $y_i = 0$ (because zero is the lower bound) and
   $\eta_i > 0$ implies $y_i = n_i$ (because $n_i$ is the lower bound).

Multinomial and product multinomial sampling is considerably more
complicated so we [defer discussion of that](#dor-multinomial).

But already we can see why this problem is hard.  We have to find a $\delta$
such that $\eta = M \delta$ satisfies the above conditions.  And for
complicated models that is complicated.  Fortunately, the computer can do it
for us.

When we have found a DOR $\eta$, the LCM says

  * if $\eta_i \neq 0$, then the LCM conditions on $Y_i = y_i$ and

  * if $\eta_i = 0$, then the LCM makes no restrictions on $Y_i$.

Hence the LCM is found by fitting the model to the cases with $\eta_i = 0$.
When we lose cases like this, we may also have more non-identifiable
parameters, but the computer can figure that out.

Now we can be a bit more precise about what a GDOR is.
For Poisson and binomial sampling, a vector $\delta$ is a DOR
has the the largest possible number of nonzero components so the LCM
conditions on the largest possible number of number of components of
the response vector.  Of course, you have no way to guess what subset
of components is "largest possible" but, again, the computer can figure
it out.

# Separation versus Degeneracy

@agresti, Section 6.5.1, uses the terminology of *complete separation* and
*quasi-complete separation* to discuss the issues in this area.

We prefer the terminology of *complete degeneracy* and *partial degeneracy*
to describe these issues.

Agresti's terms only apply to logistic regression.
Our terms only apply to all exponential families.

*Complete separation* occurs if all cases are plotted in regressor space,
the successes and failures are labeled with different colors, and there
is a hyperplane in predictor space that separates them, all of one color
on one side and all of the other color on the other side.
An example of such a plot is Figure
\@ref(fig:agresti-complete-complete-separation) below.

This corresponds to an LCM that is *completely degenerate*, concentrated
at a single point.  The MLE distribution says the only possible value
of the response vector was the observed value.

*Quasi-complete separation* occurs if all cases are plotted in predictor space,
the successes and failures are labeled with different colors, and there
is a hyperplane in predictor space that weakly separates them,
all of one color on one side or in the hyperplane and
all of the other color on the other side or in the hyperplane.
An example of such a plot is Figure
\@ref(fig:agresti-quasi-complete-separation) below.

This corresponds to an LCM that is *partially degenerate*, not concentrated
at a single point but having support smaller than the OM.
The MLE distribution is not completely degenerate but does give probability
zero to some outcomes that have positive probability in the OM.

# Complete Separation Example of Agresti

## Data

Section 6.5.1 of @agresti introduces the notion of complete separation
with the following example.
```{r example.i}
x <- seq(10, 90, 10)
x <- x[x != 50]
y <- as.numeric(x > 50)
data.frame(x, y)
```

The following figure shows these data.
```{r label="one", echo=FALSE, fig.align = "center", fig.cap = "Logistic Regression Data for Complete Separation Example of Agresti"}
plot(x, y)
```

## Attempt to Fit Model

Suppose we want to do "simple" logistic regression (one predictor $x$ plus
intercept, so the model is two-dimensional).  Let's try to do it naively.
```{r naive, error = TRUE}
gout <- glm(y ~ x, family = binomial)
summary(gout)
```
R function `glm` does give a warning.  But what are you supposed
to do about it?  If the output of the R function `summary` is to
be taken seriously, you cannot tell whether either regression coefficient
is nonzero.  As we shall see, that is complete nonsense.
Both parameters have to go to infinity (one to plus infinity, the other
to minus infinity) in order to maximize the likelihood.

In fact, these data are not analyzable by the R function `glm` and
its associated functions (generic functions having methods for
class `"glm"`).

At least it did give a (not very clear) warning that its results
are absolute rubbish.  But that is no consolation because

 * it doesn't give you any options to do the
   [Right Thing](http://www.catb.org/jargon/html/R/Right-Thing.html) and

 * it gives both false negative and false positives on these warnings,
   so they are nonsense too.  You cannot rely on anything R function
   `glm` does.

## Log-Linear Models Done Right

So we switch to R function `llmdr` which stands for
log-linear models done right (a take-off on the [name of a well-known
linear algebra textbook](https://www.amazon.com/Linear-Algebra-Right-Undergraduate-Mathematics/dp/3319110799/)) in R package `llmdr` [@llmdr-package].
```{r llmdr-agresti-complete}
lout <- llmdr(y ~ x, family = "binomial")
summary(lout)
```

The fact that we have `NA` for every parameter estimate means all parameters
have been dropped to obtain identifiability in the LCM.  Thus the LCM is
completely degenerate (or we have complete separation in Agresti's
terminology).  The GDOR given is the direction the parameters go to infinity
to get to the LCM.  Since no parameters have been estimated, everything
else is also `NA`.

Admittedly this is not easy to understand.  But of course [canonical parameters
are never easy to understand](expfam.html#alice).

But mean value parameters are much easier to understand.
```{r agresti-complete-mu}
estimates(lout, type = "mu")
```
This says the MLE of the saturated model mean value parameter vector $\mu$
is equal to the observed data.  And since all of those values are on the
boundary of the range ($y_i$ can only be zero or one for these data),
this means we have a completely degenerate LCM (something we already knew
from the Estimates column in the summary being all `NA`).

## So Now What?

@agresti, Section 6.5.3 gives several pieces of advice about what to do
in this situation.  We now think all of it is bad, although we ourselves
have given similar advice in the past.  More on this below.
None of this section is intended to be critical of @agresti.  That book
is an authoritative compendium of conventional wisdom about the subject.
The fact that conventional wisdom sometimes disagrees with long established
theory, albeit fairly esoteric theory, is not unique to categorical data
analysis.

The first thing we can do is wonder if some simpler model fits the data.
If so, then we might not have to worry about this OM not having MLE.
So let's try that.
```{r agresti-complete-anova}
lout.null <- llmdr(y ~ 1, family = "binomial")
anova(lout.null, lout)
```
We are getting a bit ahead of ourselves here (we will explain how hypothesis
tests work in this situation in Section \@ref(theory-of-hypothesis-tests)
below).
But the $P$-value seems fairly strong.  The response does depend on $x$.
So we cannot drop $x$ from the model.

Save these data for future use ([Section about AIC below](#aic))
```{r agresti-complete-save}
save.complete <- list(null = lout.null, alternative = lout,
    anova = anova(lout.null, lout), data = data.frame(x = x, y = y))
```

By the way, we could have gotten the same result from R function `glm` and
friends.
```{r agresti-complete-anova-glm}
gout.null <- glm(y ~ 1, family = "binomial")
anova(gout.null, gout, test = "LRT")
```

This illustrates a general principle.  When the MLE for the null hypothesis
exists in the OM (of the null hypothesis), then the usual theory about
likelihood ratio tests and Rao tests is valid.
It doesn't matter whether or not the MLE exists in the OM of the alternative
hypothesis (Section \@ref(theory-of-hypothesis-tests) below).

So R function `glm` and friends are doing the Right Thing here.
But it gives a warning in fitting the alternative hypothesis that we
have to be sophisticated enough to ignore (for the purposes of this
hypothesis test).

Another thing Agresti recommends is to do likelihood-based confidence
intervals.  These at least will be automatically one sided.
But R function `confint.glm` in R package `MASS` messes up one-sided
confidence intervals.  So we have to
[do that the hard way](ch4.html#likelihood-intervals).

```{r agresti-complete-logl-likelihood-function}
logl.factory <- function(modmat, response) {
    stopifnot(is.numeric(modmat))
    stopifnot(is.finite(modmat))
    stopifnot(is.matrix(modmat))
    stopifnot(is.numeric(response))
    stopifnot(is.finite(response))
    stopifnot(round(response) == response) # check for integer-valued
    stopifnot(response >= 0)
    stopifnot(length(response) == nrow(modmat))
    logpq <- function(theta) ifelse(theta > 0, - log1p(exp(- theta)),
            theta - log1p(exp(theta)))
    function(beta) {
        stopifnot(is.numeric(beta))
        stopifnot(is.finite(beta))
        stopifnot(length(beta) == ncol(modmat))
        theta <- drop(modmat %*% beta)
        logp <- logpq(theta)
        logq <- logpq(- theta)
        return(sum(response * logp + (1 - response) * logq))
    }
}
# need as.matrix because lout$x is sparse matrix class
logl <- logl.factory(as.matrix(lout$x), lout$y)

logl.gradient.factory <- function(modmat, response) {
    stopifnot(is.numeric(modmat))
    stopifnot(is.finite(modmat))
    stopifnot(is.matrix(modmat))
    stopifnot(is.numeric(response))
    stopifnot(is.finite(response))
    stopifnot(round(response) == response) # check for integer-valued
    stopifnot(response >= 0)
    stopifnot(length(response) == nrow(modmat))
    pq <- function(theta) ifelse(theta > 0, 1 / (1 + exp(- theta)),
            exp(theta) / (1 + exp(theta)))
    function(beta) {
        stopifnot(is.numeric(beta))
        stopifnot(is.finite(beta))
        stopifnot(length(beta) == ncol(modmat))
        theta <- drop(modmat %*% beta)
        p <- pq(theta)
        q <- pq(- theta)
        return(rbind(drop(crossprod(modmat, response * q - (1 - response) * p))))
    }
}
logl.gradient <- logl.gradient.factory(as.matrix(lout$x), lout$y)

beta.try <- rnorm(2)
logl.gradient(beta.try)
grad(logl, beta.try)

# check that maximum of logl is zero
logl(100 * lout$gdor)
logl(300 * lout$gdor)
```

So now we are going to do something really tricky that does not generalize
to problems with more than two parameters but does really show what is going
on in this example.  We are going to draw the boundary of the likelihood-based
confidence region for $\beta$.  Because we are going to compare this with
some other regions, we write a function to do it.

```{r agresti-complete-curve-follower}
do.curve <- function(hin, hin.jac, eps = 0.1, cutoff = -10) {

    # find one point on curve
    beta.start <- c(0, 0)
    gdor <- lout$gdor
    aout <- auglag(beta.start,
        fn = function(beta) sum(gdor * beta),
        gr = function(beta) gdor,
        hin = hin, hin.jac = hin.jac,
        control.outer = list(trace = FALSE))
    stopifnot(aout$convergence == 0)

    beta.save <- aout$par

    # Now step along the curve in counterclockwise direction
    # using R function auglag to put us back on the curve
    for (ii in 1:1000) {
        foo <- logl.gradient(aout$par)
        # rotate clockwise 90 degrees
        bar <- rev(foo) * c(1, -1)
        # normalize
        bar <- bar / sqrt(sum(bar^2))

        # slowly increase step size as we go along
        beta.try <- aout$par + eps * bar * (1 + ii / 10)
        aout <- auglag(beta.try,
            fn = function(beta) sum((beta - beta.try)^2) / 2,
            gr = function(beta) beta - beta.try,
            hin = hin, hin.jac = hin.jac,
            control.outer = list(trace = FALSE))
        stopifnot(aout$convergence == 0)
        beta.save <- rbind(beta.save, aout$par)
        if (aout$par[1] < cutoff) break
    }

    # go back to the the first point
    aout <- list(par = beta.save[1, ])
    # and go the other way
    for (ii in 1:1000) {
        foo <- logl.gradient(aout$par)
        # rotate counterclockwise 90 degrees
        bar <- rev(foo) * c(-1, 1)
        # normalize
        bar <- bar / sqrt(sum(bar^2))

        # slowly increase step size as we go along
        beta.try <- aout$par + eps * bar * (1 + ii / 10)
        aout <- auglag(beta.try,
            fn = function(beta) sum((beta - beta.try)^2) / 2,
            gr = function(beta) beta - beta.try,
            hin = hin, hin.jac = hin.jac,
            control.outer = list(trace = FALSE))
        stopifnot(aout$convergence == 0)
        beta.save <- rbind(aout$par, beta.save)
        if (aout$par[1] < cutoff) break
    }
    beta.save
}
```
So that is a pretty complicated function, but now that we have it,
it is fairly simple to use.  Its arguments are

 * `hin` the function whose zero set is the curve,

 * `hin.jac` the gradient of that function,

 * `eps` how far apart we want points along the curve to be, and

 * `cutoff` where to stop.  The curve, of course, goes off to infinity
   in the direction of recession at both ends, so we have to stop
   somewhere.  Having made this plot in previous lecture notes
   (not in the same way, we know that the curve goes off the left
   side of the plot, so we stop when $\beta_1$ equals the cutoff.

This code (especially `cutoff`) is very specific to this particular problem,
but since nothing of the sort works for more than two parameters, we won't
worry about that.

```{r agresti-complete-likelihood-simultaneous,cache=TRUE}
level <- 0.95
# divide by 2 because our function is log likelihood not twice log likelihood
crit <- qchisq(level, df = length(lout$coefficients)) / 2
crit

beta.likelihood <- do.curve(function(beta) crit + logl(beta), logl.gradient)
```

```{r agresti-complete-likelihood-simultaneous-plot,fig.align="center",fig.cap="Likelihood-based Confidence Region for Beta"}
foo <- beta.likelihood[,1] > -8
plot(beta.likelihood[,1], beta.likelihood[,2],
    xlab = expression(beta[1]), ylab = expression(beta[2]),
    xlim = range(beta.likelihood[foo,1]),
    ylim = range(beta.likelihood[foo,2]), type = "l")
```
Now we can make likelihood-based confidence intervals for any parameter
(any function of $\beta$) by maximizing and minimizing it over this region.
And all of these intervals will have simultaneous coverage because they
are all based on the same confidence region.

If we want non-simultaneous coverage.  We do the same thing except we
use the critical value for one degree of freedom.
```{r agresti-complete-likelihood-non-simultaneous,cache=TRUE}
# divide by 2 because our function is log likelihood not twice log likelihood
crit <- qchisq(level, df = 1) / 2
crit

beta.likelihood.one.df <-
    do.curve(function(beta) crit + logl(beta), logl.gradient)
```

```{r agresti-complete-likelihood-non-simultaneous-plot,fig.align="center",fig.cap="Level Sets of the Log Likelihood.  Black, chi square critical value for 2 df.  Red, chi square critical value for 1 df."}
plot(beta.likelihood[,1], beta.likelihood[,2],
    xlab = expression(beta[1]), ylab = expression(beta[2]),
    xlim = range(beta.likelihood[foo,1]),
    ylim = range(beta.likelihood[foo,2]), type = "l")
lines(beta.likelihood.one.df[,1], beta.likelihood.one.df[,2], col = "red")
```

But there is a problem with what we are doing.  It isn't justified by theory.

Let's go back to the binomial distribution.  We know that for large $n$
and $\pi$ not close to the boundary of its parameter space that we have
good normal approximation, which is what these likelihood based confidence
intervals assume.

But we also know that when $n$ is large and $\pi$ is close to the boundary
so either $n \pi$ or $n (1 - \pi)$ is not large, we have good Poisson
approximation not normal approximation.

So that says that when the mean value parameter is close to the boundary,
the usual asymptotics of maximum likelihood do not work.
So there is every reason to believe these intervals just do not work.

Another way to explain the same thing is to note that the multivariate
normal distribution has no boundary.  Thus a sampling distribution is
only approximately multivariate normal when the data for the distribution
being approximated are almost never on the boundary.  Thus the mere
fact that the MLE distribution is on the boundary is sufficient reason
to believe the asymptotics do not work.

## Valid Region for Beta

Thus we introduce another sort of confidence region, proposed by
@geyer-gdor, Section 3.16.  These, at least, do have some theoretical
justification.  We won't explain that theory now, although it is quite
simple (see Section \@ref(theory-confidence) below).
We will just put it too on our plot.

```{r agresti-complete-geyer,cache=TRUE}
alpha <- 1 - level
alpha
crit <- (- log(alpha))
crit

beta.geyer <- do.curve(function(beta) crit + logl(beta), logl.gradient,
    cutoff = -1000)
```

```{r agresti-complete-geyer-plot,fig.align="center",fig.cap="Level Sets of the Log Likelihood.  Black, chi square critical value for 2 df.  Red, chi square critical value for 1 df.  Blue, minus log significance level.  Arrow is GDOR.  Blue curve is on top of black curve."}
plot(beta.likelihood[,1], beta.likelihood[,2],
    xlab = expression(beta[1]), ylab = expression(beta[2]),
    xlim = range(beta.likelihood[foo,1]),
    ylim = range(beta.likelihood[foo,2]), type = "l")
lines(beta.likelihood.one.df[,1], beta.likelihood.one.df[,2], col = "red")
lines(beta.geyer[,1], beta.geyer[,2], col = "blue")
arrows(-5, 0.125, -5 + 2 * lout$gdor[1], 0.125 + 2 * lout$gdor[2])
```

The blue curve in the plot is exactly on top of the black curve.
This is an accidental feature of this problem.  It happens because
the LCM is completely degenerate (so both the blue curve and the black curve
use the same constraint function) and this problem has 2 parameters
(which makes the critical values for the blue curve and the black curve
the same).  The blue and black curves would be different if the LCM
were only partially degenerate or if the problem had number of parameters
not equal to 2.
If the number of parameters was much greater than 2, then the black curve
would be far outside the blue curve, and even the red curve might be far
outside the blue curve.

So we will henceforth
discard the likelihood-based intervals as a bad idea (when near the boundary
in the mean value parameter space and near infinity in the canonical parameter
space; likelihood-intervals are one of the great ideas of statistics when
far from the boundary) and use the proposal of @geyer-gdor.

## GDOR Need Not Be Unique

In this problem there are an infinite number of directions of recession.
R package `rcdd` can find all of them.
```{r agresti-complete-rcdd}
tanv <- as.matrix(lout$x)
tanv[lout$y == 1, ] <- (- tanv[lout$y == 1, ])
vrep <- makeV(rays = tanv)
hrep <- scdd(vrep)$output
norv <- (- hrep[ , -(1:2), drop = FALSE])
norv
```
The rows of this matrix are the extreme directions of recession.
They are not generic.  Every vector pointing in a direction between
these vectors (every vector that is a linear combination of these vectors
with strictly positive coefficients) is a GDOR.
Let's put these on our plot.
```{r agresti-complete-dor-plot,fig.align="center",fig.cap="Level Set of Log Likelihood and Directions of Recession.  Blue curve is boundary of level set (same as blue curve in Figure \\@ref(fig:agresti-complete-geyer-plot)).  Black arrow is GDOR.  Green arrows are DOR that are not generic."}
plot(beta.likelihood[,1], beta.likelihood[,2],
    xlab = expression(beta[1]), ylab = expression(beta[2]),
    xlim = range(beta.likelihood[foo,1]),
    ylim = range(beta.likelihood[foo,2]), type = "n")
lines(beta.geyer[,1], beta.geyer[,2], col = "blue")
arrows(-5, 0.125, -5 + 2 * lout$gdor[1], 0.125 + 2 * lout$gdor[2])
norv <- norv / norv[1, 1] * lout$gdor[1]
arrows(-5, 0.125, -5 + 2 * norv[1, 1], 0.125 + 2 * norv[1, 2], col = "green3")
norv <- norv / norv[2, 1] * lout$gdor[1]
arrows(-5, 0.125, -5 + 2 * norv[2, 1], 0.125 + 2 * norv[2, 2], col = "green3")
```

Any arrow pointing between the green arrows is a GDOR, so GDOR are not
unique in this example (they are unique in a [slightly different
example](#quasi-complete-separation-example-of-agresti) below).

There are two reasons why we don't try to find all GDOR rather than just
one GDOR.

 * One GDOR is enough to do all the statistics.

 * All the GDOR take a lot longer to find, hours, days, or worse
   for large examples.

## Mathematical Proofs

The arrow in Figure \@ref(fig:agresti-complete-geyer-plot) or
the arrows in Figure \@ref(fig:agresti-complete-dor-plot) show
that the log likelihood really does go uphill
all the way to infinity in the GDOR direction.  Of course, just eyeballing
a plot isn't a proof, but the computer can do a rigorous proof.
```{r agresti-complete-proof}
lout <- llmdr(y ~ x, family = "binomial", proofs = 2)
lout$proofs
```

The argument `proofs = 2` to R function `llmdr` asks it to do 2 proofs
(more precisely 2 parts of one proof).  The result having
`lout$proofs == 2` indicates that both proofs were successful.
The proofs have been done.  What is reported as a GDOR really is one.
(Or for a problem in which a GDOR does not exist so the MLE exists
in the OM, that has been proved.)

By reporting it has done the proofs, the computer
says that, although it found the answer using inexact computer arithmetic,
it went back and did enough work in infinite-precision rational arithmetic
to prove that what it says is a GDOR really is one.

The only reason R function `llmdr` does not do these proofs by default
is that inexact computer arithmetic usually gives the right answer and
these proofs can take many minutes to do on large problems.

## Intervals for Beta

```{r agresti-complete-beta-intervals}
beta <- beta.geyer
colnames(beta) <- names(lout$coefficients)
beta <- apply(beta, 2, range)
rownames(beta) <- c("lower", "upper")
beta <- t(beta)
beta
```
Except we know this is wrong.  As the arrow in
Figure \@ref(fig:agresti-complete-geyer-plot) shows, $\beta_1$ goes to
$- \infty$ as we travel in the GDOR direction, and $\beta_2$ goes to
$+ \infty$ as we travel in the GDOR direction.  We just don't get that
far on the plot.  So fix that up.
```{r agresti-complete-beta-intervals-too}
beta[1, "lower"] <- -Inf
beta[2, "upper"] <- Inf
beta
```

We can make these intervals if a user asks for them, but they are not very
informative.  They provide much less information than the whole region whose
boundary is the blue curve in
Figure \@ref(fig:agresti-complete-geyer-plot).

## Intervals for Theta and Mu

Larry Brown (the author of @brown) told your humble author that nobody
would ever understand the intervals for $\beta$ so focus on $\mu$.
Your humble author mostly agrees.  The only reasons we bothered with $\beta$
is because

 * everything starts there, we can only do intervals for other parameters
   by mapping intervals for $\beta$ to them,

 * when writing an R package, you don't know what users will want or
   what applications they will be doing, so you have to provide all
   the possibilities you can if you want to be useful.

First we do sloppy intervals for $\theta$.
```{r agresti-complete-theta-sloppy}
modmat <- as.matrix(lout$x)
theta.geyer <- modmat %*% t(beta.geyer)
theta.intervals <- apply(theta.geyer, 1, range)
theta.intervals <- t(theta.intervals)
colnames(theta.intervals) <- c("lower", "upper")
theta.intervals
```

But we can do a lot better than this by running the optimizer again.
First we know that we have one-sided intervals.
If $y_i = 0$, then we know the lower bound for $\theta_i$ is $- \infty$.
If $y_i = 1$, then we know the upper bound for $\theta_i$ is $+ \infty$.
So we can just set them without any calculation.

Then we find the $\beta$ that optimizes each $\theta_i$ and use it as a
starting point for optimization.  And we write a function to do that.
```{r agresti-complete-theta-function}
do.theta.intervals <- function(x) {
    foo <- cbind(1, x) %*% t(beta.geyer)
    gdor <- lout$gdor

    lower <- rep(-Inf, length(x))
    upper <- rep(+Inf, length(x))
    for (i in seq(along = x)) {
        eta <- sum(c(1, x[i]) * gdor)
        eta <- zapsmall(eta)
        if (eta < 0) {
            # need upper bound
            idx <- which(foo[i, ] == max(foo[i, ]))
            idx <- idx[1] # just in case of ties
            beta.start <- beta.geyer[idx, ]
            aout <- auglag(beta.start,
                fn = function(beta) sum(c(1, x[i]) * beta),
                gr = function(beta) c(1, x[i]),
                hin = function(beta) crit + logl(beta),
                hin.jac = logl.gradient,
                control.outer = list(trace = FALSE),
                control.optim = list(fnscale = -1))
            stopifnot(aout$convergence == 0)
            upper[i] <- aout$value
        } else {
            # need lower bound
            idx <- which(foo[i, ] == min(foo[i, ]))
            idx <- idx[1] # just in case of ties
            beta.start <- beta.geyer[idx, ]
            aout <- auglag(beta.start,
                fn = function(beta) sum(c(1, x[i]) * beta),
                gr = function(beta) c(1, x[i]),
                hin = function(beta) crit + logl(beta),
                hin.jac = logl.gradient,
                control.outer = list(trace = FALSE))
            stopifnot(aout$convergence == 0)
            lower[i] <- aout$value
        }
    }
    cbind(lower, upper)
}
theta <- do.theta.intervals(x)
theta
```

And then we transform these to the mean value parameter scale.
```{r agresti-complete-mu-intervals}
mu <- 1 / (1 + exp(- theta))
mu
```

And then we plot them.
```{r agresti-complete-mu-plot,fig.align="center",fig.cap="Simultaneous 95% Confidence intervals for Mean Value Parameters.  Hollow circles are MLE."}
errbar(x, y, mu[ , 2], mu[ , 1], xlab = "x", ylab = "probability")
```

This plot is fairly simple.  The hollow dots are the MLE saturated model
mean value parameter vector $\hat{\mu}$.  They are on the boundary.
The MLE distribution is completely degenerate and says the observed value
of the response vector is the only possible value.

But $\hat{\mu}$ is not $\mu$.  Estimates are not the parameters they estimate.
This is the second most fundamental concept in statistics.
(The most fundamental is anecdotes are not data.
The third most fundamental is it is called a 95% confidence interval because
it is wrong 5% of the time.)

So we see that we do not need to fear MLE on the boundary.  We just need
a proper theory of confidence intervals that applies in the situation.

Note that the error bars say almost nothing about the probabilities near
the jump in the regression function.  Not surprising in hindsight.
If one of the Bernoulli random
variables for $x = 40$ or $x = 60$ had been different, we would have inferred
the jump was at a different point.  But the farther we get from the jump
the better the estimation.

So even in this toy problem with so little data, statistics can say something
of consequence.

## Submodel Canonical Statistic

Theory tells us that where the action is really happening is with the
canonical statistic vector $M^T y$ of the canonical affine submodel.

In most problems we can't visualize that, but in this toy problem,
where $M^T y$ is two-dimensional, we can.

Another reason why we can't do anything like this section in non-toy
problems is that the computing we do in this section is exponential in
the size of the data, so the computing time would be horrendously long
for non-toy data.

There are $2^n$ possible values of the response vector where $n$
is the length of the response vector because each component of $y$ can be
either zero or one.  The following code makes all of those vectors.
```{r example.i.all.y}
yy <- matrix(0:1, nrow = 2, ncol = length(x))
colnames(yy) <- paste0("y", x)
yy <- expand.grid(as.data.frame(yy))

head(yy)
dim(yy)
```

But there are not so many distinct values of the submodel canonical statistic.
```{r example.i.all.m.transpose.y}
m <- cbind(1, x)
mtyy <- t(m) %*% t(yy)
t1 <- mtyy[1, ]
t2 <- mtyy[2, ]
t1.obs <- sum(y)
t2.obs <- sum(x * y)
```
```{r agresti-complete-submodel-sufficient,fig.align="center",fig.cap="Possible values of the submodel canonical statistic vector $M^T y$ for the data shown in Figure \\@ref(fig:one).  Solid dot is the observed value.  Line is hyperplane \\@ref(eq:h) determined by the GDOR."}
t.key <- paste(t1, t2, sep = ":")
t.idx <- match(unique(t.key), t.key)
t1.uniq <- t1[t.idx]
t2.uniq <- t2[t.idx]

par(mar = c(5, 5, 1, 0) + 0.1)
plot(t1.uniq, t2.uniq, xlab = expression(t[1] == sum(y)),
    ylab = expression(t[2] == sum(x * y)))
points(t1.obs, t2.obs, pch = 19)

# add H
gdor <- lout$gdor
# rotate 90 degrees to convert normal vector to tangent vector
tanv <- rev(gdor) * c(-1, 1)
tanv
# now convert to slope and intercept of line
slope <- tanv[2] / tanv[1]
intercept <- t2.obs - t1.obs * slope
abline(intercept, slope)
```
Having the hyperplane $H$ defined by \@ref(eq:h) on the plot makes it clear
that the observed value of the canonical statistic really is extreme.
It lies on the line, and all other possible (in the OM) values lie on one
side of $H$.  The fact that only one point lies in $H$ means the LCM is
concentrated at this one point, so is completely degenerate.

## Complete Separation

Just for completeness, we do show the "complete separation" plot that
Agresti discusses.
```{r agresti-complete-complete-separation,fig.align="center",fig.cap="Complete Separation of Successes and Failures.  Black dots successes.  White dots failures.  Black line separating hyperplane"}
dim(modmat)
plot(modmat[ , 1], modmat[ , 2],
    xlab = colnames(modmat)[1], ylab = colnames(modmat)[2])
points(modmat[y == 1, 1], modmat[y == 1, 2], pch = 19)
intercept <- 50 - 1 * slope
abline(intercept, slope)
```

The separating hyperplane in Figure
\@ref(fig:agresti-complete-complete-separation) is one whose normal vector
is the GDOR found by R function `llmdr`.  It is clear that there are a
lot of separating hyperplanes, so this plot, although easier to do than
Figure \@ref(fig:agresti-complete-submodel-sufficient), doesn't provide the
same information.  This will become even more clear in the following
section.

Figure \@ref(fig:agresti-complete-complete-separation) is a little odd
in that the point cloud isn't really two-dimensional.  All of the points
have the same value of the regressor `(Intercept)`.  But of course
that always happens.  That it what "intercept" means.  So it is odd,
but also extremely common.  Most models have an intercept.

## Region for Tau

We can map the confidence region for $\beta$ that is the region bounded by
the blue curve in
Figure \@ref(fig:agresti-complete-geyer-plot) to values
for $\tau$ thus obtaining a confidence region for $\tau$.
```{r agresti-complete-tau-region,fig.align="center",fig.cap="Same as Figure \\@ref(fig:agresti-complete-submodel-sufficient) Except Gray Region is 95% Confidence Region for Submodel Mean Value Parameter"}
theta <- tcrossprod(modmat, beta.geyer)
mu <- 1 / (1 + exp(- theta))
tau <- crossprod(modmat, mu)
# add observed value to these points
tau <- cbind(c(t1.obs, t2.obs), tau)

par(mar = c(5, 6, 1, 0) + 0.1)
plot(t1.uniq, t2.uniq, xlab = expression(t[1] == sum(y)),
    ylab = expression(t[2] == sum(x * y)), type = "n")
polygon(tau[1, ], tau[2, ], border = NA, col = "gray")
points(t1.uniq, t2.uniq)
points(t1.obs, t2.obs, pch = 19)
intercept <- t2.obs - t1.obs * slope
abline(intercept, slope)
mtext(expression(tau[1]), side = 1, line = 4)
mtext(expression(tau[2]), side = 2, line = 5)
```

From this figure it is clear that the confidence region is fairly restrictive,
and most of the submodel mean value parameter space is excluded by the
confidence region.

```{r restore-par-mar,echo=FALSE}
par(mar = c(5, 4, 4, 1) + 0.1)
```

## Confidence Interval for Closeness to the Boundary

We can make confidence intervals for any function of parameters we can think
of.  Here is one that is especially interesting.

The parameter $\langle \tau, \delta \rangle$ measures how close a distribution
is to the boundary (to the line in the figure).
```{r agresti-boundary-interval}
sum(c(t1.obs, t2.obs) * gdor) - rev(range(crossprod(tau, gdor)))
```

Admittedly, this is impossible to interpret by itself because the
length of the GDOR is arbitrary.  But we can compare it to the range
of the same GDOR applied to the whole sample space.
```{r agresti-boundary-interval-compare}
ttt <- rbind(t1.uniq, t2.uniq)
sum(c(t1.obs, t2.obs) * gdor) - rev(range(crossprod(ttt, gdor)))
```

These data, few though they may be, do mush up the probability fairly
close to the boundary.

## Confidence Intervals for Hypothetical Individuals

We can also make confidence intervals for hypothetical individuals,
what R generic function `predict` does when its `newdata` argument
is used.

Here we treat $x$ as a continuous variable and do intervals for
$\mu$ for all possible $x$ values.
```{r agresti-complete-hypothetical,cache=TRUE}
xx <- seq(0, 100, 0.1)
# be especially careful at jump
xx <- c(xx, 39.99, 40.01, 59.99, 60.01)
xx <- sort(xx)
theta <- do.theta.intervals(xx)
mu <- 1 / (1 + exp(- theta))
# have to do by hand since optimization isn't good enough
mu[xx < 60, 1] <- 0
mu[xx > 40, 2] <- 1
```
```{r agresti-complete-hypothetical-plot,fig.align="center",fig.cap="Like Figure \\@ref(fig:agresti-complete-mu-plot) Except with Confidence Intervals for Every $x$ value"}
plot(x, y, type = "n",
    xlab = expression(x), ylab = expression(mu(x)))
xxx <- c(xx, rev(xx))
yyy <- c(mu[ , 1], rev(mu[ , 2]))
length(xxx); length(yyy)
polygon(xxx, yyy, border = NA, col = "gray")
points(x, y)
```
That finishes our work on this toy problem.
Admittedly, it is hard to know what you are supposed to learn from a toy
problem.
But toy problems do have the advantage of at least being somewhat
understandable.

# A Clinical Trial Example of Agresti

```{r agresti-clinical-clean,echo=FALSE}
rm(list = setdiff(ls(), c("level", "do.curve", "logl.factory",
    "logl.gradient.factory", "save.complete")))
```

But analyses involving DOR are not always so complicated.  Sometimes
the issue and analysis are really quite simple.

Section 6.5.2 in @agresti looks at the following data for a clinical trial.
```{r agresti-clinical-data}
center <- rep(1:5, each = 2)
center <- as.factor(center)
treatment <- rep(c("active_drug", "placebo"), times = 5)
success <- c(0, 0, 1, 0, 0, 0, 6, 2, 5, 2)
failure <- c(5, 9, 12, 10, 7, 5, 3, 6, 9, 12)
y <- cbind(success, failure)
```

If we dump that into R function `llmdr` we find the MLE does not
exist in the OM.
```{r agresti-clinical-fit}
lout <- llmdr(y ~ center + treatment, family = "binomial")
summary(lout)
```

```{r agresti-clinical-wald-do-summary,echo=FALSE}
sout <- summary(lout)
```

Furthermore, we cannot drop `treatment` because it is statistically
significant (`r round(sout$coefficients["treatmentplacebo", "Pr(>|z|)"], 4)`)
by the Wald test done in the summary, and we cannot drop `center` because
```{r agresti-clinical-anova-center}
lout.null <- llmdr(y ~ treatment, family = "binomial")
anova(lout.null, lout)
```

So the model `y ~ center + treatment` is the only model under
consideration that fits the data.

[As usual](expfam.html#alice) it is very hard to figure out what
canonical parameters mean.  So we look at mean value parameters.
```{r agresti-clinical-mu}
data.frame(center, treatment, estimates(lout, "mu"))
```

Oh!  There were no successes in either treatment group in centers 1 and 3.
And the MLE mean values also say no successes in those centers.

If you were to decide to throw away the data from centers 1 and 3,
just to make the data easier for you to analyze, then you could be
severely criticized for that.  If you did it and didn't admit it,
that would be fraud.  If you did it and did admit it, then that
would be sufficient reason for everybody to ignore everything you
say on the subject.

But if statistics does the same thing, following the theory of maximum
likelihood estimation in exponential families, then there is nothing
to criticize.  The MLE in the LCM is also the MLE in the completion of
the OM, arrived at by taking limits of sequences of distributions in
the OM.  There is nothing wrong with that.  If taking limits is wrong,
then everything involving calculus has been wrong for over 400 years.
So that argument is a non-starter.

So having arrived at this conclusion, we mention that this is
another situation in
which the theory in these notes matches conventional practice.
If a conclusion can be drawn solely from the fit in the LCM alone
(with no reference to the OM), then it is valid (assuming the usual
asymptotics of maximum likelihood apply, that is, assuming the sample
size of the LCM is large enough).

So that means the coefficient for the treatment effect and its standard error
in the summary above are valid.  So we can turn that into a valid Wald
confidence interval.
```{r agresti-clinical-wald-interval}
sout <- summary(lout)
fit <- sout$coefficients["treatmentplacebo", "Estimate"]
se.fit <- sout$coefficients["treatmentplacebo", "Std. Error"]
fit + c(-1, 1) * qnorm((1 + level) / 2) * se.fit
```

And the hypothesis test is valid too.  We have a statistically significant
treatment effect (with aforementioned $P$-value $P =
`r round(sout$coefficients["treatmentplacebo", "Pr(>|z|)"], 4)`$).
And this agrees with the corresponding confidence interval not containing zero
(as it has to by the duality of hypothesis tests and confidence intervals).

Of course this does ignore the data from centers 1 and 3.  But that is
what maximum likelihood does with these data.
Once we decide that `center` needs to be in the model,
then we have to let statistics do what it does.

Of course, if we were interested in confidence intervals for centers 1 and 3,
then we would have to do calculations like those in the preceding section.
But as long as we are happy with what has been done in this section,
then we're good.

By the way
```{r agresti-clinical-try-glm}
gout <- glm(y ~ center + treatment, family = "binomial")
summary(gout)
```
gives a false negative by failing to warn
`fitted probabilities numerically 0 or 1 occurred`.

It does agree with our $P$-value for the treatment effect.
But that is an accident of having orthogonal predictors.
If `treatment` and `center` were correlated predictors,
then we wouldn't get the same answers.

# Theory of Confidence Intervals {#theory-confidence}

## Confidence Intervals for OM when there is a GDOR

These are the intervals based on the region bounded by the blue curve in
Figure \@ref(fig:agresti-complete-geyer-plot).  We have already met them.
Now for the details.

Suppose the GDOR $\delta$ were known in advance of obtaining the data
(this is not true; the GDOR depends on the data), and suppose we chose
$\langle Y, \delta \rangle$ as a test statistic for a hypothesis test
with a point null hypothesis $\beta$.  We observe the largest possible
value of this test statistic $\langle y, \delta \rangle$.  Thus the
$P$-value of this test is
$$
   \mathop{\rm pr}\nolimits_\beta\bigl(
   \langle Y, \delta \rangle = \langle y, \delta \rangle \bigr)
   =
   \mathop{\rm pr}\nolimits_\beta( Y \in H )
$$
where $H$ is given by \@ref(eq:h).
If we think decision theoretically, we reject the null hypothesis when
this $P$-value is less than the significance level, call that $\alpha$.

When we invert a hypothesis test to get a confidence region,
the confidence region is the set of parameter values considered
as null hypotheses that are not rejected by the hypothesis test.
Of course, the probability that the null hypothesis is not rejected
is $1 - \alpha$, so that is the coverage probability of the confidence
region.  In short, our confidence region is the set of $\beta$
such that
\begin{equation}
   \mathop{\rm pr}\nolimits_\beta( Y \in H ) \ge \alpha
   (\#eq:confreg-geyer)
\end{equation}

It makes no sense to use confidence regions of this kind when the MLE exists
in the OM.  Then this confidence region does nothing at all because saying
there is no GDOR is equivalent to saying $\delta = 0$, hence $H$ is the whole
vector space where the canonical statistic lives, hence
the region satisfying \@ref(eq:confreg-geyer) is the whole parameter space.

## Confidence Intervals for LCM when it is Not Completely Degenerate

When the LCM is not completely degenerate, it is an exponential family
for which the MLE exists and the usual asymptotics may hold (if there
is enough data after conditioning to make the LCM).  So we use that.
Let $l_\text{LCM}$ be the log likelihood function for the LCM.  Then we
make a confidence region that is the set of $\beta$ such that
\begin{equation}
   2 \bigl[ l_\text{LCM}(\hat{\beta}) - l_\text{LCM}(\beta) \bigr]
   \le \text{crit}
   (\#eq:confreg-lcm)
\end{equation}
where $\text{crit}$ is chosen to be an appropriate chi-square critical value
with degrees of freedom chosen to give the desired coverage,
and where in case the MLE exists in the OM we say LCM = OM, so the log
likelihood in \@ref(eq:confreg-lcm) is the log likelihood of the OM.
One uses degrees of freedom equal to the number of identifiable parameters
in the LCM to get simultaneous coverage.
One uses degrees of freedom equal to one if one doesn't care about
simultaneous coverage.

So much, so conventional.  But things get a little stranger when we
use the same parameterization for both of these kinds of confidence regions.
If $\beta$ is the parameter of the OM, for both of these, then the region
satisfying \@ref(eq:confreg-lcm) is also unbounded.
The GDOR of the OM, at least, is a DOC of the LCM, so the region satisfying
\@ref(eq:confreg-lcm) is also unbounded in the direction of the GDOR
and also in the opposite direction, and perhaps other directions
(the LCM can have more DOC than the GDOR of the LCM).

It makes no sense to use confidence regions of this kind when the LCM
is completely degenerate.  Then this confidence region does nothing at all
because the LCM contains only one distribution, so every direction is a DOC
of the LCM and $l_\text{LCM}$ is a constant function of $\beta$, hence
the region satisfying \@ref(eq:confreg-lcm) is the whole parameter space.

## The Combination of the Two

To use both kinds of information, we use the confidence region consisting
of $\beta$ that satisfy both \@ref(eq:confreg-geyer) and \@ref(eq:confreg-lcm)
except when the MLE exists in the OM, when we take the region satisfying
\@ref(eq:confreg-lcm) only, or except when the LCM is completely degenerate,
when we take the region satisfying
\@ref(eq:confreg-geyer) only.

In case the LCM is partially degenerate and we are using both kinds of
confidence regions, we have to correct for multiple testing.
Because \@ref(eq:confreg-geyer) only involves the probability of
the event $Y \in H$
and \@ref(eq:confreg-lcm) only involves the probability distribution of $Y$
given the event $Y \in H$, probabilities multiply (joint equals marginal
times conditional).  Thus if the confidence level for the region satisfying
\@ref(eq:confreg-lcm) has confidence level $1 - \alpha_1$
and the confidence level for the region satisfying
\@ref(eq:confreg-geyer) has confidence level $1 - \alpha_2$,
then the confidence level of the combination
is $(1 - \alpha_1) (1 - \alpha_2)$.
And we need to adjust $\alpha_1$ and $\alpha_2$ to get the desired level.
Usually we just take both $1 - \alpha_1$ and $1 - \alpha_2$
to be the square root of the desired level, except in case the MLE exists
in the OM and we take $1 - \alpha_1$ to be the desired level or in
case the LCM is completely degenerate and we take $1 - \alpha_2$ to be
the desired level.

## Confidence Regions to Confidence Intervals

For any function $g(\beta)$ we can form a confidence interval for this
parameter by minimizing and maximizing $g(\beta)$ over the region
satisfying both \@ref(eq:confreg-geyer) and \@ref(eq:confreg-lcm).

# Quasi-Complete Separation Example of Agresti

```{r agresti-quasi-clean,echo=FALSE}
rm(list = setdiff(ls(), c("level", "do.curve", "logl.factory",
    "logl.gradient.factory", "save.complete")))
```

## Data {#data-quasi}

Section 6.5.1 of @agresti introduces the notion of quasi-complete separation
with the following example.
```{r agresti-quasi-data}
x <- seq(10, 90, 10)
x <- c(x, 50)
y <- as.numeric(x > 50)
y[x == 50] <- c(0, 1)
data.frame(x, y)
```

Figure \@ref(fig:agresti-quasi-data-plot) shows these data.
```{r agresti-quasi-data-plot, echo = FALSE, fig.align = "center", fig.cap = "Logistic Regression Data for Quasi-Complete Separation Example of Agresti"}
plot(x, y)
```

## Fit Model

```{r agresti-quasi-llmdr}
lout <- llmdr(y ~ x, family = "binomial")
summary(lout)
```

Now we see that the model is not completely degenerate (because the `Estimates`
column of the printout is not all `NA`) but is partially degenerate
(because the printout has a `GDOR` column).  That's what Agresti calls
quasi-complete separation.

As usual mean value parameters are easier to interpret.
```{r agresti-quasi-means}
data.frame(x, y, mu = estimates(lout, "mu"))
```

All of the components of the MLE $\hat{\mu}$ are on the boundary except
those corresponding to $x = 50$ where we have one success and one failure
and predict $\hat{\mu}_i = 0.5$.

## Valid Region for Beta {#valid-quasi}

```{r agresti-quasi-geyer,cache=TRUE}
# names here are odd, these are the model matrix and response vector
# for the data that are fixed at their observed values in the LCM
modmat.lcm <- as.matrix(lout$x[x != 50, ])
y.lcm <- y[x != 50]

logl <- logl.factory(modmat.lcm, y.lcm)

logl.gradient <- logl.gradient.factory(modmat.lcm, y.lcm)

level1 <- level2 <- sqrt(level)
level1
alpha <- 1 - level2
crit <- (- log(alpha))
crit

beta.geyer <- do.curve(function(beta) crit + logl(beta), logl.gradient,
    cutoff = -1000)
```

```{r agresti-quasi-geyer-plot,fig.align="center",fig.cap="Confidence Region Consisting of $\\beta$ satisfying \\@ref(eq:confreg-geyer).  Arrow is GDOR"}
foo <- beta.geyer[,1] > -8
plot(beta.geyer[,1], beta.geyer[,2],
    xlab = expression(beta[1]), ylab = expression(beta[2]),
    xlim = range(beta.geyer[foo,1]),
    ylim = range(beta.geyer[foo,2]), type = "n")
lines(beta.geyer[,1], beta.geyer[,2], col = "blue")
arrows(-5, 0.125, -5 + 2 * lout$gdor[1], 0.125 + 2 * lout$gdor[2])
```

The blue curves in Figures \@ref(fig:agresti-complete-geyer-plot)
and \@ref(fig:agresti-quasi-geyer-plot) are different because the
confidence levels are different.

Now we want to add the region of $\beta$ satisfying \@ref(eq:confreg-lcm).
First we get the likelihood-based confidence region for the LCM.  Here
we can do that the easy way using R function `confint` from R package
`MASS`.
```{r agrest-quasi-mass-confint}
gout <- glm(c(0, 1) ~ 1, family = "binomial")
summary(gout)
cout <- confint(gout, level = level1)
cout
```
So now we know the likelihood-based confidence region for $\beta_1$
is $`r round(cout[1], 2)` < \beta_1 < `r round(cout[2], 2)`$ when we
constrain $\beta_2 = 0$.  (R function `summary` says `NA` for coefficients
it constrains to be zero.)

Now we know that the DOR for the OM is the only DOC for the LCM because
we have only one non-identifiable parameter for the LCM ($\beta_2$) so
only one DOC.

Thus our confidence region of $\beta$ satisfying \@ref(eq:confreg-lcm) is
the set of all $\beta$ of the form
$$
   (\beta_1, 0) + s \delta
$$
with
$`r round(cout[1], 2)` < \beta_1 < `r round(cout[2], 2)`$
and $s$ any real number and $\delta$ the GDOR.
Its boundaries are the straight lines
$$
   (\pm `r round(cout[2], 2)`, 0) + s \delta
$$
So add them to our plot.
```{r agresti-quasi-both-plot,fig.align="center",fig.cap="Confidence Regions. Blue: $\\beta$ satisfying \\@ref(eq:confreg-geyer).  Pink: $\\beta$ satisfying \\@ref(eq:confreg-lcm). Arrow is GDOR"}
plot(beta.geyer[,1], beta.geyer[,2],
    xlab = expression(beta[1]), ylab = expression(beta[2]),
    xlim = range(beta.geyer[foo,1]),
    ylim = range(beta.geyer[foo,2]), type = "n")
lines(beta.geyer[,1], beta.geyer[,2], col = "blue")
arrows(-5, 0.125, -5 + 2 * lout$gdor[1], 0.125 + 2 * lout$gdor[2])

gdor <- lout$gdor
slope <- gdor[2] / gdor[1]
intercept.lower <- (- slope * cout[1])
intercept.upper <- (- slope * cout[2])
abline(intercept.lower, slope, col = "pink3")
abline(intercept.upper, slope, col = "pink3")
```

Our confidence region for $\beta$ is the intersection of the two regions,
the region inside the blue curve and between the region between
the pink lines in Figure \@ref(fig:agresti-quasi-both-plot).
The lower pink line does eventually intersect the blue curve,
but that is off the plot to the left.

## Intervals for Beta {#beta-quasi}

So confidence intervals for $\beta$ are
```{r agresti-quasi-beta}
beta <- matrix(NA_real_, 2, 2)
colnames(beta) <- c("lower", "upper")
rownames(beta) <- names(gdor)
beta[1, 1] <- -Inf
beta[2, 2] <- Inf
beta[1, 2] <- max(beta.geyer[ , 1])
beta[2, 1] <- min(beta.geyer[ , 2])
beta
```

We won't bother to make these intervals more precise using optimization.
[They are almost uninterpretable anyway](expfam.html#alice).
Clearly they present a lot
less information than Figure \@ref(fig:agresti-quasi-both-plot).

## Intervals for Theta {#theta-quasi}

First we need to find the intersection of our two confidence regions.
```{r agresti-quasi-beta-both}
nrow(beta.geyer)
beta.upper <- beta.geyer[ , 1] * slope + intercept.upper
is.up <- beta.geyer[ , 2] > beta.upper
range(which(is.up))
beta.lower <- beta.geyer[ , 1] * slope + intercept.lower
is.dn <- beta.geyer[ , 2] < beta.lower
range(which(is.dn))

beta.both <- beta.geyer
beta.both[is.up, 2] <- beta.geyer[is.up, 1] * slope + intercept.upper
beta.both[is.dn, 2] <- beta.geyer[is.dn, 1] * slope + intercept.lower
```

Then we map to $\theta$
```{r agresti-quasi-theta-both}
modmat <- as.matrix(lout$x)
theta.both <- tcrossprod(modmat, beta.both)
theta <- apply(theta.both, 1, range)
theta <- t(theta)
colnames(theta) <- c("lower", "upper")
theta
# fix up infinite by hand
theta[x < 50, "lower"] <- -Inf
theta[x > 50, "upper"] <- Inf
theta
```

We might as well polish these up with optimization because we
need the R function `do.theta.intervals` defined in
Section \@ref(intervals-for-theta-and-mu) above modified to do this problem
in order to do the rest of our analysis of this example.

```{r agresti-quasi-theta-better}
# names here are odd, these are the model matrix and response vector
# for the data that are free in the LCM
modmat.lcm <- as.matrix(lout$x[lout$lcm.support, ])
y.lcm <- y[lout$lcm.support]
modmat.lcm
y.lcm

logl.lcm <- logl.factory(modmat.lcm, y.lcm)

logl.lcm.gradient <- logl.gradient.factory(modmat.lcm, y.lcm)

beta.hat.lcm <- c(0, 0)

crit2 <- crit
crit1 <- qchisq(level1, df = sum(! is.na(lout$coefficients))) / 2
crit1
crit2

# now constraint function for optimization has both constraints

hin <- function(beta) c(logl(beta) + crit2,
    logl.lcm(beta) - logl.lcm(beta.hat.lcm) + crit1)

hin.jac <- function(beta)
    rbind(logl.gradient(beta), - logl.lcm.gradient(beta))

i <- sample(which(is.up), 1)
beta.both[i, ]; hin(beta.both[i, ]); hin.jac(beta.both[i, ])
jacobian(hin, beta.both[i, ])
a.little.up <- c(0, 1e-4)
a.little.dn <- c(0, -1e-4)
hin(beta.both[i, ] + a.little.up); hin(beta.both[i, ] + a.little.dn)
```

It seems a bit odd that the constraint values returned by R function `hin`
are neither zero and the closest is $`r hin(beta.both[i, ])[2]`$,
but that is what inexact computer arithmetic and the default convergence
tolerances for R function `auglag` do.

If we take that number to correspond to a zero using infinite precision
arithmetic, then this is what we expect.  We are on the pink boundary
where the second constraint is zero.  Oh.  The inexactness here might
be where the pink lines are drawn which might be the fault of
R function `confint.glm` in R package `MASS`.  We won't sort out who is
to blame, because we will now be using R function `hin` so say when
we are in the confidence region for $\beta$.  We won't use the pink
lines any more.

The last two lines show that if we go a little up, then we are outside
the confidence region (one component of constraint function negative),
whereas if we go a little down, then we are inside (both components positive).

```{r agresti-quasi-theta-better-too}
i <- sample(which(is.dn), 1)
beta.both[i, ]; hin(beta.both[i, ]); hin.jac(beta.both[i, ])
jacobian(hin, beta.both[i, ])
hin(beta.both[i, ] + a.little.up); hin(beta.both[i, ] + a.little.dn)
```
Again, looks OK, except for inexactness of something or other.
Now a little up is inside and a little down is outside.

```{r agresti-quasi-theta-better-too-too}
i <- sample(which((! is.dn) & (! is.up)), 1)
beta.both[i, ]; hin(beta.both[i, ]); hin.jac(beta.both[i, ])
jacobian(hin, beta.both[i, ])
```
Now this is a point on the blue curve, and here the first constraint
is nearly zero, and the jacobian looks OK here too.
So it looks like our constraint functions and their gradients are OK.

```{r agresti-quasi-theta-better-too-too-too}
do.theta.intervals <- function(x) {
    foo <- cbind(1, x) %*% t(beta.both)
    gdor <- lout$gdor

    lower <- rep(-Inf, length(x))
    upper <- rep(+Inf, length(x))

    zapsmall <- function(x) if (abs(x) < sqrt(.Machine$double.eps)) 0 else x

    for (i in seq(along = x)) {
        eta <- sum(c(1, x[i]) * gdor)
        eta <- zapsmall(eta)
        if (eta <= 0) {
            # need upper bound
            idx <- which(foo[i, ] == max(foo[i, ]))
            idx <- idx[1] # just in case of ties
            beta.start <- beta.both[idx, ]
            aout <- auglag(beta.start,
                fn = function(beta) sum(c(1, x[i]) * beta),
                gr = function(beta) c(1, x[i]),
                hin = hin,
                hin.jac = hin.jac,
                control.outer = list(trace = FALSE),
                control.optim = list(fnscale = -1))
            # looks like we have to accept 9, which is
            # Convergence due to lack of progress in parameter updates
            stopifnot(aout$convergence %in% c(0, 9))
            upper[i] <- aout$value
        }
        if (eta >= 0) {
            # need lower bound
            idx <- which(foo[i, ] == min(foo[i, ]))
            idx <- idx[1] # just in case of ties
            beta.start <- beta.both[idx, ]
            aout <- auglag(beta.start,
                fn = function(beta) sum(c(1, x[i]) * beta),
                gr = function(beta) c(1, x[i]),
                hin = hin,
                hin.jac = hin.jac,
                control.outer = list(trace = FALSE))
            # looks like we have to accept 9, which is
            # Convergence due to lack of progress in parameter updates
            stopifnot(aout$convergence %in% c(0, 9))
            lower[i] <- aout$value
        }
    }
    cbind(lower, upper)
}
```
```{r agresti-quasi-theta-better-too-too-too-too}
theta <- do.theta.intervals(sort(unique(x)))
rownames(theta) <- sort(unique(x))
theta
```

## Intervals for Mu {#mu-quasi}

And we can map these to the mean value parameter.
```{r agresti-quasi-mu-both}
mu <- 1 / (1 + exp(- theta))
mu
```

And plot them.
```{r agresti-quasi-mu-plot,fig.align="center",fig.cap="Confidence intervals for Mean Value Parameters.  Hollow circles are MLE."}
errbar(sort(unique(x)), as.numeric(sort(unique(x)) > 50), mu[ , 2], mu[ , 1],
    xlab = "x", ylab = "probability")
# add the other data point for x = 50
points(50, 1)
```

## Submodel Canonical Statistic {#statistic-quasi}

This is like Section \@ref(submodel-canonical-statistic) except for
the difference in the data.
```{r agresti-quasi-all-possible-y}
yy <- matrix(0:1, nrow = 2, ncol = length(x))
colnames(yy) <- paste0("y", seq(along = x))
yy <- expand.grid(as.data.frame(yy))

head(yy)
dim(yy)
```

Get distinct possible values of the submodel canonical statistic
and the observed values.
```{r agresti-quasi-all-possible-m-transpose-y}
m <- cbind(1, x)
mtyy <- t(m) %*% t(yy)
t1 <- mtyy[1, ]
t2 <- mtyy[2, ]
t1.obs <- sum(y)
t2.obs <- sum(x * y)
```

And make the plot.
```{r agresti-quasi-submodel-sufficient,fig.align="center",fig.cap="Possible values of the submodel canonical statistic vector $M^T y$ for the data shown in Figure \\@ref(fig:agresti-quasi-data-plot).  Solid dot is the observed value.  Line is hyperplane \\@ref(eq:h) determined by the GDOR."}
t.key <- paste(t1, t2, sep = ":")
t.idx <- match(unique(t.key), t.key)
t1.uniq <- t1[t.idx]
t2.uniq <- t2[t.idx]

par(mar = c(5, 5, 1, 0) + 0.1)
plot(t1.uniq, t2.uniq, xlab = expression(t[1] == sum(y)),
    ylab = expression(t[2] == sum(x * y)))
points(t1.obs, t2.obs, pch = 19)

# add H
gdor <- lout$gdor
# rotate 90 degrees to convert normal vector to tangent vector
tanv <- rev(gdor) * c(-1, 1)
tanv
# now convert to slope and intercept of line
slope <- tanv[2] / tanv[1]
intercept <- t2.obs - t1.obs * slope
abline(intercept, slope)
```

This differs from Figure \@ref(fig:agresti-complete-submodel-sufficient)
in that now the support of the LCM, the points that lie on the hyperplane
$H$, which is the line in the figure, has three points rather than one.
Hence the LCM is not completely degenerate (concentrated at one point)
but rather only *partially degenerate*.

But otherwise the picture is much the same.  The MLE does not exist in the OM
because the observed data (for the submodel canonical sufficient statistic)
is on the boundary of the convex hull of the set of its possible values.

## Region for Tau {#tau-quasi}

This is like Section \@ref(region-for-tau) except for the difference in
data.

```{r agresti-quasi-tau-region,fig.align="center",fig.cap="Same as Figure \\@ref(fig:agresti-quasi-submodel-sufficient) Except Gray Region is 95% Confidence Region for Submodel Mean Value Parameter"}
theta <- tcrossprod(modmat, beta.both)
mu <- 1 / (1 + exp(- theta))
tau <- crossprod(modmat, mu)
# add observed value to these points
tau <- cbind(c(t1.obs, t2.obs), tau)

par(mar = c(5, 6, 1, 0) + 0.1)
plot(t1.uniq, t2.uniq, xlab = expression(t[1] == sum(y)),
    ylab = expression(t[2] == sum(x * y)), type = "n")
polygon(tau[1, ], tau[2, ], border = NA, col = "gray")
points(t1.uniq, t2.uniq)
points(t1.obs, t2.obs, pch = 19)
abline(intercept, slope)
mtext(expression(tau[1]), side = 1, line = 4)
mtext(expression(tau[2]), side = 2, line = 5)
```

## Quasi-Complete Separation {#separation-quasi}

This is like Section \@ref(complete-separation) except for the difference in
data.
```{r agresti-quasi-complete-separation,fig.align="center",fig.cap="Quasi-Complete Separation of Successes and Failures.  Black dots successes.  White dots failures.  Gray dot, one success and one failure.  Black line separating hyperplane"}
dim(modmat)
plot(modmat[ , 1], modmat[ , 2],
    xlab = colnames(modmat)[1], ylab = colnames(modmat)[2])
intercept <- 50 - 1 * slope
abline(intercept, slope)
points(modmat[y == 1, 1], modmat[y == 1, 2], pch = 19)
points(modmat[x == 50, 1], modmat[x == 50, 2], pch = 19, col = "gray")
```

## Confidence Intervals for Hypothetical Individuals {#hypothetical-quasi}

This is like Section \@ref(confidence-intervals-for-hypothetical-individuals)
except for the difference in data.
```{r agresti-quasi-hypothetical,cache=TRUE}
xx <- seq(0, 100, 0.1)
# be especially careful at jump
xx <- c(xx, 49.99, 50.01)
xx <- sort(xx)
theta <- do.theta.intervals(xx)
mu <- 1 / (1 + exp(- theta))
# have to do by hand since optimization isn't good enough
mu[xx < 50, 1] <- 0
mu[xx > 50, 2] <- 1
```
```{r agresti-quasi-hypothetical-plot,fig.align="center",fig.cap="Like Figure \\@ref(fig:agresti-quasi-mu-plot) Except with Confidence Intervals for Every $x$ value"}
plot(x, y, type = "n",
    xlab = expression(x), ylab = expression(mu(x)))
xxx <- c(xx, rev(xx))
yyy <- c(mu[ , 1], rev(mu[ , 2]))
length(xxx); length(yyy)
polygon(xxx, yyy, border = NA, col = "gray")
points(x, y)
```

That finishes our work on this toy problem.

# Theory of Hypothesis Tests

The theory of hypothesis tests propounded in @geyer-gdor is very simple,
once you understand it.  @geyer-gdor attributes it to Stephen Fienberg
(who outlined this theory in answer to a question from your humble author
asked in the questions after a talk Fienberg was giving).

The main point is that in the theory of hypothesis testing, the distribution
of the data used for calculations is the distribution *under the
null hypothesis*.  (This ignores power calculations.)
Thus the LCM for the alternative hypothesis is irrelevant.
(And it would be irrelevant even if we were doing power calculations,
because those only involve alternatives near the null hypothesis.)

The next most important point is that the MLE distribution
*under the null hypothesis* is a conditional distribution (if you include
conditioning on nothing, which is the same as not conditioning, in case
no GDOR exists and the MLE is in the OM).  Thus we have to fit the null
and alternative hypotheses *using the same conditioning* for each.

Everything else is straightforward.  It is just the usual theory of
hypothesis tests associated with maximum likelihood estimation
(Wilks, Rao, Wald) applied to this conditional model.
Actually, since Wald tests only use the MLE for the alternative hypothesis,
they would only tell us about LCM for the alternative, so they don't make
much sense except in the output of R function `summary`.

It follows that we do not even need the MLE calculated by R function `llmdr`
for the *alternative hypothesis* (for Wilks or Rao).
But we can use the deviance it calculates.

It follows that we have to refit the *alternative hypothesis* using the
conditioning for the *null hypothesis*.  The method of R generic function
`anova` for objects of class `llmdr` does this.
The appropriate degrees of freedom for the *alternative hypothesis* will
be the degrees of freedom of this refitted model.
And these degrees of freedom will not be either of the degrees of freedom
in the object returned by R function `llmdr` when it calculates the MLE
for the *alternative hypothesis*.

Thus a model has many possible degrees of freedom

 * One is the number of identifiable parameters of the OM.  This is
   the degrees of freedom one should use for AIC and BIC
   (maybe, for a counterargument, see the
   [section about this below](#aic-and-bic-and-lcm)).

 * Another is the number of identifiable parameters of the LCM.

 * Yet another is the number of identifiable parameters of the model
   when considered as an alternative hypothesis.  This depends on what
   the null hypothesis is.  In general, it can be anything between
   the two degrees of freedom listed above.

In summary, our theory of hypothesis testing is fairly conventional.
The test statistic (Wilks, Rao, or Wald) still has an asymptotic chi-square
distribution.  But, unless the MLE for the null hypothesis in in its OM,
the *degrees of freedom are different* from conventional.

So we just have to remember that the degrees of freedom for the alternative
hypothesis are different from conventional and depend on what the null
hypothesis is.

# Categorical Data Example of Geyer

```{r geyer-categorical-cleanup,echo=FALSE}
rm(list = setdiff(ls(), "save.complete"))
```

This is the example in Section 2.3 of @geyer-gdor.  It is a seven-dimensional
contingency table, each dimension having two values, so the number of cells
of the table is $2^7 = `r 2^7`$.
These data are found in dataset `catrec` in R package `llmdr`
```{r geyer-categorical-data}
data(catrec)
class(catrec)
dim(catrec)
names(catrec)
```
The response (vector of cell counts) is `y`.  The other variables are the
indicators for each dimension.

@geyer-gdor fits the model with all possible three-way interactions and
no higher interactions
```{r geyer-categorical-fit}
lout <- llmdr(y ~ (v1 + v2 + v3 + v4 + v5 + v6 + v7)^3,
    family = "poisson", data = catrec, proofs = 2)
length(lout$coefficients)
sum(is.na(lout$coefficients))
```

We don't print out the summary, because it is too long and is about
[meaningless parameters](expfam.html#alice) anyway.

We notice that the dimension of the LCM is one less than the dimension
of the OM, which means the GDOR points in a unique direction (every
GDOR for this problem points in the same direction).

This GDOR has nonnegative components
```{r geyer-categorical-show-gdor}
# did proofs = 2 above to get GDOR with exactly correct zeros
cbind(lout$gdor[lout$gdor != 0])
```
This agrees with Table 1 in @geyer-gdor.

@geyer-gdor considers the following hypothesis tests.
```{r geyer-categorical-anova}
lout0 <- llmdr(y ~ (v1 + v2 + v3 + v4 + v5 + v6 + v7)^2,
    family = "poisson", data = catrec)
lout2 <- llmdr(y ~ (v1 + v2 + v3 + v4 + v5 + v6 + v7)^4,
    family = "poisson", data = catrec)
anova(lout0, lout, lout2)
```

This says the model with all three-way interactions and no higher-order
interactions, which has $\text{OM} \neq \text{LCM}$, fits the data
much better than the model with only two-way interactions.
And this model fits just as well as the model with all four-way interactions.

Note that in line two of the printout, the degrees of freedom are different
for the three-way interactions model, depending on whether is has the role
of null hypothesis or alternative hypothesis.

There are, of course, many, many more models to consider.
But, as this is being written, R function `glmbb` in R package `glmbb`
[@glmbb-package]
uses up all memory the computer has and then crashes when applied to
this problem, and R functions `add1`, `drop1`, and `step` do have not
have methods for objects returned by R function `llmdr`.

So we could do whatever hypothesis tests we can think of, but we will
have to wait until R package `llmdr` is more complete or R package `glmbb`
is fixed to use less memory in order to search among the very many models
for these data.

# Multinomial and Product Multinomial Sampling

Unlike R function `glm` the new function `llmdr` also does multinomial
and product multinomial sampling.  Let's see how that works.

This is a competitor for "response" in scare quotes, which we discussed
in the [notes accompanying Agresti, Chapter 8](ch8.html).

## Alligators

```{r geyer-alligators-cleanup,echo=FALSE}
rm(list = setdiff(ls(), "save.complete"))
```

Let's redo an
[example previously done in those notes](ch8.html#multinomial-response).
```{r alligators,error = TRUE}
data(table_8.1)
alligator <- transform(table_8.1,
    lake = factor(lake,
        labels = c("Hancock", "Oklawaha", "Trafford", "George")),
    gender = factor(gender, labels = c("Male", "Female")),
    size = factor(size, labels = c("<=2.3", ">2.3")),
    food = factor(food,
        labels = c("Fish", "Invertebrate", "Reptile", "Bird", "Other")))
names(alligator)
lout <- llmdr(count ~ food * (size + gender + lake),
    strata = ~ size * lake * gender,
    family = "multinomial", data = alligator)
summary(lout)
```

This is the Right Thing.  It does not show
any coefficients for regressors in `strata` but does condition on them.
The estimates and their standard errors are those for product multinomial
sampling.

But this doesn't have anything to do with the subject of these notes
except that we have proved that the MLE exists in the OM (something
R function `glm` cannot do).

But when we ran R function `glmbb` on this, we had warnings.
```{r alligators-glmbb}
gout <- glmbb(big = count ~ lake * gender * size * food,
    little = ~ lake * gender * size + food,
    family = poisson, data = alligator)
```

What were the models for which it should have given warnings?
```{r alligators-warnings}
foo <-gout$envir
rm(min.crit, envir = foo)
bar <- eapply(foo, function(x) x$formula)
lout <- lapply(bar, function(x) llmdr(x, family = "poisson", data = alligator))
is.lcm <- sapply(lout, function(x) x$is.lcm)
sum(is.lcm)
```
(we use `family = "poisson"` here to match the formulas `glmbb` uses,
and then we don't need argument `strata`).

R function `glm` should have given `r sum(is.lcm)`
warnings rather than two!  So this is proof that its idea of warnings is
completely inadequate.  It gives lots of false negatives.   It can also
give false positives, but we haven't shown an example of that yet.

Here are the models with LCM
```{r alligator-lcm-show}
sapply(bar[is.lcm], tidy.formula.hierarchical)
```

Eventually we need to make R package `glmbb` use R function `llmdr`
to fit models rather than R function `glm`.  But that is not done yet.

## A Digression about AIC and BIC and LCM {#aic-and-bic-and-lcm}

### AIC

To understand the issues here we need to do a little bit of theory.
The mathematical argument that leads to AIC assumes the "usual asymptotics
of maximum likelihood".  MLE at infinity (for canonical parameters, on the
boundary for mean value parameters) mean those assumptions are wrong.

On the other hand, those "usual asymptotics of maximum likelihood" may
apply to the LCM if the conditioning that produces the LCM leaves enough
components of the response vector random so the sample size can be considered
"large" so the parameter estimates in the LCM are approximately normal.
When the LCM is completely degenerate, so it has no parameters to estimate,
the asymptotic argument does still apply if we consider
the chi-square distribution on zero degrees of freedom
to be the distribution of the constant random variable always equal to zero
(which R does)
```{r aic-chi-square-zero}
pchisq(0, df = 0)
pchisq(1e-16, df = 0)
```

It follows that the correct $p$ to use in the definition
$$
   \text{AIC} = \text{deviance} + 2 p
$$
is what R function `llmdr` calls the degrees of freedom for the LCM,
component `df` of the result it returns.

There really cannot be any argument about this if you follow the theory.
AIC is trying to be an unbiased estimate of Kullback-Leibler information
(times two).  On or near the boundary of the mean value parameter space
the Kullback-Leibler information function is flat or nearly flat in the
same directions as the log likelihood function; there are only `df`
dimensions in which it is strongly curved and can be overestimated.

For an explicit example where this is clearly correct, let us redo
the [complete separation example of
Agresti](#complete-separation-example-of-agresti) just doing AIC.

Recall this analysis of deviance table
```{r aic-agresti-complete-anova}
save.complete$anova
```

Now we want AIC.  We claim that the correct AIC are
```{r aic-agresti-complete}
with(save.complete$null, deviance + 2 * df)
with(save.complete$alternative, deviance + 2 * df)
```

This is not what R function `glm` and friends (methods for generic functions
that handle objects of class `"glm"`) calculates
```{r aic-agresti-complete-glm,error=TRUE}
with(save.complete, AIC(glm(null$formula, family = "binomial", data = data)))
with(save.complete, AIC(glm(alternative$formula, family = "binomial", data = data)))
```
Of course either calculation says the model with formula
`r deparse(save.complete$alternative$formula)` is best,
but that's not the point.  The point is that we have issues with how
to calculate AIC.  This section recommends one thing.
R function `glm` and friends do another.

There is no question that R function `glm` and friends are theoretically
incorrect.  The AIC argument says that it is a bias-corrected estimate
of Kullback-Leibler information (times two).  It corrects for overfitting.
But the model with formula
`r deparse(save.complete$alternative$formula)` cannot overfit.
All randomness is gone in a completely degenerate LCM.

But I can imagine the counterarguments.  Those arguments ignore the math
behind AIC and only pay attention to the way it is usually used.  Sparsity
fans will argue that we are really looking for sparsity and AIC is a penalty
that rewards sparsity, and if we redefine AIC the way we recommend here,
that won't any longer be rewarding sparsity.  And that was what was always
good about AIC (never mind what theory says).

So I understand that a lot of people might not like what this section says
(mathematically correct or no).  They will continue to use their grandfather's
version of AIC.

### BIC

#### n

BIC has the same issue as AIC, in the definition
$$
   \text{BIC} = \text{deviance} + \log(n) \cdot p
$$
$p$ has to be the dimension of the LCM to make the mathematical argument
behind BIC correct (with the same caveat that there has to be enough data
left in the LCM for estimates in the LCM to be approximately normally
distributed, except when the LCM is completely degenerate and has no data).

BIC also has the other issue about $n$, which is discussed in
the help page for R function `glmbb` which we quote

> It is unclear what the sample size, the $n$ in the BIC penalty
> $n \log(p)$ should be.  Before version 0.4 of this package the BIC was
> taken to be the result of applying R generic function `BIC` to the
> fitted object produced by R function `glm`.  This is generally wrong
> whenever we think we are doing categorical data analysis
> [@raftery; @kass-raftery].  Whether we consider the sampling scheme
> to be Poisson, multinomial, or product multinomial (and binomial is
> a special case of product multinomial) the sample size is the total
> number of individuals classified and is the only thing that is
> considered as going to infinity in the usual asymptotics for
> categorical data analysis.  Thus the option `BIC.option = "sum"`
> should always be used for categorical data analysis.

In more detail, if you are doing purely categorical data analysis
(categorical response, all predictors categorical too, so the data
can be reformatted as a contingency table) then you are not imagining
the number of cells of the contingency table `length(y)` to be going
to infinity, you are imagining the number of individuals classified
`sum(y)` to be going to infinity.  Thus `sum(y)` is the only choice
for $n$ in the BIC formula that makes sense.  So the cited literature
says.

Once we start thinking along these lines, it is unclear that `length(y)`
which is what R function `glm` and friends do (with no option) ever
makes sense.  BIC was originally defined [@schwarz] for independent
and identically distributed (IID) data.  For that, it is clear what $n$
means: it is the number of IID random variables in the data.
For categorical data analysis, that is the number of individuals classified
(the individuals are IID if they are a random sample).

BIC can be extended to non-IID data.  @cavanaugh-neath give all the gory
details.  But they introduce $n$ in a completely artificial way in their
assumptions.  They imagine a sequence of statistical models with log
likelihoods $l_n$ all having the same parameter $\theta$.  Then they assume
$$
   \frac{1}{n} l_n(\theta)
$$
converges (for all $\theta$, this is their assumption (5)).

So how does that make sense for regression if we are assuming the dimension
of the response vector is going to infinity, so the dimension of the predictor
vectors must be going to infinity to match?  
That convergence assumption implies a very restrictive condition on model
matrices of the models in the sequence of models.  It is not clear that
this story ever makes sense.  This is the reason why your humble author
wrote a paper [@geyer-no-n] that does asymptotics without $n$.
But if you take the $n$ out of asymptotics, then BIC actually makes no
sense because there is no $n$ to go into the formula.

In summary, it is very unclear what $n$ should be in the BIC formula
when we do not have IID data.  In particular, it is not clear that what
R function `glm` and friends do (in particular, R generic function `BIC`
does when applied to R objects of class `"glm"`, using `length(y)`
rather than `sum(y)` as $n$) ever makes sense.

#### p

The case is even worse for $p$.  The derivation of BIC [@neath-cavanaugh]
can use many different priors, but one possibility is flat within-model
priors (as an approximation to Bayes factors, it does not depend on between
model priors).  And we know from our
[notes on priors for exponential families](prior.html) that the posterior
(and hence the Bayes factor) does not exist when the MLE does not exist
(in the conventional sense) and a flat prior is used.

Thus Bayes factors make no sense when one model being compared has
solutions at infinity (the subject of this document).
Or we might say that any model with solutions at infinity is *infinitely bad*
when *flat priors are used*.  But such priors are *improper* and make no
philosophical sense.  So this is not a problem with the models but rather
with the whole idea of BIC.

## AIC for Alligators

So back to alligators.  We now think R is calculating AIC wrong for those
models that have LCM.  R does not use the same definitions of AIC and BIC
that we do.  It adds a constant to our definitions, that is,
$$
   \text{AIC} = \text{deviance} + \text{constant} + 2 p
$$
and similarly for BIC.  The constant doesn't matter.  It drops out of
comparisons and calculation of weights.  But we will use the same constant
so we don't have to change all of the AIC values we already have.

So we recalculate AIC for these models.
Because we used different formulas for R function `llmdr` we need to
change these AIC values by hand (eventually this should be automated).

The whole issue is that we have different degrees of freedom output by
R functions `glm` and `llmdr` and the latter are correct (so we say).
Here are those degrees of freedom.
```{r alligator-aic-sout}
df.glm <- unlist(eapply(foo, function(x) x$rank))
df.llmdr <- sapply(lout, function(x) x$df)
identical(names(df.glm), names(df.llmdr))
aic.glm <- unlist(eapply(foo, AIC))
aic.llmdr <- aic.glm + 2 * (df.llmdr - df.glm)
```

So now we have a problem that what models are within cutoff (10, the default)
of the minimum are different for the two definitions, so we need a different
summary for each, but we only have a function to do one.
```{r min.aic}
min(aic.glm)
min(aic.llmdr)
```

We see that since the new minimum AIC is more than the cutoff below the
old minimum AIC, that none of the models that were supposedly best under
the old criterion are even on our new list.

Because we did not use R function `llmdr` as the model fitting function
for R function `glmbb` we do not know that we didn't miss some models
that have low AIC under the correct definition.  So we don't bother to
make a new table.  That will have to wait until we fix R function `glmbb`.

It should come as a shock to fans of the old AIC that what is considered
the best model changes, so it would change the model selection done by AIC.

Of course, model selection is a
[mug's game](https://www.merriam-webster.com/dictionary/mug%27s%20game),
but, even if we do model averaging, the weights change, and not just a little
bit but a whole lot.

## Other Ways to Specify Multinomial Data

R function `multinom` in R package `nnet` specifies multinomial response
data in a different way.  It has the response vector actually be a matrix
or a factor.  Then it doesn't need a `strata` argument.
R function `llmdr` allows this too.  We will illustrate the response
being a factor in the next section.  Here we show the response as a matrix.
```{r alligator-response-matrix-what}
head(alligator$food)
```
It looks like in these data food varies fastest, so we can reshape it
into a matrix as follows,
```{r alligator-response-matrix}
food <- alligator$count
food <- matrix(food, byrow = TRUE, ncol = nlevels(alligator$food))
colnames(food) <- levels(alligator$food)
head(food)
```
But we have to check that this is actually right, that the other variables
agree for each level of this new response matrix.
```{r alligator-predictors}
alligator.predictors <- alligator
alligator.predictors$count <- NULL
alligator.predictors$food <- NULL
names(alligator.predictors)
for (i in levels(alligator$food)) {
    foob <- subset(alligator.predictors, alligator$food == "Bird")
    fooi <- subset(alligator.predictors, alligator$food == i)
    rownames(foob) <- NULL
    rownames(fooi) <- NULL
    print(identical(foob, fooi))
}
alligator.predictors <-
    subset(alligator.predictors, alligator$food == "Bird")
```

```{r alligator-wide-too,echo=FALSE}
opt.save <- options(width=200)
```

Great! We now have the data we need.
```{r alligator-response-matrix-llmdr}
lout <- llmdr(food ~ lake * size, family = "multinomial",
    data = alligator.predictors)
summary(lout)
```

```{r alligator-wide-too-restore,echo=FALSE}
options(opt.save)
```

When you get the data in the right format, the analysis is a lot simpler,
but getting the data in the right format can be a bit of a struggle.

## A Simple Multinomial Model with Completely Degenerate LCM

```{r multinomial-complete-cleanup,echo=FALSE}
rm(list = ls())
```

```{r multinomial-complete-data}
x <- seq(10, 90, 10)
y <- rep(c("red", "green", "blue"), each = 3)
y <- factor(y, levels = c("red", "green", "blue"))
data.frame(x, y)
```

We fit these data as follows.
```{r multinomial-complete-mle}
lout <- llmdr(y ~ x, family = "multinomial")
summary(lout)
```

As usual,
it is [hard to make sense of canonical parameters](expfam.html#alice),
but from the fact that the `Estimates` column is all `NA` we see the LCM
is completely degenerate.
As usual, mean value parameters make more sense.
```{r multinomial-complete-mu}
cbind(x, estimates(lout, type = "mu"))
```

The low values of `x` are all red, the middle values all green, and
the high values all blue.  So the model can perfectly predict these data.

## A Simple Multinomial Model with Partially Degenerate LCM

We change the data a little bit.
```{r multinomial-quasi-data}
y[6] <- "blue"
y[7] <- "green"
y <- factor(y, levels = c("red", "green", "blue"))
data.frame(x, y)
```

And refit
```{r multinomial-quasi-mle}
lout <- llmdr(y ~ x, family = "multinomial")
summary(lout)

cbind(x, estimates(lout, type = "mu"))
```
Now we see that the LCM is only partially degenerate because the `Estimate`
column is not all `NA` but the actual numbers in the `Estimate` and `GDOR`
columns don't make much sense.

But from the mean values it is clear what is going on.  Now `x` fails to
perfectly predict green and blue but still perfectly predicts red.
The probability of green decreases with `x` and the probability of blue
increases.

Of course, it looks like with such a small amount of data that the parameters
of the LCM are not statistically significant, either of them.
But to make a conclusion about dropping two parameters we should actually
fit both models and compare.

But there doesn't seem to be any way to do that with the data in this form.
It wants to use the same model matrix for each stratum, so we cannot keep
`red` and `x:red` as predictors while dropping the rest.
We can do that if we put the data into data frame form and use explicit
strata.
```{r multinomial-quasi-redo}
xx <- rep(x, times = 3)
color <- rep(levels(y), each = length(x))
yy <- double(length(xx))
for (i in 1:length(x))
    yy[xx == x[i] & color == y[i]] <- 1
data.frame(x = xx, color, y = yy)
```

Now the models we want to compare are
```{r multinomial-quasi-test-alt}
lout.alternative <- llmdr(yy ~ xx * color, family = "multinomial",
    strata = as.factor(xx))
summary(lout.alternative)
data.frame(x = xx, color, y = yy, mu = estimates(lout.alternative, "mu"))
```
Same mean values, so same model, just expressed differently.

Now for the null model, which does have red, but not the other colors.
```{r multinomial-quasi-test-null}
red <- as.numeric(color == "red")
lout.null <- llmdr(yy ~ xx * red, family = "multinomial",
    strata = as.factor(xx))
summary(lout.null)
data.frame(x = xx, color, y = yy, mu = estimates(lout.null, "mu"))
```
Right.  Predicts red perfectly, says nothing about blue or green,
where "says nothing" means zero on the logit scale, or $1/2$ on
the probability scale, and this really does not say anything about
blue or green, they are 50-50 given $x \ge 40$.

And then we have the really, really null hypothesis
```{r multinomial-quasi-test-null.null}
lout.null.null <- llmdr(yy ~ red, family = "multinomial",
    strata = as.factor(xx))
summary(lout.null.null)
data.frame(x = xx, color, y = yy, mu = estimates(lout.null.null, "mu"))
```

So finally the test.
```{r multinomial-quasi-test}
anova(lout.null.null, lout.null, lout.alternative)
```
These data really don't say anything about blue and green other than
red versus non-red is predicted perfectly.
And we cannot drop $x$ entirely.

## On the Baseline-Category Logit Parameterization

R function `multinom` in R package `nnet` uses the baseline-category logit
parameterization
```{r multinomial-quasi-nnet}
mout <- multinom(y ~ x)
summary(mout)
round(predict(mout, type = "probs"), 7)
estimates(lout, type = "mu")
```

Not too different.  The differences are that R function `multinom`
does not understand generic directions of recession, so can't do the
right thing, and isn't even trying to do maximum likelihood, but rather
what some neural net thinks is somewhat optimal.
If you make allowances for extreme sloppiness, then it has the right answer,
at least for the probabilities.

But what we wanted to talk about is what @agresti, first page of Chapter 8,
calls the baseline-category logit parameterization of the multinomial.
In this example, it makes the canonical parameters
$$
   \log \frac{\pi_\text{color}(x)}{\pi_\text{red}(x)}
$$
which makes the canonical parameters for anything red zero, hence they
are dropped, which is why `summary(mout)` above does not mention red.

This is what the exponential families notes call the
[try II parameterization](expfam.html#try-ii).

But R function `llmdr` does not use this parameterization.
Instead it uses the symmetric parameterization,
what the exponential families notes call the
[try III parameterization](expfam.html#try-iii).
Unlike baseline-category logit, this parameterization does not
constrain any canonical parameters to be zero.  Thus it is not identifiable.
It is the parameterization you get when you start with Poisson sampling,
and condition to get product multinomial sampling *without changing the
canonical parameters*.

Why?   The baseline-category logit parameterization seems easy enough
to understand, and is identifiable.  There is the issue of arbitrariness
of which category is chosen to be "baseline", but we ignore that.

But in hindsight, considered in light of directions of recession, that
parameterization seems to be a really dumb idea.

 * It amounts to constraining equal to zero parameters that need to go
   to infinity to maximize the likelihood.

 * Moreover, even if we move directions of recession to the baseline-category
   logit parameterization, that parameterization need not make the LCM
   identifiable.

We see the latter in our toy problem.  R function `nnet` chooses red
to be "baseline" for arbitrary reasons.  But in the LCM we see that
red has been conditioned out of the LCM.  The LCM only has data for
blue or green and $x \ge 40$.
Now trying to say that red is "baseline" seems almost maximally stupid.

So OK.  We cannot choose a baseline category until the LCM has already
been determined.  But we can then, right?  No.

It is possible to construct models having LCM in which no choice of
multinomial response category to be "baseline" will make the LCM
identifiable.

So why this "baseline" category stuff anyway?  R functions `lm` and `glm`
and `llmdr` (which calls `glm.fit` to fit LCM) have no trouble figuring
out which coefficients to "drop" to get an identifiable model.  They don't
need to be given an identifiable parameterization.  So why fuss about
a particular one that will only lead to insoluble problems in some
instances?

Other issues with the baseline-category logit parameterization
are illustrated in the next section.

## Mapping Canonical to Mean Value Parameters

For product multinomial data with response vector $y$
and strata $\mathcal{A}$, this means the subvectors $y_A$
for $A \in \mathcal{A}$ are the independent multinomial parts of $y$,
the saturated model usual parameters are given in terms of the
canonical parameters by
$$
   \pi_i = \frac{e^{\theta_i}}{\sum_{j \in A} e^{\theta_j}},
   \qquad i \in A \in \mathcal{A}
$$
(and there is always exactly one $A$ satisfying the condition of this
equation because $\mathcal{A}$ partitions the index set of $y$).
Then the mean value parameters are
$$
   \mu_i = n_A \pi_i, \qquad i \in A \in \mathcal{A}.
$$
So to keep track of what is going on with the various parameterizations,
we have to keep this non-identifiability of $\beta$ and $\theta$ in mind.
```{r multinomial-quasi-theta}
theta <- estimates(lout, "theta")
theta
```

If we really, really wanted red to be the baseline category, then we could do
that.  Then we would have zero for the whole red column.  And the blue
column would be `Inf` for $x \ge 40$.  And then the green column would
have to be `Inf + `r theta[4,2]``,
`Inf + `r theta[5,2]``, and so forth for $x \ge 40$.
What does that mean?  It means these parameters go off to infinity
at the same rate, in lockstep, keeping a constant difference,
\begin{align*}
   \theta_{40\ \text{green}} & = \theta_{40\ \text{blue}} + `r theta[4, 2]`
   \\
   \theta_{50\ \text{green}} & = \theta_{50\ \text{blue}} + `r theta[5, 2]`
   \\
   \theta_{60\ \text{green}} & = \theta_{60\ \text{blue}} + `r theta[6, 2]`
   \\
   \theta_{70\ \text{green}} & = \theta_{70\ \text{blue}} - `r -theta[7, 2]`
\end{align*}
and so forth.  See what I mean when I say
that the baseline-category logit parameterization causes far more pain
than enlightenment?

But if we keep to the symmetric parameterization that R function `llmdr`
is using, then we have to know how to use it.
We go from $\theta$ to $\pi$ in two steps.
```{r multinomial-quasi-theta-to-pi}
foo <- exp(theta)
foo
```
Then we have to make the probabilities in each stratum sum to one.
Here the strata are the rows of the matrix.
```{r multinomial-quasi-theta-to-pi-too}
bar <- rowSums(foo)
mu <- sweep(foo, 1, bar, "/")
mu
```

Of course, R function `estimates` does this for us, so we don't have to
understand this to get the numbers, but this is what it is doing to get them.
```{r multinomial-quasi-theta-to-pi-too-too}
estimates(lout, "pi")
```
## DOR, DOC, and GDOR for Multinomial {#dor-multinomial}

Usually `family = "multinomial"` means product multinomial, so
we first need to start with the symmetric parameterization for
product multinomial sampling
[described above](#mapping-canonical-to-mean-value-parameters).

We use a theorem described in Section 3.17 of @geyer-gdor,
which says, if you can find DOR or GDOR for Poisson sampling,
then they also work for multinomial or product multinomial sampling
(actually @geyer-gdor leaves the product multinomial case as an
exercise for the reader, but it is fully worked out in the design
document for R package `llmdr` [@llmdr-package], which is found
in every installation of the package).

This theorem needs one idea.  The indicator vectors for strata are
vectors $u$ having coordinates
$$
   u_i = \begin{cases} 1, & i \in A \\ 0, & i \notin A \end{cases}
$$
where $A \in \mathcal{A}$ are the sets of indices for strata,
as in the [section on multinomial
parameterization](#mapping-canonical-to-mean-value-parameters) above.

Then the exact statement of the theorem is that the model matrix for Poisson
sampling has to have all the columns of the model matrix for product
multinomial sampling plus all the indicator vectors for multinomial strata
as columns.
Then every MLE for the Poisson model is also an MLE for
the product multinomial model (when both use the same parameterization,
that is, the product multinomial is using the symmetric parameterization)
and every DOR for the Poisson model is a DOR for the product multinomial
model.

The only difference is that the parameterization for Poisson sampling
is (usually) identifiable, and the parameterization for product multinomial
sampling is never identifiable.  Its directions of constancy are exactly
the indicator vectors for strata described above.

Thus we can immediately drop all components of the Poisson parameter vector
that correspond to indicator vectors of strata,
and similarly for components of DOR.
They are not identifiable.  This leaves us with MLE for $\beta$ and
GDOR for product multinomial sampling.

Thus we only know how to solve multinomial problems indirectly, by
converting them to Poisson problems, and then converting back at the end.

A much more complete characterization of product multinomial DOR, DOC,
and GDOR is found in the design document for R package `llmdr` cited above.

# Bibliography