--- title: "Stat 5421 Lecture Notes: Completion of Exponential Families" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" nocite: | @alabama-package, @numderiv-package, @sfsmisc-package, @mass-book, @mass-package, @rmarkdown-book, @rmarkdown-package, @rmarkdown-cookbook @nnet-package glmbb-package output: bookdown::html_document2: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML toc: true toc_depth: 2 bookdown::pdf_document2: latex_engine: xelatex number_sections: true md_extensions: -tex_math_single_backslash toc: true toc_depth: 2 linkcolor: blue urlcolor: blue bibliography: expfam.bib csl: journal-of-the-royal-statistical-society.csl link-citations: true --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R ```{r libraries} library(alabama) library(rcdd) library(llmdr, lib.loc = "~/Software/llmdr/package/llmdr.Rcheck") library(sfsmisc) library(MASS) library(glmbb) library(nnet) library(CatDataAnalysis) ``` * The version of R used to make this document is `r getRversion()`. * The version of the `alabama` package used to make this document is `r packageVersion("alabama")`. * The version of the `numDeriv` package used to make this document is `r packageVersion("numDeriv")`. * The version of the `rcdd` package used to make this document is `r packageVersion("rcdd")`. * The version of the `llmdr` package used to make this document is `r packageVersion("llmdr")`. * The version of the `sfsmisc` package used to make this document is `r packageVersion("sfsmisc")`. * The version of the `MASS` package used to make this document is `r packageVersion("MASS")`. * The version of the `glmbb` package used to make this document is `r packageVersion("glmbb")`. * The version of the `nnet` package used to make this document is `r packageVersion("nnet")`. * The version of the `CatDataAnalysis` package used to make this document is `r packageVersion("CatDataAnalysis")`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. At present, R package `llmdr` is not publicly available. That is why it is loaded from a nonstandard location on the author's computer. It will be available soon. # History of Theory Maximum likelihood estimates (MLE) do not have to exist. Theoretically, the problem of when MLE exist in regular full exponential families of distributions has been completely understood for a long time. For exponential families having finite support, this theory is due to @barndorff-nielsen, Theorem 9.15 and the surrounding discussion. Finite support means there are only a finite number of possible values of the canonical statistic vector, as happens in logistic regression and log-linear models for categorical data when multinomial or product-multinomial sampling is assumed. @brown, Chapter 6, Section "Aggregate Exponential Families" extended Barndorff-Nielsen's theory to exponential families with countable support under certain regularity conditions. Countable support means we have a discrete probability distribution, and Poisson regression with log link and log-linear models for categoriccal data when Poisson sampling is assumed satisfy Brown's regularity conditions. @geyer-gdor reworked this theory to use the concepts * direction of recession (DOR) * direction of constancy (DOC) * generic direction of recession (GDOR) # Directions in the Canonical Parameter Space Following @geyer-gdor we are going to be sloppy about the mathematical meaning of the word *direction*. Strictly speaking, it is the direction in which a vector points, so every positive scalar multiple of a vector points in the same direction. So a direction is not a vector. Vectors have length, directions do not. Also the zero vector does not point in any direction. But we are just going to call vectors directions, and we are going to allow the zero vector. There is a reason for this in the correspondence between DOR and normal vectors (@geyer-gdor, Theorem 3), which we do not want to discuss. Mostly this does not cause confusion. Occasionally it does lead to some clumsy language. When the MLE does not exist in the original model (OM), this is because there is a DOR that is not a DOC, (@geyer-gdor, Theorems 1, 3, and 4). Here are the definitions of these concepts that give them their names. * A vector $\delta$ in the canonical parameter space of a regular full exponential family is a DOR if the log likelihood function $l$ is nondecreasing in that direction, meaning for any $\theta$ in the canonical parameter space and any positive real number $s$ we have $l(\theta) \le l(\theta + s \delta)$. * A vector $\delta$ in the canonical parameter space of a regular full exponential family is a DOC if the log likelihood function $l$ is constant in that direction, meaning for any any $\theta$ in the canonical parameter space and any real number $s$ we have $l(\theta) = l(\theta + s \delta)$. Every DOC is a DOR, but not vice versa. And here are some equivalent characterizations. In them $Y$ denotes the canonical statistic vector, thought of as a random vector having some probability distribution in the family, and $y$ denotes its observed value. * A vector $\delta$ in the canonical parameter space of a regular full exponential family is a DOC if and only if for any real number $s$ and any $\theta$ in the canonical parameter space $\theta$ and $\theta + s \delta$ correspond to the same distribution. * A vector $\delta$ in the canonical parameter space of a regular full exponential family is a DOC if and only if $\langle Y, \delta \rangle$ is a constant random variable. * A vector $\delta$ in the canonical parameter space of a regular full exponential family is a DOR if and only $\langle Y, \delta \rangle \le \langle y, \delta \rangle$ with probability one. In summary, * $\delta$ is a DOC if and only if $\langle Y, \delta \rangle$ is constant, * $\delta$ is a DOR if and only if $\langle y, \delta \rangle$ is extreme, the upper endpoint of the range of $\langle Y, \delta \rangle$, and * $\delta$ is a DOC if and only if and only if $\theta$ and $\theta + \delta$ correspond to the same distribution. The last bullet point here says that DOC are about identifiability, If there is a DOC, then we need to drop some parameters to gain identifiability. Usually, the software can do this for us. The other two bullet points here are about maximum likelihood. The MLE exists in the OM if and only if every DOR is a DOC (@geyer-gdor, Theorem 4). What happens when the MLE does not exist in the OM is the subject of the following section. # Limits If we take straight line limits in the direction of a DOR $\delta$ then all of the probability goes to as far as it can go in that direction (@geyer-gdor, Theorem 6, but similar theorems are found in @barndorff-nielsen, @brown, and @geyer-thesis). Probability piles up on the subset of the support \begin{equation} H = \{\, Y : \langle Y - y, \delta \rangle = 0 \,\} (\#eq:h) \end{equation} The probability distribution for $\theta + s \delta$ converges as $s \to \infty$ to the conditional probability distribution for $\theta$ of $Y$ given $Y \in H$. This family of conditional distributions (for all $\theta$) is an exponential family of distributions that @geyer-gdor calls a *limiting conditional model* (LCM). It is wrong to think of distributions in the LCM only as conditional distributions or only as limit distributions. They are both. Since the log likelihood is nondecreasing along a DOR, the log likelihood for an LCM lies on or above the log likelihood for the OM. Hence maximizing the likelihood in the LCM maximizes the likelihood in the OM too. The canonical parameters do not converge as we take these limits. They go off to infinity. But the probability distributions converge. So the MLE *distribution* in the LCM (if one exists, more on this presently) can be thought of as the MLE. Moreover, not only distributions converge. Under the regularity conditions of Brown, which hold for all statistical models studied in this course, moments of all orders converge (@eck-geyer, Theorem 7), for example, the mean of the distribution in the OM for parameter $\theta + s \delta$ converges to the mean of the distribution for the parameter $\theta$ in the LCM as $s \to \infty$. And the same goes for the variance and other moments. Thus *mean value parameters* are well behaved. When we take limits to complete exponential families, means converge. Every LCM is itself an exponential family, just a different exponential family from the OM. The LCM and the OM have the same canonical statistic and the same canonical parameter. The LCM just restricts the canonical statistic to the event $H$ described above. If $\delta$ is a GDOR of the OM, then $\delta$ is a DOC of the LCM. This is clear from the fact that the LCM conditions on the event $Y \in H$ with $H$ given by \@ref(eq:h). This is the reason we have to allow DOC. Otherwise, we cannot talk about the OM and LCM using the same parameters. Since the LCM is an exponential family, we can tell when the MLE exists in it. Just apply the theory described above to it. The MLE will exist in the LCM if and only if every DOR of the LCM (not OM) is a DOC of the LCM (not OM). And (@geyer-gdor, Section 3.6) show this happens whenever the DOR of the OM that we used to form the LCM was *generic*, not in any way special. This we call a *generic direction of recession* (GDOR). A more precise characterization of GDOR will be given at the end of [Section 8](#canonical-affine-submodels) below. When we find a GDOR and its corresponding LCM, then the MLE in the LCM is the MLE *distribution* and the mean value of this distribution is the MLE *mean value*. And this MLE satisfies the observed-equals-expected property, because it is the MLE for an exponential family (the LCM). Hence the observed-equals-expected property always holds. Similarly, the sufficient dimension reduction property and maximum entropy properties always hold. It does not matter whether the MLE is found in the OM or an LCM. One might ask when the MLE is in an LCM, what is the statistical model? The model cannot be the OM, because if that is so, then there is no MLE. We have to say that the model is the OM and all of its limits. @geyer-gdor calls that model the *Barndorff-Nielsen completion* of the OM. The Barndorff-Nielsen completion has no canonical parameterization, because when you take limits canonical parameters do not converge (they go off to infinity). But it is a family of distributions (the OM and all of its LCM). Also (under the conditions of Brown) it has a mean value parameterization. Every distribution in the Barndorff-Nielsen completion has a mean vector, and mean vectors do converge as limits are taken to get LCM. Each LCM is an exponential family that has a canonical parameterization, but the Barndorff-Nielsen completion as a whole does not have a canonical parameterization. The canonical parameterizations of the LCM do not fit together to make one parameterization. # History of Computing The thesis of your humble author (@geyer-thesis, Chapter 2), included an algorithm using repeated linear programming to discover DOR that are not DOC for exponential families. With software available at the time it was very clumsy but was actually used to find MLE in LCM for some very complicated statistical models [@geyer-thompson]. @geyer-gdor used various functions in R package `rcdd` [@rcdd-package] to find GDOR. This software was also somewhat clumsy to use (there was no model fitting function comparable to R function `glm`) but only a dozen or so lines of code were enough to find out whether the MLE existed in the OM or otherwise a GDOR and its LCM and the MLE in that LCM. But this code was slow on big problems. @eck-geyer have an example (dataset `bigcategorical` in R package `llmdr`), a contingency table with 1024 cells and 781 parameters, that took a little over 3 days and 4 hours of computing time to do with the methods of this paragraph. @eck-geyer used R function `glm` to maximize the likelihood getting canonical parameter estimates close to infinity by using small tolerances and then looking for directions in which the Fisher information matrix is nearly singular (approximate null eigenvectors). This algorithm was fast but has problems with inexactness of computer arithmetic. So it is difficult to know exactly when an approximate null eigenvector of the Fisher information matrix is close to being a GDOR. R package `llmdr` [@llmdr-package] uses repeated linear programming but different linear programs from those used by @geyer-thesis and @geyer-gdor. It also uses the fast linear programming (R package `glpkAPI`, @glpkapi-package) that scales to large problems (it uses sparse matrix arithmetic). But `llmdr` can also use infinite-precision rational arithmetic from R packages `rcdd` [@rcdd-package] and `gmp` [@gmp-package] to prove, if requested, that its results are correct. The big categorical example of @eck-geyer, mentioned above, that took over 3 days to do with the methods of @geyer-gdor took only 150 milliseconds to do with R function `llmdr` in R package `llmdr`. If proofs were requested, they took, 25 minutes (slow, but still an improvement over the methods of @geyer-gdor). So we finally have, in this fourth generation algorithm, methods that are fast, easy, and safe enough for ordinary users to use. # The Binomial Family of Distributions The binomial family of distributions as usually taught has usual parameter $\pi$ and usual parameter space $0 \le \pi \le 1$. The binomial family of distributions [thought of as an exponential family](expfam.html#the-binomial-distribution) has canonical parameter $\theta = \mathop{\rm logit}(\pi)$ and has canonical parameter space $- \infty < \theta < + \infty$ and this corresponds to usual parameter space $0 < \pi < 1$. We have lost the distributions for $\pi = 0$ and $\pi = 1$. The reason why these distributions cannot be in the exponential family is [all distributions in an exponential family have the same support](expfam.html#non-degeneracy). There is also the problem that $\mathop{\rm logit}(0)$ and $\mathop{\rm logit}(1)$ are undefined, so there are no canonical parameter values for those distributions to have (canonical parameters have to be real numbers or real vectors, infinite values are not allowed). This is a one-parameter family of distributions so its canonical parameter space is a one-dimensional vector space. In a one-dimensional vector space, there are only two directions: toward $+ \infty$ and toward $- \infty$. There are an infinite number of vectors, but only two directions those vectors can point in. The [log likelihood for the binomial distribution](ch1.html#modifications) is $$ l(\pi) = y \log(\pi) + (n - y) \log(1 - \pi) $$ When $y = 0$, this becomes $$ l(\pi) = n \log(1 - \pi) $$ and this is a strictly increasing function of $\pi$ and $\theta$ in the minus direction (toward $- \infty$). Because $\mathop{\rm logit}$ is a monotone function, as $\theta$ decreases, $\pi$ also decreases, $1 - \pi$ increases, and $\log(1 - \pi)$ increases. So $-1$ (or any negative number, thought of as a vector in the one-dimensional vector space $\mathbb{R}$) is a DOR. And what happens to the probabilities as we take limits? There are two cases. For $y = 0$ $$ f(y) = (1 - \pi)^n \to 1, \qquad \text{as $\pi \to 0$ and $\theta \to - \infty$}. $$ And for $y > 0$ $$ f(y) = \pi^y (1 - \pi)^{n - y} \to 0, \qquad \text{as $\pi \to 0$ and $\theta \to - \infty$}. $$ Thus binomial distributions with usual parameter $\pi$ and canonical parameter $\theta$ converge to the completely degenerate distribution concentrated at zero as $\pi \to 0$ and $\theta \to - \infty$. Now the theory described above says the MLE does not exist in the OM (because $- 1$ is a DOR that is not a DOC) and the LCM is the distribution concentrated at $Y = 0$. This is the only distribution in the LCM, so the LCM has no identifiable parameters, and this only distribution is the MLE distribution. This distribution has mean zero and variance zero, which are the limits of the mean $\pi$ and variance $\pi (1 - \pi)$ as $\pi \to 0$ and $\theta \to - \infty$. This seems like a crazy amount of theory to throw at such a simple problem. And it is overkill for a one-dimensional problem like this. But, as we shall soon, see. Even two dimensional problems can be hard for humans to figure out. So we need the computer to use the theory above to find the answers. By the way, a similar story could be told about when the data are $y = n$. # Canonical Affine Submodels [Canonical affine submodels](expfam.html#casm-sdr) of regular full exponential families are again regular full exponential families. If $a$ is the offset vector and $M$ is the model matrix, then $$ \theta = a + M \beta $$ is the relationship between * the saturated model canonical parameter vector $\theta$ and * the canonical affine submodel canonical parameter vector $\beta$. And $$ \tau = M^T \mu $$ is the relationship between * the saturated model mean value parameter vector $\mu$ and * the canonical affine submodel mean value parameter vector $\tau$. And * the saturated model canonical statistic vector is $y$ and * the canonical affine submodel canonical statistic vector is $M^T y$. So the only difference between saturated models and canonical affine submodels is that we replace $y$ with $M^T y$. * A vector $\delta$ in the canonical parameter space of a canonical affine submodel a DOC if and only if $\langle M^T (Y - y), \delta \rangle = \langle Y - y, M \delta \rangle$ is zero with probability one. * A vector $\delta$ in the canonical parameter space of a canonical affine submodel is a DOR if and only if for any $\langle M^T (Y - y), \delta \rangle = \langle Y - y, M \delta \rangle$ is nonpositive with probability one. A vector $\delta$ is a DOR for the submodel if and only if $\eta = M \delta$ is a DOR for the saturated model. Thus * for Poisson sampling $\eta$ is a DOR if and only if $\eta_i \le 0$ for all $i$ (because the Poisson distribution has no upper bound) and $\eta_i < 0$ implies $y_i = 0$ (because zero is the lower bound), * for binomial sampling $\eta$ is a DOR if and only if $\eta_i < 0$ implies $y_i = 0$ (because zero is the lower bound) and $\eta_i > 0$ implies $y_i = n_i$ (because $n_i$ is the lower bound). Multinomial and product multinomial sampling is considerably more complicated so we [defer discussion of that](#dor-multinomial). But already we can see why this problem is hard. We have to find a $\delta$ such that $\eta = M \delta$ satisfies the above conditions. And for complicated models that is complicated. Fortunately, the computer can do it for us. When we have found a DOR $\eta$, the LCM says * if $\eta_i \neq 0$, then the LCM conditions on $Y_i = y_i$ and * if $\eta_i = 0$, then the LCM makes no restrictions on $Y_i$. Hence the LCM is found by fitting the model to the cases with $\eta_i = 0$. When we lose cases like this, we may also have more non-identifiable parameters, but the computer can figure that out. Now we can be a bit more precise about what a GDOR is. For Poisson and binomial sampling, a vector $\delta$ is a DOR has the the largest possible number of nonzero components so the LCM conditions on the largest possible number of number of components of the response vector. Of course, you have no way to guess what subset of components is "largest possible" but, again, the computer can figure it out. # Separation versus Degeneracy @agresti, Section 6.5.1, uses the terminology of *complete separation* and *quasi-complete separation* to discuss the issues in this area. We prefer the terminology of *complete degeneracy* and *partial degeneracy* to describe these issues. Agresti's terms only apply to logistic regression. Our terms only apply to all exponential families. *Complete separation* occurs if all cases are plotted in regressor space, the successes and failures are labeled with different colors, and there is a hyperplane in predictor space that separates them, all of one color on one side and all of the other color on the other side. An example of such a plot is Figure \@ref(fig:agresti-complete-complete-separation) below. This corresponds to an LCM that is *completely degenerate*, concentrated at a single point. The MLE distribution says the only possible value of the response vector was the observed value. *Quasi-complete separation* occurs if all cases are plotted in predictor space, the successes and failures are labeled with different colors, and there is a hyperplane in predictor space that weakly separates them, all of one color on one side or in the hyperplane and all of the other color on the other side or in the hyperplane. An example of such a plot is Figure \@ref(fig:agresti-quasi-complete-separation) below. This corresponds to an LCM that is *partially degenerate*, not concentrated at a single point but having support smaller than the OM. The MLE distribution is not completely degenerate but does give probability zero to some outcomes that have positive probability in the OM. # Complete Separation Example of Agresti ## Data Section 6.5.1 of @agresti introduces the notion of complete separation with the following example. ```{r example.i} x <- seq(10, 90, 10) x <- x[x != 50] y <- as.numeric(x > 50) data.frame(x, y) ``` The following figure shows these data. ```{r label="one", echo=FALSE, fig.align = "center", fig.cap = "Logistic Regression Data for Complete Separation Example of Agresti"} plot(x, y) ``` ## Attempt to Fit Model Suppose we want to do "simple" logistic regression (one predictor $x$ plus intercept, so the model is two-dimensional). Let's try to do it naively. ```{r naive, error = TRUE} gout <- glm(y ~ x, family = binomial) summary(gout) ``` R function `glm` does give a warning. But what are you supposed to do about it? If the output of the R function `summary` is to be taken seriously, you cannot tell whether either regression coefficient is nonzero. As we shall see, that is complete nonsense. Both parameters have to go to infinity (one to plus infinity, the other to minus infinity) in order to maximize the likelihood. In fact, these data are not analyzable by the R function `glm` and its associated functions (generic functions having methods for class `"glm"`). At least it did give a (not very clear) warning that its results are absolute rubbish. But that is no consolation because * it doesn't give you any options to do the [Right Thing](http://www.catb.org/jargon/html/R/Right-Thing.html) and * it gives both false negative and false positives on these warnings, so they are nonsense too. You cannot rely on anything R function `glm` does. ## Log-Linear Models Done Right So we switch to R function `llmdr` which stands for log-linear models done right (a take-off on the [name of a well-known linear algebra textbook](https://www.amazon.com/Linear-Algebra-Right-Undergraduate-Mathematics/dp/3319110799/)) in R package `llmdr` [@llmdr-package]. ```{r llmdr-agresti-complete} lout <- llmdr(y ~ x, family = "binomial") summary(lout) ``` The fact that we have `NA` for every parameter estimate means all parameters have been dropped to obtain identifiability in the LCM. Thus the LCM is completely degenerate (or we have complete separation in Agresti's terminology). The GDOR given is the direction the parameters go to infinity to get to the LCM. Since no parameters have been estimated, everything else is also `NA`. Admittedly this is not easy to understand. But of course [canonical parameters are never easy to understand](expfam.html#alice). But mean value parameters are much easier to understand. ```{r agresti-complete-mu} estimates(lout, type = "mu") ``` This says the MLE of the saturated model mean value parameter vector $\mu$ is equal to the observed data. And since all of those values are on the boundary of the range ($y_i$ can only be zero or one for these data), this means we have a completely degenerate LCM (something we already knew from the Estimates column in the summary being all `NA`). ## So Now What? @agresti, Section 6.5.3 gives several pieces of advice about what to do in this situation. We now think all of it is bad, although we ourselves have given similar advice in the past. More on this below. None of this section is intended to be critical of @agresti. That book is an authoritative compendium of conventional wisdom about the subject. The fact that conventional wisdom sometimes disagrees with long established theory, albeit fairly esoteric theory, is not unique to categorical data analysis. The first thing we can do is wonder if some simpler model fits the data. If so, then we might not have to worry about this OM not having MLE. So let's try that. ```{r agresti-complete-anova} lout.null <- llmdr(y ~ 1, family = "binomial") anova(lout.null, lout) ``` We are getting a bit ahead of ourselves here (we will explain how hypothesis tests work in this situation in Section \@ref(theory-of-hypothesis-tests) below). But the $P$-value seems fairly strong. The response does depend on $x$. So we cannot drop $x$ from the model. Save these data for future use ([Section about AIC below](#aic)) ```{r agresti-complete-save} save.complete <- list(null = lout.null, alternative = lout, anova = anova(lout.null, lout), data = data.frame(x = x, y = y)) ``` By the way, we could have gotten the same result from R function `glm` and friends. ```{r agresti-complete-anova-glm} gout.null <- glm(y ~ 1, family = "binomial") anova(gout.null, gout, test = "LRT") ``` This illustrates a general principle. When the MLE for the null hypothesis exists in the OM (of the null hypothesis), then the usual theory about likelihood ratio tests and Rao tests is valid. It doesn't matter whether or not the MLE exists in the OM of the alternative hypothesis (Section \@ref(theory-of-hypothesis-tests) below). So R function `glm` and friends are doing the Right Thing here. But it gives a warning in fitting the alternative hypothesis that we have to be sophisticated enough to ignore (for the purposes of this hypothesis test). Another thing Agresti recommends is to do likelihood-based confidence intervals. These at least will be automatically one sided. But R function `confint.glm` in R package `MASS` messes up one-sided confidence intervals. So we have to [do that the hard way](ch4.html#likelihood-intervals). ```{r agresti-complete-logl-likelihood-function} logl.factory <- function(modmat, response) { stopifnot(is.numeric(modmat)) stopifnot(is.finite(modmat)) stopifnot(is.matrix(modmat)) stopifnot(is.numeric(response)) stopifnot(is.finite(response)) stopifnot(round(response) == response) # check for integer-valued stopifnot(response >= 0) stopifnot(length(response) == nrow(modmat)) logpq <- function(theta) ifelse(theta > 0, - log1p(exp(- theta)), theta - log1p(exp(theta))) function(beta) { stopifnot(is.numeric(beta)) stopifnot(is.finite(beta)) stopifnot(length(beta) == ncol(modmat)) theta <- drop(modmat %*% beta) logp <- logpq(theta) logq <- logpq(- theta) return(sum(response * logp + (1 - response) * logq)) } } # need as.matrix because lout$x is sparse matrix class logl <- logl.factory(as.matrix(lout$x), lout$y) logl.gradient.factory <- function(modmat, response) { stopifnot(is.numeric(modmat)) stopifnot(is.finite(modmat)) stopifnot(is.matrix(modmat)) stopifnot(is.numeric(response)) stopifnot(is.finite(response)) stopifnot(round(response) == response) # check for integer-valued stopifnot(response >= 0) stopifnot(length(response) == nrow(modmat)) pq <- function(theta) ifelse(theta > 0, 1 / (1 + exp(- theta)), exp(theta) / (1 + exp(theta))) function(beta) { stopifnot(is.numeric(beta)) stopifnot(is.finite(beta)) stopifnot(length(beta) == ncol(modmat)) theta <- drop(modmat %*% beta) p <- pq(theta) q <- pq(- theta) return(rbind(drop(crossprod(modmat, response * q - (1 - response) * p)))) } } logl.gradient <- logl.gradient.factory(as.matrix(lout$x), lout$y) beta.try <- rnorm(2) logl.gradient(beta.try) grad(logl, beta.try) # check that maximum of logl is zero logl(100 * lout$gdor) logl(300 * lout$gdor) ``` So now we are going to do something really tricky that does not generalize to problems with more than two parameters but does really show what is going on in this example. We are going to draw the boundary of the likelihood-based confidence region for $\beta$. Because we are going to compare this with some other regions, we write a function to do it. ```{r agresti-complete-curve-follower} do.curve <- function(hin, hin.jac, eps = 0.1, cutoff = -10) { # find one point on curve beta.start <- c(0, 0) gdor <- lout$gdor aout <- auglag(beta.start, fn = function(beta) sum(gdor * beta), gr = function(beta) gdor, hin = hin, hin.jac = hin.jac, control.outer = list(trace = FALSE)) stopifnot(aout$convergence == 0) beta.save <- aout$par # Now step along the curve in counterclockwise direction # using R function auglag to put us back on the curve for (ii in 1:1000) { foo <- logl.gradient(aout$par) # rotate clockwise 90 degrees bar <- rev(foo) * c(1, -1) # normalize bar <- bar / sqrt(sum(bar^2)) # slowly increase step size as we go along beta.try <- aout$par + eps * bar * (1 + ii / 10) aout <- auglag(beta.try, fn = function(beta) sum((beta - beta.try)^2) / 2, gr = function(beta) beta - beta.try, hin = hin, hin.jac = hin.jac, control.outer = list(trace = FALSE)) stopifnot(aout$convergence == 0) beta.save <- rbind(beta.save, aout$par) if (aout$par[1] < cutoff) break } # go back to the the first point aout <- list(par = beta.save[1, ]) # and go the other way for (ii in 1:1000) { foo <- logl.gradient(aout$par) # rotate counterclockwise 90 degrees bar <- rev(foo) * c(-1, 1) # normalize bar <- bar / sqrt(sum(bar^2)) # slowly increase step size as we go along beta.try <- aout$par + eps * bar * (1 + ii / 10) aout <- auglag(beta.try, fn = function(beta) sum((beta - beta.try)^2) / 2, gr = function(beta) beta - beta.try, hin = hin, hin.jac = hin.jac, control.outer = list(trace = FALSE)) stopifnot(aout$convergence == 0) beta.save <- rbind(aout$par, beta.save) if (aout$par[1] < cutoff) break } beta.save } ``` So that is a pretty complicated function, but now that we have it, it is fairly simple to use. Its arguments are * `hin` the function whose zero set is the curve, * `hin.jac` the gradient of that function, * `eps` how far apart we want points along the curve to be, and * `cutoff` where to stop. The curve, of course, goes off to infinity in the direction of recession at both ends, so we have to stop somewhere. Having made this plot in previous lecture notes (not in the same way, we know that the curve goes off the left side of the plot, so we stop when $\beta_1$ equals the cutoff. This code (especially `cutoff`) is very specific to this particular problem, but since nothing of the sort works for more than two parameters, we won't worry about that. ```{r agresti-complete-likelihood-simultaneous,cache=TRUE} level <- 0.95 # divide by 2 because our function is log likelihood not twice log likelihood crit <- qchisq(level, df = length(lout$coefficients)) / 2 crit beta.likelihood <- do.curve(function(beta) crit + logl(beta), logl.gradient) ``` ```{r agresti-complete-likelihood-simultaneous-plot,fig.align="center",fig.cap="Likelihood-based Confidence Region for Beta"} foo <- beta.likelihood[,1] > -8 plot(beta.likelihood[,1], beta.likelihood[,2], xlab = expression(beta[1]), ylab = expression(beta[2]), xlim = range(beta.likelihood[foo,1]), ylim = range(beta.likelihood[foo,2]), type = "l") ``` Now we can make likelihood-based confidence intervals for any parameter (any function of $\beta$) by maximizing and minimizing it over this region. And all of these intervals will have simultaneous coverage because they are all based on the same confidence region. If we want non-simultaneous coverage. We do the same thing except we use the critical value for one degree of freedom. ```{r agresti-complete-likelihood-non-simultaneous,cache=TRUE} # divide by 2 because our function is log likelihood not twice log likelihood crit <- qchisq(level, df = 1) / 2 crit beta.likelihood.one.df <- do.curve(function(beta) crit + logl(beta), logl.gradient) ``` ```{r agresti-complete-likelihood-non-simultaneous-plot,fig.align="center",fig.cap="Level Sets of the Log Likelihood. Black, chi square critical value for 2 df. Red, chi square critical value for 1 df."} plot(beta.likelihood[,1], beta.likelihood[,2], xlab = expression(beta[1]), ylab = expression(beta[2]), xlim = range(beta.likelihood[foo,1]), ylim = range(beta.likelihood[foo,2]), type = "l") lines(beta.likelihood.one.df[,1], beta.likelihood.one.df[,2], col = "red") ``` But there is a problem with what we are doing. It isn't justified by theory. Let's go back to the binomial distribution. We know that for large $n$ and $\pi$ not close to the boundary of its parameter space that we have good normal approximation, which is what these likelihood based confidence intervals assume. But we also know that when $n$ is large and $\pi$ is close to the boundary so either $n \pi$ or $n (1 - \pi)$ is not large, we have good Poisson approximation not normal approximation. So that says that when the mean value parameter is close to the boundary, the usual asymptotics of maximum likelihood do not work. So there is every reason to believe these intervals just do not work. Another way to explain the same thing is to note that the multivariate normal distribution has no boundary. Thus a sampling distribution is only approximately multivariate normal when the data for the distribution being approximated are almost never on the boundary. Thus the mere fact that the MLE distribution is on the boundary is sufficient reason to believe the asymptotics do not work. ## Valid Region for Beta Thus we introduce another sort of confidence region, proposed by @geyer-gdor, Section 3.16. These, at least, do have some theoretical justification. We won't explain that theory now, although it is quite simple (see Section \@ref(theory-confidence) below). We will just put it too on our plot. ```{r agresti-complete-geyer,cache=TRUE} alpha <- 1 - level alpha crit <- (- log(alpha)) crit beta.geyer <- do.curve(function(beta) crit + logl(beta), logl.gradient, cutoff = -1000) ``` ```{r agresti-complete-geyer-plot,fig.align="center",fig.cap="Level Sets of the Log Likelihood. Black, chi square critical value for 2 df. Red, chi square critical value for 1 df. Blue, minus log significance level. Arrow is GDOR. Blue curve is on top of black curve."} plot(beta.likelihood[,1], beta.likelihood[,2], xlab = expression(beta[1]), ylab = expression(beta[2]), xlim = range(beta.likelihood[foo,1]), ylim = range(beta.likelihood[foo,2]), type = "l") lines(beta.likelihood.one.df[,1], beta.likelihood.one.df[,2], col = "red") lines(beta.geyer[,1], beta.geyer[,2], col = "blue") arrows(-5, 0.125, -5 + 2 * lout$gdor[1], 0.125 + 2 * lout$gdor[2]) ``` The blue curve in the plot is exactly on top of the black curve. This is an accidental feature of this problem. It happens because the LCM is completely degenerate (so both the blue curve and the black curve use the same constraint function) and this problem has 2 parameters (which makes the critical values for the blue curve and the black curve the same). The blue and black curves would be different if the LCM were only partially degenerate or if the problem had number of parameters not equal to 2. If the number of parameters was much greater than 2, then the black curve would be far outside the blue curve, and even the red curve might be far outside the blue curve. So we will henceforth discard the likelihood-based intervals as a bad idea (when near the boundary in the mean value parameter space and near infinity in the canonical parameter space; likelihood-intervals are one of the great ideas of statistics when far from the boundary) and use the proposal of @geyer-gdor. ## GDOR Need Not Be Unique In this problem there are an infinite number of directions of recession. R package `rcdd` can find all of them. ```{r agresti-complete-rcdd} tanv <- as.matrix(lout$x) tanv[lout$y == 1, ] <- (- tanv[lout$y == 1, ]) vrep <- makeV(rays = tanv) hrep <- scdd(vrep)$output norv <- (- hrep[ , -(1:2), drop = FALSE]) norv ``` The rows of this matrix are the extreme directions of recession. They are not generic. Every vector pointing in a direction between these vectors (every vector that is a linear combination of these vectors with strictly positive coefficients) is a GDOR. Let's put these on our plot. ```{r agresti-complete-dor-plot,fig.align="center",fig.cap="Level Set of Log Likelihood and Directions of Recession. Blue curve is boundary of level set (same as blue curve in Figure \\@ref(fig:agresti-complete-geyer-plot)). Black arrow is GDOR. Green arrows are DOR that are not generic."} plot(beta.likelihood[,1], beta.likelihood[,2], xlab = expression(beta[1]), ylab = expression(beta[2]), xlim = range(beta.likelihood[foo,1]), ylim = range(beta.likelihood[foo,2]), type = "n") lines(beta.geyer[,1], beta.geyer[,2], col = "blue") arrows(-5, 0.125, -5 + 2 * lout$gdor[1], 0.125 + 2 * lout$gdor[2]) norv <- norv / norv[1, 1] * lout$gdor[1] arrows(-5, 0.125, -5 + 2 * norv[1, 1], 0.125 + 2 * norv[1, 2], col = "green3") norv <- norv / norv[2, 1] * lout$gdor[1] arrows(-5, 0.125, -5 + 2 * norv[2, 1], 0.125 + 2 * norv[2, 2], col = "green3") ``` Any arrow pointing between the green arrows is a GDOR, so GDOR are not unique in this example (they are unique in a [slightly different example](#quasi-complete-separation-example-of-agresti) below). There are two reasons why we don't try to find all GDOR rather than just one GDOR. * One GDOR is enough to do all the statistics. * All the GDOR take a lot longer to find, hours, days, or worse for large examples. ## Mathematical Proofs The arrow in Figure \@ref(fig:agresti-complete-geyer-plot) or the arrows in Figure \@ref(fig:agresti-complete-dor-plot) show that the log likelihood really does go uphill all the way to infinity in the GDOR direction. Of course, just eyeballing a plot isn't a proof, but the computer can do a rigorous proof. ```{r agresti-complete-proof} lout <- llmdr(y ~ x, family = "binomial", proofs = 2) lout$proofs ``` The argument `proofs = 2` to R function `llmdr` asks it to do 2 proofs (more precisely 2 parts of one proof). The result having `lout$proofs == 2` indicates that both proofs were successful. The proofs have been done. What is reported as a GDOR really is one. (Or for a problem in which a GDOR does not exist so the MLE exists in the OM, that has been proved.) By reporting it has done the proofs, the computer says that, although it found the answer using inexact computer arithmetic, it went back and did enough work in infinite-precision rational arithmetic to prove that what it says is a GDOR really is one. The only reason R function `llmdr` does not do these proofs by default is that inexact computer arithmetic usually gives the right answer and these proofs can take many minutes to do on large problems. ## Intervals for Beta ```{r agresti-complete-beta-intervals} beta <- beta.geyer colnames(beta) <- names(lout$coefficients) beta <- apply(beta, 2, range) rownames(beta) <- c("lower", "upper") beta <- t(beta) beta ``` Except we know this is wrong. As the arrow in Figure \@ref(fig:agresti-complete-geyer-plot) shows, $\beta_1$ goes to $- \infty$ as we travel in the GDOR direction, and $\beta_2$ goes to $+ \infty$ as we travel in the GDOR direction. We just don't get that far on the plot. So fix that up. ```{r agresti-complete-beta-intervals-too} beta[1, "lower"] <- -Inf beta[2, "upper"] <- Inf beta ``` We can make these intervals if a user asks for them, but they are not very informative. They provide much less information than the whole region whose boundary is the blue curve in Figure \@ref(fig:agresti-complete-geyer-plot). ## Intervals for Theta and Mu Larry Brown (the author of @brown) told your humble author that nobody would ever understand the intervals for $\beta$ so focus on $\mu$. Your humble author mostly agrees. The only reasons we bothered with $\beta$ is because * everything starts there, we can only do intervals for other parameters by mapping intervals for $\beta$ to them, * when writing an R package, you don't know what users will want or what applications they will be doing, so you have to provide all the possibilities you can if you want to be useful. First we do sloppy intervals for $\theta$. ```{r agresti-complete-theta-sloppy} modmat <- as.matrix(lout$x) theta.geyer <- modmat %*% t(beta.geyer) theta.intervals <- apply(theta.geyer, 1, range) theta.intervals <- t(theta.intervals) colnames(theta.intervals) <- c("lower", "upper") theta.intervals ``` But we can do a lot better than this by running the optimizer again. First we know that we have one-sided intervals. If $y_i = 0$, then we know the lower bound for $\theta_i$ is $- \infty$. If $y_i = 1$, then we know the upper bound for $\theta_i$ is $+ \infty$. So we can just set them without any calculation. Then we find the $\beta$ that optimizes each $\theta_i$ and use it as a starting point for optimization. And we write a function to do that. ```{r agresti-complete-theta-function} do.theta.intervals <- function(x) { foo <- cbind(1, x) %*% t(beta.geyer) gdor <- lout$gdor lower <- rep(-Inf, length(x)) upper <- rep(+Inf, length(x)) for (i in seq(along = x)) { eta <- sum(c(1, x[i]) * gdor) eta <- zapsmall(eta) if (eta < 0) { # need upper bound idx <- which(foo[i, ] == max(foo[i, ])) idx <- idx[1] # just in case of ties beta.start <- beta.geyer[idx, ] aout <- auglag(beta.start, fn = function(beta) sum(c(1, x[i]) * beta), gr = function(beta) c(1, x[i]), hin = function(beta) crit + logl(beta), hin.jac = logl.gradient, control.outer = list(trace = FALSE), control.optim = list(fnscale = -1)) stopifnot(aout$convergence == 0) upper[i] <- aout$value } else { # need lower bound idx <- which(foo[i, ] == min(foo[i, ])) idx <- idx[1] # just in case of ties beta.start <- beta.geyer[idx, ] aout <- auglag(beta.start, fn = function(beta) sum(c(1, x[i]) * beta), gr = function(beta) c(1, x[i]), hin = function(beta) crit + logl(beta), hin.jac = logl.gradient, control.outer = list(trace = FALSE)) stopifnot(aout$convergence == 0) lower[i] <- aout$value } } cbind(lower, upper) } theta <- do.theta.intervals(x) theta ``` And then we transform these to the mean value parameter scale. ```{r agresti-complete-mu-intervals} mu <- 1 / (1 + exp(- theta)) mu ``` And then we plot them. ```{r agresti-complete-mu-plot,fig.align="center",fig.cap="Simultaneous 95% Confidence intervals for Mean Value Parameters. Hollow circles are MLE."} errbar(x, y, mu[ , 2], mu[ , 1], xlab = "x", ylab = "probability") ``` This plot is fairly simple. The hollow dots are the MLE saturated model mean value parameter vector $\hat{\mu}$. They are on the boundary. The MLE distribution is completely degenerate and says the observed value of the response vector is the only possible value. But $\hat{\mu}$ is not $\mu$. Estimates are not the parameters they estimate. This is the second most fundamental concept in statistics. (The most fundamental is anecdotes are not data. The third most fundamental is it is called a 95% confidence interval because it is wrong 5% of the time.) So we see that we do not need to fear MLE on the boundary. We just need a proper theory of confidence intervals that applies in the situation. Note that the error bars say almost nothing about the probabilities near the jump in the regression function. Not surprising in hindsight. If one of the Bernoulli random variables for $x = 40$ or $x = 60$ had been different, we would have inferred the jump was at a different point. But the farther we get from the jump the better the estimation. So even in this toy problem with so little data, statistics can say something of consequence. ## Submodel Canonical Statistic Theory tells us that where the action is really happening is with the canonical statistic vector $M^T y$ of the canonical affine submodel. In most problems we can't visualize that, but in this toy problem, where $M^T y$ is two-dimensional, we can. Another reason why we can't do anything like this section in non-toy problems is that the computing we do in this section is exponential in the size of the data, so the computing time would be horrendously long for non-toy data. There are $2^n$ possible values of the response vector where $n$ is the length of the response vector because each component of $y$ can be either zero or one. The following code makes all of those vectors. ```{r example.i.all.y} yy <- matrix(0:1, nrow = 2, ncol = length(x)) colnames(yy) <- paste0("y", x) yy <- expand.grid(as.data.frame(yy)) head(yy) dim(yy) ``` But there are not so many distinct values of the submodel canonical statistic. ```{r example.i.all.m.transpose.y} m <- cbind(1, x) mtyy <- t(m) %*% t(yy) t1 <- mtyy[1, ] t2 <- mtyy[2, ] t1.obs <- sum(y) t2.obs <- sum(x * y) ``` ```{r agresti-complete-submodel-sufficient,fig.align="center",fig.cap="Possible values of the submodel canonical statistic vector $M^T y$ for the data shown in Figure \\@ref(fig:one). Solid dot is the observed value. Line is hyperplane \\@ref(eq:h) determined by the GDOR."} t.key <- paste(t1, t2, sep = ":") t.idx <- match(unique(t.key), t.key) t1.uniq <- t1[t.idx] t2.uniq <- t2[t.idx] par(mar = c(5, 5, 1, 0) + 0.1) plot(t1.uniq, t2.uniq, xlab = expression(t[1] == sum(y)), ylab = expression(t[2] == sum(x * y))) points(t1.obs, t2.obs, pch = 19) # add H gdor <- lout$gdor # rotate 90 degrees to convert normal vector to tangent vector tanv <- rev(gdor) * c(-1, 1) tanv # now convert to slope and intercept of line slope <- tanv[2] / tanv[1] intercept <- t2.obs - t1.obs * slope abline(intercept, slope) ``` Having the hyperplane $H$ defined by \@ref(eq:h) on the plot makes it clear that the observed value of the canonical statistic really is extreme. It lies on the line, and all other possible (in the OM) values lie on one side of $H$. The fact that only one point lies in $H$ means the LCM is concentrated at this one point, so is completely degenerate. ## Complete Separation Just for completeness, we do show the "complete separation" plot that Agresti discusses. ```{r agresti-complete-complete-separation,fig.align="center",fig.cap="Complete Separation of Successes and Failures. Black dots successes. White dots failures. Black line separating hyperplane"} dim(modmat) plot(modmat[ , 1], modmat[ , 2], xlab = colnames(modmat)[1], ylab = colnames(modmat)[2]) points(modmat[y == 1, 1], modmat[y == 1, 2], pch = 19) intercept <- 50 - 1 * slope abline(intercept, slope) ``` The separating hyperplane in Figure \@ref(fig:agresti-complete-complete-separation) is one whose normal vector is the GDOR found by R function `llmdr`. It is clear that there are a lot of separating hyperplanes, so this plot, although easier to do than Figure \@ref(fig:agresti-complete-submodel-sufficient), doesn't provide the same information. This will become even more clear in the following section. Figure \@ref(fig:agresti-complete-complete-separation) is a little odd in that the point cloud isn't really two-dimensional. All of the points have the same value of the regressor `(Intercept)`. But of course that always happens. That it what "intercept" means. So it is odd, but also extremely common. Most models have an intercept. ## Region for Tau We can map the confidence region for $\beta$ that is the region bounded by the blue curve in Figure \@ref(fig:agresti-complete-geyer-plot) to values for $\tau$ thus obtaining a confidence region for $\tau$. ```{r agresti-complete-tau-region,fig.align="center",fig.cap="Same as Figure \\@ref(fig:agresti-complete-submodel-sufficient) Except Gray Region is 95% Confidence Region for Submodel Mean Value Parameter"} theta <- tcrossprod(modmat, beta.geyer) mu <- 1 / (1 + exp(- theta)) tau <- crossprod(modmat, mu) # add observed value to these points tau <- cbind(c(t1.obs, t2.obs), tau) par(mar = c(5, 6, 1, 0) + 0.1) plot(t1.uniq, t2.uniq, xlab = expression(t[1] == sum(y)), ylab = expression(t[2] == sum(x * y)), type = "n") polygon(tau[1, ], tau[2, ], border = NA, col = "gray") points(t1.uniq, t2.uniq) points(t1.obs, t2.obs, pch = 19) intercept <- t2.obs - t1.obs * slope abline(intercept, slope) mtext(expression(tau[1]), side = 1, line = 4) mtext(expression(tau[2]), side = 2, line = 5) ``` From this figure it is clear that the confidence region is fairly restrictive, and most of the submodel mean value parameter space is excluded by the confidence region. ```{r restore-par-mar,echo=FALSE} par(mar = c(5, 4, 4, 1) + 0.1) ``` ## Confidence Interval for Closeness to the Boundary We can make confidence intervals for any function of parameters we can think of. Here is one that is especially interesting. The parameter $\langle \tau, \delta \rangle$ measures how close a distribution is to the boundary (to the line in the figure). ```{r agresti-boundary-interval} sum(c(t1.obs, t2.obs) * gdor) - rev(range(crossprod(tau, gdor))) ``` Admittedly, this is impossible to interpret by itself because the length of the GDOR is arbitrary. But we can compare it to the range of the same GDOR applied to the whole sample space. ```{r agresti-boundary-interval-compare} ttt <- rbind(t1.uniq, t2.uniq) sum(c(t1.obs, t2.obs) * gdor) - rev(range(crossprod(ttt, gdor))) ``` These data, few though they may be, do mush up the probability fairly close to the boundary. ## Confidence Intervals for Hypothetical Individuals We can also make confidence intervals for hypothetical individuals, what R generic function `predict` does when its `newdata` argument is used. Here we treat $x$ as a continuous variable and do intervals for $\mu$ for all possible $x$ values. ```{r agresti-complete-hypothetical,cache=TRUE} xx <- seq(0, 100, 0.1) # be especially careful at jump xx <- c(xx, 39.99, 40.01, 59.99, 60.01) xx <- sort(xx) theta <- do.theta.intervals(xx) mu <- 1 / (1 + exp(- theta)) # have to do by hand since optimization isn't good enough mu[xx < 60, 1] <- 0 mu[xx > 40, 2] <- 1 ``` ```{r agresti-complete-hypothetical-plot,fig.align="center",fig.cap="Like Figure \\@ref(fig:agresti-complete-mu-plot) Except with Confidence Intervals for Every $x$ value"} plot(x, y, type = "n", xlab = expression(x), ylab = expression(mu(x))) xxx <- c(xx, rev(xx)) yyy <- c(mu[ , 1], rev(mu[ , 2])) length(xxx); length(yyy) polygon(xxx, yyy, border = NA, col = "gray") points(x, y) ``` That finishes our work on this toy problem. Admittedly, it is hard to know what you are supposed to learn from a toy problem. But toy problems do have the advantage of at least being somewhat understandable. # A Clinical Trial Example of Agresti ```{r agresti-clinical-clean,echo=FALSE} rm(list = setdiff(ls(), c("level", "do.curve", "logl.factory", "logl.gradient.factory", "save.complete"))) ``` But analyses involving DOR are not always so complicated. Sometimes the issue and analysis are really quite simple. Section 6.5.2 in @agresti looks at the following data for a clinical trial. ```{r agresti-clinical-data} center <- rep(1:5, each = 2) center <- as.factor(center) treatment <- rep(c("active_drug", "placebo"), times = 5) success <- c(0, 0, 1, 0, 0, 0, 6, 2, 5, 2) failure <- c(5, 9, 12, 10, 7, 5, 3, 6, 9, 12) y <- cbind(success, failure) ``` If we dump that into R function `llmdr` we find the MLE does not exist in the OM. ```{r agresti-clinical-fit} lout <- llmdr(y ~ center + treatment, family = "binomial") summary(lout) ``` ```{r agresti-clinical-wald-do-summary,echo=FALSE} sout <- summary(lout) ``` Furthermore, we cannot drop `treatment` because it is statistically significant (`r round(sout$coefficients["treatmentplacebo", "Pr(>|z|)"], 4)`) by the Wald test done in the summary, and we cannot drop `center` because ```{r agresti-clinical-anova-center} lout.null <- llmdr(y ~ treatment, family = "binomial") anova(lout.null, lout) ``` So the model `y ~ center + treatment` is the only model under consideration that fits the data. [As usual](expfam.html#alice) it is very hard to figure out what canonical parameters mean. So we look at mean value parameters. ```{r agresti-clinical-mu} data.frame(center, treatment, estimates(lout, "mu")) ``` Oh! There were no successes in either treatment group in centers 1 and 3. And the MLE mean values also say no successes in those centers. If you were to decide to throw away the data from centers 1 and 3, just to make the data easier for you to analyze, then you could be severely criticized for that. If you did it and didn't admit it, that would be fraud. If you did it and did admit it, then that would be sufficient reason for everybody to ignore everything you say on the subject. But if statistics does the same thing, following the theory of maximum likelihood estimation in exponential families, then there is nothing to criticize. The MLE in the LCM is also the MLE in the completion of the OM, arrived at by taking limits of sequences of distributions in the OM. There is nothing wrong with that. If taking limits is wrong, then everything involving calculus has been wrong for over 400 years. So that argument is a non-starter. So having arrived at this conclusion, we mention that this is another situation in which the theory in these notes matches conventional practice. If a conclusion can be drawn solely from the fit in the LCM alone (with no reference to the OM), then it is valid (assuming the usual asymptotics of maximum likelihood apply, that is, assuming the sample size of the LCM is large enough). So that means the coefficient for the treatment effect and its standard error in the summary above are valid. So we can turn that into a valid Wald confidence interval. ```{r agresti-clinical-wald-interval} sout <- summary(lout) fit <- sout$coefficients["treatmentplacebo", "Estimate"] se.fit <- sout$coefficients["treatmentplacebo", "Std. Error"] fit + c(-1, 1) * qnorm((1 + level) / 2) * se.fit ``` And the hypothesis test is valid too. We have a statistically significant treatment effect (with aforementioned $P$-value $P = `r round(sout$coefficients["treatmentplacebo", "Pr(>|z|)"], 4)`$). And this agrees with the corresponding confidence interval not containing zero (as it has to by the duality of hypothesis tests and confidence intervals). Of course this does ignore the data from centers 1 and 3. But that is what maximum likelihood does with these data. Once we decide that `center` needs to be in the model, then we have to let statistics do what it does. Of course, if we were interested in confidence intervals for centers 1 and 3, then we would have to do calculations like those in the preceding section. But as long as we are happy with what has been done in this section, then we're good. By the way ```{r agresti-clinical-try-glm} gout <- glm(y ~ center + treatment, family = "binomial") summary(gout) ``` gives a false negative by failing to warn `fitted probabilities numerically 0 or 1 occurred`. It does agree with our $P$-value for the treatment effect. But that is an accident of having orthogonal predictors. If `treatment` and `center` were correlated predictors, then we wouldn't get the same answers. # Theory of Confidence Intervals {#theory-confidence} ## Confidence Intervals for OM when there is a GDOR These are the intervals based on the region bounded by the blue curve in Figure \@ref(fig:agresti-complete-geyer-plot). We have already met them. Now for the details. Suppose the GDOR $\delta$ were known in advance of obtaining the data (this is not true; the GDOR depends on the data), and suppose we chose $\langle Y, \delta \rangle$ as a test statistic for a hypothesis test with a point null hypothesis $\beta$. We observe the largest possible value of this test statistic $\langle y, \delta \rangle$. Thus the $P$-value of this test is $$ \mathop{\rm pr}\nolimits_\beta\bigl( \langle Y, \delta \rangle = \langle y, \delta \rangle \bigr) = \mathop{\rm pr}\nolimits_\beta( Y \in H ) $$ where $H$ is given by \@ref(eq:h). If we think decision theoretically, we reject the null hypothesis when this $P$-value is less than the significance level, call that $\alpha$. When we invert a hypothesis test to get a confidence region, the confidence region is the set of parameter values considered as null hypotheses that are not rejected by the hypothesis test. Of course, the probability that the null hypothesis is not rejected is $1 - \alpha$, so that is the coverage probability of the confidence region. In short, our confidence region is the set of $\beta$ such that \begin{equation} \mathop{\rm pr}\nolimits_\beta( Y \in H ) \ge \alpha (\#eq:confreg-geyer) \end{equation} It makes no sense to use confidence regions of this kind when the MLE exists in the OM. Then this confidence region does nothing at all because saying there is no GDOR is equivalent to saying $\delta = 0$, hence $H$ is the whole vector space where the canonical statistic lives, hence the region satisfying \@ref(eq:confreg-geyer) is the whole parameter space. ## Confidence Intervals for LCM when it is Not Completely Degenerate When the LCM is not completely degenerate, it is an exponential family for which the MLE exists and the usual asymptotics may hold (if there is enough data after conditioning to make the LCM). So we use that. Let $l_\text{LCM}$ be the log likelihood function for the LCM. Then we make a confidence region that is the set of $\beta$ such that \begin{equation} 2 \bigl[ l_\text{LCM}(\hat{\beta}) - l_\text{LCM}(\beta) \bigr] \le \text{crit} (\#eq:confreg-lcm) \end{equation} where $\text{crit}$ is chosen to be an appropriate chi-square critical value with degrees of freedom chosen to give the desired coverage, and where in case the MLE exists in the OM we say LCM = OM, so the log likelihood in \@ref(eq:confreg-lcm) is the log likelihood of the OM. One uses degrees of freedom equal to the number of identifiable parameters in the LCM to get simultaneous coverage. One uses degrees of freedom equal to one if one doesn't care about simultaneous coverage. So much, so conventional. But things get a little stranger when we use the same parameterization for both of these kinds of confidence regions. If $\beta$ is the parameter of the OM, for both of these, then the region satisfying \@ref(eq:confreg-lcm) is also unbounded. The GDOR of the OM, at least, is a DOC of the LCM, so the region satisfying \@ref(eq:confreg-lcm) is also unbounded in the direction of the GDOR and also in the opposite direction, and perhaps other directions (the LCM can have more DOC than the GDOR of the LCM). It makes no sense to use confidence regions of this kind when the LCM is completely degenerate. Then this confidence region does nothing at all because the LCM contains only one distribution, so every direction is a DOC of the LCM and $l_\text{LCM}$ is a constant function of $\beta$, hence the region satisfying \@ref(eq:confreg-lcm) is the whole parameter space. ## The Combination of the Two To use both kinds of information, we use the confidence region consisting of $\beta$ that satisfy both \@ref(eq:confreg-geyer) and \@ref(eq:confreg-lcm) except when the MLE exists in the OM, when we take the region satisfying \@ref(eq:confreg-lcm) only, or except when the LCM is completely degenerate, when we take the region satisfying \@ref(eq:confreg-geyer) only. In case the LCM is partially degenerate and we are using both kinds of confidence regions, we have to correct for multiple testing. Because \@ref(eq:confreg-geyer) only involves the probability of the event $Y \in H$ and \@ref(eq:confreg-lcm) only involves the probability distribution of $Y$ given the event $Y \in H$, probabilities multiply (joint equals marginal times conditional). Thus if the confidence level for the region satisfying \@ref(eq:confreg-lcm) has confidence level $1 - \alpha_1$ and the confidence level for the region satisfying \@ref(eq:confreg-geyer) has confidence level $1 - \alpha_2$, then the confidence level of the combination is $(1 - \alpha_1) (1 - \alpha_2)$. And we need to adjust $\alpha_1$ and $\alpha_2$ to get the desired level. Usually we just take both $1 - \alpha_1$ and $1 - \alpha_2$ to be the square root of the desired level, except in case the MLE exists in the OM and we take $1 - \alpha_1$ to be the desired level or in case the LCM is completely degenerate and we take $1 - \alpha_2$ to be the desired level. ## Confidence Regions to Confidence Intervals For any function $g(\beta)$ we can form a confidence interval for this parameter by minimizing and maximizing $g(\beta)$ over the region satisfying both \@ref(eq:confreg-geyer) and \@ref(eq:confreg-lcm). # Quasi-Complete Separation Example of Agresti ```{r agresti-quasi-clean,echo=FALSE} rm(list = setdiff(ls(), c("level", "do.curve", "logl.factory", "logl.gradient.factory", "save.complete"))) ``` ## Data {#data-quasi} Section 6.5.1 of @agresti introduces the notion of quasi-complete separation with the following example. ```{r agresti-quasi-data} x <- seq(10, 90, 10) x <- c(x, 50) y <- as.numeric(x > 50) y[x == 50] <- c(0, 1) data.frame(x, y) ``` Figure \@ref(fig:agresti-quasi-data-plot) shows these data. ```{r agresti-quasi-data-plot, echo = FALSE, fig.align = "center", fig.cap = "Logistic Regression Data for Quasi-Complete Separation Example of Agresti"} plot(x, y) ``` ## Fit Model ```{r agresti-quasi-llmdr} lout <- llmdr(y ~ x, family = "binomial") summary(lout) ``` Now we see that the model is not completely degenerate (because the `Estimates` column of the printout is not all `NA`) but is partially degenerate (because the printout has a `GDOR` column). That's what Agresti calls quasi-complete separation. As usual mean value parameters are easier to interpret. ```{r agresti-quasi-means} data.frame(x, y, mu = estimates(lout, "mu")) ``` All of the components of the MLE $\hat{\mu}$ are on the boundary except those corresponding to $x = 50$ where we have one success and one failure and predict $\hat{\mu}_i = 0.5$. ## Valid Region for Beta {#valid-quasi} ```{r agresti-quasi-geyer,cache=TRUE} # names here are odd, these are the model matrix and response vector # for the data that are fixed at their observed values in the LCM modmat.lcm <- as.matrix(lout$x[x != 50, ]) y.lcm <- y[x != 50] logl <- logl.factory(modmat.lcm, y.lcm) logl.gradient <- logl.gradient.factory(modmat.lcm, y.lcm) level1 <- level2 <- sqrt(level) level1 alpha <- 1 - level2 crit <- (- log(alpha)) crit beta.geyer <- do.curve(function(beta) crit + logl(beta), logl.gradient, cutoff = -1000) ``` ```{r agresti-quasi-geyer-plot,fig.align="center",fig.cap="Confidence Region Consisting of $\\beta$ satisfying \\@ref(eq:confreg-geyer). Arrow is GDOR"} foo <- beta.geyer[,1] > -8 plot(beta.geyer[,1], beta.geyer[,2], xlab = expression(beta[1]), ylab = expression(beta[2]), xlim = range(beta.geyer[foo,1]), ylim = range(beta.geyer[foo,2]), type = "n") lines(beta.geyer[,1], beta.geyer[,2], col = "blue") arrows(-5, 0.125, -5 + 2 * lout$gdor[1], 0.125 + 2 * lout$gdor[2]) ``` The blue curves in Figures \@ref(fig:agresti-complete-geyer-plot) and \@ref(fig:agresti-quasi-geyer-plot) are different because the confidence levels are different. Now we want to add the region of $\beta$ satisfying \@ref(eq:confreg-lcm). First we get the likelihood-based confidence region for the LCM. Here we can do that the easy way using R function `confint` from R package `MASS`. ```{r agrest-quasi-mass-confint} gout <- glm(c(0, 1) ~ 1, family = "binomial") summary(gout) cout <- confint(gout, level = level1) cout ``` So now we know the likelihood-based confidence region for $\beta_1$ is $`r round(cout[1], 2)` < \beta_1 < `r round(cout[2], 2)`$ when we constrain $\beta_2 = 0$. (R function `summary` says `NA` for coefficients it constrains to be zero.) Now we know that the DOR for the OM is the only DOC for the LCM because we have only one non-identifiable parameter for the LCM ($\beta_2$) so only one DOC. Thus our confidence region of $\beta$ satisfying \@ref(eq:confreg-lcm) is the set of all $\beta$ of the form $$ (\beta_1, 0) + s \delta $$ with $`r round(cout[1], 2)` < \beta_1 < `r round(cout[2], 2)`$ and $s$ any real number and $\delta$ the GDOR. Its boundaries are the straight lines $$ (\pm `r round(cout[2], 2)`, 0) + s \delta $$ So add them to our plot. ```{r agresti-quasi-both-plot,fig.align="center",fig.cap="Confidence Regions. Blue: $\\beta$ satisfying \\@ref(eq:confreg-geyer). Pink: $\\beta$ satisfying \\@ref(eq:confreg-lcm). Arrow is GDOR"} plot(beta.geyer[,1], beta.geyer[,2], xlab = expression(beta[1]), ylab = expression(beta[2]), xlim = range(beta.geyer[foo,1]), ylim = range(beta.geyer[foo,2]), type = "n") lines(beta.geyer[,1], beta.geyer[,2], col = "blue") arrows(-5, 0.125, -5 + 2 * lout$gdor[1], 0.125 + 2 * lout$gdor[2]) gdor <- lout$gdor slope <- gdor[2] / gdor[1] intercept.lower <- (- slope * cout[1]) intercept.upper <- (- slope * cout[2]) abline(intercept.lower, slope, col = "pink3") abline(intercept.upper, slope, col = "pink3") ``` Our confidence region for $\beta$ is the intersection of the two regions, the region inside the blue curve and between the region between the pink lines in Figure \@ref(fig:agresti-quasi-both-plot). The lower pink line does eventually intersect the blue curve, but that is off the plot to the left. ## Intervals for Beta {#beta-quasi} So confidence intervals for $\beta$ are ```{r agresti-quasi-beta} beta <- matrix(NA_real_, 2, 2) colnames(beta) <- c("lower", "upper") rownames(beta) <- names(gdor) beta[1, 1] <- -Inf beta[2, 2] <- Inf beta[1, 2] <- max(beta.geyer[ , 1]) beta[2, 1] <- min(beta.geyer[ , 2]) beta ``` We won't bother to make these intervals more precise using optimization. [They are almost uninterpretable anyway](expfam.html#alice). Clearly they present a lot less information than Figure \@ref(fig:agresti-quasi-both-plot). ## Intervals for Theta {#theta-quasi} First we need to find the intersection of our two confidence regions. ```{r agresti-quasi-beta-both} nrow(beta.geyer) beta.upper <- beta.geyer[ , 1] * slope + intercept.upper is.up <- beta.geyer[ , 2] > beta.upper range(which(is.up)) beta.lower <- beta.geyer[ , 1] * slope + intercept.lower is.dn <- beta.geyer[ , 2] < beta.lower range(which(is.dn)) beta.both <- beta.geyer beta.both[is.up, 2] <- beta.geyer[is.up, 1] * slope + intercept.upper beta.both[is.dn, 2] <- beta.geyer[is.dn, 1] * slope + intercept.lower ``` Then we map to $\theta$ ```{r agresti-quasi-theta-both} modmat <- as.matrix(lout$x) theta.both <- tcrossprod(modmat, beta.both) theta <- apply(theta.both, 1, range) theta <- t(theta) colnames(theta) <- c("lower", "upper") theta # fix up infinite by hand theta[x < 50, "lower"] <- -Inf theta[x > 50, "upper"] <- Inf theta ``` We might as well polish these up with optimization because we need the R function `do.theta.intervals` defined in Section \@ref(intervals-for-theta-and-mu) above modified to do this problem in order to do the rest of our analysis of this example. ```{r agresti-quasi-theta-better} # names here are odd, these are the model matrix and response vector # for the data that are free in the LCM modmat.lcm <- as.matrix(lout$x[lout$lcm.support, ]) y.lcm <- y[lout$lcm.support] modmat.lcm y.lcm logl.lcm <- logl.factory(modmat.lcm, y.lcm) logl.lcm.gradient <- logl.gradient.factory(modmat.lcm, y.lcm) beta.hat.lcm <- c(0, 0) crit2 <- crit crit1 <- qchisq(level1, df = sum(! is.na(lout$coefficients))) / 2 crit1 crit2 # now constraint function for optimization has both constraints hin <- function(beta) c(logl(beta) + crit2, logl.lcm(beta) - logl.lcm(beta.hat.lcm) + crit1) hin.jac <- function(beta) rbind(logl.gradient(beta), - logl.lcm.gradient(beta)) i <- sample(which(is.up), 1) beta.both[i, ]; hin(beta.both[i, ]); hin.jac(beta.both[i, ]) jacobian(hin, beta.both[i, ]) a.little.up <- c(0, 1e-4) a.little.dn <- c(0, -1e-4) hin(beta.both[i, ] + a.little.up); hin(beta.both[i, ] + a.little.dn) ``` It seems a bit odd that the constraint values returned by R function `hin` are neither zero and the closest is $`r hin(beta.both[i, ])[2]`$, but that is what inexact computer arithmetic and the default convergence tolerances for R function `auglag` do. If we take that number to correspond to a zero using infinite precision arithmetic, then this is what we expect. We are on the pink boundary where the second constraint is zero. Oh. The inexactness here might be where the pink lines are drawn which might be the fault of R function `confint.glm` in R package `MASS`. We won't sort out who is to blame, because we will now be using R function `hin` so say when we are in the confidence region for $\beta$. We won't use the pink lines any more. The last two lines show that if we go a little up, then we are outside the confidence region (one component of constraint function negative), whereas if we go a little down, then we are inside (both components positive). ```{r agresti-quasi-theta-better-too} i <- sample(which(is.dn), 1) beta.both[i, ]; hin(beta.both[i, ]); hin.jac(beta.both[i, ]) jacobian(hin, beta.both[i, ]) hin(beta.both[i, ] + a.little.up); hin(beta.both[i, ] + a.little.dn) ``` Again, looks OK, except for inexactness of something or other. Now a little up is inside and a little down is outside. ```{r agresti-quasi-theta-better-too-too} i <- sample(which((! is.dn) & (! is.up)), 1) beta.both[i, ]; hin(beta.both[i, ]); hin.jac(beta.both[i, ]) jacobian(hin, beta.both[i, ]) ``` Now this is a point on the blue curve, and here the first constraint is nearly zero, and the jacobian looks OK here too. So it looks like our constraint functions and their gradients are OK. ```{r agresti-quasi-theta-better-too-too-too} do.theta.intervals <- function(x) { foo <- cbind(1, x) %*% t(beta.both) gdor <- lout$gdor lower <- rep(-Inf, length(x)) upper <- rep(+Inf, length(x)) zapsmall <- function(x) if (abs(x) < sqrt(.Machine$double.eps)) 0 else x for (i in seq(along = x)) { eta <- sum(c(1, x[i]) * gdor) eta <- zapsmall(eta) if (eta <= 0) { # need upper bound idx <- which(foo[i, ] == max(foo[i, ])) idx <- idx[1] # just in case of ties beta.start <- beta.both[idx, ] aout <- auglag(beta.start, fn = function(beta) sum(c(1, x[i]) * beta), gr = function(beta) c(1, x[i]), hin = hin, hin.jac = hin.jac, control.outer = list(trace = FALSE), control.optim = list(fnscale = -1)) # looks like we have to accept 9, which is # Convergence due to lack of progress in parameter updates stopifnot(aout$convergence %in% c(0, 9)) upper[i] <- aout$value } if (eta >= 0) { # need lower bound idx <- which(foo[i, ] == min(foo[i, ])) idx <- idx[1] # just in case of ties beta.start <- beta.both[idx, ] aout <- auglag(beta.start, fn = function(beta) sum(c(1, x[i]) * beta), gr = function(beta) c(1, x[i]), hin = hin, hin.jac = hin.jac, control.outer = list(trace = FALSE)) # looks like we have to accept 9, which is # Convergence due to lack of progress in parameter updates stopifnot(aout$convergence %in% c(0, 9)) lower[i] <- aout$value } } cbind(lower, upper) } ``` ```{r agresti-quasi-theta-better-too-too-too-too} theta <- do.theta.intervals(sort(unique(x))) rownames(theta) <- sort(unique(x)) theta ``` ## Intervals for Mu {#mu-quasi} And we can map these to the mean value parameter. ```{r agresti-quasi-mu-both} mu <- 1 / (1 + exp(- theta)) mu ``` And plot them. ```{r agresti-quasi-mu-plot,fig.align="center",fig.cap="Confidence intervals for Mean Value Parameters. Hollow circles are MLE."} errbar(sort(unique(x)), as.numeric(sort(unique(x)) > 50), mu[ , 2], mu[ , 1], xlab = "x", ylab = "probability") # add the other data point for x = 50 points(50, 1) ``` ## Submodel Canonical Statistic {#statistic-quasi} This is like Section \@ref(submodel-canonical-statistic) except for the difference in the data. ```{r agresti-quasi-all-possible-y} yy <- matrix(0:1, nrow = 2, ncol = length(x)) colnames(yy) <- paste0("y", seq(along = x)) yy <- expand.grid(as.data.frame(yy)) head(yy) dim(yy) ``` Get distinct possible values of the submodel canonical statistic and the observed values. ```{r agresti-quasi-all-possible-m-transpose-y} m <- cbind(1, x) mtyy <- t(m) %*% t(yy) t1 <- mtyy[1, ] t2 <- mtyy[2, ] t1.obs <- sum(y) t2.obs <- sum(x * y) ``` And make the plot. ```{r agresti-quasi-submodel-sufficient,fig.align="center",fig.cap="Possible values of the submodel canonical statistic vector $M^T y$ for the data shown in Figure \\@ref(fig:agresti-quasi-data-plot). Solid dot is the observed value. Line is hyperplane \\@ref(eq:h) determined by the GDOR."} t.key <- paste(t1, t2, sep = ":") t.idx <- match(unique(t.key), t.key) t1.uniq <- t1[t.idx] t2.uniq <- t2[t.idx] par(mar = c(5, 5, 1, 0) + 0.1) plot(t1.uniq, t2.uniq, xlab = expression(t[1] == sum(y)), ylab = expression(t[2] == sum(x * y))) points(t1.obs, t2.obs, pch = 19) # add H gdor <- lout$gdor # rotate 90 degrees to convert normal vector to tangent vector tanv <- rev(gdor) * c(-1, 1) tanv # now convert to slope and intercept of line slope <- tanv[2] / tanv[1] intercept <- t2.obs - t1.obs * slope abline(intercept, slope) ``` This differs from Figure \@ref(fig:agresti-complete-submodel-sufficient) in that now the support of the LCM, the points that lie on the hyperplane $H$, which is the line in the figure, has three points rather than one. Hence the LCM is not completely degenerate (concentrated at one point) but rather only *partially degenerate*. But otherwise the picture is much the same. The MLE does not exist in the OM because the observed data (for the submodel canonical sufficient statistic) is on the boundary of the convex hull of the set of its possible values. ## Region for Tau {#tau-quasi} This is like Section \@ref(region-for-tau) except for the difference in data. ```{r agresti-quasi-tau-region,fig.align="center",fig.cap="Same as Figure \\@ref(fig:agresti-quasi-submodel-sufficient) Except Gray Region is 95% Confidence Region for Submodel Mean Value Parameter"} theta <- tcrossprod(modmat, beta.both) mu <- 1 / (1 + exp(- theta)) tau <- crossprod(modmat, mu) # add observed value to these points tau <- cbind(c(t1.obs, t2.obs), tau) par(mar = c(5, 6, 1, 0) + 0.1) plot(t1.uniq, t2.uniq, xlab = expression(t[1] == sum(y)), ylab = expression(t[2] == sum(x * y)), type = "n") polygon(tau[1, ], tau[2, ], border = NA, col = "gray") points(t1.uniq, t2.uniq) points(t1.obs, t2.obs, pch = 19) abline(intercept, slope) mtext(expression(tau[1]), side = 1, line = 4) mtext(expression(tau[2]), side = 2, line = 5) ``` ## Quasi-Complete Separation {#separation-quasi} This is like Section \@ref(complete-separation) except for the difference in data. ```{r agresti-quasi-complete-separation,fig.align="center",fig.cap="Quasi-Complete Separation of Successes and Failures. Black dots successes. White dots failures. Gray dot, one success and one failure. Black line separating hyperplane"} dim(modmat) plot(modmat[ , 1], modmat[ , 2], xlab = colnames(modmat)[1], ylab = colnames(modmat)[2]) intercept <- 50 - 1 * slope abline(intercept, slope) points(modmat[y == 1, 1], modmat[y == 1, 2], pch = 19) points(modmat[x == 50, 1], modmat[x == 50, 2], pch = 19, col = "gray") ``` ## Confidence Intervals for Hypothetical Individuals {#hypothetical-quasi} This is like Section \@ref(confidence-intervals-for-hypothetical-individuals) except for the difference in data. ```{r agresti-quasi-hypothetical,cache=TRUE} xx <- seq(0, 100, 0.1) # be especially careful at jump xx <- c(xx, 49.99, 50.01) xx <- sort(xx) theta <- do.theta.intervals(xx) mu <- 1 / (1 + exp(- theta)) # have to do by hand since optimization isn't good enough mu[xx < 50, 1] <- 0 mu[xx > 50, 2] <- 1 ``` ```{r agresti-quasi-hypothetical-plot,fig.align="center",fig.cap="Like Figure \\@ref(fig:agresti-quasi-mu-plot) Except with Confidence Intervals for Every $x$ value"} plot(x, y, type = "n", xlab = expression(x), ylab = expression(mu(x))) xxx <- c(xx, rev(xx)) yyy <- c(mu[ , 1], rev(mu[ , 2])) length(xxx); length(yyy) polygon(xxx, yyy, border = NA, col = "gray") points(x, y) ``` That finishes our work on this toy problem. # Theory of Hypothesis Tests The theory of hypothesis tests propounded in @geyer-gdor is very simple, once you understand it. @geyer-gdor attributes it to Stephen Fienberg (who outlined this theory in answer to a question from your humble author asked in the questions after a talk Fienberg was giving). The main point is that in the theory of hypothesis testing, the distribution of the data used for calculations is the distribution *under the null hypothesis*. (This ignores power calculations.) Thus the LCM for the alternative hypothesis is irrelevant. (And it would be irrelevant even if we were doing power calculations, because those only involve alternatives near the null hypothesis.) The next most important point is that the MLE distribution *under the null hypothesis* is a conditional distribution (if you include conditioning on nothing, which is the same as not conditioning, in case no GDOR exists and the MLE is in the OM). Thus we have to fit the null and alternative hypotheses *using the same conditioning* for each. Everything else is straightforward. It is just the usual theory of hypothesis tests associated with maximum likelihood estimation (Wilks, Rao, Wald) applied to this conditional model. Actually, since Wald tests only use the MLE for the alternative hypothesis, they would only tell us about LCM for the alternative, so they don't make much sense except in the output of R function `summary`. It follows that we do not even need the MLE calculated by R function `llmdr` for the *alternative hypothesis* (for Wilks or Rao). But we can use the deviance it calculates. It follows that we have to refit the *alternative hypothesis* using the conditioning for the *null hypothesis*. The method of R generic function `anova` for objects of class `llmdr` does this. The appropriate degrees of freedom for the *alternative hypothesis* will be the degrees of freedom of this refitted model. And these degrees of freedom will not be either of the degrees of freedom in the object returned by R function `llmdr` when it calculates the MLE for the *alternative hypothesis*. Thus a model has many possible degrees of freedom * One is the number of identifiable parameters of the OM. This is the degrees of freedom one should use for AIC and BIC (maybe, for a counterargument, see the [section about this below](#aic-and-bic-and-lcm)). * Another is the number of identifiable parameters of the LCM. * Yet another is the number of identifiable parameters of the model when considered as an alternative hypothesis. This depends on what the null hypothesis is. In general, it can be anything between the two degrees of freedom listed above. In summary, our theory of hypothesis testing is fairly conventional. The test statistic (Wilks, Rao, or Wald) still has an asymptotic chi-square distribution. But, unless the MLE for the null hypothesis in in its OM, the *degrees of freedom are different* from conventional. So we just have to remember that the degrees of freedom for the alternative hypothesis are different from conventional and depend on what the null hypothesis is. # Categorical Data Example of Geyer ```{r geyer-categorical-cleanup,echo=FALSE} rm(list = setdiff(ls(), "save.complete")) ``` This is the example in Section 2.3 of @geyer-gdor. It is a seven-dimensional contingency table, each dimension having two values, so the number of cells of the table is $2^7 = `r 2^7`$. These data are found in dataset `catrec` in R package `llmdr` ```{r geyer-categorical-data} data(catrec) class(catrec) dim(catrec) names(catrec) ``` The response (vector of cell counts) is `y`. The other variables are the indicators for each dimension. @geyer-gdor fits the model with all possible three-way interactions and no higher interactions ```{r geyer-categorical-fit} lout <- llmdr(y ~ (v1 + v2 + v3 + v4 + v5 + v6 + v7)^3, family = "poisson", data = catrec, proofs = 2) length(lout$coefficients) sum(is.na(lout$coefficients)) ``` We don't print out the summary, because it is too long and is about [meaningless parameters](expfam.html#alice) anyway. We notice that the dimension of the LCM is one less than the dimension of the OM, which means the GDOR points in a unique direction (every GDOR for this problem points in the same direction). This GDOR has nonnegative components ```{r geyer-categorical-show-gdor} # did proofs = 2 above to get GDOR with exactly correct zeros cbind(lout$gdor[lout$gdor != 0]) ``` This agrees with Table 1 in @geyer-gdor. @geyer-gdor considers the following hypothesis tests. ```{r geyer-categorical-anova} lout0 <- llmdr(y ~ (v1 + v2 + v3 + v4 + v5 + v6 + v7)^2, family = "poisson", data = catrec) lout2 <- llmdr(y ~ (v1 + v2 + v3 + v4 + v5 + v6 + v7)^4, family = "poisson", data = catrec) anova(lout0, lout, lout2) ``` This says the model with all three-way interactions and no higher-order interactions, which has $\text{OM} \neq \text{LCM}$, fits the data much better than the model with only two-way interactions. And this model fits just as well as the model with all four-way interactions. Note that in line two of the printout, the degrees of freedom are different for the three-way interactions model, depending on whether is has the role of null hypothesis or alternative hypothesis. There are, of course, many, many more models to consider. But, as this is being written, R function `glmbb` in R package `glmbb` [@glmbb-package] uses up all memory the computer has and then crashes when applied to this problem, and R functions `add1`, `drop1`, and `step` do have not have methods for objects returned by R function `llmdr`. So we could do whatever hypothesis tests we can think of, but we will have to wait until R package `llmdr` is more complete or R package `glmbb` is fixed to use less memory in order to search among the very many models for these data. # Multinomial and Product Multinomial Sampling Unlike R function `glm` the new function `llmdr` also does multinomial and product multinomial sampling. Let's see how that works. This is a competitor for "response" in scare quotes, which we discussed in the [notes accompanying Agresti, Chapter 8](ch8.html). ## Alligators ```{r geyer-alligators-cleanup,echo=FALSE} rm(list = setdiff(ls(), "save.complete")) ``` Let's redo an [example previously done in those notes](ch8.html#multinomial-response). ```{r alligators,error = TRUE} data(table_8.1) alligator <- transform(table_8.1, lake = factor(lake, labels = c("Hancock", "Oklawaha", "Trafford", "George")), gender = factor(gender, labels = c("Male", "Female")), size = factor(size, labels = c("<=2.3", ">2.3")), food = factor(food, labels = c("Fish", "Invertebrate", "Reptile", "Bird", "Other"))) names(alligator) lout <- llmdr(count ~ food * (size + gender + lake), strata = ~ size * lake * gender, family = "multinomial", data = alligator) summary(lout) ``` This is the Right Thing. It does not show any coefficients for regressors in `strata` but does condition on them. The estimates and their standard errors are those for product multinomial sampling. But this doesn't have anything to do with the subject of these notes except that we have proved that the MLE exists in the OM (something R function `glm` cannot do). But when we ran R function `glmbb` on this, we had warnings. ```{r alligators-glmbb} gout <- glmbb(big = count ~ lake * gender * size * food, little = ~ lake * gender * size + food, family = poisson, data = alligator) ``` What were the models for which it should have given warnings? ```{r alligators-warnings} foo <-gout$envir rm(min.crit, envir = foo) bar <- eapply(foo, function(x) x$formula) lout <- lapply(bar, function(x) llmdr(x, family = "poisson", data = alligator)) is.lcm <- sapply(lout, function(x) x$is.lcm) sum(is.lcm) ``` (we use `family = "poisson"` here to match the formulas `glmbb` uses, and then we don't need argument `strata`). R function `glm` should have given `r sum(is.lcm)` warnings rather than two! So this is proof that its idea of warnings is completely inadequate. It gives lots of false negatives. It can also give false positives, but we haven't shown an example of that yet. Here are the models with LCM ```{r alligator-lcm-show} sapply(bar[is.lcm], tidy.formula.hierarchical) ``` Eventually we need to make R package `glmbb` use R function `llmdr` to fit models rather than R function `glm`. But that is not done yet. ## A Digression about AIC and BIC and LCM {#aic-and-bic-and-lcm} ### AIC To understand the issues here we need to do a little bit of theory. The mathematical argument that leads to AIC assumes the "usual asymptotics of maximum likelihood". MLE at infinity (for canonical parameters, on the boundary for mean value parameters) mean those assumptions are wrong. On the other hand, those "usual asymptotics of maximum likelihood" may apply to the LCM if the conditioning that produces the LCM leaves enough components of the response vector random so the sample size can be considered "large" so the parameter estimates in the LCM are approximately normal. When the LCM is completely degenerate, so it has no parameters to estimate, the asymptotic argument does still apply if we consider the chi-square distribution on zero degrees of freedom to be the distribution of the constant random variable always equal to zero (which R does) ```{r aic-chi-square-zero} pchisq(0, df = 0) pchisq(1e-16, df = 0) ``` It follows that the correct $p$ to use in the definition $$ \text{AIC} = \text{deviance} + 2 p $$ is what R function `llmdr` calls the degrees of freedom for the LCM, component `df` of the result it returns. There really cannot be any argument about this if you follow the theory. AIC is trying to be an unbiased estimate of Kullback-Leibler information (times two). On or near the boundary of the mean value parameter space the Kullback-Leibler information function is flat or nearly flat in the same directions as the log likelihood function; there are only `df` dimensions in which it is strongly curved and can be overestimated. For an explicit example where this is clearly correct, let us redo the [complete separation example of Agresti](#complete-separation-example-of-agresti) just doing AIC. Recall this analysis of deviance table ```{r aic-agresti-complete-anova} save.complete$anova ``` Now we want AIC. We claim that the correct AIC are ```{r aic-agresti-complete} with(save.complete$null, deviance + 2 * df) with(save.complete$alternative, deviance + 2 * df) ``` This is not what R function `glm` and friends (methods for generic functions that handle objects of class `"glm"`) calculates ```{r aic-agresti-complete-glm,error=TRUE} with(save.complete, AIC(glm(null$formula, family = "binomial", data = data))) with(save.complete, AIC(glm(alternative$formula, family = "binomial", data = data))) ``` Of course either calculation says the model with formula `r deparse(save.complete$alternative$formula)` is best, but that's not the point. The point is that we have issues with how to calculate AIC. This section recommends one thing. R function `glm` and friends do another. There is no question that R function `glm` and friends are theoretically incorrect. The AIC argument says that it is a bias-corrected estimate of Kullback-Leibler information (times two). It corrects for overfitting. But the model with formula `r deparse(save.complete$alternative$formula)` cannot overfit. All randomness is gone in a completely degenerate LCM. But I can imagine the counterarguments. Those arguments ignore the math behind AIC and only pay attention to the way it is usually used. Sparsity fans will argue that we are really looking for sparsity and AIC is a penalty that rewards sparsity, and if we redefine AIC the way we recommend here, that won't any longer be rewarding sparsity. And that was what was always good about AIC (never mind what theory says). So I understand that a lot of people might not like what this section says (mathematically correct or no). They will continue to use their grandfather's version of AIC. ### BIC #### n BIC has the same issue as AIC, in the definition $$ \text{BIC} = \text{deviance} + \log(n) \cdot p $$ $p$ has to be the dimension of the LCM to make the mathematical argument behind BIC correct (with the same caveat that there has to be enough data left in the LCM for estimates in the LCM to be approximately normally distributed, except when the LCM is completely degenerate and has no data). BIC also has the other issue about $n$, which is discussed in the help page for R function `glmbb` which we quote > It is unclear what the sample size, the $n$ in the BIC penalty > $n \log(p)$ should be. Before version 0.4 of this package the BIC was > taken to be the result of applying R generic function `BIC` to the > fitted object produced by R function `glm`. This is generally wrong > whenever we think we are doing categorical data analysis > [@raftery; @kass-raftery]. Whether we consider the sampling scheme > to be Poisson, multinomial, or product multinomial (and binomial is > a special case of product multinomial) the sample size is the total > number of individuals classified and is the only thing that is > considered as going to infinity in the usual asymptotics for > categorical data analysis. Thus the option `BIC.option = "sum"` > should always be used for categorical data analysis. In more detail, if you are doing purely categorical data analysis (categorical response, all predictors categorical too, so the data can be reformatted as a contingency table) then you are not imagining the number of cells of the contingency table `length(y)` to be going to infinity, you are imagining the number of individuals classified `sum(y)` to be going to infinity. Thus `sum(y)` is the only choice for $n$ in the BIC formula that makes sense. So the cited literature says. Once we start thinking along these lines, it is unclear that `length(y)` which is what R function `glm` and friends do (with no option) ever makes sense. BIC was originally defined [@schwarz] for independent and identically distributed (IID) data. For that, it is clear what $n$ means: it is the number of IID random variables in the data. For categorical data analysis, that is the number of individuals classified (the individuals are IID if they are a random sample). BIC can be extended to non-IID data. @cavanaugh-neath give all the gory details. But they introduce $n$ in a completely artificial way in their assumptions. They imagine a sequence of statistical models with log likelihoods $l_n$ all having the same parameter $\theta$. Then they assume $$ \frac{1}{n} l_n(\theta) $$ converges (for all $\theta$, this is their assumption (5)). So how does that make sense for regression if we are assuming the dimension of the response vector is going to infinity, so the dimension of the predictor vectors must be going to infinity to match? That convergence assumption implies a very restrictive condition on model matrices of the models in the sequence of models. It is not clear that this story ever makes sense. This is the reason why your humble author wrote a paper [@geyer-no-n] that does asymptotics without $n$. But if you take the $n$ out of asymptotics, then BIC actually makes no sense because there is no $n$ to go into the formula. In summary, it is very unclear what $n$ should be in the BIC formula when we do not have IID data. In particular, it is not clear that what R function `glm` and friends do (in particular, R generic function `BIC` does when applied to R objects of class `"glm"`, using `length(y)` rather than `sum(y)` as $n$) ever makes sense. #### p The case is even worse for $p$. The derivation of BIC [@neath-cavanaugh] can use many different priors, but one possibility is flat within-model priors (as an approximation to Bayes factors, it does not depend on between model priors). And we know from our [notes on priors for exponential families](prior.html) that the posterior (and hence the Bayes factor) does not exist when the MLE does not exist (in the conventional sense) and a flat prior is used. Thus Bayes factors make no sense when one model being compared has solutions at infinity (the subject of this document). Or we might say that any model with solutions at infinity is *infinitely bad* when *flat priors are used*. But such priors are *improper* and make no philosophical sense. So this is not a problem with the models but rather with the whole idea of BIC. ## AIC for Alligators So back to alligators. We now think R is calculating AIC wrong for those models that have LCM. R does not use the same definitions of AIC and BIC that we do. It adds a constant to our definitions, that is, $$ \text{AIC} = \text{deviance} + \text{constant} + 2 p $$ and similarly for BIC. The constant doesn't matter. It drops out of comparisons and calculation of weights. But we will use the same constant so we don't have to change all of the AIC values we already have. So we recalculate AIC for these models. Because we used different formulas for R function `llmdr` we need to change these AIC values by hand (eventually this should be automated). The whole issue is that we have different degrees of freedom output by R functions `glm` and `llmdr` and the latter are correct (so we say). Here are those degrees of freedom. ```{r alligator-aic-sout} df.glm <- unlist(eapply(foo, function(x) x$rank)) df.llmdr <- sapply(lout, function(x) x$df) identical(names(df.glm), names(df.llmdr)) aic.glm <- unlist(eapply(foo, AIC)) aic.llmdr <- aic.glm + 2 * (df.llmdr - df.glm) ``` So now we have a problem that what models are within cutoff (10, the default) of the minimum are different for the two definitions, so we need a different summary for each, but we only have a function to do one. ```{r min.aic} min(aic.glm) min(aic.llmdr) ``` We see that since the new minimum AIC is more than the cutoff below the old minimum AIC, that none of the models that were supposedly best under the old criterion are even on our new list. Because we did not use R function `llmdr` as the model fitting function for R function `glmbb` we do not know that we didn't miss some models that have low AIC under the correct definition. So we don't bother to make a new table. That will have to wait until we fix R function `glmbb`. It should come as a shock to fans of the old AIC that what is considered the best model changes, so it would change the model selection done by AIC. Of course, model selection is a [mug's game](https://www.merriam-webster.com/dictionary/mug%27s%20game), but, even if we do model averaging, the weights change, and not just a little bit but a whole lot. ## Other Ways to Specify Multinomial Data R function `multinom` in R package `nnet` specifies multinomial response data in a different way. It has the response vector actually be a matrix or a factor. Then it doesn't need a `strata` argument. R function `llmdr` allows this too. We will illustrate the response being a factor in the next section. Here we show the response as a matrix. ```{r alligator-response-matrix-what} head(alligator$food) ``` It looks like in these data food varies fastest, so we can reshape it into a matrix as follows, ```{r alligator-response-matrix} food <- alligator$count food <- matrix(food, byrow = TRUE, ncol = nlevels(alligator$food)) colnames(food) <- levels(alligator$food) head(food) ``` But we have to check that this is actually right, that the other variables agree for each level of this new response matrix. ```{r alligator-predictors} alligator.predictors <- alligator alligator.predictors$count <- NULL alligator.predictors$food <- NULL names(alligator.predictors) for (i in levels(alligator$food)) { foob <- subset(alligator.predictors, alligator$food == "Bird") fooi <- subset(alligator.predictors, alligator$food == i) rownames(foob) <- NULL rownames(fooi) <- NULL print(identical(foob, fooi)) } alligator.predictors <- subset(alligator.predictors, alligator$food == "Bird") ``` ```{r alligator-wide-too,echo=FALSE} opt.save <- options(width=200) ``` Great! We now have the data we need. ```{r alligator-response-matrix-llmdr} lout <- llmdr(food ~ lake * size, family = "multinomial", data = alligator.predictors) summary(lout) ``` ```{r alligator-wide-too-restore,echo=FALSE} options(opt.save) ``` When you get the data in the right format, the analysis is a lot simpler, but getting the data in the right format can be a bit of a struggle. ## A Simple Multinomial Model with Completely Degenerate LCM ```{r multinomial-complete-cleanup,echo=FALSE} rm(list = ls()) ``` ```{r multinomial-complete-data} x <- seq(10, 90, 10) y <- rep(c("red", "green", "blue"), each = 3) y <- factor(y, levels = c("red", "green", "blue")) data.frame(x, y) ``` We fit these data as follows. ```{r multinomial-complete-mle} lout <- llmdr(y ~ x, family = "multinomial") summary(lout) ``` As usual, it is [hard to make sense of canonical parameters](expfam.html#alice), but from the fact that the `Estimates` column is all `NA` we see the LCM is completely degenerate. As usual, mean value parameters make more sense. ```{r multinomial-complete-mu} cbind(x, estimates(lout, type = "mu")) ``` The low values of `x` are all red, the middle values all green, and the high values all blue. So the model can perfectly predict these data. ## A Simple Multinomial Model with Partially Degenerate LCM We change the data a little bit. ```{r multinomial-quasi-data} y[6] <- "blue" y[7] <- "green" y <- factor(y, levels = c("red", "green", "blue")) data.frame(x, y) ``` And refit ```{r multinomial-quasi-mle} lout <- llmdr(y ~ x, family = "multinomial") summary(lout) cbind(x, estimates(lout, type = "mu")) ``` Now we see that the LCM is only partially degenerate because the `Estimate` column is not all `NA` but the actual numbers in the `Estimate` and `GDOR` columns don't make much sense. But from the mean values it is clear what is going on. Now `x` fails to perfectly predict green and blue but still perfectly predicts red. The probability of green decreases with `x` and the probability of blue increases. Of course, it looks like with such a small amount of data that the parameters of the LCM are not statistically significant, either of them. But to make a conclusion about dropping two parameters we should actually fit both models and compare. But there doesn't seem to be any way to do that with the data in this form. It wants to use the same model matrix for each stratum, so we cannot keep `red` and `x:red` as predictors while dropping the rest. We can do that if we put the data into data frame form and use explicit strata. ```{r multinomial-quasi-redo} xx <- rep(x, times = 3) color <- rep(levels(y), each = length(x)) yy <- double(length(xx)) for (i in 1:length(x)) yy[xx == x[i] & color == y[i]] <- 1 data.frame(x = xx, color, y = yy) ``` Now the models we want to compare are ```{r multinomial-quasi-test-alt} lout.alternative <- llmdr(yy ~ xx * color, family = "multinomial", strata = as.factor(xx)) summary(lout.alternative) data.frame(x = xx, color, y = yy, mu = estimates(lout.alternative, "mu")) ``` Same mean values, so same model, just expressed differently. Now for the null model, which does have red, but not the other colors. ```{r multinomial-quasi-test-null} red <- as.numeric(color == "red") lout.null <- llmdr(yy ~ xx * red, family = "multinomial", strata = as.factor(xx)) summary(lout.null) data.frame(x = xx, color, y = yy, mu = estimates(lout.null, "mu")) ``` Right. Predicts red perfectly, says nothing about blue or green, where "says nothing" means zero on the logit scale, or $1/2$ on the probability scale, and this really does not say anything about blue or green, they are 50-50 given $x \ge 40$. And then we have the really, really null hypothesis ```{r multinomial-quasi-test-null.null} lout.null.null <- llmdr(yy ~ red, family = "multinomial", strata = as.factor(xx)) summary(lout.null.null) data.frame(x = xx, color, y = yy, mu = estimates(lout.null.null, "mu")) ``` So finally the test. ```{r multinomial-quasi-test} anova(lout.null.null, lout.null, lout.alternative) ``` These data really don't say anything about blue and green other than red versus non-red is predicted perfectly. And we cannot drop $x$ entirely. ## On the Baseline-Category Logit Parameterization R function `multinom` in R package `nnet` uses the baseline-category logit parameterization ```{r multinomial-quasi-nnet} mout <- multinom(y ~ x) summary(mout) round(predict(mout, type = "probs"), 7) estimates(lout, type = "mu") ``` Not too different. The differences are that R function `multinom` does not understand generic directions of recession, so can't do the right thing, and isn't even trying to do maximum likelihood, but rather what some neural net thinks is somewhat optimal. If you make allowances for extreme sloppiness, then it has the right answer, at least for the probabilities. But what we wanted to talk about is what @agresti, first page of Chapter 8, calls the baseline-category logit parameterization of the multinomial. In this example, it makes the canonical parameters $$ \log \frac{\pi_\text{color}(x)}{\pi_\text{red}(x)} $$ which makes the canonical parameters for anything red zero, hence they are dropped, which is why `summary(mout)` above does not mention red. This is what the exponential families notes call the [try II parameterization](expfam.html#try-ii). But R function `llmdr` does not use this parameterization. Instead it uses the symmetric parameterization, what the exponential families notes call the [try III parameterization](expfam.html#try-iii). Unlike baseline-category logit, this parameterization does not constrain any canonical parameters to be zero. Thus it is not identifiable. It is the parameterization you get when you start with Poisson sampling, and condition to get product multinomial sampling *without changing the canonical parameters*. Why? The baseline-category logit parameterization seems easy enough to understand, and is identifiable. There is the issue of arbitrariness of which category is chosen to be "baseline", but we ignore that. But in hindsight, considered in light of directions of recession, that parameterization seems to be a really dumb idea. * It amounts to constraining equal to zero parameters that need to go to infinity to maximize the likelihood. * Moreover, even if we move directions of recession to the baseline-category logit parameterization, that parameterization need not make the LCM identifiable. We see the latter in our toy problem. R function `nnet` chooses red to be "baseline" for arbitrary reasons. But in the LCM we see that red has been conditioned out of the LCM. The LCM only has data for blue or green and $x \ge 40$. Now trying to say that red is "baseline" seems almost maximally stupid. So OK. We cannot choose a baseline category until the LCM has already been determined. But we can then, right? No. It is possible to construct models having LCM in which no choice of multinomial response category to be "baseline" will make the LCM identifiable. So why this "baseline" category stuff anyway? R functions `lm` and `glm` and `llmdr` (which calls `glm.fit` to fit LCM) have no trouble figuring out which coefficients to "drop" to get an identifiable model. They don't need to be given an identifiable parameterization. So why fuss about a particular one that will only lead to insoluble problems in some instances? Other issues with the baseline-category logit parameterization are illustrated in the next section. ## Mapping Canonical to Mean Value Parameters For product multinomial data with response vector $y$ and strata $\mathcal{A}$, this means the subvectors $y_A$ for $A \in \mathcal{A}$ are the independent multinomial parts of $y$, the saturated model usual parameters are given in terms of the canonical parameters by $$ \pi_i = \frac{e^{\theta_i}}{\sum_{j \in A} e^{\theta_j}}, \qquad i \in A \in \mathcal{A} $$ (and there is always exactly one $A$ satisfying the condition of this equation because $\mathcal{A}$ partitions the index set of $y$). Then the mean value parameters are $$ \mu_i = n_A \pi_i, \qquad i \in A \in \mathcal{A}. $$ So to keep track of what is going on with the various parameterizations, we have to keep this non-identifiability of $\beta$ and $\theta$ in mind. ```{r multinomial-quasi-theta} theta <- estimates(lout, "theta") theta ``` If we really, really wanted red to be the baseline category, then we could do that. Then we would have zero for the whole red column. And the blue column would be `Inf` for $x \ge 40$. And then the green column would have to be `Inf + `r theta[4,2]``, `Inf + `r theta[5,2]``, and so forth for $x \ge 40$. What does that mean? It means these parameters go off to infinity at the same rate, in lockstep, keeping a constant difference, \begin{align*} \theta_{40\ \text{green}} & = \theta_{40\ \text{blue}} + `r theta[4, 2]` \\ \theta_{50\ \text{green}} & = \theta_{50\ \text{blue}} + `r theta[5, 2]` \\ \theta_{60\ \text{green}} & = \theta_{60\ \text{blue}} + `r theta[6, 2]` \\ \theta_{70\ \text{green}} & = \theta_{70\ \text{blue}} - `r -theta[7, 2]` \end{align*} and so forth. See what I mean when I say that the baseline-category logit parameterization causes far more pain than enlightenment? But if we keep to the symmetric parameterization that R function `llmdr` is using, then we have to know how to use it. We go from $\theta$ to $\pi$ in two steps. ```{r multinomial-quasi-theta-to-pi} foo <- exp(theta) foo ``` Then we have to make the probabilities in each stratum sum to one. Here the strata are the rows of the matrix. ```{r multinomial-quasi-theta-to-pi-too} bar <- rowSums(foo) mu <- sweep(foo, 1, bar, "/") mu ``` Of course, R function `estimates` does this for us, so we don't have to understand this to get the numbers, but this is what it is doing to get them. ```{r multinomial-quasi-theta-to-pi-too-too} estimates(lout, "pi") ``` ## DOR, DOC, and GDOR for Multinomial {#dor-multinomial} Usually `family = "multinomial"` means product multinomial, so we first need to start with the symmetric parameterization for product multinomial sampling [described above](#mapping-canonical-to-mean-value-parameters). We use a theorem described in Section 3.17 of @geyer-gdor, which says, if you can find DOR or GDOR for Poisson sampling, then they also work for multinomial or product multinomial sampling (actually @geyer-gdor leaves the product multinomial case as an exercise for the reader, but it is fully worked out in the design document for R package `llmdr` [@llmdr-package], which is found in every installation of the package). This theorem needs one idea. The indicator vectors for strata are vectors $u$ having coordinates $$ u_i = \begin{cases} 1, & i \in A \\ 0, & i \notin A \end{cases} $$ where $A \in \mathcal{A}$ are the sets of indices for strata, as in the [section on multinomial parameterization](#mapping-canonical-to-mean-value-parameters) above. Then the exact statement of the theorem is that the model matrix for Poisson sampling has to have all the columns of the model matrix for product multinomial sampling plus all the indicator vectors for multinomial strata as columns. Then every MLE for the Poisson model is also an MLE for the product multinomial model (when both use the same parameterization, that is, the product multinomial is using the symmetric parameterization) and every DOR for the Poisson model is a DOR for the product multinomial model. The only difference is that the parameterization for Poisson sampling is (usually) identifiable, and the parameterization for product multinomial sampling is never identifiable. Its directions of constancy are exactly the indicator vectors for strata described above. Thus we can immediately drop all components of the Poisson parameter vector that correspond to indicator vectors of strata, and similarly for components of DOR. They are not identifiable. This leaves us with MLE for $\beta$ and GDOR for product multinomial sampling. Thus we only know how to solve multinomial problems indirectly, by converting them to Poisson problems, and then converting back at the end. A much more complete characterization of product multinomial DOR, DOC, and GDOR is found in the design document for R package `llmdr` cited above. # Bibliography