--- title: "Stat 8054 Lecture Notes: Exponential Families" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: bookdown::html_document2: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML bookdown::pdf_document2: number_sections: true md_extensions: -tex_math_single_backslash linkcolor: blue urlcolor: blue --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. # Non-Reading This web page is the condensation of a lot of material I have put in courses and papers over the years. I record it all here, but I don't expect students to read any of it. Skip to Section \@ref(expfam) below to get past this bibliography. * [Geyer, 2010](http://hdl.handle.net/11299/57163). An article only "published" as a technical report titled *A Philosophical Look at Aster Models* which is a first version of much of this document. (Aster models, Geyer, Wagenius, and Shaw, 2007, cited below, are exponential family models complicated enough so that only the general theory of exponential families suffices.) * [Stat 8112 lecture notes](http://www.stat.umn.edu/geyer/8112/notes/expfam.pdf) about the "usual" asymptotics of maximum likelihood, which always hold for exponential families. * [Geyer, 2013, in the Eaton Festschrift](https://projecteuclid.org/euclid.imsc/1379942045) an article that says the "usual" asymptotics of maximum likelihood always hold for exponential families, even those with complicated dependence, like those used in spatial statistics and statistical genetics. * [Stat 9831, Fall 2016](http://www.stat.umn.edu/geyer/8931expfam/) a whole semester special topics course on exponential families. * [Stat 9831, Fall 2018](http://www.stat.umn.edu/geyer/8931aster/) a whole semester special topics course on aster models, which are exponential families with complicated dependence used in life history analysis (biology). * [Stat 8501, Fall 2018](http://www.stat.umn.edu/geyer/8501/) the stochastic process course, with two handouts on exponential families that are stochastic processes: * one on [spatial point processes](http://www.stat.umn.edu/geyer/8501/points.pdf) and * one on [spatial lattice processes](http://www.stat.umn.edu/geyer/8501/lattice.pdf). * [Geyer, 2009](https://projecteuclid.org/euclid.ejs/1239716414) an article about maximum likelihood in exponential families when the MLE does not exist in the classical sense, but does exist as a limit of probability distributions. * [Stat 5421, Fall 2016](http://www.stat.umn.edu/geyer/5421/) a master's level service course about categorical data analysis, which, of course, is all exponential family models * [notes on exponential families, part I](http://www.stat.umn.edu/geyer/5421/notes/expfam.pdf), * [notes on conjugate priors for exponential families](http://www.stat.umn.edu/geyer/5421/notes/prior.pdf), * [notes on exponential families, part II](http://www.stat.umn.edu/geyer/5421/notes/infinity.pdf), (MLE as limit of probability distributions). * [Stat 5102, Fall 2016](http://www.stat.umn.edu/geyer/5102/) master's level theoretical statistics. * [Geyer, 1990, PhD Thesis](http://hdl.handle.net/11299/56330), titled *Likelihood and Exponential Families*, has three papers worth of stuff about exponential families. * Geyer (2009) cited above, improves theory in chapters 2, 3, and 4 of the thesis. * Geyer and Thompson (1992), cited below is chapters 5 and 6 of the thesis. * Geyer (1991), cited below is chapter 7 of the thesis. * Geyer (1994), cited below improves theory in Chapter 5 of the thesis. * [Geyer and Thompson (1992)](https://doi.org/10.1111/j.2517-6161.1992.tb01443.x) an article that proposes Markov chain Monte Carlo maximum likelihood for exponential families without closed-form expression for likelihood and also expounds the maximum entropy argument. * [Geyer (1994)](https://doi.org/10.1111/j.2517-6161.1994.tb01976.x) an article that improves the convergence theory in chapter 5 of the thesis and in Geyer and Thompson (1992). * [Geyer (1991)](https://doi.org/10.1080/01621459.1991.10475100) an article about inference with inequality constraints (like isotonic regression) but log convex logistic regression. * [Geyer and Møller (1994)](https://www.jstor.org/stable/4616323) an article containing a non-regular exponential family (the Strauss process). * [Geyer, Wagenius, and Shaw (2007)](https://doi.org/10.1093/biomet/asm030) an article about a very complicated class of exponential family models called *aster models* that arise in life history analysis. * [Shaw and Geyer (2010)](https://doi.org/10.1111/j.1558-5646.2010.01010.x) an article that uses multivariate monotonicity of the transformation from canonical to mean value parameters in a nontrivial way. * [Barndorff-Nielsen (1978)](https://doi.org/10.1002/9781118857281) the fundamental reference book on the theory of exponential families. * [Brown (1986)](https://projecteuclid.org/euclid.lnms/1215466757) another fundamental reference book on the theory of exponential families. * [Rockafellar and Wets (1998)](https://doi.org/10.1007/978-3-642-02431-3) the fundamental reference on the theory of convex analysis and nonsmooth analysis. Supersedes Rockafellar (1970) which is the basis of most of the math underlying Barndorff-Nielsen (1978) and Brown (1986) and Geyer (PhD thesis). Your humble author took a special topics course based on draft chapters from this book from Terry Rockafellar in 1990 and mined it for years, using material from those draft chapters in Chapter 5 of the PhD thesis, and in Geyer (1994, cited above) and in another Geyer (1994, [Annals of Statistics](https://projecteuclid.org/euclid.aos/1176325768)) that is about inequality constrained inference like in Geyer (1991, cited above). Corrected printings of this book contain corrections, simplifications, and additional comments, and should always be used. The latest is 3rd corrected printing, 2010. # Exponential Families {#expfam} We will use the following definition from Geyer (2009, cited above). A statistical model is an *exponential family of distributions* if it has a log likelihood of the form \begin{equation} l(\theta) = \langle y, \theta \rangle - c(\theta) (\#eq:logl) \end{equation} where * $y$ is a vector-valued statistic, which is called the *canonical statistic*, * $\theta$ is a vector-valued parameter, which is called the *canonical parameter*, and * $c$ is a real-valued function, which is called the *cumulant function*. The notation $\langle \,\cdot\,, \,\cdot\, \rangle$ denotes a bilinear form that places the vector space where $y$ takes values and the vector space where $\theta$ takes values in duality. In equation \@ref(eq:logl) we have used the rule that additive terms in the log likelihood that do not contain the parameter may be dropped. Such terms have been dropped in \@ref(eq:logl). You may object to the angle brackets notation as unfamiliar and not what you saw in some class and prefer some notation like $(y, \theta)$ or $y \cdot \theta$ or $y^T \theta$ or $\theta^T y$ or one of the latter with little t or prime for transpose. In your humble author's opinion, the angle brackets are superior because they make it clear that $\langle y, y \rangle$ or $\langle \theta, \theta \rangle$ is always obviously wrong, whereas $y^T y$ or $\theta^T \theta$ or the same in any other notation is not obviously wrong. The angle brackets notation comes from functional analysis. Although we usually say "the" canonical statistic, "the" canonical parameter, and "the" cumulant function, these are not uniquely defined: * any one-to-one affine function of a canonical statistic vector is another canonical statistic vector, * any one-to-one affine function of a canonical parameter vector is another canonical parameter vector, and * any real-valued affine function plus a cumulant function is another cumulant function. (see Section \@ref(casm) below for the definition of affine function). These possible changes of statistic, parameter, or cumulant function are not algebraically independent. Changes to one may require changes to the others to keep a log likelihood of the form \@ref(eq:logl) above. Usually no fuss is made about this nonuniqueness. One fixes a choice of canonical statistic, canonical parameter, and cumulant function and leaves it at that. The cumulant function may not be defined by \@ref(eq:logl) above on the whole vector space where $\theta$ takes values. In that case it can be extended to this whole vector space by \begin{equation} c(\theta) = c(\psi) + \log\left\{ E_\psi\bigl( e^{\langle y, \theta - \psi\rangle} \bigr) \right\} (\#eq:cumfun) \end{equation} where $\theta$ varies while $\psi$ is fixed at a possible canonical parameter value, and the expectation and hence $c(\theta)$ are assigned the value $\infty$ for $\theta$ such that the expectation does not exist. The family is *full* if its canonical parameter space is \begin{equation} \Theta = \{\, \theta : c(\theta) < \infty \,\} (\#eq:full) \end{equation} and a full family is *regular* if its canonical parameter space is an open subset of the vector space where $\theta$ takes values. Almost all exponential families used in real applications are full and regular. So-called *curved exponential families* (smooth non-affine submodels of full exponential families) are not full. Constrained exponential families (Geyer, 1991, cited above) are not full. A few exponential families used in spatial statistics are full but not regular (Geyer and Møller, 1994, cited above). Many people use "natural" everywhere this document uses "canonical". In this we are following Barndorff-Nielsen (1978, cited above). Many people also use an older terminology that says a statistical model is *in the* exponential family, where we say a statistical model is *an* exponential family. Thus the older terminology says *the* exponential family is the collection of all of what the newer terminology calls exponential families. The older terminology names a useless mathematical object, a heterogeneous collection of statistical models not used in any application. The newer terminology names an important property of statistical models. If a statistical model is a regular full exponential family, then it has all of the properties discussed here. If a statistical model is an exponential family (not necessarily full or regular), then it has many of the properties discussed here. Presumably, that is the reason for the newer terminology. In this we are again following Barndorff-Nielsen (1978, cited above). # Mean Value Parameters The reason why the cumulant function has the name it has is because it is related to the cumulant generating function (CGF). A cumulant generating function is the logarithm of a moment generating function (MGF). Derivatives of an MGF evaluated at zero give moments. Derivatives of a CGF evaluated at zero give [cumulants](https://en.wikipedia.org/wiki/Cumulant). Cumulants are polynomial functions of moments and vice versa. Using \@ref(eq:cumfun), the MGF for an exponential family with log likelihood \@ref(eq:logl) is given by $$ M_\theta(t) = e^{c(\theta + t) - c(\theta)} $$ provided this formula defines an MGF, which it does if and only if it is finite for $t$ in a neighborhood of zero, which happens if and only if $\theta$ is in the interior of the full canonical parameter space \@ref(eq:full). So the cumulant generating function is $$ K_\theta(t) = c(\theta + t) - c(\theta) $$ provided $\theta$ is in the interior of $\Theta$. It is easy to see that derivatives of $K_\theta$ evaluated at zero are derivatives of $c$ evaluated at $\theta$. So derivatives of $c$ evaluated at $\theta$ are cumulants. We will be only interested in the first two cumulants \begin{align} E_\theta(y) & = \nabla c(\theta) (\#eq:cumfun-first-derivative) \\ \mathop{\rm var}\nolimits_\theta(y) & = \nabla^2 c(\theta) (\#eq:cumfun-second-derivative) \end{align} (Barndorff-Nielsen, 1978, cited above, Theorem 8.1). Hence for any $\theta$ in the interior of $\Theta$, the corresponding probability distribution has moments and cumulants of all orders. In particular, every distribution in a regular full exponential family has moments and cumulants of all orders and the mean and variance are given by the formulas above. Conversely, any distribution whose canonical parameter value is on the boundary of the full canonical parameter space does not have a moment generating function or a cumulant generating function, and no moments or cumulants need exist. The canonical parameterization is not always identifiable. It is identifiable if and only if the canonical statistic vector is not concentrated on a hyperplane in the vector space where it takes values (Geyer, 2009, cited above, Theorem 1). It is always possible to choose an identifiable canonical parameterization but not always convenient (Geyer, 2009, cited above). The means of distributions in a regular full exponential family always constitute an identifiable parameterization (Theorem \@ref(thm:identifiable) below). The mean value parameterization of a regular full exponential family is just as good as the canonical parameterization. Since cumulant functions are infinitely differentiable, the transformation from canonical to mean value parameters is infinitely differentiable. If the canonical parameterization is chosen so it is identifiable, then the inverse function theorem of real analysis says the inverse mapping is infinitely differentiable too. (See also Section \@ref(multivariate-monotonicity) below.) In elementary applications mean value parameters are preferred. When we introduce the binomial and Poisson distributions we use mean value parameters. It is only when we get to generalized linear models with binomial or Poisson response that we need canonical parameters. When we introduce the multinomial distribution we use mean value parameters. It is only when we get to hierarchical log-linear models for categorical data that we need canonical parameters. # Sufficient Dimension Reduction Nowadays, there is much interest in [sufficient dimension reduction in regression](https://en.wikipedia.org/wiki/Sufficient_dimension_reduction) that does not fit into the exponential family paradigm described in Section \@ref(casm) below. But exponential families were there first. ## Canonical Statistics are Sufficient Since the likelihood only depends on the data through the canonical statistic, the canonical statistic is always a (vector) *sufficient statistic*. This is one direction of the [Neyman-Fisher factorization theorem](https://en.wikipedia.org/wiki/Sufficient_statistic#Fisher%E2%80%93Neyman_factorization_theorem). ## Independent and Identically Distributed Sampling {#iid} Suppose $y_1$, $y_2$, $\ldots,$ $y_n$ are independent and identically distributed (IID) random variables from an exponential family with log likelihood for sample size one \@ref(eq:logl). Then the log likelihood for sample size $n$ is \begin{align*} l_n(\theta) & = \sum_{i = 1}^n \bigl[ \langle y_i, \theta \rangle - c(\theta) \bigr] \\ & = \left\langle \sum_{i = 1}^n y_i, \theta \right\rangle - n c(\theta) \end{align*} From this it follows that IID sampling converts one exponential family into another exponential family with * canonical statistic vector $\sum_i y_i$, which is the sum of the canonical statistic vectors for the samples, * canonical parameter vector $\theta$, which is the same as the canonical parameter vector for the samples, * cumulant function $n c(\,\cdot\,)$, which is $n$ times the cumulant function for the samples, * mean value parameter vector $n \mu$, which is $n$ times the mean value parameter $\mu$ for the samples. Many familiar "addition rules" for [brand name distributions](http://www.stat.umn.edu/geyer/5101/notes/brand.pdf) are special cases of this * sum of IID binomial is binomial, * sum of IID Poisson is Poisson, * sum of IID negative binomial is negative binomial, * sum of IID gamma is gamma, * sum of IID multinomial is multinomial, and * sum of IID multivariate normal is multivariate normal. The point is that the dimension reduction from $y_1$, $y_2$, $\ldots,$ $y_n$ to $\sum_i y_i$ is a *sufficient dimension reduction*. It loses no information about the parameters assuming the model is correct. ## Canonical Affine Submodels {#casm} Suppose we parameterize a submodel of our exponential family with parameter transformation \begin{equation} \theta = a + M \beta (\#eq:affine) \end{equation} where * $a$ is a known vector, usually called the *offset vector*, * $M$ is a known matrix, usually called the *model matrix* (also called the *design matrix*), * $\theta$ is the original parameter, and * $\beta$ is the *canonical affine submodel canonical parameter* (also called *coefficients vector*). The terms *offset vector*, *model matrix*, and *coefficients* are those used by R functions `lm` and `glm`. The term *design matrix* is widely used although it doesn't make much sense for data that do not come from a designed experiment (but language doesn't have to make sense and often doesn't). The terminology *canonical affine submodel* is from Geyer, Wagenius, and Shaw (2007, cited above). For a linear model (fit by R function `lm`) $\theta$ is the mean vector $\theta = E_\theta(y)$. For a generalized linear model (fit by R function `glm`) $\theta$ is the so-called the *linear predictor*, and it not usually called a parameter, even though it is a parameter. The transformation \@ref(eq:affine) is usually called "linear", but Geyer, Wagenius, and Shaw (2007, cited above) decided to call it "affine". The issue is that there are two meanings of [linear](https://en.wikipedia.org/wiki/Linear_function) * in calculus and all mathematics below that level (including before college) a linear function is a function whose graph is a straight line, but * in linear algebra and all mathematics above that level (including real analysis and functional analysis, which are just advanced calculus with another name) a linear function is a function that preserves vector addition and scalar multiplication, in particular, if $f$ is a linear function, then $f(0) = 0$. In linear algebra and all mathematics above that level, if we need to refer to the other notion of linear function we call it an *affine function*. An affine function is a linear function plus a constant function, where "linear" here means the notion from linear algebra and above. All of this extends to arbitrary transformations between vector spaces. An affine function from one vector space to another is a linear function plus a constant function. So \@ref(eq:affine) is an *affine change of parameter* in the language of linear algebra and above and a *linear change of parameter* in the language of calculus and below. It is also a *linear change of parameter* in the language of linear algebra and above in the special case $a = 0$ (no offset). The fact that \@ref(eq:affine) is almost always used when $a = 0$ (offsets are very rarely used) may contribute to the tendency to call this parameter transformation "linear". It is not just in applications that offsets rarely appear. Even theoreticians who pride themselves on their knowledge of advanced mathematics usually ignore offsets. The familiar formula $\hat{\beta} = (M^T M)^{- 1} M^T y$ for the least squares estimator is missing an offset $a$. Another reason for confusion between the two notions of "linear" is that for simple linear regression (R command `lm(y ~ x)`), the *regression function* is affine (linear in the lower-level notion). But this is applying "linear" in the wrong context. > It's called "linear regression" because it's linear in the > regression coefficients, not because it is linear in $x$. > > --- Werner Stutzle If we change the model to quadratic regression (R command `lm(y ~ x + I(x^2))`), then the regression function is quadratic (nonlinear) but the model is still a *linear model* fit by R function `lm`. Another way of saying this is some people think of simple linear regression as being linear in the lower-level sense because it has an intercept, but, as sophisticated statisticians, we know that having an intercept does not put an $a$ in equation \@ref(eq:affine), it adds a column to $M$ (all of whose components are ones). Statisticians generally ignore this confusion in terminology. Most clients of statistics, including most scientists, do not take linear algebra and math classes beyond that, so we statisticians use "linear" in the lower-level sense when talking to clients. I myself use "linear" in the lower-level sense in Stat 5101 and 5102, a master's level service course in theoretical probability and statistics. Except we are inconsistent. When we say "linear model" we usually mean \@ref(eq:affine) with $a = 0$ so $(M^T M)^{- 1} M^T y$ makes sense, and that is the higher-level sense of "linear". Hence Geyer, Wagenius, and Shaw (2007, cited above) decided to introduce the term *canonical affine submodel* for what was already familiar but either had no name or was named with confusing terminology. In the list following \@ref(eq:affine), "known" means nonrandom. In regression analysis we allow $M$ to depend on covariate data, and saying $M$ is nonrandom means we are treating covariate data as fixed. If the covariate data happen to be random, we say we are doing the analysis conditional on the observed values of the covariate data (which is the same as treating these data as fixed and nonrandom). In other words, the statistical model is for the conditional distribution of the response $y$ given the covariate data, and the (marginal) distribution of the covariate data *is not modeled*. Thus to be fussily pedantic we should write $$ E_\theta(y \mid \text{the part of the covariate data that is random, if any}) $$ everywhere instead of $E_\theta(y)$, and similarly for $\mathop{\rm var}\nolimits_\theta(y)$ and so forth. But we are not going to do that, and almost no one does that. We can also allow $a$ to depend on covariate data (but almost no one does that). Now we come to the point of this section. The log likelihood for the canonical affine submodel is \begin{align*} l(\beta) & = \langle y, a + M \beta \rangle - c(a + M \beta) \\ & = \langle y, a \rangle + \langle y, M \beta \rangle - c(a + M \beta) \end{align*} and we may drop the term $\langle y, a \rangle$ from the log likelihood because it does not contain the parameter $\beta$ giving $$ l(\beta) = \langle y, M \beta \rangle - c(a + M \beta) $$ and because $$ \langle y, M \beta \rangle = y^T M \beta = (M^T y)^T \beta = \langle M^T y, \beta \rangle $$ we finally get log likelihood for the canonical affine submodel \begin{equation} l(\beta) = \langle M^T y, \beta \rangle - c_\text{sub}(\beta) (\#eq:logl-casm) \end{equation} where $$ c_\text{sub}(\beta) = c(a + M \beta) $$ From this it follows that the change of parameter \@ref(eq:affine) converts one exponential family into another exponential family with * canonical statistic vector $M^T y$, * canonical parameter vector $\beta$, * cumulant function $c_\text{sub}$, and * mean value parameter $\tau = M^T \mu$, where $\mu$ is the mean value parameter of the original model. If $\Theta$ is the canonical parameter space of the original model, then $$ B = \{\, \beta : a + M \beta \in \Theta \,\} $$ is the canonical parameter space of the canonical affine submodel. If the original model is full, then so is the canonical affine submodel. If the original model is regular full, then so is the canonical affine submodel. There are many points to this section. It is written the way it is because of aster models (Geyer, Wagenius, and Shaw, 2007, cited above), but it applies to linear models, generalized linear models, and log-linear models for categorical data too, hence to the majority of applied statistics. But the point of this section that gets it put in its supersection is that the dimension reduction from $y$ to $M^T y$ is a *sufficient dimension reduction*. It loses no information about $\beta$ assuming this submodel is correct. ## The Pitman–Koopman–Darmois Theorem Nowadays, exponential families are so important in so many parts of theoretical statistics that their origin has been forgotten. They were invented to be the class of statistical models described by the [Pitman–Koopman–Darmois theorem](https://en.wikipedia.org/wiki/Exponential_family#Classical_estimation:_sufficiency), which says (roughly) that the only statistical models having the sufficient dimension reduction property in IID sampling described in Section \@ref(iid) above are exponential families. In effect, it turns that section from if to if and only if. But the reason for the "roughly" is that as just stated with no conditions, the theorem is false. Here is a counterexample. For an IID sample from the $\text{Uniform}(0, \theta)$ distribution, the maximum likelihood estimator (MLE) is $\hat{\theta}_n = \max\{y_1, \ldots, y_n\}$ and this is easily seen to be a sufficient statistic (the Neyman-Fisher factorization theorem again) but $\text{Uniform}(0, \theta)$ is not an exponential family. So there have to be side conditions to make the theorem true. Pitman, Koopman, and Darmois independently (not as co-authors) in 1935 and 1936 published theorems that said under two side conditions, * the distribution of the canonical statistic is continuous and * the support of the distribution of the canonical statistic does not depend on the parameter, then any statistical model with the sufficient dimension reduction property in IID sampling is an exponential family. (Obviously, $\text{Uniform}(0, \theta)$ violates the second side condition). Later, other authors published theorems with more side conditions that covered the discrete case. But the side conditions for those are really messy so those theorems are not so interesting. Nowadays exponential families are so important for so many reasons (not all of them mentioned in this document) that no one any longer cares about these theorems. We only want the if part of the if and only if, which is covered in Section \@ref(iid) above. # Observed Equals Expected {#observed-equals-expected} The usual procedure for maximizing the log likelihood is to differentiate it, set it equal to zero, and solve for the parameter. The derivative of \@ref(eq:logl) is \begin{equation} \nabla l(\theta) = y - \nabla c(\theta) = y - E_\theta(Y) (\#eq:cumfun-first-deriv) \end{equation} So the MLE $\hat{\theta}$ is characterized by \begin{equation} y = E_{\hat{\theta}}(Y) (\#eq:observed-equals-expected) \end{equation} We can say a lot more than this. Cumulant functions are lower semicontinuous convex functions (Barndorff-Nielsen, 1978, cited above, Theorem 7.1). Hence log likelihoods of regular full exponential families are upper semicontinuous concave functions that are differentiable everywhere in the canonical parameter space. Hence \@ref(eq:cumfun-first-deriv) being equal to zero is a necessary and sufficient condition for $\theta$ to maximize \@ref(eq:logl). Hence \@ref(eq:observed-equals-expected) is a necessary and sufficient condition for $\hat{\theta}$ to be a MLE. The MLE need not exist and need not be unique. The MLE does not exist if the observed value of the canonical statistic is on the boundary of its support in the following sense, there exists a vector $\delta \neq 0$ such that $\langle Y - y, \delta \rangle \le 0$ holds almost surely and $\langle Y - y, \delta \rangle = 0$ does not hold almost surely (Geyer, 2009, cited above, Theorems 1, 3, and 4; here $Y$ is a random vector having the distribution of the canonical statistic and $y$ is the observed value of the canonical statistic). When the MLE does not exist in the classical sense, it may exist in an extended sense as a limit of distributions in the original family, but that is a story for another time (see Geyer, 2009, cited above, and Geyer, Stat 9831 lecture notes, Fall 2016, cited above, for more on this subject). The MLE is not unique if there exists a $\delta \neq 0$ such that $\langle Y - y, \delta \rangle = 0$ holds almost surely (Geyer, 2009, cited above, Theorem 1, where $Y$ and $y$ are the same as before). In theory, nonuniqueness of the MLE for a full exponential family is not a problem because every MLE corresponds to the same probability distribution (Geyer, 2009, cited above, Theorem 1 and Corollary 2), so the MLE canonical parameter vector is not unique (if any MLE exist) but the MLE probability distribution is unique (if it exists). It is always possible to arrange for uniqueness of the MLE. Simply arrange that the distribution of the canonical statistic have full dimension (not be concentrated on a hyperplane). But one does not want to do this too early in the data analysis process. In the [homework on MCMC and football](http://www.stat.umn.edu/geyer/8054/hw/nfl.html) we want to use a nonidentifiable canonical parameterization for the trinomial distribution of each game in which ties can occur and for the binomial distribution of each game in which ties cannot occur; we only want an identifiable submodel canonical parameterization for what we now understand is a canonical affine submodel (Section \@ref(casm) above). But we automatically get an identifiable parameterization if we follow the recipe in the homework assignment. So we don't care that we were nonidentifiable somewhere in the middle as long as we came out identifiable in the end. This sort of data provides simple examples of when the MLE does not exist in the classical sense. If we allow for ties and no ties actually occur, then the MLE does not exist in the classical sense because the coefficient for ties wants to go to minus infinity to make the probability of ties as small as possible but can never get there because minus infinity is not a possible parameter value. Similarly if one team lost all its games then the coefficient for that team wants to be minus infinity, as shown in the [example on MCMC and volleyball](http://www.stat.umn.edu/geyer/3701/notes/mcmc-bayes.html#try-2). More complications arise when some group of teams wins all of its games against all of the rest of the teams, as shown in the example of Section 2.4 of Geyer (2009, cited above). But all of this about nonexistence and nonuniqueness of the MLE is not the main point of this section. The main point is that \@ref(eq:observed-equals-expected) characterizes the MLE when it exists and whether or not it is unique. * The MLE in an exponential family satisfies "observed equals expected". The MLE for the mean value parameter vector satisfies $\hat{\mu} = y$ or $y = E_{\hat{\theta}}(y)$. More precisely, we should say the observed value of the *canonical statistic vector* equals its expected value under the MLE distribution (which is unique if it exists). It is not the observed value of anything whatsoever that equals its MLE expected value. The observed-equals-expected property is one of the keys to interpreting MLE's for exponential families. Strangely, this is considered very important in some areas of statistics and not mentioned at all in other areas. In categorical data analysis, it is considered key. The MLE for a hierarchical model log-linear model for categorical data satisfies observed equals expected: the marginals of the table corresponding to the terms in the model are equal to their MLE expected values, and those marginals are the canonical sufficient statistics. So this gives a complete characterization of maximum likelihood for these models, and hence a complete understanding in a sense. (See also Section \@ref(maximum-entropy) below.) In regression analysis, it is ignored. The most widely used linear and generalized linear models are exponential families: linear models, logistic regression, and Poisson regression with log link. Thus maximum likelihood satisfies the observed-equals-expected property. * The MLE for a canonical affine submodel satisfies "observed equals expected". If $y$ is the response vector and $M$ is the model matrix, then the MLE for the submodel mean value parameter vector satisfies $\hat{\tau} = M^T y$ or $M^T y = M^T E_{\hat{\beta}}(y)$. More precisely, we should say the observed value of the *submodel canonical statistic vector* equals its expected value under the MLE distribution (which is unique if it exists). It is not the observed value of anything whatsoever that equals its MLE expected value. So this tells us that the submodel canonical statistic vector $M^T y$ is crucial to understanding linear and generalized linear models (and aster models, Geyer, Wagenius, and Shaw, 2007, cited above) just like it is for hierarchical log-linear models for categorical data. But do regression books even mention this? Not that your humble author knows of. Let's check this for linear models with no offset where we have original model mean value parameter $$ \mu = E(y) = M \beta $$ and MLE for the submodel canonical parameter $\beta$ $$ \hat{\beta} = (M^T M)^{- 1} M^T y $$ and consequently $$ M^T M \hat{\beta} = M^T y $$ and by invariance of maximum likelihood ([Geyer, 2016, Stat 5012 lecture notes, deck 3, slides 100 ff.](http://www.stat.umn.edu/geyer/5102/slides/s3.pdf#page=100)) $$ \hat{\mu} = M \hat{\beta} $$ so $$ M^T \hat{\mu} = M^T y $$ which we claim is the key to understanding linear models. But regression textbooks never mention it. So who is right, the authors of regression textbooks or the authors of categorical data analysis textbooks? Our answer is the latter. $M^T y$ is important. Another way people sometimes say this is that every MLE in a regular full exponential family is a method of moments estimator, but not just any old method of moments estimator. It is the method of moments estimator that sets the expectation of the *canonical statistic vector* equal to its observed value and solves for the parameter. For example, for linear models, the method of moments estimator we want sets $$ M^T M \beta = M^T y $$ and solves for $\beta$. And being precise, we need the method of moments estimator that sets the *canonical statistic vector for the model being analyzed* equal to its expected value. For a canonical affine submodel, that is the *submodel* canonical statistic vector $M^T y$. But there is nothing special here about linear models except that they have a closed form expression for $\hat{\beta}$. In general, we can only determine $\hat{\beta}$ as a function of $M^T y$ by numerically maximizing the likelihood using a computer optimization function. But we always have "observed equals expected" up to the inaccuracy of computer arithmetic. And usually "observed equals expected" is the only simple equality we know about maximum likelihood in regular full exponential families. # Maximum Entropy {#maximum-entropy} Many scientists in the early part of the nineteenth century invented the science of thermodynamics, in which some of the key concepts are *energy* and *entropy*. Entropy was initially [defined physically](https://en.wikipedia.org/wiki/Entropy#Classical_thermodynamics) as $$ d S = \frac{d Q}{T} $$ where $S$ is entropy and $d S$ its differential, $Q$ is heat and $d Q$ its differential, and $T$ is temperature, so to calculate entropy in most situations you have to do an integral (the details here don't matter --- the point is that entropy defined this way is a physical quantity measured in physical ways). The [first law of thermodynamics](https://en.wikipedia.org/wiki/First_law_of_thermodynamics) says energy is conserved in any closed physical system. Energy can change form from motion to heat to chemical energy and to other forms. But the total is conserved. The [second law of thermodynamics](https://en.wikipedia.org/wiki/Second_law_of_thermodynamics) says entropy is nondecreasing in any closed physical system. But there are many other equivalent formulations. One is that heat always flows spontaneously from hot to cold, never the reverse. Another is that there is a [maximum efficiency](https://en.wikipedia.org/wiki/Second_law_of_thermodynamics#Carnot_theorem) of a heat engine or a refrigerator (a heat engine operated in reverse) that depends only on the temperature difference that powers it (or that the refrigerator produces). So, somewhat facetiously, the first law says you can't win, and the second law says you can't break even. Near the end of the nineteenth century and the beginning of the twentieth century, thermodynamics was extended to chemistry. And it was found [chemistry too obeys the laws of thermodynamics](https://en.wikipedia.org/wiki/Chemical_thermodynamics). Every chemical reaction in your body is all the time obeying the laws of thermodynamics. No animal can convert all of the energy of food to useful work. There must be waste heat, and this is a consequence of the second law of thermodynamics. Also near the end of the nineteenth century [Ludwig Boltzmann](https://en.wikipedia.org/wiki/Ludwig_Boltzmann) discovered the relationship between entropy and probability. He was so pleased with this discovery that he had $$ S = k \log W $$ engraved on his tombstone. Here $S$ is again entropy, $k$ is a physical constant now known as Boltzmann's constant, and $W$ is probability (*Wahrscheinlichkeit* in German). Of course, this is not probability in general, but probability in certain physical systems. Along with this came the interpretation that entropy does not always increase. Physical systems necessarily spend more time in more probable states and less time in less probable states. Increase of entropy is just the inevitable move from less probable to more probable on average. At the microscopic level entropy fluctuates as the system moves through each state according to its probability. In mid twentieth century [Claude Shannon](https://en.wikipedia.org/wiki/Claude_Shannon) recognized the relation between entropy and information. The same formulas that define entropy statistically define information as negative entropy (so minus a constant times log probability). He used this to bound how much signal could be put through a noisy communications channel. A little later [Kullback and Leibler](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) imported Shannon's idea into statistics, defining what we now call Kullback-Leibler information. What does maximum likelihood try to do theoretically? It tries to maximize the expectation of the log likelihood function, which is the Kullback-Leibler information function, that maximum being the true unknown parameter value if the model is identifiable ([Wald, 1949](http://www.jstor.org/stable/2236315)). There is also a connection between [Kullback-Leibler information and Fisher information](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Fisher_information_metric). A little later [Edwin Jaynes](https://en.wikipedia.org/wiki/Edwin_Thompson_Jaynes) recognized the connection between entropy or negative Kullback-Leibler information and exponential families. Exponential families maximize entropy subject to constraints. Fix a probability distribution $Q$ and a random vector $Y$ on the probability space of that distribution. Then for each vector $\mu$ find the probability distribution $P$ that maximizes entropy (minimizes Kullback-Leibler information) with respect to $Q$ subject to $E_P(Y) = \mu$. If the maximum exists, call it $P_\mu$. Then the collection of all such $P_\mu$ is a full exponential family having canonical statistic $Y$ and mean value parameter $\mu$ (for a proof see my [Stat 9831 lecture notes, Fall 2018](http://www.stat.umn.edu/geyer/8931aster/slides/s2.pdf#page=176)). Jaynes is not popular among statisticians because his maximum entropy idea became linked with so-called [maxent modeling](http://www.cs.cmu.edu/~aberger/maxent.html) which statisticians for the most part have ignored. But in the context of exponential families, maximum entropy is powerful. It says you start with the canonical statistic. If you start with a canonical statistic that is an affine function of the original canonical statistic of an exponential family, then the canonical affine submodel maximizes entropy subject to the distributions in the canonical affine submodel having the mean value parameters they do. Every other aspect of those distributions is just randomness in the sense of maximum entropy or minimum Kullback-Leibler information. Thus the (submodel) mean value parameter tells you everything interesting about a canonical affine submodel. When connected with observed equals expected (Section \@ref(observed-equals-expected) above), this is a very powerful principle. Observed equals expected says maximum likelihood estimation matches exactly the submodel canonical statistic vector to its observed value. Maximum entropy says nothing else matters, everything important, all the *information* about the parameter is in the MLE. All else is randomness (in the sense of maximum entropy). Admittedly the one time I have made this argument in a published article (Geyer and Thompson, 1992, cited above) it was not warmly received. But it was a minor point of that article. Perhaps this section makes a better case. The reason why the model for the [homework on MCMC and football](http://www.stat.umn.edu/geyer/8054/hw/nfl.html) is what it is is because the National Football League considers the canonical statistics of that model the correct way to evaluate teams. It is true that we add two statistics the league does not use, total number of ties and total number of home wins, because we need something in the model to determine the probability of ties and we want home field advantage in the model because it obviously exists (it is highly statistically significant every year in every sport), and leaving out home field advantage would inflate the variance of estimates. This it is amazing (to me) that this procedure fully and correctly adjusts sports standings for strength of schedule. It is strange (to me) that only one sport, college ice hockey, uses this procedure, which in that context they call [KRACH](https://www.uscho.com/rankings/krach/d-i-men/), and they do not use it alone, but as just one factor in a mess of procedures that have no statistical justification. # Multivariate Monotonicity {#multivariate-monotonicity} A link function, which goes componentwise from mean value parameters to canonical parameters for a generalized linear model that is an exponential family (linear models, logistic regression, Poisson regression with log link) is *univariate monotone*. This does not generalize to exponential families with dependence among components of the response vector like aster models (Geyer, Wagenius, and Shaw, 2007, cited above), Markov spatial point processes (Geyer and Møller, 1994, cited above, and Stat 8501 lecture notes on spatial point processes cited above), and Markov spatial lattice processes (Stat 8501 lecture notes on spatial lattice processes, cited above). Instead we have *multivariate monotonicity*. This is not a concept statisticians are familiar with. It does not appear in real analysis, functional analysis, or probability theory. It comes from convex analysis. Rockefellar and Wets (1998, cited above) have a whole chapter on the subject (Chapter 12). There are many equivalent characterizations. We will only discuss two of them. A function $f$ from one vector space to another is *multivariate monotone* if $$ \langle f(x) - f(y), x - y \rangle \ge 0, \qquad \text{for all $x$ and $y$} $$ and *strictly multivariate monotone* if $$ \langle f(x) - f(y), x - y \rangle > 0, \qquad \text{for all $x$ and $y$ such that $x \neq y$} $$ (Rockafellar and Wets, 1998, Definition 12.1). The reason this is important to us is that the gradient mapping of a convex function is multivariate monotone (Rockafellar and Wets, 1998, Theorem 12.17, indeed a proper lower semicontinuous function is convex if and only if its subgradient mapping is multivariate monotone). We have proper lower semicontinuous convex functions in play: cumulant functions (Barndorff-Nielsen, 1978, Theorem 7.1; "proper" means the function never takes the value $-\infty$ and somewhere takes a finite value, cumulant functions satisfy this, PhD thesis, cited above, Theorem 2.1). Also cumulant functions of regular full exponential families are differentiable at points where they are finite, which constitute their canonical parameter spaces \@ref(eq:full). So define $h$ by $$ h(\theta) = \nabla c(\theta), \qquad \theta \in \Theta $$ This maps the canonical parameter space into the space where the canonical statistic $y$ and the mean value parameter $\mu$ take values, and defines the mapping from canonical to mean value parameter $\mu = h(\theta)$. Then $h$ is multivariate monotone. Hence if $\theta_1$ and $\theta_2$ are canonical parameter values and $\mu_1$ and $\mu_2$ are the corresponding mean value parameter values $$ \langle \mu_1 - \mu_2, \theta_1 - \theta_2 \rangle \ge 0 $$ Moreover if the canonical parameterization is identifiable, $h$ is *strictly multivariate monotone* $$ \langle \mu_1 - \mu_2, \theta_1 - \theta_2 \rangle > 0, \qquad \theta_1 \neq \theta_2 $$ (Barndorff-Nielsen, 1978, Theorem 7.1; Geyer, 2009, Theorem 1; Rockafellar and Wets, Theorem 12.17, all cited above). We can see from the way the canonical and mean value parameters enter symmetrically, that when the canonical parameterization is identifiable so $h$ is invertible (Geyer, 8112 lecture notes, cited above, Lemma 9) the inverse $h^{- 1}$ is also *strictly multivariate monotone* $$ \langle \mu_1 - \mu_2, \theta_1 - \theta_2 \rangle > 0, \qquad \mu_1 \neq \mu_2 $$ One final characterization: a differentiable function is strictly multivariate monotone if and only if the restriction to every line segment in the domain is strictly univariate monotone (obvious from the way the definitions above only deal with two points in the domain at a time). Thus we have a "dumbed down" version of strict multivariate monotonicity: increasing one component of the canonical parameter vector increases the corresponding component of the mean value parameter vector, if the canonical parameterization is identifiable. The other components of $\mu$ also change but can go any which way. When specialized to canonical affine submodels (Section \@ref(casm) above) strict multivariate monotonicity becomes $$ \langle \tau_1 - \tau_2, \beta_1 - \beta_2 \rangle > 0, \qquad \beta_1 \neq \beta_2 $$ where $\tau_1$ and $\tau_2$ are the submodel mean value parameters corresponding to the submodel canonical parameters $\beta_1$ and $\beta_2$. When "dumbed down" this becomes: increasing one component of the submodel canonical parameter vector $\beta$ increases the corresponding component of the submodel mean value parameter vector $\tau = M^T \mu$, if the submodel canonical parameterization is identifiable. The other components of $\tau$ and components of $\mu$ also change but can go any which way. Again we see the key importance of the sufficient dimension reduction map $y \mapsto M^T \mu$ and the corresponding original model to canonical affine submodel mean value parameter mapping $\mu \mapsto M^T \mu$, that is, the importance of thinking of $M^T$ as (the matrix representing) a linear transformation. These "dumbed down" characterizations say that strict multivariate monotonicity implies strict univariate monotonicity of the restrictions of the function $h$ to line segments in the domain *parallel to the coordinate axes* (so only one component of the vector changes). Compare this with our last (not dumbed down) characterization: strict multivariate monotonicity holds *if and only if* all restrictions of the function $h$ to line segments in the domain are strictly univariate monotone (not just line segments parallel to the coordinate axes, *all* line segments). So the "dumbed down" version only varies one component of the canonical parameter at a time, whereas the non-dumbed-down version varies all components. The "dumbed down" version can be useful when talking to people who have never heard of multivariate monotonicity. But sometimes the non-dumbed-down concept is needed (Shaw and Geyer, 2010, cited above, Appendix A). There is no substitute for understanding this concept. It should be in the toolbox of every statistician. # Regression Coefficients are Meaningless The title of this section comes from my [Stat 5102 lecture notes, deck 5, slide 19](http://www.stat.umn.edu/geyer/5102/slides/s5.pdf#page=19). It is stated the way it is for shock value. All of the students in that class have previously taken courses where they were told how to interpret regression coefficients. So this phrase is intended to shock them into thinking they have been mistaught! Although shocking, it refers to something everyone knows. Even in the context of linear models (which those 5102 notes are) the same model can be specified by different formulas or different model matrices. ## Example: Polynomial Regression For example ```{r foo-one} foo <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex5-1.txt", header = TRUE) lout1 <- lm(y ~ poly(x, 2), data = foo) summary(lout1) ``` and ```{r foo-too} lout2 <- lm(y ~ poly(x, 2, raw = TRUE), data = foo) summary(lout2) ``` have different fitted regression coefficients. But they fit the same model ```{r foo-same} all.equal(fitted(lout1), fitted(lout2)) ``` ## Example: Categorical Predictors For another example, when there are categorical predictors we must "drop" one category from each predictor to get an identifiable model and which one we drop is arbitrary. Thus ```{r bar-one} bar <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex5-4.txt", header = TRUE, stringsAsFactors = TRUE) levels(bar$color) lout1 <- lm(y ~ x + color, data = bar) summary(lout1) ``` and ```{r bar-too} bar <- transform(bar, color = relevel(color, ref = "red")) lout2 <- lm(y ~ x + color, data = bar) summary(lout2) ``` have different fitted regression coefficients. But they fit the same model ```{r bar-same} all.equal(fitted(lout1), fitted(lout2)) ``` ## Example: Collinearity Even in the presence of collinearity, where some coefficients must be dropped to obtain identifiability (and which one(s) are dropped is arbitrary) the mean values are unique, hence the fitted model is unique. ```{r baz-one, error=TRUE} baz <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex5-3.txt", header = TRUE, stringsAsFactors = TRUE) x3 <- with(baz, x1 + x2) lout1 <- lm(y ~ x1 + x2 + x3, data = baz) summary(lout1) ``` and ```{r baz-too, error=TRUE} lout2 <- lm(y ~ x3 + x2 + x1, data = baz) summary(lout2) ``` have different fitted regression coefficients. But they fit the same model ```{r baz-same} all.equal(fitted(lout1), fitted(lout2)) ``` ## Alice in Wonderland {#alice} After several iterations, this shocking advice became the following ([Stat 8931 Aster models lecture notes, cited above, deck 2, slide 41](http://www.stat.umn.edu/geyer/8931aster/slides/s2.pdf#page=41)) > A quote from my master’s level theory notes. > >> Parameters are meaningless quantities. >> Only probabilities and expectations are meaningful. > > Of course, some parameters are probabilities and expectations, > but most exponential family canonical parameters are not. > > A quote from *Alice in Wonderland* > >> 'If there’s no meaning in it,' said the King, >> 'that saves a world of trouble, you know, as we needn't try to find any.’ > > Realizing that canonical parameters are meaningless quantities > "saves a world of trouble". We "needn't try to find any". Thinking sophisticatedly and theoretically, of course parameters are meaningless. A statistical model is a family $\mathcal{P}$ of probability distributions. How this family is parameterized (indexed) is meaningless. If $$ \mathcal{P} = \{\, P_\theta : \theta \in \Theta \,\} = \{\, P_\beta : \beta \in B \,\} = \{\, P_\varphi : \varphi \in \Phi \,\} $$ are three different parameterizations for the same model, then they are all for the same model (duh!). The fact that parameter estimates in one parameterization tell us nothing about estimates in another parameterization tells us nothing. But probabilities and expectations are meaningful. For $P \in \mathcal{P}$, both $P(A)$ and $E_P\{g(Y)\}$ depend only on $P$ not what parameter value is deemed to index it. And this does not depend on what $P$ means, whether we specify $P$ with a probability mass function, a probability density function, a distribution function, or a probability measure, the same holds: probabilities and expectations only depend on the distribution, not how it is described. Even if we limit the discussion to regular full exponential families, any one-to-one affine function of a canonical parameter vector is another canonical parameter vector (copied from Section \@ref(expfam) above). That's a lot of parameterizations, and which one you choose (or the computer chooses for you) is meaningless. Hence we agree with the King of Hearts in *Alice in Wonderland*. It "saves a world of trouble" if we don't try to interpret canonical parameters. It doesn't help those wanting to interpret canonical parameters that sometimes the map from canonical to mean value parameters has no closed-form expression (this happens in the spatial point and lattice processes discussed in the Stat 8501 handouts cited above; the log likelihood and its derivatives can only be approximated by MCMC using the scheme in Geyer and Thompson, 1992, and Geyer, 1994, both cited above) or has a closed-form expression, but it is so fiendishly complicated that people have no clue what is going on, although the computer chugs through the calculation effortlessly (this happens with aster models, Geyer, Wagenius, and Shaw, 2007, cited above). # Interpreting Exponential Family Model Fits We take up the points made above in turn, stressing their impact on how users can interpret exponential family model fits. ## Observed Equals Expected {#observed-equals-expected-in-review} The simplest and most important property is the observed-equals-expected property (Section \@ref(observed-equals-expected) above). The MLE for the submodel mean value parameter vector $\hat{\tau} = M^T \hat{\mu}$ is exactly equal to the submodel canonical statistic vector $M^T y$. That's what maximum likelihood in a regular full exponential family *does*. So understanding $M^T y$ is the most important thing in understanding the model. If $M^T y$ is scientifically (business analytically, sports analytically, whatever) interpretable, then the model is interpretable. Otherwise, not! ## Sufficient Dimension Reduction {#sufficient-dimension-reduction-in-review} The next most important property is sufficient dimension reduction (Section \@ref(casm) above). The submodel canonical statistic vector $M^T y$ is *sufficient*. It contains all the information about the parameters that there is in the data, assuming the model is correct. Since $M^T y$ determines the MLE for the coefficients vector $\hat{\beta}$ (Section \@ref(casm) above, assuming $\beta$ is identifiable), and the MLE for every other parameter vector is a one-to-one function of $\hat{\beta}$, the MLE's for all parameter vectors ($\hat{\beta}$, $\hat{\theta}$, $\hat{\mu}$, and $\hat{\tau}$) are sufficient statistic vectors. The MLE for each parameter vector contains all the information about parameters that there is in the data, assuming the model is correct. ## Maximum Entropy {#maximum-entropy-in-review} And nothing else matters for interpretation. Everything else about the model other than what the MLE's say is as random as possible (maximum entropy, Section \@ref(maximum-entropy) above) and contains no information (sufficiency, just discussed). ## Regression Coefficients are Meaningless {#regression-coefficients-are-meaningless-in-review} In particular it "saves a world of trouble" if we realize "we needn't try to find any" meaning in coefficients vector $\hat{\beta}$ (Section \@ref(alice) above). ## Multivariate Monotonicity {#multivariate-monotonicity-in-review} But if we do have to say something about the coefficients vector $\hat{\beta}$ we do have the multivariate-monotonicity property available (Section \@ref(multivariate-monotonicity) above). ## The Model Equation Most statistics courses that discuss the regression models teach students to woof about the *model equation* \@ref(eq:affine). In lower-level courses where students are not expected to understand matrices, students are taught to woof about the same thing in other terms $$ y_i = \beta_1 + \beta_2 x_i + \text{error} $$ and the like. That is, they are taught to think about the model matrix as a linear operator $\beta \mapsto M \beta$ or the same thing in other terms. And another way of saying this is that they are taught to focus on the *rows* of $M$. The view taken here is that this woof is all meaningless because it is about meaningless parameters ($\beta$ and $\theta$). The important linear operator to understand is the sufficient dimension reduction operator $y \mapsto M^T y$ or, what is the same thing described in other language, the original model to submodel mean value transformation operator $\mu \mapsto M^T \mu$. And another way of saying this is that we should focus on the *columns* of $M$. It is not when you woof about $M \beta$ that you understand and explain the model, it is when you woof about $M^T y$ that you understand and explain the model. # Asymptotics A story: when I was a first year graduate student I answered a question about asymptotics with "because it's an exponential family" but the teacher didn't think that was quite enough explanation. It is enough, but textbooks and courses don't emphasize this. The "usual" asymptotics of maximum likelihood (asymptotically normal, variance inverse Fisher information) hold for every regular full exponential family, no other regularity conditions are necessary (all other conditions are implied by regular full exponential family). The "usual" asymptotics also hold for all curved exponential families by smoothness. For a proof of this using the usual IID sampling and $n$ goes to infinity story see the Stat 8112 lecture notes (cited above). In fact, these same "usual" asymptotics hold when there is complicated dependence and no IID in sight and either no $n$ goes to infinity story makes sense or whatever $n$ goes to infinity story can be concocted yields an intractable problem. For that see Geyer (2013, cited above, Sections 1.4 and 1.5). And these justify all the hypothesis tests and confidence intervals based on these "usual" asymptotics, for example, those for generalized linear models that are exponential families and for log-linear models for categorical data and for aster models. # (APPENDIX) Appendix {-} # Identifiability of the Mean Value Parameter This is not a theory handout, but we have a theorem because it is important and I cannot find it elsewhere. ```{theorem, identifiable} The mean value parameterization of a regular full exponential family is always identifiable, regardless of whether the canonical parameterization is identifiable. ``` Another way to say this is: in a regular full exponential family different distributions have different means (of the canonical statistic vector). ```{proof} Let $\theta_1$ and $\theta_2$ be canonical parameter values corresponding to mean value parameter values $\mu_1$ and $\mu_2$, and let $Y$ denote the canonical statistic. Now Theorem 1 in Geyer (2009, cited above, parts (d) and (f)) says $\theta_1$ and $\theta_2$ correspond to the same probability distribution if and only if $\langle Y, \theta_1 - \theta_2 \rangle$ is constant almost surely. We now have two cases. Case I: the distribution of $\langle Y, \theta_1 - \theta_2 \rangle$ is constant almost surely. Then $\theta_1$ and $\theta_2$ correspond to the same probability distribution, which implies $\mu_1 = \mu_2$. So this case is irrelevant to identifiability. Case II: the distribution of $\langle Y, \theta_1 - \theta_2 \rangle$ is not constant almost surely. Then $\theta_1$ and $\theta_2$ do not correspond to the same probability distribution, and $\mu_1$ and $\mu_2$ do not correspond to the same probability distribution. ``` So this theorem was almost in the literature. It should have been stated and proved in Geyer (2009, cited above) but wasn't. This theorem is new only in the "regardless of whether the canonical parameterization is identifiable" part. The mean value parameterization of regular full exponential families has long been recognized, and Barndorff-Nielsen (1978) and Brown (1986) have theorems that say, if the canonical parameterization is identifiable, then the mean values also parameterize the family.