This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/).
The R markdown source for this document is https://www.stat.umn.edu/geyer/5421/notes/expfam.Rmd.
The version of R used to make this document is 4.4.1.
The version of the bookdown
package used to make this document is
0.40.
The version of the rmarkdown
package used to make this document is
2.28.
A statistical model is an exponential family of distributions if it has a log likelihood of the form \[\begin{equation} l(\theta) = \langle y, \theta \rangle - c(\theta) \tag{3.1} \end{equation}\] where
\(y\) is a vector-valued statistic, which is called the canonical statistic,
\(\theta\) is a vector-valued parameter, which is called the canonical parameter,
\(c\) is a real-valued function, which is called the cumulant function,
and \(\langle \,\cdot\,, \,\cdot\, \rangle\) denotes a bilinear form that places the vector space where \(y\) takes values and the vector space where \(\theta\) takes values in duality.
In equation (3.1) we have used the rule that additive terms in the log likelihood that do not contain the parameter may be dropped. Any such terms have been dropped in (3.1).
The angle brackets notation comes from functional analysis (which is what very advanced, PhD level, calculus is called). It has the virtue of not treating \(y\) and \(\theta\) as being vectors in the same space, but rather in dual vector spaces (a concept from linear algebra). Of course, we know that statistics and parameters are different, but it is nice to have our notation reflect that. Nevertheless, we will sometimes use other notation \[ \textstyle \langle y, \theta \rangle = y^T \theta = \theta^T y = \sum_i y_i \theta_i \]
Often the canonical statistic and canonical parameter are not the usual statistic or usual parameter. One has to do a change of parameter or change of statistic or both to get the log likelihood into the form (3.1). More on this in Section 7 below.
Many people use “natural” everywhere this document uses “canonical”. In this choice this we are following Barndorff-Nielsen (1978).
Many people also use an older terminology that says a statistical model is in the exponential family where we say a statistical model is an exponential family. Thus the older terminology says the exponential family is the collection of all of what the newer terminology calls exponential families. The older terminology names a useless mathematical object, a heterogeneous collection of statistical models not used in any application. The newer terminology names an important property of statistical models. Presumably, that is the reason for the newer terminology. In this choice we are also following Barndorff-Nielsen (1978).
All of the models used in this course, Poisson, multinomial, product multinomial, univariate and multivariate normal, are exponential families. So are logistic regression, Poisson regression with log link, and multinomial response regression. So are hierarchical and graphical log-linear models. But not the chi-square distribution (because it has no parameters).
More about these examples in Sections 7 and 9.3 below.
Also many models outside of this course are exponential families. Linear regression is. So are many complicated models in spatial statistics (Baddeley et al., 2024; Geyer, 1990, 1999; Geyer and Møller, 1994), biology (Geyer et al., 1993, 2007; Geyer and Thompson, 1992), and networks (Handcock et al., 2023; Okabayashi, 2011).
The point of the theory of exponential families in general and of these notes in particular is to bring all of these disparate statistical models under one theoretical concept so we can learn general principles that apply to all of them.
Although we usually say “the” canonical statistic, “the” canonical parameter, and “the” cumulant function, these are not uniquely defined:
any one-to-one affine function of a canonical statistic vector is another canonical statistic vector,
any one-to-one affine function of a canonical parameter vector is another canonical parameter vector, and
any real-valued affine function plus a cumulant function is another cumulant function.
(see Section 9.3.2 below for the definition of affine function).
These possible changes of statistic, parameter, or cumulant function are not algebraically independent. Changes to one may require changes to the others to keep a log likelihood of the form (3.1) above.
Usually no fuss is made about this nonuniqueness. One fixes a choice of canonical statistic, canonical parameter, and cumulant function and leaves it at that.
More about this nonuniqueness in Sections 9.3.5 and 14 below.
The cumulant function may not be defined by (3.1) above on the whole vector space where \(\theta\) takes values. In that case it can be extended to this whole vector space by \[\begin{equation} c(\theta) = c(\psi) + \log\left\{ E_\psi\bigl( e^{\langle Y, \theta - \psi\rangle} \bigr) \right\} \tag{3.2} \end{equation}\] where \(\theta\) varies while \(\psi\) is fixed at a possible canonical parameter value, and the expectation and hence \(c(\theta)\) are assigned the value \(\infty\) for \(\theta\) such that the expectation does not exist (our equation (3.2) is equation (5) in Geyer (2009)).
An exponential family is full if its canonical parameter space is \[\begin{equation} \Theta = \{\, \theta : c(\theta) < \infty \,\} \tag{3.3} \end{equation}\] (where the cumulant function is defined by (3.2)), and a full family is regular if its canonical parameter space is an open subset of the vector space where \(\theta\) takes values.
Almost all exponential families used in real applications are full and regular. So-called curved exponential families (smooth non-affine submodels of full exponential families) are not full (an example is the ABO blood group example in our notes on likelihood computation). Constrained exponential families (Geyer, 1991) are not full. A few exponential families used in spatial statistics are full but not regular (Geyer, 1999; Geyer and Møller, 1994).
An example where the canonical parameter space of a full family (3.3) is not a whole vector space is the negative binomial distribution (Section 7.3 below).
Our definition (Section 3.1 above) simplifies many arguments but it does not tell us exactly what the probability mass functions (PMF) or probability density functions (PDF) are.
We can get PMF or PDF as the case may be (we say PMDF to include both) by realizing that the log likelihood (3.1) is log PMDF plus perhaps a term that may depend on the data but does not contain parameters. Thus the PMDF of the distribution having canonical parameter vector \(\theta\) has the form \[\begin{equation} f_\theta(w) = e^{\langle Y(w), \theta \rangle - c(\theta) + h(w)} \tag{3.4} \end{equation}\] where \(w\) is the whole data (\(Y\) may be only a statistic, that is, a function of the whole data) and \(h(w)\) is some term not containing the parameters that is dropped to get the log likelihood to have the form (3.1).
The reason why the cumulant function has the name it has is because it is related to the cumulant generating function (CGF), which is the logarithm of a moment generating function (MGF). Derivatives of an MGF evaluated at zero give moments (expectations of powers of a random variable). Derivatives of a CGF evaluated at zero give cumulants. Cumulants are polynomial functions of moments and vice versa.
Using (3.2), the MGF of the distribution of the canonical (vector) statistic for canonical (vector) parameter value \(\theta\) in an exponential family with log likelihood (3.1) is given by \[ M_\theta(t) = E_\theta(e^{\langle Y, t \rangle}) = e^{c(\theta + t) - c(\theta)} \] provided this formula defines an MGF, which it does if and only if it is finite for \(t\) in a neighborhood of zero, which happens if and only if \(\theta\) is in the interior of the full canonical parameter space (3.3).
So the cumulant generating function is \[ K_\theta(t) = c(\theta + t) - c(\theta) \] provided \(\theta\) is in the interior of \(\Theta\).
It is easy to see (chain rule from calculus) that derivatives of \(K_\theta\) evaluated at zero are derivatives of \(c\) evaluated at \(\theta\). So derivatives of \(c\) evaluated at \(\theta\) are cumulants.
We will be only interested in the first two cumulants \[\begin{align} E_\theta(Y) & = \nabla c(\theta) \tag{4.1} \\ \mathop{\rm var}\nolimits_\theta(Y) & = \nabla^2 c(\theta) \tag{4.2} \end{align}\] (Barndorff-Nielsen, 1978, Theorem 8.1).
Equation (4.1) is a vector equation. On the left-hand side we have the mean vector of the canonical statistic vector. Its components are the expectations \(E_\theta(Y_i)\). On the right-hand side we have the first derivative vector (also called gradient vector) of the cumulant function. Its components are the partial derivatives \(\partial c(\theta) / \partial \theta_i\).
Equation (4.2) is a matrix equation. On the left-hand side we have the variance matrix of the canonical statistic vector. Its components are the covariances \(\mathop{\rm cov}_\theta(Y_i, Y_j)\). On the right-hand side we have the second derivative matrix (also called Hessian matrix) of the cumulant function. Its components are the second partial derivatives \(\partial^2 c(\theta) / \partial \theta_i \partial \theta_j\).
Hence for any \(\theta\) in the interior of \(\Theta\), the corresponding probability distribution has moments and cumulants of all orders. In particular, every distribution in a regular full exponential family has moments and cumulants of all orders, and the mean and variance are given by the formulas above.
Conversely, any distribution whose canonical parameter value is on the boundary of the full canonical parameter space does not have a moment generating function or a cumulant generating function, and no moments or cumulants need exist.
Theorem 4.1 In a regular full exponential family (4.2) is the Fisher information matrix (both observed and expected) for the canonical parameter vector \(\theta\).
The proof of this theorem (and all other theorems in these notes) is given in Appendix A.
Define a mapping from the canonical parameter space \(\Theta\) to the vector space where the canonical statistic vector and mean value parameter vector live by \[\begin{equation} h(\theta) = \nabla c(\theta) = E_\theta(Y) \tag{5.1} \end{equation}\] using equation (4.1).
Theorem 5.1 The mean values \(\mu = E_\theta(Y)\) parameterize a regular full exponential family. Different distributions in the family have different mean vectors.
We call \(\mu\) as defined in the theorem and in equation (5.1) the mean value parameter vector.
In elementary applications mean value parameters are preferred. When we introduce the Poisson distribution we use mean the mean value parameter \(\mu\). When we introduce the binomial and multinomial distributions we use a parameter proportional to the mean value parameter: the usual parameter is \(\pi\) and the mean value parameter is \(\mu = n \pi\).
It is only when we get to generalized linear models with Poisson, binomial, or multinomial response and log-linear models for contingency tables that we need canonical parameters. But for these models mean value parameters are also useful. We often reason using both.
A parameterization of a statistical model is identifiable if no two distinct (vector) parameter values correspond to the same probability distribution.
The canonical parameterization of a regular full exponential family need not be identifiable. It is identifiable if and only if the canonical statistic vector is not concentrated on a hyperplane in the vector space where it takes values (Geyer, 2009, Theorem 1). This condition is equivalent to the variance matrix of the canonical statistic vector (4.2) being invertible. It is always possible to choose an identifiable canonical parameterization but not always convenient (Geyer, 2009).
More on this in Section 7 below.
The mean value parameterization of a regular full exponential family is always identifiable. No distribution can have two different mean vectors.
Since cumulant functions are infinitely differentiable, the transformation from canonical to mean value parameters described above is infinitely differentiable.
If the canonical parameterization is identifiable, then the inverse transformation exists (because the transformation is then one-to-one and onto) and is also infinitely differentiable by the inverse function theorem of real analysis and the fact that the canonical parameterization is identifiable if and only if the derivative of the map from canonical to mean value parameter, which is the variance of the canonical statistic vector by (4.1) and (4.2), is invertible. But we often do not have a symbolic expression for this inverse transformation.
Theorem 6.1 Suppose a regular full exponential family, and define the function \[\begin{equation} q(\theta) = \langle \mu, \theta \rangle - c(\theta) \tag{6.1} \end{equation}\] If \(\mu\) is a possible value of the mean value parameter vector, then any \(\theta\) that maximizes \(q\) is a corresponding value of the canonical parameter vector. Moreover the maximum always exists.
Observe that (6.1) is the same as the log likelihood (3.1) except that \(y\) in (3.1) is replaced by \(\mu\) in (6.1), where \(\mu = E_\theta(Y)\). That is, we replace the observed value of the canonical statistic vector by a possible mean value of it.
Theorem 6.1 allows us to compute \(\theta\) in terms of \(\mu\) but does not give us a symbolic expression for the transformation from one parameterization to the other (no such symbolic expression need exist).
The theorem does not assert that the maximizer of (6.1) is unique. It must be unique if the canonical parameterization is identifiable. It need not be unique if the canonical parameterization is not identifiable. In fact, any canonical parameter vector corresponding to the same distribution as \(\mu\) maximizes (6.1).
Theorem 6.2 In a regular full exponential family having identifiable canonical parameterization, any algorithm that always goes uphill on on \(q\) defined by (6.1) until the derivative of \(q\) goes to zero converges to the unique maximizer of \(q\).
The point of the theorem is that maximization of upper semicontinuous strictly concave functions having a unique maximizer is easy (which includes log likelihoods of regular full exponential families and also \(q\) defined by (6.1) when the canonical parameterization is identifiable). Any algorithm, no matter what the starting point, and no matter how inefficient, can find the unique solution, so long as it is not so stupid as to be going downhill when it should be going uphill.
Theorem 6.3 Suppose a regular full exponential family, suppose the observed value of the canonical statistic vector \(y\) is a possible value of the mean value parameter vector, and suppose the canonical parameterization is identifiable. Then any algorithm that always goes uphill on on \(l\) defined by (3.1) until the derivative of \(l\) goes to zero converges to the unique MLE.
The case where the observed value of the canonical statistic vector is not a possible value of the mean value parameter vector, so the MLE does not exist, is discussed in other lecture notes.
More on maps between the two parameterizations in Section 13 below.
The binomial distribution has log likelihood \[ l(\pi) = x \log(\pi) + (n - x) \log(1 - \pi) \] where \(x\) is the “usual” data, \(\pi\) is the “usual” parameter, and \(n\) is neither data nor parameter but rather a constant. To get this into the form (3.1), assuming the family is one-dimensional (which having only one parameter suggests), we need to write this as a function of \(x\) (which will be the canonical statistic \(y\)) times a function of \(\pi\) (which will be the canonical parameter \(\theta\)) minus a function of \(\theta\) (the cumulant function). So isolate \(x\) \[\begin{align*} l(\pi) & = x \bigl[ \log(\pi) - \log(1 - \pi) \bigr] + n \log(1 - \pi) \\ & = x \log \left( \frac{\pi}{1 - \pi} \right) + n \log(1 - \pi) \end{align*}\] and this tells us
the canonical statistic is \(x\),
the canonical parameter is \[\begin{equation} \theta = \log \left( \frac{\pi}{1 - \pi} \right), \tag{7.1} \end{equation}\] and
the cumulant function is \[\begin{equation} c(\theta) = - n \log(1 - \pi). \tag{7.2} \end{equation}\]
Of course (7.2) doesn’t really make sense because it has \(\theta\) on one side and \(\pi\) on the other. We need to solve (7.1) for \(\pi\) obtaining \[\begin{equation} \pi = \frac{e^{\theta}}{1 + e^\theta} \tag{7.3} \end{equation}\] and plug this back into (7.2) obtaining \[\begin{align*} c(\theta) & = - n \log(1 - \pi) \\ & = - n \log\left(1 - \frac{e^{\theta}}{1 + e^\theta}\right) \\ & = - n \log\left(\frac{1}{1 + e^\theta}\right) \\ & = n \log\left(1 + e^\theta\right) \end{align*}\] Since this formula is valid for all \(\theta\), we see the binomial family is a regular full exponential family whose canonical parameter space is the whole of Euclidean space (one-dimensional Euclidean space in this case).
This change-of-parameter is so important that statisticians give it a name \[\begin{equation} \mathop{\rm logit}(\pi) = \log \left( \frac{\pi}{1 - \pi} \right) \qquad 0 < \pi < 1. \tag{7.4} \end{equation}\] (\(\mathop{\rm logit}\) is pronounced “low-jit” with a soft “g”).
Note that 0 and 1 are not possible arguments of the \(\mathop{\rm logit}\) function. We get \(\log(0)\) or division by zero, either of which is undefined. The \(\mathop{\rm logit}\) function maps \((0, 1) \to (- \infty, + \infty)\). So the binomial family of distributions, considered as an exponential family, has “usual” parameter space \(0 < \pi < 1\) and canonical parameter space \(- \infty < \theta < + \infty)\).
This is a little surprising because the usual “usual” parameter space is \(0 \le \pi \le 1\) because otherwise we have no parameter values to be maximum likelihood estimates when we observe \(x = 0\) or \(x = n\). So exponential family theory causes some trouble here. Much more on this subject later.
If we think about it a bit more though, this is just the non-degeneracy property of exponential families (Section 3.8 above) at work. This is why the degenerate distributions concentrated at \(x = 0\) and \(x = n\) (for \(\pi = 0\) and \(\pi = 1\)) cannot be included in the exponential family.
Theorem 7.1 \[\begin{align*} c'(\theta) & = n \pi \\ c''(\theta) & = n \pi (1 - \pi) \end{align*}\]
Thus this theory gives us back the known mean and variance of the binomial distribution and the map from canonical parameter to mean value parameter and Fisher information.
For the binomial distribution
the canonical statistic is \(x\),
the canonical parameter is \(\theta = \mathop{\rm logit}(\pi)\),
the mean value parameter is \(\mu = n \pi\),
the canonical parameter space of the full family is \(- \infty < \theta < + \infty\),
the mean value parameter space of the full family is \(0 < \mu < n\),
the cumulant function is \[\begin{equation} c(\theta) = n \log\bigl(1 + e^\theta\bigr), \tag{7.5} \end{equation}\] and
Fisher information for \(\theta\) is \(I(\theta) = n \pi (1 - \pi)\).
The Poisson distribution has log likelihood \[ l(\mu) = x \log(\mu) - \mu \] where \(x\) is the “usual” data and \(\mu\) is the “usual” parameter. To get this into the form (3.1), assuming the family is one-dimensional (which having only one parameter suggests), we need to write this as a function of \(x\) (which will be the canonical statistic \(y\)) times a function of \(\mu\) (which will be the canonical parameter \(\theta\)) minus a function of \(\theta\) (the cumulant function). Obviously,
the canonical statistic is \(x\),
the canonical parameter is \[ \theta = \log(\mu), \] which has inverse function \[ \mu = e^\theta, \] and
the cumulant function is \[ c(\theta) = \mu = e^\theta. \]
Note that 0 is not a possible argument of the \(\log\) function. The \(\log\) function maps \((0, + \infty) \to (- \infty, + \infty)\). So the Poisson family of distributions, considered as an exponential family, has “usual” parameter space \(0 < \mu < + \infty\) and canonical parameter space \(- \infty < \theta < + \infty)\).
The Poisson distribution for \(\mu = 0\) is degenerate, concentrated at zero, so by the non-degeneracy property (Section 3.8 above) it cannot be included in the exponential family.
Theorem 7.2 \[\begin{align*} c'(\theta) & = \mu \\ c''(\theta) & = \mu \end{align*}\]
For the Poisson distribution
the canonical statistic is \(x\),
the canonical parameter is \(\theta = \log(\mu)\),
the mean value parameter is \(\mu\) (this is also the “usual” parameter),
the canonical parameter space of the full family is \(- \infty < \theta < + \infty\),
the mean value parameter space of the full family is \(0 < \mu < + \infty\), and
the cumulant function is \[\begin{equation} c(\theta) = e^\theta, \tag{7.6} \end{equation}\] and
Fisher information for \(\theta\) is \(I(\theta) = \mu\).
The negative binomial distribution has PMF \[\begin{equation} f(x) = \binom{r + x - 1}{x} \pi^r (1 - \pi)^x, \qquad x = 0, 1, \ldots \tag{7.7} \end{equation}\] where \(x\) is the “usual” data and \(r\) and \(\pi\) are the “usual” parameters and \[ \binom{r}{x} = \frac{r \cdot (r - 1) \cdots (r - x + 1)}{x !}. \]
If we consider both \(r\) and \(\pi\) to be unknown parameters, this is not an exponential family. If we consider \(r\) to be known, so \(\pi\) is the only unknown parameter, then it is an exponential family. So we consider \(r\) as known.
Then the log likelihood is \[ l(\pi) = r \log(\pi) + x \log(1 - \pi). \] Obviously,
the canonical statistic is \(x\),
the canonical parameter is \[ \theta = \log(1 - \pi), \] which has inverse function \[ \pi = 1 - e^\theta, \] and
the cumulant function is \[\begin{equation} c(\theta) = - r \log(\pi) = - r \log\bigl(1 - e^\theta\bigr). \tag{7.8} \end{equation}\]
The negative binomial distribution for \(\pi = 0\) does not exist (\(\pi = 0\) is not an allowed parameter value). The negative binomial distribution for \(\pi = 1\) does exist but is degenerate, concentrated at zero. So this degenerate distribution cannot be included in the exponential family (by Section 3.8 above).
The \(\log\) function maps \((0, 1) \to (- \infty, 0)\). So the negative binomial family of distributions, considered as an exponential family, has “usual” parameter space \(0 < \pi < 1\) and canonical parameter space \(- \infty < \theta < 0)\).
We are tempted to use (3.2) to extend the family, but if we try, we find that (3.2) gives us the same cumulant function we already have, so the negative binomial family as we have described it is already a regular full exponential family.
Theorem 7.3 \[\begin{align*} c'(\theta) & = \frac{r (1 - \pi)}{\pi} \\ c''(\theta) & = \frac{r (1 - \pi)}{\pi^2} \end{align*}\]
For the negative binomial distribution
the canonical statistic is \(x\),
the canonical parameter is \(\theta = \log(1 - \pi)\),
the mean value parameter is \(\mu = r (1 - \pi) / \pi\),
the canonical parameter space of the full family is \(- \infty < \theta < 0\),
the mean value parameter space of the full family is \(0 < \mu < \infty\),
the cumulant function is \[\begin{equation} c(\theta) = - r \log(1 - e^\theta), \tag{7.9} \end{equation}\] and
Fisher information for \(\theta\) is \(I(\theta) = r (1 - \pi) / \pi^2\).
This provides an example of a regular full exponential family whose canonical parameter space (3.3) is not a whole Euclidean space (in this case one-dimensional Euclidean space).
The multinomial PDF is \[ f(x) = \binom{n}{x} \prod_{i = 1}^k \pi_i^{x_i} \] where \(x = (x_1, \ldots, x_k)\) is the vector of cell counts of a contingency table (the “usual” data vector), where \(\pi = (\pi_1, \ldots, \pi_k)\) is the vector of cell probabilities (the “usual” parameter vector), where \(n\) is the sample size (the sum of the cell counts), and where \(\binom{n}{x}\) is a multinomial coefficient. The log likelihood is \[ l(\pi) = \sum_{i = 1}^k x_i \log(\pi_i) \]
It looks like we have exponential family form with the canonical statistic vector \(x\), canonical parameter vector \(\theta\) having components \(\theta_i = \log(\pi_i)\), and cumulant function the zero function.
As we shall eventually see, this is correct but misleading. If we use (4.1) and (4.2) on the zero function we get that \(y\) has mean vector zero and variance matrix zero, which is not the mean vector and variance matrix of the multinomial distribution. But we are not allowed to use those formulas because the components of \(\theta\), as we have defined them in this section, are not separately variable: the components of \(\pi\) are constrained to sum to one. And that translates to a complicated nonlinear constraint on the canonical parameters \[\begin{equation} \sum_{i = 1}^k e^{\theta_i} = 1. \tag{7.10} \end{equation}\] That looks annoying. So we try something else.
If we eliminate one of the components, say \(\pi_k\), we get log likelihood \[ l(\pi_1, \ldots, \pi_{k - 1}) = x_k \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right) + \sum_{i = 1}^{k - 1} x_i \log(\pi_i) \] But since we are trying to put this in exponential family form in which the canonical statistic vector and canonical parameter vector must have the same dimension, we have to also eliminate the corresponding component of \(x\). We can do that because the components of \(x\) are constrained to sum to \(n\). If we use that fact to eliminate \(x_k\) we get \[\begin{align*} l(\pi_1, \ldots, \pi_{k - 1}) & = \left(n - \sum_{i = 1}^{k - 1} x_i\right) \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right) + \sum_{i = 1}^{k - 1} x_i \log(\pi_i) \\ & = n \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right) + \sum_{i = 1}^{k - 1} x_i \left[ \log(\pi_i) - \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right) \right] \\ & = n \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right) + \sum_{i = 1}^{k - 1} x_i \log\left( \frac{\pi_i}{ 1 - \sum_{i = 1}^{k - 1} \pi_i} \right) \end{align*}\] and this has exponential family form with
canonical statistic vector \((x_1, \ldots, x_{k - 1})\),
canonical parameter vector \((\theta_1, \ldots, \theta_{k - 1})\), where \[\begin{equation} \theta_i = \log\left( \frac{\pi_i}{1 - \sum_{i = 1}^{k - 1} \pi_i} \right) \tag{7.11} \end{equation}\] and
cumulant function \[\begin{equation} c(\theta_1, \ldots, \theta_{k - 1}) = - n \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right), \tag{7.12} \end{equation}\]
except that (as usual) this doesn’t make sense with \(\theta\)’s on the left-hand side and \(\pi\)’s on the right-hand side, so we have to solve for the \(\pi\)’s in terms of the \(\theta\)’s and plug in. Reintroduce \(\pi_k = 1 - \sum_{i = 1}^{k - 1} \pi_i\) and plug into (7.11) obtaining \[ \theta_i = \log\left( \frac{\pi_i}{\pi_k} \right) \] or \[ e^{\theta_i} = \frac{\pi_i}{\pi_k} \] or \[\begin{equation} \pi_i = \pi_k e^{\theta_i} \tag{7.13} \end{equation}\] and this also holds for \(i = k\) if we define \(\theta_k = 0\) (which we may do because there was no definition of \(\theta_k\) before). Since the \(\pi\)’s must sum to \(1\), summing (7.13) gives \[ 1 = \pi_k \sum_{i = 1}^k e^{\theta_i} = \pi_k \left( 1 + \sum_{i = 1}^{k - 1} e^{\theta_i} \right) \] and \[ \pi_k = \frac{1}{1 + \sum_{i = 1}^{k - 1} e^{\theta_i}} \] and (plugging this back into (7.13)) \[\begin{equation} \pi_i = \frac{e^{\theta_i}}{1 + \sum_{i = 1}^{k - 1} e^{\theta_i}} \tag{7.14} \end{equation}\] and (plugging this back into (7.12)) \[\begin{align*} c(\theta_1, \ldots, \theta_{k - 1}) & = - n \log\left(1 - \sum_{i = 1}^{k - 1} \frac{e^{\theta_i}}{1 + \sum_{i = 1}^{k - 1} e^{\theta_i}} \right) \\ & = n \log \left( 1 + \sum_{i = 1}^{k - 1} e^{\theta_i} \right) \end{align*}\]
I don’t know what you think of this, but I think it is horribly messy. Also arbitrary. Clearly, which \(\pi_i\) we choose to eliminate is arbitrary. But we also know (Section 3.4 above) that there is a lot more arbitrariness than that in choosing canonical statistic, canonical parameter, and cumulant function.
Nevertheless, this messy parameterization is widely used. Agresti (2013) uses it in Chapter 8.
Let’s try again, this time not eliminating parameters or statistics, and using (3.2) to define the cumulant function. Let \(\psi\) in (3.2) be the canonical parameter vector for the multinomial distribution having probability vector \(p = (p_1, \ldots, p_k)\). Let \(S\) denote the sample space of the multinomial distribution: the set of all vectors \(x\) having nonnegative integer components that sum to \(n\). Then (3.2) says \[\begin{align*} c(\theta) & = c(\psi) + \log \sum_{x \in S} e^{\langle x, \theta - \psi \rangle} \cdot \binom{n}{x} \prod_{i = 1}^k p_i^{x_i} \\ & = c(\psi) + \log \sum_{x \in S} e^{\sum_{i = 1}^k x_i (\theta_i - \psi_i)} \cdot \binom{n}{x} \prod_{i = 1}^k p_i^{x_i} \\ & = c(\psi) + \log \sum_{x \in S} \left[ \prod_{i = 1}^k e^{x_i (\theta_i - \psi_i)} \right] \cdot \binom{n}{x} \prod_{i = 1}^k p_i^{x_i} \\ & = c(\psi) + \log \sum_{x \in S} \binom{n}{x} \prod_{i = 1}^k p_i^{x_i} e^{x_i (\theta_i - \psi_i)} \\ & = c(\psi) + \log \sum_{x \in S} \binom{n}{x} \prod_{i = 1}^k \bigl[ p_i e^{\theta_i - \psi_i} \bigr]^{x_i} \\ & = c(\psi) + \log \left( \sum_{i = 1}^k p_i e^{\theta_i - \psi_i} \right)^n \\ & = c(\psi) + n \log \left( \sum_{i = 1}^k p_i e^{\theta_i - \psi_i} \right) \end{align*}\] We use the multinomial theorem to evaluate the sum over the sample space. If we take \(\psi\) to be the vector with all components zero and \(p\) to be the vector with all components equal to \(1 / k\), which we are free to do because \(\psi\) and \(p\) were arbitrary, we get \[\begin{align*} c(\theta) & = c(\psi) + n \log \left( \frac{1}{k} \sum_{i = 1}^k e^{\theta_i} \right) \\ & = c(\psi) + n \log \left( \frac{1}{k} \sum_{i = 1}^k e^{\theta_i} \right) \\ & = c(\psi) - n \log(k) + n \log \left( \sum_{i = 1}^k e^{\theta_i} \right) \end{align*}\] and, since we are free to choose \(c(\psi)\), we can choose it to cancel the \(- n \log(k)\), finally obtaining \[ c(\theta) = n \log \left( \sum_{i = 1}^k e^{\theta_i} \right) \]
Theorem 7.4 The map from canonical parameters to usual parameters is \[\begin{equation} \pi_j = \frac{e^{\theta_j}}{\sum_{i = 1}^k e^{\theta_i}} \tag{7.15} \end{equation}\] and \[\begin{align*} \partial c(\theta) / \partial \theta_i & = n \pi_i \\ \partial^2 c(\theta) / \partial \theta_i^2 & = n \pi_i (1 - \pi_i) \\ \partial^2 c(\theta) / \partial \theta_i \partial \theta_j & = - n \pi_i \pi_j, \qquad i \neq j \end{align*}\]
So the Fisher information matrix for \(\theta\), denoted \(I(\theta)\) has \(i, i\) components \(n \pi (1 - \pi)\) and \(i, j\) components \(- n \pi_i \pi_j\) for \(i \neq j\).
A multinomial distribution having usual parameter vector \(\pi\) with any component \(\pi_i\) equal to zero is partially degenerate, with the distribution of \(y_i\) concentrated at zero, so by the non-degeneracy property (Section 3.8 above) it cannot be included in the exponential family. In light of this it is not surprising that (7.15) is never zero. Similarly for the try II parameterization: (7.14) is never zero.
For the multinomial family of dimension \(k\) and sample size \(n\)
the canonical statistic is the vector \((x_1, \ldots, x_{k - 1})\) whose components are all but one of the category counts,
the canonical parameter vector is \((\theta_1, \ldots, \theta_{k - 1})\), where \[\begin{equation} \theta_i = \log\left( \frac{\pi_i}{\pi_k} \right), \tag{7.16} \end{equation}\]
the mean value parameter vector is \((\mu_1, \ldots, \mu_{k - 1})\), where \(\mu_i = n \pi_i\),
the canonical parameter space of the full family is the whole of \((k - 1)\)-dimensional Euclidean space,
the mean value parameter space is the set of mean value parameter vectors that satisfy \[\begin{gather} \mu_i > 0, \qquad i = 1, \ldots k - 1, \\ \sum_{i = 1}^{k - 1} \mu_i < n, \end{gather}\] and
the cumulant function is \[\begin{equation} c(\theta) = n \log \left( 1 + \sum_{i = 1}^{k - 1} e^{\theta_i} \right) \tag{7.17} \end{equation}\]
The binomial distribution is the \(k = 2\) case of the multinomial distribution when the “try II” parameterization is used. If \(x_1\) is the number of successes and \(x_2\) is the number of failures, then we eliminate \(x_2\) from the canonical statistic vector, just leaving the scalar variable \(x_1\). Because the \(\pi\)’s sum to one we have \(\pi_2 = 1 - \pi_1\) and plugging this into (7.16) gives the analogous equation for the binomial distribution (7.1). Also (7.17) is the same as the analogous equation for the binomial distribution (7.5).
For the multinomial family of dimension \(k\) and sample size \(n\)
the canonical statistic is the vector \(x\) of category counts
the usual parameter is the vector \(\pi\),
the canonical parameter is the vector \(\theta\),
the mean value parameter vector is \(n \pi\), where the components of \(\pi\) are given by (7.15),
the canonical parameter space of the full family is the whole of \(k\)-dimensional Euclidean space,
the mean value parameter space of the full family is the set of all vectors \(\mu\) whose components are strictly positive and sum to one,
the cumulant function is \[\begin{equation} c(\theta) = n \log \left( \sum_{i = 1}^k e^{\theta_i} \right) \tag{7.18} \end{equation}\]
the canonical statistic vector satisfies the linear equality \(x_1 + x_2 + \cdots + x_k = n\), hence (Section 6.1 above) the canonical parameterization is not identifiable, and
if \(\theta\) having components \(\theta_i\) is a possible value of the canonical parameter vector, then so is \(\theta + r\) having components \(\theta_i + r\) for any real number \(r\).
This provides an example where we have three different ways to put the distribution in exponential family form and we can choose the one we like.
Try I seemed simple at first but is complicated by the nonlinear constraint on the parameters (7.10). Try II is ugly but simple to use. Try III is elegant but a little more complicated.
Try I is the special case of Try III where we impose the constraint (7.10). Try II is the special case of Try III where we impose the constraint \(\theta_k = 0\) and then don’t consider \(\theta_k\) part of the canonical parameter vector and don’t consider \(x_k\) part of the canonical statistic vector.
Try III is complicated by the fact that the mapping between canonical parameters and usual parameters (7.15) is not a one-to-one mapping (the “try III” parameterization is not identifiable). In order to get a one-to-one relationship we need to impose a constraint on the canonical parameters. We can choose the constraint to be anything convenient. We can use the ones already mentioned, (7.10) and \(\theta_k = 0\). But we can also use others.
Try III is important because it keeps all the cell counts as components of the canonical statistic vector. This makes it much easier to reason about models. If we want to directly compare Poisson, multinomial, and product multinomial models, then we need to use the “try III” parameterization for the multinomial and product multinomial models in order for them to have the same the canonical parameter space as the Poisson model. In the complicated exponential family models that arise in aster models (Geyer, 2023; Geyer et al., 2007), it is essential that the canonical statistic vector be the full vector of counts. Thus we have to use the “try III” parameterization.
We can explain the difference between try II and try III as follows. In try II we think that getting a one-to-one correspondence between usual parameters and canonical parameters is so important that we do it right away and make all the math and other reasoning about models very messy. In try III we think that getting a one-to-one correspondence is not important, and we do not allow it to mess up the math and other reasoning. We can impose a constraint to get one-to-one correspondence whenever we want.
When we consider maximum likelihood estimates in the completion of a full exponential family and then consider the special case of multinomial response models in that context, we will see that any attempt to use the “try II” parameterization makes a complete mess of the mathematics (it still works but becomes wildy unintuitive).
This section can be omitted. Its only purpose is to show an exponential family in which the “usual” statistic is not canonical.
The PDF of the univariate normal distribution is \[ f_{\mu, \sigma}(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{- (x - \mu)^2 / (2 \sigma^2)} \] from which we get log likelihood \[\begin{align*} l(\mu, \sigma) & = - \log(\sigma) - \frac{(x - \mu)^2}{2 \sigma^2} \\ & = - \log(\sigma) - \frac{x^2}{2 \sigma^2} + \frac{x \mu}{\sigma^2} - \frac{\mu^2}{2 \sigma^2} \end{align*}\] and we see this does have the exponential family form (3.1) with
canonical statistic vector \((x, x^2)\),
canonical parameter vector \(\theta\) with \[\begin{align*} \theta_1 & = \frac{\mu}{\sigma^2} \\ \theta_2 & = - \frac{1}{2 \sigma^2} \end{align*}\] and
cumulant function \[\begin{align*} c(\theta) & = \log(\sigma) + \frac{\mu^2}{2 \sigma^2} \end{align*}\] where we still have to solve for \(\mu\) and \(\sigma\) in terms of \(\theta_1\) and \(\theta_2\) to have this usable.
We won’t bother to go any further because we are not going to use this theory of the normal distribution as an exponential family.
The only point was to have an example where the usual statistic \(x\) was not the canonical statistic vector \((x, x^2)\). It is surprising that the usual statistic (a scalar) and the canonical statistic (a vector) don’t even have the same dimension. Considered as an exponential family the univariate normal distribution isn’t univariate. Its canonical statistic vector is two-dimensional. But then we see that it shouldn’t be surprising because the canonical statistic vector and the canonical parameter vector always have the same dimension. So if we are going to have a two-dimensional canonical parameter vector (and we do need that because this is a two-parameter family of distributions), then we also must have a two-dimensional canonical statistic vector. This also makes more sense when we consider samples of size \(n\) (Section 9.1 below).
The multivariate normal distribution is also an exponential family, but the equations are even more complicated.
Fisher (1922) invented the concept of sufficiency and sufficient statistics. A scalar or vector statistic (function of the data that does not depend on parameters) is sufficient if the the conditional distribution of the whole data given that statistic does not depend on parameters, in notation, if \(w\) is the whole data and \(y\) is the statistic, then \(y\) is sufficient if the conditional distribution of \(w\) given \(y\) does not depend on the parameters.
Fisher argued that the sufficient statistic extracts all of the information in the data relevant to estimating the parameters. If \(y\) is sufficient and \(w\) is the whole data and we write joint equals marginal times conditional \[ f_\theta(w, y) = f(w \mid y) f_\theta(y), \] only the marginal depends on the parameters. So Fisher argued that we should use that marginal to estimate parameters, because the conditional does not involve the parameters and hence has nothing to do with them. Fisher’s argument was later dubbed the sufficiency principle: statistical inference should only depend on the data through the sufficient statistics.
The likelihood may be written \[ L(\theta) = f_\theta(y) \] (as usual we may drop multiplicative terms from the likelihood that do not contain parameters). Thus likelihood inference and Bayesian inference automatically obey the sufficiency principle. It is only ad hoc methods of estimation that may violate it.
Sufficiency is a very abstract concept. It says for a particular statistical model a particular statistic (perhaps a vector statistic) contains all of the information there is about the parameters of that particular model. If one does not have a clear understanding of what a parametric statistical model is, then one cannot understand sufficiency.
The definition of sufficient statistics is hard to apply (you have to factor the joint distribution into marginal times conditional), but the Neyman-Fisher factorization criterion (Fisher, 1922; Halmos and Savage, 1949; Neyman, 1935) is much easier to apply. This says a vector statistic is sufficient if and only if there is a version of the likelihood (possibly obtained by dropping some multiplicative factors that do not contain the parameters) or log likelihood (possibly obtained by dropping some additive factors that do not contain the parameters) that depends on the whole data only through that statistic.
From (3.1) and the Neyman-Fisher factorization criterion we see that the canonical statistic vector of an exponential family is always a sufficient statistic vector.
Nowadays, there is much interest in sufficient dimension reduction in regression that does not fit into the exponential family paradigm described in Section 9.3 below. But exponential families were there first.
Suppose \(y_1\), \(y_2\), \(\ldots,\) \(y_n\) are independent and identically distributed (IID) random variables from an exponential family with log likelihood for sample size one (3.1). Then the log likelihood for sample size \(n\) is \[\begin{align*} l_n(\theta) & = \sum_{i = 1}^n \bigl[ \langle y_i, \theta \rangle - c(\theta) \bigr] \\ & = \left\langle \sum_{i = 1}^n y_i, \theta \right\rangle - n c(\theta) \end{align*}\] From this it follows that IID sampling converts one exponential family into another exponential family with
canonical statistic vector \(\sum_i y_i\), which is the sum of the canonical statistic vectors for the samples,
canonical parameter vector \(\theta\), which is the same as the canonical parameter vector for the samples,
cumulant function \(n c(\,\cdot\,)\), which is \(n\) times the cumulant function for the samples,
mean value parameter vector \(n \mu\), which is \(n\) times the mean value parameter \(\mu\) for the samples.
Many familiar “addition rules” for brand name distributions are special cases of this
sum of IID binomial is binomial,
sum of IID Poisson is Poisson,
sum of IID negative binomial is negative binomial,
sum of IID gamma is gamma,
sum of IID multinomial is multinomial, and
sum of IID multivariate normal is multivariate normal.
The point is that the dimension reduction from \(y_1\), \(y_2\), \(\ldots,\) \(y_n\) to \(\sum_i y_i\) is a sufficient dimension reduction. It loses no information about the parameters assuming the model is correct.
In the case of the univariate normal distribution (Section 7.5 above) we found that the canonical statistic is the two-dimensional vector \(y = (x, x^2)\) where \(x\) is the normal random variable. This seemed odd at the time. But now we see the canonical statistic for a sample of size \(n\) is \[\begin{equation*} y = \textstyle \left( \sum_{i = 1}^n x_i, \sum_{i = 1}^n x_i^2 \right) \end{equation*}\] and the components of this two-dimensional vector are just what is needed to compute maximum likelihood estimates of the usual parameters \[\begin{align*} \hat{\mu}_n & = \frac{1}{n} \sum_{i = 1}^n x_i \\ \hat{\sigma}^2_n & = \frac{1}{n} \sum_{i = 1}^n x_i^2 - \hat{\mu}_n^2 \end{align*}\] (or any other parameters, because each identifiable parameterization is a function of any other parameterization). So the two-dimensionality now makes sense.
What about independent but not identically distributed? Models covered in this course having this property are Poisson sampling (the joint distribution is the product of Poisson distributions with perhaps different means), product multinomial sampling, Poisson regression with log link (a special case of Poisson sampling), and logistic regression (a special case of product multinomial).
Suppose we have a product model in which the total data vector \(y\) has components \(y_A\) for \(A\) in some partition \(\mathcal{A}\) of the index set of \(y\). And the \(y_A\), \(A \in \mathcal{A}\) are independent random vectors. And for each \(A \in \mathcal{A}\) the subvector \(y_A\) is the canonical statistic vector of an exponential family model with canonical parameter vector \(\theta_A\) and cumulant function \(c_A\). Then the log likelihood for the whole model is \[ \sum_{A \in \mathcal{A}} [ \langle y_A, \theta_A \rangle - c_A(\theta_A) ] = \langle y, \theta \rangle - c(\theta) \] where \[ c(\theta) = \sum_{A \in \mathcal{A}} c_A(\theta_A) \] So not only are the marginal distributions of the \(y_A\) exponential family, so is the joint distribution of \(y\).
If each of the exponential families for the \(y_A\) are regular and full, then so is the exponential family for \(y\).
There is no sufficient dimension reduction here. The sufficient statistic vector is \(y\), which is sometimes the whole data. But we combine this idea with the idea in the following section.
Suppose we parameterize a submodel of our exponential family with parameter transformation \[\begin{equation} \theta = a + M \beta \tag{9.1} \end{equation}\] where
\(a\) is a known vector, usually called the offset vector,
\(M\) is a known matrix, usually called the model matrix (also called the design matrix),
\(\theta\) is the canonical parameter vector of the original family, and
\(\beta\) is the canonical parameter vector of the canonical affine submodel (also called coefficients vector).
The terms offset vector, model matrix, and coefficients
are those used by R functions lm
and glm
. The term design matrix
is widely used although it doesn’t make much sense for data that do not
come from a designed experiment (but language doesn’t have to make sense
and often doesn’t). The terminology canonical affine submodel is from
Geyer et al. (2007), but before that Section 8.2 of Barndorff-Nielsen (1978) used
“affine hypothesis” for this concept.
For a linear model (fit by R function lm
) \(\theta\) is the mean vector
\(\theta = E_\theta(y)\).
For a generalized linear model (fit by R function glm
) \(\theta\) is the
so-called the linear predictor, which is usually not called a parameter even
though it is one.
The transformation (9.1) is usually called “linear”,
but Geyer et al. (2007) decided to call it “affine”.
The issue is that there are two meanings of
linear
in calculus and all mathematics below that level (including before college) a linear function is a function whose graph is a straight line, but
in linear algebra and all mathematics above that level (including real analysis and functional analysis, which are just advanced calculus with another name) a linear function is a function that preserves vector addition and scalar multiplication, in particular, if \(f\) is a linear function, then \(f(0) = 0\).
In linear algebra and all mathematics above that level, if we need to refer to the other notion of linear function we call it an affine function. An affine function is a linear function plus a constant function, where “linear” here means the notion from linear algebra and above.
All of this extends to arbitrary transformations between vector spaces. An affine function from one vector space to another is a linear function plus a constant function. Going the other way, a linear function is an affine function \(f\) that satisfies \(f(0) = 0\).
So (9.1) is an affine change of parameter in the language of linear algebra and above and a linear change of parameter in the language of calculus and below. It is also a linear change of parameter in the language of linear algebra and above in the special case \(a = 0\) (no offset). The fact that (9.1) is almost always used when \(a = 0\) (offsets are very rarely used) may contribute to the tendency to call this parameter transformation “linear”.
It is not just in applications that offsets rarely appear. Even theoreticians who pride themselves on their knowledge of advanced mathematics usually ignore offsets. The familiar formula \(\hat{\beta} = (M^T M)^{- 1} M^T y\) for the least squares estimator is missing an offset \(a\). Compare with equation (10.5) below.
Another reason for confusion between the two notions of “linear” is that
for simple linear regression (R command lm(y ~ x)
), the regression function
is affine (linear in the lower-level notion). But this is applying “linear”
in the wrong context.
It’s called “linear regression” because it’s linear in the regression coefficients, not because it’s linear in \(x\).
— Werner Stutzle
If we change the model to quadratic regression
(R command lm(y ~ x + I(x^2))
), then the regression function is quadratic
(nonlinear) but the model is still a linear model fit by R function lm
.
Another way of saying this is some people think of simple linear regression
as being linear in the lower-level sense because it has an intercept,
but, as sophisticated statisticians, we know that having an intercept
does not put an \(a\) in equation (9.1), it adds a column to \(M\)
(all of whose components are equal to one) because the “intercept” is a
parameter to estimate (a component of \(\beta\)) not a known constant.
Statisticians generally ignore this confusion in terminology. Most clients of statistics, including most scientists, do not take linear algebra and math classes beyond that, so we statisticians use “linear” in the lower-level sense when talking to clients. I myself use “linear” in the lower-level sense in Stat 5101 and 5102, a master’s level service course in theoretical probability and statistics. Except we are inconsistent. When we say “linear model” we usually mean (9.1) with \(a = 0\) so \((M^T M)^{- 1} M^T y\) makes sense, and that is the higher-level sense of “linear”.
Hence Geyer et al. (2007) decided to introduce the term canonical affine submodel for what was already familiar but either had no name or was named with confusing terminology.
In the list following (9.1), “known” means nonrandom. In regression analysis we allow \(M\) to depend on covariate data, and saying \(M\) is nonrandom means we are treating covariate data as fixed. If the covariate data happen to be random, we say we are doing the analysis conditional on the observed values of the covariate data (which is the same as treating these data as fixed and nonrandom). In other words, the statistical model is for the conditional distribution of the response \(y\) given the covariate data, and the (marginal) distribution of the covariate data is not modeled. Thus to be fussily pedantic we should write \[ E_\theta(y \mid \text{the part of the covariate data that is random, if any}) \] everywhere instead of \(E_\theta(y)\), and similarly for \(\mathop{\rm var}\nolimits_\theta(y)\) and so forth. But we are not going to do that, and almost no one does that. We can also allow \(a\) to depend on covariate data (but almost no one does that).
Now we come to the point of this section. The log likelihood for the canonical affine submodel is \[\begin{align*} l(\beta) & = \langle y, a + M \beta \rangle - c(a + M \beta) \\ & = \langle y, a \rangle + \langle y, M \beta \rangle - c(a + M \beta) \end{align*}\] and we may drop the term \(\langle y, a \rangle\) from the log likelihood because it does not contain the parameter \(\beta\) giving \[ l(\beta) = \langle y, M \beta \rangle - c(a + M \beta) \] and because \[ \langle y, M \beta \rangle = y^T M \beta = (M^T y)^T \beta = \langle M^T y, \beta \rangle \] we finally get log likelihood for the canonical affine submodel put into exponential family form \[\begin{equation} l(\beta) = \langle M^T y, \beta \rangle - c_\text{sub}(\beta) \tag{9.2} \end{equation}\] where \[ c_\text{sub}(\beta) = c(a + M \beta) \] From this it follows that the change of parameter (9.1) converts one exponential family into another exponential family with
canonical statistic vector \(M^T y\),
canonical parameter vector \(\beta\),
cumulant function \(c_\text{sub}\), and
mean value parameter \(\tau = M^T \mu\), where \(\mu\) is the mean value parameter of the original model.
If \(\Theta\) is the canonical parameter space of the original model, then \[ B = \{\, \beta : a + M \beta \in \Theta \,\} \] is the canonical parameter space of the canonical affine submodel. If the original model is full, then so is the canonical affine submodel. If the original model is regular full, then so is the canonical affine submodel.
There are many points to this section. It is written the way it is because of aster models (Geyer et al., 2007), but it applies to linear models, generalized linear models, and log-linear models for categorical data too, hence to the majority of applied statistics. But the point of this section that gets it put in its supersection is that the dimension reduction from \(y\) to \(M^T y\) is a sufficient dimension reduction. It loses no information about \(\beta\) assuming this submodel is correct.
The offset vector and model matrix that determine a canonical affine submodel are not unique.
Any model matrix that has the same column space, the vector subspace spanned by the columns of \(M\), specifies the same statistical model. This is because that subspace is, by definition, \[ V = \{\, M \beta : \beta \in \mathbb{R}^p \,\} \] where \(p\) is the dimension of \(\beta\). So different model matrices with the same column space specify the same set of \(\theta\) via (9.1).
Similarly, offset vectors whose difference lies in this subspace \(V\) do not specify different statistical models, again because they specify the same set of \(\theta\) via (9.1).
For examples of different model matrices specifying the same model, see Section 14 below.
Nowadays, exponential families are so important in so many parts of theoretical statistics that their origin has been forgotten. They were invented to be the class of statistical models described by the Pitman-Koopman-Darmois theorem, which says (roughly) that the only statistical models having the sufficient dimension reduction property in IID sampling described in Section 9.1 above are exponential families. In effect, it turns that section from if to if and only if.
But the reason for the “roughly” is that as just stated with no conditions, the theorem is false. Here is a counterexample. For an IID sample from the \(\text{Uniform}(0, \theta)\) distribution, the maximum likelihood estimator (MLE) is \(\hat{\theta}_n = \max\{y_1, \ldots, y_n\}\) and this is easily seen to be a sufficient statistic (the Neyman-Fisher factorization theorem again) but \(\text{Uniform}(0, \theta)\) is not an exponential family.
So there have to be side conditions to make the theorem true. Pitman, Koopman, and Darmois independently (not as co-authors) in 1935 and 1936 published theorems that said under two side conditions,
the distribution of the canonical statistic is continuous and
the support of the distribution of the canonical statistic does not depend on the parameter,
then any statistical model with the sufficient dimension reduction property in IID sampling is an exponential family. (Obviously, \(\text{Uniform}(0, \theta)\) violates the second side condition).
Later, other authors published theorems with more side conditions that covered the discrete case. But the side conditions for those are really messy so those theorems are not so interesting.
Nowadays exponential families are so important for so many reasons (not all of them mentioned in this document) that no one any longer cares about these theorems. We only want the if part of the if and only if, which is covered in Section 9.1 above.
The usual procedure for maximizing the log likelihood is to differentiate it, set it equal to zero, and solve for the parameter. The derivative of (3.1) is \[\begin{equation} \nabla l(\theta) = y - \nabla c(\theta) = y - E_\theta(Y) \tag{10.1} \end{equation}\] (Here and throughout this section \(Y\) is a random vector having the distribution of the canonical statistic and \(y\) is the observed value of the canonical statistic.) So the MLE \(\hat{\theta}\) is characterized by \[\begin{equation} y = E_{\hat{\theta}}(Y) \tag{10.2} \end{equation}\]
We can say a lot more than this. Cumulant functions are convex functions (Barndorff-Nielsen, 1978, Theorem 7.1). Hence log likelihoods of regular full exponential families are concave functions that are differentiable everywhere in the canonical parameter space. Hence (10.1) being equal to zero is a necessary and sufficient condition for \(\theta\) to maximize (3.1) (Rockafellar and Wets, 1998, part (b) of Theorem 2.14). Hence (10.2) is a necessary and sufficient condition for \(\hat{\theta}\) to be an MLE.
The MLE need not exist and need not be unique.
The MLE does not exist if the observed value of the canonical statistic is on the boundary of its support in the following sense, there exists a vector \(\delta \neq 0\) such that \(\langle Y - y, \delta \rangle \le 0\) holds almost surely and \(\langle Y - y, \delta \rangle = 0\) does not hold almost surely (Geyer, 2009, Theorems 1, 3, and 4). When the MLE does not exist in the classical sense, it may exist in an extended sense as a limit of distributions in the original family, but that is a story for another time (see Geyer (2009) and Geyer (2024b) for more on this subject).
The MLE is not unique if there exists a \(\delta \neq 0\) such that \(\langle Y - y, \delta \rangle = 0\) holds almost surely (Geyer, 2009). This is the same phenomenon described in Section 6.1 above. In theory, nonuniqueness of the MLE for a full exponential family is not a problem because every MLE corresponds to the same probability distribution (Geyer, 2009, Theorem 1 and Corollary 2), so the MLE canonical parameter vector is not unique (if any MLE exist), but the MLE probability distribution is unique (if it exists) and the MLE mean value parameter is unique (if it exists).
It is always possible to arrange for uniqueness of the MLE. Simply arrange that the distribution of the canonical statistic have full dimension (not be concentrated on a hyperplane).
But one does not want to do this too early in the data analysis process.
In the lecture notes about Bayesian inference and Markov chain Monte Carlo
(Geyer, 2024a) we used a nonidentifiable parameterization for
the example (volleyball standings) for the frequentist analysis and this
caused no problems because the computer (R function glm
) knew how to deal
with it. This same nonidentifiable parameterization when used in the
Bayesian analysis allowed us to easily specify a prior distribution that
did not favor any team.
This sort of data provides simple examples of when the MLE does not exist in the classical sense. In the example cited above one team did not win any matches, and the MLE for that team consequently does not exist. We set its MLE to be minus infinity, but that is a mathematical fiction. Minus infinity is not an allowed parameter value for an exponential family (not a real number).
But all of this about nonexistence and nonuniqueness of the MLE is not the main point of this section. The main point is that (10.2) characterizes the MLE when it exists and whether or not it is unique.
More precisely, we should say the observed value of the canonical statistic vector equals its expected value under the MLE distribution (which is unique if it exists). It is not the observed value of anything whatsoever that equals its MLE expected value.
The observed-equals-expected property is one of the keys to interpreting MLE’s for exponential families. Strangely, this is considered very important in some areas of statistics and not mentioned at all in other areas.
In categorical data analysis, it is considered key. The MLE for a hierarchical log-linear model for categorical data satisfies observed equals expected: the marginals of the table corresponding to the terms in the model are equal to their MLE expected values, and those marginals are the canonical sufficient statistics. So this gives a complete characterization of maximum likelihood for these models, and hence a complete understanding in a sense. (See also Section 12 below.)
In regression analysis, it is ignored. The most widely used linear and generalized linear models are exponential families: linear models, logistic regression, and Poisson regression with log link. Thus maximum likelihood satisfies the observed-equals-expected property.
More precisely, we should say the observed value of the submodel canonical statistic vector equals its expected value under the MLE distribution (which is unique if it exists). It is not the observed value of anything whatsoever that equals its MLE expected value.
So this tells us that the submodel canonical statistic vector \(M^T y\) is crucial to understanding linear and generalized linear models (and aster models, Geyer et al. (2007)) just like it is for hierarchical log-linear models for categorical data. But do regression books even mention this? Not that your humble author knows of (except for indirectly using it in the proof of Theorem 17.1 below, which some regression textbooks do mention).
Let’s check this for linear models with offset where we have original model mean value parameter \[ \mu = E(y) = a + M \beta \] so “observed equals expected” says \[\begin{equation} M^T y = M^T (a + M \beta) \tag{10.3} \end{equation}\] or \[\begin{equation} M^T (y - a) = M^T M \beta \tag{10.4} \end{equation}\] so MLE for the submodel canonical parameter \(\beta\) is given by \[\begin{equation} \hat{\beta} = (M^T M)^{- 1} M^T (y - a) \tag{10.5} \end{equation}\] provided \(M^T M\) is an invertible matrix (also called not having collinearity). We claim (10.3) is the key to understanding linear models. But regression textbooks never mention it. So who is right, the authors of regression textbooks or the authors of categorical data analysis textbooks? Our answer is the latter. \(M^T y\) is important.
Another way people sometimes say this is that every MLE in a regular full exponential family is a method of moments estimator, but not just any old method of moments estimator. It is the method of moments estimator that sets the expectation of the canonical statistic vector equal to its observed value and solves for the parameter. For example, for linear models, the method of moments estimator we want solves (10.4) for \(\beta\). And being precise, we need the method of moments estimator that sets the canonical statistic vector for the model being analyzed equal to its expected value. For a canonical affine submodel, that is the submodel canonical statistic vector \(M^T y\).
But there is nothing special here about linear models except that they have a symbolic expression for \(\hat{\beta}\). In general, we can only determine \(\hat{\beta}\) as a function of \(M^T y\) by numerically maximizing the likelihood using a computer optimization function. But we always have “observed equals expected” up to the inaccuracy of computer arithmetic.
And usually “observed equals expected” is the only simple equality we know about maximum likelihood in regular full exponential families.
When the maximum likelihood estimate exists and is unique, Theorem 6.3 above says that any algorithm that always goes uphill on the log likelihood when it can will converge to the MLE. This includes a lot of algorithms including most of the optimization algorithms in core R.
We do not need to worry about multiple local maxima and so do not
ever need to use the default method of R function optim
or
method = "SANN"
of that function. Other methods of R function optim
always go uphill if they can. So do R functions nlm
and nlminb
except that R function nlm
has no options for doing maximization rather
than minimization so with it one must give it negative log likelihood
as the function to minimize.
The point of our Theorem 6.3 is that one does not need a good starting point for the optimization (as one does for maximum likelihood estimation in models that are not regular full exponential families). For regular full exponential families, any starting point works.
Many scientists in the early part of the nineteenth century invented the science of thermodynamics, in which some of the key concepts are energy and entropy. Entropy was initially defined physically as \[ d S = \frac{d Q}{T} \] where \(S\) is entropy and \(d S\) its differential, \(Q\) is heat and \(d Q\) its differential, and \(T\) is temperature, so to calculate entropy in most situations you have to do an integral (the details here don’t matter — the point is that entropy defined this way is a physical quantity measured in physical ways).
The first law of thermodynamics says energy is conserved in any closed physical system. Energy can change form from motion to heat to chemical energy and to other forms. But the total is conserved.
The second law of thermodynamics says entropy is nondecreasing in any closed physical system. But there are many other equivalent formulations. One is that heat always flows spontaneously from hot to cold, never the reverse. Another is that there is a maximum efficiency of a heat engine or a refrigerator (a heat engine operated in reverse) that depends only on the temperature difference that powers it (or that the refrigerator produces).
So, somewhat facetiously, the first law says you can’t win, and the second law says you can’t break even.
Near the end of the nineteenth century and the beginning of the twentieth century, thermodynamics was extended to chemistry. And it was found that chemistry too obeys the laws of thermodynamics. Every chemical reaction in your body is all the time obeying the laws of thermodynamics. No animal can convert all of the energy of food to useful work. There must be waste heat, and this is a consequence of the second law of thermodynamics.
Also near the end of the nineteenth century Ludwig Boltzmann discovered the relationship between entropy and probability. He was so pleased with this discovery that he had \[ S = k \log W \] engraved on his tombstone. Here \(S\) is again entropy, \(k\) is a physical constant now known as Boltzmann’s constant, and \(W\) is probability (Wahrscheinlichkeit in German). Of course, this is not probability in general, but probability in certain physical systems.
Along with this came the interpretation that entropy does not always increase. Physical systems necessarily spend more time in more probable states and less time in less probable states. Increase of entropy is just the inevitable move from less probable to more probable on average. At the microscopic level entropy fluctuates as the system moves through each state according to its probability.
In mid twentieth century Claude Shannon recognized the relation between entropy and information. The same formulas that define entropy statistically define information as negative entropy (so minus a constant times log probability). He used this to bound how much signal could be put through a noisy communications channel.
A little later Kullback and Leibler imported Shannon’s idea into statistics, defining what we now call Kullback-Leibler information.
What does maximum likelihood try to do theoretically? It tries to maximize the expectation of the log likelihood function, which is the Kullback-Leibler information function, that maximum being the true unknown parameter value if the model is identifiable (Wald, 1949). There is also a connection between Kullback-Leibler information and Fisher information.
A little later Edwin Jaynes recognized the connection between entropy or negative Kullback-Leibler information and exponential families. Exponential families maximize entropy subject to constraints. Fix a probability distribution \(Q\) and a random vector \(Y\) on the probability space of that distribution. Then for each vector \(\mu\) find the probability distribution \(P\) that maximizes entropy (minimizes Kullback-Leibler information) with respect to \(Q\) subject to \(E_P(Y) = \mu\). If the maximum exists, call it \(P_\mu\). Then the collection of all such \(P_\mu\) is a full exponential family having canonical statistic \(Y\) and mean value parameter vector \(\mu\) (for a proof see Geyer (2018) Deck 2, Slides 176 ff.).
Jaynes is not popular among statisticians because his maximum entropy idea became linked with so-called maxent modeling which statisticians for the most part have ignored.
But in the context of exponential families, maximum entropy is powerful. It says you start with the canonical statistic. If you start with a canonical statistic that is an affine function of the original canonical statistic of an exponential family, then the canonical affine submodel maximizes entropy subject to the distributions in the canonical affine submodel having the mean value parameters they do. Every other aspect of those distributions is just randomness in the sense of maximum entropy or minimum Kullback-Leibler information. Thus the (submodel) mean value parameter tells you everything interesting about a canonical affine submodel.
When connected with observed equals expected (Section 10 above), this is a very powerful principle. Observed equals expected says maximum likelihood estimation matches exactly the submodel canonical statistic vector to its observed value. Maximum entropy says nothing else matters, everything important, all the information about the parameter is in the MLE. All else is randomness (in the sense of maximum entropy).
The sufficiency principle (Section 8.2 above) also says nothing else matters, all the information about the parameter is in the MLE, but the maximum entropy principle says more: the distributions of the maximum entropy submodel are completely determined by the maximum entropy principle.
Admittedly the one time I have made this argument in print (Geyer and Thompson, 1992) it was not warmly received. But it was a minor point of that article. Perhaps this section makes a better case.
This is the reason why the model for the MCMC and volleyball example is what it is. The official Big Ten conference standings only pay attention to team wins. Thus we use them as canonical statistics. It is true that we add one statistic the conference does not use, the total number of home wins, because we want home field advantage in the model because it obviously exists (it is highly statistically significant every year in every sport), and leaving out home field advantage would inflate the variance of estimates.
It is amazing (to me) that this procedure fully and correctly adjusts sports standings for strength of schedule. It is strange (to me) that only one sport, college ice hockey, uses this procedure, which in that context they call KRACH, and they do not use it alone, but as just one factor in a mess of procedures that have no statistical justification.
A link function, which goes componentwise from mean value parameters to canonical parameters for a generalized linear model that is an exponential family (linear models, logistic regression, Poisson regression with log link) is univariate monotone.
This does not generalize to exponential families with dependence among components of the response vector like aster models (Geyer et al., 2007), Markov spatial point processes (Geyer and Møller (1994) and Geyer (2024c)), Markov spatial lattice processes (Geyer, 2020) or even to log-linear models for contingency tables when multinomial or product multinomial sampling is assumed.
Instead we have multivariate monotonicity. This is not a concept statisticians are familiar with. It does not appear in real analysis, functional analysis, or probability theory. It comes from convex analysis. Rockafellar and Wets (1998) have a whole chapter on the subject (their Chapter 12). There are many equivalent characterizations. We will only discuss two of them.
A function \(f\) from one vector space to another is multivariate monotone if \[ \langle f(x) - f(y), x - y \rangle \ge 0, \qquad \text{for all $x$ and $y$} \] and strictly multivariate monotone if \[ \langle f(x) - f(y), x - y \rangle > 0, \qquad \text{for all $x$ and $y$ such that $x \neq y$} \] (Rockafellar and Wets (1998), Definition 12.1).
The reason this is important to us is that the gradient mapping of a convex function is multivariate monotone (Rockafellar and Wets, 1998, Theorem 12.17, indeed a differentiable function is convex if and only if its gradient mapping is multivariate monotone). We have differentiable convex functions in play: cumulant functions (Barndorff-Nielsen, 1978, Theorem 7.1) Also cumulant functions of regular full exponential families are differentiable everywhere in their canonical parameter spaces (3.3).
So define \(h\) by (5.1), the function that maps canonical parameter vector values to mean value parameter vector values. Then \(h\) is multivariate monotone. Hence if \(\theta_1\) and \(\theta_2\) are canonical parameter values and \(\mu_1\) and \(\mu_2\) are the corresponding mean value parameter values \[ \langle \mu_1 - \mu_2, \theta_1 - \theta_2 \rangle \ge 0 \] Moreover if the canonical parameterization is identifiable, \(h\) is strictly multivariate monotone \[ \langle \mu_1 - \mu_2, \theta_1 - \theta_2 \rangle > 0, \qquad \theta_1 \neq \theta_2 \] (Barndorff-Nielsen (1978), Theorem 7.1; Geyer (2009), Theorem 1; Rockafellar and Wets (1998), Theorem 12.17).
We can see from the way the canonical and mean value parameters enter symmetrically, that when the canonical parameterization is identifiable so \(h\) is invertible (Geyer, 2013a, Lemma 9) the inverse \(h^{- 1}\) is also strictly multivariate monotone \[ \langle \mu_1 - \mu_2, \theta_1 - \theta_2 \rangle > 0, \qquad \mu_1 \neq \mu_2 \]
One final characterization: a differentiable function is strictly multivariate monotone if and only if the restriction to every line segment in the domain is strictly univariate monotone (obvious from the way the definitions above only deal with two points in the domain at a time).
Thus we have a “dumbed down” version of strict multivariate monotonicity: increasing one component of the canonical parameter vector increases the corresponding component of the mean value parameter vector, if the canonical parameterization is identifiable. The other components of \(\mu\) also change but can go any which way.
When specialized to canonical affine submodels (Section 9.3 above) strict multivariate monotonicity becomes \[ \langle \tau_1 - \tau_2, \beta_1 - \beta_2 \rangle > 0, \qquad \beta_1 \neq \beta_2 \] where \(\tau_1\) and \(\tau_2\) are the submodel mean value parameters corresponding to the submodel canonical parameters \(\beta_1\) and \(\beta_2\). When “dumbed down” this becomes: increasing one component of the submodel canonical parameter vector \(\beta\) increases the corresponding component of the submodel mean value parameter vector \(\tau = M^T \mu\), if the submodel canonical parameterization is identifiable. The other components of \(\tau\) and components of \(\mu\) also change but can go any which way.
Again we see the key importance of the sufficient dimension reduction map \(y \mapsto M^T y\) and the corresponding original model to canonical affine submodel mean value parameter mapping \(\mu \mapsto M^T \mu\), that is, the importance of thinking of \(M^T\) as (the matrix representing) a linear transformation.
These “dumbed down” characterizations say that strict multivariate monotonicity implies strict univariate monotonicity of the restrictions of the function \(h\) to line segments in the domain parallel to the coordinate axes (so only one component of the vector changes).
Compare this with our last (not dumbed down) characterization: strict multivariate monotonicity holds if and only if all restrictions of the function \(h\) to line segments in the domain are strictly univariate monotone (not just line segments parallel to the coordinate axes, all line segments).
So the “dumbed down” version only varies one component of the canonical parameter at a time, whereas the non-dumbed-down version varies all components. The “dumbed down” version can be useful when talking to people who have never heard of multivariate monotonicity. But sometimes the non-dumbed-down concept is needed (Shaw and Geyer, 2010, Appendix A). There is no substitute for understanding this concept. It should be in the toolbox of every statistician.
The title of this section comes from my Stat 5102 lecture notes (Geyer, 2016), deck 5, slide 19. It is stated the way it is for shock value. All of the students in that class have previously taken courses where they were told how to interpret regression coefficients. So this phrase is intended to shock them into thinking they have been mistaught!
Although shocking, it refers to something everyone knows. Even in the context of linear models (which those 5102 notes are) the same model can be specified by different formulas or different model matrices.
For example
foo <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex5-1.txt",
header = TRUE)
lout1 <- lm(y ~ poly(x, 2), data = foo)
summary(lout1)
##
## Call:
## lm(formula = y ~ poly(x, 2), data = foo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2677 -2.7246 0.4333 3.6335 9.0588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.762 0.846 24.54 < 2e-16 ***
## poly(x, 2)1 74.620 5.351 13.95 2.62e-16 ***
## poly(x, 2)2 -7.065 5.351 -1.32 0.195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.351 on 37 degrees of freedom
## Multiple R-squared: 0.8414, Adjusted R-squared: 0.8328
## F-statistic: 98.11 on 2 and 37 DF, p-value: 1.613e-15
and
lout2 <- lm(y ~ poly(x, 2, raw = TRUE), data = foo)
summary(lout2)
##
## Call:
## lm(formula = y ~ poly(x, 2, raw = TRUE), data = foo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2677 -2.7246 0.4333 3.6335 9.0588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.883215 2.670473 -1.080 0.287
## poly(x, 2, raw = TRUE)1 1.406719 0.300391 4.683 3.74e-05 ***
## poly(x, 2, raw = TRUE)2 -0.009381 0.007105 -1.320 0.195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.351 on 37 degrees of freedom
## Multiple R-squared: 0.8414, Adjusted R-squared: 0.8328
## F-statistic: 98.11 on 2 and 37 DF, p-value: 1.613e-15
have different fitted regression coefficients. But they fit the same model
all.equal(fitted(lout1), fitted(lout2))
## [1] TRUE
For another example, when there are categorical predictors we must “drop” one category from each predictor to get an identifiable model and which one we drop is arbitrary. Thus
bar <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex5-4.txt",
header = TRUE, stringsAsFactors = TRUE)
levels(bar$color)
## [1] "blue" "green" "red"
lout1 <- lm(y ~ x + color, data = bar)
summary(lout1)
##
## Call:
## lm(formula = y ~ x + color, data = bar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2398 -2.9939 0.1725 3.5555 11.9747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.16989 1.01710 12.948 < 2e-16 ***
## x 1.00344 0.02848 35.227 < 2e-16 ***
## colorgreen 2.12586 1.00688 2.111 0.0364 *
## colorred 6.60586 1.00688 6.561 8.7e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.034 on 146 degrees of freedom
## Multiple R-squared: 0.898, Adjusted R-squared: 0.8959
## F-statistic: 428.6 on 3 and 146 DF, p-value: < 2.2e-16
and
bar <- transform(bar, color = relevel(color, ref = "red"))
lout2 <- lm(y ~ x + color, data = bar)
summary(lout2)
##
## Call:
## lm(formula = y ~ x + color, data = bar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2398 -2.9939 0.1725 3.5555 11.9747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.77575 1.01710 19.443 < 2e-16 ***
## x 1.00344 0.02848 35.227 < 2e-16 ***
## colorblue -6.60586 1.00688 -6.561 8.7e-10 ***
## colorgreen -4.48000 1.00688 -4.449 1.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.034 on 146 degrees of freedom
## Multiple R-squared: 0.898, Adjusted R-squared: 0.8959
## F-statistic: 428.6 on 3 and 146 DF, p-value: < 2.2e-16
have different fitted regression coefficients (even the names of the regression coefficients are different). But they fit the same model
all.equal(fitted(lout1), fitted(lout2))
## [1] TRUE
Even in the presence of collinearity, where some coefficients must be dropped to obtain identifiability (and which one(s) are dropped is arbitrary) the mean values are unique, hence the fitted model is unique.
baz <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex5-3.txt",
header = TRUE, stringsAsFactors = TRUE)
x3 <- with(baz, x1 + x2)
lout1 <- lm(y ~ x1 + x2 + x3, data = baz)
summary(lout1)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = baz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89036 -0.67352 -0.05958 0.69110 2.06976
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4342 0.9094 0.477 0.635232
## x1 1.4179 0.1719 8.247 1.09e-10 ***
## x2 0.6743 0.1688 3.993 0.000227 ***
## x3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9722 on 47 degrees of freedom
## Multiple R-squared: 0.7388, Adjusted R-squared: 0.7277
## F-statistic: 66.48 on 2 and 47 DF, p-value: 1.987e-14
and
lout2 <- lm(y ~ x3 + x2 + x1, data = baz)
summary(lout2)
##
## Call:
## lm(formula = y ~ x3 + x2 + x1, data = baz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89036 -0.67352 -0.05958 0.69110 2.06976
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4342 0.9094 0.477 0.6352
## x3 1.4179 0.1719 8.247 1.09e-10 ***
## x2 -0.7436 0.2859 -2.601 0.0124 *
## x1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9722 on 47 degrees of freedom
## Multiple R-squared: 0.7388, Adjusted R-squared: 0.7277
## F-statistic: 66.48 on 2 and 47 DF, p-value: 1.987e-14
have different fitted regression coefficients. But they fit the same model
all.equal(fitted(lout1), fitted(lout2))
## [1] TRUE
After several iterations, this shocking advice became the following (Geyer (2018), deck 2, slide 41)
A quote from my master’s level theory notes.
Parameters are meaningless quantities. Only probabilities and expectations are meaningful.
Of course, some parameters are probabilities and expectations, but most exponential family canonical parameters are not.
A quote from Alice in Wonderland
‘If there’s no meaning in it,’ said the King, ‘that saves a world of trouble, you know, as we needn’t try to find any.’
Realizing that canonical parameters are meaningless quantities “saves a world of trouble”. We “needn’t try to find any”.
Thinking sophisticatedly and theoretically, of course parameters are meaningless. A statistical model is a family \(\mathcal{P}\) of probability distributions. How this family is parameterized (indexed) is meaningless. If \[ \mathcal{P} = \{\, P_\theta : \theta \in \Theta \,\} = \{\, P_\beta : \beta \in B \,\} = \{\, P_\varphi : \varphi \in \Phi \,\} \] are three different parameterizations for the same model, then they are all the same model (duh!). The fact that parameter estimates in one parameterization tell us nothing about estimates in another parameterization tells us nothing.
But probabilities and expectations are meaningful. For \(P \in \mathcal{P}\), both \(P(A)\) and \(E_P\{g(Y)\}\) depend only on \(P\) not what parameter value is deemed to index it. And this does not depend on what \(P\) means, whether we specify \(P\) with a probability mass function, a probability density function, a distribution function, or a probability measure, the same holds: probabilities and expectations only depend on the distribution, not how it is described.
Even if we limit the discussion to regular full exponential families, any one-to-one affine function of a canonical parameter vector is another canonical parameter vector (copied from Section 3 above). That’s a lot of parameterizations, and which one you choose (or the computer chooses for you) is meaningless.
Hence we agree with the King of Hearts in Alice in Wonderland. It “saves a world of trouble” if we don’t try to interpret canonical parameters.
It doesn’t help those wanting to interpret canonical parameters that sometimes the map from canonical to mean value parameters has no symbolic expression (this happens in spatial point and lattice processes (Geyer, 2020, 2024c); the log likelihood and its derivatives can only be approximated by MCMC using the scheme in Geyer (1990), Geyer and Thompson (1992), and Geyer (1994). Or this map may have a symbolic expression, but it is so fiendishly complicated that people have no clue what is going on, although the computer chugs through the calculation effortlessly; this happens with aster models (Geyer et al., 2007).
For canonical affine submodels, the mapping from mean value parameters to canonical parameters rarely has a symbolic expression. We only have Theorem 6.1 above, which says calculating this mapping is of the same difficulty as doing maximum likelihood estimation.
For yet another reason canonical parameters are meaningless, see the section about baseline-category logit parameterization in the notes about completion of exponential families.
We take up the points made above in turn, stressing their impact on how users can interpret exponential family model fits.
This section just reiterates the points of Sections 5.2 and 6.2 and 9.1 and 9.3.4 above. The mean vectors \(\tau = M^T \mu\) that are the canonical affine submodel canonical statistic vectors parameterize those submodels.
Formula (3.4) gives probabilities (or probability densities) in terms of canonical statistics and parameters. There is no formula that gives probabilities (or probability densities) in terms of canonical statistics and mean value parameters (all we have is Theorem 6.1 above). But we only use large-sample approximations (normal and chi-square) to approximate probabilities anyway (or, if we want to be more accurate, the parametric bootstrap). We only use the PMDF (3.4) to derive the theoretical properties of exponential families of distributions. We don’t use them to do calculations in applied statistics.
So mean value parameters are as good as canonical parameters for interpretation, and they are a lot more understandable.
Admittedly, different model matrices specifying the same model (Sections 9.3.5 and 14 above) specify different submodel canonical statistics \(M^T y\) and hence different submodel mean value parameters. Hence there is some arbitrariness in what we take these to be. But this arbitrariness is just part of the arbitrariness of canonical statistics (Section 3.4 above) and sufficient statistics (Section 8 above).
For more on dealing with this nonuniqueness and interpreting mean value parameters, see Section 17 below.
The simplest and most important property is the observed-equals-expected property (Section 10 above).
The MLE for the submodel mean value parameter vector \(\hat{\tau} = M^T \hat{\mu}\) is exactly equal to the submodel canonical statistic vector \(M^T y\). That’s what maximum likelihood in a regular full exponential family does.
So understanding \(M^T y\) is the most important thing in understanding the model. If \(M^T y\) is scientifically (business analytically, sports analytically, whatever) interpretable, then the model is interpretable. Otherwise, not!
The next most important property is sufficient dimension reduction (Section 9.3 above).
The submodel canonical statistic vector \(M^T y\) is sufficient. It contains all the information about the parameters that there is in the data, assuming the model is correct.
Since \(M^T y\) determines the MLE for the coefficients vector \(\hat{\beta}\) (Section 9.3 above, assuming \(\beta\) is identifiable), and the MLE for every other parameter vector is a one-to-one function of \(\hat{\beta}\), the MLE’s for all parameter vectors (\(\hat{\beta}\), \(\hat{\theta}\), \(\hat{\mu}\), and \(\hat{\tau}\)) are sufficient statistic vectors. The MLE for each parameter vector contains all the information about parameters that there is in the data, assuming the model is correct.
And nothing else matters for interpretation.
Everything else about the model other than what the MLE’s say is as random as possible (maximum entropy, Section 12 above) and contains no information (sufficiency, just discussed).
In particular it “saves a world of trouble” if we realize “we needn’t try to find any” meaning in coefficients vector \(\hat{\beta}\) (Section 14.4 above).
But if we do have to say something about the coefficients vector \(\hat{\beta}\) we do have the multivariate-monotonicity property available (Section 13 above).
Most statistics courses that discuss the regression models teach students to woof about the model equation (9.1). In lower-level courses where students are not expected to understand matrices, students are taught to woof about the same thing in other terms \[ y_i = \beta_1 + \beta_2 x_i + \text{error} \] and the like. That is, they are taught to think about the model matrix as a linear operator \(\beta \mapsto M \beta\) or the same thing in other terms. And another way of saying this is that they are taught to focus on the rows of \(M\) (because the components of \(M \beta\) are dot products of rows of \(M\) with \(\beta\)).
The view taken here is that this woof is all meaningless because it is about meaningless parameters (\(\beta\) and \(\theta\)). The important linear operator to understand is the sufficient dimension reduction operator \(y \mapsto M^T y\) or, what is the same thing described in other language, the original model to submodel mean value transformation operator \(\mu \mapsto M^T \mu\). And another way of saying this is that we should focus on the columns of \(M\) (because the components of \(M^T \mu\) are dot products of columns of \(M\) with \(\mu\)).
It is not when you woof about \(M \beta\) that you understand and explain the model, it is when you woof about \(M^T y\) that you understand and explain the model.
A story: when I was a first year graduate student I answered a homework question about asymptotics with “because it’s an exponential family” but the teacher didn’t think that was quite enough explanation. It is enough, but textbooks and courses don’t emphasize this.
The “usual” asymptotics of maximum likelihood (consistent and asymptotically normal, asymptotic variance inverse Fisher information) hold for every regular full exponential family, no other regularity conditions are necessary (all other conditions are implied by regular full exponential family). The “usual” asymptotics also hold for all curved exponential families by smoothness. For a proof of this using the usual IID sampling and \(n\) goes to infinity story see Geyer (2013a). In fact, these same “usual” asymptotics hold when there is complicated dependence and no IID in sight and whatever \(n\) goes to infinity story can be concocted either makes no sense or yields an intractable problem. For that see Sections 1.4 and 1.5 of Geyer (2013b).
And these justify all the hypothesis tests and confidence intervals based on these “usual” asymptotics, for example, those for generalized linear models that are exponential families and for log-linear models for categorical data and for aster models.
By “contingency tables” we mean data in which all variables are categorical. For an example of this let us revisit the seat belt use data.
# clean up R global environment
rm(list = ls())
count <- c(7287, 11587, 3246, 6134, 10381, 10969, 6123, 6693,
996, 759, 973, 757, 812, 380, 1084, 513)
injury <- gl(2, 8, 16, labels = c("No", "Yes"))
gender <- gl(2, 4, 16, labels = c("Female", "Male"))
location <- gl(2, 2, 16, labels = c("Urban", "Rural"))
seat.belt <- gl(2, 1, 16, labels = c("No", "Yes"))
library(glmbb)
out <- glmbb(count ~ seat.belt * injury * location * gender,
family = "poisson")
summary(out)
##
## Results of search for hierarchical models with lowest AIC.
## Search was for all models with AIC no larger than min(AIC) + 10
## These are shown below.
##
## criterion weight formula
## 182.8 0.24105 count ~ seat.belt*injury*location + seat.belt*location*gender + injury*location*gender
## 183.1 0.21546 count ~ injury*gender + seat.belt*injury*location + seat.belt*location*gender
## 184.0 0.13742 count ~ seat.belt*injury + seat.belt*location*gender + injury*location*gender
## 184.8 0.09055 count ~ seat.belt*injury*location + seat.belt*injury*gender + seat.belt*location*gender + injury*location*gender
## 184.9 0.08446 count ~ seat.belt*injury + injury*location + injury*gender + seat.belt*location*gender
## 185.0 0.08042 count ~ seat.belt*injury*location + seat.belt*injury*gender + seat.belt*location*gender
## 185.5 0.06462 count ~ seat.belt*injury*location*gender
## 185.8 0.05365 count ~ seat.belt*injury*gender + seat.belt*location*gender + injury*location*gender
## 186.8 0.03237 count ~ injury*location + seat.belt*injury*gender + seat.belt*location*gender
We will just look at the best model according to AIC having formula
f <- summary(out)$results$formula[1]
f
## [1] "count ~ seat.belt*injury*location + seat.belt*location*gender + injury*location*gender"
With four variables there are four possible three-way interactions, but we only have three of the four in this model.
Refit the model.
out.best <- glm(as.formula(f), family = poisson)
observed <- xtabs(count ~ seat.belt + injury + location + gender)
expected <- predict(out.best, type = "response")
expected <- xtabs(expected ~ seat.belt + injury + location + gender)
observed.minus.expected <- observed - expected
names(dimnames(observed))
## [1] "seat.belt" "injury" "location" "gender"
apply(observed.minus.expected, 1:3, sum)
## , , location = Urban
##
## injury
## seat.belt No Yes
## No 8.003553e-11 6.025402e-12
## Yes 3.819878e-11 1.875833e-12
##
## , , location = Rural
##
## injury
## seat.belt No Yes
## No 3.001333e-11 5.229595e-12
## Yes 9.094947e-13 5.229595e-12
apply(observed.minus.expected, c(1, 2, 4), sum)
## , , gender = Female
##
## injury
## seat.belt No Yes
## No 3.646043 -3.646043
## Yes -3.646043 3.646043
##
## , , gender = Male
##
## injury
## seat.belt No Yes
## No -3.646043 3.646043
## Yes 3.646043 -3.646043
apply(observed.minus.expected, c(1, 3, 4), sum)
## , , gender = Female
##
## location
## seat.belt Urban Rural
## No 6.309619e-11 1.728040e-11
## Yes 3.160494e-11 1.045919e-11
##
## , , gender = Male
##
## location
## seat.belt Urban Rural
## No 2.296474e-11 1.796252e-11
## Yes 8.469669e-12 -4.320100e-12
apply(observed.minus.expected, 2:4, sum)
## , , gender = Female
##
## location
## injury Urban Rural
## No 8.913048e-11 2.182787e-11
## Yes 5.570655e-12 5.911716e-12
##
## , , gender = Male
##
## location
## injury Urban Rural
## No 2.910383e-11 9.094947e-12
## Yes 2.330580e-12 4.547474e-12
Indeed we have observed equals expected (up to the accuracy of computer arithmetic) for three of the four three-way margins. Of course, we also have observed equals expected for lower order margins of these margins.
apply(observed.minus.expected, 1:2, sum)
## injury
## seat.belt No Yes
## No 1.100489e-10 1.125500e-11
## Yes 3.910827e-11 7.105427e-12
apply(observed.minus.expected, c(1, 3), sum)
## location
## seat.belt Urban Rural
## No 8.606094e-11 3.524292e-11
## Yes 4.007461e-11 6.139089e-12
apply(observed.minus.expected, c(1, 4), sum)
## gender
## seat.belt Female Male
## No 8.037659e-11 4.092726e-11
## Yes 4.206413e-11 4.149570e-12
apply(observed.minus.expected, 2:3, sum)
## location
## injury Urban Rural
## No 1.182343e-10 3.092282e-11
## Yes 7.901235e-12 1.045919e-11
apply(observed.minus.expected, c(2, 4), sum)
## gender
## injury Female Male
## No 1.109584e-10 3.819878e-11
## Yes 1.148237e-11 6.878054e-12
apply(observed.minus.expected, 3:4, sum)
## gender
## location Female Male
## Urban 9.470114e-11 3.143441e-11
## Rural 2.773959e-11 1.364242e-11
apply(observed.minus.expected, 1, sum)
## No Yes
## 1.213039e-10 4.621370e-11
apply(observed.minus.expected, 2, sum)
## No Yes
## 1.491571e-10 1.836042e-11
apply(observed.minus.expected, 3, sum)
## Urban Rural
## 1.261355e-10 4.138201e-11
apply(observed.minus.expected, 4, sum)
## Female Male
## 1.224407e-10 4.507683e-11
For an example of observed equals expected in a more general context, we revisit the time of day data
# clean up R global environment
rm(list = ls())
foo <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex6-4.txt",
header = TRUE)
count <- foo$count
w <- foo$hour / 24 * 2 * pi
out <- glm(count ~ I(sin(w)) + I(cos(w)) + I(sin(2 * w)) + I(cos(2 * w)),
family = poisson, x = TRUE)
summary(out)
##
## Call:
## glm(formula = count ~ I(sin(w)) + I(cos(w)) + I(sin(2 * w)) +
## I(cos(2 * w)), family = poisson, x = TRUE)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.65917 0.02494 66.516 < 2e-16 ***
## I(sin(w)) -0.13916 0.03128 -4.448 8.66e-06 ***
## I(cos(w)) -0.28510 0.03661 -7.787 6.86e-15 ***
## I(sin(2 * w)) -0.42974 0.03385 -12.696 < 2e-16 ***
## I(cos(2 * w)) -0.30846 0.03346 -9.219 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 704.27 on 335 degrees of freedom
## Residual deviance: 399.58 on 331 degrees of freedom
## AIC: 1535.7
##
## Number of Fisher Scoring iterations: 5
Check observed equals expected.
m <- out$x
e <- predict(out, type = "response")
t(m) %*% (count - e)
## [,1]
## (Intercept) -3.508305e-12
## I(sin(w)) 1.313778e-14
## I(cos(w)) -3.095302e-13
## I(sin(2 * w)) 8.115363e-13
## I(cos(2 * w)) 2.584599e-13
OK. It checks. But what does it mean?
Let \(z\) denote any regressor (column of the model matrix). Let \(y\) denote the response vector (“observed”) and \(\hat{\mu}\) its MLE prediction (“expected”).
For the (Intercept)
column, \(z\) has all components equal to one.
For the I(sin(w))
column, \(z\) has \(i\)-th component \(\sin(w_i)\).
And so forth.
Now observed equals expected says \[\begin{equation} z^T y = z^T \hat{\mu}, \qquad \text{for each regressor $z$}. \tag{17.1} \end{equation}\]
Theorem 17.1 Suppose we are working with a canonical affine submodel of a regular full exponential family, and this submodel has an “intercept”. Then the residual vector \(y - \hat{\mu}\) is empirically uncorrelated with each regressor (column of the model matrix).
The condition of the theorem statement is always true when there are any categorical predictors because the model is the same whether or not there is an intercept in this case.
The term “empirically uncorrelated” in the theorem statement means \(\bar{z}_i\) is the mean of the regressor vector \(z_i\).
whereThis theorem is a simple consequence of the observed-equals-expected property that is often discussed in the context of linear models because it expresses an important fact about residuals-versus-predictor plots. But the theorem is not much use for generalized linear models and log-linear models for categorical data because residual analysis is not usually discussed (residual analysis is very complicated when the response vector is not normally distributed, we will not mention it again in this course).
Here is a plot that shows how Theorem 17.1 works.
z <- m[ , "I(sin(w))"]
plot(z, count - e, ylab = "residuals",
xlab = expression(sin(w)))
lm(count - e ~ z) |> abline()
There is a pattern in the plot that comes from the response vector
y
being nonnegative-valued. And, of course, it makes no sense to
do least squares regression of residuals on this regressor z
.
But that does allow us to draw the least squares regression line,
which is flat (by the theorem). In the sense of this scatterplot,
the generalized linear model fit by R function glm
has already
accounted completely for the effect of \(\sin(w)\) on the response.
Of course, the same is true for all other regressors.
There are many ways to rephrase “observed equals expected”. The fundamental meaning is (17.1), but many other statements can be derived from that.
Proof (of Theorem 4.1). The first two derivatives of the log likelihood (3.1) are \[\begin{align} \nabla l(\theta) & = y - \nabla c(\theta) \\ \nabla^2 l(\theta) & = - \nabla^2 c(\theta) \end{align}\] Observed Fisher information for \(\theta\) is \[ - \nabla^2 l(\theta) = \nabla^2 c(\theta) \] which is equal to (4.2). Expected Fisher information is the expectation of observed Fisher information. Since we have just seen that observed Fisher information is nonrandom, the expectation is trivial (expectation of a constant random variable is that constant).
Proof (of Theorem 5.1). Consider two canonical parameter values \(\theta_1\) and \(\theta_2\) and the corresponding mean value parameter values \(\mu_1\) and \(\mu_2\) (so \(\mu_i = \nabla c(\theta_i)\)). We assume \(\theta_1\) and \(\theta_2\) do not correspond to the same distribution. Then Theorem 1 part (a) in Geyer (2009) says that \[ g(s) = l\bigl(\theta_1 + s (\theta_2 - \theta_1)\bigr) = \langle y, \theta_1 \rangle + s \langle y, \theta_2 - \theta_1 \rangle - c\bigl(\theta_1 + s (\theta_2 - \theta_1)\bigr) \] is a strictly concave function on the interval on which it is finite, which has \(0\) and \(1\) as interior points because of the assumption of regular full exponential family. Hence, by Theorem 2.13 part (a) in Rockafellar and Wets (1998) and the chain rule from calculus, its derivative \[\begin{align*} g'(s) & = \frac{d}{d s} l\bigl(\theta_1 + s (\theta_2 - \theta_1)\bigr) \\ & = \langle y, \theta_2 - \theta_1 \rangle - \bigl\langle \nabla c\bigl(\theta_1 + s (\theta_2 - \theta_1)\bigr), \theta_2 - \theta_1 \bigr\rangle \\ & = \langle y, \theta_2 - \theta_1 \rangle - \langle E_{\theta_1 + s (\theta_2 - \theta_1)}(Y), \theta_2 - \theta_1 \rangle \end{align*}\] is a strictly decreasing function. Thus \[ g'(0) = \langle y - \mu_1, \theta_2 - \theta_1 \rangle > g'(1) = \langle y - \mu_2, \theta_2 - \theta_1 \rangle \] Hence \[ g'(0) - g'(1) = \langle \mu_2 - \mu_1, \theta_2 - \theta_1 \rangle \] is strictly greater than zero. Hence \(\mu_1 \neq \mu_2\).
Proof (of Theorem 6.1). We try to maximize \(q\) given by (6.1) by setting the first derivative equal to zero and solving for \(\theta\). The first derivative is \[ q'(\theta) = \mu - \nabla c(\theta) = \mu - E_\theta(Y) \] By assumption, such a \(\theta\) does exist because \(\mu\) is assumed to be equal to \(E_\theta(Y)\) for some \(\theta\). By Theorem 2.14 part (b) in Rockafellar and Wets (1998), this solution is a global maximizer of \(q\).
Proof (of Theorem 6.2). Any cumulant function is lower semicontinuous and convex (Barndorff-Nielsen, 1978, Theorem 7.1). Hence (6.1) is an upper semicontinuous concave function. The rest of the theorem is Theorem 3.2 in Okabayashi and Geyer (2012).
Proof (of Theorem 6.3). In case \(y = \mu\) for some mean value parameter vector \(\mu\), this is a special case of Theorem 6.2.
Proof (of Theorem 7.1). \[\begin{align*} c'(\theta) & = \frac{d}{d \theta} n \log(1 + e^\theta) \\ & = \frac{n e^\theta}{1 + e^\theta} \\ & = n \pi \\ c''(\theta) & = \frac{d}{d \theta} \frac{n e^\theta}{1 + e^\theta} \\ & = \frac{n e^\theta}{1 + e^\theta} - \frac{n (e^\theta)^2}{(1 + e^\theta)^2} \\ & = n \frac{e^\theta}{1 + e^\theta} \bigl( 1 - \frac{e^\theta}{1 + e^\theta} \bigr) \\ & = n \pi (1 - \pi) \end{align*}\]
Proof (of Theorem 7.2). \[\begin{align*} c'(\theta) & = \frac{d}{d \theta} e^\theta \\ & = e^\theta \\ & = \mu \\ c''(\theta) & = \frac{d}{d \theta} e^\theta \\ & = \mu \end{align*}\]
Proof (of Theorem 7.3). \[\begin{align*} c'(\theta) & = \frac{d}{d \theta} \left( - r \log (1 - e^\theta) \right) \\ & = \frac{r e^\theta}{1 - e^\theta} \\ & = \frac{r (1 - \pi)}{\pi} \\ c''(\theta) & = \frac{d}{d \theta} \frac{r e^\theta}{1 - e^\theta} \\ & = \frac{r e^\theta}{1 - e^\theta} + \frac{r (e^\theta)^2}{(1 - e^\theta)^2} \\ & = \frac{r e^\theta}{1 - e^\theta} \left( 1 + \frac{e^\theta}{1 - e^\theta} \right) \\ & = \frac{r e^\theta}{1 - e^\theta} \frac{1}{1 - e^\theta} \\ & = \frac{r (1 - \pi)}{\pi^2} \end{align*}\]
Proof (of Theorem 7.4). To simplify notation, let \(I\) denote the index set \(\{1, 2, \ldots, k\}\) \[\begin{align*} \frac{\partial c(\theta)}{\partial \theta_i} & = \frac{\partial}{\partial \theta_i} n \log\left(\sum_{j \in I} e^{\theta_j}\right) \\ & = \frac{n e^{\theta_i}}{\sum_{j \in I} e^{\theta_j}} \end{align*}\] and from (4.2) we know this must be equal to the mean of the multinomial \(n \pi_i\). And cancelling the \(n\) on both sides gives (7.15). And \[\begin{align*} \frac{\partial^2 c(\theta)}{\partial \theta_i^2} & = \frac{\partial}{\partial \theta_i} \frac{n e^{\theta_i}}{\sum_{j \in I} e^{\theta_j}} \\ & = \frac{n e^{\theta_i}}{\sum_{j \in I} e^{\theta_j}} - \frac{n (e^{\theta_i})^2}{\left(\sum_{j \in I} e^{\theta_j}\right)^2} \\ & = \frac{n e^{\theta_i}}{\sum_{j \in I} e^{\theta_j}} \left(1 - \frac{e^{\theta_i}}{\sum_{j \in I} e^{\theta_j}} \right) \\ & = n \pi_i (1 - \pi_i) \end{align*}\] and for \(i \neq j\) \[\begin{align*} \frac{\partial^2 c(\theta)}{\partial \theta_i \partial \theta_j} & = \frac{\partial}{\partial \theta_j} \frac{n e^{\theta_i}}{\sum_{k \in I} e^{\theta_k}} \\ & = - \frac{n e^{\theta_i} e^{\theta_j}} {\left(\sum_{k \in I} e^{\theta_k}\right)^2} \\ & = - n \pi_i \pi_j \end{align*}\]
Proof (of Theorem 17.1). The special case of (17.1) with \(z\) the “intercept” regressor (all of whose components are equal to one) says \(\sum_i y_i = \sum_i \hat{\mu}_i\) or \[\begin{equation} \sum_i (y_i - \hat{\mu}_i) = 0 \tag{A.1} \end{equation}\] Rearranging (17.1) gives \[\begin{equation} \sum_i (y_i - \hat{\mu}_i) z_i = 0 \tag{A.2} \end{equation}\] for any regressor \(z\). Multiplying (A.1) by a constant gives \[\begin{equation} \sum_i (y_i - \hat{\mu}_i) \bar{z} = 0 \tag{A.3} \end{equation}\] and subtracting (A.3) from (A.2) gives (17.2).
log likelihood (equation (3.1))
canonical (vector) statistic
canonical parameter vector
mean value parameter vector
cumulant function, see also equation (3.2) and Sections 9.1 and 9.3.4.
binomial distribution as exponential family
Poisson distribution as exponential family
negative binomial distribution as exponential family
multinomial distribution as exponential family
univariate normal distribution as exponential family
coefficients vector also called canonical affine submodel canonical parameter vector
affine function, see also Section 3.4
observed equals expected property, see also Section 17
maximum entropy property
multivariate monotonicity property
Agresti, A. (2013) Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons.
Baddeley, A., Turner, R. and Rubak, E. (2024) R Package spatstat
: Spatial Point Pattern Analysis, Model-Fitting, Simulation, Tests, Version 3.2-1. Available at: https://cran.r-project.org/package=spatstat.
Barndorff-Nielsen, O. E. (1978) Information and Exponential Families. Chichester, England: Wiley.
Fisher, R. A. (1922) On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A, 222, 309–368.
Geyer, C. J. (1990) Likelihood and exponential families. PhD thesis. University of Washington. Available at: https://hdl.handle.net/11299/56330.
Geyer, C. J. (1991) Constrained maximum likelihood exemplified by isotonic convex logistic regression. Journal of the American Statistical Association, 86, 717–724. DOI: 10.1080/01621459.1991.10475100.
Geyer, C. J. (1994) On the convergence of Monte Carlo maximum likelihood calculations. Journal of the Royal Statistical Society, Series B, 56, 261–274. DOI: 10.1111/j.2517-6161.1994.tb01976.x.
Geyer, C. J. (1999) Likelihood inference for spatial point processes. In Stochastic Geometry: Likelihood and Computation (eds O. E. Barndorff-Nielsen, W. S. Kendall, and M. N. M. van Lieshout), pp. 79–140. Boca Raton, FL: Chapman & Hall/CRC.
Geyer, C. J. (2009) Likelihood inference in exponential families and directions of recession. Electronic Journal of Statistics, 3, 259–289. DOI: 10.1214/08-EJS349.
Geyer, C. J. (2013a) Asymptotics of exponential families. Available at: https://www.stat.umn.edu/geyer/8112/notes/expfam.pdf.
Geyer, C. J. (2013b) Asymptotics of maximum likelihood without the LLN or CLT or sample size going to infinity. In Advances in Modern Statistical Theory and Applications: A Festschrift in Honor of Morris L. Eaton (eds G. L. Jones and X. Shen), pp. 1–24. Hayward, CA: Institute of Mathematical Statistics. DOI: 10.1214/12-IMSCOLL1001.
Geyer, C. J. (2016) Statistics 5102 (geyer, fall 2016) slides. Available at: https://www.stat.umn.edu/geyer/5102/slides/.
Geyer, C. J. (2018) Statistics 8931 (geyer, fall 2018) slides. Available at: https://www.stat.umn.edu/geyer/8931aster/slides/.
Geyer, C. J. (2020) Stat 8501 lecture notes: Spatial lattice processes. Available at: https://www.stat.umn.edu/geyer/8501/lattice.pdf.
Geyer, C. J. (2023) R Package aster
: Aster Models, Version 1.1-3. Available at: https://cran.r-project.org/package=aster.
Geyer, C. J. (2024a) Stat 3701 lecture notes: Bayesian inference via Markov chain Monte Carlo (MCMC). Available at: https://www.stat.umn.edu/geyer/3701/notes/mcmc-bayes.html.
Geyer, C. J. (2024b) Stat 5421 lecture notes: Completion of exponential families. Available at: https://www.stat.umn.edu/geyer/5421/notes/infinity.html.
Geyer, C. J. (2024c) Stat 8501 lecture notes: Spatial point processes. Available at: https://www.stat.umn.edu/geyer/8501/points.pdf.
Geyer, C. J. and Møller, J. (1994) Simulation procedures and likelihood inference for spatial point processes. Scandinavian Journal of Statistics, 21, 359–373. Available at: https://www.jstor.org/stable/4616323.
Geyer, C. J. and Thompson, E. A. (1992) Constrained Monte Carlo maximum likelihood for dependent data (with discussion). Journal of the Royal Statistical Society, Series B, 54, 657–699. DOI: 10.1111/j.2517-6161.1992.tb01443.x.
Geyer, C. J., Ryder, O. A., Chemnick, L. G., et al. (1993) Analysis of relatedness in the California condors from DNA fingerprints. Molecular Biology and Evolution, 10, 571–589. DOI: 10.1093/oxfordjournals.molbev.a040035.
Geyer, C. J., Wagenius, S. and Shaw, R. G. (2007) Aster models for life history analysis. Biometrika, 94, 415–426. DOI: 10.1093/biomet/asm030.
Halmos, P. R. and Savage, L. J. (1949) Application of the Radon-Nikodym theorem to the theory of sufficient statistics. Annals of Mathematical Statistics, 20, 225–241.
Handcock, M. S., Hunter, D. R., Butts, C. T., et al. (2023) R Package ergm
:Fit, Simulate and Diagnose Exponential-Family Models for Networks, Version 4.6.0. Available at: https://cran.r-project.org/package=ergm.
Neyman, J. (1935) Sur un teorema concernente le cosiddette statistiche sufficienti. Giornale dell’Istituto Italiano degli Attuari, 6, 320–334.
Okabayashi, S. (2011) Parameter estimation in social network models. PhD thesis. University of Minnesota. Available at: https://hdl.handle.net/11299/108208.
Okabayashi, S. and Geyer, C. J. (2012) Long range search for maximum likelihood in exponential families. Electronic Journal of Statistics, 6, 123–147. DOI: 10.1214/11-EJS664.
Rockafellar, R. T. and Wets, R. J.-B. (1998) Variational Analysis. Berlin: Springer-Verlag.
Shaw, R. G. and Geyer, C. J. (2010) Inferring fitness landscapes. Evolution, 64, 2510–2520. DOI: 10.1111/j.1558-5646.2010.01010.x.
Wald, A. (1949) Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics, 20, 595–601. Available at: https://www.jstor.org/stable/2236315.