\documentclass[11pt]{article}

\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amsthm}
% \usepackage[multiple]{footmisc}
\usepackage{indentfirst}
\usepackage[utf8]{inputenc}
\usepackage{natbib}
\usepackage{url}

\newcommand{\inner}[1]{\langle #1 \rangle}
\newcommand{\abs}[1]{\lvert #1 \rvert}
\newcommand{\set}[1]{\{\, #1 \,\}}

\newcommand{\real}{\mathbb{R}}

\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\cov}{cov}
\DeclareMathOperator{\pr}{pr}

\newtheorem{theorem}{Theorem}

\begin{document}

\begin{raggedright}
\Large
Stat 5421 Lecture Notes
\\
\textbf{Exponential Families, Part I}
\\
Charles J. Geyer
\\
\today
\par
\end{raggedright}

\tableofcontents

<<foo,include=FALSE,echo=FALSE>>=
options(keep.source = TRUE, width = 60)
@

\section{Introduction}

\subsection{Definition}

An \emph{exponential family of distributions} is a statistical model that
has log likelihood of the form
\begin{equation} \label{eq:logl}
   l(\theta) = \inner{y, \theta} - c(\theta),
\end{equation}
where $y$ is a vector statistic and $\theta$ is a vector parameter, both
having the same dimension, and
$$
   \textstyle
   \inner{y, \theta} = y^T \theta = \theta^T y = y \cdot \theta
   = \sum_i x_i \theta_i.
$$
Your humble author has a reason for preferring this angle bracket notation
to either matrix multiplication or to the ``dot product'' or ``scalar product''
notation, but the reason isn't important for what we are going to do.

This is not the usual definition \citep[Chapter~8]{barndorff-nielsen} but
it defines the same concept as the usual definition.  Our definition
comes from \citet{geyer-gdor}.

Recall from the likelihood handout that additive terms that do not contain
the parameter can be dropped from log likelihoods, and they may need to be
dropped to get the form \eqref{eq:logl}.
Also recall from introductory statistics that a statistic is a function of the
data that does not depend on the parameter.  So $y$ need not be ``the data''
(in its usual form); rather it may be a function of the data.
Thus if $\omega$ is
the original data, the probability mass functions (PMF) or probability
density functions (PDF), as the case may be (PMF for discrete data
or PDF for continuous data) have the form
\begin{equation} \label{eq:densities}
   f_\theta(\omega) = e^{\inner{y(\omega), \theta} - c(\theta) + h(\omega)}
\end{equation}
where $h(\omega)$ is a term not depending on the parameter that is dropped
from the log likelihood to obtain the form \eqref{eq:logl}.

I want to say recall from introductory statistics that any one-to-one
function of the parameter is another parameter, but I am aware that most
introductory statistics courses go out of their way to never mention more
than one parameterization of any particular statistical model.  So we will
just say it: any one-to-one function of the (vector) parameter of a statistical
model is another (vector) parameter, and one parameterization is just as good
as another: they all serve to distinguish the probability distributions in the
model.

Many well known and widely used statistical models are exponential families
(including Bernoulli, binomial, Poisson, geometric, negative binomial when
the shape parameter is fixed, exponential, gamma, beta, normal, multinomial,
and multivariate normal).  In this course, of course, we are mostly interested
in the discrete ones.

But even if you have had the theory sequence (mathematical probability
and statistics) you may not see a connection between the PMF and PDF you
know for these families of distributions and \eqref{eq:logl}.
It may be that the usual data are not the $y$ and the usual parameter is not
the $\theta$ in \eqref{eq:logl}, so one needs to figure out what $y$ and
$\theta$ are to get the log likelihood into exponential family form
(examples below, Sections~\ref{sec:examples-i} and~\ref{sec:examples-ii}).

A statistic $y$ and parameter $\theta$ that give a log likelihood of the
form \eqref{eq:logl} are called \emph{canonical} or \emph{natural}.
The function $c$ is called the \emph{cumulant function} of the family.
See Section~\ref{sec:cumfun} below for more on cumulant functions.

\subsection{Terminology}

Following \citet{barndorff-nielsen}, we are using modern terminology
about exponential families.

An older terminology says ``the exponential family'' for the collection of
all of what the newer terminology calls exponential families.
\begin{itemize}
\item The old terminology says, for example, the family of Poisson
    distributions is \emph{in the} exponential family.
\item The new terminology says, for example, the family of Poisson
    distributions is \emph{an} exponential family.
\end{itemize}

The old terminology has nothing to recommend it.  It makes the primary term
--- exponential family --- refer to a heterogeneous collection of
statistical models of no interest in any application.

The new terminology
describes a property of statistical models that implies many other
properties.  It is a key concept of theoretical statistics.

\subsection{Affine Functions}

An affine function is what most people, including most statisticians,
call a linear function, but which mathematicians, at least when being
pedantic, do not.  Starting in courses called ``linear algebra'' a linear
function (also called linear transformation) is a function $f$ between
vector spaces that preserves the vector space operations, that is,
\begin{subequations}
\begin{alignat}{2}
   f(x + y) & = f(x) + f(y), & \qquad & \text{for all vectors $x$ and $y$}
   \label{eq:linear-a}
   \\
   f(a x) & = a f(x), && \text{for all scalars $a$ and vectors $x$}
   \label{eq:linear-b}
\end{alignat}
\end{subequations}
The special case of \eqref{eq:linear-b} when $a = 0$ says $f(0) = 0$, which
most people think linear functions do not have to satisfy.

What most people mean by linear function is a function whose graph is flat.
They are the scalar-to-scalar functions of the form
$$
   f(x) = a + b x,
$$
where $a$ and $b$ are scalars, and vector-to-vector functions of the form
$$
   f(x) = a + B x,
$$
where $a$ is a vector and $B$ is a matrix.  Mathematicians also need to
discuss these kinds of functions, so they give them a name: affine functions.

The relations between the two kinds of functions are (using linear in
the pedantic sense here)
\begin{itemize}
\item an affine function is a linear function plus a constant function, and
\item a linear function is an affine function $f$ satisfying $f(0) = 0$.
\end{itemize}

Your humble author likes to use linear in the pedantic sense and affine
for linear in the sloppy sense.  Apologies for the pedantry.

\subsection{Non-Uniqueness} \label{sec:non-unique}

The canonical statistic, canonical parameter, and cumulant function of
an exponential family are not unique.
\begin{itemize}
\item
Any one-to-one affine function of a canonical statistic is another canonical
statistic.
\item
Any one-to-one affine function of a canonical parameter is another canonical
parameter.
\item
Any scalar-valued affine function of the canonical parameter added
to the cumulant function gives another cumulant function.
\end{itemize}
These changes cannot be done separately.  Changing one requires changes
in the others to keep the form \eqref{eq:logl}.

We usually do not worry about this.  We fix one choice of canonical
statistic, canonical parameter, and cumulant function and say ``the'' canonical
statistic, ``the'' canonical parameter, and ``the'' cumulant function.

\subsection{Non-Degeneracy} \label{sec:non-degenerate}

The PMF or PDF \eqref{eq:densities} of an exponential family are never zero.
This means all distributions in the family have the same support (more on
this later).

\subsection{Cumulant Functions} \label{sec:cumfun}

Cumulant functions satisfy
\begin{equation} \label{eq:cumfun}
   c(\theta)
   =
   c(\psi)
   +
   \log
   E_\psi\left\{
   e^{\inner{y, \theta - \psi}}
   \right\}
\end{equation}
\citep[equation (5) in][]{geyer-gdor}.
Whenever $\theta$ is an interior point of their domain, they also satisfy
\begin{align}
   E_\theta(y) & = \nabla c(\theta)
   \label{eq:bart-i}
   \\
   \var_\theta(y) & = \nabla^2 c(\theta)
   \label{eq:bart-ii}
\end{align}
\citep[Theorem~8.1]{barndorff-nielsen}, where, as in the likelihood handout,
\begin{itemize}
\item $E(y)$ denotes the vector having components $E(y_i)$,
\item $\nabla c(\theta)$ denotes the vector having
    components $\partial c(\theta) / \partial \theta_i$,
\item $\var(y)$ denotes the matrix having components $\cov(y_i, y_j)$,
\item and $\nabla^2 c(\theta)$ denotes the matrix having
    components $\partial^2 c(\theta) / \partial \theta_i \partial \theta_j$.
\end{itemize}
It is part of the assertion of Theorem~8.1 in \citet{barndorff-nielsen} that
the derivatives exist whenever $\theta$ is an interior point of the domain.

If there is an equality constraint on the parameter vector, so its components
are not separately variable, the derivatives
in \eqref{eq:bart-i} and \eqref{eq:bart-ii} do not exist anywhere, but then
we can sometimes use \eqref{eq:cumfun} to extend the domain of the cumulant
function, as described in the next section, so that derivatives do exist.

\subsection{Fullness and Regularity}

In \eqref{eq:cumfun} consider $\psi$ as fixed and $\theta$ as variable,
and let $\theta$ vary over the entire Euclidean space that contains the
canonical parameter space, letting $c(\theta)$ have the value $+ \infty$
at points $\theta$ such that the expectation on the right-hand side does not
exist.  Then define
\begin{equation} \label{eq:full}
   \Theta = \set{ \theta : c(\theta) < \infty }.
\end{equation}
This is called the \emph{effective domain} of the cumulant function.
We can extend the exponential family to \eqref{eq:full}
by defining the PMF or PDF by
\begin{equation} \label{eq:pmdf-full}
   f_\theta(\omega)
   =
   f_\psi(\omega) e^{\inner{y(\omega), \theta - \psi} - c(\theta) + c(\psi)}
\end{equation}
at points $\theta$ that were not in the originally given canonical parameter
space.

We say an exponential family is \emph{full} if its canonical parameter space
is \eqref{eq:full}.
We say a full exponential family is \emph{regular} if \eqref{eq:full} is
an open subset of the Euclidean space where $\theta$ takes values.
Another name for the set \eqref{eq:full} is the canonical parameter space
of the full exponential family.

For many exponential families \eqref{eq:full} is the whole Euclidean space.
But not for all.  Geometric, negative binomial, and two-parameter normal
are examples where \eqref{eq:full} is not a whole Euclidean space.

Regularity is very important.  Many of the desirable properties of exponential
families hold only for regular full exponential families.
For example, if the full family is regular, this means every
canonical parameter value $\theta$ is an interior point of the full
canonical parameter space \eqref{eq:full}, and, consequently,
every distribution in the family has mean vector and variance matrix
given by derivatives of the cumulant function \eqref{eq:bart-i}
and \eqref{eq:bart-ii}.

Fortunately, almost all exponential families that arise in applications
are full and regular.  The only examples I know of that don't are some
models for spatial point processes \citep{geyer-moller,geyer-spatial}.
Every exponential family we will look at in this course is regular.

\section{Examples I} \label{sec:examples-i}

\subsection{Binomial} \label{sec:binomial}

The binomial distribution has log likelihood
$$
   l(\pi) = x \log(\pi) + (n - x) \log(1 - \pi)
$$
where $x$ is the ``usual'' data, $\pi$ is the ``usual'' parameter,
and $n$ is neither data nor parameter but rather a constant.
To get this into the form \eqref{eq:logl}, assuming the family
is one-dimensional (which having only one parameter suggests),
we need to write this as a function of $x$ (which will be
the canonical statistic $y$) times a function of $\pi$
(which will be the canonical parameter $\theta$) minus a function
of $\theta$ (the cumulant function).  So isolate $x$
\begin{align*}
   l(\pi)
   & =
   x \bigl[ \log(\pi) - \log(1 - \pi) \bigr] + n \log(1 - \pi)
   \\
   & =
   x \log \left( \frac{\pi}{1 - \pi} \right) + n \log(1 - \pi)
\end{align*}
and this tells us
\begin{subequations}
\begin{itemize}
\item the canonical statistic is $x$,
\item the canonical parameter is
\begin{equation} \label{eq:fooa}
   \theta = \log \left( \frac{\pi}{1 - \pi} \right),
\end{equation}
\item and the cumulant function is
\begin{equation} \label{eq:foob}
   c(\theta) = - n \log(1 - \pi).
\end{equation}
\end{itemize}
\end{subequations}
Of course \eqref{eq:foob} doesn't really make sense because it has
$\theta$ on one side and $\pi$ on the other.  We need to solve \eqref{eq:fooa}
for $\pi$ obtaining
\begin{equation} \label{eq:inv-logit}
   \pi = \frac{e^{\theta}}{1 + e^\theta}
\end{equation}
and plug this back into \eqref{eq:foob} obtaining
\begin{align*}
   c(\theta)
   & =
   - n \log(1 - \pi)
   \\
   & =
   - n \log\left(1 - \frac{e^{\theta}}{1 + e^\theta}\right)
   \\
   & =
   - n \log\left(\frac{1}{1 + e^\theta}\right)
   \\
   & =
   n \log\left(1 + e^\theta\right)
\end{align*}
Since this formula is valid for all $\theta$, we see the binomial family
is a regular full exponential family whose canonical parameter space is
the whole of Euclidean space (one-dimensional Euclidean space in this case).

This change-of-parameter is so important that statisticians give it a name
\begin{equation} \label{eq:logit}
   \logit(\pi) = \log \left( \frac{\pi}{1 - \pi} \right)
   \qquad 0 < \pi < 1.
\end{equation}
($\logit$ is pronounced ``low-jit'' with a soft ``g'').

Note that 0 and 1 are not possible arguments of the $\logit$ function.
We get $\log(0)$ or division by zero, either of which is undefined.
The $\logit$ function maps $(0, 1) \to (- \infty, + \infty)$.
So the binomial family of distributions, considered as an exponential
family, has ``usual'' parameter space
$0 < \pi < 1$ and canonical parameter space $- \infty < \theta < + \infty)$.

This is a little surprising because the usual ``usual'' parameter space
is $0 \le \pi \le 1$ because otherwise we have no parameter values to
be maximum likelihood estimates when we observe $x = 0$ or $x = n$.
So exponential family theory causes some trouble here.
Much more on this subject later.

If we think about it a bit more though, this is just the non-degeneracy
property of exponential families (Section~\ref{sec:non-degenerate} above)
at work.  This is why the degenerate distributions
concentrated at $x = 0$ and $x = n$ (for $\pi = 0$ and $\pi = 1$)
cannot be included in the exponential family.

\paragraph{Summary}

For the binomial distribution
\begin{itemize}
\item the canonical statistic is $x$,
\item the canonical parameter is $\theta = \logit(\pi)$,
\item the canonical parameter space of the full family is
    $- \infty < \theta < + \infty$,
\item and the cumulant function is
\begin{equation} \label{eq:cumfun-binomial}
   c(\theta) = n \log\bigl(1 + e^\theta\bigr).
\end{equation}
\end{itemize}
% c[theta_] = n Log[1 + Exp[theta]]
% m = D[c[theta], theta]
% v = D[m, theta]
% theta = Log[p / (1 - p)]
% Simplify[m]
% Simplify[v]

\subsection{Poisson}

The Poisson distribution has log likelihood
$$
   l(\mu) = x \log(\mu) - \mu
$$
where $x$ is the ``usual'' data and $\mu$ is the ``usual'' parameter.
To get this into the form \eqref{eq:logl}, assuming the family
is one-dimensional (which having only one parameter suggests),
we need to write this as a function of $x$ (which will be
the canonical statistic $y$) times a function of $\mu$
(which will be the canonical parameter $\theta$) minus a function
of $\theta$ (the cumulant function).  Obviously,
\begin{itemize}
\item the canonical statistic is $x$,
\item the canonical parameter is
$$
   \theta = \log(\mu),
$$
which has inverse function
$$
   \mu = e^\theta,
$$
\item and the cumulant function is
$$
   c(\theta) = \mu = e^\theta.
$$
\end{itemize}

Note that 0 is not a possible argument of the $\log$ function.
The $\log$ function maps $(0, + \infty) \to (- \infty, + \infty)$.
So the Poisson family of distributions, considered as an exponential
family, has ``usual'' parameter space
$0 < \mu < + \infty$
and canonical parameter space $- \infty < \theta < + \infty)$.

The Poisson distribution for $\mu = 0$ is degenerate, concentrated at zero,
so by the non-degeneracy property (Section~\ref{sec:non-degenerate} above)
it cannot be included in the exponential family.

\paragraph{Summary}

For the Poisson distribution
\begin{itemize}
\item the canonical statistic is $x$,
\item the canonical parameter is $\theta = \log(\mu)$,
\item the canonical parameter space of the full family is
    $- \infty < \theta < + \infty$,
\item and the cumulant function is
\begin{equation} \label{eq:cumfun-poisson}
   c(\theta) = e^\theta.
\end{equation}
\end{itemize}

\subsection{Negative Binomial}

The negative binomial distribution has PMF
\begin{equation} \label{eq:nbinom-pmf}
   f(x) = \binom{r + x - 1}{x} p^r (1 - p)^x, \qquad x = 0, 1, \ldots
\end{equation}
where $x$ is the ``usual'' data and $r$ and $p$ are the ``usual'' parameters
and
$$
    \binom{r}{x} = \frac{r \cdot (r - 1) \cdots (r - x + 1)}{x !}.
$$

If we consider both $r$ and $p$ to be unknown parameters, this is not an
exponential family.  If we consider $r$ to be known, so $p$ is the only
unknown parameter, then it is an exponential family.  So we consider
$r$ as known.

Then the log likelihood is
$$
   l(p) = r \log(p) + x \log(1 - p).
$$
Obviously,
\begin{itemize}
\item the canonical statistic is $x$,
\item the canonical parameter is
$$
   \theta = \log(1 - p),
$$
which has inverse function
$$
   p = 1 - e^\theta,
$$
\item and the cumulant function is
\begin{equation} \label{eq:cumfun-negative-binomial-the-first}
   c(\theta) = - r \log(p) = - r \log\bigl(1 - e^\theta\bigr).
\end{equation}
\end{itemize}

The negative binomial distribution for $p = 0$ does not exist
($p = 0$ is not an allowed parameter value).
The negative binomial distribution for $p = 1$ does exist
but is degenerate, concentrated at zero.
So this degenerate distribution cannot be included in the exponential
family (by Section~\ref{sec:non-degenerate} above).

The $\log$ function maps $(0, 1) \to (- \infty, 0)$.
So the negative binomial family of distributions, considered as an exponential
family, has ``usual'' parameter space
$0 < p < 1$
and canonical parameter space $- \infty < \theta < 0)$.

We are tempted to use \eqref{eq:cumfun} to extend the family, but if
we try, we find that \eqref{eq:cumfun} gives us the same cumulant function
we already have, so the negative binomial family as we have described it
is already a regular full exponential family.

\paragraph{Summary}

For the negative binomial distribution
\begin{itemize}
\item the canonical statistic is $x$,
\item the canonical parameter is $\theta = \log(1 - p)$,
\item the canonical parameter space of the full family is
    $- \infty < \theta < 0$,
\item and the cumulant function is
\begin{equation} \label{eq:cumfun-negative-binomial}
   c(\theta) = - r \log(1 - e^\theta)
\end{equation}
\end{itemize}
% c[theta_] = - r Log[1 - Exp[theta]]
% m = D[c[theta], theta]
% v = D[m, theta]
% theta = Log[1 - p]
% Simplify[m]
% Simplify[v]

\paragraph{Commentary}

This provides an example where the canonical parameter
space of the full family is not the whole Euclidean space (in this case
one-dimensional Euclidean space).

\subsection{Multinomial} \label{sec:multinomial}

The multinomial PDF is
$$
   f(x)
   =
   \binom{n}{x} \prod_{i = 1}^k \pi_i^{x_i}
$$
where $x = (x_1, \ldots, x_k)$ is the vector of cell counts
of a contingency table (the ``usual'' data vector), where
$\pi = (\pi_1, \ldots, \pi_k)$
is the vector of cell probabilities (the ``usual'' parameter vector),
where $n$ is the sample size (the sum of the cell counts), and where
$\binom{n}{x}$ is a multinomial coefficient.  The log likelihood is
$$
   l(\pi)
   =
   \sum_{i = 1}^k x_i \log(\pi_i)
$$

\subsubsection{Try I}

It looks like we have exponential family form with the
canonical statistic vector $x$, canonical parameter vector $\theta$
having components $\theta_i = \log(\pi_i)$, and cumulant function
the zero function.

As we shall eventually see, this is correct but misleading.
If we use \eqref{eq:bart-i} and \eqref{eq:bart-ii} on the zero
function we get that $y$ has mean vector zero and variance matrix zero,
which is incorrect.  But we are not allowed to use those formulas
because the components of $\theta$, as we have defined them in this section,
are not separately variable:
the components of $\pi$ are constrained to sum to one.
And that translates to a complicated nonlinear constraint on the
canonical parameters
\begin{equation} \label{eq:try-i-constraint}
   \sum_{i = 1}^k e^{\theta_i} = 1.
\end{equation}
That looks annoying.  So we try something else.

\subsubsection{Try II} \label{sec:try-ii}

If we eliminate one of the components, say $\pi_k$, we get log likelihood
$$
   l(\pi_1, \ldots, \pi_{k - 1})
   =
   x_k \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right)
   + \sum_{i = 1}^{k - 1} x_i \log(\pi_i)
$$
But since we are trying to put this in exponential family form in which
the canonical statistic vector and canonical parameter vector must have
the same dimension, we have to also eliminate the corresponding
component of $x$.
We can do that because
the components of $x$ are constrained to sum to $n$.
If we use that fact to eliminate $x_k$ we get
\begin{align*}
   l(\pi_1, \ldots, \pi_{k - 1})
   & =
   \left(n - \sum_{i = 1}^{k - 1} x_i\right)
   \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right)
   + \sum_{i = 1}^{k - 1} x_i \log(\pi_i)
   \\
   & =
   n \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right)
   +
   \sum_{i = 1}^{k - 1} x_i \left[ \log(\pi_i) -
   \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right) \right]
   \\
   & =
   n \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right)
   +
   \sum_{i = 1}^{k - 1} x_i \log\left( \frac{\pi_i}{
   1 - \sum_{i = 1}^{k - 1} \pi_i} \right)
\end{align*}
and this has exponential family form with
\begin{itemize}
\item canonical statistic vector $(x_1, \ldots, x_{k - 1})$,
\item canonical parameter vector $(\theta_1, \ldots, \theta_{k - 1})$, where
\begin{equation} \label{eq:kumquat}
   \theta_i = \log\left( \frac{\pi_i}{1 - \sum_{i = 1}^{k - 1} \pi_i} \right)
\end{equation}
\item and cumulant function
\begin{equation} \label{eq:cumfun-multi-foobar}
   c(\theta_1, \ldots, \theta_{k - 1})
   =
   - n \log\left(1 - \sum_{i = 1}^{k - 1} \pi_i\right),
\end{equation}
\end{itemize}
except that (as usual) this doesn't make sense with $\theta$'s on
the left-hand side and $\pi$'s on the right-hand side,
so we have to solve for the $\pi$'s in terms of the $\theta$'s and
plug in.  Reintroduce $\pi_k = 1 - \sum_{i = 1}^{k - 1} \pi_i$ and plug into
\eqref{eq:kumquat} obtaining
$$
   \theta_i = \log\left( \frac{\pi_i}{\pi_k} \right)
$$
or
$$
   e^{\theta_i} = \frac{\pi_i}{\pi_k}
$$
or
\begin{equation} \label{eq:quince}
   \pi_i = \pi_k e^{\theta_i}
\end{equation}
and this also holds for $i = k$ if we define $\theta_k = 0$ (which we may do
because there was no definition of $\theta_k$ before).  Since the $\pi$'s
must sum to $1$, summing \eqref{eq:quince} gives
$$
   1 = \pi_k \sum_{i = 1}^k e^{\theta_i}
   = \pi_k \left( 1 + \sum_{i = 1}^{k - 1} e^{\theta_i} \right)
$$
and
$$
   \pi_k = \frac{1}{1 + \sum_{i = 1}^{k - 1} e^{\theta_i}}
$$
and (plugging this back into \eqref{eq:quince})
$$
   \pi_i = \frac{e^{\theta_i}}{1 + \sum_{i = 1}^{k - 1} e^{\theta_i}}
$$
and (plugging this back into \eqref{eq:cumfun-multi-foobar})
\begin{align*}
   c(\theta_1, \ldots, \theta_{k - 1})
   & = 
   - n \log\left(1 - \sum_{i = 1}^{k - 1} 
   \frac{e^{\theta_i}}{1 + \sum_{i = 1}^{k - 1} e^{\theta_i}}
   \right)
   \\
   & =
   n \log \left( 1 + \sum_{i = 1}^{k - 1} e^{\theta_i} \right)
\end{align*}

I don't know what you think of this, but I think it is horribly messy.
Also arbitrary.  Clearly, which $\pi_i$ we choose to eliminate is arbitrary.
But we also know (Section~\ref{sec:non-unique} above) that there is a lot
more arbitrariness than that in choosing canonical statistic, canonical
parameter, and cumulant function.

\subsubsection{Try III} \label{sec:try-iii}

Let's try again, this time not eliminating parameters or statistics,
and using \eqref{eq:cumfun} to define the cumulant function.
Let $\psi$
in \eqref{eq:cumfun} be the canonical parameter vector for the multinomial
distribution having probability vector $p = (p_1, \ldots, p_k)$.
Let $S$ denote the sample space of the multinomial distribution: the
set of all vectors $x$ having nonnegative integer components that
sum to $n$.  Then \eqref{eq:cumfun} says
\begin{align*}
   c(\theta)
   & =
   c(\psi)
   +
   \log
   \sum_{x \in S}
   e^{\inner{x, \theta - \psi}}
   \cdot
   \binom{n}{x} \prod_{i = 1}^k p_i^{x_i}
   \\
   & =
   c(\psi)
   +
   \log
   \sum_{x \in S}
   e^{\sum_{i = 1}^k x_i (\theta_i - \psi_i)}
   \cdot
   \binom{n}{x} \prod_{i = 1}^k p_i^{x_i}
   \\
   & =
   c(\psi)
   +
   \log
   \sum_{x \in S}
   \left[ \prod_{i = 1}^k e^{x_i (\theta_i - \psi_i)} \right]
   \cdot
   \binom{n}{x} \prod_{i = 1}^k p_i^{x_i}
   \\
   & =
   c(\psi)
   +
   \log
   \sum_{x \in S}
   \binom{n}{x} \prod_{i = 1}^k p_i^{x_i} e^{x_i (\theta_i - \psi_i)}
   \\
   & =
   c(\psi)
   +
   \log
   \sum_{x \in S}
   \binom{n}{x} \prod_{i = 1}^k \bigl[ p_i e^{\theta_i - \psi_i}
   \bigr]^{x_i}
   \\
   & =
   c(\psi)
   +
   \log
   \left(
   \sum_{i = 1}^k p_i e^{\theta_i - \psi_i}
   \right)^n
   \\
   & =
   c(\psi)
   +
   n \log \left( \sum_{i = 1}^k p_i e^{\theta_i - \psi_i} \right)
\end{align*}
We use the multinomial theorem to evaluate the sum over the sample space.
If we take $\psi$ to be the vector with all components zero
and $p$ to be the vector with all components equal to $1 / k$,
which we are free to do because $\psi$ and $p$ were
arbitrary, we get
\begin{align*}
   c(\theta)
   & =
   c(\psi)
   +
   n \log \left( \frac{1}{k} \sum_{i = 1}^k e^{\theta_i} \right)
   \\
   & =
   c(\psi)
   +
   n \log \left( \frac{1}{k} \sum_{i = 1}^k e^{\theta_i} \right)
   \\
   & =
   c(\psi)
   - n \log(k)
   +
   n \log \left( \sum_{i = 1}^k e^{\theta_i} \right)
\end{align*}
and, since we are free to choose $c(\psi)$, we can
choose it to cancel the $- n \log(k)$, finally obtaining
$$
   c(\theta)
   =
   n \log \left( \sum_{i = 1}^k e^{\theta_i} \right)
$$

But now we have lost track of what the usual parameters are in this
parameterization.  We use \eqref{eq:bart-i} to find them again.
$$
   E(x_i) = n \pi_i = \frac{\partial c(\theta)}{\partial \theta_i}
   =
   \frac{n e^{\theta_i}}{\sum_{i = 1}^k e^{\theta_i}}
$$
so
\begin{equation} \label{eq:try-iii-mean-value}
   \pi_j = \frac{e^{\theta_j}}{\sum_{i = 1}^k e^{\theta_i}}
\end{equation}

\subsubsection{Summary}

\paragraph{Try II}

For the multinomial family of dimension $k$ and sample size $n$
\begin{itemize}
\item the canonical statistic is the vector $(x_1, \ldots, x_{k - 1})$
    whose components are all but one of the category counts,
\item the canonical parameter is the vector
    $(\theta_1, \ldots, \theta_{k - 1})$, where
\begin{subequations}
\begin{equation} \label{eq:try-ii-linkfun}
   \theta_i = \log\left( \frac{\pi_i}{\pi_k} \right),
\end{equation}
\item the canonical parameter space of the full family is the whole of
    $(k - 1)$-dimensional Euclidean space,
\item and the cumulant function is
\begin{equation} \label{eq:try-ii-cumfun}
   c(\theta)
   =
   n \log \left( 1 + \sum_{i = 1}^{k - 1} e^{\theta_i} \right)
\end{equation}
\end{subequations}
\end{itemize}

\paragraph{Binomial Distribution}

The binomial distribution is the $k = 2$ case of the multinomial distribution
when the ``try II'' parameterization is used.  If $x_1$ is the number of
successes and $x_2$ is the number of failures, then we eliminate $x_2$
from the canonical statistic vector, just leaving the scalar variable $x_1$.
Because the $\pi$'s sum to one we have $\pi_2 = 1 - \pi_1$ and plugging
this into \eqref{eq:try-ii-linkfun} gives the analogous equation for
the binomial distribution \eqref{eq:fooa}.  Also \eqref{eq:try-ii-cumfun}
is the same as the analogous equation for the binomial distribution
\eqref{eq:cumfun-binomial}.

\paragraph{Summary of Try III}

For the multinomial family of dimension $k$ and sample size $n$
\begin{itemize}
\item the canonical statistic is the vector $x$ of category counts
\item the canonical parameter is the vector $\theta$,
\item the canonical parameter space of the full family is the whole of
    $k$-dimensional Euclidean space,
\item the cumulant function is
\begin{equation} \label{eq:cumfun-multinomial-try-iii}
   c(\theta)
   =
   n \log \left( \sum_{i = 1}^k e^{\theta_i} \right)
\end{equation}
\end{itemize}

\paragraph{Commentary}

This provides an example where we have three different ways to put
the distribution in exponential family form and we can choose the one
we like.


Try I seemed simple at first but is complicated by the nonlinear constraint
on the parameters \eqref{eq:try-i-constraint}.
Try II is ugly but simple to use.
Try III is elegant but a little more complicated.

Try I is the special case of Try III where we impose the constraint
\eqref{eq:try-i-constraint}.
Try II is the special case of Try III where we impose the constraint
$\theta_k = 0$ and then don't consider $\theta_k$ part of the canonical
parameter vector and don't consider $x_k$ part of the canonical statistic
vector.

Try III is complicated by the fact that the mapping between canonical
parameters and usual parameters \eqref{eq:try-iii-mean-value} is not
a one-to-one mapping.  If one adds the same constant to all of the $\theta_i$
on the right-hand side of \eqref{eq:try-iii-mean-value}, then it does not
change the left-hand side.  In order to get a one-to-one relationship
we need to impose a constraint on the canonical parameters.  We can choose
the constraint to be anything convenient.  We can use the ones already
mentioned, \eqref{eq:try-i-constraint} and $\theta_k = 0$.  But we
can also use others.

Try III is important because it keeps all the cell counts as components
of the canonical statistic vector.  This makes it much easier to reason
about models.  In the complicated exponential family models that arise
in aster models \citep*{geyer-wagenius-shaw,aster-package},
it is essential that the
canonical statistic vector be the full vector of counts.  Thus we have
to use the ``try III'' parameterization.

We can explain the difference between try II and try III as follows.
In try II we think that getting a one-to-one correspondence between
usual parameters and canonical parameters is so important that we do
it right away and make all the math and other reasoning about models
very messy.
In try III we think that getting a one-to-one correspondence is not
important, and we do not allow it to mess up the math and other reasoning.
We can impose a constraint to get one-to-one correspondence whenever we
want.

\section{Independent and Identically Distributed} \label{sec:iid}

If we observe data independent and identically distributed (IID)
from any statistical model, the joint distribution is the product of
marginal distributions
$$
   f_\theta(x_1, \ldots, x_n) = \prod_{i = 1}^n f_\theta(x_i)
$$
so the likelihood is also a product
$$
   L_n(\theta) = \prod_{i = 1}^n f_\theta(x_i)
$$
and the log likelihood is a sum
$$
   l_n(\theta) = \sum_{i = 1}^n \log f_\theta(x_i).
$$

When we apply this to exponential families,
the log likelihood is \eqref{eq:logl} with
$y$ changed to $y_i$ and summed
\begin{align*}
   l(\theta)
   & =
   \sum_{i = 1}^n \bigl[ \inner{y_i, \theta} - c(\theta) \bigr]
   \\
   & =
   \left\langle \sum_{i = 1}^n y_i, \theta \right\rangle - n c(\theta)
\end{align*}
Thus we see that IID sampling produces a new exponential family and
\begin{itemize}
\item the canonical statistic is $\sum_{i = 1}^n y_i$,
\item the canonical parameter is $\theta$,
\item and the cumulant function is $n c(\theta)$.
\end{itemize}
or
\begin{itemize}
\item the canonical statistic for sample size $n$ is the sum of the
    canonical statistics for the sampled individuals,
\item the canonical parameter is the same for all sample sizes,
\item and the cumulant function for sample size $n$ is $n$ times the
    cumulant function for sample size one.
\end{itemize}
This explains many ``addition rules'' for distributions
(sum of IID binomial is binomial, sum of IID Poisson is Poisson,
sum of IID normal is normal, and so forth for other exponential families).

\section{Sufficiency}

\subsection{Definition}

\Citet{fisher1922} invented the concept of sufficiency and sufficient
statistics.  A scalar or vector statistic (function of the data that does
not depend on
parameters) is \emph{sufficient} if the the conditional distribution of
the whole data given that statistic does not depend on parameters, in
notation, if $\omega$ is the whole data and $y$ is the statistic,
then $y$ is sufficient if the distribution of $\omega$ given $y$ does
not depend on the parameters.

\subsection{The Sufficiency Principle}

Fisher argued that the sufficient statistic extracts all of the information
in the data relevant to estimating the parameters.
If $y$ is sufficient and $\omega$ is the whole data and we write
joint equals marginal times conditional
$$
   f_\theta(\omega, y) = f(\omega \mid y) f_\theta(y),
$$
only the marginal depends on the parameters.
So Fisher argued that we should use that marginal to estimate parameters,
because the conditional does not involve the parameters and hence has
nothing to do with them.  Fisher's argument was later dubbed the
\emph{sufficiency principle:} statistical inference should only depend
on the data through the sufficient statistics.

The likelihood may be written
$$
   L(\theta) = f_\theta(y)
$$
(as usual we may drop multiplicative terms from the likelihood that do
not contain parameters).  Thus likelihood inference and Bayesian inference
automatically obey the sufficiency principle.  It is only ad hoc methods
of estimation that may violate it.

\subsection{The Neyman-Fisher Factorization Criterion}

The definition of sufficient statistics is hard to apply (you have to
factor the joint distribution into marginal times conditional), but the
\emph{Neyman-Fisher factorization criterion}
\citep{fisher1922,neyman,halmos-savage} is much easier to apply.
This says a vector statistic is sufficient if and only if there is a version
of the likelihood (possibly obtained by dropping some multiplicative factors
that do not contain the parameters) or log likelihood (possibly obtained by
dropping some additive factors that do not contain the parameters) that
depends on the whole data only through that statistic.

From \eqref{eq:logl} and the Neyman-Fisher factorization criterion
we see that the canonical statistic vector of an exponential family is
always a sufficient statistic vector.  From Section~\ref{sec:iid} we see
that the dimension of the sufficient statistic vector does not change
as the sample size changes.  This is not automatic.  There are statistical
models that have no sufficient statistic vector of smaller dimension than
the whole data (the Cauchy location family, for example).

\subsection{The Pitman-Koopman-Darmois Theorem}

The so-called Pitman-Koopman-Darmois theorem \citep{pitman,koopman,darmois}
says this property exponential families have that in IID sampling
the dimension of the sufficient statistic vector does not change
as the sample size changes actually characterizes exponential families
under certain side conditions.

Without side conditions the theorem simply isn't true.
The uniform $(0,\theta)$ statistical model has sufficient statistic
$\max(x_1, \ldots, x_n)$, which is one-dimensional for all $n$, but
this model is not an exponential family.
The theorem rules this out by a side condition
that requires the support of the distributions in the family to not depend
on the parameter.

Pitman, Koopman, and Darmois each independently proved
that under this side condition the only continuous statistical models
that have sufficient statistics whose dimension does not depend on the
sample size in IID sampling are exponential families.

There are analogs of the Pitman-Koopman-Darmois theorem for discrete data
but they add ugly and unmotivated side conditions to get the theorem.
They are what my brother-in-law's thesis advisor called
``ham and eggs theorems'' (if we had some ham, we'd have ham and eggs, if
we had some eggs).  Nevertheless, Pitman-Koopman-Darmois theory did kick
off the subject of exponential families.

\subsection{Linear Models}

To a statistician, the term \emph{linear model} (LM) means simple linear
regression
or multiple linear regression with the usual assumptions (IID normal errors
having mean zero).  This includes analysis of variance (ANOVA), which is
just multiple linear regression when all predictors are categorical.
More precisely, these usual assumptions are
\begin{equation} \label{eq:regression}
   y_i = \beta_1 + \beta_2 x_{i 1} + \beta_3 x_{i 2} + \cdots
   + \beta_p x_{i, p - 1} + e_i
\end{equation}
where the observed data for the $i$-th individual comprises
\begin{itemize}
\item $y_i$ (response variable) and
\item $x_{i 1}$, $\ldots,$ $x_{i p}$ (predictor variables),
\end{itemize}
where
\begin{itemize}
\item $e_i$ is an unobserved random variable (error), the
error variables for all individuals being assumed IID normal with mean zero,
\end{itemize}
and where
\begin{itemize}
\item $\beta_1$, $\ldots,$ $\beta_p$ are unknown parameters,
\item and the error variance is another unknown parameter.
\end{itemize}

In the original theory \citep{galton,galton-dickson}
that gave the methodology its name ``regression''
and the phenomenon ``regression towards the mean'' the individuals were
assumed IID with $(y_i, x_{i 1}, \ldots, x_{i p})$ a multivariate normal
random vector.  Then
the conditional distribution of $y_i$ given $x_{i 1}$, $\ldots,$
$x_{i p}$ is normal with mean
$$
   \beta_1 + \beta_2 x_{i 1} + \beta_3 x_{i 2} + \cdots + \beta_p x_{i, p - 1}
$$
(for some parameters $\beta_1$, $\ldots,$ $\beta_p$) and variance that
does not depend on $x_{i 1}$, $\ldots,$ $x_{i p}$, that is, the regression
equation \eqref{eq:regression} holds.
\citet{galton-dickson} showed this only for the
$p = 1$ case, but this is now known to be a
theorem about the multivariate normal distribution
\citep[Theorem~2.5.1]{anderson} that holds for general $p$.

But later authors realized that only the conditional distribution mattered.
It does not matter what the marginal distribution
of $x_{i 1}$, $\ldots,$ $x_{i p}$ is if statistical inference is done
conditionally.  This also allows for the case where some or all of the
predictor variables are not random but fixed by experimental design.
For example, in ANOVA $x_{i 1}$, $\ldots,$ $x_{i p}$ are all dummy variables
that indicate categories, and they aren't random.
So modern practice is to assume nothing about the marginal distribution
of $x_{i 1}$, $\ldots,$ $x_{i p}$ or even that they are random.
The only assumptions are that \eqref{eq:regression} holds with the $e_i$
IID mean-zero normal.

It is important to understand that it is called a linear model
because \eqref{eq:regression} is linear in the $\beta$'s not because it is
linear in the $x$'s.  In fact, when nonlinear functions of the data are
used for predictors, as in polynomial regression when
$$
   y_i = \beta_1 + \beta_2 x_i + \beta_3 x_i^2 + \cdots
   + \beta_p x_i^{p - 1} + e_i
$$
it is still a linear model (because it is linear in the $\beta$'s).
In fact, one can (as in polynomial regression) make up new predictor
variables that are functions of the originally given predictor variables.
We do this all the time (dummy variables corresponding to categorical
predictors, ``interaction'' terms, and lots more).  So
$x_{i 1}$, $\ldots,$ $x_{i p}$ don't have to be the originally given data,
they can be any made-up functions of the originally given data.

But \eqref{eq:regression} is very unwieldy notation.  The notation becomes
much simpler and also much more powerful if we shift to matrix notation.
We do this in two steps.  First we notice that in \eqref{eq:regression}
we need not treat $\beta_1$ as special, because \eqref{eq:regression}
is the special case of
\begin{equation} \label{eq:regression-general}
   y_i = \beta_1 x_{i 1} + \beta_2 x_{i 2} + \beta_3 x_{i 3} + \cdots
   + \beta_p x_{i p} + e_i
\end{equation}
where $x_{i 1} = 1$ for all $i$.  Since \eqref{eq:regression-general}
is more general than \eqref{eq:regression} we shift to that.
Now we shift to matrix notation.  If
\begin{itemize}
\item $y$ is the random vector having components $y_i$,
\item $M$ is the non-random matrix having components $x_{i j}$
(nonrandom because the inference is conditional on the $x_{i j}$),
\item $\beta$ is the non-random (parameter) vector having components $\beta_j$,
\item $e$ is the random vector having components $e_i$,
\end{itemize}
then \eqref{eq:regression-general} can be written
\begin{equation} \label{eq:regression-matrix}
   y = M \beta + e.
\end{equation}
As an example of the power of matrix notation the maximum likelihood estimate
(MLE) of $\beta$ can be written
\begin{equation} \label{eq:linear-model-mle}
   \hat{\beta} = (M^T M)^{- 1} M^T y
\end{equation}
but we won't even give a reference for that (it can be found in any book
on linear regression) because it is irrelevant to this course.  For us,
linear models are just a step on the way to generalized linear models (GLM)
and exponential family canonical affine models (about which more below).
The only point (for us) of \eqref{eq:linear-model-mle} is that you couldn't
express it without matrix notation.  Other advantages of matrix notation
will arise as we go along.

The matrix $M$ is called the \emph{model matrix} by some statisticians
and the \emph{design matrix} by other statisticians.  In a designed experiment
in which none of the predictors are random, rather they are all fixed by
the experimental design, $M$ incorporates everything about the design
that is relevant to statistical data analysis (hence the name ``design
matrix'').
In an observational study in which
some of the predictors may be random, but are treated as fixed in
the statistical analysis (because conditioning on a variable is the
same as treating it as fixed) and nothing was ``designed,'' the name
``model matrix'' seems more appropriate.  Your humble author being
a computer nerd, always says ``model matrix'' because that is what R
calls it (the R function \texttt{model.matrix} is what turns R formulas
into model matrices; ordinary users never call this function, but it is
called inside R functions like \texttt{lm} and \texttt{glm} to do this
job).

\subsection{Canonical Affine Submodels} \label{sec:submodel}

The notion of canonical affine submodels of exponential families was
only proposed recently \citep{geyer-wagenius-shaw} but in hindsight is
only a generalization of generalized linear models like logistic regression
\citep[Chapter~5]{agresti}
and Poisson regression with log link
\citep[Section~4.3]{agresti} and of log-linear models for contingency
tables \citep[Chapters~9 and~10]{agresti} that had long been used.

\citet{geyer-wagenius-shaw} is about a class of exponential family models
they call ``aster models'' that are like generalized linear models except
that they allow
\begin{itemize}
\item different components of the response vector are allowed to have different
families, some Bernoulli, some Poisson, some normal, etc.,
\item and components of the response vector that are measurements on the same
individual can be dependent with the dependency structure specified by a
graphical model.
\end{itemize}
But we don't want to talk about aster models now (maybe another handout
if there is time).  We just wanted to explain why generalized linear models
had to be generalized.

A \emph{canonical affine submodel} of an exponential family arises when
the canonical parameter vector of an exponential family is modeled affinely
\begin{equation} \label{eq:canonical-affine-submodel}
   \theta = a + M \beta
\end{equation}
where
\begin{itemize}
\item $\theta$ is the canonical parameter of the full model,
\item $a$ is a non-random vector called the \emph{offset vector},
\item $M$ is a non-random matrix called the \emph{model matrix},
\item and $\beta$ is the submodel parameter vector.
\end{itemize}
The name comes from $\theta$ being the canonical parameter and
it being an affine function of $\beta$.

When $a = 0$, which is the usual case, we say \emph{canonical linear submodel}
because then $\theta$ is a linear function of $\beta$.

The R functions \texttt{lm} and \texttt{glm} have an \texttt{offset} argument
that, when used, would change the name to ``affine model''
or ``generalized affine model'' if one wanted to be pedantic,
but nobody says this.
Since offsets are very rarely used, this doesn't matter much.
But the ``linear'' in LM and GLM is pedantically correct for models
without offsets and sloppily correct (using ``linear'' in the sense
most people use it) even with offsets.

Now we come to the theory of canonical affine submodels, which is very simple.
Plugging \eqref{eq:canonical-affine-submodel} into \eqref{eq:logl} we get
the submodel log likelihood
\begin{align*}
   l(\beta)
   & =
   \inner{y, a + M \beta} - c(a + M \beta)
   \\
   & =
   y^T (a + M \beta) - c(a + M \beta)
   \\
   & =
   y^T a + y^T M \beta - c(a + M \beta)
   \\
   & =
   y^T a + \beta^T M^T y - c(a + M \beta)
   \\
   & =
   y^T a + (M^T y)^T \beta - c(a + M \beta)
   \\
   & =
   \inner{y, a} + \inner{M^T y, \beta} - c(a + M \beta)
\end{align*}
and we are allowed to drop the additive term that does not contain the
parameter $\beta$ obtaining
\begin{equation} \label{eq:logl-cam}
   l(\beta)
   =
   \inner{M^T y, \beta} - c(a + M \beta)
\end{equation}
which we see again has exponential family form, hence the submodel is
itself an exponential family with
\begin{itemize}
\item canonical statistic vector $M^T y$,
\item canonical parameter vector $\beta$,
\item and cumulant function defined by
\begin{equation} \label{eq:cumfun-sub}
   c_{\text{sub}}(\beta) = c(a + M \beta).
\end{equation}
\end{itemize}

The reader may be thinking, so what?  But as we shall see, exponential
family theory is very powerful.  It has a lot of consequences we haven't
even hinted at yet.  For now all we know is that $M^T y$, being the submodel
canonical statistic vector, is a sufficient statistic vector for the submodel.
(It contains all of the information in the whole data that is useful for
estimating $\beta$).

\subsection{Sufficient Dimension Reduction}

There is a lot of local interest in the School of Statistics at the
University of Minnesota in the ``sufficient dimension reduction'' program
\citep{cook-weisberg,cook,cook-adragni}.  This is not the place to explain
that.  We just want to remember that the original sufficient dimension
reduction was Fisher's notion of a sufficient statistic (which Cook
and co-authors are also using) and Koopman-Pitman-Darmois theory which
says that exponential families have the sufficient dimension reduction
property in IID sampling.

In exponential family contexts, the fact that the submodel canonical statistic
vector is sufficient for a canonical affine submodel explains why
linear models, some generalized linear models (logistic regression and Poisson
regression with log link, for example), and log-linear models for categorical
data also do ``sufficient dimension reduction.''

Of course, the point of the sufficient dimension reduction program of Cook
and co-authors is to do sufficient dimension reduction nonparametrically
without assuming a particular parametric submodel is correct.  So these
notions are a bit different.  But they are very closely related.

\section{Examples II} \label{sec:examples-ii}

\subsection{Logistic Regression} \label{sec:logistic-regression}

Suppose we have regression-like data, each individual has a response
$y_i$ and a bunch of predictors $x_{i 1}$, $\ldots,$ $x_{i p}$ (possibly
some originally given and some made up) and the $y_i$ are all Bernoulli
(zero-or-one-valued).  As with linear models, we collect the $y_i$ into
a vector $y$ and collect the predictors into a model matrix $M$.

Now, however, the ``response equals mean function plus error'' formalism
used in linear models (equations \eqref{eq:regression},
\eqref{eq:regression-general}, and \eqref{eq:regression-matrix} above)
makes no sense for two reasons.
\begin{itemize}
\item Modeling means linearly makes no sense because they are bounded.
    Write $\mu_i = E(y_i)$.  Then $0 \le \mu_i \le 1$.  But if we collect
    the means into a vector $\mu$ and model them as $\mu = M \beta$, we
    will get values of $\mu$ that violate the constraints.

    One might think there would be some mathematical magic that could
    make this idea work (constrained optimization perhaps), but there
    is none found in the literature, which suggests it is a hopeless
    idea.
\item The ``means plus error'' is also bad because the ``errors''
    $y_i - \mu_i$ do not have any distribution with well studied properties.
\end{itemize}
In short, to make progress, we need to ditch two key ideas of linear models
\begin{itemize}
\item response equals means plus error,
\item and means are linear functions of the parameters.
\end{itemize}

What to do?  Exponential families and sufficient dimension reduction to the
rescue.  The distribution of the response vector $y$ is exponential family
(Bernoulli is binomial with sample size one, and we have seen that binomial
is exponential family (Section~\ref{sec:binomial}) if we assume (as in LM)
that components of the response are independent (or conditionally independent
given the predictors if we are fussy about thinking of the predictors as
possibly random variables that we are treating as fixed by conditioning on
them) then the joint distribution of the $y_i$ is the product of the marginals
and the distribution of the whole vector $y$ is exponential family.
Introducing the canonical parameters
$$
   \theta_i = \logit(\mu_i), \qquad \text{for all $i$},
$$
the log likelihood is
$$
   l(\theta)
   =
   \sum_{i = 1}^n \bigl[ y_i \theta_i - c(\theta_i) \bigr]
$$
where $c$ is the cumulant function for the Bernoulli family
(the case $n = 1$ of equation \eqref{eq:cumfun-binomial}),
and this has exponential family form
$$
   l(\theta) = \inner{y, \theta} - c(\theta)
$$
where we collect the $\theta_i$ into a vector $\theta$ and define
$$
   c(\theta) = \sum_{i = 1}^n c(\theta_i)
$$
using what mathematicians call ``abuse of notation'' in that we are using
the same letter $c$ for two different functions; on the left-hand side
it is a function of the vector $\theta$, but on the right-hand
side it is a function of the scalar $\theta_i$.

Canonical linear models $\theta = M \beta$ (when there is no offset, the
usual case) or canonical affine models $\theta = a + M \beta$ (when there
is an offset) have the sufficient dimension reduction property.

This idea is what is usually called ``logistic regression'' for reasons
that will be explained when we get to GLM theory.

This is a really good idea.  These models work great and are widely used.

\subsection{Poisson Regression with Log Link} \label{sec:poisson-regression}

Apply the same ideas to Poisson data.  Now we assume the $y_i$ are independent
(or conditionally independent given the predictors if being pedantic) Poisson.

We get exactly the same story as in the preceding section except the
canonical parameters for the Poisson
\begin{equation} \label{eq:poisson-regression-log-link}
   \theta_i = \log(\mu_i), \qquad \text{for all $i$},
\end{equation}
are different from those for the Bernoulli, as are the cumulant functions
(equation \eqref{eq:cumfun-poisson}).
Everything else is the same as in the preceding section.
Canonical linear submodels and canonical affine submodels have the sufficient
dimension reduction property.  And this too is a really good idea.
These models work great and are widely used.

The term ``log link'' won't make sense until we get to GLM.  The ``log''
refers to the $\log$ in \eqref{eq:poisson-regression-log-link}.  The ``link''
is a term of GLM theory.

\section{Maximum Likelihood Estimation}

Maximum likelihood estimation is much nicer in exponential families than
in general statistical models.  The reason is that cumulant functions
are convex functions \citep[Theorem~7.1]{barndorff-nielsen} and consequently
log likelihoods are concave functions.
Moreover, they are strictly convex and strictly concave (respectively)
if the canonical statistic vector is not concentrated on a hyperplane, that is,
if the components of the canonical statistic vector do not satisfy a
linear constraint that holds with probability one
\citep[again][Theorem~7.1]{barndorff-nielsen}.

This is not the place to define convexity and concavity.  There are many
equivalent characterizations \citep[Theorems~2.13 and~2.14]{rockafellar-wets}.
We will just explain the important consequences.

\subsection{Identifiability}

A parameterization is \emph{identifiable} if each different parameter value
corresponds to a different distribution.  Every parameterization we have
talked about in this document is identifiable except for one
(the ``try III'' parameterization for the multinomial).

Identifiability in full exponential families is fairly simple,
described by Theorem~1 in \citet{geyer-gdor}, which we condense.
\begin{theorem} \label{th:directions-of-constancy}
For a full exponential family with canonical statistic vector $y$,
canonical parameter vector $\theta$, canonical parameter space $\Theta$,
and log likelihood $l$,
\begin{enumerate}
\item[\normalfont (a)] the canonical parameterization fails to be
identifiable if and only if there exists a nonzero vector
$\delta$ such that $\inner{y, \delta}$ is constant,
\item[\normalfont (b)] and this happens if and only if $\theta$ and $\theta + r \delta$
correspond to the same distribution for all real $r$ and
all $\theta \in \Theta$,
\item[\normalfont (c)] and this happens if and only if $l(\theta + r \delta)$ is a constant
function of $r$ for all real $r$ and all $\theta \in \Theta$.
\end{enumerate}
\end{theorem}
Any vector $\delta$ satisfying these conditions is called
a \emph{direction of constancy} of the family by \citet{geyer-gdor}.

Every parameterization discussed in this document is identifiable except one.
The try III parameterization of the multinomial distribution is not
identifiable by part (a) of the theorem because of the constraint
$$
   \sum_{i = 1}^k y_i = n,
$$
which can be written in vector notation
$$
   \inner{y, \delta} = n,
$$
where $\delta = (1, 1, \ldots, 1)$.  So this $\delta$ is a direction of
constancy of the multinomial family of distributions with the ``try III''
canonical parameterization.

\subsection{Mean-Value Parameters}

For a regular full exponential family, use \eqref{eq:bart-i} to define
\begin{equation} \label{eq:canonical-to-mean-value}
   \mu = E_\theta(y) = \nabla c(\theta).
\end{equation}
The vector $\mu$ is called the \emph{mean-value parameter vector}.
\begin{theorem}
The mean value parameterization of a regular full exponential family is
always identifiable: every distribution in the family has a mean vector,
and different distributions have different mean vectors.
\end{theorem}
This is explained by \citet[p.~121]{barndorff-nielsen}.

What this says is that if
\begin{align*}
   \mu_1 & = \nabla c(\theta_1)
   \\
   \mu_2 & = \nabla c(\theta_2)
\end{align*}
and $\mu_1 = \mu_2$, then $\theta_1 - \theta_2$ is a direction of constancy.

Let us look at mean-value parameter for canonical affine submodels
(Section~\ref{sec:submodel}).  In this context, it helps to
introduce the name \emph{saturated model} for the original model that
the submodels are submodels of.  Recall that the saturated model canonical
parameter vector $\theta$ and the submodel canonical parameter $\beta$ are
related by \eqref{eq:canonical-affine-submodel}.

By the multivariable chain rule
\begin{equation} \label{eq:multivariable-chain-rule-one}
   \frac{\partial c_{\text{sub}}(\beta)}{\partial \beta_j}
   =
   \frac{\partial c(\theta)}{\partial \beta_j}
   =
   \sum_i \frac{\partial c(\theta)}{\partial \theta_i}
   \frac{\partial \theta_i}{\partial \beta_j}
   =
   \sum_i \mu_i m_{i j}
\end{equation}
where $m_{i j}$ are the components of the model matrix $M$.  And this can
be written in matrix notation
$$
   \nabla c_{\text{sub}}(\beta) = M^T \mu.
$$
Of course, this has to turn out this way.  We know the canonical statistic
of the submodel is $M^T y$, so its mean has to be $M^T \mu$ by linearity
of expectation.

Let us denote the submodel mean-value parameter by $\tau$.  Then the relations
between the submodel and saturated model parameters are
\begin{align*}
    \theta & = a + M \beta
    \\
    \tau & = M^T \mu
\end{align*}

\subsection{Canonical versus Mean-Value Parameters}

A quote from my master's level theory notes
\begin{quote}
Parameters are meaningless quantities.
Only probabilities and expectations are meaningful.
\end{quote}
Of course, some parameters are probabilities and expectations, but most
exponential family canonical parameters are not.

A quote from \emph{Alice in Wonderland}
\begin{quote}
`If there's no meaning in it,' said the King, `that saves a world of trouble,
you know, as we needn't try to find any.'
\end{quote}
Realizing that canonical parameters are meaningless quantities ``saves a
world of trouble.''  We ``needn't try to find any.''

We have to have canonical parameters to because PMF and
canonical affine submodels are specified in terms of them.
But we want our interpretations to be in terms of mean-value parameters,
since those are the ones directly related to probabilities and expectations.

This is even more obvious when we specify canonical linear submodels using
the R formula mini-language and we have categorical predictors.
There are a great many arbitrary choices for the canonical parameters of
a canonical linear submodel (Section~\ref{sec:arbitrary} below).
But expected cell counts (mean-value parameters)
do not change so long as the submodel is the same.

\subsection{Maximum Likelihood}

\begin{theorem}
For a regular full exponential family, MLE need not exist, but if they
do they are points where the first derivative vector of the log likelihood
is zero.  Moreover, every point where the first derivative vector of the
log likelihood is zero is an MLE\@.
If the canonical parameterization is identifiable, then the MLE for the
canonical parameter vector is unique.
MLE for the mean-value parameter are always unique.
If the canonical parameterization is not identifiable and $\hat{\theta}_1$
and $\hat{\theta}_2$ are distinct MLE for the canonical parameter, then
$\hat{\theta}_1 - \hat{\theta}_2$ is a direction of constancy.  Consequently,
different MLE always correspond to the same probability distribution.
\end{theorem}
This is Corollary~2 in \citet{geyer-gdor}.

With the log likelihood given by \eqref{eq:logl}, the first derivative
vector is
\begin{align*}
   \nabla l(\theta)
   & =
   y - \nabla c(\theta)
   \\
   & =
   y - E_\theta(y)
   \\
   & =
   y - \mu
\end{align*}
where in the last line we need to think of the mean-value parameter
as a function of the canonical parameter.
Setting this equal to zero and solving  gives
\begin{equation} \label{eq:observed-equal-expected-one}
   \hat{\mu} = y
\end{equation}
This says
\begin{quote}
the MLE for the mean-value parameter ($\hat{\mu}$) is the observed value
of the canonical statistic vector $y$
\end{quote}
or, for short,
\begin{quote}
observed equals expected.
\end{quote}
This is the most important property of maximum likelihood.  It
is the key to interpreting MLE for regular full exponential families.

Let us look at maximum likelihood for canonical affine submodels
(Section~\ref{sec:submodel}).  By \eqref{eq:multivariable-chain-rule-one}
\begin{equation} \label{eq:score-submodel}
   \nabla l_{\text{sub}}(\beta) = M^T (y - \mu) = M^T y - \tau.
\end{equation}
Again the MLE is a point
where the first derivative vector of the log likelihood is zero,
so now we have
\begin{equation} \label{eq:observed-equal-expected-sub}
   \hat{\tau} = M^T \hat{\mu} = M^T y
\end{equation}
and this is
\begin{quote}
a canonical affine submodel of a regular full exponential family is itself
a regular full exponential family, so the MLE for
its mean-value parameter ($M^T \hat{\mu}$) is the observed value
of its canonical statistic vector $M^T y$
\end{quote}
and we still say ``observed equals expected'' for short.

Equation \eqref{eq:observed-equal-expected-sub} is not useful for computation.
The PMF of the family are defined in terms of the canonical parameters
\eqref{eq:pmdf-full}.  So we need MLE for canonical parameters, and the
only way to find them is, in general, to maximize the log likelihood.

But ``observed equals expected'' is the key to \emph{interpreting} MLE for
regular full exponential families.  It is the only simple property the MLE
have.  In regard to interpreting MLE, the mean-value parameters are meaningful
and the canonical parameters meaningless.

\subsection{Fisher Information}

Applying the multivariable chain rule again to
\eqref{eq:multivariable-chain-rule-one} gives
\begin{align*}
   \frac{\partial^2 c(\theta)}{\partial \beta_j \partial \beta_k}
   & =
   \sum_i \sum_r \frac{\partial^2 l(\theta)}{\partial \theta_i \theta_r}
   \frac{\partial \theta_i}{\partial \beta_j}
   \frac{\partial \theta_r}{\partial \beta_k}
   +
   \sum_i \frac{\partial l(\theta)}{\partial \theta_i}
   \frac{\partial^2 \theta_i}{\partial \beta_j \partial \beta_k}
   \\
   & =
   \sum_i \sum_r \frac{\partial^2 l(\theta)}{\partial \theta_i \theta_r}
   m_{i j} m_{r k}
\end{align*}
And this can be written in matrix notation
$$
   \nabla^2 c_{\text{sub}}(\beta)
   =
   M^T \bigl[ \nabla^2 c(\theta) \bigr] M
$$

Remembering the formula for Fisher information
$$
   I(\theta) = - E \{ \nabla^2 l(\theta) \}
$$
we see that for exponential families we have
\begin{align*}
   \nabla l(\theta) & = y - \nabla c(\theta)
   \\
   \nabla^2 l(\theta) & = - \nabla^2 c(\theta)
\end{align*}
and expectation of a constant is that constant, so
$$
   I(\theta) = \nabla^2 c(\theta)
$$
the Fisher information matrix for the canonical parameter is the
second derivative matrix of the cumulant function.   Using our work above
on submodels, we see that
\begin{equation} \label{eq:fisher-submodel}
   I(\beta) = M^T I(\theta) M
\end{equation}
where we are indulging in ``abuse of notation'' in using the same letter
for two different functions.  Here $I(\beta)$ is the Fisher information
matrix for the submodel canonical parameter and $I(\theta)$ is the Fisher
information matrix for the saturated model canonical parameter.  Perhaps
it would be better to write
$$
   I_{\text{sub}}(\beta) = M^T I_{\text{sat}}(\theta) M
$$
and we are also assuming that $\theta$ and $\beta$ are related by
\eqref{eq:canonical-affine-submodel}, that $\theta$ and $\beta$ are
values of the saturated model and submodel canonical parameters that
correspond to the same distribution.

For the canonical parameter vector of a regular full exponential family
there is no difference between observed and expected Fisher information,
because observed Fisher information is already nonrandom.

Now we (or the computer) are ready to do any form of likelihood inference.
We know the log likelihood and Fisher information, and everything we want
to do is a function of them (Wilks, Rao, Wald hypothesis tests and confidence
intervals).

\subsection{Sufficient Dimension Reduction Revisited}

Any one-to-one function of a sufficient statistic vector is another sufficient
statistic vector.  Thus if we are using identifiable parameterizations, MLE of
all parameters are sufficient statistic vectors.  So are other one-to-one
functions of any of the sufficient statistic vectors we know.

\subsection{Examples III} \label{sec:examples-iii}

\subsubsection{Data}

We use the example data from Table~2.6 in \citet{agresti} which is about
death penalty verdicts in court cases cross-classified by race of victim
and race of defendant.  Rather than put the data in a table the way
\citeauthor{agresti} does, we put them in the form that the R function
\texttt{glm} likes.  For binomial data, the response is specified
``as a two-column matrix with the columns giving the numbers of successes
and failures'' (quoted from \texttt{help(glm)}).  Here we, perhaps
insensitively, have coded the first column in \citeauthor{agresti}'s
table as ``success'' and that happens to be death penalty = \texttt{"yes"}.
Mathematically, it doesn't matter which we call success and which failure
so long
as we keep track of which is which.  The categorical predictors \texttt{victim}
and \texttt{defendant} are vectors of class \texttt{factor}, as is usual
for regression on categorical variables.
<<death-data>>=
victim <- factor(rep(c("white", "black"), each = 2))
defendant <- factor(rep(c("white", "black"), times = 2))
deathpenalty <- matrix(c(53, 11, 0, 4, 414, 37, 16, 139),
        ncol = 2, dimnames = list(NULL, c("yes", "no")))
@
\pagebreak[3]
We check that our data matches the data in \citet{agresti}.
<<death-data-show>>=
data.frame(victim, defendant, deathpenalty)
@

\subsubsection{Fit}

Now we can try a logistic regression.
<<death-data-fit0>>=
gout <- glm(deathpenalty ~ victim + defendant,
    family = binomial, x = TRUE)
gout.summary <- summary(gout)
gout.summary
@
This is an example of logistic regression with response vector having
dimension four and model matrix
<<death-data-modmat0>>=
modmat <- gout$x
modmat
@

\subsubsection{Observed Equals Expected}

Now we want to check ``observed equals expected'' (it must check, of course,
unless the R function \texttt{glm} is buggy, which we don't expect --- we
just want to see for ourselves).
<<death-data-n-y-mu>>=
y <- deathpenalty[ , "yes"]
n <- rowSums(deathpenalty)
mu <- predict(gout, type = "response")
y
n
mu
@
What the R function \texttt{glm} is calling mean values obviously aren't.
They appear to be probabilities rather than means.  That's right.
Quoting from \texttt{help(predict.glm)}
\begin{center}
\begin{tabular}{rp{4in}}
type: & the type of prediction required.  The default is on the scale
          of the linear predictors; the alternative \texttt{"response"} is on
          the scale of the response variable.  Thus for a default
          binomial model the default predictions are of log-odds
          (probabilities on logit scale) and \texttt{type = "response"} gives
          the predicted probabilities.
\end{tabular}
\end{center}
So we need to multiply $\mu = n p$
<<death-data-mu-too>>=
mu <- n * mu
@
and now we check
<<obs-equals-exp-logit>>=
all.equal(t(modmat) %*% mu, t(modmat) %*% y)
@
we use the R function \texttt{all.equal}, which allows for small error due
to inexactness of computer arithmetic, to compare ``real'' numbers in the
computer (understanding that the computer's ``real'' numbers are not real
real numbers).

We see the fits from the R function \texttt{glm} with
\texttt{family = binomial} and the default link, which is \texttt{"logit"},
do indeed satisfy the observed equals expected property.
And what are the elements of the submodel canonical statistic $M^T y$?
Obviously
\begin{itemize}
\item total number of death-penalty verdicts
\item total number of death-penalty verdicts when victim was white
\item total number of death-penalty verdicts when defendant was white
\end{itemize}
Subtracting either of the latter two from the first gives
\begin{itemize}
\item total number of death-penalty verdicts when victim was black
\item total number of death-penalty verdicts when defendant was black
\end{itemize}
So observed and (MLE estimates of) expected for these quantities are also
equal.  And subtracting any of these from the corresponding sample sizes
(which are considered non-random for binomial GLM) gives
\begin{itemize}
\item total number of non-death-penalty verdicts
\item total number of non-death-penalty verdicts when victim was white
\item total number of non-death-penalty verdicts when defendant was white
\item total number of non-death-penalty verdicts when victim was black
\item total number of non-death-penalty verdicts when defendant was black
\end{itemize}
So observed equals expected for these quantities too.

\subsubsection{Wald Tests}

<<wald-setup,include=FALSE,echo=FALSE>>=
pval <- gout.summary$coefficients[ , "Pr(>|z|)"]
pval.exponent <- floor(log10(pval))
pval.base <- pval / 10^pval.exponent
@
The printout of the R function \texttt{summary.glm} gives the $P$-values
for (two-tailed) Wald tests of null hypotheses that the various parameters
are zero.  The Wald test comparing this model (alternative hypothesis) with
its submodel that sets the submodel canonical parameter named
\texttt{victimwhite} to zero says this parameter is highly statistically
significant ($P = \Sexpr{round(pval.base["victimwhite"], 3)} \times
10^{\Sexpr{pval.exponent["victimwhite"]}}$) and the analogous test for
the submodel canonical parameter named \texttt{defendantwhite} says
this parameter is statistically significant
($P = \Sexpr{round(pval["defendantwhite"], 4)}$).
The other parameter is uninteresting.

Of course, as sophisticated statisticians, we all know that correlation is
not causation and regression isn't either.  Since this is not data from
a controlled experiment, it only shows association not causation.  But
what the $P$-values do say is that if we dropped the predictor variable
\texttt{victim} from the model, that model would not fit the data as well.
And similarly for dropping \texttt{defendant}.

Why there were statistically significantly more death penalty verdicts
in these data when the victim was white
and statistically significantly fewer death penalty verdicts
in these data when the defendant was white is something these data cannot
answer.  Given American history, one suspects racial prejudice, but by who,
when, or where is not something these data say anything about.  Nor is it
even certain, from these data, that that is the causal explanation.
Correlation is not causation even when you don't have a story to explain
why not.

We should also comment about, what is again obvious to us as sophisticated
statisticians, that these procedures are approximate and that asymptotic
distributions are nearly correct for large $n$ in terms of \emph{absolute
error} ($\abs{\text{approximate} - \text{exact}}$) not \emph{relative error}
($\abs{\text{approximate} - \text{exact}} / \abs{\text{exact}}$).

A number like $P = \Sexpr{round(pval.base["victimwhite"], 3)} \times
10^{\Sexpr{pval.exponent["victimwhite"]}}$ has relatively small absolute
error.  The correct interpretation is that the true $P$-value
is very small, less than $0.001$, say, if $n$ is large enough.  There is
no justfication for taking even the order of magnitude of the asymptotic
$P$-value seriously.  That would be assuming small relative error, not
small absolute error.  No matter how small the asymptotic $P$-values are,
say $10^{- 12}$, the interpretation is still the same: very small, less
than 0.001, say, if $n$ is large enough.  The farther the observed
value of the test statistic is out in the tail of its asymptotic distribution
under the null hypothesis, the worse the asymptotic approximation is in
terms of relative error.  Here, with the sample sizes
<<n>>=
n
@
not humongous, maybe we shouldn't be sure
that $P \approx \Sexpr{round(pval.base["victimwhite"], 3)} \times
10^{\Sexpr{pval.exponent["victimwhite"]}}$ even means $P < 0.001$.

\subsubsection{Wald Confidence Intervals}

We extract the estimates and standard errors from the output of
\texttt{summary.glm}
<<wald-ci>>=
beta <- gout.summary$coefficients[ , "Estimate"]
beta.stderr <- gout.summary$coefficients[ , "Std. Error"]
@
and then the Wald confidence intervals for confidence level
<<wald-ci-conf-level>>=
conf.level <- 0.95
@
are
<<wald-ci>>=
crit <- qnorm((1 + conf.level) / 2)
crit
beta["victimwhite"] + c(-1, 1) * crit * beta.stderr["victimwhite"]
beta["defendantwhite"] + c(-1, 1) * crit *
    beta.stderr["defendantwhite"]
@

\subsubsection{Likelihood Ratio Tests} \label{sec:lrt}

To do likelihood ratio (Wilks) tests, we need to actually fit the models
that drop these parameters.
<<lrt-victim>>=
drop1(gout, ~ victim, test = "LRT")
@
Since your humble author does not regularly use the R function \texttt{drop1}
and wants to check that this is indeed doing what he thinks it is doing, we
check it.
<<lrt-victim-check>>=
gout.no.victim <- glm(deathpenalty ~ defendant, family = binomial)
anova(gout.no.victim, gout, test = "Chisq")
@
We do indeed get the same test statistic and $P$-value both ways, so
the following must be correct too.
<<lrt-victim>>=
drop1(gout, ~ defendant, test = "LRT")
@
Note that the Wald and Wilks tests do not agree exactly, although they
do agree approximately.

\subsubsection{Likelihood Ratio Confidence Intervals}

The R recommended package \texttt{MASS} \citep{MASS-package},
which goes with the book \citet{venables-ripley},
has a function \texttt{confint.glm} which
purports to do likelihood based confidence intervals (and we saw
that it agreed with clearly correct calculations in the one-dimensional
case in a previous handout).  Let's try that
<<wilks-ci>>=
library(MASS)
confint(gout)
@
Again, these agree approximately but not exactly with the Wald intervals.

\subsubsection{Rao Tests}

The R function \texttt{drop1} also has an option that purports to do Rao
tests.
<<rao-test>>=
drop1(gout, ~ victim, test = "Rao")
drop1(gout, ~ defendant, test = "Rao")
@
Again, these agree approximately but not exactly with the Wald and Wilks tests.

\subsubsection{Summary}

<<wilks-and-rao-and-wald-setup,include=FALSE,echo=FALSE>>=
foompter <- drop1(gout, ~ victim, test = "LRT")
tstat.wilks <- foompter[["LRT"]][2]
pval.wilks <- foompter[["Pr(>Chi)"]][2]
foompter <- drop1(gout, ~ defendant, test = "LRT")
tstat.wilks <- c(tstat.wilks, foompter[["LRT"]][2])
pval.wilks <- c(pval.wilks, foompter[["Pr(>Chi)"]][2])
foompter <- drop1(gout, ~ victim, test = "Rao")
tstat.rao <- foompter[["Rao score"]][2]
pval.rao <- foompter[["Pr(>Chi)"]][2]
foompter <- drop1(gout, ~ defendant, test = "Rao")
tstat.rao <- c(tstat.rao, foompter[["Rao score"]][2])
pval.rao <- c(pval.rao, foompter[["Pr(>Chi)"]][2])

pval.wilks.exponent <- floor(log10(pval.wilks))
pval.wilks.base <- pval.wilks / 10^pval.wilks.exponent
pval.rao.exponent <- floor(log10(pval.rao))
pval.rao.base <- pval.rao / 10^pval.rao.exponent
pval <- pval[-1]
pval.base <- pval.base[-1]
pval.exponent <- pval.exponent[-1]
tstat <- gout.summary$coefficients[ , "z value"]
tstat <- tstat^2
tstat <- tstat[-1]

wald.vic.ci <- beta["victimwhite"] + c(-1, 1) * crit *
    beta.stderr["victimwhite"]
wald.def.ci <- beta["defendantwhite"] + c(-1, 1) * crit *
    beta.stderr["defendantwhite"]

foompter <- confint(gout)
foompter <- foompter[-1, ]
@
Test statistics and asymptotic $P$-values for test of dropping
\texttt{victim} from the model
\begin{center}
\begin{tabular}{l|cc}
    & Test Statistic & $P$-value \\
\hline
Wald & \Sexpr{formatC(tstat[1], format = "f", digits = 2)} &
    $\Sexpr{formatC(pval.base[1], format = "f", digits = 2)}
    \times 10^{\Sexpr{pval.exponent[1]}}$
\\
Wilks & \Sexpr{formatC(tstat.wilks[1], format = "f", digits = 2)} &
    $\Sexpr{formatC(pval.wilks.base[1], format = "f", digits = 2)}
    \times 10^{\Sexpr{pval.wilks.exponent[1]}}$
\\
Rao & \Sexpr{formatC(tstat.rao[1], format = "f", digits = 2)} &
    $\Sexpr{formatC(pval.rao.base[1], format = "f", digits = 2)}
    \times 10^{\Sexpr{pval.rao.exponent[1]}}$
\end{tabular}
\end{center}

Test statistics and asymptotic $P$-values for test of dropping
\texttt{defendant} from the model
\begin{center}
\begin{tabular}{l|cc}
    & Test Statistic & $P$-value \\
\hline
Wald & \Sexpr{formatC(tstat[2], format = "f", digits = 2)} &
    \Sexpr{formatC(pval[2], format = "f", digits = 4)}
\\
Wilks & \Sexpr{formatC(tstat.wilks[2], format = "f", digits = 2)} &
    \Sexpr{formatC(pval.wilks[2], format = "f", digits = 4)}
\\
Rao & \Sexpr{formatC(tstat.rao[2], format = "f", digits = 2)} &
    \Sexpr{formatC(pval.rao[2], format = "f", digits = 4)}
\end{tabular}
\end{center}

95\% confidence intervals for submodel canonical
parameter named \texttt{victimwhite}
\begin{center}
\begin{tabular}{l|cc}
    & lower & upper \\
\hline
Wald & \Sexpr{formatC(wald.vic.ci[1], format = "f", digits = 2)} &
       \Sexpr{formatC(wald.vic.ci[2], format = "f", digits = 2)}
\\
Wilks & \Sexpr{formatC(foompter[1, 1], format = "f", digits = 2)} &
       \Sexpr{formatC(foompter[1, 2], format = "f", digits = 2)}
\end{tabular}
\end{center}

95\% confidence intervals for submodel canonical
parameter named \texttt{defendantwhite}
\begin{center}
\begin{tabular}{l|cc}
    & lower & upper \\
\hline
Wald & $\Sexpr{formatC(wald.def.ci[1], format = "f", digits = 2)}$ &
       $\Sexpr{formatC(wald.def.ci[2], format = "f", digits = 2)}$
\\
Wilks & $\Sexpr{formatC(foompter[2, 1], format = "f", digits = 2)}$ &
       $\Sexpr{formatC(foompter[2, 2], format = "f", digits = 2)}$
\end{tabular}
\end{center}

There does not seem to be any convenient way in R to do Rao intervals
(a.~k.~a.\ score intervals).

\subsubsection{Wald Intervals for Mean-Value Parameters}

The R function \texttt{predict.glm} also does confidence intervals for
mean-value parameters.
<<mvp-ci>>=
pout <- predict(gout, se.fit = TRUE, type = "response")
low <- pout$fit - crit * pout$se.fit
hig <- pout$fit + crit * pout$se.fit
cbind(low, hig)
@

As always, there is nothing that keeps these confidence intervals in
the parameter space.  And we have gone out.  These are MLE of cell
probabilities and one of the endpoints in negative.

We can avoid this behavior by making Wald intervals for the saturated
model canonical parameters and mapping them to the mean-value parameter
scale.
<<mvp-ci-tricky>>=
pout <- predict(gout, se.fit = TRUE)
low <- pout$fit - crit * pout$se.fit
hig <- pout$fit + crit * pout$se.fit
invlogit <- function(theta) 1 / (1 + exp(- theta))
low <- invlogit(low)
hig <- invlogit(hig)
cbind(low, hig)
@
The latter seems better for small $n$.  Of course, the two methods
are asymptotically equivalent.

Note: these are confidence intervals for cell probabilities rather than
expected cell counts.  Each interval would have to be multiplied by the
corresponding sample size to get the latter.

\subsubsection{Arbitrariness of Canonical Parameterization}
\label{sec:arbitrary}

This section is a short riff on the meaningless of canonical parameters.
It shows various ways in which specifying the model differently would have
lead to very different canonical parameters for \emph{exactly the same model}.

First principle: when there are any categorical predictors, models with
and without intercept are the same
<<arbitrary-intercept>>=
gout.foo <- glm(deathpenalty ~ 0 + victim + defendant,
    family = binomial)
coefficients(gout)
coefficients(gout.foo)
mu.foo <- n * predict(gout.foo, type = "response")
all.equal(mu.foo, mu)
@
The submodel canonical parameters don't even have the same names,
and for some that do have the same names, the values have changed.
But the mean values are the same, so the MLE probability distribution
is the same.

Second principle: when there are multiple categorical predictors,
the order of terms in the formula can matter
<<arbitrary-intercept>>=
gout.bar <- glm(deathpenalty ~ 0 + defendant + victim,
    family = binomial)
coefficients(gout.foo)
coefficients(gout.bar)
mu.bar <- n * predict(gout.bar, type = "response")
all.equal(mu.bar, mu)
@
One number is the same in the canonical parameter estimates, but for
parameters that are named differently!  The parameters that are named
the same have different numbers!  But the model is still the same.

Third principle: when R sets up a factor, the default order of factor
levels is alphabetical, but it doesn't have to be.  Then when setting
up model matrices it drops the dummy variable for the first factor level.
But which was first was arbitrary.
<<arbitrary-factor>>=
victim.save <- victim
victim <- factor(as.character(victim),
    levels = rev(levels(victim)))
victim.save
victim

gout.baz <- glm(deathpenalty ~ victim + defendant,
    family = binomial)
coefficients(gout)
coefficients(gout.baz)
mu.baz <- n * predict(gout.baz, type = "response")
all.equal(mu.baz, mu)
@
Again, the names of the canonical parameters change.
And the values of some parameters with the same name stay
the same while the values of other parameters with the same name are different.
But the model remains the same.

\pagebreak[3]
Put \texttt{victim} back the way it was.
<<restore-victim>>=
victim <- victim.save
@

Forth principle: the coding of response labels is arbitrary, which of
the \texttt{deathpenalty} values is considered ``success'' and
which ``failure'' is arbitrary.  Reverse them.
<<arbitrary-success>>=
deathpenalty.save <- deathpenalty
deathpenalty <- deathpenalty[ , 2:1]
deathpenalty

gout.qux <- glm(deathpenalty ~ victim + defendant,
    family = binomial)
coefficients(gout)
coefficients(gout.qux)
mu.qux <- n * predict(gout.qux, type = "response")
all.equal(n - mu.qux, mu)
@
The canonical parameters all have the same names, but all now have opposite
signs.  In showing that the means are the same so the model is the same,
we have to take into account that \texttt{mu.qux} is predicted ``successes''
by the second definition (death penalty ``no'') and that $\texttt{mu}$ is
predicted ``successes'' by the first definition (death penalty ``yes'') so
one is \texttt{n} minus the other.

All of the above, and even more ways that could be illustrated with more
complicated models (we could play lots of games with ``interactions'' if
we had them), is what we mean when we say canonical parameters are meaningless.

\pagebreak[3]
Put \texttt{deathpenalty} back the way it was.
<<restore-deathpenalty>>=
deathpenalty <- deathpenalty.save
@

For more examples of the arbitrariness of canonical parameterization,
see Section~\ref{sec:dropping} below.

\section{Different Sampling Schemes}

A very important point about categorical data analysis (analysis of
contingency table data) is that one gets the same results regardless of
whether the assumed sampling scheme ``Poisson,'' ``multinomial,'' or
``product multinomial.''  First we have to explain what these are.

\subsection{Sampling Schemes}

A \emph{partition} of a set $S$ is a family $\mathcal{A}$ of subsets of
of $S$ having the property that for every $x \in S$ there is exactly one
$A \in \mathcal{A}$ such that $x \in A$.  If $\mathcal{A}$ and $\mathcal{B}$
are two partitions of $S$, we say that $\mathcal{A}$ is \emph{finer}
than $\mathcal{B}$ or that $\mathcal{B}$ is \emph{coarser} than $\mathcal{A}$
if for every $A \in \mathcal{A}$ there is exactly one $B \in \mathcal{B}$
such that $A \subset B$.

Suppose we have a contingency table with $k$ categories.  Let $I$ denote
the index set
$$
   I = \{ 1, 2, \ldots, k \}.
$$
Here we are interested in partitions of $I$.

We can think of vectors as functions $I \mapsto \real$.
A vector $y$ is the function $i \mapsto y_i$.
For any subset $A$ of $I$, let $y_A$ denote the ``subvector'' of $y$
whose components have indices in $A$.  Considered as functions,
$y$ and $y_A$ have the same rule $i \mapsto y_i$ but different domains:
but $y$ has domain $I$ and $y_A$ has domain $A$.
\label{pg:subvector}

The point is that components of $y_A$ know which components of $y$ they
are equal to.  If $A = \{ 3, 7, 10 \}$, then the components of $y_A$
are $y_3$, $y_7$, and $y_{10}$.  We don't renumber them to have the
indices be 1, 2, 3 out of some misguided notation that components of
three-dimensional vectors have to be numbered 1, 2, 3 or it will make
us cry (or whatever).  If we don't renumber then the notation ``remembers''
which components of $y$ correspond to which components of $y_A$.
If we insist on renumbering then we have lost the correspondence between
components of $y$ and components of $y_A$ and have to keep track of it
by some additional, and probably very clumsy, notation.  What was
called a ``misguided notation'' that vectors are indexed by consecutive
numbers starting with one is responsible for more clumsy notation in
statistics than any other idea.

The set of all such functions is denoted $\real^I$ in set theory.
So $y$ is an element of $\real^I$ and $y_A$ is an element of $\real^A$.
In set theory, everything is a set.  The number $d$ is the set
$\{ 0, 1, 2, \ldots, d - 1 \}$.  So in set theory $\real^d$ means
$\real^S$ where $S = \{ 0, 1, 2, \ldots, d - 1 \}$.  Thus our new notation
for finite-dimensional vector spaces $\real^S$ for some finite set $S$
essentially the same as the conventional notation $\real^d$ for some
nonnegative integer $d$.  The only difference is that set theory starts
counting at zero instead of at one.

The kinds of sampling schemes commonly assumed for contingency tables are
the following.
\begin{description}
\item[Poisson] The cell counts are independent Poisson random variables.
\item[multinomial] The cell counts are components of a multinomial
    random vector.
\item[product multinomial] The index set $I$ has a partition $\mathcal{A}$,
    and the vector $y$ of cell counts is similarly partitioned.  The
    subvectors $y_A$ for $A \in \mathcal{A}$ are independent multinomial
    random vectors.
\end{description}
It is called ``product multinomial'' because independence implies the
joint PMF is the product of the marginal PMF's
$$
   f(y) = \prod_{A \in \mathcal{A}} f(y_A)
$$
(abusing notation by denoting different functions by the same letter $f$).

\subsection{Sampling Schemes and Conditioning}

These sampling schemes are related as follows.
\begin{itemize}
\item Multinomial is Poisson conditioned on the sample size.  If $f$ is
    the PMF for the Poisson scheme, then the PMF for the multinomial scheme is
\begin{equation} \label{eq:multinomial-as-conditional}
   \textstyle
   f\left( y \,\middle|\, \sum_{i \in I} y_i = n \right)
\end{equation}
where $n$ is the sample size.
\item Product multinomial is Poisson conditioned on the sample sizes for
    the elements of the partition.  If $f$ is the PMF for the Poisson
    scheme, then the PMF for the product multinomial scheme for partition
    $\mathcal{A}$ is
\begin{equation} \label{eq:product-multinomial-as-conditional}
   \textstyle
   \prod_{A \in \mathcal{A}}
   f\left( y_A \,\middle|\, \sum_{i \in A} y_i = n_A \right)
\end{equation}
where $n_A$ is the sample size for subvector $y_A$.
\item Product multinomial is also multinomial conditioned on the sample sizes
    for the elements of the partition.  If $f$ is the PMF for the multinomial
    scheme, then the PMF for the product multinomial scheme for partition
    $\mathcal{A}$ is again given by
    \eqref{eq:product-multinomial-as-conditional}.
\item Product multinomial is also product multinomial for a coarser partition
    conditioned on the sample sizes for the elements of a finer partition.
    Suppose $\mathcal{A}$ and $\mathcal{B}$ are partitions of the index set $I$
    with $\mathcal{A}$ finer than $\mathcal{B}$.  If $f$ is the PMF for
    the product multinomial scheme for partition $\mathcal{B}$, then
    the PMF for the product multinomial scheme for partion $\mathcal{A}$
    is again given by \eqref{eq:product-multinomial-as-conditional}.
\end{itemize}

\subsection{Sampling Schemes and Maximum Likelihood Estimates}

\subsubsection{Poisson versus Multinomial}

The key is to understand Poisson versus multinomial.  When we later consider
product multinomial, the principles remain the same, the math gets messier
but not harder.

Suppose we assume Poisson sampling, have a canonical affine submodel
with parameterization \eqref{eq:canonical-affine-submodel}, observed data $y$,
cumulant function $c$ given by
\begin{equation} \label{eq:cumfun-poisson-regression}
   c(\theta) = \sum_{i \in I} e^{\theta_i},
\end{equation}
and MLE's
\begin{itemize}
\item $\hat{\beta}$ (for the submodel canonical parameter $\beta$),
\item $\hat{\theta}$ (for the saturated model canonical parameter $\theta$),
\item $\hat{\mu}$, (for the saturated model mean-value parameter $\mu$), and
\item $\hat{\tau}$, (for the submodel mean-value parameter $\tau$),
\end{itemize}
which, of course, are related by
\begin{subequations}
\begin{align}
   \hat{\theta} & = a + M \hat{\beta}
   \label{eq:beta-hat-to-theta-hat}
   \\
   \hat{\mu} & = \nabla c(\hat{\theta})
   \label{eq:theta-hat-to-mu-hat}
   \\
   \hat{\tau} & = M^T \hat{\mu}
   \label{eq:mu-hat-to-tau-hat}
\end{align}
\end{subequations}
and are all determined by the observed equals expected property
\begin{equation} \label{eq:oee-schemes}
   M^T y = \hat{\tau}.
\end{equation}

If the submodel parameterization is identifiable, then $\hat{\beta}$ is the
unique submodel canonical parameter value that makes
\eqref{eq:beta-hat-to-theta-hat},
\eqref{eq:theta-hat-to-mu-hat}, and
\eqref{eq:mu-hat-to-tau-hat}
lead to \eqref{eq:oee-schemes}.

If the submodel parameterization is not identifiable, then $\hat{\beta}$ is
a (non-unique) submodel canonical parameter value that makes
\eqref{eq:beta-hat-to-theta-hat},
\eqref{eq:theta-hat-to-mu-hat}, and
\eqref{eq:mu-hat-to-tau-hat}
lead to \eqref{eq:oee-schemes}.

In the latter case there is a direction $\delta$ in the submodel parameter
space that is a direction of constancy,
$$
   \inner{M^T y, \delta} = y^T M \delta
$$
is a constant random variable, and $\hat{\beta} + r \delta$ is also
an MLE for $\beta$ for any real number $r$.

Now we consider the multinomial sampling scheme.
Since we want to compare Poisson and multinomial, we use the same
saturated model canonical statistic vector $y$ for both.  This means
we cannot have an identifiable parameterization for the saturated multinomial.
Instead we have what we called the ``try III'' parameterization in
Section~\ref{sec:multinomial} above.
Then we repeat everything we said above about the Poisson for the multinomial
and everything is the same except that the Poisson cumulant function is
\eqref{eq:cumfun-poisson-regression} whereas the multinomial cumulant function
is \eqref{eq:cumfun-multinomial-try-iii}.

Now we make a key assumption.  The model for the Poisson sampling scheme
must include an ``intercept'' term or be equivalent to one that does.
We saw in the preceding section that if there are any categorical predictors,
then leaving out the ``intercept'' does not change the model (hence the
scare quotes around ``intercept'').
Formally the requirement is that the vector $\delta_I = (1, 1, \ldots, 1)$
\label{pg:delta-i}
be in
the column space of the model matrix $M$, that is, some linear combination
of the columns of $M$ has all components equal to one.

\begin{theorem} \label{th:poisson-and-multinomial}
Suppose we have a Poisson regression model with the parameterization of
Section~\ref{sec:poisson-regression} and a multinomial model with the
parameterization of Section~\ref{sec:try-iii}.
Suppose we use the same canonical affine submodel for both with
parameterization \eqref{eq:canonical-affine-submodel}.
And suppose the vector with all
components equal to one is in the column space of the model matrix.
Then MLE for the models agree in the sense that, for the same data,
any MLE for the Poisson model for any of the four parameter vectors,
submodel canonical, submodel mean-value, saturated model canonical,
or saturated model mean-value, is also an MLE for the corresponding parameter
vector of the multinomial model.
\end{theorem}
Since only part of this theorem is usually stated in textbooks,
we give a proof.  Observe that it is crucial that we use the nonidentifiable
``try III'' parameterization of the multinomial to make canonical parameters
match.
\begin{proof}
Since both models have the same canonical statistic vector and the same
model matrix, they have the same ``observed equals expected'' equation
\eqref{eq:oee-schemes}
determining they have the same MLE for the submodel mean-value parameter.

In fact, we see that the Poisson and multinomial sampling schemes not
only agree on \eqref{eq:oee-schemes}, they also agree on
\eqref{eq:beta-hat-to-theta-hat} and \eqref{eq:mu-hat-to-tau-hat}.
They only disagree on \eqref{eq:theta-hat-to-mu-hat}, where for the
Poisson scheme we differentiate \eqref{eq:cumfun-poisson-regression}
obtaining
\begin{equation} \label{eq:fred-poisson}
   \hat{\mu}_i = e^{\hat{\theta}_i}
\end{equation}
and for the multinomial scheme we differentiate
\eqref{eq:cumfun-multinomial-try-iii} obtaining
\begin{equation} \label{eq:fred-multinomial}
   \hat{\mu}_i
   =
   \frac{n e^{\hat{\theta}_i}}{\sum_{j \in I} e^{\hat{\theta}_j}}
\end{equation}
Suppose $\hat{\theta}$ is the Poisson MLE.  Then \eqref{eq:fred-poisson}
holds.  Because of the key assumption that $\delta_I$ is in the column
space of the model matrix, we have, by ``observed equals expected''
\begin{equation} \label{eq:fred-oee}
   \delta_I^T \hat{\mu} = \sum_{i \in I} \hat{\mu}_i
   =
   \delta_I^T y = \sum_{i \in I} y_i.
\end{equation}
Introduce an abbreviation
\begin{equation} \label{eq:fred-n}
   n = \sum_{i \in I} y_i.
\end{equation}
(For the multinomial scheme $n$ is the multinomial sample size, but
here, for the Poisson scheme, $n$ isn't anything in particular, it is
just an abbreviation for the right-hand side of \eqref{eq:fred-n}).
Plugging \eqref{eq:fred-n} into \eqref{eq:fred-oee} and then using
\eqref{eq:fred-poisson} gives
\begin{equation} \label{eq:fred-poisson-oee}
   \sum_{i \in I} \hat{\mu}_i = \sum_{i \in I} e^{\hat{\theta}_i} = n.
\end{equation}
All of this holds for the Poisson MLE's.  Now we want to show that
\eqref{eq:fred-multinomial} also holds for the Poisson MLE's.
Plug \eqref{eq:fred-poisson-oee} into \eqref{eq:fred-multinomial}
obtaining
$$
   \hat{\mu}_i
   =
   \frac{n e^{\hat{\theta}_i}}{\sum_{j \in I} e^{\hat{\theta}_j}}
   =
   \frac{n e^{\hat{\theta}_i}}{n}
   =
   e^{\hat{\theta}_i}
$$
and, reading end to end, this is just \eqref{eq:fred-poisson} which
characterizes the Poisson MLE.  Hence we conclude that if $\hat{\theta}$
is the Poisson MLE,
then \eqref{eq:fred-poisson} and \eqref{eq:fred-multinomial} both hold.
But \eqref{eq:fred-multinomial} characterizes multinomial MLE.
Hence the Poisson MLE is also a multinomial MLE.

As we said before, since \eqref{eq:oee-schemes},
\eqref{eq:beta-hat-to-theta-hat}, and \eqref{eq:mu-hat-to-tau-hat}
obviously agree for both schemes, and we have just shown that
\eqref{eq:theta-hat-to-mu-hat} also agrees for both schemes,
all MLE for all parameterizations, $\hat{\beta}$, $\hat{\theta}$,
$\hat{\mu}$, and $\hat{\tau}$ agree for both schemes.
\end{proof}

\subsubsection{Poisson versus Product Multinomial}

\begin{theorem} \label{th:poisson-and-product-multinomial}
Suppose we have a Poisson regression model with the parameterization of
Section~\ref{sec:poisson-regression} and a product multinomial model
for partition $\mathcal{A}$, each multinomial of the product having the
parameterization of Section~\ref{sec:try-iii}.  Suppose we use the same
canonical affine submodel for both with parameterization
\eqref{eq:canonical-affine-submodel}.  Define for each $A \in \mathcal{A}$
the indicator vectors $\delta_A$, the $i$-th component of which is zero
\label{pg:delta-a}
if $i \notin A$ and one if $i \in A$.  And suppose that each of these
$\delta_A$ is in the column space of the model matrix.
Then MLE for the models agree in the sense that, for the same data,
any MLE for the Poisson model for any of the four parameter vectors,
submodel canonical, submodel mean-value, saturated model canonical,
or saturated model mean-value, is also an MLE for the corresponding parameter
vector of the product multinomial model.
\end{theorem}
\begin{proof}
The proof is just like the proof of Theorem~\ref{th:poisson-and-multinomial}
except that the multinomial cumulant function
\eqref{eq:cumfun-multinomial-try-iii} is replaced by the product
multinomial cumulant function
\begin{equation} \label{eq:cumfun-product-multinomial}
   c(\theta) = \sum_{A \in \mathcal{A}} n_A \log\left(
   \sum_{i \in A} e^{\theta_i} \right).
\end{equation}
Differentiating and plugging in $\hat{\theta}$ for $\theta$ gives
\begin{equation} \label{eq:fred-product-multinomial}
   \hat{\mu}_i
   =
   \frac{n_A e^{\hat{\theta}_i}}{\sum_{j \in A} e^{\hat{\theta}_j}},
   \qquad (i \in A).
\end{equation}
So in this proof \eqref{eq:fred-product-multinomial} plays the role
of \eqref{eq:fred-multinomial} in the proof of
Theorem~\ref{th:poisson-and-multinomial}.

Then, because of the assumption that $\delta_A$ is in the column space
of the model matrix, we have
\begin{equation} \label{eq:fred-oee-product}
   \delta_A^T \hat{\mu} = \sum_{i \in A} \hat{\mu}_i
   =
   \delta_A^T y = \sum_{i \in A} y_i.
\end{equation}
holding for the Poisson MLE's.  This replaces \eqref{eq:fred-oee} in
the other proof.  Then we just follow the pattern of the other proof,
introducing
\begin{equation} \label{eq:fred-n-product}
   n_A = \sum_{i \in A} y_i
\end{equation}
we see that \eqref{eq:fred-oee-product} implies, still for Poisson MLE's, that
\begin{equation} \label{eq:fred-plug-poisson}
   \sum_{i \in A} e^{\hat{\theta}_i} = n_A
\end{equation}
and plugging this into \eqref{eq:fred-product-multinomial} we see that
\eqref{eq:fred-product-multinomial} becomes \eqref{eq:fred-poisson}.
So the equation that characterizes product multinomial MLE's actually
holds for Poisson MLE's.  The rest of the two proofs is the same.
\end{proof}

\subsubsection{Multinomial versus Product Multinomial}

\begin{theorem} \label{th:multinomial-and-product-multinomial}
Suppose we have a multinomial regression model and a product multinomial model
for partition $\mathcal{A}$, each multinomial having the
parameterization of Section~\ref{sec:try-iii}.  Suppose we use the same
canonical affine submodel for both with parameterization
\eqref{eq:canonical-affine-submodel}.  Define $\delta_A$ as in
Theorem~\ref{th:poisson-and-product-multinomial},
and suppose that each of these
$\delta_A$ is in the column space of the model matrix.
Then MLE for the models agree in the sense that, for the same data,
any MLE for the multinomial model for any of the four parameter vectors,
submodel canonical, submodel mean-value, saturated model canonical,
or saturated model mean-value, is also an MLE for the corresponding parameter
vector of the product multinomial model.
\end{theorem}
\begin{proof}
The proof is just like the proofs of Theorems~\ref{th:poisson-and-multinomial}
and~\ref{th:poisson-and-product-multinomial} except that now start with
multinomial MLE's satisfying \eqref{eq:fred-multinomial} and need to show
that they also satisfy \eqref{eq:fred-product-multinomial}.

Again, because of the assumption that $\delta_A$ is in the column space
of the model matrix, we have \eqref{eq:fred-oee-product},
and, introducing the abbreviation \eqref{eq:fred-n-product} we have
\begin{equation} \label{eq:fred-plug-multinomial}
   n_A = \sum_{i \in A} \hat{\mu}_i
   = \sum_{i \in A} \frac{n e^{\hat{\theta}_i}}
   {\sum_{j \in I} e^{\hat{\theta}_j}}
   = \frac{n \sum_{i \in A} e^{\hat{\theta}_i}}
   {\sum_{j \in I} e^{\hat{\theta}_j}}.
\end{equation}
This differs from \eqref{eq:fred-plug-poisson} because we plugged in the
formula \eqref{eq:fred-multinomial} for multinomial means for $\hat{\mu}_i$
rather than the formula \eqref{eq:fred-poisson} for Poisson means.

Now we want to show that if the MLE's are for the multinomial scheme,
then \eqref{eq:fred-product-multinomial} also holds.
Rewrite \eqref{eq:fred-plug-multinomial} as
$$
   n_A \sum_{j \in I} e^{\hat{\theta}_j}
   =
   n \sum_{i \in A} e^{\hat{\theta}_i}
$$
and plug into \eqref{eq:fred-product-multinomial} obtaining
(when $i \in A$)
$$
   \hat{\mu}_i
   =
   \frac{n_A e^{\hat{\theta}_i}}{\sum_{j \in A} e^{\hat{\theta}_j}}
   =
   \frac{n_A e^{\hat{\theta}_i}}
   {\frac{n_A}{n} \sum_{j \in I} e^{\hat{\theta}_j}}
   =
   \frac{n e^{\hat{\theta}_i}}
   {\sum_{j \in I} e^{\hat{\theta}_j}}
$$
and, reading end to end, this is \eqref{eq:fred-multinomial}.
So the equation that characterizes product multinomial MLE's actually
holds for multinomial MLE's.  The rest of two proofs is the same as
for the two preceding proofs.
\end{proof}

By now the pattern should be obvious.  We won't bother to state or prove the
analogous theorem for product multinomials with coarser and finer partitions.

\subsubsection{Example}

Instead we move to an example, the same example we have been doing all
along.  But now we use Poisson regression to fit the model.  We consider
the logistic regression we have been doing an instance of product multinomial
where there are four multinomials (in this case binomials) in the product.

To do Poisson regression, we reformat the data as follows.
<<poisson-data>>=
pdata <- data.frame(verdicts = as.vector(deathpenalty),
    victim = rep(victim, times = 2),
    defendant = rep(defendant, times = 2),
    deathpenalty = factor(rep(colnames(deathpenalty), each = 4)))
@
We had to stretch the data out into a vector \texttt{verdicts},
and correspondingly had to make the predictors \texttt{victim}
and \texttt{defendant} twice as long to match
and also had to make up a new predictor \texttt{deathpenalty}
to indicate which elements of the new response vector \texttt{verdicts}
corresponded to which columns of the old response matrix \texttt{deathpenalty}
(having a response matrix rather than a response vector being a curiosity,
of the way the R function \texttt{glm} specifies binomial GLM's).

\pagebreak[3]
The data are now
<<poisson-data-show>>=
pdata
@

To fit this model by Poisson regression and have it be equal to the
corresponding product binomial model (logistic regression), we need
to have what Theorem~\ref{th:poisson-and-product-multinomial} calls
the $\delta_A$ in the model, so we need to include a term
\verb@victim * defendant@ in the R formula, because the dummy
variables it produces are the $\delta_A$.  (The four parts of the
product binomial are for the four victim-race-defendant-race combinations).

We also have a term \verb@deathpenalty : (victim + defendant)@
in the model because we want to know how well \verb@victim + defendant@
does in ``predicting'' whether the death penalty is imposed or not.
We did not need the \verb@deathpenalty :@ when doing logistic regressions
because this was implicit in what we called the ``curiosity
of the way the R function \texttt{glm} specifies binomial GLM's'' above
(having a response matrix rather than a response vector).

<<foo-widen,include=FALSE,echo=FALSE>>=
options(width = 80)
@
So do it
<<poisson-regression>>=
gout.poisson <- glm(verdicts ~ victim * defendant +
    deathpenalty : (victim + defendant), family = poisson,
    data = pdata, x = TRUE)
summary(gout.poisson)
@
<<foo-narrow,include=FALSE,echo=FALSE>>=
options(width = 60)
@
And we check that we have the same mean vectors for both models,
taking into account what we are now beginning to see as the weirdness
of the way binomial GLM's work in R.
<<poisson-compare-mean-vectors>>=
mu.binomial <- predict(gout, type = "response")
mu.binomial <- n * mu.binomial
mu.binomial <- c(mu.binomial, n - mu.binomial)
mu.binomial <- as.vector(mu.binomial)
mu.poisson <- predict(gout.poisson, type = "response")
mu.poisson <- as.vector(mu.poisson)
all.equal(mu.binomial, mu.poisson)
@
We needed the two \texttt{as.vector} commands because the names of the
components are wrong (again because of the ``weirdness
of the way binomial GLM's work in R''). (The R function \texttt{all.equal}
check attributes, including names, as well as values.  It is an error
if the names don't match.)

Now that we know the mean vectors match we know that, since mean vectors
parameterize regular full exponential family models, we have matching MLE
probability distributions.

\pagebreak[3]
Now look at saturated model canonical parameters.
<<poisson-compare-canonical-vectors>>=
theta.binomial <- predict(gout)
theta.binomial <- c(theta.binomial, rep(0, length(theta.binomial)))
theta.binomial <- as.vector(theta.binomial)
theta.poisson <- predict(gout.poisson)
theta.poisson <- as.vector(theta.poisson)
matrix(theta.binomial - theta.poisson, ncol = 2)
@
We have formatted the difference of the two saturated model MLE canonical
parameter vectors in the same way the R function \texttt{glm} requires
binomial response vectors be formatted (in a matrix with two columns).
Each row corresponds to an independent binomial in the product binomial
(logistic regression model).  And we know that the vector with all
all components equal to one is a direction of constancy for a multinomial
(with ``try III'' parameterization, Section~\ref{sec:try-iii}).
Hence from Theorem~\ref{th:directions-of-constancy} so is any vector
with all components equal to each other, because if $\inner{y, \delta}$
is almost surely constant, then so is
$\inner{y, a \delta} = a \inner{y, \delta}$ for any constant $a$.

So we see that the difference of the two saturated model MLE canonical
parameter vectors is a direction of constancy of the product binomial
model, as must be from the theory of exponential families.

Now look at submodel model canonical parameters, and we get in trouble.
The R formula mini-language doesn't give us anything comparable, even
the lengths of these vectors
<<poisson-compare-betas>>=
names(coefficients(gout))
names(coefficients(gout.poisson))
@
We could get a closer match if we used different formulas to specify the
models (so-called intercepts behave really weirdly in complicated situations,
get rid of them, so do \verb@*@'s for interactions, get rid of them too).
<<wow>>=
gout.poisson.alternate <- glm(verdicts ~ 0 + victim : defendant +
    deathpenalty : (victim + defendant), family = poisson,
    data = pdata, x = TRUE)
all.equal(predict(gout.poisson, type = "response"),
    predict(gout.poisson.alternate, type = "response"))
gout.binomial.alternate <- glm(deathpenalty ~ 0 + victim + defendant,
    family = binomial, x = TRUE)
all.equal(predict(gout, type = "response"),
    predict(gout.binomial.alternate, type = "response"))
@
Now the $\hat{\beta}$'s match more closely
<<wow-too>>=
coefficients(gout.binomial.alternate)
coefficients(gout.poisson.alternate)
all.equal(as.vector(coefficients(gout.binomial.alternate)),
    as.vector(coefficients(gout.poisson.alternate))[5:7])
@
The first four components of \texttt{coefficients(gout.poisson.alternate)}
are the cofficients for what Theorem~\ref{th:poisson-and-product-multinomial}
calls the $\delta_A$, the dummy variables for the sums that are fixed
(multinomial sample sizes) in the product binomial model.  The remaining
three components of that $\hat{\beta}$ vector correspond to the three
components of the $\hat{\beta}$ vector for the product binomial
(logistic regression) model.  The R formula mini-language doesn't name
them exactly the same, but they are essentially the same parameters.

One last point before we leave this issue.
Theorems~\ref{th:poisson-and-multinomial},
\ref{th:poisson-and-product-multinomial},
and~\ref{th:multinomial-and-product-multinomial}
are about using the same model matrix for both models.
But here we clearly didn't do that.
The reason is identifiability.
The extra columns of the model matrix for
\texttt{gout.poisson.alternate} correspond to directions of constancy
for the other model.  We need to leave them out for identifiability.
We could do exactly what Theorem~\ref{th:poisson-and-product-multinomial} says
except that the weirdness of the way binomial GLM's work in R defeats us.
So we won't bother to match that up.

\subsection{Sampling Schemes and Likelihood Ratio Tests}

Suppose we want to test nested canonical affine submodels with the same offset
vector and model matrices $M_1$ and $M_2$.  The nesting hypothesis means that
the column space of $M_1$ is a subspace of the column space of $M_2$.
Usually, but not always, this means that every column of $M_1$ is also a
column of $M_2$.  When we think of models specified by formulas , thus usually
means that every term in the formula for the smaller model is also a term in
the formula for the bigger model.

\begin{theorem} \label{th:wilks-schemes}
Not only do MLE agree for Poisson, multinomial, and product multinomial
sampling schemes, so do likelihood ratio test statistics.
\end{theorem}
\begin{proof}
We have already seem that the MLE $\hat{\beta}$, $\hat{\theta}$, $\hat{\mu}$,
and $\hat{\tau}$ can be considered the same for both sampling schemes.
For the Poisson scheme, the maximized log likelihood is
\begin{equation} \label{eq:wilks-poisson}
   l(\hat{\beta})
   =
   \inner{y, \hat{\theta}} - \sum_{i \in I}e^{\hat{\theta}_i}.
\end{equation}
For the multinomial scheme, it is
\begin{equation} \label{eq:wilks-multinomial}
   l(\hat{\beta})
   =
   \inner{y, \hat{\theta}}
   - n \log\left( \sum_{i \in I}e^{\hat{\theta}_i} \right).
\end{equation}
For the product multinomial scheme for partition $\mathcal{A}$, it is
\begin{equation} \label{eq:wilks-product-multinomial}
   l(\hat{\beta})
   =
   \inner{y, \hat{\theta}}
   -
   \sum_{A \in \mathcal{A}} n_A \log\left(\sum_{i \in A} e^{\hat{\theta}_i}
   \right)
\end{equation}
and we need to show that these are the same or at least give the same
results when used in constructing test statistics.  Using
\eqref{eq:fred-n} and \eqref{eq:fred-n-product},
equations
\eqref{eq:wilks-poisson}, \eqref{eq:wilks-multinomial},
and \eqref{eq:wilks-product-multinomial} become (respectively)
\begin{align*}
   l(\hat{\beta})
   & =
   \inner{y, \hat{\theta}} - n
   \\
   l(\hat{\beta})
   & =
   \inner{y, \hat{\theta}} - n \log(n)
   \\
   l(\hat{\beta})
   & =
   \inner{y, \hat{\theta}} - \sum_{A \in \mathcal{A}} n_A \log(n_A)
\end{align*}
and these agree except for constants that cancel when we calculate
likelihood ratio test statistics (which involve differences of log
likelihoods).
\end{proof}

Because likelihoods are invariant under reparameterization,
Theorem~\ref{th:wilks-schemes} holds for any parameterization.

\subsubsection{Example Continued: Wilks Tests} \label{sec:examp-wilks}

Recall that we did the following likelihood ratio test back
in Section~\ref{sec:lrt}
<<lrt-victim-check-too>>=
anova(gout.no.victim, gout, test = "Chisq")
@
Recall that we put the data back in the form for logistic (product binomial)
regression.  Put it back in the form for Poisson regression, and fit the
model with no victim effect.  Since we put the data back in form for logistic
regression, we have to again convert it to the form for Poisson regression.
<<lrt-victim-check-poisson>>=
gout.poisson.no.victim <- glm(verdicts ~ victim * defendant +
    deathpenalty : defendant, family = poisson,
    data = pdata, x = TRUE)
anova(gout.poisson.no.victim, gout.poisson, test = "Chisq")
@
The same.

\subsubsection{Warning}

The R function \texttt{drop1} does not do the right thing with the
example of the preceding section.
<<warning-about-drop1>>=
drop1(gout.poisson, ~ victim, test = "LRT")
@
The reason is that it tries to drop everything involving the ``predictor''
\texttt{victim} (scare quotes around ``predictor'' because it isn't
data but just an indicator variable that tells which cells of the contingency
table we are talking about).
If there is a way to get \texttt{drop1} to do the right thing here, your
humble author cannot figure it out.

The lesson here is that \texttt{anova} is a lot safer than \texttt{drop1}.
With \texttt{anova} you have R objects for both models being compared.
You can do \texttt{summary} on them, look at their coefficients, and see that
they are actually the models you want to compare.  With \texttt{drop1}
you can only look at the big model and can only hope that \texttt{drop1}'s
futzing with the R formula does the right thing.  And sometimes, like here,
it doesn't.

\subsection{Dropping Variables to Gain Identifiability} \label{sec:dropping}

The R functions \texttt{lm} and \texttt{glm} that fit linear models (LM) and
generalized linear models (GLM) automatically drop variables to gain
identifiability (or avoid so-called collinearity).

Because the saturated model for an LM or GLM is always identifiable,
non-identifiability for a canonical affine submodel can only arise
due to the model matrix $M$ being such that $\theta = M \beta$
is not a one-to-one change of parameter.  Then different $\beta$'s
(submodel canonical parameter values) correspond to the same $\theta$
(saturated model canonical parameter value), so the model is obviously
not identifiable.

What \texttt{lm} and \texttt{glm} and other R functions that fit regression
models and have copied \texttt{lm} and \texttt{glm} do in this case is
to find a maximal set of linearly independent columns of $M$ and use the
matrix $M_2$ composed of these columns as the model matrix.
In the language of linear algebra, the columns of $M_2$ are a basis for
the column space of $M$.  That means that any vector $\theta = M \beta$
can also be expressed at $\theta = M \beta_2$ for some vector $\beta_2$.
And this means that $M$ and $M_2$ are different model matrices for the
same statistical model.

Because \texttt{lm} and \texttt{glm} always pick the columns of $M_2$
to be a subset of columns of $M$, we always have $M_2$ a submatrix of $M$.

Our subvector notation (p.~\pageref{pg:subvector} above)
can be extended to matrices to cover this case.
Matrices are vectors because they can be added and multiplied by scalars.
We can think of a matrix $M$ having components $m_{i j}$ as the function
$(i, j) \mapsto m_{i j}$.  If we let $I$ and $J$ denote the sets over which
$i$ and $j$ range, then our matrix is an element of the vector space
$\real^{I \times J}$.  An element of $I \times J$ is a pair $(i, j)$,
so an element of $\real^{I \times J}$ is a function
$(i, j) \mapsto m_{i j}$.
Then if $A \subset J$, our subvector notation says $M_{I \times A}$ is
the function $(i, j) \mapsto m_{i j}$ defined on $I \times A$.  So we
are just restricting $j$ to be in $A$.  So this is a submatrix that has
some of the columns of $M$ but not all of them.

So suppose $M_{I \times A}$ is what we were calling $M_2$ before.
Then $M_{I \times A} \beta_A = M \beta$ if $\beta_j = 0$ for $j \notin A$.
In short, using the model matrix $M_2$ is like constraining the components
of $\beta$ that correspond to columns of $M$ that are not in $M_2$ to be
zero.
The R functions \texttt{lm} and \texttt{glm} do not say they have constrained
these components of $\beta$ to be zero, instead they put in \texttt{NA}.

Here is a made-up example.
<<collinearity>>=
color <- c("red", "orange", "orange", "yellow", "yellow", "green", "green")
fruit <- c("apple", "orange", "kumquat", "apple", "lemon", "apple", "lime")
count <- rpois(length(color), 10)
foof <- data.frame(color = color, fruit = fruit, count = count)
foof
gout.foof <- glm(count ~ color + fruit, family = poisson, data = foof)
summary(gout.foof)
@
What happened?  These R functions know they should drop one dummy variable
for each factor if they include an intercept.  So it automatically drops
\texttt{colorgreen} and \texttt{fruitapple}.  But it automatically keeps
the rest.  But here if you know that \texttt{colororange} = 1
and \texttt{fruitkumquat} = 0, then you know \texttt{fruitorange} = 1,
because the only two orange fruits in these data are oranges and kumquats.
That's non-identifiability or collinearity.  So we have to drop one of
these three dummy variables, and which one doesn't matter.  We get the
same statistical model either way, one more example of regression coefficients
are meaningless.  The R function \texttt{glm} decides to
drop \texttt{fruitorange}.

IMHO this function should print \texttt{0} rather than \texttt{NA} for the
estimate for \texttt{fruitorange}.  Instead it leaves it to us to know that
\texttt{NA} here means the same thing as constraining this regression
coefficient to be zero.

If we change the order of the levels of the factors, so R will drop
different dummy variables, we get a different collinearity.
<<collinearity-too>>=
color <- factor(color, levels = sort(unique(color), TRUE))
fruit <- factor(fruit, levels = sort(unique(fruit), TRUE))
foof <- data.frame(color = color, fruit = fruit, count = count)
foof
gout.foof <- glm(count ~ color + fruit, family = poisson,
    data = foof, x = TRUE)
coefficients(gout.foof)
@
Now it drops \texttt{fruitapple} because
<<collinearity-too-because>>=
modmat <- gout.foof$x
modmat.good <- modmat[ , colnames(modmat) != "fruitapple"]
modmat.bad <- modmat[ , colnames(modmat) == "fruitapple"]
solve(modmat.good, modmat.bad)
@
which says
$$
   \texttt{fruitapple} = \texttt{(Intercept)} - \texttt{colororange}
   - \texttt{fruitlime} - \texttt{fruitlemon}
$$
(in these data a fruit that is not colored orange and is not a lime or a lemon
is an apple).
This is one more example of regression coefficients are meaningless.

If we change the order of terms in the formula, it will drop something
different.
<<collinearity-too-too>>=
gout.foof <- glm(count ~ fruit + color, family = poisson,
    data = foof, x = TRUE)
coefficients(gout.foof)
@
It drops \texttt{colororange} because
<<collinearity-too-too-because>>=
modmat <- gout.foof$x
modmat.good <- modmat[ , colnames(modmat) != "colororange"]
modmat.bad <- modmat[ , colnames(modmat) == "colororange"]
solve(modmat.good, modmat.bad)
@
which says
$$
   \texttt{colororange} = \texttt{(Intercept)} - \texttt{fruitlime}
   - \texttt{fruitlemon} - \texttt{fruitapple}
$$
(in these data a fruit that is not a lime, lemon, or apple is colored orange).
This is one more example of regression coefficients are meaningless.

But none of the above was what this section has to do with sampling schemes.
That is the following.

If the saturated model has directions of constancy, as it does for the
multinomial and product multinomial sampling schemes, to get identifiability
one must also drop columns of the model matrix $M$ that are in the constancy
space of the saturated model.

If the saturated model is multinomial, then one must drop columns
of $M$ that are scalar multiples of what was called $\delta_I$ above and is
labeled \verb@"(Intercept)"@ when R constructs model matrices (the vector
having all components equal to one).
If the saturated model is product multinomial for partition $\mathcal{A}$,
then one must drop columns of $M$ that are linear combinations of what was
called $\delta_A$, $A \in \mathcal{A}$.  These are the dummy variables
that are put into the model to force
$$
   \sum_{i \in A} y_i = \sum_{i \in A} \hat{\mu}_i, \qquad A \in \mathcal{A},
$$
to be among the ``observed equals expected'' equations.

In either case, one drops all of these columns of $M$ when fitting the
multinomial or product multinomial (as the case may be) model.  And this
is necessary to obtain identifiability.

\subsection{Sampling Schemes and Rao Tests}

\begin{center} \LARGE REVISED DOWN TO HERE \\ \normalsize this section is
actually wrong \end{center}

For the Rao test, we do need to be clear about the big and little models
with model matrices $M_2$ and $M_1$, respectively.
(We didn't need to do this for the Wilks test because both tests use the
same maximized log likelihood.)

The Rao test statistic is given by equations (11) or (12) in the likelihood
handout, which are the same for exponential families because observed and
expected Fisher information for the canonical parameter are the same.
We repeat equation (11) in the likelihood handout here
$$
   R_n = 
   \bigl(\nabla l_n(\tilde{\theta}_n)\bigr)^T I_n(\tilde{\theta}_n)^{-1}
   \nabla l_n(\tilde{\theta}_n).
$$
Here $l_n$ is the log likelihood for the big model,
$I_n(\theta)$ is Fisher information for the canonical parameter $\theta$
for sample size $n$, which is $n$ times Fisher information
for sample size one, but the MLE $\tilde{\theta}_n$ is the MLE for the
little model.

For our purposes here we drop the $n$'s obtaining
If sample size is ``large enough''
\begin{equation} \label{eq:rao-test-statistic}
   R = 
   \bigl(\nabla l(\tilde{\theta})\bigr)^T I(\tilde{\theta})^{-1}
   \nabla l(\tilde{\theta}).
\end{equation}
we have good asymptotic approximation, if not, then we don't.
But either way, this is our formula for the Rao test statistic.
In \eqref{eq:rao-test-statistic} we still have $l$ is the log likelihood
for the big model, $I(\theta)$ is the log likelihood for the big model,
and $\tilde{\theta}$ is the MLE for the little model.

We are going to make a few simplifying assumptions about the model matrices.
These are made ``without loss of generality'' as the saying goes.
We could always make the model matrices have this form (as will be explained
as we go along).  The R formula mini-language may not construct model matrices
of this form, but there exist model matrices for the same models that do have
this form.  We assume that both models are identifiable.  What this means
for $M_1$ and $M_2$ depends on what the saturated model is.

Let $V_i$ denote the column space of $M_i$.  Each $V_i$ is a vector subspace
of $\real^J$.  Then the nested model condition is $V_1 \subset V_2$.

If the saturated model is Poisson, hence identifiable, then this just says
that the ranks of $M_1$ and $M_2$ are equal to their column dimensions.
Alternatively, it says that the columns of $M_1$ are a linearly independent
set of vectors, a basis for $V_1$, and similarly for
the columns of $M_2$.

If the saturated model is multinomial, hence not identifiable, having a
one-dimensional constancy space, whose one basis vector can be chosen to
be $\delta_I$ (defined above, p.~\pageref{pg:delta-i}, to be the vector
having all components equal to one),
then this says that the columns of $M_1$ are
a linearly independent set of vectors, a basis for $V_1$
and that $\delta_I \notin V_1$,
and similarly for the columns of $M_2$.

The reason for the condition $\delta_I \notin V_i$ is that if this
were false, there would exist a $\beta$ such that $\delta_I = M_i \beta$.
But then this $\beta$ would be a direction of constancy for the $i$-th model,
and that would mean the model would not be identifiable, which we don't want.

If the saturated model is product multinomial for partition $\mathcal{A}$,
hence not identifiable, having a constancy space whose dimension is the
number of elements of $\mathcal{A}$, a basis for which can be taken to be
the vectors $\delta_A$, $A \in \mathcal{A}$
(defined above, p.~\pageref{pg:delta-a}, a $\delta_A$ is the indicator
function of the set $A$),
then this says that the columns of $M_1$ are
a linearly independent set of vectors, a basis for $V_1$
and that $\delta_A \notin V_1$ for all $A \in \mathcal{A}$,
and similarly for the columns of $M_2$.

The reason for the condition $\delta_A \notin V_i$ for all $A \in \mathcal{A}$
is similar to the reason for the similar condition for the multinomial scheme.

We are also going to insist that $M_2$ have the form
$$
   M_2 = \begin{pmatrix} M_1 & N_2 \end{pmatrix}
$$
(a partitioned matrix), so all of the columns of $M_1$ are also columns of
$M_2$, and the columns of $M_1$ appear in the same order in $M_2$ and before
any other columns of $M_2$.  The matrix $N_2$ is the left over columns of
$M_2$, those that are not columns of $M_1$.  The reason why there is
``no loss of generality'' here is that $V_1 \subset V_2$, and it is a theorem
of linear algebra that any basis for $V_1$ can be extended to be a basis
for $V_2$.  Whatever $M_2$ we started with we can throw away and define a
new $M_2$ as follows.  Take the columns of $M_1$ to be a basis for $V_1$.
Then extend this basis to be a basis for $V_2$, letting $N_2$ be the matrix
whose columns are the vectors added to the basis for $V_1$ to get a basis
for $V_2$.  The old and new $M_2$ determine the same model because
both have the same column space $V_2$.

Now with all of that setup we are finally ready to get back to the Rao test.
By \eqref{eq:score-submodel} and \eqref{eq:fisher-submodel} we have
\begin{align*}
   \nabla l_2(\theta) & = M_2^T (y - \mu)
   \\
   I_2(\theta) & = M_2^T I(\theta) M_2
\end{align*}
where quantities with subscripts 2 refer to the big model (still a submodel
of the saturated model, and quantities without subscripts refer to the
saturated model).  Now into $l_2(\theta)$ we want to plug in the MLE
for the little model, which is $\tilde{\mu}$ characterized by the
``observed equals expected'' equation
for the little model $M_1^T \tilde{\mu} = M_1^T y$.

So do that plug in
\begin{align*}
   \nabla l_2(\tilde{\theta})
   & =
   M_2^T (y - \tilde{\mu})
   \\
   & =
   \begin{pmatrix} M_1^T \\ N_2^T \end{pmatrix}
   (y - \tilde{\mu})
   \\
   & =
   \begin{pmatrix} M_1^T (y - \tilde{\mu}) \\ N_2^T (y - \tilde{\mu})
   \end{pmatrix}
   \\
   & =
   \begin{pmatrix} 0 \\ N_2^T (y - \tilde{\mu}) \end{pmatrix}
\end{align*}

\subsubsection{Example Continued: Rao Tests} \label{sec:examp-rao}

Recall that in Section~\ref{sec:examp-wilks}
we compared via Wilks's test the R objects
\texttt{gout.poisson.no.victim} (little model) and \texttt{gout.poisson}
(big model) produced by Poisson regression.  We now know that the MLE's
don't depend on the sampling scheme.  The ones we want are for the little
model
<<rao-examp-mle>>=
theta.twiddle <- predict(gout.poisson.no.victim)
mu.twiddle <- predict(gout.poisson.no.victim, type = "response")
@
Fisher information for the saturated model evaluated at the MLE for
the little model is
<<rao-examp-fish-sat>>=
fish.sat <- diag(mu.twiddle)
@
The corresponding Fisher information for the big model is
<<rao-examp-fish-big>>=
modmat <- gout.poisson$x
fish.big <- t(modmat) %*% fish.sat %*% modmat
fish.big.inverse <- solve(fish.big)
@
The score (log likelihood gradient) for the big model evaluated at the
MLE in the little model
<<rao-examp-score-big>>=
y.poisson <- pdata$verdicts
score.big <- t(modmat) %*% (y.poisson - mu.twiddle)
score.big <- zapsmall(score.big)
score.big
@
So the Rao test statistic is
<<rao-examp-test-poisson>>=
rao.poisson <- t(score.big) %*% fish.big.inverse %*% score.big
rao.poisson <- as.vector(rao.poisson)
rao.poisson
@

Now we want to calculate the same thing assuming product binomial
(logistic regression) sampling.

MLE's
<<rao-examp-mle-binomial>>=
theta.twiddle <- predict(gout.no.victim)
p.twiddle <- predict(gout.no.victim, type = "response")
mu.twiddle <- n * p.twiddle
@
Fisher information for the saturated model evaluated at the MLE for
the little model
<<rao-examp-fish-sat-binomial>>=
fish.sat <- diag(n * p.twiddle * (1 - p.twiddle))
@
Corresponding Fisher information for the big model
<<rao-examp-fish-big-binomial>>=
modmat <- gout$x
fish.big <- t(modmat) %*% fish.sat %*% modmat
fish.big.inverse <- solve(fish.big)
@
Score for the big model evaluated at the MLE in the little model
<<rao-examp-score-big-binomial>>=
score.big <- t(modmat) %*% (y - mu.twiddle)
score.big <- zapsmall(score.big)
score.big
@
Rao test statistic
<<rao-examp-test-binomial>>=
rao.binomial <- t(score.big) %*% fish.big.inverse %*% score.big
rao.binomial <- as.vector(rao.binomial)
rao.binomial
all.equal(rao.binomial, rao.poisson)
@


\begin{center} \LARGE REVISED DOWN TO HERE \end{center}

\section{Multivariate Monotonicity}

The important mapping \eqref{eq:canonical-to-mean-value}
between canonical and mean value parameters of a
regular full exponential family has an important property called
\emph{multivariate monotonicity}.  If $\theta_1$ and $\theta_2$
are possible values of the canonical parameter vector and
\begin{align*}
   \mu_1 & = \nabla c(\theta_1)
   \\
   \mu_2 & = \nabla c(\theta_2)
\end{align*}
are the corresponding values of the mean-value parameter vector, then
\begin{equation} \label{eq:monotone-saturated}
   \inner{\mu_1 - \mu_2, \theta_1 - \theta_2} \ge 0,
\end{equation}
and if the canonical parameterization is identifiable
\begin{equation} \label{eq:strictly-monotone-saturated}
   \inner{\mu_1 - \mu_2, \theta_1 - \theta_2} > 0,
   \qquad
   \theta_1 \neq \theta_2.
\end{equation}

Property \eqref{eq:monotone-saturated} holding for all $\theta_1$
and $\theta_2$ in the full canonical parameter space is called
\emph{multivariate monotonicity} of the mapping from $\theta$ to $\mu$
\citep[Chapter~12]{rockafellar-wets}.  It follows from the derivative
matrix of this mapping being a variance matrix, hence a positive semi-definite
matrix, our equation \eqref{eq:bart-ii} and Proposition~12.3
in \citet{rockafellar-wets}.

Property \eqref{eq:strictly-monotone-saturated} holding for all $\theta_1$
and $\theta_2$ in the full canonical parameter space
with $\theta_1 \neq \theta_2$ is called
\emph{strict multivariate monotonicity} of the mapping from $\theta$ to $\mu$
\citep[Chapter~12]{rockafellar-wets}.  When the canonical parameterization
is identifiable, it follows from our Theorem~\ref{th:directions-of-constancy}
that there is no nonzero vector $\delta$ such that $\inner{y, \delta}$ is
almost surely constant, which implies both sides of \eqref{eq:bart-ii}
are positive definite matrices.  And this implies strict multivariate
monotonicity by Proposition~12.3 in \citet{rockafellar-wets}.

Since canonical affine submodels of regular full exponential families are
themselves regular full exponential families, the same theory discussed above
applies to them.  If $\beta_1$ and $\beta_2$
are possible values of the submodel canonical parameter vector and
\begin{align*}
   \tau_1 & = \nabla c_{\text{sub}}(\beta_1)
   = M^T \nabla c(a + M \beta_1)
   \\
   \tau_2 & = \nabla c_{\text{sub}}(\beta_2)
   = M^T \nabla c(a + M \beta_1)
\end{align*}
are the corresponding values of the submodel mean-value parameter vector,
then
\begin{equation} \label{eq:monotone-submodel}
   \inner{\tau_1 - \tau_2, \beta_1 - \beta_2} \ge 0,
\end{equation}
and if the canonical parameterization is identifiable
\begin{equation} \label{eq:strictly-monotone-submodel}
   \inner{\tau_1 - \tau_2, \beta_1 - \beta_2} > 0,
   \qquad
   \beta_1 \neq \beta_2.
\end{equation}

So much for the theory of multivariate monotonicity.  What is the point?
Why is it even called that?

In the case where there is just one parameter
(so $\theta$ and $\mu$ are scalars rather than vectors)
equation \eqref{eq:monotone-saturated} becomes
$$
   (\mu_1 - \mu_2) (\theta_1 - \theta_2) \ge 0,
$$
so $\mu_1 - \mu_2$ and $\theta_1 - \theta_2$ are either both nonnegative
or both nonpositive.  Hence $\theta_1 \le \theta_2$ implies $\mu_1 \le \mu_2$,
and the mapping from $\theta$ to $\mu$ is a nondecreasing function.

The term monotone applied to scalar-to-scalar functions usually means either
nondecreasing or nonincreasing.  If a scalar-to-scalar function is
multivariate monotone, then it is nondecreasing.  So the match is not perfect,
but we can see why the name.

In the case where there is just one parameter
equation \eqref{eq:strictly-monotone-saturated} becomes
$$
   (\mu_1 - \mu_2) (\theta_1 - \theta_2) > 0,
$$
so $\mu_1 - \mu_2$ and $\theta_1 - \theta_2$ are either both negative
or both positive.  Hence $\theta_1 < \theta_2$ implies $\mu_1 < \mu_2$,
and the mapping from $\theta$ to $\mu$ is an increasing function.

The term strictly monotone applied to scalar-to-scalar functions usually
means either decreasing or increasing.  If a scalar-to-scalar function is
multivariate monotone, then it is increasing.
So again the match is not perfect, but we can see why the name.

So why is this the multivariate analog of univariate monotonicity?
Consider the mapping restricted to a line.  Pick two distinct points
$\theta_1$ and $\theta_2$ in the canonical parameter space and
consider the line determined by these two points
$$
   r \theta_1 + (1 - r) \theta_2, \qquad r \in \real.
$$
Then we consider the part of this line that is in the full canonical parameter
space $\Theta$ given by \eqref{eq:full}.  Define
\begin{align*}
   r_{\text{min}}
   & =
   \inf \set { r \in \real : r \theta_1 + (1 - r) \theta_2 \in \Theta}
   \\
   r_{\text{max}}
   & =
   \sup \set { r \in \real : r \theta_1 + (1 - r) \theta_2 \in \Theta}
\end{align*}
($r_{\text{min}} = - \infty$ and $r_{\text{max}} = + \infty$ are possible
values).  Then
$$
   r \theta_1 + (1 - r) \theta_2, \qquad r_{\text{min}} < r < r_{\text{max}}
$$
is the part of this line in the full canonical parameter space.

The map from canonical to mean-value parameters is nonlinear, so the
corresponding mean values
$$
   \mu_r = \nabla c\bigl( r \theta_1 + (1 - r) \theta_2 \bigr)
$$
do not lie on a line but rather on a smooth curve.
What does the multivariate monotonicity property say here?
Consider $r_1$ and $r_2$ with $r_{\text{min}} < r_1 < r_2 < r_{\text{max}}$.
Then
$$
   \bigl[ r_1 \theta_1 + (1 - r_1) \theta_2 \bigr]
   -
   \bigl[ r_2 \theta_1 + (1 - r_2) \theta_2 \bigr]
   =
   (r_1 - r_2) (\theta_1 - \theta_2)
$$
and the multivariate monotonicity property becomes
$$
   0
   \le
   \inner{ \mu_{r_1} - \mu_{r_2}, (r_1 - r_2) (\theta_1 - \theta_2) }
   =
   (r_1 - r_2) \inner{ \mu_{r_1} - \mu_{r_2}, \theta_1 - \theta_2 }
$$
so
$$
   \inner{ \mu_{r_1} - \mu_{r_2}, \theta_1 - \theta_2 } \le 0
$$
and
$$
   \inner{ \mu_{r_1}, \theta_1 - \theta_2 }
   \le
   \inner{ \mu_{r_2}, \theta_1 - \theta_2 }
$$
Thus the multivariate monotonicity property says
\begin{equation} \label{eq:woof-woof}
   r \mapsto \inner{ \mu_r, \theta_1 - \theta_2 }
\end{equation}
is a nondecreasing function.  And following the same logic for the
strictly multivariate monotonicity property would show that
\eqref{eq:woof-woof} is an increasing function.

If we apply what we just learned to points that change only one coordinate
of $\theta$, so $\theta_1 - \theta_2$ points along a coordinate axis we get
the following, which your humble calls the ``dumbed down'' version of
multivariate monotonicity.  (As always $\theta$ is the canonical parameter
vector and $\mu$ is the mean-value parameter vector and $\theta_i$ and
$\mu_i$ are their components.)
\begin{itemize}
\item Increase $\theta_i$ holding the other components of $\theta$ fixed,
    and $\mu_i$ either either increases or stays the same.  Other components
    of $\mu$ can go any which way.
\item If the canonical parameterization is identifiable, increasing $\theta_i$
    holding the other components of $\theta$ fixed increases $\mu_i$.
\end{itemize}
The reason we call this ``dumbed down'' is that it is not enough information
to capture the full multivariate monotonicity phenomenon.
For example \citet[Appendix~A]{shaw-geyer} found that an important bit of
scientific reasoning about aster models (which are exponential family models)
required the full force of \eqref{eq:strictly-monotone-submodel}.  Thinking
one coordinate at a time didn't do the job.

Nevertheless, the ``dumbed down'' version is a lot easier to explain to people
and so is often used when the full force of the technical definition is not
needed.

In GLM, independence of components of the response vector means that the
model is product exponential family, each component having an exponential
family distribution that does not involve other components.  This means
that the mapping $\theta_i \mapsto \mu_i$ is scalar-to-scalar, hence an
increasing function (we never use a nonidentifiable canonical parameterization
for GLM saturated models, because that would make the canonical parameter
space one-dimensional and the mean-value parameter space zero-dimensional,
which would give us a degenerate model with only one distribution, and
that would be pointless).

So for GLM saturated models, we can reason componentwise.  Each $\mu_i$
is a function of $\theta_i$ only, and this function is increasing.
But this does not work for GLM canonical affine submodels.  Changing one
component of $\beta$ (the ``intercept'' component, for example) can change
all the components of the submodel canonical parameter vector $\tau$
(and also all the components of $\theta$ and $\mu$).

For other exponential family models, including multinomial models,
we need multivariate monotonicity.  There is no available
univariate monotonicity.

\section{Maximum Entropy}

\section{Summary}

\begin{thebibliography}{}

\bibitem[Agresti(2013)]{agresti}
Agresti, A. (2013).
\newblock \emph{Categorical Data Analysis}, third edition.
\newblock John Wiley \& Sons, Hoboken.

\bibitem[Anderson(2003)]{anderson}
Anderson, T.~W. (2003).
\newblock \emph{An Introduction to Multivariate Statistical Analysis},
    third edition.
\newblock John Wiley \& Sons, Hoboken.

\bibitem[Barndorff-Nielsen(1978)]{barndorff-nielsen}
Barndorff-Nielsen, O.~E. (1978).
\newblock \emph{Information and Exponential Families}.
\newblock John Wiley \& Sons, Chichester.

% \bibitem[Cram\'{e}r(1951)]{cramer}
% Cram\'{e}r, H. (1951).
% \newblock \emph{Mathematical Methods of Statistics}.
% \newblock Princeton University Press, Princeton.

\bibitem[Cook(1998)]{cook}
Cook, R.~D. (1998).
\newblock \emph{Regression Graphics: Ideas for Studying Regressions
    through Graphics}.
\newblock John Wiley \& Sons, New York.

\bibitem[Cook and Adragni(2009)]{cook-adragni}
Cook, R.~D., and Adragni, K.~P. (2009).
\newblock Sufficient dimension reduction and prediction in regression.
\newblock \emph{Philosophical Transactions of the Royal Society, Series A},
    \textbf{367}, 4385--4405.

\bibitem[Cook and Weisberg(1991)]{cook-weisberg}
Cook, R.~D., and Weisberg, S. (1991).
\newblock Sliced inverse regression for dimension reduction: Comment.
\newblock \emph{Journal of the American Statistical Association},
    \textbf{86}, 328--332.
\newblock Discussion of \citet{li}.

\bibitem[Darmois(1935)]{darmois}
Darmois, G. (1935).
\newblock Sur les lois de probabilit\'{e}s a estimation exhaustive.
\newblock \emph{Comptes Rendus de l'Acad\'{e}mie des Sciences},
    \textbf{200}, 1265--1266.

\bibitem[Fisher(1922)]{fisher1922}
Fisher, R.~A. (1922).
\newblock On the mathematical foundations of theoretical statistics.
\newblock \emph{Philosophical Transactions of the Royal Society of London,
    Series A}, 222, 309--368.

\bibitem[Galton(1886)]{galton}
Galton, F. (1886). 
\newblock Regression towards mediocrity in hereditary stature.
\newblock \emph{Journal of the Anthropological Institute},
    \textbf{15}, 246--263.

\bibitem[Galton and Dickson(1886)]{galton-dickson}
Galton, F., and Dickson, J.~D.~H. (1886).
\newblock Family likeness in stature.
\newblock \emph{Proceedings of the Royal Society of London},
    \textbf{40}, 42--73.
\newblock Dickson supplied an appendix to Galton's paper.

\bibitem[Geyer(1999)]{geyer-spatial}
Geyer, C.~J. (1999).
\newblock Likelihood inference for spatial point processes.
\newblock In \emph{Stochastic Geometry: Likelihood and Computation},
      eds.\ W.\ Kendall, O.\ Barndorff-Nielsen and
      M.~N.~M.\ van Lieshout.
\newblock Chapman \& Hall/CRC, London.

\bibitem[Geyer(2009)]{geyer-gdor}
Geyer, C.~J. (2009).
\newblock Likelihood inference in exponential families and directions
    of recession.
\newblock \emph{Electronic Journal of Statistics}, \textbf{3}, 259--289.

\bibitem[Geyer(2015)]{aster-package}
Geyer, C.~J. (2015).
\newblock R package \texttt{aster} (Aster Models), version 0.8-31.
\newblock \url{http://cran.r-project.org/package=aster}.

\bibitem[Geyer and M{\o}ller(1994)]{geyer-moller}
Geyer, C. J. and M{\o}ller, J. (1994).
\newblock Simulation procedures and likelihood inference for spatial
    point processes.
\newblock \emph{Scandinavian Journal of Statistics}, \textbf{21}, 359--373.

\bibitem[Geyer, et~al.(2007)Geyer, Wagenius, and Shaw]{geyer-wagenius-shaw}
Geyer, C.~J., Wagenius, S., and Shaw, R.~G. (2007).
\newblock Aster models for life history analysis.
\newblock \emph{Biometrika}, \textbf{94}, 415--426.

\bibitem[Halmos and Savage(1949)]{halmos-savage}
Halmos, P.~R., and Savage, L.~J. (1949).
\newblock Application of the Radon-Nikodym theorem to the theory
    of sufficient statistics.
\newblock \emph{Annals of Mathematical Statistics}, \textbf{20}, 225--241.

\bibitem[Koopman(1936)]{koopman}
Koopman, B.~O. (1936).
\newblock On distributions admitting a sufficient statistic.
\newblock \emph{Transactions of the American Mathematical Society},
    \textbf{39}, 399--409.

\bibitem[Li(1991)]{li}
Li, K-C. (1991).
Sliced inverse regression for dimension reduction (with discussion).
\newblock \emph{Journal of the American Statistical Association},
    \newblock{86}, 316--342.

\bibitem[Neyman(1935)]{neyman}
Neyman, J. (1935).
\newblock Sur un teorema concernente le cosiddette statistiche sufficienti.
\newblock \emph{Giornale dell'Istituto Italiano degli Attuari},
    \textbf{6}, 320--334.
\newblock The title translates to ``On a theorem concerning so-called
    sufficient statistics.''

\bibitem[Pitman(1936)]{pitman}
Pitman, E.~J.~G. (1936).
\newblock Sufficient statistics and intrinsic accuracy.
\newblock \emph{Proceedings of the Cambridge Philosophical Society},
    \textbf{32}, 567--579.

\bibitem[Ripley, et al.(2015)Ripley, Venables, Bates, Hornik, Gebhardt,
    and Firth]{MASS-package}
Ripley, B., Venables, B., Bates, D.~M., Hornik, K., Gebhardt, A., and
    Firth, D. (2015).
\newblock R package \texttt{MASS} (support functions and datasets for
    Venables and Ripley's MASS), version 7.3-45.
\newblock This is an R recommended package, installed by default in
    every installation of R.

\bibitem[Rockafellar and Wets(1998)]{rockafellar-wets}
Rockafellar, R.~T., and Wets, R.~J.-B. (1998).
\newblock \emph{Variational Analysis}.
\newblock Springer-Verlag, Berlin.
\newblock (The corrected printings contain extensive changes.  We used
    the third corrected printing, 2010.)

\bibitem[Shaw and Geyer(2010)]{shaw-geyer}
Shaw, R.~G., and Geyer, C.~J. (2010).
\newblock Inferring fitness landscapes.
\newblock \emph{Evolution}, \textbf{64}, 2510--2520.

\bibitem[Venables and Ripley(2002)]{venables-ripley}
Venables, W.~N., and Ripley, B.~D. (2002).
\newblock \emph{Modern Applied Statistics with S}, fourth edition.
\newblock Springer, New York.

\end{thebibliography}

\end{document}

\begin{center} \LARGE REVISED DOWN TO HERE \end{center}