\documentclass{article}

\usepackage{indentfirst}
% \usepackage{verbatim}

\RequirePackage{amsmath}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\logit}{logit}

\newcommand{\boldbeta}{\boldsymbol{\beta}}
\newcommand{\boldeta}{\boldsymbol{\eta}}
\newcommand{\boldtheta}{\boldsymbol{\theta}}
\newcommand{\boldmu}{\boldsymbol{\mu}}
\newcommand{\boldp}{\mathbf{p}}
\newcommand{\boldx}{\mathbf{x}}
\newcommand{\boldD}{\mathbf{D}}
\newcommand{\boldI}{\mathbf{I}}
\newcommand{\boldX}{\mathbf{X}}

\newcommand{\mdot}{\,\cdot\,}

\newcommand{\Ybar}{Y{\mkern -14.0 mu}\overline{\phantom{\text{Y}}}}

\RequirePackage{amsfonts}
\newcommand{\real}{\mathbb{R}}

% \newcommand{\centerfig}[2]{\relax}

\RequirePackage{amsthm}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}

\newtheoremstyle{example}% name
     {\topsep}%      Space above
     {\topsep}%      Space below
     {}%         Body font
     {}%         Indent amount (empty = no indent, \parindent = para indent)
     {\bfseries}% Thm head font
     {.}%        Punctuation after thm head
     {\newline}%     Space after thm head: " " = normal interword space;
     {}%         Thm head spec
{\theoremstyle{example} \newtheorem{example}{Example}[section]}

\newenvironment{slogan}{\begin{quote}\em}{\end{quote}}

\begin{document}

\title{Generalized Linear Models in R}
\author{Charles J. Geyer}
\maketitle

This used to be a section of my master's level theory notes.
It is a bit overly theoretical for this R course.  Just think
of it as an example of literate programming in R using
the \verb@Sweave@ function.  You don't have to absorb all the
theory, although it is there for your perusal if you are interested.

\section{Bernoulli Regression}

We start with a slogan
\begin{itemize}
\item Categorical \emph{predictors} are no problem for linear regression.
Just use ``dummy variables'' and proceed normally.
\end{itemize}
but
\begin{itemize}
\item Categorical \emph{responses} do present a problem.
Linear regression assumes
normally distributed responses.  Categorical variables can't be normally
distributed.
\end{itemize}
So now we learn how to deal with at least one kind of categorical response,
the simplest, which is Bernoulli.

Suppose the responses are
\begin{equation} \label{eq:berp-assume}
   Y_i \sim \text{Bernoulli}(p_i)
\end{equation}
contrast this with the assumptions for linear regression
\begin{equation} \label{eq:gauss-assume}
   Y_i \sim \text{Normal}(\mu_i, \sigma^2)
\end{equation}
and
\begin{equation} \label{eq:linear-assume}
   \boldmu = \boldX \boldbeta
\end{equation}

The analogy between \eqref{eq:berp-assume} and \eqref{eq:gauss-assume} should
be clear.  Both assume the data are independent, but not identically
distributed.  The responses $Y_i$ have distributions in the same family, but
not the same parameter values.  So all we need to finish the specification
of a regression-like model for Bernoulli is an equation that takes the place
of \eqref{eq:linear-assume}.

\subsection{A Dumb Idea (Identity Link)}

We could use \eqref{eq:linear-assume} with the Bernoulli model, although
we have to change the symbol for the parameter from $\boldmu$ to $\boldp$
$$
   \boldp = \boldX \boldbeta.
$$
This means, for example, in the ``simple'' linear regression model
(with one constant and one non-constant predictor $x_i$)
\begin{equation} \label{eq:ber-linear-assume}
   p_i = \alpha + \beta x_i.
\end{equation}
Before we further explain this, we caution that this is universally recognized
to be a dumb idea, so don't get too excited about it.

Now nothing is normal,
so least squares, $t$ and $F$ tests, and so forth make no sense.  But maximum
likelihood, the asymptotics of maximum likelihood estimates, and likelihood
ratio tests do make sense.

Hence we write down the log likelihood
$$
   l(\alpha, \beta)
   =
   \sum_{i = 1}^n \bigl[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \bigr]
$$
and its derivatives
\begin{align*}
   \frac{\partial l(\alpha, \beta)}{\partial \alpha}
   & =
   \sum_{i = 1}^n \left[ \frac{y_i}{p_i} - \frac{1 - y_i}{1 - p_i} \right]
   \\
   \frac{\partial l(\alpha, \beta)}{\partial \beta}
   & =
   \sum_{i = 1}^n \left[ \frac{y_i}{p_i} - \frac{1 - y_i}{1 - p_i} \right] x_i
\end{align*}
and set equal to zero to solve for the MLE's.  Fortunately, even for this
dumb idea, R knows how to do the problem.

\begin{example}[Bernoulli Regression, Identity Link]
We \label{ex:dumb} use the data in the file \verb@ex12.8.1.dat@ in
this directory, which is read by the following
<<read-binom>>=
X <- read.table("ex12.8.1.dat", header = TRUE)
names(X)
attach(X)
@
and has three variables \texttt{x}, \texttt{y}, and \texttt{z}.  For now
we will just use the first two.

The response \texttt{y} is Bernoulli.
We will do a Bernoulli regression using the model
assumptions described above.  The following code
does the regression and prints out a summary.
<<glm-quasi>>=
out.quasi <- glm(y ~ x, family = quasi(variance = "mu(1-mu)"),
    start = c(0.5, 0))
summary(out.quasi, dispersion=1)
@

We have to apologize for the rather esoteric syntax, which results from
our choice of introducing Bernoulli regression via this rather dumb example.

As usual, our main interest is in the table labeled \texttt{Coefficients:},
which says the estimated regression coefficients (the MLE's) are
$\hat{\alpha} = -0.34750$ and $\hat{\beta} = 0.01585$.  This table also
gives standard errors, test statistics (``$z$ values'') and $P$-values
for the two-tailed test of whether the true value of the coefficient is zero.

The scatter plot with regression line for this regression is somewhat unusual
looking.  It is produced by the code
<<label=fig1plot,include=FALSE>>=
plot(x, y)
curve(predict(out.quasi, data.frame(x = x)), add=TRUE)
@
and is shown in Figure~\ref{fig:dumb}.
\begin{figure}
\begin{center}
<<label=fig1,fig=TRUE,echo=FALSE>>=
<<fig1plot>>
@
\end{center}
\caption{Scatter plot and regression line for Example~\ref{ex:dumb}
(Bernoulli regression with an identity link function).}
\label{fig:dumb}
\end{figure}
The response values are, of course, being Bernoulli, either zero or one,
which makes the scatter plot almost impossible to interpret (it is clear
that there are more ones for high $x$ values than for low, but it's impossible
to see much else, much less to visualize the correct regression line).
\end{example}

That finishes our discussion of the example.  So why is it ``dumb''?
One reason is that nothing keeps the parameters in the required range.
The $p_i$, being probabilities must be between zero and one.  The right hand
side of \eqref{eq:ber-linear-assume}, being a linear function may take any
values between $- \infty$ and $+ \infty$.  For the data set used in the
example, it just happened that the MLE's wound up in $(0, 1)$ without
constraining them to do so.  In general that won't happen.  What then?
R being semi-sensible will just crash (produce error messages rather than
estimates).

There are various ad-hoc ways one could think to patch up this problem.
One could, for example, truncate the linear function at zero and one.
But that makes a nondifferentiable log likelihood and ruins the asymptotic
theory.  The only simple solution is to realize that linearity is no longer
simple and give up linearity.

\subsection{Logistic Regression (Logit Link)}

What we need is an assumption about the $p_i$ that will always keep them
between zero and one.  A great deal of thought by many smart people came
up with the following general solution to the problem.  Replace the assumption
\eqref{eq:linear-assume} for linear regression with the following two
assumptions
\begin{equation} \label{eq:linear-predictor}
   \boldeta = \boldX \boldbeta
\end{equation}
and
\begin{equation} \label{eq:inverse-link}
   p_i = h(\eta_i)
\end{equation}
where $h$ is a smooth invertible function that maps $\real$ into $(0, 1)$ so
the $p_i$ are always in the required range.  We now stop for some important
terminology.
\begin{itemize}
\item The vector $\boldeta$ in \eqref{eq:linear-predictor} is called the
\emph{linear predictor}.
\item The function $h$ is called the \emph{inverse link function} and its
inverse $g = h^{-1}$ is called the \emph{link function}.
\end{itemize}

The most widely used (though not the only) link function for Bernoulli
regression is the \emph{logit} link defined by
\begin{subequations}
\begin{align}
   g(p) & = \logit(p) = \log\left(\frac{p}{1 - p}\right)
   \label{eq:logit}
   \\
   h(\eta) & = g^{-1}(\eta) = \frac{e^\eta}{e^\eta + 1}
   = \frac{1}{1 + e^{- \eta}}
   \label{eq:inverse-logit}
\end{align}
\end{subequations}
Equation \eqref{eq:logit} defines the so-called \emph{logit}
function, and, of course, equation \eqref{eq:inverse-logit}
defines the inverse logit function.

For generality, we will not at first use the explicit form of the link function
writing the log likelihood
$$
   l(\boldbeta)
   =
   \sum_{i = 1}^n \bigl[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \bigr]
$$
where we are implicitly using
\eqref{eq:linear-predictor} and \eqref{eq:inverse-link} as part of the
definition.  Then
$$
   \frac{\partial l(\boldbeta)}{\partial \beta_j}
   =
   \sum_{i = 1}^n
   \left[ \frac{y_i}{p_i} - \frac{1 - y_i}{1 - p_i} \right]
   \frac{\partial p_i}{\partial \eta_i}
   \frac{\partial \eta_i}{\partial \beta_j}
$$
where the two partial derivatives on the right arise from the chain rule
and are explicitly
\begin{align*}
   \frac{\partial p_i}{\partial \eta_i} & = h'(\eta_i) \\
   \frac{\partial \eta_i}{\partial \beta_j} & = x_{i j}
\end{align*}
where $x_{i j}$ denotes the $i, j$ element of the design matrix $\boldX$
(the value of the $j$-th predictor for the $i$-th individual).
Putting everything together
$$
   \frac{\partial l(\boldbeta)}{\partial \beta_j}
   =
   \sum_{i = 1}^n
   \left[ \frac{y_i}{p_i} - \frac{1 - y_i}{1 - p_i} \right]
   h'(\eta_i) x_{i j}
$$
These equations also do not have a closed form solution, but are easily
solved numerically by R
\begin{example}[Bernoulli Regression, Logit Link]
We \label{ex:logit} use the same data in Example~\ref{ex:dumb}.  The R
commands for logistic regression are
<<glm-logit>>=
out.logit <- glm(y ~ x, family = binomial)
summary(out.logit)
@
Note that the syntax is a lot cleaner for this (logit link) than for the
``dumb'' way (identity link).
The regression function for this ``logistic regression'' is shown in
Figure~\ref{fig:three}, which appears later, after we have done another example.
\end{example}

\subsection{Probit Regression (Probit Link)}

Another widely used link function for Bernoulli regression is the \emph{probit}
function, which is just another name for the standard normal inverse c.~d.~f.
That is,
the link function is $g(p) = \Phi^{-1}(p)$ and the inverse link function is
$g^{-1}(\eta) = \Phi(\eta)$.  The fact that we do not have closed-form
expressions for these functions and must use table look-up or computer programs
to evaluate them is no problem.  We need computers to solve the likelihood
equations anyway.

\begin{example}[Bernoulli Regression, Probit Link]
We \label{ex:probit} use the same data in Example~\ref{ex:dumb}.  The R
commands for probit regression are
<<glm-probit>>=
out.probit <- glm(y ~ x, family = binomial(link = "probit"))
summary(out.probit)
@

Note that there is a huge difference in the regression coefficients for
our three examples, but this should be no surprise because the coefficients
for the three regressions are not comparable.  Because the regressions
involve different link functions, the \emph{meaning} of the regression
coefficients are not the same.  Comparing them is like comparing apples
and oranges, as the saying goes.  Thus Bernoulli regression in particular
and generalized linear models in general give us yet another reason why
\emph{regression coefficients are meaningless}.  Note that
Figure~\ref{fig:three} shows that the estimated regression functions
$E(Y \mid X)$ are almost identical for the logit and probit regressions
despite the regression coefficients being wildly different.  Even the linear
regression function used in our first example is not so different, at least
in the middle of the range of the data, from the other two.
\begin{slogan}
Regression functions (response predictions) have a direct
probabilistic interpretation $E(Y \mid X)$.

Regression coefficients don't.
\end{slogan}

The regression function $E(Y \mid X)$ for all three of our Bernoulli
regression examples, including this one, are shown in Figure~\ref{fig:three},
which was made by the following code
<<label=fig2plot,include=FALSE>>=
plot(x, y)
curve(predict(out.logit, data.frame(x=x), type="response"),
   add=TRUE, lty=1)
curve(predict(out.probit, data.frame(x=x), type="response"),
   add=TRUE, lty=2)
curve(predict(out.quasi, data.frame(x=x)), add=TRUE, lty=3)
@
The \texttt{type="response"} argument says we want the predicted mean values
$g(\boldeta)$, the default being the linear predictor values $\boldeta$.
The reason why this argument is not needed for the last case is because
there is no difference with an identity link.
\begin{figure}
\begin{center}
<<label=fig2,fig=TRUE,echo=FALSE>>=
<<fig2plot>>
@
\end{center}
\caption{Scatter plot and regression functions for Examples~\ref{ex:dumb},
\ref{ex:logit}, and~\ref{ex:probit}.  Solid line: regression function for
logistic regression (logit link).  Dashed line: regression function for
probit regression (probit link).  Dotted line: regression function for
no-name regression (identity link).}
\label{fig:three}
\end{figure}
\end{example}

\section{Generalized Linear Models}

A \emph{generalized linear model} (GLM) is a rather general (duh!) form
of model that includes ordinary linear regression, logistic and probit
regression, and lots more.  We keep the regression-like association
\eqref{eq:linear-predictor} between the regression coefficient vector
$\boldbeta$ and the \emph{linear predictor} vector $\boldeta$ that we
used in Bernoulli regression.  But now we generalize the probability model
greatly.  We assume the responses $Y_i$ are independent but not identically
distributed with densities of the form
\begin{equation} \label{eq:def-glm-dens}
   f(y \mid \theta, \phi, w)
   =
   \exp\left( \frac{y \theta - b(\theta)}{\phi / w} - c(y, \phi) \right)
\end{equation}
We assume
$$
   Y_i \sim f(\mdot \mid \theta_i, \phi, w_i),
$$
that is, the \emph{canonical parameter} $\theta_i$ is different for each case
and is determined (in a way yet to be specified) by the linear predictor
$\eta_i$ but the so-called \emph{dispersion parameter} $\phi$ is the same
for all $Y_i$.  The \emph{weight} $w_i$ is a known positive constant,
not a parameter.  Also $\phi > 0$ is assumed ($\phi < 0$ would just change
the sign of some equations with only trivial effect).
The function $b$ is a smooth function but otherwise arbitrary.
Given $b$ the function $c$ is determined by the requirement that $f$ integrate
to one (like any other probability density).

The log likelihood is thus
\begin{equation} \label{eq:glm-logl}
   l(\boldbeta)
   =
   \sum_{i = 1}^n
   \left( \frac{y_i \theta_i - b(\theta_i)}{\phi / w_i} - c(y_i, \phi) \right)
\end{equation}
Before we proceed to the likelihood equations, let us first look at what the
identities derived from differentiating under the integral sign
\begin{equation} \label{bart1}
   E_\theta \{ l_n'(\theta) \} = 0
\end{equation}
and
\begin{equation} \label{olof}
   E_\theta\{ l_n''(\theta) \} = - \var_\theta\{ l_n'(\theta) \}
\end{equation}
and their multiparameter analogs
\begin{equation} \label{bart1-multi}
   E_{\boldtheta}\{ \nabla l_n(\boldtheta) \} = 0
\end{equation}
and
\begin{equation} \label{olof-multi}
   E_{\boldtheta}\{ \nabla^2 l_n(\boldtheta) \}
   =
   - \var_{\boldtheta}\{ \nabla l_n(\boldtheta) \}
\end{equation}
tell us about this model.  Note that these identities
are exact, not asymptotic, and so can be applied to sample size one and to any
parameterization.  So let us differentiate one term
of \eqref{eq:glm-logl} with respect to its $\theta$ parameter
\begin{align*}
   l(\theta, \phi)
   & =
   \frac{y \theta - b(\theta)}{\phi / w} - c(y, \phi)
   \\
   \frac{\partial l(\theta, \phi)}{\partial \theta}
   & =
   \frac{y - b'(\theta)}{\phi / w}
   \\
   \frac{\partial^2 l(\theta, \phi)}{\partial \theta^2}
   & =
   - \frac{b''(\theta)}{\phi / w}
\end{align*}
Applied to this particular situation, the identities from differentiating
under the integral sign are
\begin{align*}
   E_{\theta, \phi} \left\{
   \frac{\partial l(\theta, \phi)}{\partial \theta} \right\}
   & =
   0
   \\
   \var_{\theta, \phi} \left\{
   \frac{\partial l(\theta, \phi)}{\partial \theta} \right\}
   & =
   -
   E_{\theta, \phi} \left\{
   \frac{\partial^2 l(\theta, \phi)}{\partial \theta^2} \right\}
\end{align*}
or
\begin{align*}
   E_{\theta, \phi} \left\{ \frac{Y - b'(\theta)}{\phi / w} \right\}
   & =
   0
   \\
   \var_{\theta, \phi} \left\{ \frac{Y - b'(\theta)}{\phi / w} \right\}
   & =
   \frac{b''(\theta)}{\phi / w}
\end{align*}
From which we obtain
\begin{subequations}
\begin{align}
   E_{\theta, \phi} (Y) & = b'(\theta)
   \label{eq:glm-bp}
   \\
   \var_{\theta, \phi} (Y) & = 
   b''(\theta) \frac{\phi}{w}
   \label{eq:glm-bpp}
\end{align}
\end{subequations}
From this we derive the following lemma.
\begin{lemma} \label{lem:expo}
The function $b$ in \eqref{eq:def-glm-dens} has the following properties
\begin{itemize}
\item[\upshape (i)] $b$ is strictly convex,
\item[\upshape (ii)] $b'$ is strictly increasing,
\item[\upshape (iii)] $b''$ is strictly positive,
\end{itemize}
unless $b''(\theta) = 0$ for all $\theta$ and the distribution of $Y$ is
concentrated at one point for all parameter values.
\end{lemma}
\begin{proof}
Just by ordinary calculus (iii) implies (ii) implies (i), so we need only prove
(iii).  Equation \eqref{eq:glm-bpp} and the assumptions $\phi > 0$ and $w > 0$
imply $b''(\theta) \ge 0$.  So the only thing left to prove is that if
$b''(\theta^\ast) = 0$ for any one $\theta^\ast$, then actually
$b''(\theta) = 0$ for all $\theta$.
By \eqref{eq:glm-bpp} $b''(\theta^\ast) = 0$ implies
$\var_{\theta^\ast, \phi} (Y) = 0$, so the distribution of $Y$ for the
parameter values $\theta^\ast$ and $\phi$ is concentrated at one point.
But now we apply a trick using the distribution at $\theta^\ast$ to calculate
for other $\theta$
\begin{align*}
   f(y \mid \theta, \phi, w)
   & = 
   \frac{f(y \mid \theta, \phi, w)}{f(y \mid \theta^\ast, \phi, w)}
   f(y \mid \theta^\ast, \phi, w)
   \\
   & = 
   \exp \left(
   \frac{y \theta - b(\theta)}{\phi / w_i}
   -
   \frac{y \theta^\ast - b(\theta^\ast)}{\phi / w_i}
   \right)
   f(y \mid \theta^\ast, \phi, w)
\end{align*}
The exponential term is strictly positive, so the only way the distribution
of $Y$ can be concentrated at one point and have variance zero for
$\theta = \theta^\ast$ is if the distribution is concentrated at the same
point and hence has variance zero for all other $\theta$.  And using
\eqref{eq:glm-bpp} again, this would imply $b''(\theta) = 0$ for all $\theta$.
\end{proof}

The ``unless'' case in the lemma is uninteresting.  We never use probability
models for data having distributions concentrated at one point (that is,
constant random variables).  Thus (i), (ii), and (iii) of the lemma hold
for any GLM we would actually want to use.  The most important of these is (ii)
for a reason that will be explained when we return to the general theory
after the following example.

\begin{example}[Binomial Regression]
We \label{ex:binom-glm}
generalize Bernoulli regression just a bit by allowing more than
one Bernoulli variable to go with each predictor value $\boldx_i$.  Adding
those Bernoullis gives a binomial response, that is, we assume
$$
   Y_i \sim \text{Binomial}(m_i, p_i)
$$
where $m_i$ is the number of Bernoulli variables with predictor
vector $\boldx_i$.
The density for $Y_i$ is
$$
   f(y_i \mid p_i) = \binom{m_i}{y_i} p_i^{y_i} (1 - p_i)^{m_i - y_i}
$$
we try to match this up with the GLM form.  So first we write the density
as an exponential
\begin{align*}
   f(y_i \mid p_i)
   & =
   \exp \left[ y_i \log(p_i) + (m_i - y_i) \log(1 - p_i) + \log
   \binom{m_i}{y_i} \right]
   \\
   & =
   \exp \left[ y_i \log\left(\frac{p_i}{1 - p_i}\right) + m_i \log(1 - p_i)
   + \log \binom{m_i}{y_i} \right]
   \\
   & =
   \exp \left\{ m_i \bigl[ \bar{y}_i \theta_i - b(\theta_i) \bigr]
   + \log \binom{m_i}{y_i} \right\}
\end{align*}
where we have defined
\begin{align*}
   \bar{y}_i & = y_i / m_i
   \\
   \theta_i & = \logit(p_i)
   \\
   b(\theta_i) & = - \log(1 - p_i)
\end{align*}
So we see that
\begin{itemize}
\item
The \emph{canonical parameter} for the binomial model is
$\theta = \logit(p)$.  That explains why the logit link is popular.
\item
The \emph{weight} $w_i$ in the GLM density turns out to be the number
of Bernoullis $m_i$ associated with the $i$-th predictor value.  So
we see that the weight allows for grouped data like this.
\item
There is nothing like a dispersion parameter here.  For the binomial
family the dispersion is known; $\phi = 1$.
\end{itemize}
\end{example}

Returning to the general GLM model (a doubly redundant redundancy), we first
define yet another parameter, the \emph{mean value parameter}
$$
   \mu_i = E_{\theta_i, \phi}(Y_i) = b'(\theta_i).
$$
By (ii) of Lemma~\ref{lem:expo} $b'$ is a strictly increasing function,
hence an invertible function.  Thus the mapping between the canonical
parameter $\theta$ and the mean value parameter $\mu$ is an invertible
change of parameter.
Then by definition of ``link function'' the relation between the mean
value parameter $\mu_i$
and the linear predictor $\eta_i$ is given by the link function
$$
   \eta_i = g(\mu_i).
$$
The link function $g$ is required to be a strictly increasing function,
hence an invertible change of parameter.

If, as in logistic regression we take the linear predictor to be the canonical
parameter, that determines the link function, because $\eta_i = \theta_i$
implies $g^{-1}(\theta) = b'(\theta)$.  In general, as is the case in probit
regression, the link function $g$ and the function $b'$ that connects the
canonical and mean value parameters are unrelated.

It is traditional in GLM theory to make primary use of the mean value parameter
and not use the canonical parameter (unless it happens to be the same as the
linear predictor).  For that reason we want to write the variance as a function
of $\mu$ rather than $\theta$
\begin{equation} \label{eq:varmint}
   \var_{\theta_i, \phi} (Y_i) = \frac{\phi}{w} V(\mu_i)
\end{equation}
where
$$
   V(\mu) = b''(\theta) \qquad \text{when} \qquad \mu = b'(\theta)
$$
This definition of the function $V$ makes sense because the function $b'$
is an invertible mapping between mean value and canonical parameters.
The function $V$ is called the \emph{variance function} even though it
is only proportional to the variance, the complete variance being
$\phi V(\mu) / w$.

\subsection{Parameter Estimation}

Now we can write out the log likelihood derivatives
\begin{align*}
   \frac{\partial l(\boldbeta)}{\partial \beta_j}
   & =
   \sum_{i = 1}^n
   \left( \frac{y_i - b'(\theta_i)}{\phi / w_i} \right)
   \frac{\partial \theta_i}{\partial \beta_j}
   \\
   & =
   \sum_{i = 1}^n
   \left( \frac{y_i - \mu_i}{\phi / w_i} \right)
   \frac{\partial \theta_i}{\partial \beta_j}
\end{align*}
In order to completely eliminate $\theta_i$ we need to calculate the partial
derivative.  First note that
$$
   \frac{\partial \mu_i}{\partial \theta_i} = b''(\theta_i)
$$
so by the inverse function theorem
$$
   \frac{\partial \theta_i}{\partial \mu_i} = \frac{1}{b''(\theta_i)}
   = \frac{1}{V(\mu_i)}
$$
Now we can write
\begin{equation} \label{eq:ch-ch-chain}
   \frac{\partial \theta_i}{\partial \beta_j}
   =
   \frac{\partial \theta_i}{\partial \mu_i}
   \frac{\partial \mu_i}{\partial \eta_i}
   \frac{\partial \eta_i}{\partial \beta_j}
   =
   \frac{1}{V(\mu_i)} h'(\eta_i) x_{i j}
\end{equation}
where $h = g^{-1}$ is the inverse link function.  And we finally arrive
at the likelihood equations expressed in terms of the mean value parameter
and the linear predictor
$$
   \frac{\partial l(\boldbeta)}{\partial \beta_j}
   =
   \frac{1}{\phi}
   \sum_{i = 1}^n
   \left( \frac{y_i - \mu_i}{V(\mu_i)} \right) w_i h'(\eta_i) x_{i j}
$$
These are the equations the computer sets equal to zero and solves to find
the regression coefficients.  Note that the dispersion parameter $\phi$
appears only multiplicatively.  So it cancels when the partial derivatives
are set equal to zero.  Thus the regression coefficients can be estimated
without estimating the dispersion (just as in linear regression).

Also as in linear regression, the dispersion parameter is not estimated
by maximum likelihood but by the method of moments.  By \eqref{eq:varmint}
$$
   E \left\{ \frac{w_i (Y_i - \mu_i)^2}{V(\mu_i)} \right\}
   =
   \frac{w_i}{V(\mu_i)} \var(Y_i)
   =
   \phi
$$
Thus
$$
   \frac{1}{n} \sum_{i = 1}^n \frac{w_i (y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}
$$
would seem to be an approximately unbiased estimate of $\phi$.  Actually
it is not because $\hat{\boldmu}$ is not $\boldmu$, and
$$
   \hat{\phi} = \frac{1}{n - p} \sum_{i = 1}^n
   \frac{w_i (y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}
$$
is closer to unbiased where $p$ is the rank of the design matrix $\boldX$.
We won't bother to prove this.  The argument is
analogous to the reason for $n - p$ in linear regression.

\subsection{Fisher Information, Tests and Confidence Intervals}

The log likelihood second derivatives are
\begin{align*}
   \frac{\partial^2 l(\boldbeta)}{\partial \beta_j \partial \beta_k}
   & =
   \sum_{i = 1}^n
   \left( \frac{y_i - b'(\theta_i)}{\phi / w_i} \right)
   \frac{\partial^2 \theta_i}{\partial \beta_j \partial \beta_k}
   -
   \sum_{i = 1}^n
   \left( \frac{b''(\theta_i)}{\phi / w_i} \right)
   \frac{\partial \theta_i}{\partial \beta_j}
   \frac{\partial \theta_i}{\partial \beta_k}
\end{align*}
This is rather a mess, but because of \eqref{eq:glm-bp} the expectation
of the first sum is zero.  Thus the $j,k$ term of the expected Fisher
information is, using \eqref{eq:ch-ch-chain} and $b'' = V$,
\begin{align*}
   - E\left\{
   \frac{\partial^2 l(\boldbeta)}{\partial \beta_j \partial \beta_k} \right\}
   & =
   \sum_{i = 1}^n
   \left( \frac{b''(\theta_i)}{\phi / w_i} \right)
   \frac{\partial \theta_i}{\partial \beta_j}
   \frac{\partial \theta_i}{\partial \beta_k}
   \\
   & =
   \sum_{i = 1}^n
   \left( \frac{V(\mu_i)}{\phi / w_i} \right)
   \frac{1}{V(\mu_i)} h'(\eta_i) x_{i j}
   \frac{1}{V(\mu_i)} h'(\eta_i) x_{i k}
   \\
   & =
   \frac{1}{\phi} \sum_{i = 1}^n
   \left( \frac{w_i h'(\eta_i)^2}{V(\mu_i)} \right) x_{i j} x_{i k}
\end{align*}
We can write this as a matrix equation if we define $\boldD$ to be the
diagonal matrix with $i,i$ element
$$
   d_{i i} =
   \frac{1}{\phi} \frac{w_i h'(\eta_i)^2}{V(\mu_i)}
$$
Then
$$
   \boldI(\boldbeta) = \boldX' \boldD \boldX
$$
is the expected Fisher information matrix.  From this standard errors
for the parameter estimates, confidence intervals, test statistics,
and so forth can be derived using the usual likelihood theory.
Fortunately, we do not have to do all of this by hand.  R knows all
the formulas and computes them for us.

\section{Poisson Regression}

The Poisson model is also a GLM.  We assume responses
$$
   Y_i \sim \text{Poisson}(\mu_i)
$$
and connection between the linear predictor and regression coefficients,
as always, of the form \eqref{eq:linear-predictor}.  We only need to identify
the link and variance functions to get going.  It turns out that the
canonical link function is the log function (left as an exercise for the
reader).
The Poisson distribution distribution has the relation
$$
   \var(Y) = E(Y) = \mu
$$
connecting the mean, variance, and mean value parameter.  Thus the variance
function is $V(\mu) = \mu$, the dispersion parameter is known ($\phi = 1$),
and the weight is also unity ($w = 1$).

\begin{example}[Poisson Regression]
The \label{ex:pois} data set in the file \verb@ex12.10.1.dat@
is read by
<<read-poisson>>=
X <- read.table("ex12.10.1.dat", header = TRUE)
names(X)
attach(X)
@
simulates the hourly counts from a not necessarily homogeneous Poisson process.
The variables are \texttt{hour} and \texttt{count}, the first counting hours
sequentially throughout a 14-day period
(running from 1 to $14 \times 24 = 336$) and the second giving the count
for that hour.

The idea of the regression is to get a handle on the mean as a function of time
if it is not constant.  Many time series have a daily cycle.  If we pool
the counts for the same hour of the day over the 14 days of the series,
we see a clear pattern in the histogram.  The R \verb@hist@ function
won't do this, but we can construct the histogram ourselves
(Figure~\ref{fig:hour}) using \verb@barplot@ with the commands
<<label=fig3plot,include=FALSE>>=
hourofday <- (hour - 1) %% 24 + 1
foo <- split(count, hourofday)
foo <- sapply(foo, sum)
barplot(foo, space = 0, xlab = "hour of the day",
    ylab = "total count", names = as.character(1:24), col = 0)
@
\begin{figure}
\begin{center}
<<label=fig3,fig=TRUE,echo=FALSE>>=
<<fig3plot>>
@
\end{center}
\caption{Histogram of the total count in each hour of the day for the data for
Example~\ref{ex:pois}.}
\label{fig:hour}
\end{figure}

In contrast, if we pool the counts for each day of the week, the histogram
is fairly even (not shown).  Thus it seems to make sense to model the mean
function as being periodic with period 24 hours, and the obvious way to do
that is to use trigonometric functions.  Let us do a bunch of fits
<<glm-trig>>=
w <- hour / 24 * 2 * pi
out1 <- glm(count ~ I(sin(w)) + I(cos(w)), family = poisson)
summary(out1)
out2 <- update(out1, . ~ . + I(sin(2 * w)) + I(cos(2 * w)))
summary(out2)
out3 <- update(out2, . ~ . + I(sin(3 * w)) + I(cos(3 * w)))
summary(out3)
@
It seems from the pattern
of ``stars'' that maybe it is time to stop.
A clearer indication is given by the so-called
\emph{analysis of deviance} table, ``deviance'' being another name for
the likelihood ratio test statistic (twice the log likelihood difference
between big and small models), which has an asymptotic chi-square distribution
by standard likelihood theory.
<<trig-anova>>=
anova(out1, out2, out3, test="Chisq")
@
The approximate $P$-value for the likelihood ratio test comparing
models 1 and 2 is $P \approx 0$, which clearly indicates that model 1 should
be rejected.
The approximate $P$-value for the likelihood ratio test comparing
models 2 and 3 is $P = 0.17$, which fairly clearly indicates that model 2
should be accepted and that model 3 is unnecessary.  $P = 0.17$
indicates exceedingly weak evidence favoring the larger model.
Thus we choose model 2.

The following code
<<label=fig4plot,include=FALSE>>=
plot(hourofday, count, xlab = "hour of the day")
curve(predict(out2, data.frame(w = x / 24 * 2 * pi),
    type = "response"), add = TRUE)
@
draws the scatter plot and estimated regression function for model 2
(Figure~\ref{fig:fish}).
\begin{figure}
\begin{center}
<<label=fig4,fig=TRUE,echo=FALSE>>=
<<fig4plot>>
@
\end{center}
\caption{Scatter plot and regression curve for Example~\ref{ex:pois}
(Poisson regression with log link function).  The regression function
is trigonometric on the scale of the linear predictor with terms up to
the frequency 2 per day.}
\label{fig:fish}
\end{figure}

I hope all readers are impressed by how magically statistics works in this
example.  A glance at Figure~\ref{fig:fish} shows
\begin{itemize}
\item Poisson regression is obviously doing more or less the right thing,
\item there is no way one could put in a sensible regression function without
using theoretical statistics.  The situation is just too complicated.
\end{itemize}
\end{example}

\section{Overdispersion}

So far we have seen only models with unit dispersion
parameter ($\phi = 1)$.  This section gives an example with $\phi \neq 1$
so we can see the point of the dispersion parameter.

The reason $\phi = 1$ for binomial regression is that the mean value parameter
$p = \mu$ determines the variance $m p (1 - p) = m \mu (1 - \mu)$.  Thus the
variance function is
\begin{equation} \label{eq:bin-var-fun}
   V(\mu) = \mu (1 - \mu)
\end{equation}
and the weights are $w_i = m_i$, the sample size for each binomial variable
(this was worked out in detail in Example~\ref{ex:binom-glm}).

But what if the model is wrong?  Here is another model.  Suppose
$$
   Y_i \mid W_i \sim \text{Binomial}(m_i, W_i)
$$
where the $W_i$ are IID random variables with mean $\mu$ and variance
$\tau^2$.  Then by the usual rules for conditional probability
$$
   E(Y_i) = E\{ E(Y_i \mid W_i) \} = E(m_i W_i) = m_i \mu
$$
and
\begin{align*}
   \var(Y_i)
   & =
   E\{ \var(Y_i \mid W_i) \} + \var\{ E(Y_i \mid W_i) \}
   \\
   & =
   E\{ m_i W_i (1 - W_i) \} + \var\{ m_i W_i \}
   \\
   & =
   m_i \mu - m_i E(W_i^2) + m_i^2 \tau^2
   \\
   & =
   m_i \mu - m_i (\tau^2 + \mu^2) + m_i^2 \tau^2
   \\
   & =
   m_i \mu (1 - \mu) + m_i (m_i - 1) \tau^2
\end{align*}
This is clearly larger than the formula $m_i \mu (1 - \mu)$ one would
have for the binomial model.  Since the variance is always larger than
one would have under the binomial model.

So we know that if our response variables $Y_i$ are the sum of a random mixture
of Bernoullis rather than IID Bernoullis, we will have overdispersion.
But how to model the overdispersion?  The GLM model offers a simple solution.
Allow for general $\phi$ so we have, defining $\Ybar_i = Y_i / m_i$
\begin{align*}
   E(\Ybar_i) & = \mu_i
   \\
   \var(\Ybar_i) & = \frac{\phi}{m_i} \mu_i (1 - \mu_i)
   \\
   & = \frac{\phi}{m_i} V(\mu_i)
\end{align*}
where $V$ is the usual binomial variance function \eqref{eq:bin-var-fun}.

\begin{example}[Overdispersed Binomial Regression]
The \label{ex:overdispersed} data set in the file \verb@ex12.11.1.dat@
is read by
<<read-binom-too>>=
X <- read.table("ex12.11.1.dat", header = TRUE)
names(X)
attach(X)
@
contains some data for an overdispersed binomial model.  The commands
<<glm-over>>==
y <- cbind(succ, fail)
out.binom <- glm(y ~ x, family = binomial)
summary(out.binom)
out.quasi <- glm(y ~ x, family = quasibinomial)
summary(out.quasi)
@
fit both the binomial model (logit link and $\phi = 1$) and
the ``quasi-binomial'' (logit link again but $\phi$ is estimated with
the method of moments estimator as explained in the text).  Both models
have exactly the same maximum likelihood regression coefficients, but
because the dispersions differ, the standard errors, $z$-values, and $P$-values
differ.

Your humble author finds this a bit unsatisfactory.  If the data are really
overdispersed, then the standard errors and so forth from the latter output
are the right ones to use.  But since the dispersion was not estimated by
maximum likelihood, there is no likelihood ratio test for comparing the two
models.  Nor could your author find any other test in a brief examination of
the literature.  Apparently, if one is worried about overdispersion, one
should use the model that allows for it.  And if not, not.  But that's not
the way we operate in the rest of statistics.  I suppose I need to find out
more about overdispersion (this was first written three years ago and I
still haven't investigated this further).
\end{example}

\end{document}

\section*{Problems}

\begin{problem}
Show that \eqref{eq:logit} and \eqref{eq:inverse-logit} do indeed define
a pair of inverse functions.
\end{problem}

\begin{problem}
Do calculations similar to Example~\ref{ex:binom-glm} for the normal problem
$$
   Y_i \sim \text{Normal}(\mu_i, \sigma^2)
$$
identifying (a) the canonical parameter $\theta$,
the dispersion parameter $\phi$, and the weight $w_i$.
\end{problem}

\begin{problem}
The data set
\begin{verbatim}
http://www.stat.umn.edu/geyer/5102/ex12.8.1.dat
\end{verbatim}
\noindent
contains another predictor vector \texttt{z} besides the ones we used in
Examples~\ref{ex:dumb}, \ref{ex:logit}, and \ref{ex:probit}.  Perform the
logistic regression of \texttt{y} on \texttt{x} and \texttt{z}.  Perform
a test comparing this new model and the one fit in Example~\ref{ex:logit}
giving the $P$-value for the test and the conclusion as to which model the
test accepts.
\end{problem}

\begin{problem}
Do \label{prob:fish-cannon} calculations similar to Example~\ref{ex:binom-glm}
for the Poisson model,
showing that the canonical parameter for the Poisson distribution
is $\theta = \log(\mu)$.
\end{problem}


\end{document}