\documentclass[11pt]{article}

\usepackage{amsmath}
\usepackage[multiple]{footmisc}
\usepackage{indentfirst}
\usepackage{natbib}

\begin{document}

\begin{raggedright}
\Large
Stat 5421 Lecture Notes
\\
\textbf{Likelihood Inference}
\\
Charles J. Geyer
\\
\today
\par
\end{raggedright}

\section{Likelihood}

In statistics, the word ``likelihood'' has a technical meaning, given
it by \citet{fisher1912}.  The likelihood for a statistical model is
the probability mass function (PMF) or probability density function (PDF)
thought of as a function of parameters rather than as a function of data
$$
   L_x(\theta) = f_\theta(x).
$$
The right-hand side is a function of $x$ that is nonnegative and sums
(for a PMF) or integrates (for a PDF) to one.  The left-hand side is a
function of $\theta$ that doesn't sum or integrate to anything in particular
(and need not sum or integrate to anything finite).

Except that is not the most general notion of likelihood.  We are allowed
to drop multiplicative terms that do not contain the parameter from the
likelihood
$$
   L_x(\theta) =
   \frac{f_\theta(x)}{\text{term that does not contain $\theta$}}.
$$
Likelihood plays a key role in frequentist statistical inference through
the method of maximum likelihood (the rest of this document) and Bayesian
inference (perhaps some other document).  And in neither does it matter if we
to drop multiplicative terms that do not contain the parameter from the
likelihood.  We get the same point estimates, hypothesis tests, and confidence
intervals (for frequentist inference) and the same posterior distributions
(for Bayesian inference) whether we drop such terms or not.

In this course we are interested in statistical models for discrete data
so the likelihood will always be a PMF perhaps with multiplicative terms
that do not contain the parameters dropped.

For the binomial distribution the PMF is
$$
   f_\pi(x) = \binom{n}{x} \pi^x (1 - \pi)^{n - x}
$$
and the likelihood is either the same right-hand side or
$$
   L_x(\pi) = \pi^x (1 - \pi)^{n - x}
$$
(dropping a multiplicative term that does not contain the parameter $\pi$).

For the Poisson distribution the PMF is
$$
   f_\mu(x) = \frac{\mu^x}{x !} e^{- \mu}
$$
and the likelihood is either the same right-hand side or
$$
   L_x(\mu) = \mu^x e^{- \mu}
$$
(dropping a multiplicative term that does not contain the parameter $\mu$).

\section{The Method of Maximum Likelihood}

\Citet{fisher1912} proposed the method of maximum likelihood which takes
the parameter value with the highest likelihood (for the observed data)
to be the point estimator of the unknown parameter.
\Citet{fisher1922}, the paper that really starts mathematical statistics,
giving the subject its first coherent theory, gave heuristic arguments
for why the maximum likelihood estimator (MLE) is
consistent and asymptotically normal (CAN) and efficient, where
\begin{itemize}
\item consistent means the estimator converges to the true unknown
parameter value as sample size goes to infinity,
\item asymptotically normal means the distribution of the estimator converges
    to a normal distribution as sample size goes to infinity, and
\item efficient means the estimator is best possible in the sense of having
    the smallest possible variance of its asymptotic normal distribution.
\end{itemize}
Fisher's heuristic arguments were made completely rigorous by many later
authors, although some qualifications and regularity conditions were necessary.

\subsection{Local and Global}

In theory, we distinguish two kinds of MLE.  We can seek the global maximizer
of the likelihood or be satisfied with a mere local maximizer.  In
one dimension, we can graph the function and just see where the global
maximizer is.  In higher dimensions, global optimization is hard.
Neither we nor our computers know
how to do it except in very special cases (perhaps some other document).

We know from calculus that, if function $f$ has a maximum at a point $x$
where it is twice differentiable, then $f'(x) = 0$ and $f''(x) \le 0$.
So these are necessary conditions for $x$ to be a maximizer of $f$.
We also know from calculus that $f'(x) = 0$ and $f''(x) < 0$ is a sufficient
condition for $x$ to be at least a local maximizer of $f$.
A \emph{local maximizer} $x$ of $f$ maximizes $f$ over a (perhaps very small)
neighborhood of $x$; every point $y$ sufficiently close to $x$
has $f(y) \le f(x)$.

There is analogous theory from multivariable calculus.  This requires
that we know what multivariable calculus considers first and second
derivatives.  We write $\nabla f(x)$ to denote the vector of first partial
derivatives, which has $i$-th component
$$
   \frac{\partial f(x)}{\partial x_i}
$$
(here $x$ is a vector having components $x_i$),
and we write $\nabla^2 f(x)$ to denote the matrix of second partial
derivatives, which has $(i, j)$-th component
$$
   \frac{\partial^2 f(x)}{\partial x_i \partial x_j}.
$$
It also requires that we know what the terms positive semidefinite,
positive definite, negative semidefinite, and negative definite mean
as descriptions of symmetric matrices.  A symmetric matrix is
\begin{itemize}
\item positive semidefinite if all its eigenvalues are nonnegative,
\item positive definite if all its eigenvalues are positive,
\item negative semidefinite if all its eigenvalues are nonpositive, and
\item negative definite if all its eigenvalues are negative.
\end{itemize}

The necessary condition for a maximum of $f$ at a point $x$ where $f$
is twice differentiable is $\nabla f(x) = 0$ and $\nabla^2 f(x)$ is
a negative semidefinite matrix.
The sufficient condition for a local maximum of $f$ at a point $x$ where $f$
is twice differentiable is $\nabla f(x) = 0$ and $\nabla^2 f(x)$ is
a negative definite matrix.

<<make-matrix,echo=FALSE>>=
set.seed(42)
x <- matrix(rnorm(1000), ncol = 10)
m <- (- var(x))
@
The R function \texttt{eigen} calculates eigenvalues and eigenvectors.
Suppose we have already defined a symmetric matrix \texttt{m}.
Then
<<check-matrix>>=
v <- eigen(m, symmetric = TRUE, only.values = TRUE)$values
all(v < 0)
@
says \texttt{TRUE} if and only if \texttt{m} is negative definite (modulo
inexactness of computer arithmetic --- if \texttt{m} has eigenvalues very
close to zero, it may give the wrong answer).  If we wanted to check for
negative semidefinite, we would replace \verb@all(v < 0)@ with
\verb@all(v <= 0)@.

So we and our
computers know how to find local maximizers (the computer goes uphill on $f$
until it finds a point $x$ that satisfies the sufficient condition to within
the inexactness of computer arithmetic or gives up and emits an error if
it cannot find such a point).

\subsection{Global}

Under the ``usual regularity conditions'' for consistency of MLE
\citep{wald,kiefer-wolfowitz,wang} the global MLE is
consistent.\footnote{Statisticians say
``usual regularity conditions'' to refer to some
unspecified regularity conditions for some theorem in some paper or textbook
that they may have seen in some theory class or just in their reading or only
have heard about without getting into the gory details.  The reason is that
there are no simple necessary and sufficient conditions for either consistency
or asymptotic normality of MLE, so every theorem in every paper or textbook
has somewhat different conditions.  The conditions that are easier
to understand are not as sharp.  The sharp conditions are harder to understand.
So authors have produced a lot of variants.  No conditions are easy to
understand.  That is why the ``easier'' above.  All tend to be a long
laundry list of unmotivated technicalities that are assumed just to make
a particular method of proof work.  So no such regularity conditions are
easily motivated or explained.}\footnote{The cited papers are difficult
to read.  \citet[Chapter~17]{ferguson} gives a simpler theorem, but still
one appropriate for Stat~8112, which is several levels above this course,
but not as sharp as the theory in the cited papers, where ``not as sharp''
means \citeauthor{ferguson}'s regularity conditions do not apply to as many
statistical models as those of the papers cited.}

So this lends support to the idea that MLE should mean the \emph{global}
maximizer of the likelihood.  It also corresponds to the naive notion:
maximum means maximum (not some technicality like local maximum).
It also corresponds to \citet{fisher1912,fisher1922} who did not discuss local
and global maximizers and thus lead people to believe he meant global.

Under the ``usual regularity conditions'' for asymptotic normality of MLE
(\citealp{cramer}, Chapters~32 and~33; \citealp{ferguson}, Chapters 16--18,
\citealp{van-der-vaart}, Chapters~5--7,
\citealp{lecam-yang}, Chapters 6--7, \citealp{geyer-simple})
the global MLE, if consistent,\footnote{That is, if the statistical model
also satisfies the other set of ``usual regularity conditions,'' the ones for
consistency.} is asymptotically normal.  Moreover it is efficient
(\citealp{lecam1953}; \citealp{van-der-vaart}, Section~8.6).  In short
the MLE is consistent, asymptotically normal, and efficient.

But these theorems also assert that even if the global MLE is not consistent
(this can happen) or does not even exist (this can also happen)
that a ``good'' local maximizer of the likelihood (a local MLE) will
also be consistent, asymptotically normal, and efficient.  So the question
becomes how to find such a ``good'' local maximizer.  Theory says start
at a ``good enough'' estimator (more on this below) and then go uphill
to a local maximum\footnote{Or even do not go all the way to the local
maximum.  The theory says from a ``good enough'' estimator doing one step
of Newton's method to maximize the likelihood is enough to produce an
estimator that is consistent, asymptotically normal, and efficient
(\citealp{van-der-vaart}, Section~5.7; \citealp{geyer-simple}).} and this will
produce an estimator (a local MLE) that is
consistent, asymptotically normal, and efficient.

An estimator is a ``good enough'' starting point for finding a local MLE
if it is what the jargon calls $\sqrt{n}$-consistent, that is, if it
obeys what a textbook formerly used for Stat 1001 \citep{freedman-et-al}
calls the
``square root law'' (statistical precision varies inversely with
sample size).\footnote{Technically, an estimator $\tilde{\theta}_n$
is $\sqrt{n}$-consistent if $\sqrt{n} (\tilde{\theta}_n - \theta)$
is bounded in probability, where $n$ is the sample size and $\theta$
is the true unknown parameter value, and where bounded in probability
means for every $\varepsilon > 0$ there exists an $r < \infty$
such that $\Pr\{\sqrt{n} (\tilde{\theta}_n - \theta) > r)\} < \varepsilon$
for all sample sizes $n$.  If $\sqrt{n} (\tilde{\theta}_n - \theta)$ converges
in distribution to anything whatsoever (normality is not necessary), then
it is bounded in probability.}

<<make-data,echo=FALSE>>=
x <- rcauchy(30, location = pi, scale = exp(1))
@
Here is an example.  Suppose we have a vector of data \texttt{x} that is
supposed to be an independent and identically distributed (IID) sample from
a Cauchy location model (Cauchy distribution centered at an arbitrary
and unknown, to be estimated, point).
A ``good enough'' starting point is the sample
median, which is consistent and asymptotically normal but not efficient
\citep[Chapter~13]{ferguson}.  Starting there and going uphill on the
likelihood produces the efficient local MLE.  Here's how to do that in R.
<<mle-cauchy-one>>=
mlogl <- function(mu)
    sum(- dcauchy(x, location = mu, log = TRUE))
nout <- nlm(mlogl, median(x))
nout$code <= 2
nout$estimate
@
If the optimizer converges (if $\verb@nout$code@ \le 2$),
then \verb@nout$estimate@ is the efficient local MLE.
It is crucial that we supply \texttt{median(x)} as the starting point,
not some point that may be far away from the true unknown parameter value.

We wrote \texttt{mlogl} to calculate minus the log likelihood because
\texttt{nlm} only minimizes functions (minimizing minus the log likelihood
is the same as maximizing the log likelihood, and maximizing the log likelihood
is the same as maximizing the likelihood).

Similar code to that above works if we have multiple parameters and make
\texttt{mlogl} a function of a vector argument (more on this later).

\subsection{Both}

Since global and local MLE have the same asymptotic distribution under
``the usual regularity conditions'' we can ignore the distinction when
talking about this asymptotic distribution.  If $\hat{\theta}_n$ is the
MLE for sample size $n$ and $\theta$ is the true unknown parameter value,
then
\begin{equation} \label{eq:ass-mle}
   \hat{\theta}_n \approx \text{Normal}\bigl(\theta, I_n(\theta)^{- 1}\bigr)
\end{equation}
where the double wiggle sign means ``approximately distributed as'' and
\begin{equation} \label{eq:expected-information}
   I_n(\theta) = - E_\theta\{ \nabla^2 l_n(\theta) \},
\end{equation}
where
$$
   l_n(\theta) = \log L_n(\theta),
$$
and $L_n$ denotes the likelihood for sample size $n$ (what was $L_x$ before).
The matrix \eqref{eq:expected-information} is called
the \emph{Fisher information matrix}.

\subsection{Plug-In}

The fundamental equation \eqref{eq:ass-mle}, although theoretically important,
is practically useless because we don't know $\theta$.  To use it we have
to ``plug in'' a consistent estimator for $\theta$ in the asymptotic variance,
and the obvious estimator is the MLE.  This gives
\begin{equation} \label{eq:ass-mle-plug-in-expected}
   \hat{\theta}_n \approx
   \text{Normal}\bigl(\theta, I_n(\hat{\theta}_n)^{- 1}\bigr),
\end{equation}
and hence, if $\theta$ is a scalar parameter,
\begin{equation} \label{eq:ass-mle-plug-in-expected-c-i}
   \hat{\theta}_n \pm 1.96 \sqrt{ I_n(\hat{\theta}_n)^{- 1} }
\end{equation}
is a 95\% approximate large-sample confidence for $\theta$ and
\begin{equation} \label{eq:ass-mle-plug-in-expected-pivot}
   \frac{\hat{\theta}_n - \theta}{\sqrt{ I_n(\hat{\theta}_n)^{- 1}}}
\end{equation}
is an approximately standard normal quantity that can be used as a test
statistic for constructing hypothesis tests.

Sometimes calculating the expectations in the definition of Fisher information
\eqref{eq:expected-information} is onerous.  Then we use another consistent
estimator of asymptotic variance.  For large $n$, under the ``usual regularity
conditions'' for maximum likelihood
$$
   I_n(\hat{\theta}_n) \approx J_n(\hat{\theta})
$$
where
\begin{equation} \label{eq:observed-information}
   J_n(\theta) = - \nabla^2 l_n(\theta)
\end{equation}
(that is, \eqref{eq:expected-information} and \eqref{eq:observed-information}
are the same except that in \eqref{eq:observed-information} we don't take
the expectation).  The matrix \eqref{eq:observed-information} is called
the \emph{observed information matrix}.
Some people, including your humble author, call \eqref{eq:expected-information}
the \emph{expected Fisher information matrix}
and call \eqref{eq:observed-information}
the \emph{observed Fisher information matrix}.
Thus we can replace \eqref{eq:ass-mle-plug-in-expected},
\eqref{eq:ass-mle-plug-in-expected-c-i}, and
\eqref{eq:ass-mle-plug-in-expected-pivot} with
\begin{equation} \label{eq:ass-mle-plug-in-observed}
   \hat{\theta}_n \approx
   \text{Normal}\bigl(\theta, J_n(\hat{\theta}_n)^{- 1}\bigr),
\end{equation}
\begin{equation} \label{eq:ass-mle-plug-in-observed-c-i}
   \hat{\theta}_n \pm 1.96 \sqrt{ J_n(\hat{\theta}_n)^{- 1} },
\end{equation}
and
\begin{equation} \label{eq:ass-mle-plug-in-observed-pivot}
   \frac{\hat{\theta}_n - \theta}{\sqrt{ J_n(\hat{\theta}_n)^{- 1}}}.
\end{equation}

\section{Two-Parameter Example}

We redo our Cauchy example with two parameters, that is, we assume both
location and scale are unknown parameters.
<<mle-cauchy-two>>=
mlogl <- function(theta) {
    stopifnot(length(theta) == 2)
    stopifnot(is.numeric(theta))
    stopifnot(is.finite(theta))
    sum(- dcauchy(x, location = theta[1],
        scale = theta[2], log = TRUE))
}
theta.twiddle <- c(median(x), IQR(x) / 2)
nout <- nlm(mlogl, theta.twiddle)
nout$code <= 2
nout$estimate
@
The MLE is now two-dimensional, first component location and second
component scale.  As before we use the sample median as a ``good enough''
estimator of the location parameter.  Since the interquartile range (IQR)
of the Cauchy distribution is twice the scale parameter, IQR / 2 is a
consistent and ``good enough'' estimator of the scale.

If we ask nicely, the computer will also calculate observed Fisher information.
<<mle-cauchy-two-with-fish>>=
nout <- nlm(mlogl, theta.twiddle, hessian = TRUE)
nout$code <= 2
nout$hessian
@
Standard errors for the estimators are square roots of the diagonal elements
of inverse Fisher information
<<mle-cauchy-two-standard-errors>>=
se <- sqrt(diag(solve(nout$hessian)))
foo <- cbind(nout$estimate, se)
rownames(foo) <- c("location", "scale")
colnames(foo) <- c("estimate", "std. err.")
round(foo, 4)
@

\section{Wald, Wilks, Rao}

\subsection{Nested Models}

There are three kinds of tests of model comparison associated with maximum
likelihood estimation.  These tests have three different test statistics
but are asymptotically equivalent (a concept explained below).  The idea
is that we have two nested models to compare, ``nested'' meaning
one is a submodel of the other.

The simplest situation in which we get nested models is when some of the
parameters in the bigger model are set to zero (constrained to be zero)
in the smaller model.  When models are specified by regression formulas,
as when using R functions like \texttt{lm} and \texttt{glm}, this happens
when all of the terms in the formula for the smaller model are in the formula
for the bigger model, but not vice versa.

More generally, for regression-like models (linear and generalized linear
models, for example) models are nested if the column space of the model matrix
of the bigger model contains the column space of the model matrix of the
smaller model, but not vice versa.

More generally, for general statistical models (not necessarily
regression-like), models are nested if every probability distribution
in the bigger model is also in the smaller model but not vice versa.
This happens for parametric statistical models when bigger and smaller
models are parameterized the same way and the parameter space of the bigger
model contains the parameter space of the smaller model but not vice versa.

It is not enough for Wald, Wilks, and Rao tests that the models are nested;
they must be nested smoothly, meaning the parameter space of the bigger model
is a manifold (a possibly curved hypersurface in $n$-dimensional space and
the parameter space of smaller model is a submanifold of the parameter space
of the bigger model.  We won't even try to explain that condition, but be
content to explain the special case when some of the parameters of the bigger
model are constrained to be equal to zero in the smaller model.  Then the
submanifold requirement is that each parameter constrained to be equal to
zero in the smaller model can take any value in an open interval that includes
zero in the bigger model and can do this regardless of the values of the
other parameters.

That is, we are doing two-sided rather than one-sided tests
of the constrained parameters (collectively).  If we were doing the
multiparameter analog of one-sided tests in which some parameters are
equal to zero in the smaller model but greater than or equal to zero in the
bigger model (variance components in random effects models, for example),
then the theory for this is known
\citep{chernoff,lecam1970,self-liang,geyer-constrained}
but not widely understood and hard to use.

\subsection{The Tests}

We also need the ``usual regularity conditions'' for asymptotic normality
of maximum likelihood estimates in the bigger and smaller models, which we
denote $\hat{\theta}_n$ and $\tilde{\theta}_n$, respectively.

Then the Wilks test statistic, also called the likelihood ratio test statistic,
is
\begin{equation} \label{eq:wilks}
   T_n
   =
   2 \bigl[ l_n(\hat{\theta}_n) - l_n(\tilde{\theta}_n) \bigr]
\end{equation}
where $l_n$ is the log likelihood for the bigger model, which means that
$\hat{\theta}_n$ and $\tilde{\theta}_n$ have the same dimension (but some
of the components of $\tilde{\theta}_n$ are equal to zero).
And Wilks's theorem says (assuming the ``usual regularity conditions and
the submanifold nesting condition)
$$
   T_n \approx \text{ChiSq}(p - d),
$$
where $p$ is the dimension of the parameter space of the model
and $d$ is the dimension of the parameter space of the submodel.

It is sometimes easy to calculate one of $\hat{\theta}_n$
and $\tilde{\theta}_n$
and difficult or impossible to calculate the other.  This motivates two
other procedures that are asymptotically equivalent to the Wilks test.

The Rao test statistic is
\begin{equation} \label{eq:rao}
   R_n = 
   \bigl(\nabla l_n(\tilde{\theta}_n)\bigr)^T I_n(\tilde{\theta}_n)^{-1}
   \nabla l_n(\tilde{\theta}_n),
\end{equation}
where $I_n(\theta)$ is expected Fisher information for sample size $n$ given by
\eqref{eq:expected-information}.
We can also replace expected Fisher information by observed Fisher information
\begin{equation} \label{eq:rao-obs}
   R_n' = 
   \bigl(\nabla l_n(\tilde{\theta}_n)\bigr)^T J_n(\tilde{\theta}_n)^{-1}
   \nabla l_n(\tilde{\theta}_n),
\end{equation}
where $J_n(\theta)$ is observed Fisher information for sample size $n$ given by
\eqref{eq:observed-information}.
We call either \eqref{eq:rao} or \eqref{eq:rao-obs} the Rao test statistic.

Under the conditions for the Wilks theorem, the Rao test statistic is
asymptotically equivalent to the Wilks test statistic.  Both $T_n - R_n$
and $T_n - R_n'$ converge in distribution to zero as $n$ goes to infinity,
so the differences between
the three test statistics $T_n$, $R_n$, and $R_n'$ are negligible compared
to their values for large sample sizes $n$.

The test using the statistic $R_n$ called the \emph{Rao test} or the
\emph{score test} or the \emph{Lagrange multiplier test}.
An important point about the Rao test statistic is that, unlike the likelihood
ratio test statistic, it only depends on the MLE for the null hypothesis
$\tilde{\theta}_n$.

The Wald test statistic is
\begin{equation} \label{eq:wald}
   W_n
   =
   g(\hat{\theta}_n)^T
   \bigl[ \nabla g(\hat{\theta}_n) I_n(\hat{\theta}_n)^{-1}
   \bigl(\nabla g(\hat{\theta}_n) \bigr)^T
   \bigr]^{-1} g(\hat{\theta}_n),
\end{equation}
or
\begin{equation} \label{eq:wald-obs}
   W_n'
   =
   g(\hat{\theta}_n)^T
   \bigl[ \nabla g(\hat{\theta}_n) J_n(\hat{\theta}_n)^{-1}
   \bigl(\nabla g(\hat{\theta}_n) \bigr)^T
   \bigr]^{-1} g(\hat{\theta}_n),
\end{equation}
where, as with the Rao statistic, $I_n(\theta)$ and $J_n(\theta)$ are
expected and observed Fisher information given
by \eqref{eq:expected-information} and \eqref{eq:observed-information} and
where $g$ is a vector-to-vector constraint function such that the submodel
is the set of $\theta$ such that $g(\theta) = 0$.  In the case of interest
to us, where the smaller model sets some of the parameters of the bigger
model to zero, it is usual to take $g(\theta)$ to be the vector of those
constrained parameters, that is, $g(\theta)$ is the vector of values of
the parameters that are constrained to be zero in the smaller model.

Under the conditions for the Wilks theorem, the Wald test statistic is
asymptotically equivalent to the Wilks test statistic.  Both $T_n - W_n$
and $T_n - W_n'$ converge in distribution to zero as $n$ goes to infinity,
so the differences between
the five test statistics $T_n$, $R_n$, $R_n'$, $W_n$, and $W_n'$
are negligible compared to their values for large sample sizes $n$.

An important point about the Wald test statistic is that, unlike the likelihood
ratio test statistic, it only depends on the MLE for the alternative hypothesis
$\hat{\theta}_n$.

\subsection{Applications}

All five of these are widely used.  Here are some cases where $T_n$, $R_n$,
and $W_n$ are widely used.

$T_n$ is the hardest to calculate, since either $\hat{\theta}_n$
or $\tilde{\theta}_n$ or both may require numerical optimization
(like we saw with the Cauchy examples).  But it is nevertheless widely
used when such optimization has to be done anyway.  The R generic function
\texttt{anova} when applied to objects produced by the R function \texttt{glm}
does likelihood ratio tests (which it calls analysis of deviance, ``deviance''
being another name for the likelihood ratio test statistic \eqref{eq:wilks}).
So everyone who does tests of model comparison using these R functions is
doing likelihood ratio tests.

The Pearson chi-square test is a special case of the Rao test.
So everyone who does tests of model comparison the Pearson chi-square test
is doing Rao tests (also called score tests).  Since this is a very old and
widely understood procedure, many users use it (in all statistical software,
not just R).

The test statistics and $P$-values in the output of the R generic function
\texttt{summary} applied to objects produced by the R function \texttt{glm}
are Wald tests.  This is obvious from the fact that only the MLE in the
big model is computed.  The relevant smaller models are those that drop the
predictor for the line of the output having a particular test statistic and
$P$-value.  Those models are not fitted in order to print the output of the
\texttt{summary} function.
So everyone who does pays attention to test statistics and $P$-values in
the output of the R function \texttt{summary} is doing Wald tests.

An example where $R_n'$ or $W_n'$ would be used is when we use the computer
to calculate Fisher information (the \texttt{hessian} component of the output
of the R function \texttt{nlm} as in the Cauchy example).  That is observed
Fisher information because the computer only knows how to approximate
derivatives but does not know how to do expectations.  If we use such to make
confidence intervals (as we did in the Cauchy example), those are Wald
confidence intervals (obtained by inverting Wald tests).

\subsection{Binomial Examples}

Recall that the log likelihood for a binomial distribution with sample size
$n$ and parameter $\pi$ is
$$
   l_n(\pi) = x \log(\pi) + (n - x) \log(1 - \pi),
$$
the score (first derivative) is
\begin{align*}
   l_n'(\pi)
   & =
   \frac{x}{\pi} - \frac{n - x}{1 - \pi}
   \\
   & =
   \frac{x - n \pi}{\pi (1 - \pi)},
\end{align*}
the MLE is
$$
   \hat{\pi}_n = x / n,
$$
observed Fisher information is
$$
   J_n(\pi) = \frac{x}{p^2} + \frac{n - x}{(1 - p)^2},
$$
and expected Fisher information is
$$
   I_n(\pi) = \frac{n}{p (1 - p)}.
$$
We consider two-tailed tests with hypotheses
\begin{align*}
   H_0 : & \pi = \pi_0
   \\
   H_1 : & \pi \neq \pi_0
\end{align*}
The MLE in the smaller model is $\pi_0$ because there are no free parameters
and nothing to estimate: the parameter is constrained to have this value.

\subsubsection{Wilks}

The Wilks (likelihood ratio) test statistic is
\begin{align*}
   T_n
   & =
   2 \bigl[ x \log(\hat{\pi}_n) + (n - x) \log(1 - \hat{\pi}_n) \bigr]
   -
   2 \bigl[
   x \log(\pi_0) + (n - x) \log(1 - \pi_0)
   \bigr]
   \\
   & =
   2 \left[
   x \log\left(\frac{\hat{\pi}_n}{\pi_0}\right)
   + (n - x) \log\left(\frac{1 - \hat{\pi}_n}{1 - \pi_0}\right)
   \right]
\end{align*}

\subsubsection{Rao}

Observed and expected Fisher information are different after plug-in
of the MLE for the smaller model
\begin{align*}
   I_n(\pi_0) & = \frac{n}{\pi_0 (1 - \pi_0)}
   \\
   J_n(\pi_0) & = \frac{x}{\pi_0^2} + \frac{n - x}{(1 - \pi_0)^2}
   \\
   & =
   \frac{n \hat{\pi}_n}{\pi_0^2} + \frac{n (1 - \hat{\pi}_n)}{(1 - \pi_0)^2}
\end{align*}
so $R_n$ and $R_n'$ are different
\begin{align*}
   R_n
   & =
   \left( \frac{x - n \pi_0}{\pi_0 (1 - \pi_0)} \right)^2
   \frac {\pi_0 (1 - \pi_0)} {n}
   \\
   & =
   \frac{(x - n \pi_0)^2}{n \pi_0 (1 - \pi_0)}
   \\
   R_n'
   & =
   \left( \frac{x - n \pi_0}{\pi_0 (1 - \pi_0)} \right)^2
   \left( 
   \frac{n \hat{\pi}_n}{\pi_0^2} + \frac{n (1 - \hat{\pi}_n)}{(1 - \pi_0)^2}
   \right)^{- 1}
   \\
   & =
   \frac{(x - n \pi_0)^2}{n \hat{\pi}_n (1 - \pi_0)^2
   + n (1 - \hat{\pi}_n) \pi_0^2}
\end{align*}
% ((x - n pi0) / (pi0 (1 - pi0)))^2 *
% ((n pihat) / pi0^2 + n (1 - pihat) / (1 - pi0)^2)^{-1}
%
% (x - n pi0)^2 / (n pihat (1 - pi0)^2 + n (1 - pihat) pi0^2)

\subsubsection{Wald}

Observed and expected Fisher information are the same after plug-in
of the MLE for the bigger model
\begin{align*}
   I_n(\hat{\pi}_n) & = \frac{n}{\hat{\pi}_n (1 - \hat{\pi}_n)}
   \\
   J_n(\hat{\pi}_n)
   & =
   \frac{x}{\hat{\pi}_n^2} + \frac{n - x}{(1 - \hat{\pi}_n)^2}
   \\
   & =
   \frac{n \hat{\pi}_n}{\hat{\pi}_n^2}
   + \frac{n (1 - \hat{\pi}_n)}{(1 - \hat{\pi}_n)^2}
   \\
   & = \frac{n}{\hat{\pi}_n (1 - \hat{\pi}_n)}
\end{align*}
so $W_n$ and $W_n'$ are the same.  The function $g$ we want to use
for this application of Wald tests is
$$
   g(\pi) = \pi - \pi_0,
$$
which has derivative
$$
   \frac{d g(\pi)}{d \pi} = 1,
$$
so
\begin{align*}
   W_n
   & =
   (\hat{\pi}_n - \pi_0)^2 \left(I_n(\hat{\pi}_n)^{- 1}\right)^{- 1}
   \\
   & =
   \frac{n (\hat{\pi}_n - \pi_0)^2}{\hat{\pi}_n (1 - \hat{\pi}_n)}
\end{align*}
<<make-p-values,echo=FALSE>>=
one <- 1
three <- 3
pi0 <- one / three
n <- 200
x <- 53
pi.hat <- x / n
tn <- 2 * (x * log(pi.hat / pi0) + (n - x) * log((1 - pi.hat) / (1 - pi0)))
tn.p <- pchisq(tn, 1, lower.tail = FALSE)
rn <- (x - n * pi0)^2 / (n * pi0 * (1 - pi0))
rn.p <- pchisq(rn, 1, lower.tail = FALSE)
rnp <- (x - n * pi0)^2 / (n * (pi.hat * (1 - pi0)^2 + (1 - pi.hat) * pi0^2))
rnp.p <- pchisq(rnp, 1, lower.tail = FALSE)
wn <- n * (pi.hat - pi0)^2 / (pi.hat * (1 - pi.hat))
wn.p <- pchisq(wn, 1, lower.tail = FALSE)
@
If we try these out with $\pi_0 = \Sexpr{one} / \Sexpr{3}$, $n = \Sexpr{n}$,
and $x = \Sexpr{x}$, we obtain
\begin{alignat*}{2}
   T_n & = \Sexpr{round(tn, 4)} & \qquad & (P = \Sexpr{round(tn.p, 4)})
   \\
   R_n & = \Sexpr{round(rn, 4)} & \qquad & (P = \Sexpr{round(rn.p, 4)})
   \\
   R_n' & = \Sexpr{round(rnp, 4)} & \qquad & (P = \Sexpr{round(rnp.p, 4)})
   \\
   W_n & = \Sexpr{round(wn, 4)} & \qquad & (P = \Sexpr{round(wn.p, 4)})
\end{alignat*}
We see the asymptotic equivalence in the fact that all four tests give
nearly the same results.

The test recommended by introductory statistics books has test statistic
$$
   Z_n = \frac{\hat{\pi}_n - \pi_0}{\sqrt{\pi_0 (1 - \pi_0) / n}}
$$
which has an asymptotic standard normal distribution.
Since $R_n = Z_n^2$ and the square of a standard normal random variable
is a chi-square distribution with one degree of freedom, which is the
reference distribution for Wilks, Rao, and Wald tests in this situation
with dimension zero for the smaller
model and dimension one for the bigger model, the two tests are the
same.  So the test recommended by introductory statistics books for
this situation is the Rao test.

\begin{thebibliography}{}

\bibitem[Chernoff(1954)]{chernoff}
Chernoff, H. (1954).
\newblock On the distribution of the likelihood ratio.
\newblock \emph{Annals of Mathematical Statistics}, \textbf{25}, 573--578.

\bibitem[Cram\'{e}r(1946)]{cramer}
Cram\'{e}r,~H. (1946).
\newblock \emph{Mathematical Methods of Statistics}.
\newblock Princeton University Press, Princeton.

\bibitem[Ferguson(1996)]{ferguson}
Ferguson,~T.~S. (1996).
\newblock \emph{A Course in Large Sample Theory}.
\newblock Chapman \& Hall, London.

\bibitem[Fisher(1912)]{fisher1912}
Fisher, R.~A. (1912).
\newblock On an absolute criterion for fitting frequency curves.
\newblock \emph{Messenger of Mathematics}, 41, 155--160.

\bibitem[Fisher(1922)]{fisher1922}
Fisher, R.~A. (1922).
\newblock On the mathematical foundations of theoretical statistics.
\newblock \emph{Philosophical Transactions of the Royal Society of London,
    Series A}, 222, 309--368.

\bibitem[Freedman, et al.(2007)Freedman, Pisani, and Purves]{freedman-et-al}
Freedman, D., Pisani, R., and Purves, R. (2007).
\newblock \emph{Statistics}, 4th ed.
\newblock W. W. Norton \& Company, New York.

\bibitem[Geyer(1994)]{geyer-constrained}
Geyer, C.~J. (1994).
\newblock On the asymptotics of constrained $M$-estimation.
\newblock \emph{Annals of Statistics}, \textbf{22}, 1993--2010.

\bibitem[Geyer(2013)]{geyer-simple}
Geyer, C.~J. (2013).
\newblock Asymptotics of maximum likelihood
    without the LLN or CLT or sample size going to infinity.
\newblock In \emph{Advances in Modern Statistical Theory and Applications:
    A Festschrift in honor of Morris L. Eaton}, G.~L. Jones and X. Shen eds.,
    IMS Collections, Vol. 10, pp. 1--24.
\newblock Institute of Mathematical Statistics: Hayward, CA.

\bibitem[Kiefer and Wolfowitz(1956)]{kiefer-wolfowitz}
Kiefer, J. and Wolfowitz, J. (1956).
\newblock Consistency of the maximum likelihood estimator in the
    presence of infinitely many incidental parameters.
\newblock \emph{Annals of Mathematical Statistics}, 27, 887--906.

\bibitem[Le Cam(1953)]{lecam1953}
Le Cam, L. (1953).
\newblock On some asymptotic properties of maximum likelihood estimates
    and related Bayes estimates.
\newblock \emph{University of California Publications in Statistics}, 1,
    277--330.

\bibitem[Le Cam(1970)]{lecam1970}
Le Cam, L. (1970).
\newblock On the assumptions used to prove asymptotic normality of maximum
    likelihood estimates.
\newblock \emph{Annals of Mathematical Statistics}, \textbf{41}, 802--828.

\bibitem[Le~Cam and Yang(2000)]{lecam-yang}
Le~Cam,~L., and Yang, G.~L. (2000).
\newblock \emph{Asymptotics in Statistics: Some Basic Concepts}, second ed.
\newblock Springer-Verlag, New York.

\bibitem[Self and Liang(1987)]{self-liang}
Self, S.~G., and Liang, K.-Y. (1987).
\newblock Asymptotic properties of maximum likelihood estimators and
    likelihood ratio tests under nonstandard conditions.
\newblock \emph{Journal of the American Statistical Association},
    \textbf{82}, 605--610.

\bibitem[van~der Vaart(2000)]{van-der-vaart}
van~der Vaart,~A.~W. (2000).
\newblock \emph{Asymptotic Statistics}.
\newblock Cambridge University Press, Cambridge.

\bibitem[Wald(1949)]{wald}
Wald, A. (1949).
\newblock Note on the consistency of the maximum likelihood estimate.
\newblock \emph{Annals of Mathematical Statistics}, 20, 595--601.

\bibitem[Wang(1985)]{wang}
Wang, J.-L. (1985).
\newblock Strong consistency of approximate maximum likelihood
    estimators with applications in nonparametrics.
\newblock \emph{Annals of Statistics}, 13, 932--946.

\end{thebibliography}

\end{document}