\documentclass[11pt]{article} \usepackage{amsmath} \usepackage[multiple]{footmisc} \usepackage{indentfirst} \usepackage{natbib} \begin{document} \begin{raggedright} \Large Stat 5421 Lecture Notes \\ \textbf{Likelihood Inference} \\ Charles J. Geyer \\ \today \par \end{raggedright} \section{Likelihood} In statistics, the word ``likelihood'' has a technical meaning, given it by \citet{fisher1912}. The likelihood for a statistical model is the probability mass function (PMF) or probability density function (PDF) thought of as a function of parameters rather than as a function of data $$ L_x(\theta) = f_\theta(x). $$ The right-hand side is a function of $x$ that is nonnegative and sums (for a PMF) or integrates (for a PDF) to one. The left-hand side is a function of $\theta$ that doesn't sum or integrate to anything in particular (and need not sum or integrate to anything finite). Except that is not the most general notion of likelihood. We are allowed to drop multiplicative terms that do not contain the parameter from the likelihood $$ L_x(\theta) = \frac{f_\theta(x)}{\text{term that does not contain $\theta$}}. $$ Likelihood plays a key role in frequentist statistical inference through the method of maximum likelihood (the rest of this document) and Bayesian inference (perhaps some other document). And in neither does it matter if we to drop multiplicative terms that do not contain the parameter from the likelihood. We get the same point estimates, hypothesis tests, and confidence intervals (for frequentist inference) and the same posterior distributions (for Bayesian inference) whether we drop such terms or not. In this course we are interested in statistical models for discrete data so the likelihood will always be a PMF perhaps with multiplicative terms that do not contain the parameters dropped. For the binomial distribution the PMF is $$ f_\pi(x) = \binom{n}{x} \pi^x (1 - \pi)^{n - x} $$ and the likelihood is either the same right-hand side or $$ L_x(\pi) = \pi^x (1 - \pi)^{n - x} $$ (dropping a multiplicative term that does not contain the parameter $\pi$). For the Poisson distribution the PMF is $$ f_\mu(x) = \frac{\mu^x}{x !} e^{- \mu} $$ and the likelihood is either the same right-hand side or $$ L_x(\mu) = \mu^x e^{- \mu} $$ (dropping a multiplicative term that does not contain the parameter $\mu$). \section{The Method of Maximum Likelihood} \Citet{fisher1912} proposed the method of maximum likelihood which takes the parameter value with the highest likelihood (for the observed data) to be the point estimator of the unknown parameter. \Citet{fisher1922}, the paper that really starts mathematical statistics, giving the subject its first coherent theory, gave heuristic arguments for why the maximum likelihood estimator (MLE) is consistent and asymptotically normal (CAN) and efficient, where \begin{itemize} \item consistent means the estimator converges to the true unknown parameter value as sample size goes to infinity, \item asymptotically normal means the distribution of the estimator converges to a normal distribution as sample size goes to infinity, and \item efficient means the estimator is best possible in the sense of having the smallest possible variance of its asymptotic normal distribution. \end{itemize} Fisher's heuristic arguments were made completely rigorous by many later authors, although some qualifications and regularity conditions were necessary. \subsection{Local and Global} In theory, we distinguish two kinds of MLE. We can seek the global maximizer of the likelihood or be satisfied with a mere local maximizer. In one dimension, we can graph the function and just see where the global maximizer is. In higher dimensions, global optimization is hard. Neither we nor our computers know how to do it except in very special cases (perhaps some other document). We know from calculus that, if function $f$ has a maximum at a point $x$ where it is twice differentiable, then $f'(x) = 0$ and $f''(x) \le 0$. So these are necessary conditions for $x$ to be a maximizer of $f$. We also know from calculus that $f'(x) = 0$ and $f''(x) < 0$ is a sufficient condition for $x$ to be at least a local maximizer of $f$. A \emph{local maximizer} $x$ of $f$ maximizes $f$ over a (perhaps very small) neighborhood of $x$; every point $y$ sufficiently close to $x$ has $f(y) \le f(x)$. There is analogous theory from multivariable calculus. This requires that we know what multivariable calculus considers first and second derivatives. We write $\nabla f(x)$ to denote the vector of first partial derivatives, which has $i$-th component $$ \frac{\partial f(x)}{\partial x_i} $$ (here $x$ is a vector having components $x_i$), and we write $\nabla^2 f(x)$ to denote the matrix of second partial derivatives, which has $(i, j)$-th component $$ \frac{\partial^2 f(x)}{\partial x_i \partial x_j}. $$ It also requires that we know what the terms positive semidefinite, positive definite, negative semidefinite, and negative definite mean as descriptions of symmetric matrices. A symmetric matrix is \begin{itemize} \item positive semidefinite if all its eigenvalues are nonnegative, \item positive definite if all its eigenvalues are positive, \item negative semidefinite if all its eigenvalues are nonpositive, and \item negative definite if all its eigenvalues are negative. \end{itemize} The necessary condition for a maximum of $f$ at a point $x$ where $f$ is twice differentiable is $\nabla f(x) = 0$ and $\nabla^2 f(x)$ is a negative semidefinite matrix. The sufficient condition for a local maximum of $f$ at a point $x$ where $f$ is twice differentiable is $\nabla f(x) = 0$ and $\nabla^2 f(x)$ is a negative definite matrix. <>= set.seed(42) x <- matrix(rnorm(1000), ncol = 10) m <- (- var(x)) @ The R function \texttt{eigen} calculates eigenvalues and eigenvectors. Suppose we have already defined a symmetric matrix \texttt{m}. Then <>= v <- eigen(m, symmetric = TRUE, only.values = TRUE)$values all(v < 0) @ says \texttt{TRUE} if and only if \texttt{m} is negative definite (modulo inexactness of computer arithmetic --- if \texttt{m} has eigenvalues very close to zero, it may give the wrong answer). If we wanted to check for negative semidefinite, we would replace \verb@all(v < 0)@ with \verb@all(v <= 0)@. So we and our computers know how to find local maximizers (the computer goes uphill on $f$ until it finds a point $x$ that satisfies the sufficient condition to within the inexactness of computer arithmetic or gives up and emits an error if it cannot find such a point). \subsection{Global} Under the ``usual regularity conditions'' for consistency of MLE \citep{wald,kiefer-wolfowitz,wang} the global MLE is consistent.\footnote{Statisticians say ``usual regularity conditions'' to refer to some unspecified regularity conditions for some theorem in some paper or textbook that they may have seen in some theory class or just in their reading or only have heard about without getting into the gory details. The reason is that there are no simple necessary and sufficient conditions for either consistency or asymptotic normality of MLE, so every theorem in every paper or textbook has somewhat different conditions. The conditions that are easier to understand are not as sharp. The sharp conditions are harder to understand. So authors have produced a lot of variants. No conditions are easy to understand. That is why the ``easier'' above. All tend to be a long laundry list of unmotivated technicalities that are assumed just to make a particular method of proof work. So no such regularity conditions are easily motivated or explained.}\footnote{The cited papers are difficult to read. \citet[Chapter~17]{ferguson} gives a simpler theorem, but still one appropriate for Stat~8112, which is several levels above this course, but not as sharp as the theory in the cited papers, where ``not as sharp'' means \citeauthor{ferguson}'s regularity conditions do not apply to as many statistical models as those of the papers cited.} So this lends support to the idea that MLE should mean the \emph{global} maximizer of the likelihood. It also corresponds to the naive notion: maximum means maximum (not some technicality like local maximum). It also corresponds to \citet{fisher1912,fisher1922} who did not discuss local and global maximizers and thus lead people to believe he meant global. Under the ``usual regularity conditions'' for asymptotic normality of MLE (\citealp{cramer}, Chapters~32 and~33; \citealp{ferguson}, Chapters 16--18, \citealp{van-der-vaart}, Chapters~5--7, \citealp{lecam-yang}, Chapters 6--7, \citealp{geyer-simple}) the global MLE, if consistent,\footnote{That is, if the statistical model also satisfies the other set of ``usual regularity conditions,'' the ones for consistency.} is asymptotically normal. Moreover it is efficient (\citealp{lecam1953}; \citealp{van-der-vaart}, Section~8.6). In short the MLE is consistent, asymptotically normal, and efficient. But these theorems also assert that even if the global MLE is not consistent (this can happen) or does not even exist (this can also happen) that a ``good'' local maximizer of the likelihood (a local MLE) will also be consistent, asymptotically normal, and efficient. So the question becomes how to find such a ``good'' local maximizer. Theory says start at a ``good enough'' estimator (more on this below) and then go uphill to a local maximum\footnote{Or even do not go all the way to the local maximum. The theory says from a ``good enough'' estimator doing one step of Newton's method to maximize the likelihood is enough to produce an estimator that is consistent, asymptotically normal, and efficient (\citealp{van-der-vaart}, Section~5.7; \citealp{geyer-simple}).} and this will produce an estimator (a local MLE) that is consistent, asymptotically normal, and efficient. An estimator is a ``good enough'' starting point for finding a local MLE if it is what the jargon calls $\sqrt{n}$-consistent, that is, if it obeys what a textbook formerly used for Stat 1001 \citep{freedman-et-al} calls the ``square root law'' (statistical precision varies inversely with sample size).\footnote{Technically, an estimator $\tilde{\theta}_n$ is $\sqrt{n}$-consistent if $\sqrt{n} (\tilde{\theta}_n - \theta)$ is bounded in probability, where $n$ is the sample size and $\theta$ is the true unknown parameter value, and where bounded in probability means for every $\varepsilon > 0$ there exists an $r < \infty$ such that $\Pr\{\sqrt{n} (\tilde{\theta}_n - \theta) > r)\} < \varepsilon$ for all sample sizes $n$. If $\sqrt{n} (\tilde{\theta}_n - \theta)$ converges in distribution to anything whatsoever (normality is not necessary), then it is bounded in probability.} <>= x <- rcauchy(30, location = pi, scale = exp(1)) @ Here is an example. Suppose we have a vector of data \texttt{x} that is supposed to be an independent and identically distributed (IID) sample from a Cauchy location model (Cauchy distribution centered at an arbitrary and unknown, to be estimated, point). A ``good enough'' starting point is the sample median, which is consistent and asymptotically normal but not efficient \citep[Chapter~13]{ferguson}. Starting there and going uphill on the likelihood produces the efficient local MLE. Here's how to do that in R. <>= mlogl <- function(mu) sum(- dcauchy(x, location = mu, log = TRUE)) nout <- nlm(mlogl, median(x)) nout$code <= 2 nout$estimate @ If the optimizer converges (if $\verb@nout$code@ \le 2$), then \verb@nout$estimate@ is the efficient local MLE. It is crucial that we supply \texttt{median(x)} as the starting point, not some point that may be far away from the true unknown parameter value. We wrote \texttt{mlogl} to calculate minus the log likelihood because \texttt{nlm} only minimizes functions (minimizing minus the log likelihood is the same as maximizing the log likelihood, and maximizing the log likelihood is the same as maximizing the likelihood). Similar code to that above works if we have multiple parameters and make \texttt{mlogl} a function of a vector argument (more on this later). \subsection{Both} Since global and local MLE have the same asymptotic distribution under ``the usual regularity conditions'' we can ignore the distinction when talking about this asymptotic distribution. If $\hat{\theta}_n$ is the MLE for sample size $n$ and $\theta$ is the true unknown parameter value, then \begin{equation} \label{eq:ass-mle} \hat{\theta}_n \approx \text{Normal}\bigl(\theta, I_n(\theta)^{- 1}\bigr) \end{equation} where the double wiggle sign means ``approximately distributed as'' and \begin{equation} \label{eq:expected-information} I_n(\theta) = - E_\theta\{ \nabla^2 l_n(\theta) \}, \end{equation} where $$ l_n(\theta) = \log L_n(\theta), $$ and $L_n$ denotes the likelihood for sample size $n$ (what was $L_x$ before). The matrix \eqref{eq:expected-information} is called the \emph{Fisher information matrix}. \subsection{Plug-In} The fundamental equation \eqref{eq:ass-mle}, although theoretically important, is practically useless because we don't know $\theta$. To use it we have to ``plug in'' a consistent estimator for $\theta$ in the asymptotic variance, and the obvious estimator is the MLE. This gives \begin{equation} \label{eq:ass-mle-plug-in-expected} \hat{\theta}_n \approx \text{Normal}\bigl(\theta, I_n(\hat{\theta}_n)^{- 1}\bigr), \end{equation} and hence, if $\theta$ is a scalar parameter, \begin{equation} \label{eq:ass-mle-plug-in-expected-c-i} \hat{\theta}_n \pm 1.96 \sqrt{ I_n(\hat{\theta}_n)^{- 1} } \end{equation} is a 95\% approximate large-sample confidence for $\theta$ and \begin{equation} \label{eq:ass-mle-plug-in-expected-pivot} \frac{\hat{\theta}_n - \theta}{\sqrt{ I_n(\hat{\theta}_n)^{- 1}}} \end{equation} is an approximately standard normal quantity that can be used as a test statistic for constructing hypothesis tests. Sometimes calculating the expectations in the definition of Fisher information \eqref{eq:expected-information} is onerous. Then we use another consistent estimator of asymptotic variance. For large $n$, under the ``usual regularity conditions'' for maximum likelihood $$ I_n(\hat{\theta}_n) \approx J_n(\hat{\theta}) $$ where \begin{equation} \label{eq:observed-information} J_n(\theta) = - \nabla^2 l_n(\theta) \end{equation} (that is, \eqref{eq:expected-information} and \eqref{eq:observed-information} are the same except that in \eqref{eq:observed-information} we don't take the expectation). The matrix \eqref{eq:observed-information} is called the \emph{observed information matrix}. Some people, including your humble author, call \eqref{eq:expected-information} the \emph{expected Fisher information matrix} and call \eqref{eq:observed-information} the \emph{observed Fisher information matrix}. Thus we can replace \eqref{eq:ass-mle-plug-in-expected}, \eqref{eq:ass-mle-plug-in-expected-c-i}, and \eqref{eq:ass-mle-plug-in-expected-pivot} with \begin{equation} \label{eq:ass-mle-plug-in-observed} \hat{\theta}_n \approx \text{Normal}\bigl(\theta, J_n(\hat{\theta}_n)^{- 1}\bigr), \end{equation} \begin{equation} \label{eq:ass-mle-plug-in-observed-c-i} \hat{\theta}_n \pm 1.96 \sqrt{ J_n(\hat{\theta}_n)^{- 1} }, \end{equation} and \begin{equation} \label{eq:ass-mle-plug-in-observed-pivot} \frac{\hat{\theta}_n - \theta}{\sqrt{ J_n(\hat{\theta}_n)^{- 1}}}. \end{equation} \section{Two-Parameter Example} We redo our Cauchy example with two parameters, that is, we assume both location and scale are unknown parameters. <>= mlogl <- function(theta) { stopifnot(length(theta) == 2) stopifnot(is.numeric(theta)) stopifnot(is.finite(theta)) sum(- dcauchy(x, location = theta[1], scale = theta[2], log = TRUE)) } theta.twiddle <- c(median(x), IQR(x) / 2) nout <- nlm(mlogl, theta.twiddle) nout$code <= 2 nout$estimate @ The MLE is now two-dimensional, first component location and second component scale. As before we use the sample median as a ``good enough'' estimator of the location parameter. Since the interquartile range (IQR) of the Cauchy distribution is twice the scale parameter, IQR / 2 is a consistent and ``good enough'' estimator of the scale. If we ask nicely, the computer will also calculate observed Fisher information. <>= nout <- nlm(mlogl, theta.twiddle, hessian = TRUE) nout$code <= 2 nout$hessian @ Standard errors for the estimators are square roots of the diagonal elements of inverse Fisher information <>= se <- sqrt(diag(solve(nout$hessian))) foo <- cbind(nout$estimate, se) rownames(foo) <- c("location", "scale") colnames(foo) <- c("estimate", "std. err.") round(foo, 4) @ \section{Wald, Wilks, Rao} \subsection{Nested Models} There are three kinds of tests of model comparison associated with maximum likelihood estimation. These tests have three different test statistics but are asymptotically equivalent (a concept explained below). The idea is that we have two nested models to compare, ``nested'' meaning one is a submodel of the other. The simplest situation in which we get nested models is when some of the parameters in the bigger model are set to zero (constrained to be zero) in the smaller model. When models are specified by regression formulas, as when using R functions like \texttt{lm} and \texttt{glm}, this happens when all of the terms in the formula for the smaller model are in the formula for the bigger model, but not vice versa. More generally, for regression-like models (linear and generalized linear models, for example) models are nested if the column space of the model matrix of the bigger model contains the column space of the model matrix of the smaller model, but not vice versa. More generally, for general statistical models (not necessarily regression-like), models are nested if every probability distribution in the bigger model is also in the smaller model but not vice versa. This happens for parametric statistical models when bigger and smaller models are parameterized the same way and the parameter space of the bigger model contains the parameter space of the smaller model but not vice versa. It is not enough for Wald, Wilks, and Rao tests that the models are nested; they must be nested smoothly, meaning the parameter space of the bigger model is a manifold (a possibly curved hypersurface in $n$-dimensional space and the parameter space of smaller model is a submanifold of the parameter space of the bigger model. We won't even try to explain that condition, but be content to explain the special case when some of the parameters of the bigger model are constrained to be equal to zero in the smaller model. Then the submanifold requirement is that each parameter constrained to be equal to zero in the smaller model can take any value in an open interval that includes zero in the bigger model and can do this regardless of the values of the other parameters. That is, we are doing two-sided rather than one-sided tests of the constrained parameters (collectively). If we were doing the multiparameter analog of one-sided tests in which some parameters are equal to zero in the smaller model but greater than or equal to zero in the bigger model (variance components in random effects models, for example), then the theory for this is known \citep{chernoff,lecam1970,self-liang,geyer-constrained} but not widely understood and hard to use. \subsection{The Tests} We also need the ``usual regularity conditions'' for asymptotic normality of maximum likelihood estimates in the bigger and smaller models, which we denote $\hat{\theta}_n$ and $\tilde{\theta}_n$, respectively. Then the Wilks test statistic, also called the likelihood ratio test statistic, is \begin{equation} \label{eq:wilks} T_n = 2 \bigl[ l_n(\hat{\theta}_n) - l_n(\tilde{\theta}_n) \bigr] \end{equation} where $l_n$ is the log likelihood for the bigger model, which means that $\hat{\theta}_n$ and $\tilde{\theta}_n$ have the same dimension (but some of the components of $\tilde{\theta}_n$ are equal to zero). And Wilks's theorem says (assuming the ``usual regularity conditions and the submanifold nesting condition) $$ T_n \approx \text{ChiSq}(p - d), $$ where $p$ is the dimension of the parameter space of the model and $d$ is the dimension of the parameter space of the submodel. It is sometimes easy to calculate one of $\hat{\theta}_n$ and $\tilde{\theta}_n$ and difficult or impossible to calculate the other. This motivates two other procedures that are asymptotically equivalent to the Wilks test. The Rao test statistic is \begin{equation} \label{eq:rao} R_n = \bigl(\nabla l_n(\tilde{\theta}_n)\bigr)^T I_n(\tilde{\theta}_n)^{-1} \nabla l_n(\tilde{\theta}_n), \end{equation} where $I_n(\theta)$ is expected Fisher information for sample size $n$ given by \eqref{eq:expected-information}. We can also replace expected Fisher information by observed Fisher information \begin{equation} \label{eq:rao-obs} R_n' = \bigl(\nabla l_n(\tilde{\theta}_n)\bigr)^T J_n(\tilde{\theta}_n)^{-1} \nabla l_n(\tilde{\theta}_n), \end{equation} where $J_n(\theta)$ is observed Fisher information for sample size $n$ given by \eqref{eq:observed-information}. We call either \eqref{eq:rao} or \eqref{eq:rao-obs} the Rao test statistic. Under the conditions for the Wilks theorem, the Rao test statistic is asymptotically equivalent to the Wilks test statistic. Both $T_n - R_n$ and $T_n - R_n'$ converge in distribution to zero as $n$ goes to infinity, so the differences between the three test statistics $T_n$, $R_n$, and $R_n'$ are negligible compared to their values for large sample sizes $n$. The test using the statistic $R_n$ called the \emph{Rao test} or the \emph{score test} or the \emph{Lagrange multiplier test}. An important point about the Rao test statistic is that, unlike the likelihood ratio test statistic, it only depends on the MLE for the null hypothesis $\tilde{\theta}_n$. The Wald test statistic is \begin{equation} \label{eq:wald} W_n = g(\hat{\theta}_n)^T \bigl[ \nabla g(\hat{\theta}_n) I_n(\hat{\theta}_n)^{-1} \bigl(\nabla g(\hat{\theta}_n) \bigr)^T \bigr]^{-1} g(\hat{\theta}_n), \end{equation} or \begin{equation} \label{eq:wald-obs} W_n' = g(\hat{\theta}_n)^T \bigl[ \nabla g(\hat{\theta}_n) J_n(\hat{\theta}_n)^{-1} \bigl(\nabla g(\hat{\theta}_n) \bigr)^T \bigr]^{-1} g(\hat{\theta}_n), \end{equation} where, as with the Rao statistic, $I_n(\theta)$ and $J_n(\theta)$ are expected and observed Fisher information given by \eqref{eq:expected-information} and \eqref{eq:observed-information} and where $g$ is a vector-to-vector constraint function such that the submodel is the set of $\theta$ such that $g(\theta) = 0$. In the case of interest to us, where the smaller model sets some of the parameters of the bigger model to zero, it is usual to take $g(\theta)$ to be the vector of those constrained parameters, that is, $g(\theta)$ is the vector of values of the parameters that are constrained to be zero in the smaller model. Under the conditions for the Wilks theorem, the Wald test statistic is asymptotically equivalent to the Wilks test statistic. Both $T_n - W_n$ and $T_n - W_n'$ converge in distribution to zero as $n$ goes to infinity, so the differences between the five test statistics $T_n$, $R_n$, $R_n'$, $W_n$, and $W_n'$ are negligible compared to their values for large sample sizes $n$. An important point about the Wald test statistic is that, unlike the likelihood ratio test statistic, it only depends on the MLE for the alternative hypothesis $\hat{\theta}_n$. \subsection{Applications} All five of these are widely used. Here are some cases where $T_n$, $R_n$, and $W_n$ are widely used. $T_n$ is the hardest to calculate, since either $\hat{\theta}_n$ or $\tilde{\theta}_n$ or both may require numerical optimization (like we saw with the Cauchy examples). But it is nevertheless widely used when such optimization has to be done anyway. The R generic function \texttt{anova} when applied to objects produced by the R function \texttt{glm} does likelihood ratio tests (which it calls analysis of deviance, ``deviance'' being another name for the likelihood ratio test statistic \eqref{eq:wilks}). So everyone who does tests of model comparison using these R functions is doing likelihood ratio tests. The Pearson chi-square test is a special case of the Rao test. So everyone who does tests of model comparison the Pearson chi-square test is doing Rao tests (also called score tests). Since this is a very old and widely understood procedure, many users use it (in all statistical software, not just R). The test statistics and $P$-values in the output of the R generic function \texttt{summary} applied to objects produced by the R function \texttt{glm} are Wald tests. This is obvious from the fact that only the MLE in the big model is computed. The relevant smaller models are those that drop the predictor for the line of the output having a particular test statistic and $P$-value. Those models are not fitted in order to print the output of the \texttt{summary} function. So everyone who does pays attention to test statistics and $P$-values in the output of the R function \texttt{summary} is doing Wald tests. An example where $R_n'$ or $W_n'$ would be used is when we use the computer to calculate Fisher information (the \texttt{hessian} component of the output of the R function \texttt{nlm} as in the Cauchy example). That is observed Fisher information because the computer only knows how to approximate derivatives but does not know how to do expectations. If we use such to make confidence intervals (as we did in the Cauchy example), those are Wald confidence intervals (obtained by inverting Wald tests). \subsection{Binomial Examples} Recall that the log likelihood for a binomial distribution with sample size $n$ and parameter $\pi$ is $$ l_n(\pi) = x \log(\pi) + (n - x) \log(1 - \pi), $$ the score (first derivative) is \begin{align*} l_n'(\pi) & = \frac{x}{\pi} - \frac{n - x}{1 - \pi} \\ & = \frac{x - n \pi}{\pi (1 - \pi)}, \end{align*} the MLE is $$ \hat{\pi}_n = x / n, $$ observed Fisher information is $$ J_n(\pi) = \frac{x}{p^2} + \frac{n - x}{(1 - p)^2}, $$ and expected Fisher information is $$ I_n(\pi) = \frac{n}{p (1 - p)}. $$ We consider two-tailed tests with hypotheses \begin{align*} H_0 : & \pi = \pi_0 \\ H_1 : & \pi \neq \pi_0 \end{align*} The MLE in the smaller model is $\pi_0$ because there are no free parameters and nothing to estimate: the parameter is constrained to have this value. \subsubsection{Wilks} The Wilks (likelihood ratio) test statistic is \begin{align*} T_n & = 2 \bigl[ x \log(\hat{\pi}_n) + (n - x) \log(1 - \hat{\pi}_n) \bigr] - 2 \bigl[ x \log(\pi_0) + (n - x) \log(1 - \pi_0) \bigr] \\ & = 2 \left[ x \log\left(\frac{\hat{\pi}_n}{\pi_0}\right) + (n - x) \log\left(\frac{1 - \hat{\pi}_n}{1 - \pi_0}\right) \right] \end{align*} \subsubsection{Rao} Observed and expected Fisher information are different after plug-in of the MLE for the smaller model \begin{align*} I_n(\pi_0) & = \frac{n}{\pi_0 (1 - \pi_0)} \\ J_n(\pi_0) & = \frac{x}{\pi_0^2} + \frac{n - x}{(1 - \pi_0)^2} \\ & = \frac{n \hat{\pi}_n}{\pi_0^2} + \frac{n (1 - \hat{\pi}_n)}{(1 - \pi_0)^2} \end{align*} so $R_n$ and $R_n'$ are different \begin{align*} R_n & = \left( \frac{x - n \pi_0}{\pi_0 (1 - \pi_0)} \right)^2 \frac {\pi_0 (1 - \pi_0)} {n} \\ & = \frac{(x - n \pi_0)^2}{n \pi_0 (1 - \pi_0)} \\ R_n' & = \left( \frac{x - n \pi_0}{\pi_0 (1 - \pi_0)} \right)^2 \left( \frac{n \hat{\pi}_n}{\pi_0^2} + \frac{n (1 - \hat{\pi}_n)}{(1 - \pi_0)^2} \right)^{- 1} \\ & = \frac{(x - n \pi_0)^2}{n \hat{\pi}_n (1 - \pi_0)^2 + n (1 - \hat{\pi}_n) \pi_0^2} \end{align*} % ((x - n pi0) / (pi0 (1 - pi0)))^2 * % ((n pihat) / pi0^2 + n (1 - pihat) / (1 - pi0)^2)^{-1} % % (x - n pi0)^2 / (n pihat (1 - pi0)^2 + n (1 - pihat) pi0^2) \subsubsection{Wald} Observed and expected Fisher information are the same after plug-in of the MLE for the bigger model \begin{align*} I_n(\hat{\pi}_n) & = \frac{n}{\hat{\pi}_n (1 - \hat{\pi}_n)} \\ J_n(\hat{\pi}_n) & = \frac{x}{\hat{\pi}_n^2} + \frac{n - x}{(1 - \hat{\pi}_n)^2} \\ & = \frac{n \hat{\pi}_n}{\hat{\pi}_n^2} + \frac{n (1 - \hat{\pi}_n)}{(1 - \hat{\pi}_n)^2} \\ & = \frac{n}{\hat{\pi}_n (1 - \hat{\pi}_n)} \end{align*} so $W_n$ and $W_n'$ are the same. The function $g$ we want to use for this application of Wald tests is $$ g(\pi) = \pi - \pi_0, $$ which has derivative $$ \frac{d g(\pi)}{d \pi} = 1, $$ so \begin{align*} W_n & = (\hat{\pi}_n - \pi_0)^2 \left(I_n(\hat{\pi}_n)^{- 1}\right)^{- 1} \\ & = \frac{n (\hat{\pi}_n - \pi_0)^2}{\hat{\pi}_n (1 - \hat{\pi}_n)} \end{align*} <>= one <- 1 three <- 3 pi0 <- one / three n <- 200 x <- 53 pi.hat <- x / n tn <- 2 * (x * log(pi.hat / pi0) + (n - x) * log((1 - pi.hat) / (1 - pi0))) tn.p <- pchisq(tn, 1, lower.tail = FALSE) rn <- (x - n * pi0)^2 / (n * pi0 * (1 - pi0)) rn.p <- pchisq(rn, 1, lower.tail = FALSE) rnp <- (x - n * pi0)^2 / (n * (pi.hat * (1 - pi0)^2 + (1 - pi.hat) * pi0^2)) rnp.p <- pchisq(rnp, 1, lower.tail = FALSE) wn <- n * (pi.hat - pi0)^2 / (pi.hat * (1 - pi.hat)) wn.p <- pchisq(wn, 1, lower.tail = FALSE) @ If we try these out with $\pi_0 = \Sexpr{one} / \Sexpr{3}$, $n = \Sexpr{n}$, and $x = \Sexpr{x}$, we obtain \begin{alignat*}{2} T_n & = \Sexpr{round(tn, 4)} & \qquad & (P = \Sexpr{round(tn.p, 4)}) \\ R_n & = \Sexpr{round(rn, 4)} & \qquad & (P = \Sexpr{round(rn.p, 4)}) \\ R_n' & = \Sexpr{round(rnp, 4)} & \qquad & (P = \Sexpr{round(rnp.p, 4)}) \\ W_n & = \Sexpr{round(wn, 4)} & \qquad & (P = \Sexpr{round(wn.p, 4)}) \end{alignat*} We see the asymptotic equivalence in the fact that all four tests give nearly the same results. The test recommended by introductory statistics books has test statistic $$ Z_n = \frac{\hat{\pi}_n - \pi_0}{\sqrt{\pi_0 (1 - \pi_0) / n}} $$ which has an asymptotic standard normal distribution. Since $R_n = Z_n^2$ and the square of a standard normal random variable is a chi-square distribution with one degree of freedom, which is the reference distribution for Wilks, Rao, and Wald tests in this situation with dimension zero for the smaller model and dimension one for the bigger model, the two tests are the same. So the test recommended by introductory statistics books for this situation is the Rao test. \begin{thebibliography}{} \bibitem[Chernoff(1954)]{chernoff} Chernoff, H. (1954). \newblock On the distribution of the likelihood ratio. \newblock \emph{Annals of Mathematical Statistics}, \textbf{25}, 573--578. \bibitem[Cram\'{e}r(1946)]{cramer} Cram\'{e}r,~H. (1946). \newblock \emph{Mathematical Methods of Statistics}. \newblock Princeton University Press, Princeton. \bibitem[Ferguson(1996)]{ferguson} Ferguson,~T.~S. (1996). \newblock \emph{A Course in Large Sample Theory}. \newblock Chapman \& Hall, London. \bibitem[Fisher(1912)]{fisher1912} Fisher, R.~A. (1912). \newblock On an absolute criterion for fitting frequency curves. \newblock \emph{Messenger of Mathematics}, 41, 155--160. \bibitem[Fisher(1922)]{fisher1922} Fisher, R.~A. (1922). \newblock On the mathematical foundations of theoretical statistics. \newblock \emph{Philosophical Transactions of the Royal Society of London, Series A}, 222, 309--368. \bibitem[Freedman, et al.(2007)Freedman, Pisani, and Purves]{freedman-et-al} Freedman, D., Pisani, R., and Purves, R. (2007). \newblock \emph{Statistics}, 4th ed. \newblock W. W. Norton \& Company, New York. \bibitem[Geyer(1994)]{geyer-constrained} Geyer, C.~J. (1994). \newblock On the asymptotics of constrained $M$-estimation. \newblock \emph{Annals of Statistics}, \textbf{22}, 1993--2010. \bibitem[Geyer(2013)]{geyer-simple} Geyer, C.~J. (2013). \newblock Asymptotics of maximum likelihood without the LLN or CLT or sample size going to infinity. \newblock In \emph{Advances in Modern Statistical Theory and Applications: A Festschrift in honor of Morris L. Eaton}, G.~L. Jones and X. Shen eds., IMS Collections, Vol. 10, pp. 1--24. \newblock Institute of Mathematical Statistics: Hayward, CA. \bibitem[Kiefer and Wolfowitz(1956)]{kiefer-wolfowitz} Kiefer, J. and Wolfowitz, J. (1956). \newblock Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. \newblock \emph{Annals of Mathematical Statistics}, 27, 887--906. \bibitem[Le Cam(1953)]{lecam1953} Le Cam, L. (1953). \newblock On some asymptotic properties of maximum likelihood estimates and related Bayes estimates. \newblock \emph{University of California Publications in Statistics}, 1, 277--330. \bibitem[Le Cam(1970)]{lecam1970} Le Cam, L. (1970). \newblock On the assumptions used to prove asymptotic normality of maximum likelihood estimates. \newblock \emph{Annals of Mathematical Statistics}, \textbf{41}, 802--828. \bibitem[Le~Cam and Yang(2000)]{lecam-yang} Le~Cam,~L., and Yang, G.~L. (2000). \newblock \emph{Asymptotics in Statistics: Some Basic Concepts}, second ed. \newblock Springer-Verlag, New York. \bibitem[Self and Liang(1987)]{self-liang} Self, S.~G., and Liang, K.-Y. (1987). \newblock Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. \newblock \emph{Journal of the American Statistical Association}, \textbf{82}, 605--610. \bibitem[van~der Vaart(2000)]{van-der-vaart} van~der Vaart,~A.~W. (2000). \newblock \emph{Asymptotic Statistics}. \newblock Cambridge University Press, Cambridge. \bibitem[Wald(1949)]{wald} Wald, A. (1949). \newblock Note on the consistency of the maximum likelihood estimate. \newblock \emph{Annals of Mathematical Statistics}, 20, 595--601. \bibitem[Wang(1985)]{wang} Wang, J.-L. (1985). \newblock Strong consistency of approximate maximum likelihood estimators with applications in nonparametrics. \newblock \emph{Annals of Statistics}, 13, 932--946. \end{thebibliography} \end{document}