\documentclass[11pt]{article}
\usepackage{indentfirst}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{amscd}
\usepackage{natbib}
\usepackage{url}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
\newcommand{\real}{\mathbb{R}}
\newcommand{\inner}[1]{\langle #1 \rangle}
\newcommand{\set}[1]{\{\, #1 \,\}}
\newcommand{\bigset}[1]{\left\{\, #1 \,\right\}}
\newcommand{\fatdot}{\,\cdot\,}
\newcommand{\opand}{\mathbin{\rm and}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\rint}{rint}
\DeclareMathOperator{\con}{con}
\DeclareMathOperator{\pos}{pos}
\DeclareMathOperator{\cl}{cl}
\DeclareMathOperator{\spam}{span}
\DeclareMathOperator{\rc}{rc}
\DeclareMathOperator{\myPr}{Pr}
\DeclareMathOperator{\supp}{supp}
\DeclareMathOperator{\rank}{rank}
\DeclareMathOperator{\logit}{logit}
\setlength{\textheight}{\paperheight}
\addtolength{\textheight}{- 2 in}
\setlength{\topmargin}{0.25 pt}
\setlength{\headheight}{0 pt}
\setlength{\headsep}{0 pt}
\addtolength{\textheight}{- \topmargin}
\addtolength{\textheight}{- \topmargin}
\addtolength{\textheight}{- \footskip}
\setlength{\textwidth}{\paperwidth}
\addtolength{\textwidth}{- 2 in}
\setlength{\oddsidemargin}{0.25 in}
\setlength{\evensidemargin}{0.25 in}
\addtolength{\textwidth}{- \oddsidemargin}
\addtolength{\textwidth}{- \evensidemargin}
\begin{document}
\vspace*{0.9375in}
\begin{center}
{\bfseries Supporting Theory and Data Analysis for ``Likelihood Inference
in Exponential Families and Directions of Recession''} \\
By \\
Charles J. Geyer \\
Technical Report No.~672 \\
School of Statistics \\
University of Minnesota \\
% April 20, 2005 \\
\today
\end{center}
\thispagestyle{empty}
\cleardoublepage
\setcounter{page}{1}
\thispagestyle{empty}
\begin{abstract}
When in a full exponential family the maximum likelihood estimate (MLE)
does not exist, the MLE may exist in the Barndorff-Nielsen completion of the
family \citep{barndorff,brown,thesis}.
A practical algorithm for finding the MLE in the completion using repeated
linear programming was proposed in the author's unpublished thesis
\citep{thesis} and used in \citet{gt}.
Now we propose a slightly different method, also using
repeated linear programming with the R contributed package \texttt{rcdd}
\citep{rcdd},
which makes straightforward the calculation of the MLE in the Barndorff-Nielsen
completion for any models satisfying a condition of \citet{brown} and for which
some R function can calculate the MLE when it does exist, for example,
generalized linear models (GLM) and aster models \citep{gws,aster}.
In this technical report we give details of two GLM examples.
Likelihood ratio tests of model comparison are almost unchanged from
the usual case.
Only the degrees of freedom need be adjusted when the MLE for the null
hypothesis lies in the completion rather than the original family.
Confidence intervals are changed much more. When the MLE
for the natural parameter does not exist, it can be thought of as having
gone to infinity in a certain direction, which we call a generic direction
of recession. Here we propose a new kind of one-sided confidence
interval, not involving asymptotic approximation, for how close to infinity
the true unknown natural parameter value may be. This maps to a one-sided
confidence interval for the mean value parameter showing how close to the
boundary of its support it may be.
\end{abstract}
<>=
options(keep.source = TRUE, width = 65)
ps.options(pointsize = 15)
pdf.options(pointsize = 15)
@
\section{R Package Rcdd}
We use the R statistical computing environment \citep{rcore} in our analysis.
It is free software and can be obtained from
\url{http://cran.r-project.org}. Precompiled binaries
are available for Windows, Macintosh, and popular Linux distributions.
We use the contributed package \verb@rcdd@.
If R has been installed, but this package has
not yet been installed, do
\begin{verbatim}
install.packages("rcdd")
\end{verbatim}
from the R command line
(or do the equivalent using the GUI menus if on Macintosh
or Windows). This may require root or administrator privileges.
If the \verb@rcdd@ package has been installed, we load it
<>=
library(rcdd)
@
<>=
baz <- library(help = "rcdd")
baz <- baz$info[[1]]
baz <- baz[grep("Version", baz)]
baz <- sub("^Version: *", "", baz)
bazzer <- paste(R.version$major, R.version$minor, sep = ".")
@
The version of the package used to make this document
is \Sexpr{baz}.
The version of R used to make this document is \Sexpr{bazzer}.
This entire document and all of the calculations shown were made using
the R command \texttt{Sweave} and hence are exactly reproducible by
anyone who has R and the R noweb (RNW) file from which it was created.
Both the RNW file and and the PDF document produced from it are available at
\url{http://www.stat.umn.edu/geyer/gdor}.
\section{Introduction}
When in a full exponential family the maximum likelihood estimate (MLE)
does not exist, the MLE may exist in the Barndorff-Nielsen completion of the
family \citep[and references cited at the beginning of Chapter~2 therein]
{barndorff,brown,thesis}. In this case the MLE in the completion
can be thought of as a pair $(\hat{\theta}, \delta)$ satisfying
\begin{equation} \label{eq:char-one}
\lim_{s \to \infty} l(\hat{\theta} + s \delta)
=
\sup_{\theta \in \Theta} l(\theta).
\end{equation}
Although $\hat{\theta}$ and $\delta$ satisfying \eqref{eq:char-one} are
not necessarily unique, the distribution which is the limit as $s \to \infty$
of distributions having parameter values $\hat{\theta} + s \delta$ is unique.
A more complicated but also more interesting characterization of $\delta$
is that it is an element of the relative interior of the normal cone
$N_C(y)$ of the convex support $C$ at the point $y$, where $C$ is
the smallest closed convex set that contains the natural statistic $Y$
almost surely for one of the distributions in the family and hence
for all of them and $y$ is the observed value of $Y$. This makes determination
of $\delta$ an exercise in computational geometry. When $C$ is a polyhedral
convex set, $\delta$ can be determined using the R functions \texttt{linearity}
and \texttt{lpcdd} in the \texttt{rcdd} contributed package \citep{rcdd}
for the R statistical computing environment \citep{rcore}.
A more complicated but also more interesting characterization of $\hat{\theta}$
is that it is the MLE in the limiting conditional model, which is the family
of distributions that are limits in distribution as $s \to \infty$ of
distributions having parameters $\theta + s \delta$ for all $\theta \in \Theta$.
This limiting conditional model, described in Section~\ref{sec:limits} below,
is an exponential family, the given exponential family conditioned on the event
$\inner{Y - y, \delta} = 0$, where $Y$ is the natural statistic,
$y$ is its observed value, and $\inner{\fatdot, \fatdot}$ is the bilinear
form given by \eqref{eq:inner} below.
Thus computation
of $\hat{\theta}$ is maximum likelihood estimation
in an exponential family and
can often be done using available software.
Some regularity conditions are necessary for the above theory to be correct.
They are given in Section~\ref{sec:assume} below. These regularity conditions
are not restrictive, being satisfied by all applications known to me
involving full exponential families.
\citet{thesis} has algorithms that work
on non-full but convex families, meaning the set of natural parameter values
is a convex subset of the natural parameter space that is closed relative
to it, and \citet{gt} have an application of such families,
but we do not discuss them here.
So what is new here? \citet{thesis} provided a method of finding the MLE
in the Barndorff-Nielsen completion that was effective and, like the methods
recommended here, was based on repeated linear programming. However, that
work was never published,
partly because it depended on high-quality linear programming
software, which was not easy for statisticians to use and certainly not
available in widely used statistical computing environments. With the
advent of the \texttt{rcdd} package \citep{rcdd}, high quality linear
programming is now available in R, so it was time
to revisit the issue. Some new ideas have been
added, so we do not closely follow \citet{thesis}. In particular,
the algorithms presented in Sections~\ref{sec:linearity} and~\ref{sec:gdor}
below, which are the heart of our methodology, differ from those
of \citet{thesis}.
Finally, the hypothesis tests and confidence intervals proposed in
Sections~\ref{sec:tests} and~\ref{sec:confidence} below are new,
although we cannot claim priority for our proposal for testing,
which was suggested by S. Fienberg (personal communication).
Section~\ref{sec:example} below contains explicit examples that show
how all calculations related to our methodology are carried out in R.
\section{Theory}
\subsection{Exponential Families}
An exponential family of distributions \citep{barndorff,brown,thesis} is
a statistical model having log likelihood
\begin{equation} \label{eq:logl}
l(\theta) = \inner{y, \theta} - c(\theta),
\end{equation}
where $y$ is a vector statistic, $\theta$ is a vector parameter,
and $(y, \theta) \mapsto \inner{y, \theta}$ is a bilinear form.
In all computations, we will assume $y$ and $\theta$ are elements of $\real^p$
and
\begin{equation} \label{eq:inner}
\inner{y, \theta} = \sum_{i = 1}^p y_i \theta_i,
\end{equation}
but in theory any bilinear form works.
A statistic $y$ and parameter $\theta$ that give a log likelihood of this
form are called \emph{natural} or \emph{canonical}.
We will say \emph{natural}.
The function $c$ is called the \emph{cumulant function} of the family.
The distribution with parameter value $\theta$ has a density with respect
to the distribution with parameter value $\psi$ of the form
\begin{equation} \label{eq:density}
f_\theta(\omega)
=
e^{\inner{Y(\omega), \theta - \psi} - c(\theta) + c(\psi)}.
\end{equation}
The requirement that this integrate to one determines
the function $c$ up to an additive constant
\begin{equation} \label{eq:cumulant}
c(\theta) = c(\psi) + \log E_\psi \bigl( e^{\inner{Y, \theta - \psi}} \bigr).
\end{equation}
We take \eqref{eq:cumulant} to be valid for all $\theta$ in $\real^p$,
defining $c(\theta) = \infty$ for $\theta$ such that the expectation
in \eqref{eq:cumulant} does not exist. Since the argument of the expectation
operator in \eqref{eq:cumulant} is strictly positive, so is the expectation;
hence the cumulant function takes values that are either real or $+ \infty$ and
the log likelihood function takes values that are either real or $- \infty$.
Define
\begin{equation} \label{eq:space}
\Theta = \set{ \theta \in \real^p : c(\theta) < \infty }.
\end{equation}
The exponential family
is \emph{full} if its natural parameter space is \eqref{eq:space}.
We shall be interested only in full exponential families.
% \begin{theorem} \label{th:concave}
% The cumulant function of a full exponential family is
% lower semicontinuous and convex. The log likelihood function
% is upper semicontinuous and concave.
% \end{theorem}
\subsection{Convex Support, Tangent Cone, and Normal Cone}
The \emph{convex support} of an exponential family is the smallest closed
convex set that contains the natural statistic with probability one under
some distribution in the family \citep[p.~90]{barndorff},
in which case this is true for all
distributions in the family, because the distributions are mutually absolutely
continuous because the densities \eqref{eq:density} are everywhere nonzero.
The \emph{tangent cone} of a convex set $C$ at a point $y \in C$ is
\begin{equation} \label{eq:tangent-cone}
T_C(y) = \cl \set{ s (w - y) : \text{$w \in C$ and $s \ge 0$} },
\end{equation}
where $\cl$ denotes the closure operation \citep[Theorem~6.9]{raw}.
The \emph{normal cone} of a convex set $C$ in $\real^p$ at a point $y \in C$ is
\begin{equation} \label{eq:normal-cone}
N_C(y)
=
\set{ \delta \in \real^p :
\text{$\inner{w - y, \delta} \le 0$ for all $w \in C$} }.
\end{equation}
\citep[Theorem~6.9]{raw}.
Tangent and normal cones are polars of each other
\citep[Theorem~6.9 and~Corollary~6.30]{raw}.
Each determines the other.
\subsection{Directions of Recession and Constancy}
Directions of recession and constancy of convex and concave functions
are defined by \citet[p.~69]{rocky}. We apply these notions to log
likelihoods of full exponential families.
Proofs of all theorems are given in Section~\ref{sec:proofs}.
\begin{theorem} \label{th:constancy}
For some vector $\delta$ and
for a full exponential family with log likelihood \eqref{eq:logl},
natural parameter space $\Theta$, convex support $C$, natural statistic $Y$,
and observed value
of the natural statistic $y$ such that $y \in C$, the following are equivalent.
\begin{enumerate}
\item[\upshape (a)] There exists a $\theta \in \Theta$
such that $s \mapsto l(\theta + s \delta)$ is
not a strictly concave function on the interval where it is finite.
\item[\upshape (b)] For all $\theta \in \Theta$ the function
$s \mapsto l(\theta + s \delta)$ is constant on $\real$.
\item[\upshape (c)] The parameter values $\theta$ and $\theta + s \delta$
correspond to the same probability distribution for
some $\theta \in \Theta$ and some $s \neq 0$.
\item[\upshape (d)] The parameter values $\theta$ and $\theta + s \delta$
correspond to the same probability distribution for
all $\theta \in \Theta$ and all real $s$.
\item[\upshape (e)] $\inner{Y - y, \delta} = 0$ almost surely for some
distribution in the family.
\item[\upshape (f)] $\inner{Y - y, \delta} = 0$ almost surely for all
distributions in the family.
\item[\upshape (g)] $\delta \in N_C(y)$ and $- \delta \in N_C(y)$.
\item[\upshape (h)] $\inner{w, \delta} = 0$, for all $w \in T_C(y)$.
\end{enumerate}
\end{theorem}
Any vector $\delta$ that satisfies any one of the conditions
of the theorem (and hence all of them) is called
a \emph{direction of constancy} of the log likelihood.
The set of all directions of constancy is called the \emph{constancy space}
of the log likelihood.
It is clear from (e) or (h) of the theorem that the constancy space is
a vector subspace.
\begin{corollary} \label{cor:non-uniqueness}
For a full exponential family, suppose $\hat{\theta}_1$ and\/
$\hat{\theta}_2$ are maximum likelihood estimates. Then
$\hat{\theta}_1 - \hat{\theta}_2$ is a direction of constancy.
\end{corollary}
From the corollary and (d) of the theorem, we see that directions of
constancy do not cause any problem for statistical inference, because
all maximum likelihood estimates correspond to the same probability
distribution. Thus we have uniqueness where it is important.
Nonuniqueness of the MLE for the natural parameter is, at worst,
merely a computational nuisance.
A family is said to be \emph{minimal} if it has no directions of constancy.
This can always be arranged by reparametrization
(\citealp[pp.~111--116]{barndorff}; \citealp[pp.~13--16]{brown}; see also
\citealp[Section~1.5]{thesis}).
The R function \texttt{glm} always uses a minimal parametrization,
dropping predictors to obtain a full rank model matrix. We take the view,
however, that minimality is not necessary and that insisting on minimality
can complicate other theoretical issues. Thus we never insist on minimality
and do allow for directions of constancy.
\begin{theorem} \label{th:recession}
For some vector $\delta$ and
for a full exponential family with log likelihood \eqref{eq:logl},
natural parameter space $\Theta$, convex support $C$, natural statistic $Y$,
and observed value
of the natural statistic $y$ such that $y \in C$, the following are equivalent.
\begin{enumerate}
\item[\upshape (a)] There exists a $\theta \in \Theta$ such that the function
$s \mapsto l(\theta + s \delta)$ is nondecreasing on $\real$.
\item[\upshape (b)] For all $\theta \in \Theta$ the function
$s \mapsto l(\theta + s \delta)$ is nondecreasing on $\real$.
\item[\upshape (c)] $\inner{Y - y, \delta} \le 0$ almost surely for some
distribution in the family.
\item[\upshape (d)] $\inner{Y - y, \delta} \le 0$ almost surely for all
distributions in the family.
\item[\upshape (e)] $\delta \in N_C(y)$.
\item[\upshape (f)] $\inner{w, \delta} \le 0$, for all $w \in T_C(y)$.
\end{enumerate}
\end{theorem}
Any vector $\delta$ that satisfies any one of the conditions
of the theorem (and hence all of them) is called
a \emph{direction of recession} of the log likelihood.
From now on we will simply say direction of recession or constancy
to refer to directions of recession or constancy of the log likelihood
(since we will not be interested in directions of recession or constancy
of any other function). Note that every direction of constancy is a direction
of recession.
In light of (e) of the theorem, we could also call directions of recession
normal vectors (of the convex support at the observed
value of the natural statistic). Since the word ``normal'' is already overused
in statistics, we prefer the term that cannot be confused with other
statistical notions.
\begin{theorem} \label{th:exist}
For a full exponential family with convex support $C$ and observed value
of the natural statistic $y$ such that $y \in C$, the following are equivalent.
\begin{enumerate}
\item[\upshape (a)] The MLE exists.
\item[\upshape (b)] Every direction of recession is a direction of constancy.
\item[\upshape (c)] $N_C(y)$ is a vector subspace.
\item[\upshape (d)] $T_C(y)$ is a vector subspace.
\end{enumerate}
\end{theorem}
This theorem provides a complete geometric solution to the problem
of when the MLE exists in a full exponential family.
\begin{corollary} \label{cor:strictly-increasing}
For a full exponential family with log likelihood \eqref{eq:logl},
natural parameter space $\Theta$, convex support $C$, and observed value
of the natural statistic $y$ such that $y \in C$,
if $\delta$ is a direction of recession
that is not a direction of constancy, then for all $\theta \in \Theta$
the function $s \mapsto l(\theta + s \delta)$ is strictly increasing
on the interval where it is finite.
\end{corollary}
\subsection{Limits in Directions of Recession} \label{sec:limits}
\begin{theorem} \label{th:limits}
For a full exponential family having log likelihood \eqref{eq:logl},
densities \eqref{eq:density}, natural statistic $Y$,
observed value of the natural statistic $y$
such that $y$ is in the convex support,
and natural parameter space $\Theta$,
if $\delta$ is a direction of recession that is not a direction of constancy,
\begin{equation} \label{eq:h2}
H = \set{ w \in \real^p : \inner{w - y, \delta} = 0 },
\end{equation}
and $\Pr(Y \in H) > 0$ for some distribution in the family, and hence for all,
then for all $\theta \in \Theta$
\begin{equation} \label{eq:limits}
\lim_{s \to \infty} f_{\theta + s \delta}(\omega)
=
\begin{cases}
0, & \inner{Y(\omega) - y, \delta} < 0
\\
f_\theta(\omega) / \Pr_\theta(Y \in H), & \inner{Y(\omega) - y, \delta} = 0
\\
+ \infty, & \inner{Y(\omega) - y, \delta} > 0
\end{cases}
\end{equation}
Moreover $s \mapsto \Pr_{\theta + s \delta}(Y \in H)$ is continuous and strictly
increasing, and $\Pr_{\theta + s \delta}(Y \in H) \to 1$ as $s \to \infty$.
\end{theorem}
We note three things about the right-hand side of \eqref{eq:limits}.
First, it is a probability density with respect to the distribution having
parameter value $\psi$. The set where it is $+\infty$ has probability zero
by Theorem~\ref{th:recession}~(d),
so this is not a problem.
Second, it is the density of the conditional distribution given the event
$Y \in H$ of the distribution having parameter value $\theta$.
Third, by Scheff\'{e}'s lemma \citep[p.~351]{lehmann-tsh} pointwise convergence
of densities implies convergence in total variation, which implies convergence
in distribution.
Denote the right-hand side of \eqref{eq:limits} by $f_\theta(\omega \mid Y \in H)$.
It is clear that the family
\begin{equation} \label{eq:conditional-family}
\set{ f_\theta(\fatdot \mid Y \in H) : \theta \in \Theta }
\end{equation}
is an exponential family with the same natural statistic and natural parameter
as the original family. Moreover, it is clear
that the log likelihood for
this conditional family
$$
l_H(\theta) = \inner{y, \theta} - c(\theta) - \log \myPr_\theta(Y \in H)
$$
satisfies
$$
l(\theta) < l_H(\theta), \qquad \theta \in \Theta.
$$
Thus, if an MLE exists for
the conditional family \eqref{eq:conditional-family},
then it maximizes the likelihood in the family that is the union
of \eqref{eq:conditional-family} and the original family.
When this happens, we say we have found an MLE in the Barndorff-Nielsen
completion of the original exponential family.
If the conditional family \eqref{eq:conditional-family}
has a direction of recession that is not a direction of constancy,
we can take limits again, and continue until we have a limiting family
that has no directions of recession that are not directions of constancy,
which must eventually occur since
the dimension of the convex support of the limiting family must
decrease in each limiting operation and can go no lower than zero,
in which case the convex support contains a single point and the conditional
family contains a single distribution which is trivially the MLE\@.
This iterated limiting process
is the one followed in \citet{thesis}. It is actually much more general
than the one we follow here.
Up to this point, we have been closely following \citet{thesis},
but now we part company, imposing conditions that assure $\delta$ can be chosen
so that one limit is enough. So far the only condition we have imposed
is $\Pr(Y \in H) > 0$, which is required for the existence of the limiting conditional
family \eqref{eq:conditional-family}.
We will impose other conditions as we go along.
All of the conditions we assume will be collected in one place in
Section~\ref{sec:assume}. Most of these conditions are not new, having been
proposed by \citet[pp.~193 and~197]{brown}.
Although these conditions limit the applicability of our procedure,
they do apply in most applications, and they make
construction of confidence intervals much easier.
The limiting conditional family \eqref{eq:conditional-family} need not be full;
the natural parameter space of the full family containing
\eqref{eq:conditional-family} is at least
as large as
\begin{equation} \label{eq:larger}
\Theta + \Gamma_{\text{lim}}
=
\set{ \theta + \gamma :
\theta \in \Theta \opand \gamma \in \Gamma_{\text{lim}} },
\end{equation}
where $\Theta$ is the natural parameter space of the original family
and $\Gamma_{\text{lim}}$ is the constancy space of
\eqref{eq:conditional-family}.
We will assume that \eqref{eq:larger} is the natural parameter space
of the full family containing \eqref{eq:conditional-family}.
Although pathological examples can be constructed for which this assumption
fails \citep[Example~2.1]{thesis},
we know of no realistic applications for which it fails.
\subsection{Convex Polyhedra}
One part of the condition of \citet[p.~197]{brown} mentioned above is that
the convex support is polyhedral \citep[Section~19]{rocky}.
Since most applications satisfy this condition, it entails
little loss of generality.
A set $C$ is a \emph{convex polyhedron} if it is
the intersection of a finite collection of closed half-spaces
\citep[Section~19]{rocky}, that is,
\begin{equation} \label{eq:polyhedron}
C
=
\set{ w \in \real^d : \text{$\inner{w, \alpha_i} = b_i$, $i \in E$,
and $\inner{w, \alpha_i} \le b_i$, $i \in I$} },
\end{equation}
where the $\alpha_i$ are nonzero vectors, the $b_i$ are scalars,
and $E$ and $I$ are disjoint finite sets.
There is an alternative characterization of convex polyhedra; they are
convex hulls of finite sets of points and directions, that is,
sets of all linear combinations
\begin{equation} \label{eq:linear-combination}
\sum_{i \in E \cup I} b_i \alpha_i,
\end{equation}
where the $\alpha_i$ are vectors, the $b_i$ are scalars, $E$ and $I$
are disjoint finite sets, and the $b_i$ satisfy
\begin{subequations}
\begin{equation} \label{eq:non-neg}
b_i \ge 0, \qquad i \in E \cup I
\end{equation}
and if $I$ is nonempty
\begin{equation} \label{eq:sum-to-one}
\sum_{i \in I} b_i = 1.
\end{equation}
\end{subequations}
The vectors $\alpha_i$, $i \in E \cup I$ are also called the \emph{generators}
of $C$ and $C$ the set \emph{generated} by the \emph{points} $\alpha_i$, $i \in I$
and the \emph{directions} $\alpha_i$, $i \in E$.
The equivalence of these two
characterizations is called the Minkowski-Weyl theorem
\citep[Theorem~19.1]{rocky}. The R function \texttt{scdd} in the
contributed package \texttt{rcdd} \citep{rcdd} converts between these
two representations of a convex polyhedron, which it calls the H-representation
and V-representation, respectively.
Let $P$ denote the set $\set{\alpha_i : i \in I }$ of points
and $D$ denote the set $\set{\alpha_i : i \in E }$ of directions.
When $P \neq \varnothing$, we write
$$
C = \con(P) + \con(\pos D)
$$
to denote the relationship between the convex polyhedron $C$ and its sets
of generators $P$ and $D$.
When $P = \varnothing$, we write
$$
C = \con(\pos D)
$$
to denote the relationship between $C$ and $D$.
This notation follows \citet[Sections~2E and~3G]{raw}.
When $C$ is a convex polyhedron, $N_C(y)$ and $T_C(y)$ are also convex
polyhedrons for each $y \in C$ and are given in terms of the H-representation
of $C$ by simple formulas \citep[Theorem~6.46]{raw}. Moreover,
the closure operation in \eqref{eq:tangent-cone} is unnecessary when
$C$ is a convex polyhedron.
When $T_C(y)$ or any convex cone is polyhedral, its V-representation
can consist of directions only, so can be of the form
$\con(\pos V)$ for some finite set $V$.
\subsection{Generic Directions of Recession}
The \emph{relative interior} of a convex set $C$, denoted $\rint C$,
is its interior relative to its affine hull \citep[Chapter 6]{rocky}.
Every nonempty convex set has a nonempty relative interior
\citep[Theorem~6.2]{rocky}.
We say a vector $\delta$ is a \emph{generic direction of recession}
(GDOR) if $\delta \in \rint N_C(y)$ and $N_C(y)$ is not a vector subspace,
where $C$ is the convex support and $y$ an observed value of the natural
statistic such that $y \in C$.
Since the relative interior is always nonempty, a GDOR exists if and only if
none of the conditions of Theorem~\ref{th:exist} hold.
\begin{theorem} \label{th:generic}
For a full exponential family having polyhedral convex support $C$
and observed value of the natural statistic $y$ such that $y \in C$,
let $T_C(y) = \con(\pos V)$, and define
$$
L = \set{ v \in V : - v \in T_C(y) }.
$$
Then a generic direction of recession exists if and only if $L \neq V$,
in which case a vector $\delta$ is a generic direction of recession
if and only if
\begin{subequations}
\begin{align}
\inner{w, \delta} = 0, & \qquad w \in L
\label{eq:calc-equality}
\\
\inner{w, \delta} < 0, & \qquad w \in V \setminus L
\label{eq:calc-inequality}
\end{align}
\end{subequations}
\end{theorem}
\begin{corollary} \label{cor:generic}
Under the assumptions of the theorem,
a generic direction of recession is not a direction of constancy.
\end{corollary}
If $B$ is a set of vectors, let $\spam B$ denote the smallest
vector subspace containing $B$. Also for any vector $x$, let
$x + \spam B = \set{ x + v : v \in \spam B }$.
\begin{corollary} \label{cor:generic-limit}
Under the assumptions of the theorem,
suppose $\delta$ is a generic
direction of recession, and $H$ is defined by \eqref{eq:h2}.
Then $T_{C \cap H}(y) = \spam L$, and $C \cap H = C \cap (y + \spam L)$.
\end{corollary}
The theorem and corollaries explain the purpose
of generic directions of recession.
By Corollary~\ref{cor:generic}, a GDOR implies the MLE does not exist in the
conventional sense, so we seek it in the limiting family described
in \eqref{sec:limits}.
Suppose that $C \cap H$ is the convex support of the limiting family.
This is another part of the conditions of \citet[pp.~193]{brown} referred to
above.
Then $T_{C \cap H}(y)$ being a vector subspace implies that the MLE
in the limiting family exists by Theorem~\ref{th:exist} (c).
Thus finding one GDOR allows us to find the MLE in the Barndorff-Nielsen
completion.
At this point our theory is essentially complete; only computational issues
remain. So we take some time to explain the connection between this theory
and the pre-existing theory of Barndorff-Nielsen completion
\citep{barndorff,brown,thesis}. The pre-existing theory says
the MLE lies in the limiting conditional family whose convex support,
what we are calling $C \cap H$, is the unique face of $C$ containing $y$
in its relative interior \citep[Chapter~4 generalizes this]{thesis}.
Thus the pre-existing approach makes it clear that the limiting conditional
family containing the MLE is unique and does not depend on the GDOR,
which is in general not
unique. In our approach, uniqueness comes from the assertion
$C \cap H = C \cap (y + \spam L)$ in Corollary~\ref{cor:generic-limit}.
This makes it clear
that, although the hyperplane $H$ does depend on the GDOR $\delta$ used to
define it, the convex support $C \cap H$ of the limiting distribution does
not depend on $H$, hence does not depend on $\delta$. Since we do not need
the pre-existing theory, we shall not bother with a proof that $C \cap H$
is the face of $C$ containing $y$ in its relative interior (the proof is
almost immediate from Lemma~A.1 in \citealp{thesis}), and merely assure
the reader that this is indeed the case, and our new theory describes the
same mathematical structure as the pre-existing theory. We have not rewritten
the theory of Barndorff-Nielsen completion merely for amusement.
As we shall see, the
new theory based on the GDOR concept is much better suited for computation
and statistical inference than the pre-existing theory based on faces
of a convex set.
Before moving to computational issues, we would like to write down somewhere
one interesting issue. Corollary~\ref{cor:generic-limit} would be false
without the assumption that $C$ is polyhedral. Consider the following
example. $C$ is the set in $\real^2$ consisting of the closed unit disk
and all points with one coordinate negative and the other less than or equal
to one. This is a closed convex set, but not polyhedral. The points
$(0, 1)$ and $(1, 0)$ are peculiar in that they are faces of $C$ that are not
exposed in the terminology of \citet[pp.~162--163]{rocky}.
Suppose $y = (1, 0)$.
Then $N_C(y) = \set{ (s, 0) : s \ge 0 }$. Since this normal cone is not
a vector subspace, there is a generic direction of recession;
one is $\delta = (1, 0)$. Then $H = \set{ (1, s) : s \in \real }$
and $C \cap H = \set{ (1, s) : s \le 0 }$.
Then $T_{C \cap H}(y) = \set{ (0, s) : s \le 0 }$ is not a vector subspace.
In this sort of situation the more general theory of \citet{thesis} may apply,
but our theory based on generic directions of recession cannot. The condition
that $C$ be polyhedral could clearly be weakened to $C$ being locally
polyhedral (every point has a convex polyhedral neighborhood $B$
such that $B \cap C$ is polyhedral), but we know of no application of such
a condition, hence do not use it.
\subsection{Assumptions} \label{sec:assume}
We summarize the assumptions we have made above. We deal with a full
exponential family.
If every direction of recession is a direction of constancy, then
we need no further assumptions. Otherwise, let $\delta$ be a generic direction
of recession, let $C$ be the convex support, let $Y$ be the natural statistic,
let $y$ be an observed value of the natural statistic satisfying $y \in C$,
and let $H$ be defined by \eqref{eq:h2}.
We assume the event $Y \in H$ has positive probability so the
limiting conditional family defined in Section~\ref{sec:limits} exists.
We further assume that $C \cap H$ is the convex support of this limiting
conditional family, so that by Theorem~\ref{cor:generic-limit} the MLE in this
limiting conditional family exists.
We further assume that the natural parameter space of the full family
containing the limiting conditional
family is given by \eqref{eq:larger}, so that the confidence interval
construction in Section~\ref{sec:confidence} below is valid.
Finally, we assume that $C$ is a convex polyhedron.
We need no other assumptions.
\subsection{Natural Affine Submodels}
In most applications of exponential family theory, we start with
a very large exponential family, which we call \emph{saturated} and which
has too many parameters to estimate well.
Then we consider \emph{natural affine submodels}, parametrized by
$$
\theta = a + M \beta,
$$
where $\theta$ is the natural parameter of the saturated model, $\beta$ is
the natural parameter of the natural affine submodel, $a$ is a known vector,
and $M$ is a known matrix. In the terminology of the R function \texttt{glm},
$a$ is called the \emph{offset vector}
and $M$ is called the \emph{model matrix}.
Observe that
$$
\inner{y, a + M \beta} = \inner{y, a} + \inner{M^T y, \beta},
$$
where the two bilinear forms on the right-hand side have different dimensions.
Since the first term on the right-hand side does not contain the parameter
and can be dropped from the log likelihood, the submodel is itself
an exponential family with natural statistic $M^T y$
and natural parameter $\beta$.
Thus everything said above applies to natural affine submodels, we
just work with the convex support of $M^T Y$ rather than of $Y$.
\subsection{Relating Tangent Cones of Models and Affine Submodels}
\label{sec:relate}
Let $C_{\text{sat}}$ denote the convex support of the saturated model
and $C_{\text{sub}}$ that of the natural affine submodel.
By Theorems~6.43 and~6.46 in \citet{raw},
\begin{equation} \label{eq:tc-map}
T_{C_{\text{sub}}}(M^T y) = \cl \set{ M^T w : w \in T_{C_{\text{sat}}}(y) }
\end{equation}
and the closure operation is not necessary if $C_{\text{sat}}$
is polyhedral. Moreover, it is clear that
if $T_{C_{\text{sat}}}(y) = \con(\pos V_{\text{sat}})$,
then $T_{C_{\text{sub}}}(y) = \con(\pos V_{\text{sub}})$, where
\begin{equation} \label{eq:vee-map}
V_{\text{sub}}
=
\set{ M^T w : w \in V_{\text{sat}} }.
\end{equation}
The induced mapping for normal cones is not so simple, requiring linear
programming. This explains our starting with tangent vectors
and V-representations rather than normal vectors and H-representations.
\subsection{Tangent Cones of Saturated Models}
In saturated families the convex support is often easy to calculate.
In logistic regression, each component of the response vector is Bernoulli
and $C_{\text{sat}} = [0, 1]^p$.
In Poisson regression, each component of the response vector is Poisson
and $C_{\text{sat}} = [0, \infty)^p$.
In categorical data analysis with Poisson response,
$C_{\text{sat}} = [0, \infty)^p$, just as in Poisson regression.
In categorical data analysis with multinomial or product-multinomial response,
the model can be derived from the Poisson model by conditioning on an affine
subspace, hence the convex support is the intersection of $[0, \infty)^p$
with this affine subspace.
As we shall see (Section~\ref{sec:multi} below),
this allows us to use the solution in the Poisson response
problem to calculate the solutions in the other problems.
When $C_{\text{sat}}$ is a Cartesian product,
as in the examples mentioned, $T_{C_{\text{sat}}}(y)$ can be calculated
coordinatewise \citep[Proposition~6.41]{raw}.
This is the only situation we will use for our examples.
Let $e_i$ denote the unit vector in the $i$-th coordinate direction
(every coordinate is zero except for the $i$-th, which is one).
Then $e_i$ is a tangent
vector at $y$ if $y_i$ is not at the upper bound of its range,
and $- e_i$ is a tangent vector at $y$ if $y_i$ is not at the lower bound.
In this case, the set $V_{\text{sat}}$ of the preceding section
contains $e_i$ or $- e_i$ or both for each $i$.
We will not do any examples where $C_{\text{sat}}$ is not a Cartesian product, except for
categorical data analysis with multinomial or product multinomial response,
where we will derive the solution from the solution for the Poisson response
case where $C_{\text{sat}}$ is a Cartesian product.
An application where the convex
support is not a Cartesian product is provided by unconditional aster models
\citep{gws}. We merely note that all of the theory developed in this technical
report applies to case where $C_{\text{sat}}$ is not a Cartesian product. Moreover,
most of the computational procedures described below also apply to this case.
Only at the beginning and the end of this process
does the non-Cartesian-product case present
additional issues. At the beginning we need to determine
a set $V_{\text{sat}}$ that generates $T_{C_\text{sat}}(y)$, and this may
be more difficult than in the Cartesian product case.
At the end (Section~\ref{sec:completion} below), we need to determine
the convex support $C_{\text{sat}} \cap H_{\text{sat}}$ of
the limiting conditional model and compute
the MLE in this model, and this may also be more difficult than in the
Cartesian product case.
\subsection{Calculating the Linearity} \label{sec:linearity}
Next we determine the \emph{linearity} of $V_{\text{sub}}$
\begin{equation} \label{eq:linearity}
L_{\text{sub}}
=
\set{ w \in V_{\text{sub}} : - w \in \con(\pos V_{\text{sub}}) }.
\end{equation}
This sounds like a complicated operation, and it is, but the \texttt{rcdd}
package has a function \texttt{linearity} that does it by repeated linear
programming.
Having found the linearity, we have solved the problem of when the MLE
exists in the original family.
It exists if and only if $L_{\text{sub}} = V_{\text{sub}}$.
All functions in the
\texttt{rcdd} package use two forms of arithmetic. One is the
default computer arithmetic used by all other R functions.
Answers produced using that arithmetic are inexact, so one is uncertain
whether the $L_{\text{sub}}$ produced is actually correct.
The other form of arithmetic is exact, infinite-precision, rational arithmetic.
Answers produced using that arithmetic are exact, so one is certain
that the $L_{\text{sub}}$ produced is actually correct, but only
if the vectors in $V_{\text{sub}}$ are also produced exactly using
either integer arithmetic or rational arithmetic.
For readers curious about how the \texttt{linearity} function works
we give the following description. Others should skip the rest of
this section. According to comments in the source code
(starting at line 3062 of the file \texttt{cddlp.c} of the source
code for the \texttt{rcdd} package, version~1.1-1, which comes from
the \texttt{cddlib} library, version~0.94f, written by K. Fukuda)
for each $w \in V_{\text{sub}}$ it solves the linear programming problem
\begin{equation} \label{eq:linprog}
\begin{split}
& \text{maximize} \\
& \qquad \inner{w, \delta} \\
& \text{subject to} \\
& \qquad
\begin{aligned}
% \inner{w, \delta} & \ge - 1 \\
\inner{v, \delta} & \ge 0,
& & \qquad v \in V_{\text{sub}} \setminus \{ w \}
\end{aligned}
\end{split}
\end{equation}
where $\delta$ is the state vector of the linear programming problem.
\begin{theorem} \label{th:linprog}
A vector $w$ is in the linearity \eqref{eq:linearity}
if and only if the optimal value of the linear program \eqref{eq:linprog}
is nonpositive.
\end{theorem}
It may be the case that some $w \in V_{\text{sub}}$ are known a priori
to be in $L_{\text{sub}}$. This can be specified in the input to the
\texttt{linearity} function, so the work verifying this need not be done.
\subsection{Calculating Generic Directions of Recession} \label{sec:gdor}
If $L_{\text{sub}} \neq V_{\text{sub}}$, then $T_{C_{\text{sub}}}(M^T y)$
is not a vector subspace, hence there exists
a generic direction of recession. By Theorem~\ref{th:generic}, $\delta$
is a GDOR if and only if
\begin{subequations}
\begin{align}
\inner{w, \delta} = 0, & \qquad w \in L_{\text{\upshape sub}}
\label{eq:linear}
\\
\inner{w, \delta} < 0, & \qquad w \in V_{\text{\upshape sub}} \setminus L_{\text{\upshape sub}}
\label{eq:nonlinear}
\end{align}
\end{subequations}
Hence we can find one such $\delta$
by solving the following linear programming problem
\begin{align*}
& \text{maximize} \\
& \qquad \epsilon \\
& \text{subject to} \\
& \qquad
\begin{aligned}
\epsilon & \le 1 \\
\inner{v, \delta} & = 0, & & \qquad v \in L_{\text{sub}} \\
\inner{v, \delta} & \le - \epsilon,
& & \qquad v \in V_{\text{sub}} \setminus L_{\text{sub}}
\end{aligned}
\end{align*}
where $\delta$ is a $p$-vector, $\epsilon$ is a scalar, and
$(\delta, \epsilon)$ is the state vector of the linear programming problem
(so the dimension is $p + 1$). The $\delta$ part of the solution
is a generic direction of recession. The $\epsilon$ part does not matter.
The idea for using this particular linear program came from the documentation
for the \verb@dd_ExistsRestrictedFace2@ function in the \texttt{cddlib} library,
which is the computational geometry library (written by K. Fukuda)
to which \texttt{rcdd} provides an incomplete interface.
The \texttt{rcdd} package does not provide an interface to this \texttt{cddlib}
function, but it does provide a function \texttt{lpcdd} that does linear
programming and can be used to solve this linear program.
\subsection{Calculating Maximum Likelihood Estimates} \label{sec:mle}
\subsubsection{In the Original Family} \label{sec:original}
There is little to be said about calculating the MLE in the original family.
When we have found that $L_{\text{sub}} = V_{\text{sub}}$, then we know the MLE
exists and can use available software to find it. We will use the R function
\texttt{glm} for our examples.
There is one issue worth mentioning. If the model is non-identifiable,
so the MLE is non-unique, the R function \texttt{glm} is smart enough to
drop enough predictors to produce an identifiable model. However, its
method of doing so is not guaranteed because of inexactness of the default
computer arithmetic.
We can use the \texttt{redundant} function in the \texttt{rcdd}
package applied to the columns of $M$ to reduce to a linearly independent
set. If $M$ was calculated using integer or rational arithmetic, and
\texttt{redundant} uses rational arithmetic, then this operation will
be exact.
\subsubsection{In the Completion} \label{sec:completion}
When we have found that $L_{\text{sub}} \neq V_{\text{sub}}$ and have found
a generic direction of recession $\delta$, we still need to characterize
the support of the limiting conditional family. We can characterize this
two ways using either of
\begin{align*}
H_{\text{sub}} & = \set{ w \in \real^q : \inner{w - M^T y, \delta} = 0 }
\\
H_{\text{sat}} & = \set{ w \in \real^p : \inner{w - y, M \delta} = 0 }
\end{align*}
where $p$ and $q$ are the dimensions of the saturated model
and affine submodel, respectively. Then the limiting conditional model
conditions on the event $M^T Y \in H_{\text{sub}}$ or $Y \in H_{\text{sat}}$,
which is the same event characterized two different ways, the latter usually
simpler.
\begin{theorem} \label{th:support}
In the setup of Sections~\ref{sec:linearity} through~\ref{sec:mle},
define
$$
L_{\text{\upshape sat}} = \set{ v \in V_{\text{\upshape sat}} : M^T v \in L_{\text{\upshape sub}} }.
$$
Then the support of the limiting conditional family is
\begin{equation} \label{eq:support-sat}
C_{\text{\upshape sat}} \cap H_{\text{\upshape sat}}
=
C_{\text{\upshape sat}} \cap (y + \spam L_{\text{\upshape sat}})
\end{equation}
when referred to the saturated model.
\end{theorem}
In the case where the convex support of the saturated model is a Cartesian
product, the support of the limiting model simply constrains $Y_i = y_i$
for $i$ such that $e_i \notin L_{\text{sat}}$, that is, the $i$-th component of
the response is unconstrained if $i \in L_{\text{sat}}$ and is constrained
to be equal to its observed value if $i \notin L_{\text{sat}}$.
This finishes our analysis of maximum likelihood estimation in the
Barndorff-Nielsen completion. At least in the Cartesian product case,
the maximum likelihood problem in the completion is of the same form
as the original problem. The only difference is that we constrain
certain components of the response vector to their observed values.
This can be achieved by removing those components from the response vector
and proceeding as if the resulting subvector were the entire response vector.
If, for example, we are using the R function \texttt{glm} to fit models,
we merely delete certain elements of the response vector and the corresponding
rows of the model matrix (or the data frame containing the data if we are
using a formula to specify the model) and proceed normally. This conditional
model (with some components deleted) will always be non-identifiable,
because $\delta$ will always be a direction of constancy and there may be
other directions of constancy. The \texttt{glm} function, however, can
deal with this issue. Furthermore, even in the rare case when
the \texttt{glm} function may be confused, we can find a full rank
model matrix having
the same column space as the original model matrix
using the function \texttt{redundant}
in the \texttt{rcdd} package, as described in the preceding section.
\subsection{Phase I and Phase II}
\citet{thesis} coined the term ``phase~I maximum likelihood problem'' to
refer to the process of determining
whether the likelihood function has any local or global maxima and if not
what to do about it. This term was also used by \citet{gt}.
It was coined by analogy with the phase~I linear
programming problem, which is to find a feasible point, if any exists, or
determine that none exist. If one exists, then this is used
to start the phase~II problem, finding optimal values.
Little, if anything is known about the phase~I maximum likelihood problem
except in one special case: exponential families, where it is completely
understood theoretically. As we have seen, the phase~I problem for full
exponential families satisfying the condition of \citet{brown} consists
entirely in determining the set $L_{\text{sat}}$ and a GDOR $\delta$,
and this is done by repeated
linear programming using one call to the function \texttt{linearity} in the
\texttt{rcdd} package and one call to the function \texttt{lpcdd} in the
same package. The phase~I algorithm can be done using exact infinite-precision
rational arithmetic, in which case the result is exact.
The rest of the maximum likelihood problem, can be called the phase~II problem:
finding the MLE\@. In Section~\ref{sec:original} we saw that,
when the phase~I problem has established $L_{\text{sat}} = V_{\text{sat}}$,
the MLE exists in the usual sense and is found using the usual algorithm.
In Section~\ref{sec:completion} we saw that,
when the phase~I problem
has established $L_{\text{sat}} \neq V_{\text{sat}}$,
the MLE exists in the Barndorff-Nielsen completion and is found using
the usual algorithm applied to modified data, the modification being determined
by $L_{\text{sat}}$.
\subsection{Likelihood Ratio Tests} \label{sec:tests}
Given two nested natural affine submodels, the maximum value of the
log likelihood can be calculated for each submodel
even without solving the phase~I
algorithm. Available software, such as the R function \texttt{glm}, will
go uphill on the log likelihood until reaching a point where the log likelihood
is nearly flat, in which case the value of the log likelihood is nearly the
maximum. If the MLE does not exist in the conventional sense,
then the natural parameter estimates will be large
but not infinite, and the \texttt{glm} function may or may not give a warning
about lack of convergence. If the MLE does not exist in the conventional
sense, then the natural parameter estimates are infinitely wrong, but the value of the
maximized log likelihood is nearly correct. Thus we can correctly calculate
the likelihood ratio test statistic without solving the phase~I problem.
This does no good, however, because the usual asymptotics of the likelihood
ratio test (Wilks' theorem) do not hold in the case where the MLE for the null
model does not exist in the conventional sense.
In this case, the following simple correction,
suggested by S. Fienberg (personal communication) seems reasonable.
Solve the phase~I problem for the null model,
determining $L_{\text{sat}} \neq V_{\text{sat}}$.
Let $M_0$ and $M_1$ be the model matrices for the null and alternative
natural affine submodels ($M_0$ was used in the phase~I calculation).
If we apply Wilks' theorem to the limiting conditional model for the null
hypothesis, we obtain the result that the deviance (twice the log likelihood
ratio) is approximately chi-squared with degrees of freedom which is the
difference in dimension of $M_1^T (\spam L_{\text{sat}})$
and of $M_0^T (\spam L_{\text{sat}})$. Assuming that $M_0$, $M_1$,
and $L_{\text{sat}}$ were
determined exactly using rational arithmetic, the degrees of freedom can
be determined exactly by applying the \texttt{redundant} function in the
\texttt{rcdd} package to the sets $\set{ M_i^T w : w \in L_{\text{sat}} }$,
$i = 0, 1$.
This asymptotic approximation may or may not hold depending on the sample
size and on how close the observed value of the natural statistic is to the
boundary of the convex support of the limiting conditional model.
By construction, it cannot be on the boundary, but if it is close the
asymptotic approximation can be bad. If one is worried about the validity
of the asymptotic approximation, one can always do a parametric bootstrap
calculation based on the limiting conditional model for the null hypothesis.
\subsection{Confidence Intervals} \label{sec:confidence}
Confidence intervals are more complicated than hypothesis tests.
Confidence intervals for both natural and mean value parameters are
of interest. The R function \texttt{predict.glm} provides either, depending
on the value of its \texttt{type} argument. We will also provide either.
Before getting into details, we should first note that confidence intervals
are often inappropriate from a theoretical point of view.
When there is not a single scalar parameter
of interest, a confidence region for the vector parameter of interest should,
theoretically,
be provided. However, high-dimensional confidence regions are unvisualizable,
hence uninterpretable and of no interest to users. Thus statisticians usually
provide something users think they can interpret, which is multiple confidence
intervals, often not adjusted for simultaneous coverage. That is what,
for example, \texttt{predict.glm} provides. We will follow the usual practice,
providing multiple confidence intervals in our examples and restricting our
treatment of confidence regions to a few comments.
In the case where the MLE does not exist in the conventional sense
but is found in the Barndorff-Nielsen completion, we have two natural parameter
spaces in play, the natural parameter space $\Theta$ of the original natural
affine submodel and the natural parameter space $\Theta_{\text{lim}}$ of
the full family containing the limiting conditional model, which by assumption
is given by \eqref{eq:larger}.
The MLE is in the latter, but confidence intervals or confidence regions must
be in the former. Thus we need to fully understand the relationship between
the two. Both are subsets of $\real^p$ and considered as such
$\Theta \subset \Theta_{\text{lim}}$, but we should not consider them as such
because a point $\theta$ in both sets corresponds to different distributions
in the two models. Let $\delta$ be a GDOR, and let $\Gamma_{\text{lim}}$
denote the constancy space of the limiting conditional model.
Then we know $\delta \in \Gamma_{\text{lim}} \subset \Theta_{\text{lim}}$.
We also know that in the limiting conditional model the parameter is not
identifiable: for every $\gamma \in \Gamma_{\text{lim}}$ the parameter
values $\theta$ and $\theta + \gamma$ correspond to the same distribution.
Thus distributions do not correspond to parameter points $\theta$ but to
equivalence classes of points
$$
\theta + \Gamma_{\text{lim}}
= \set{ \theta + \gamma : \gamma \in \Gamma_{\lim} }.
$$
Of course, the same issue applies to the original model. If its constancy
space is $\Gamma$, then equivalence classes $\theta + \Gamma$ correspond
to distributions in the original model.
% However, we can, and usually do,
% arrange so that there are no directions of constancy for the original model,
% that is, $\Gamma = \{ 0 \}$.
Since $\Gamma_{\text{lim}}$ contains $\delta$, it is
a nontrivial subspace. We cannot reparametrize to make the limiting conditional
model identifiable without losing the connection between the two models.
Points $\theta$ and $\theta + \gamma$ with $\gamma \in \Gamma_{\text{lim}}$
correspond to the same distribution in the limiting conditional model but
may correspond to different distributions in the original model (do correspond
to different distributions unless $\gamma \in \Gamma$).
The relationship between the two models is that the distributions in
the original model corresponding to $\theta + s \delta$
and $\theta + \gamma + s \delta$ converge as $s \to \infty$ to the
(single) distribution in the limiting conditional model corresponding to
both parameter values $\theta$ and $\theta + \gamma$.
\subsubsection{In the Limiting Conditional Model}
In the limiting conditional model we have the ``usual asymptotics'' of maximum
likelihood. The MLE $\hat{\theta}$ is asymptotically normal with variance
inverse Fisher information, or would be except for two issues: the Fisher
information matrix is singular with null space $\Gamma_{\text{lim}}$ and
we may not believe any distribution in the limiting conditional model is
correct, so a confidence region in the limiting conditional model may
be nonsense. Nevertheless, such confidence regions may be useful as a tool
for constructing confidence regions in the original model that do make sense.
Moreover, the non-identifiability issue in the limiting conditional model
can be dealt with either by using a pseudo-inverse for Fisher information or
by constraining $\hat{\theta}$ to lie in a subspace of $\real^p$ such that
the limiting conditional model is identifiable when this constraint is imposed.
We usually take the latter approach, since the R function \texttt{glm}
does this automatically, dropping parameters to obtain identifiability.
Suppose we have such a confidence region $R_{\text{lim}}$ for $\theta$ in
the limiting conditional model. Because we used constraints, we do
not automatically get $R_{\text{lim}} \supset \Gamma_{\text{lim}}$.
Hence, if we wish to relate this confidence region to the original model,
which we do, then we need to consider
$R_{\text{lim}} + \Gamma_{\text{lim}} = \set{ \theta + \gamma :
\theta \in R_{\text{lim}}, \ \gamma \in \Gamma_{\text{lim}} }$
the ``actual'' confidence region.
\subsubsection{In the Original Model}
Now for each $\theta \in R_{\text{lim}}$
and each $\gamma \in \Gamma_{\text{lim}}$, the distribution in the limiting
conditional model corresponding to $\theta + \gamma$ is the limit
as $s \to \infty$ of the distributions in the original model corresponding
to $\theta + \gamma + s \delta$. Thus it remains to be decided how large
$s$ may be, that is, we need a confidence interval for $s$, which will
necessarily be one-sided, of the form $(\hat{s}_{\theta + \gamma}, \infty)$.
As our notation suggests, we make one such confidence interval for each
point $\theta + \gamma$ we need to consider.
We base our interval on the statistic
$\inner{Y, \delta}$, the observed value of which $\inner{y, \delta}$ is its
maximum possible value. Since $\inner{Y, \delta}$ is discrete where it counts
--- the value $\inner{y, \delta}$ is assumed to have positive probability so
that the limiting conditional model described in Section~\ref{sec:limits}
exists --- it is not possible to get an exact confidence interval unless one
uses randomized or fuzzy intervals, described by \citet{fuzzy}.
It follows immediately from the theory of uniformly most powerful tests
\citep{lehmann-tsh} and the definitions in \citet{fuzzy},
that the exact $1 - \alpha$ fuzzy confidence interval in this case
has membership function
\begin{equation} \label{eq:fuzzy}
I(s) = \max[ 0, 1 - \alpha /
\myPr_{\theta + \gamma + s \delta}(Y \in H) ],
\end{equation}
which has a direct interpretation as a confidence statement: $I(s)$ gives
the degree to which $s$ should be considered to be in the confidence interval
\citep{fuzzy}. Also \eqref{eq:fuzzy} can be used to construct randomized
confidence intervals; if $U$ is a $\text{Uniform}(0, 1)$ random variate
independent of the data, then
$$
\vphantom{I(s)}^{U}I(s) = \set{ s : I(s) \ge U }
$$
is an exact $1 - \alpha$ randomized confidence interval
\citet[Section~2.1]{fuzzy}. The main point of this technical report
is to advocate
the use of generic directions of recession to calculate maximum likelihood
estimates in the Barndorff-Nielsen completion and related hypothesis tests
and confidence intervals. Advocacy of fuzzy confidence intervals is not
our purpose (see \citealp{fuzzy}, for that). However, if one wants an
exact procedure, one must use the fuzzy confidence interval, so we have
mentioned it.
For those that want a conventional confidence interval, the \emph{support}
of the fuzzy interval
\begin{equation} \label{eq:support}
(\hat{s}_{\theta + \gamma}, \infty) = \supp I = \set{ s : I(s) > 0 }
\end{equation}
\citet[Section~1.3]{fuzzy} is a conservative $1 - \alpha$ confidence interval
for $s$. Clearly,
$$
\hat{s}_{\theta + \gamma}
=
\inf \set{ s \in \real : \theta + \gamma + s \delta \in \Theta
\opand \myPr_{\theta + \gamma + s \delta}(Y \in H) > \alpha }.
$$
Since $s \mapsto \myPr_{\theta + \gamma + s \delta}(Y \in H)$ is continuous
and strictly increasing by Theorem~\ref{th:limits},
usually $\hat{s}_{\theta + \gamma}$ is the unique $s$ such that
$\myPr_{\theta + \gamma + s \delta}(Y \in H) = \alpha$.
Only when the data
have very little information about the parameter would it be the case that
$\myPr_{\theta + \gamma + s \delta}(Y \in H) > \alpha$ for all $s$ in
the allowed range, in which case $\hat{s}_{\theta + \gamma}$ would be
the lower endpoint of this range.
We should remark that there is a simple argument leading directly to
these conventional confidence intervals \eqref{eq:support} without going
through fuzzy confidence intervals. The conventional
conservative $P$-value for the upper-tailed test
having test statistic $\inner{Y, \delta}$,
null hypothesis $\theta + \gamma + s \delta$,
and observed data $y$ is
\begin{equation} \label{eq:pval}
\myPr_{\theta + \gamma + s \delta}(\inner{Y - y, \delta} \ge 0).
\end{equation}
A conventional level $\alpha$ test rejects when \eqref{eq:pval} is less than
or equal to $\alpha$. A conservative $1 - \alpha$ confidence interval consists
of the set of $s$ values that are not rejected at level $\alpha$. In the
only case of interest to us, where $\inner{y, \delta}$ is the largest possible
value of $\inner{Y, \delta}$ so the event $\inner{Y - y, \delta} \ge 0$ is
the same as $Y \in H$ up to a set of measure zero, this comes to the same interval
as \eqref{eq:support}.
\subsubsection{A Combination of the Two}
Because all of our one-sided confidence intervals for $s$, one for each
parameter point $\theta + \gamma$, $\theta \in R_{\text{lim}}$,
$\gamma \in \Gamma_{\text{lim}}$, use the same
test statistic $\inner{Y, \delta}$, it follows that they have simultaneous
coverage probability at least $1 - \alpha$. Thus under the assumption that
$R_{\text{lim}}$ was a $1 - \alpha^*$ confidence region in the conditional
limiting model, we see that
$$
\set{ \theta + \gamma + s \delta : \theta \in R_{\text{lim}},
\ \gamma \in \Gamma_{\text{lim}}, \ s \ge \hat{s}_{\theta + \gamma} }
$$
is a $1 - \alpha - \alpha^*$ confidence region for the natural parameter
of the original model.
However, we are rarely interested in providing a confidence region.
Thus we will instead consider
$$
\set{ \hat{\theta} + \gamma + s \delta :
\ \gamma \in \Gamma_{\text{lim}}, \ s \ge \hat{s}_{\hat{\theta} + \gamma} }
$$
to be a $1 - \alpha$ confidence region that captures the part
of the uncertainty relating to ``how close to infinity'' the natural parameter
may be. This must be combined with $R_{\text{lim}}$ or with non-simultaneous
confidence intervals based on the limiting conditional model. Simultaneous
coverage will not be achieved unless Bonferroni or other correction is applied.
\subsubsection{The Cartesian Product Case}
In the case where the convex support of the saturated model is a Cartesian
product, all of this simplifies somewhat. We know that a confidence region
for the mean value parameters of the limiting conditional model only involves
the response variables that are not constrained to be at their observed values
in the limiting conditional model. We take such a confidence region or
separate confidence intervals, such as those provided by the R function
\texttt{predict.glm} applied to the limiting conditional model, to be adequate
for describing those components of the response.
Our one-sided intervals come into play in computing one-sided confidence
intervals for the mean value parameters of the other components of the
response. Since the MLE of their mean value parameters are on the boundary,
one-sided intervals are the only kind that make sense. We distinguish two
cases.
The first case, is where $\Gamma_{\text{lim}} = \set{ s \delta : s \in \real }$.
In this case, all intervals $(\hat{s}_{\theta + \gamma}, \infty)$ for the
same $\theta$ but different $\gamma \in \Gamma_{\text{lim}}$ correspond
to the same natural parameter values. Hence we can ignore $\gamma$. We
can hope that one interval $(\hat{s}_{\hat{\theta}}, \infty)$ adequately
describes the variability in mean value parameters for components of the
response having MLE at the boundary.
The second case, is where $\Gamma_{\text{lim}}$ contains vectors
not proportional to $\delta$. Then we use our general formula, but can
still hope that the intervals $(\hat{s}_{\hat{\theta} + \gamma}, \infty)$,
$\gamma \in \Gamma_{\text{lim}}$ adequately
describe the variability in mean value parameters for components of the
response having MLE at the boundary.
In both cases we only use the intervals corresponding to $\hat{\theta}$
rather than all $\theta \in R_{\text{lim}}$.
This is clearly not exactly correct, however, we do have the following
argument. The idea is that we are separating variation into two components,
one along the direction of recession and the other across the direction of
recession. To a first approximation, our one-sided intervals are about
variation along the direction of recession and conventional intervals for
the limiting conditional model are about variation across the direction of
recession. In the spirit of sloppiness that allows us to provide separate
confidence intervals rather than confidence regions, we consider this division
not too bad. From the discussion about the distributions
for $\theta + s \delta$ and $\theta + \gamma + s \delta$ converging to the
same limiting conditional distribution, we see that this division cannot
be exactly correct, but it may do for practical purposes.
In any event, statisticians have made do up to now with no tools whatsoever
for handling this issue.
Approximately correct tools will be a great improvement.
\subsubsection{Calculating the Constancy Space} \label{sec:gamma}
By Theorems~\ref{th:constancy} and~\ref{th:exist} the constancy space
is $N_C(y)$ in the case where every direction of recession is a direction
of constancy. By Corollary~\ref{cor:generic-limit} the tangent space
$T_{C \cap H}(y)$ in the limiting conditional model is $\spam L$.
Hence the constancy space is
$$
N_{C \cap H}(y)
=
\set{ \delta \in \real^p : \inner{v, \delta} = 0, \ v \in L }.
$$
In the case of a natural affine submodel this becomes
\begin{equation} \label{eq:gamma}
\Gamma_{\text{lim}}
=
N_{C_{\text{sub}} \cap H_{\text{sub}}}(M^T y)
=
\set{ \delta \in \real^q :
\inner{v, \delta} = 0, \ v \in L_{\text{sub}} }.
\end{equation}
where $q$ is the dimension of the submodel.
We have already seen in Section~\ref{sec:linearity} how to calculate
$L_{\text{sub}}$. Now we merely note that $L_{\text{sub}}$ is
a V-representation of its span, and a call to the function \texttt{scdd}
in the \texttt{rcdd} package will compute an H-representation of its span
which is also a V-representation for \eqref{eq:gamma}, that is, a basis
for the constancy space of the limiting conditional model. The examples
that follow do not exhibit a need for this operation, but the need would
arise in other examples.
When calculating \eqref{eq:gamma} for the purpose of generating one-sided
confidence intervals it is clear that we do not need to let $\gamma$ range
over the whole constancy space \eqref{eq:gamma} because points $\gamma$
and $\gamma + s \delta$ lead to the same one-sided confidence intervals.
Hence it is enough to use the subspace of $\Gamma_{\text{lim}}$ orthogonal
to $\delta$, which is calculated by feeding $L_{\text{sub}} \cup \{ \delta \}$
to the R function \texttt{scdd} for conversion to an H-representation,
which will also be a basis of the desired subspace.
\section{Examples} \label{sec:example}
\subsection{A Logistic Regression Example} \label{sec:logit}
We start with a logistic regression.
Suppose we observe a vector $y$ whose
components are Bernoulli with means forming a vector $p$. The natural
parameter is $\theta = \logit(p)$, where logit operates componentwise
$\theta_i = \logit(p_i)$. Suppose we also have one covariate vector $x$
and we want to fit a quadratic model
$$
\theta_i = \beta_1 + \beta_2 x_i + \beta_3 x_i^2.
$$
Finally, suppose $x_i$ takes the values 1, $\ldots$, 30 and $y_i = 0$
for $x_i \le 12$ or $x_i \ge 24$ and $y_i = 1$ otherwise.
The following R statements create the data and attempt to fit the model.
<>=
x <- 1:30
y <- c(rep(0, 12), rep(1, 11), rep(0, 7))
out1 <- suppressWarnings(glm(y ~ x + I(x^2), family = binomial, x = TRUE))
@
It seems difficult if not impossible to capture warning messages to an
Sweave file, so we have suppressed them. The warnings given by R version 2.7.0
are
\begin{verbatim}
algorithm did not converge
fitted probabilities numerically 0 or 1 occurred
\end{verbatim}
(two separate warnings). This is the \texttt{glm} function's way of indicating
that the MLE may not exist. However, because we gave the argument
\verb@x = TRUE@ we have obtained the model matrix
<>=
M <- out1$x
@
Now we are ready to carry out the computation of Section~\ref{sec:linearity},
determining $L_{\text{sub}}$.
<>=
tanv <- M
tanv[y == 1, ] <- (- tanv[y == 1, ])
vrep <- cbind(0, 0, tanv)
lout <- linearity(vrep, rep = "V")
lout
@
The rows of the matrix \texttt{tanv} are the elements of $V_{\text{sub}}$.
The matrix \texttt{vrep} is the form in which the \texttt{rcdd} package
encodes the V-representation
of $\con(\pos V_{\text{sub}}) = T_{C_{\text{sub}}}(M^Ty)$.
The vector \texttt{lout} gives the indices of the elements of $V_{\text{sub}}$
that are in $L_{\text{sub}}$. It having length zero indicates that
$L_{\text{sub}} = \varnothing$, and this in turn indicates
by Corollary~\ref{cor:generic-limit} that the
support of the limiting conditional model is $C_{\text{sat}} \cap H_{\text{sat}} = \{ y \}$.
Hence there is only one distribution in the limiting conditional model,
which is the distribution concentrated at $y$, and this trivially must be
the MLE in the limiting conditional model. This also implies that
every direction is a direction of constancy in the limiting conditional model,
that is $\Gamma_{\text{lim}} = \real^3$.
Now we are ready to carry out the computation of Section~\ref{sec:gdor},
calculating $\delta$.
<>=
hrep <- cbind(- vrep, - 1)
hrep <- rbind(hrep, c(0, 1, rep(0, 3), - 1))
objv <- c(rep(0, 3), 1)
pout <- lpcdd(hrep, objv, minimize = FALSE)
names(pout)
pout$solution.type
@
The list \texttt{pout} is the solution of the linear programming problem
described in Section~\ref{sec:gdor}.
We obtain the GDOR and check its validity
<>=
gdor <- pout$primal.solution[- length(pout$primal.solution)]
all(tanv %*% gdor < 0)
gdor
@
All of the computation to this point has used ordinary inexact computer
arithmetic. We also illustrate exact infinite-precision rational arithmetic
computations
<>=
pout.exact <- lpcdd(d2q(hrep), d2q(objv), minimize = FALSE)
gdor.exact <- pout.exact$primal.solution[- length(pout$primal.solution)]
gdor.exact
q2d(gdor.exact)
@
We are now ready to find one-sided confidence intervals. We use the
R function \texttt{uniroot} to solve equations involved in this process.
We start by defining a function that evaluates $\myPr_{\beta + s \delta}(Y \in H)$
for any $\beta$ and $\delta$.
<>=
invlogit <- function(x) 1 / (1 + exp(- x))
prob.face <- function(beta, s) {
moo <- M %*% (beta + s * gdor)
moo <- as.vector(moo)
prod(ifelse(y == 1, invlogit(moo), invlogit(- moo)))
}
@
Now we try it out on the point $\beta = 0$.
<>=
alpha <- 0.05
beta.hat <- rep(0, length(gdor))
foo <- function(s) prob.face(beta.hat, s) - alpha
fred <- uniroot(foo, lower = -10, upper = 10)
lowbnd <- fred$root
c(lowbnd, Inf)
fred$estim.prec
@
The meaning of this confidence interval is not obvious just from looking
at the numbers. Hence we make a plot showing the mean values that correspond
to $s \delta$ for $s$ in the interval.
<