Course Announcement
- Inequality-Constrained Statistical Inference
- Stat 8931
- Spring 2008
- 9:05–9:55 MWF
- FordH 170
Instructor: | Charles Geyer (5-8511, charlie@stat.umn.edu) |
---|
Inequality-constrained statistical inference involves situations where there are inequality constraints on parameters. One such situation is well-known: classical one-tailed tests. Other inequality-constrained problems are not covered in the core courses here at Minnesota (or for that matter in most other programs). This course gives a thorough introduction.
There will be no textbook for the course, but some material will be taken from the following books: Fletcher (1987), Robertson, Wright, and Dykstra (1988), and Rockafellar and Wets (1998), which are listed in the bibliography below.
Course Content
Optimality Conditions
If you are like the usual statistics student you think the way to find maxima of functions is to find a point where the first derivative is zero. If you learned to be fussy about this issue, you may also think to plot the function and check whether the maximum occurs on the boundary of the domain.
But what about functions of many variables? How do you check for solutions on the boundary then? Methods for doing this are well known. One formulation of the optimality condition are the so-called Kuhn-Tucker conditions. We will study these and others.
We will also study second-order optimality conditions, the constrained analog of the criterion that a negative second derivative and zero first derivative indicates a local maximum.
Optimization Algorithms
When statisticians think of optimization algorithms, they either think of Newton-Raphson or they think of one of the algorithms with a peculiarly statistical flavor, such as iteratively reweighted least squares (IRLS), iterative proportional fitting (IPF), or the EM algorithm. Except for EM, none of these are useful for constrained problems, nor are they good algorithms even for unconstrained problems.
Good optimization algorithms are efficient and always go uphill.
Newton-Raphson is only good if modified by various safeguarding
procedures that prevent downhill steps. It can be modified to allow
inequality constraints, in which case it is called
sequential quadratic programming
(SQP). Any algorithm that is efficient
is asymptotically equivalent to Newton or SQP (the Dennis-Moré theorem).
IRLS and IPF are not efficient. They are bad algorithms that only a
statistician could like.
EM is peculiar in that it does a magic trick. It optimizes a function (the missing data log likelihood) that it doesn't ever evaluate. When that function cannot be evaluated, EM is great. But EM is often recommended in many cases, such as random effects models in which the missing data log likelihood can be evaluated but is a bit of bother. Then EM is another dumb algorithm that only a statistician could like.
We will learn about good general optimization algorithms, including
Levenberg-Marquardt, trust region methods, active set methods, and SQP.
We will also study an interesting algorithm invented by statisticians
called pool adjacent violators
(PAVA) that is a good algorithm
for a certain class of inequality-constrained problems called
isotonic regression
problems.
Asymptotics
In order to do statistics with inequality-constrained problems we need
sampling distributions of estimators and test statistics. Or perhaps
I should have said frequentist
inference. Bayesian inequality-constrained
inference is trivial. You just turn the Bayesian crank. These problems
typically have no closed-form solutions for posterior distributions and
must be done by Monte Carlo, but that is true of almost all Bayesian
problems, constrained or not.
Asymptotics under inequality constraints has been well understood
for a long time, in the usual meaning of well understood,
which is
a handful of people understand it. Two very famous papers by Chernoff (1954)
and Le Cam (1970) laid out the correct asymptotics for most such problems.
Geyer (1994) added useful theoretical tools and weakened the regularity
conditions a bit.
Although Chernoff (1954) is very much cited as the source of consistency
implies root n consistency,
a minor technical point of the paper, the main
point of the paper, the asymptotic distribution of the likelihood ratio test
statistic for arbitrary hypotheses, is unknown to the vast majority of
statisticians. Statisticians often woof about testing in a very general
framework, starting the discussion with something like
let H0 and H1
be disjoint subsets of the parameter space and consider testing
H0 versus H1,
but they rarely have a clue as to how
to proceed except when
H0 and H1
are nested manifolds
so the asymptotic distribution is chi-squared. Similarly, Le Cam (1970)
is often cited as the source of weak regularity conditions for maximum
likelihood based on the notion of differentiability in quadratic mean.
That paper also gives the asymptotic distribution of the MLE whenever the
parameter space is convex, but almost no one knows that.
Although this asymptotic theory can be discussed using treatments much resembling classical asymptotic theory (Chernoff and Le Cam used such methods), the theory is much more elegant and simpler and more powerful if we use the notion of epiconvergence of functions, which is an important tool in optimization theory. The key idea, first presented in Geyer (1994) is to do the asymptotics on objective functions (such as log likelihoods) rather than on solutions (such as MLEs). Once we obtain epiconvergence in law of objective functions, the convergence in law results for solutions (MLEs) and optimal values (likelihood ratio test statistics) follow as simple corollaries. We can also get theorems about level sets (likelihood-based confidence intervals) and profiles (profile likelihoods) as simple corollaries. These theorems are entirely new, something statisticians never thought of and they just drop out of the new approach.
Solution Path Algorithms
Another name for inequality-constrained optimization is nonsmooth
optimization. Minimizing a nonsmooth function is equivalent to
inequaltity-constrained minimization and vice versa.
So nonsmooth optimization fits into this course.
An example of nonsmooth optimization of much recent interest is the so-called lasso (Tibshirani, 1996). If l(θ) is a log likelihood or, more generally, any loss function, then the lasso objective function is
Let θ(&lambda) denote the minimizer of the lasso objective function (assuming it is unique) for a fixed value of λ. A solution path algorithm (Efron, et al., 2004) efficiently determines the entire solution path. It turns out in many cases of interest, as when l(θ) is quadratic (least squares regression), the solution path is continuous and piecewise linear. But many open questions remain about related situations.
Bibliography
- Chernoff, H. (1954).
On the distribution of the likelihood ratio.
Ann. Math. Statist., 25, 573–578. - Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004).
Least angle regression.
Ann. Statist., 32, 407–499. - R. Fletcher (1987).
Practical Methods of Optimization (2nd ed.).
Chichester and New York: John Wiley. - Geyer, C. J. (1994).
On the asymptotics of constrained M-estimation.
Ann. Statist., 22, 1993–2010. - Le Cam, L. (1970).
On the assumptions used to prove asymptotic normality of maximum likelihood estimates.
Ann. Math. Statist., 41, 802–828. - T. Robertson, F. T. Wright, and R. L. Dykstra (1988).
Order Restricted Statistical Inference.
New York: John Wiley. - R. T. Rockafellar and R. J.-B. Wets (1998).
Variational Analysis.
Berlin: Springer-Verlag. - Tibshirani, R. (1996).
Regression shrinkage and selection via the lasso.
J. Roy. Statist. Soc. Ser. B, 58, 267–288.