Course Announcement

Instructor: Charles Geyer (5-8511, charlie@stat.umn.edu)

Inequality-constrained statistical inference involves situations where there are inequality constraints on parameters. One such situation is well-known: classical one-tailed tests. Other inequality-constrained problems are not covered in the core courses here at Minnesota (or for that matter in most other programs). This course gives a thorough introduction.

There will be no textbook for the course, but some material will be taken from the following books: Fletcher (1987), Robertson, Wright, and Dykstra (1988), and Rockafellar and Wets (1998), which are listed in the bibliography below.

Course Content

Optimality Conditions

If you are like the usual statistics student you think the way to find maxima of functions is to find a point where the first derivative is zero. If you learned to be fussy about this issue, you may also think to plot the function and check whether the maximum occurs on the boundary of the domain.

But what about functions of many variables? How do you check for solutions on the boundary then? Methods for doing this are well known. One formulation of the optimality condition are the so-called Kuhn-Tucker conditions. We will study these and others.

We will also study second-order optimality conditions, the constrained analog of the criterion that a negative second derivative and zero first derivative indicates a local maximum.

Optimization Algorithms

When statisticians think of optimization algorithms, they either think of Newton-Raphson or they think of one of the algorithms with a peculiarly statistical flavor, such as iteratively reweighted least squares (IRLS), iterative proportional fitting (IPF), or the EM algorithm. Except for EM, none of these are useful for constrained problems, nor are they good algorithms even for unconstrained problems.

Good optimization algorithms are efficient and always go uphill. Newton-Raphson is only good if modified by various safeguarding procedures that prevent downhill steps. It can be modified to allow inequality constraints, in which case it is called sequential quadratic programming (SQP). Any algorithm that is efficient is asymptotically equivalent to Newton or SQP (the Dennis-Moré theorem). IRLS and IPF are not efficient. They are bad algorithms that only a statistician could like.

EM is peculiar in that it does a magic trick. It optimizes a function (the missing data log likelihood) that it doesn't ever evaluate. When that function cannot be evaluated, EM is great. But EM is often recommended in many cases, such as random effects models in which the missing data log likelihood can be evaluated but is a bit of bother. Then EM is another dumb algorithm that only a statistician could like.

We will learn about good general optimization algorithms, including Levenberg-Marquardt, trust region methods, active set methods, and SQP. We will also study an interesting algorithm invented by statisticians called pool adjacent violators (PAVA) that is a good algorithm for a certain class of inequality-constrained problems called isotonic regression problems.

Asymptotics

In order to do statistics with inequality-constrained problems we need sampling distributions of estimators and test statistics. Or perhaps I should have said frequentist inference. Bayesian inequality-constrained inference is trivial. You just turn the Bayesian crank. These problems typically have no closed-form solutions for posterior distributions and must be done by Monte Carlo, but that is true of almost all Bayesian problems, constrained or not.

Asymptotics under inequality constraints has been well understood for a long time, in the usual meaning of well understood, which is a handful of people understand it. Two very famous papers by Chernoff (1954) and Le Cam (1970) laid out the correct asymptotics for most such problems. Geyer (1994) added useful theoretical tools and weakened the regularity conditions a bit.

Although Chernoff (1954) is very much cited as the source of consistency implies root n consistency, a minor technical point of the paper, the main point of the paper, the asymptotic distribution of the likelihood ratio test statistic for arbitrary hypotheses, is unknown to the vast majority of statisticians. Statisticians often woof about testing in a very general framework, starting the discussion with something like let H0 and H1 be disjoint subsets of the parameter space and consider testing H0 versus H1, but they rarely have a clue as to how to proceed except when H0 and H1 are nested manifolds so the asymptotic distribution is chi-squared. Similarly, Le Cam (1970) is often cited as the source of weak regularity conditions for maximum likelihood based on the notion of differentiability in quadratic mean. That paper also gives the asymptotic distribution of the MLE whenever the parameter space is convex, but almost no one knows that.

Although this asymptotic theory can be discussed using treatments much resembling classical asymptotic theory (Chernoff and Le Cam used such methods), the theory is much more elegant and simpler and more powerful if we use the notion of epiconvergence of functions, which is an important tool in optimization theory. The key idea, first presented in Geyer (1994) is to do the asymptotics on objective functions (such as log likelihoods) rather than on solutions (such as MLEs). Once we obtain epiconvergence in law of objective functions, the convergence in law results for solutions (MLEs) and optimal values (likelihood ratio test statistics) follow as simple corollaries. We can also get theorems about level sets (likelihood-based confidence intervals) and profiles (profile likelihoods) as simple corollaries. These theorems are entirely new, something statisticians never thought of and they just drop out of the new approach.

Solution Path Algorithms

Another name for inequality-constrained optimization is nonsmooth optimization. Minimizing a nonsmooth function is equivalent to inequaltity-constrained minimization and vice versa. So nonsmooth optimization fits into this course.

An example of nonsmooth optimization of much recent interest is the so-called lasso (Tibshirani, 1996). If l(θ) is a log likelihood or, more generally, any loss function, then the lasso objective function is

l(θ) + &lambda ||θ||
where ||⋅|| denotes the L1 norm
||θ|| = ∑ii|

Let θ(&lambda) denote the minimizer of the lasso objective function (assuming it is unique) for a fixed value of λ. A solution path algorithm (Efron, et al., 2004) efficiently determines the entire solution path. It turns out in many cases of interest, as when l(θ) is quadratic (least squares regression), the solution path is continuous and piecewise linear. But many open questions remain about related situations.

Bibliography