Statistics 5102 (Geyer, Spring 2003) MLE

Contents

Coding the Likelihood Function

In order to do maximum likelihood estimation (MLE) using the computer we need to write the likelihood function or log likelihood function (usually the latter) as a function in the computer language we are using.

In this course we are using R and Rweb. So we need to know how to write the log likelihood as an R function.

For an example we will use the gamma distribution with known scale parameter, which we assume to be 1.0. This is Example 6.6.2 in DeGroot and Schervish.

Some made-up data actually having a gamma distribution (they are computer simulated gamma random variables) is shown below. The print(x) statement prints the whole data vector (30 numbers) and the hist(x) statement makes a histogram.

External Data Entry

Enter a dataset URL :

The R function dgamma ( on-line help) calculates the density of the gamma distribution. As with most R functions the calculation is vectorized, so

dgamma(x, shape = alpha)

calculates a vector of 30 numbers, the values fα(xi) for each of the 30 data points xi

Another optional argument (found on the on-line help for the dgamma function) tells the function to return log probability density instead of probability density

dgamma(x, shape = alpha, log = TRUE)

Finally the sum function ( on-line help) calculates the sum of these logs. Hence the code

logl <- function(alpha, x)
    return(sum(dgamma(x, shape = alpha, log = TRUE)))

This function can be slightly improved by inserting a check that alpha is a single variable (a vector of length 1 to R) rather than a vector, which doesn't make sense,

logl <- function(alpha, x) {
    if (length(alpha) > 1) stop("alpha must be scalar")
    if (alpha <= 0) stop("alpha must be positive")
    return(sum(dgamma(x, shape = alpha, log = TRUE)))
}

Of course, when there is no mistake (no bugs in the code) both functions do exactly the same thing. But when there is a mistake, the function with the error check will make it clear what the problem is.

Plotting the Log Likelihood Function

If you want to look at the log likelihood, the following R statements seem to do the job.

External Data Entry

Enter a dataset URL :

Maximizing the Likelihood Function

R has several functions that optimize functions. The one we will explain here is the nlm function ( on-line help). It has a huge number of arguments, most of which can be ignored. The only arguments that must be supplied are two

We won't at this point discuss any of the optional arguments described on the on-line help except for the somewhat mysterious ... arguments

Since nlm minimizes rather than maximizes we need to write an f function that calculates the minus the log likelihood rather than the log likelihood (stand on your head and maximization becomes minimization).

Good starting values are hard to find, in general. In our particular problem, maximum likelihood for the shape parameter of the gamma distribution, a good estimate of the shape parameter α is the sample mean, since the theoretical mean of the gamma distribution is α / β where β is the rate parameter, here assumed to be known to be β = 1.0. Hence the population mean is α and the sample mean is a reasonable estimate (by the law of large numbers).

This gives us the following first attempt at maximum likelihood for our example.

External Data Entry

Enter a dataset URL :

Using More Options

A few more options of nlm can be helpful.

We can use the hessian, which is part of the list returned by the nlm function to tell whether the solution is a local minimum. More importantly, we can use it to tells about the statistical properties of the MLE (Section 7.8 in DeGroot and Schervish).

External Data Entry

Enter a dataset URL :

Fisher Information and Confidence Intervals

The hessian returned by the nlm function is the second derivative of minus the log likelihood, that is, what is referred to in the handout on Fisher Information and Confidence Intervals as observed Fisher information. More precisely, it is a finite difference approximation to Fisher information, that is, equation (1.2) in the handout with the MLE plugged in for the parameter.

Thus the following code calculates an asymptotic conf.level confidence interval for the unknown scale parameter in the example we have been doing

External Data Entry

Enter a dataset URL :

Maximizing the Likelihood Function (Multiparameter Case)

Method of Moments Estimators

We need good starting points for our optimization algorithm, and if we stick with the gamma distribution, but now switching to the two-parameter gamma distribution to get a multiparameter problem, the simplest estimators are so-called method of moments estimators.

This is a traditional theory topic that students that have had a course like this are expected to know (on preliminary exams, actuarial exams, and so forth) so it won't hurt to cover it, even though DeGroot and Schervish ignore it.

The first two moments of the Gamma(α, β) distribution are

E(X) = α / β

and

Var(X) = α / β2

The so-called method of moments says the following

For the gamma distribution, solving for the parameters gives

β = E(X) / Var(X)

and

α = β E(X) = E(X)2 / Var(X)

For the plug-in I am handicaped by various sucky web browsers (especially Microsoft) not implementing web standards for mathematics on the web. I can't make an x bar for sample mean. The R statements for these estimators are

alpha.start <- mean(x)^2 / var(x)
beta.start <- mean(x) / var(x)

The method of moments isn't always applicable (the solve for parameters step may be impossible), and it doesn't necessarily produce good estimators. Maximum likelihood estimators are asymptotically efficient. Method of moment estimators generally aren't. But they do provide good enough starting points for maximum likelihood.

Coding the Likelihood Function

Coding the log likelihood (really minus the log likelihood is what we need to hand to nlm) is must the same as coding the uniparameter case. The main difference is that the argument to the function must be a vector of parameters.

We usually want to take this vector apart into its scalar components if, as in the gamma example, the different parameters do different things.

With that in mind our mlogl function looks something like this

mlogl <- function(theta, x) {
    if (length(theta) != 2) stop("theta must be vector of length 2")
    alpha <- theta[1]
    lambda <- theta[2]
    if (alpha <= 0) stop("theta[1] must be positive")
    if (lambda <= 0) stop("theta[2] must be positive")
    return(- sum(dgamma(x, shape = alpha, rate = lambda, log = TRUE)))
}

(Note: We've changed the name of the rate parameter from β which DeGroot and Schervish use to λ which many other theory books use because it conflicts with an R function name (the R function beta calculates the normalizing constant of the beta distribution, which is called the beta function).

Maximum Likelihood

So let's try it.

External Data Entry

Enter a dataset URL :

The important bits here are the MLE

$estimate 
[1] 1.680531 1.009529 

the observed Fisher information matrix

$hessian 
          [,1]      [,2] 
[1,]  24.15181 -29.71534 
[2,] -29.71534  49.45877 

and, for comparison with the MLE, the method of moments estimators

Rweb:> print(theta.start) 
[1] 1.5398404 0.9250142 

The method of moments estimators seem fairly close to the MLEs, but we can't really tell whether they are close in the statistical sense until we calculate confidence intervals.

Confidence Intervals

External Data Entry

Enter a dataset URL :

The important output is right at the end. The confidence intervals

[1] 0.899571 2.461491
[1] 0.4637939 1.5552641

for the shape parameter and the rate parameter (in that order).

An important comparison is with the confidence interval for the shape parameter we got when only estimating the shape (assuming we knew the scale parameter was 1.0), which was

[1] 1.271798 2.065813

Note that this interval is much narrower: (1.27, 2.07) when the shape parameter is known versus (0.90, 2.46) when the shape parameter is unknown and must also be estimated.

This is an important lesson.

Confidence intervals for parameters of interest are wider when nuisance parameters are estimated.