Statistics 5102 (Geyer, Spring 2007) Examples: Maximum Likelihood Estimation

One Parameter

A Gamma Example

In order to do maximum likelihood estimation (MLE) using the computer we need to write the likelihood function or log likelihood function (usually the latter) as a function in the computer language we are using.

In this course we are using R and Rweb. So we need to know how to write the log likelihood as an R function.

For an example we will use the gamma distribution with known scale parameter, which we assume to be 1.0. This is Example 6.6.2 in DeGroot and Schervish.

Some made-up data actually having a gamma distribution (they are computer simulated gamma random variables) is shown below. The print(x) statement prints the whole data vector (30 numbers) and the hist(x) statement makes a histogram.

Coding the Likelihood Function

The R function dgamma (on-line help) calculates the density of the gamma distribution. As with most R functions the calculation is vectorized, so

dgamma(x, shape = alpha)

calculates a vector of 30 numbers, the values f_α(x_i) for each of the 30 data points x_i

Another optional argument (found on the on-line help for the dgamma function) tells the function to return log probability density instead of probability density

dgamma(x, shape = alpha, log = TRUE)

Finally the sum function (on-line help) calculates the sum of these logs. Hence the code

logl <- function(alpha, x)
    return(sum(dgamma(x, shape = alpha, log = TRUE)))

This function can be slightly improved by inserting a check that alpha is a single variable (a vector of length 1 to R) rather than a vector, which doesn't make sense,

logl <- function(alpha, x) {
    if (length(alpha) > 1) stop("alpha must be scalar")
    if (alpha <= 0) stop("alpha must be positive")
    return(sum(dgamma(x, shape = alpha, log = TRUE)))
}

Of course, when there is no mistake (no bugs in the code) both functions do exactly the same thing. But when there is a mistake, the function with the error check will make it clear what the problem is.

Plotting the Likelihood Function

If you want to look at the log likelihood, the following R statements seem to do the job.

Maximizing the Likelihood Function

R has several functions that optimize functions. The one we will explain here is the nlm function (on-line help). Another optimizer optim will be briefly demonstrated in the last section of this page.

The nlm function has a huge number of arguments, most of which can be ignored. The only arguments that must be supplied are two

f the function to be minimized.
p a starting value for the variable. The dimension of p tells nlm the dimension of the space in which f takes values. So p must be supplied to specify the dimension, if for no other reason.
But it helps if we can specify a starting value reasonably close to the solution.

We won't at this point discuss any of the optional arguments described on the on-line help except for the somewhat mysterious ... arguments

... the on-line help for the nlm function describes this as additional arguments to f. What this means is that if we supply an argument whose name is the name of one of the arguments to the function supplied as the argument f and whose name is not one of the named arguments to nlm, then this argument will be forwarded to the function supplied as the argument f.
In particular, if we write our log likelihood function to have an additional argument x which is the data (as we did above), then since nlm doesn't have an argument named x, this will be passed to our log likelihood function.

Since nlm minimizes rather than maximizes we need to write an f function that calculates the minus the log likelihood rather than the log likelihood (stand on your head and maximization becomes minimization).

Good starting values are hard to find, in general. In our particular problem, maximum likelihood for the shape parameter of the gamma distribution, a good estimate of the shape parameter α is the sample mean, since the theoretical mean of the gamma distribution is α / β where β is the rate parameter, here assumed to be known to be β = 1.0. Hence the population mean is α and the sample mean is a reasonable estimate (by the law of large numbers).

This gives us the following first attempt at maximum likelihood for our example.

Using More Options

A few more options of nlm can be helpful.

hessian returns the second derivative (an approximation calculated by finite differences) of the objective function. This will be a k × k matrix if the dimension of the argument p (which must match the dimension of the first argument to the function f) is k.
fscale helps nlm know when it has converged to the solution. It should give roughly the size of the objective function at the solution. Often fscale = length(x) is about right.
print.level tells nlm to blather about what is is doing. print.level = 2 gives maximum verbosity.

We can use the hessian, which is part of the list returned by the nlm function to tell whether the solution is a local minimum. More importantly, we can use it to tells about the statistical properties of the MLE (Section 7.8 in DeGroot and Schervish).

Fisher Information and Confidence Intervals

The hessian returned by the nlm function is the second derivative of minus the log likelihood, that is, what is referred to in the handout on Fisher Information and Confidence Intervals as observed Fisher information. More precisely, it is a finite difference approximation to Fisher information, that is, equation (1.2) in the handout with the MLE plugged in for the parameter.

Thus the following code calculates an asymptotic conf.level confidence interval for the unknown scale parameter in the example we have been doing

Several Parameters

A Two-Parameter Gamma Example

For our first example of two-parameter maximum likelihood estimation, we use the two-parameter gamma distribution and the same data as above.

Coding the Likelihood Function

Coding the log likelihood (really minus the log likelihood is what we need to hand to nlm) is much the same as coding the uniparameter case. The main difference is that the argument to the function must be a vector of parameters.

We usually want to take this vector apart into its scalar components if, as in the gamma example, the different parameters do different things.

With that in mind our mlogl function looks something like this

mlogl <- function(theta, x) {
    if (length(theta) != 2) stop("theta must be vector of length 2")
    alpha <- theta[1]
    lambda <- theta[2]
    if (alpha <= 0) stop("theta[1] must be positive")
    if (lambda <= 0) stop("theta[2] must be positive")
    return(- sum(dgamma(x, shape = alpha, rate = lambda, log = TRUE)))
}

Note: We've changed the name of the rate parameter from β (which DeGroot and Schervish use) to λ (which many other theory books use) because it conflicts with an R function name (the R function beta calculates the normalizing constant of the beta distribution, which is called the beta function).

A Starting Point for Optimization

We need good starting points for our optimization algorithm, and the simplest estimators for the two-parameter gamma distribution are so-called method of moments estimators.

This is a traditional theory topic that students that have had a course like this are expected to know (on preliminary exams, actuarial exams, and so forth) so it won't hurt to cover it, even though DeGroot and Schervish ignore it.

The first two moments of the Gamma(α, β) distribution are

E(X) = α / β

and

Var(X) = α / β²

The so-called method of moments says the following

Solve your moment equations for the parameters (if you can).
Replace theoretical moments by sample moments to estimate parameters.

For the gamma distribution, solving for the parameters gives

β = E(X) / Var(X)

and

α = β E(X) = E(X)² / Var(X)

For the plug-in I am handicaped by various sucky web browsers (especially Microsoft) not implementing web standards for mathematics on the web. I can't make an x bar for sample mean. The R statements for these estimators are

alpha.start <- mean(x)^2 / var(x)
lambda.start <- mean(x) / var(x)

The method of moments isn't always applicable (the solve for parameters step may be impossible), and it doesn't necessarily produce good estimators. Maximum likelihood estimators are asymptotically efficient. Method of moment estimators generally aren't. But they do provide good enough starting points for maximum likelihood.

Maximum Likelihood

So let's try it.

The important bits here are the MLE

$estimate 
[1] 1.680531 1.009529

the observed Fisher information matrix

$hessian 
          [,1]      [,2] 
[1,]  24.15181 -29.71534 
[2,] -29.71534  49.45877

and, for comparison with the MLE, the method of moments estimators

Rweb:> print(theta.start) 
[1] 1.5398404 0.9250142

The method of moments estimators seem fairly close to the MLEs, but we can't really tell whether they are close in the statistical sense until we calculate confidence intervals.

The fact that all the eigenvalues of the Hessian of minus the log likelihood (observed Fisher information) are positive indicates that our MLE is a local maximum of the log likelihood.

Confidence Intervals

The R solve function (on-line help) solves linear equations and also inverts matrices.

The important output is right at the end. The confidence intervals

[1] 0.899571 2.461491
[1] 0.4637939 1.5552641

for the shape parameter and the rate parameter (in that order).

An important comparison is with the confidence interval for the shape parameter we got when only estimating the shape (assuming we knew the scale parameter was 1.0), which was

[1] 1.271798 2.065813

Note that this interval is much narrower: (1.27, 2.07) when the shape parameter is known versus (0.90, 2.46) when the shape parameter is unknown and must also be estimated.

This is an important lesson.

Confidence intervals for parameters of interest are wider when nuisance parameters are estimated.

A Five-Parameter Normal Mixture Example

For our second example of multi-parameter maximum likelihood estimation, we use the five-parameter, two-component normal mixture distribution. This is a generalization of Example 6.5.8 in DeGroot and Schervish in which we do not assume the two components of the mixture have equal probability, but rather an arbitrary probability p, and we also assume that the mean and variance of both components is unknown. That makes five parameters, mean and variance for the first component, mean and variance for the second component, and the mixture probability p.

The same argument given in Example 6.5.8 in DeGroot and Schervish applies to this generalization and shows that global maximizers of the likelihood function do not exist. However, this is no problem, good local maximizers do exist and have all the desirable properties of maximum likelihood estimates.

The data x₁, x₂, … x_n are assumed to be independent and identically distributed from the distribution with density

f(x | μ₁, σ²₁, μ₂, σ²₂, p) = p φ(x | μ₁, σ²₁) + (1 − p) φ(x | μ₂, σ²₂)

where φ(x | μ, σ²) denotes the normal density with mean μ and variance σ².

Some made-up data actually having a mixture of two normal distributions (they are computer simulated) is shown below.

Coding the Likelihood Function

Coding the log likelihood (really minus the log likelihood is what we need to hand to nlm) is much the same as for the one-parameter example and two-parameter example above. Of course, now the density is completely different, and there are five parameters, so the argument to the function must be a vector of length five.

With that in mind our mlogl function looks something like this

mlogl <- function(theta) {
    stopifnot(is.numeric(theta))
    stopifnot(length(theta) == 5)
    mu1 <- theta[1]
    mu2 <- theta[2]
    v1 <- theta[3]
    v2 <- theta[4]
    p <- theta[5]
    logl <- sum(log(p * dnorm(x, mu1, sqrt(v1)) +
        (1 - p) * dnorm(x, mu2, sqrt(v2))))
    return(- logl)
}

A Starting Point for Optimization

We need good starting points for our optimization algorithm, now there are no nice simple estimators. Nor are there even any messy but well known estimators. Maximum likelihood is the only well-known method that is not computer intensive.

Hence we make do with a very crude method. We take p = 1⁄2 as the starting value. Then we divide the data into upper and lower halves and take the sample mean and variance of each as the starting values for the mean and variance of one component.

We do not claim anything for this method other than that it is a thing to do (TTD). We shall see whether it works.

Maximum Likelihood

So let's try it.

R statements mlogl <- function(theta) { stopifnot(is.numeric(theta)) stopifnot(length(theta) == 5) mu1 <- theta[1] mu2 <- theta[2] v1 <- theta[3] v2 <- theta[4] p <- theta[5] logl <- sum(log(p * dnorm(x, mu1, sqrt(v1)) + (1 - p) * dnorm(x, mu2, sqrt(v2)))) return(- logl) } n <- length(x) p.start <- 1 / 2 mu1.start <- mean(sort(x)[seq(along = x) <= n / 2]) mu2.start <- mean(sort(x)[seq(along = x) > n / 2]) v1.start <- var(sort(x)[seq(along = x) <= n / 2]) v2.start <- var(sort(x)[seq(along = x) > n / 2]) theta.start <- c(mu1.start, mu2.start, v1.start, v2.start, p.start) out <- nlm(mlogl, theta.start, print.level = 2, fscale = length(x)) out <- nlm(mlogl, out$estimate, print.level = 2, fscale = length(x), hessian = TRUE) print(out) print(theta.start) mu1.hat <- out$estimate[1] mu2.hat <- out$estimate[2] sigma1.hat <- sqrt(out$estimate[3]) sigma2.hat <- sqrt(out$estimate[4]) p.hat <- out$estimate[5] fred <- function(x) p.hat * dnorm(x, mu1.hat, sigma1.hat) + (1 - p.hat) * dnorm(x, mu2.hat, sigma2.hat) hist(x, freq = FALSE) curve(fred, add = TRUE) hist(x, freq = FALSE, ylim = c(0, fred(mu1.hat))) curve(fred, add = TRUE) eigen(out$hessian, symmetric = TRUE)

Dataset URL

The reason why we run nlm twice is to check that it has arrived at a solution even though it gave warnings on the first run.

The important bits here are the MLE

$estimate 
1] 0.005267126 3.010454061 1.006854276 1.038172545 0.676241202

the observed Fisher information matrix

hessian 
            [,1]       [,2]       [,3]      [,4]       [,5] 
[1,]  158.807404 -24.139843 -32.853702  5.586309 -103.42268 
[2,]  -24.139843  63.447291  -9.839708 18.661232  -84.43795 
[3,]  -32.853702  -9.839708  68.430927 -2.357617  -65.80344 
[4,]    5.586309  18.661232  -2.357617 28.203454   38.00752 
[5,] -103.422678 -84.437955 -65.803437 38.007516 1079.07185

and, for comparison with the MLE, our starting point was

Rweb:> print(theta.start) 
[1] -0.4345200  2.3909647  0.5340638  1.4761858  0.5000000

The plots show histograms of the data with the MLE mixture-of-normals p. d. f. overlaid for comparison.

The fact that all the eigenvalues of the Hessian of minus the log likelihood (observed Fisher information) are positive indicates that our MLE is a local maximum of the log likelihood.

Confidence Intervals

R statements conf.level <- 0.95 mlogl <- function(theta) { stopifnot(is.numeric(theta)) stopifnot(length(theta) == 5) mu1 <- theta[1] mu2 <- theta[2] v1 <- theta[3] v2 <- theta[4] p <- theta[5] logl <- sum(log(p * dnorm(x, mu1, sqrt(v1)) + (1 - p) * dnorm(x, mu2, sqrt(v2)))) return(- logl) } n <- length(x) p.start <- 1 / 2 mu1.start <- mean(sort(x)[seq(along = x) <= n / 2]) mu2.start <- mean(sort(x)[seq(along = x) > n / 2]) v1.start <- var(sort(x)[seq(along = x) <= n / 2]) v2.start <- var(sort(x)[seq(along = x) > n / 2]) theta.start <- c(mu1.start, mu2.start, v1.start, v2.start, p.start) out <- nlm(mlogl, theta.start, fscale = length(x)) out <- nlm(mlogl, out$estimate, fscale = length(x), hessian = TRUE) crit.val <- qnorm((1 + conf.level) / 2) inv.fish.info <- solve(out$hessian) for (i in 1:length(out$estimate)) print(out$estimate[i] + c(-1, 1) * crit.val * sqrt(inv.fish.info[i, i]))

Dataset URL

The important output is right at the end. The confidence intervals

[1] -0.2571063  0.2676405 
[1] 2.527202 3.493707 
[1] 0.6547243 1.3589842 
[1] 0.4432608 1.6330843 
[1] 0.5663284 0.7861540

for the two means, the two variances, and the mixing proportion (in that order).

Note that the confidence intervals are fairly wide, even though the sample size was fairly large according to intuition developed for simpler problems. Nevertheless maximum likelihood does work.

Another Optimizer

It is annoying that the nlm function gives warnings. The problem is that is does not know that variances must be nonnegative and probabilities must be between zero and one. Another optimizer the optim function (on-line help) does allow bounds specification for one of its methods.

Dataset URL

Statistics 5102 (Geyer, Spring 2007) Examples: Maximum Likelihood Estimation

One Parameter

A Gamma Example

Coding the Likelihood Function

Plotting the Likelihood Function

Maximizing the Likelihood Function

Using More Options

Fisher Information and Confidence Intervals

Several Parameters

A Two-Parameter Gamma Example

Coding the Likelihood Function

A Starting Point for Optimization

Maximum Likelihood

Confidence Intervals

A Five-Parameter Normal Mixture Example

Coding the Likelihood Function

A Starting Point for Optimization

Maximum Likelihood

Confidence Intervals

Another Optimizer

Navigation

Contents