Statistics 5102 (Geyer, Fall 2016) Examples: Maximum Likelihood Estimation

One Parameter

Gamma Distribution, Unknown Shape

In order to do maximum likelihood estimation (MLE) using the computer we need to write the likelihood function or log likelihood function (usually the latter) as a function in the computer language we are using.

In this course we are using R and Rweb. So we need to know how to write the log likelihood as an R function.

For an example we will use the gamma distribution with unknown shape parameter α and known rate parameter λ = 1.0.

Some made-up data actually having a gamma distribution (they are computer simulated gamma random variables) is shown below. The print(x) statement prints the whole data vector (30 numbers) and the hist(x) statement makes a histogram.

Coding the Likelihood Function

The R function dgamma (on-line help) calculates the density of the gamma distribution. As with most R functions the calculation is vectorized, so

dgamma(x, shape = alpha)

calculates a vector of 30 numbers, the values f_α(x_i) for each of the 30 data points x_i

Another optional argument tells the function to return log probability density instead of probability density

dgamma(x, shape = alpha, log = TRUE)

Finally the sum function (on-line help) calculates the sum of these logs. Hence the code

logl <- function(alpha, x)
    return(sum(dgamma(x, shape = alpha, log = TRUE)))

defines an R function that calculates the log likelihood function using the R function named function (on-line help) creates R functions (see also the section on functions in the introduction to R and Rweb page).

This function can be slightly improved by inserting a check that alpha is a single variable (a vector of length 1 to R) rather than a vector, which doesn't make sense,

logl <- function(alpha, x) {
    if (length(alpha) > 1) stop("alpha must be scalar")
    if (alpha <= 0) stop("alpha must be positive")
    return(sum(dgamma(x, shape = alpha, log = TRUE)))
}

Of course, when there is no mistake (no bugs in the code) both functions do exactly the same thing. But when there is a mistake, the function with the error check will make it clear what the problem is.

Plotting the Likelihood Function

If you want to look at the log likelihood, the following R statements seem to do the job.

Maximizing the Likelihood Function

R has several functions that optimize functions. The one we will explain here is the nlm function (on-line help). Another optimizer optim will be briefly demonstrated in the last section of this page.

The nlm function has a huge number of arguments, most of which can be ignored. The only arguments that must be supplied are two

f the function to be minimized.
p a starting value for the variable. The dimension of p tells nlm the dimension of the space in which f takes values. So p must be supplied to specify the dimension, if for no other reason.
But it helps if we can specify a starting value reasonably close to the solution.

We won't at this point discuss any of the optional arguments described on the on-line help except for the somewhat mysterious ... argument, which is describe as additional arguments to f. What this means is that if we supply an argument whose name is the name of one of the arguments to the function supplied as the argument f and whose name is not one of the named arguments to nlm, then this argument will be forwarded to the function supplied as the argument f.

In particular, if we write our log likelihood function to have an additional argument x which is the data (as we did above), then since nlm doesn't have an argument named x, this will be passed to our log likelihood function.

Since nlm minimizes rather than maximizes we need to write an f function that calculates the minus the log likelihood rather than the log likelihood (stand on your head and maximization becomes minimization).

Good starting values are hard to find, in general. In our particular problem, maximum likelihood for the shape parameter of the gamma distribution, a good estimate of the shape parameter α is the sample mean, which is the method of moments estimator of α when λ = 1.0 is known.

This gives us the following first attempt at maximum likelihood for our example.

The MLE as estimated by the computer is the estimate component of the returned object out, which is 1.668806.

Using More Options

A few more options of nlm can be helpful.

hessian returns the second derivative (an approximation calculated by finite differences) of the objective function. This will be a k × k matrix if the dimension of the parameter space is k.
fscale helps nlm know when it has converged to the solution. It should give roughly the size of the objective function at the solution. Often fscale = length(x) is about right.
print.level tells nlm to blather about what is is doing. print.level = 2 gives maximum verbosity.

We can use the hessian, which is part of the list returned by the nlm function to tell whether the solution is a local minimum. More importantly, we can use it as the plug-in estimate of observed Fisher information.

Fisher Information and Confidence Intervals

The following code calculates an asymptotic conf.level confidence interval for the unknown scale parameter in the example we have been doing

The expression for Fisher information comes from slide 57, deck 3.

Several Parameters

A Two-Parameter Gamma Example

For our first example of two-parameter maximum likelihood estimation, we use the two-parameter gamma distribution and the same data as above.

Coding the Likelihood Function

Coding the log likelihood (really minus the log likelihood is what we need to hand to nlm) is much the same as coding the uniparameter case. The main difference is that the argument to the function must be a vector of parameters.

We usually want to take this vector apart into its scalar components if, as in the gamma example, the different parameters do different things.

With that in mind our mlogl function looks something like this

mlogl <- function(theta, x) {
    if (length(theta) != 2) stop("theta must be vector of length 2")
    alpha <- theta[1]
    lambda <- theta[2]
    if (alpha <= 0) stop("theta[1] must be positive")
    if (lambda <= 0) stop("theta[2] must be positive")
    return(- sum(dgamma(x, shape = alpha, rate = lambda, log = TRUE)))
}

A Starting Point for Optimization

We need good starting points for our optimization algorithm, and the simplest estimators for the two-parameter gamma distribution are the method of moments estimators (slides 66–67, deck 2).

The R statements for these estimators are

alpha.start <- mean(x)^2 / var(x)
lambda.start <- mean(x) / var(x)

The method of moments isn't always applicable, and it doesn't necessarily produce good estimators. Maximum likelihood estimators are asymptotically efficient. Method of moment estimators generally aren't. But they do provide good enough starting points for maximum likelihood.

Maximum Likelihood

So let's try it.

R statements mlogl <- function(theta, x) { if (length(theta) != 2) stop("theta must be vector of length 2") alpha <- theta[1] lambda <- theta[2] if (alpha <= 0) stop("theta[1] must be positive") if (lambda <= 0) stop("theta[2] must be positive") return(- sum(dgamma(x, shape = alpha, rate = lambda, log = TRUE))) } alpha.start <- mean(x)^2 / var(x) lambda.start <- mean(x) / var(x) theta.start <- c(alpha.start, lambda.start) out <- nlm(mlogl, theta.start, x = x, hessian = TRUE, fscale = length(x)) print(out) print(theta.start) eigen(out$hessian, symmetric = TRUE, only.values = TRUE) # theoretical Fisher information n <- length(x) alpha.hat <- out$estimate[1] lambda.hat <- out$estimate[2] fish <- n * matrix(c(trigamma(alpha.hat), - 1 / lambda.hat, - 1 / lambda.hat, alpha.hat / lambda.hat^2), nrow = 2) fish

Dataset URL

The important bits here are the MLE

$estimate 
[1] 1.680531 1.009529

the observed Fisher information matrix

$hessian 
          [,1]      [,2] 
[1,]  24.15181 -29.71534 
[2,] -29.71534  49.45877

and, for comparison with the MLE, the method of moments estimators

Rweb:> print(theta.start) 
[1] 1.5398404 0.9250142

The method of moments estimators seem fairly close to the MLEs, but we can't really tell whether they are close in the statistical sense until we calculate confidence intervals.

The fact that all the eigenvalues of the Hessian of minus the log likelihood (observed Fisher information) are positive indicates that our MLE is a local maximum of the log likelihood.

Also we compare the Fisher information matrix derived by theory (slide 96, deck 3) with that computed by finite differences by the function nlm, that is, fish and out$hessian. They are nearly the same.

Confidence Intervals

The R solve function (on-line help) solves linear equations and also inverts matrices.

The important output is right at the end. The confidence intervals

[1] 0.899571 2.461491
[1] 0.4637939 1.5552641

for the shape parameter and the rate parameter (in that order).

An important comparison is with the confidence interval for the shape parameter we got when only estimating the shape (assuming we knew the scale parameter was 1.0), which was

[1] 1.271798 2.065813

Note that this interval is much narrower: (1.27, 2.07) when the shape parameter is known versus (0.90, 2.46) when the shape parameter is unknown and must also be estimated.

This is an important lesson.

Confidence intervals for parameters of interest are wider when nuisance parameters are estimated.

An exception is when the parameter estimates are asymptotically independent, but this rarely happens.

A Five-Parameter Normal Mixture Example

For our second example of multi-parameter maximum likelihood estimation, we use the five-parameter, two-component normal mixture distribution. The five parameters are mean and variance for the first component, mean and variance for the second component, and the mixture probability p.

Global maximizers of the likelihood function do not exist, but this is no problem, good local maximizers do exist and have all the desirable properties of maximum likelihood estimates.

The data x₁, x₂, … x_n are assumed to be independent and identically distributed from the distribution with density

f(x | μ₁, σ²₁, μ₂, σ²₂, p) = p φ(x | μ₁, σ²₁) + (1 − p) φ(x | μ₂, σ²₂)

where φ(x | μ, σ²) denotes the normal density with mean μ and variance σ².

Some made-up data actually having a mixture of two normal distributions (they are computer simulated) is shown below.

Coding the Likelihood Function

Coding the log likelihood (really minus the log likelihood is what we need to hand to nlm) is much the same as for the one-parameter example and two-parameter example above. Of course, now the density is completely different, and there are five parameters, so the argument to the function must be a vector of length five.

With that in mind our mlogl function looks something like this

mlogl <- function(theta) {
    stopifnot(is.numeric(theta))
    stopifnot(length(theta) == 5)
    mu1 <- theta[1]
    mu2 <- theta[2]
    v1 <- theta[3]
    v2 <- theta[4]
    p <- theta[5]
    logl <- sum(log(p * dnorm(x, mu1, sqrt(v1)) +
        (1 - p) * dnorm(x, mu2, sqrt(v2))))
    return(- logl)
}

A Starting Point for Optimization

We need good starting points for our optimization algorithm, now there are no nice simple estimators. Nor are there even any messy but well known estimators. Maximum likelihood is the only well-known method that is not computer intensive.

Hence we make do with a very crude method. We take p = 1⁄2 as the starting value. Then we divide the data into upper and lower halves and take the sample mean and variance of each as the starting values for the mean and variance of one component.

We do not claim anything for this method other than that it is a thing to do (TTD). We shall see whether it works.

Maximum Likelihood

So let's try it.

R statements mlogl <- function(theta) { stopifnot(is.numeric(theta)) stopifnot(length(theta) == 5) mu1 <- theta[1] mu2 <- theta[2] v1 <- theta[3] v2 <- theta[4] p <- theta[5] logl <- sum(log(p * dnorm(x, mu1, sqrt(v1)) + (1 - p) * dnorm(x, mu2, sqrt(v2)))) return(- logl) } n <- length(x) p.start <- 1 / 2 mu1.start <- mean(sort(x)[seq(along = x) <= n / 2]) mu2.start <- mean(sort(x)[seq(along = x) > n / 2]) v1.start <- var(sort(x)[seq(along = x) <= n / 2]) v2.start <- var(sort(x)[seq(along = x) > n / 2]) theta.start <- c(mu1.start, mu2.start, v1.start, v2.start, p.start) out <- nlm(mlogl, theta.start, print.level = 2, fscale = length(x)) out <- nlm(mlogl, out$estimate, print.level = 2, fscale = length(x), hessian = TRUE) print(out) print(theta.start) mu1.hat <- out$estimate[1] mu2.hat <- out$estimate[2] sigma1.hat <- sqrt(out$estimate[3]) sigma2.hat <- sqrt(out$estimate[4]) p.hat <- out$estimate[5] fred <- function(x) p.hat * dnorm(x, mu1.hat, sigma1.hat) + (1 - p.hat) * dnorm(x, mu2.hat, sigma2.hat) hist(x, freq = FALSE) curve(fred, add = TRUE) hist(x, freq = FALSE, ylim = range(0, fred(mu1.hat), fred(mu2.hat))) curve(fred, add = TRUE) curve(p.hat * dnorm(x, mu1.hat, sigma1.hat), add = TRUE, col = "red") curve((1 - p.hat) * dnorm(x, mu2.hat, sigma2.hat), add = TRUE, col = "red") eigen(out$hessian, symmetric = TRUE)

Dataset URL

The reason why we run nlm twice is to check that it has arrived at a solution even though it gave warnings on the first run.

The important bits here are the MLE

$estimate 
1] 0.005267126 3.010454061 1.006854276 1.038172545 0.676241202

the observed Fisher information matrix

hessian 
            [,1]       [,2]       [,3]      [,4]       [,5] 
[1,]  158.807404 -24.139843 -32.853702  5.586309 -103.42268 
[2,]  -24.139843  63.447291  -9.839708 18.661232  -84.43795 
[3,]  -32.853702  -9.839708  68.430927 -2.357617  -65.80344 
[4,]    5.586309  18.661232  -2.357617 28.203454   38.00752 
[5,] -103.422678 -84.437955 -65.803437 38.007516 1079.07185

and, for comparison with the MLE, our starting point was

Rweb:> print(theta.start) 
[1] -0.4345200  2.3909647  0.5340638  1.4761858  0.5000000

The plots show histograms of the data with the MLE mixture-of-normals p. d. f. overlaid for comparison. The red curves in the second plot are the two normal distributions of which the mixture is formed.

The fact that all the eigenvalues of the Hessian of minus the log likelihood (observed Fisher information) are positive indicates that our MLE is a local maximum of the log likelihood.

Confidence Intervals

R statements conf.level <- 0.95 mlogl <- function(theta) { stopifnot(is.numeric(theta)) stopifnot(length(theta) == 5) mu1 <- theta[1] mu2 <- theta[2] v1 <- theta[3] v2 <- theta[4] p <- theta[5] logl <- sum(log(p * dnorm(x, mu1, sqrt(v1)) + (1 - p) * dnorm(x, mu2, sqrt(v2)))) return(- logl) } n <- length(x) p.start <- 1 / 2 mu1.start <- mean(sort(x)[seq(along = x) <= n / 2]) mu2.start <- mean(sort(x)[seq(along = x) > n / 2]) v1.start <- var(sort(x)[seq(along = x) <= n / 2]) v2.start <- var(sort(x)[seq(along = x) > n / 2]) theta.start <- c(mu1.start, mu2.start, v1.start, v2.start, p.start) out <- nlm(mlogl, theta.start, fscale = length(x)) out <- nlm(mlogl, out$estimate, fscale = length(x), hessian = TRUE) crit.val <- qnorm((1 + conf.level) / 2) inv.fish.info <- solve(out$hessian) for (i in 1:length(out$estimate)) print(out$estimate[i] + c(-1, 1) * crit.val * sqrt(inv.fish.info[i, i]))

Dataset URL

The important output is right at the end. The confidence intervals

[1] -0.2571063  0.2676405 
[1] 2.527202 3.493707 
[1] 0.6547243 1.3589842 
[1] 0.4432608 1.6330843 
[1] 0.5663284 0.7861540

for the two means, the two variances, and the mixing proportion (in that order).

Note that the confidence intervals are fairly wide, even though the sample size was fairly large according to intuition developed for simpler problems. Nevertheless maximum likelihood does work.

Statistics 5102 (Geyer, Fall 2016) Examples: Maximum Likelihood Estimation

One Parameter

Gamma Distribution, Unknown Shape

Coding the Likelihood Function

Plotting the Likelihood Function

Maximizing the Likelihood Function

Using More Options

Fisher Information and Confidence Intervals

Several Parameters

A Two-Parameter Gamma Example

Coding the Likelihood Function

A Starting Point for Optimization

Maximum Likelihood

Confidence Intervals

A Five-Parameter Normal Mixture Example

Coding the Likelihood Function

A Starting Point for Optimization

Maximum Likelihood

Confidence Intervals

Navigation

Contents