## One Parameter

### Gamma Distribution, Unknown Shape

In order to do maximum likelihood estimation (MLE) using the computer we need to write the likelihood function or log likelihood function (usually the latter) as a function in the computer language we are using.

In this course we are using R and Rweb. So we need to know how to write the log likelihood as an R function.

For an example we will use the gamma distribution with unknown shape parameter α and known rate parameter λ = 1.0.

Some made-up data actually having a gamma distribution (they are computer
simulated gamma random variables) is shown below. The `print(x)`

statement prints the whole data vector (30 numbers) and the
`hist(x)`

statement makes a histogram.

### Coding the Likelihood Function

The R function `dgamma`

(on-line
help) calculates the density of the gamma
distribution. As with most R functions the calculation is vectorized, so

dgamma(x, shape = alpha)

calculates a vector of 30 numbers, the values
`f`_{α}(`x`_{i})
for each of the
30 data points `x`_{i}

Another optional argument tells the function to return log probability density instead of probability density

dgamma(x, shape = alpha, log = TRUE)

Finally the `sum`

function
(on-line
help) calculates the sum of these logs. Hence the code

logl <- function(alpha, x) return(sum(dgamma(x, shape = alpha, log = TRUE)))

defines an R function that calculates the log likelihood function
using the R function named `function`

(on-line
help) creates R functions (see also the
section on functions in the
introduction to R and Rweb page).

This function can be slightly improved by inserting a check that
`alpha`

is a single variable (a vector of length 1 to R)
rather than a vector, which doesn't make sense,

logl <- function(alpha, x) { if (length(alpha) > 1) stop("alpha must be scalar") if (alpha <= 0) stop("alpha must be positive") return(sum(dgamma(x, shape = alpha, log = TRUE))) }

Of course, when there is no mistake (no bugs

in the code)
both functions do exactly the same thing. But when there *is*
a mistake, the function with the error check will make it clear what
the problem is.

### Plotting the Likelihood Function

If you want to look at the log likelihood, the following R statements seem to do the job.

### Maximizing the Likelihood Function

R has several functions that optimize functions. The one we will explain
here is the `nlm`

function
(on-line
help). Another optimizer `optim`

will be briefly demonstrated
in the last section of this page.

The `nlm`

function has a huge number of arguments,
most of which can be ignored.
The only arguments that must be supplied are two

`f`

the function to be minimized.`p`

a starting value for the variable. The dimension of`p`

tells`nlm`

the dimension of the space in which`f`

takes values. So`p`

must be supplied to specify the dimension, if for no other reason.But it helps if we can specify a starting value reasonably close to the solution.

We won't at this point discuss any of the optional arguments described
on the on-line help
except for the somewhat mysterious `...`

argument, which is describe as
additional arguments to

. What this means is that
if we supply an argument whose name is the name of one of the arguments
to the function supplied as the argument `f`

`f`

and whose name
is not one of the named arguments to `nlm`

, then this argument
will be forwarded

to the function supplied as
the argument `f`

.

In particular, if we write our log likelihood function to have an additional
argument `x`

which is the data (as we did above), then since
`nlm`

doesn't have an argument named `x`

, this will
be passed to our log likelihood function.

Since `nlm`

minimizes rather than maximizes we need to write
an `f`

function that calculates the *minus* the
log likelihood rather than the log likelihood (stand on your head and
maximization becomes minimization).

Good starting values are hard to find, in general. In our particular problem, maximum likelihood for the shape parameter of the gamma distribution, a good estimate of the shape parameter α is the sample mean, which is the method of moments estimator of α when λ = 1.0 is known.

This gives us the following first attempt at maximum likelihood for our example.

The MLE as estimated by the computer is the `estimate`

component
of the returned object `out`

, which is 1.668806.

#### Using More Options

A few more options of `nlm`

can be helpful.

`hessian`

returns the second derivative (an approximation calculated by finite differences) of the objective function. This will be a`k`×`k`matrix if the dimension of the parameter space is`k`.`fscale`

helps`nlm`

know when it has converged to the solution. It should give roughly the size of the objective function at the solution. Often`fscale = length(x)`

is about right.`print.level`

tells`nlm`

to blather about what is is doing.`print.level = 2`

gives maximum verbosity.

We can use the `hessian`

, which is part of the list returned
by the `nlm`

function to tell whether the solution is a local
minimum. More importantly, we can use it as the plug-in estimate of observed
Fisher information.

### Fisher Information and Confidence Intervals

The following code calculates an asymptotic `conf.level`

confidence interval for the unknown scale parameter in the example
we have been doing

The expression for Fisher information comes from slide 57, deck 3.

## Several Parameters

### A Two-Parameter Gamma Example

For our first example of two-parameter maximum likelihood estimation, we use the two-parameter gamma distribution and the same data as above.

### Coding the Likelihood Function

Coding the log likelihood (really *minus* the log likelihood is
what we need to hand to `nlm`

) is much the same as
coding the uniparameter case. The main difference
is that the argument to the function must be a vector of parameters.

We usually want to take this vector apart into its scalar components if, as in the gamma example, the different parameters do different things.

With that in mind our `mlogl`

function looks something like this

mlogl <- function(theta, x) { if (length(theta) != 2) stop("theta must be vector of length 2") alpha <- theta[1] lambda <- theta[2] if (alpha <= 0) stop("theta[1] must be positive") if (lambda <= 0) stop("theta[2] must be positive") return(- sum(dgamma(x, shape = alpha, rate = lambda, log = TRUE))) }

### A Starting Point for Optimization

We need good starting points for our optimization algorithm, and the simplest estimators for the two-parameter gamma distribution are the method of moments estimators (slides 66–67, deck 2).

The R statements for these estimators are

alpha.start <- mean(x)^2 / var(x) lambda.start <- mean(x) / var(x)

The method of moments isn't always applicable, and it doesn't necessarily produce good estimators. Maximum likelihood estimators are asymptotically efficient. Method of moment estimators generally aren't. But they do provide good enough starting points for maximum likelihood.

### Maximum Likelihood

So let's try it.

The important bits here are the MLE

$estimate [1] 1.680531 1.009529

the observed Fisher information matrix

$hessian [,1] [,2] [1,] 24.15181 -29.71534 [2,] -29.71534 49.45877

and, for comparison with the MLE, the method of moments estimators

Rweb:> print(theta.start) [1] 1.5398404 0.9250142

The method of moments estimators seem fairly close to the MLEs, but we can't really tell whether they are close in the statistical sense until we calculate confidence intervals.

The fact that all the eigenvalues of the Hessian of minus the log likelihood (observed Fisher information) are positive indicates that our MLE is a local maximum of the log likelihood.

Also we compare the Fisher information matrix derived by theory
(slide 96, deck 3) with that computed by finite differences
by the function `nlm`

, that is, `fish`

and `out$hessian`

. They are nearly the same.

### Confidence Intervals

The R `solve`

function
(on-line
help) solves linear equations and also inverts matrices.

The important output is right at the end. The confidence intervals

[1] 0.899571 2.461491 [1] 0.4637939 1.5552641

for the shape parameter and the rate parameter (in that order).

An important comparison is with the confidence interval for the shape parameter we got when only estimating the shape (assuming we knew the scale parameter was 1.0), which was

[1] 1.271798 2.065813

Note that this interval is much narrower: (1.27, 2.07) when the shape parameter is known versus (0.90, 2.46) when the shape parameter is unknown and must also be estimated.

This is an important lesson.

An exception is when the parameter estimates are asymptotically independent, but this rarely happens.

Confidence intervals for parameters of interest are wider when nuisance parameters are estimated.

## A Five-Parameter Normal Mixture Example

For our second example of multi-parameter maximum likelihood estimation,
we use the five-parameter, two-component normal mixture distribution.
The five parameters are mean and variance for the first component,
mean and variance for the second component, and the mixture probability
`p`.

Global maximizers of the likelihood function do not exist, but this is no problem, good local maximizers do exist and have all the desirable properties of maximum likelihood estimates.

The data
`x`_{1},
`x`_{2}, …
`x`_{n}
are assumed to be independent and identically distributed from
the distribution with density

`f`(

`x`| μ

_{1}, σ

^{2}

_{1}, μ

_{2}, σ

^{2}

_{2},

`p`) =

`p`φ(x | μ

_{1}, σ

^{2}

_{1}) + (1 −

`p`) φ(x | μ

_{2}, σ

^{2}

_{2})

where φ(x | μ, σ^{2}) denotes the normal density
with mean μ and variance σ^{2}.

Some made-up data actually having a mixture of two normal distributions (they are computer simulated) is shown below.

### Coding the Likelihood Function

Coding the log likelihood (really *minus* the log likelihood is
what we need to hand to `nlm`

) is much the same as
for the one-parameter example
and two-parameter example above.
Of course, now the density is completely different, and there are five
parameters, so the argument to the function must be a vector of length five.

With that in mind our `mlogl`

function looks something like this

mlogl <- function(theta) { stopifnot(is.numeric(theta)) stopifnot(length(theta) == 5) mu1 <- theta[1] mu2 <- theta[2] v1 <- theta[3] v2 <- theta[4] p <- theta[5] logl <- sum(log(p * dnorm(x, mu1, sqrt(v1)) + (1 - p) * dnorm(x, mu2, sqrt(v2)))) return(- logl) }

### A Starting Point for Optimization

We need good starting points for our optimization algorithm, now there are no nice simple estimators. Nor are there even any messy but well known estimators. Maximum likelihood is the only well-known method that is not computer intensive.

Hence we make do with a very crude method. We take `p` = 1⁄2
as the starting value. Then we divide the data into upper and lower halves
and take the sample mean and variance of each as the starting values for
the mean and variance of one component.

We do not claim anything for this method other than that it is a thing to do (TTD). We shall see whether it works.

### Maximum Likelihood

So let's try it.

The reason why we run `nlm`

twice is to check that it has
arrived at a solution even though it gave warnings on the first run.

The important bits here are the MLE

$estimate 1] 0.005267126 3.010454061 1.006854276 1.038172545 0.676241202

the observed Fisher information matrix

hessian [,1] [,2] [,3] [,4] [,5] [1,] 158.807404 -24.139843 -32.853702 5.586309 -103.42268 [2,] -24.139843 63.447291 -9.839708 18.661232 -84.43795 [3,] -32.853702 -9.839708 68.430927 -2.357617 -65.80344 [4,] 5.586309 18.661232 -2.357617 28.203454 38.00752 [5,] -103.422678 -84.437955 -65.803437 38.007516 1079.07185

and, for comparison with the MLE, our starting point was

Rweb:> print(theta.start) [1] -0.4345200 2.3909647 0.5340638 1.4761858 0.5000000

The plots show histograms of the data with the MLE mixture-of-normals p. d. f. overlaid for comparison. The red curves in the second plot are the two normal distributions of which the mixture is formed.

The fact that all the eigenvalues of the Hessian of minus the log likelihood (observed Fisher information) are positive indicates that our MLE is a local maximum of the log likelihood.

### Confidence Intervals

The important output is right at the end. The confidence intervals

[1] -0.2571063 0.2676405 [1] 2.527202 3.493707 [1] 0.6547243 1.3589842 [1] 0.4432608 1.6330843 [1] 0.5663284 0.7861540

for the two means, the two variances, and the mixing proportion (in that order).

Note that the confidence intervals are fairly wide, even though the sample size was fairly large according to intuition developed for simpler problems. Nevertheless maximum likelihood does work.