One Parameter
A Gamma Example
In order to do maximum likelihood estimation (MLE) using the computer we need to write the likelihood function or log likelihood function (usually the latter) as a function in the computer language we are using.
In this course we are using R and Rweb. So we need to know how to write the log likelihood as an R function.
For an example we will use the gamma distribution with known scale parameter, which we assume to be 1.0. This is Example 6.6.2 in DeGroot and Schervish.
Some made-up data actually having a gamma distribution (they are computer
simulated gamma random variables) is shown below. The print(x)
statement prints the whole data vector (30 numbers) and the
hist(x)
statement makes a histogram.
Coding the Likelihood Function
The R function dgamma
(on-line help)
calculates the density of the gamma
distribution. As with most R functions the calculation is vectorized, so
dgamma(x, shape = alpha)
calculates a vector of 30 numbers, the values fα(xi) for each of the 30 data points xi
Another optional argument
(found on the
on-line
help for the dgamma
function) tells the function
to return log probability density instead of probability density
dgamma(x, shape = alpha, log = TRUE)
Finally the sum
function
(on-line
help) calculates the sum of these logs. Hence the code
logl <- function(alpha, x) return(sum(dgamma(x, shape = alpha, log = TRUE)))
This function can be slightly improved by inserting a check that
alpha
is a single variable (a vector of length 1 to R)
rather than a vector, which doesn't make sense,
logl <- function(alpha, x) { if (length(alpha) > 1) stop("alpha must be scalar") if (alpha <= 0) stop("alpha must be positive") return(sum(dgamma(x, shape = alpha, log = TRUE))) }
Of course, when there is no mistake (no bugs
in the code)
both functions do exactly the same thing. But when there is
a mistake, the function with the error check will make it clear what
the problem is.
Plotting the Likelihood Function
If you want to look at the log likelihood, the following R statements seem to do the job.
Maximizing the Likelihood Function
R has several functions that optimize functions. The one we will explain
here is the nlm
function
(on-line
help). Another optimizer optim
will be briefly demonstrated
in the last section of this page.
The nlm
function has a huge number of arguments,
most of which can be ignored.
The only arguments that must be supplied are two
f
the function to be minimized.p
a starting value for the variable. The dimension ofp
tellsnlm
the dimension of the space in whichf
takes values. Sop
must be supplied to specify the dimension, if for no other reason.But it helps if we can specify a starting value reasonably close to the solution.
We won't at this point discuss any of the optional arguments described
on the
on-line help except for the somewhat mysterious ...
arguments
...
the on-line help for thenlm
function describes this asadditional arguments to
. What this means is that if we supply an argument whose name is the name of one of the arguments to the function supplied as the argumentf
f
and whose name is not one of the named arguments tonlm
, then this argument will beforwarded
to the function supplied as the argumentf
.In particular, if we write our log likelihood function to have an additional argument
x
which is the data (as we did above), then sincenlm
doesn't have an argument namedx
, this will be passed to our log likelihood function.
Since nlm
minimizes rather than maximizes we need to write
an f
function that calculates the minus the
log likelihood rather than the log likelihood (stand on your head and
maximization becomes minimization).
Good starting values are hard to find, in general. In our particular problem,
maximum likelihood for the shape parameter of the gamma distribution,
a good estimate of the shape parameter α is the sample mean,
since the theoretical mean of the gamma distribution is α / β
where β is the rate
parameter, here assumed to be known to
be β = 1.0. Hence the population mean is α and the sample
mean is a reasonable estimate (by the law of large numbers).
This gives us the following first attempt at maximum likelihood for our example.
Using More Options
A few more options of nlm
can be helpful.
hessian
returns the second derivative (an approximation calculated by finite differences) of the objective function. This will be a k × k matrix if the dimension of the argumentp
(which must match the dimension of the first argument to the functionf
) is k.fscale
helpsnlm
know when it has converged to the solution. It should give roughly the size of the objective function at the solution. Oftenfscale = length(x)
is about right.print.level
tellsnlm
to blather about what is is doing.print.level = 2
gives maximum verbosity.
We can use the hessian
, which is part of the list returned
by the nlm
function to tell whether the solution is a local
minimum. More importantly, we can use it to tells about the statistical
properties of the MLE (Section 7.8 in DeGroot and Schervish).
Fisher Information and Confidence Intervals
The hessian
returned by the nlm
function is
the second derivative of minus the log likelihood, that is, what is
referred to in the handout
on Fisher Information and Confidence Intervals
as
observed Fisher information. More precisely, it is a finite
difference approximation to Fisher information, that is, equation (1.2)
in the handout with the MLE plugged in for the parameter.
Thus the following code calculates an asymptotic conf.level
confidence interval for the unknown scale parameter in the example
we have been doing
Several Parameters
A Two-Parameter Gamma Example
For our first example of two-parameter maximum likelihood estimation, we use the two-parameter gamma distribution and the same data as above.
Coding the Likelihood Function
Coding the log likelihood (really minus the log likelihood is
what we need to hand to nlm
) is much the same as
coding the uniparameter case. The main difference
is that the argument to the function must be a vector of parameters.
We usually want to take this vector apart into its scalar components if, as in the gamma example, the different parameters do different things.
With that in mind our mlogl
function looks something like this
mlogl <- function(theta, x) { if (length(theta) != 2) stop("theta must be vector of length 2") alpha <- theta[1] lambda <- theta[2] if (alpha <= 0) stop("theta[1] must be positive") if (lambda <= 0) stop("theta[2] must be positive") return(- sum(dgamma(x, shape = alpha, rate = lambda, log = TRUE))) }
Note: We've changed the name of the rate parameter
from β (which DeGroot and Schervish use) to λ (which many
other theory books use) because it conflicts with an R function name
(the R function beta
calculates the normalizing constant
of the beta distribution, which is called the beta function).
A Starting Point for Optimization
We need good starting points for our optimization algorithm, and the simplest estimators for the two-parameter gamma distribution are so-called method of moments estimators.
This is a traditional theory topic that students that have had a course like this are expected to know (on preliminary exams, actuarial exams, and so forth) so it won't hurt to cover it, even though DeGroot and Schervish ignore it.
The first two moments of the Gamma(α, β) distribution are
and
The so-called method of moments says the following
- Solve your moment equations for the parameters (if you can).
- Replace theoretical moments by sample moments to estimate parameters.
For the gamma distribution, solving for the parameters gives
and
For the plug-in I am handicaped by various sucky web browsers
(especially Microsoft) not implementing web standards for mathematics
on the web. I can't make an x bar
for sample mean.
The R statements for these estimators are
alpha.start <- mean(x)^2 / var(x) lambda.start <- mean(x) / var(x)
The method of moments isn't always applicable (the solve for parameters
step may be impossible), and it doesn't necessarily produce good estimators.
Maximum likelihood estimators are asymptotically efficient.
Method of moment estimators generally aren't.
But they do provide good enough starting points for maximum likelihood.
Maximum Likelihood
So let's try it.
The important bits here are the MLE
$estimate [1] 1.680531 1.009529
the observed Fisher information matrix
$hessian [,1] [,2] [1,] 24.15181 -29.71534 [2,] -29.71534 49.45877
and, for comparison with the MLE, the method of moments estimators
Rweb:> print(theta.start) [1] 1.5398404 0.9250142
The method of moments estimators seem fairly close to the MLEs, but we can't really tell whether they are close in the statistical sense until we calculate confidence intervals.
The fact that all the eigenvalues of the Hessian of minus the log likelihood (observed Fisher information) are positive indicates that our MLE is a local maximum of the log likelihood.
Confidence Intervals
The R solve
function
(on-line
help) solves linear equations and also inverts matrices.
The important output is right at the end. The confidence intervals
[1] 0.899571 2.461491 [1] 0.4637939 1.5552641
for the shape parameter and the rate parameter (in that order).
An important comparison is with the confidence interval for the shape parameter we got when only estimating the shape (assuming we knew the scale parameter was 1.0), which was
[1] 1.271798 2.065813
Note that this interval is much narrower: (1.27, 2.07) when the shape parameter is known versus (0.90, 2.46) when the shape parameter is unknown and must also be estimated.
This is an important lesson.
Confidence intervals for parameters of interest are wider when nuisance parameters are estimated.
A Five-Parameter Normal Mixture Example
For our second example of multi-parameter maximum likelihood estimation, we use the five-parameter, two-component normal mixture distribution. This is a generalization of Example 6.5.8 in DeGroot and Schervish in which we do not assume the two components of the mixture have equal probability, but rather an arbitrary probability p, and we also assume that the mean and variance of both components is unknown. That makes five parameters, mean and variance for the first component, mean and variance for the second component, and the mixture probability p.
The same argument given in Example 6.5.8 in DeGroot and Schervish applies to this generalization and shows that global maximizers of the likelihood function do not exist. However, this is no problem, good local maximizers do exist and have all the desirable properties of maximum likelihood estimates.
The data x1, x2, … xn are assumed to be independent and identically distributed from the distribution with density
where φ(x | μ, σ2) denotes the normal density with mean μ and variance σ2.
Some made-up data actually having a mixture of two normal distributions (they are computer simulated) is shown below.
Coding the Likelihood Function
Coding the log likelihood (really minus the log likelihood is
what we need to hand to nlm
) is much the same as
for the one-parameter example
and two-parameter example above.
Of course, now the density is completely different, and there are five
parameters, so the argument to the function must be a vector of length five.
With that in mind our mlogl
function looks something like this
mlogl <- function(theta) { stopifnot(is.numeric(theta)) stopifnot(length(theta) == 5) mu1 <- theta[1] mu2 <- theta[2] v1 <- theta[3] v2 <- theta[4] p <- theta[5] logl <- sum(log(p * dnorm(x, mu1, sqrt(v1)) + (1 - p) * dnorm(x, mu2, sqrt(v2)))) return(- logl) }
A Starting Point for Optimization
We need good starting points for our optimization algorithm, now there are no nice simple estimators. Nor are there even any messy but well known estimators. Maximum likelihood is the only well-known method that is not computer intensive.
Hence we make do with a very crude method. We take p = 1⁄2 as the starting value. Then we divide the data into upper and lower halves and take the sample mean and variance of each as the starting values for the mean and variance of one component.
We do not claim anything for this method other than that it is a thing to do (TTD). We shall see whether it works.
Maximum Likelihood
So let's try it.
The reason why we run nlm
twice is to check that it has
arrived at a solution even though it gave warnings on the first run.
The important bits here are the MLE
$estimate 1] 0.005267126 3.010454061 1.006854276 1.038172545 0.676241202
the observed Fisher information matrix
hessian [,1] [,2] [,3] [,4] [,5] [1,] 158.807404 -24.139843 -32.853702 5.586309 -103.42268 [2,] -24.139843 63.447291 -9.839708 18.661232 -84.43795 [3,] -32.853702 -9.839708 68.430927 -2.357617 -65.80344 [4,] 5.586309 18.661232 -2.357617 28.203454 38.00752 [5,] -103.422678 -84.437955 -65.803437 38.007516 1079.07185
and, for comparison with the MLE, our starting point was
Rweb:> print(theta.start) [1] -0.4345200 2.3909647 0.5340638 1.4761858 0.5000000
The plots show histograms of the data with the MLE mixture-of-normals p. d. f. overlaid for comparison.
The fact that all the eigenvalues of the Hessian of minus the log likelihood (observed Fisher information) are positive indicates that our MLE is a local maximum of the log likelihood.
Confidence Intervals
The important output is right at the end. The confidence intervals
[1] -0.2571063 0.2676405 [1] 2.527202 3.493707 [1] 0.6547243 1.3589842 [1] 0.4432608 1.6330843 [1] 0.5663284 0.7861540
for the two means, the two variances, and the mixing proportion (in that order).
Note that the confidence intervals are fairly wide, even though the sample size was fairly large according to intuition developed for simpler problems. Nevertheless maximum likelihood does work.
Another Optimizer
It is annoying that the nlm
function gives warnings.
The problem is that is does not know that variances must be nonnegative
and probabilities must be between zero and one. Another optimizer
the optim
function
(on-line
help) does allow bounds specification for one of its methods.