In order to do maximum likelihood estimation (MLE) using the computer we need to write the likelihood function or log likelihood function (usually the latter) as a function in the computer language we are using.
In this course we are using R and Rweb. So we need to know how to write the log likelihood as an R function.
For an example we will use the gamma distribution with known scale parameter, which we assume to be 1.0. This is Example 6.6.2 in DeGroot and Schervish.
Some made-up data actually having a gamma distribution (they are computer
simulated gamma random variables) is shown below. The print(x)
statement prints the whole data vector (30 numbers) and the
hist(x)
statement makes a histogram.
The R function dgamma
(
on-line help)
calculates the density of the gamma
distribution. As with most R functions the calculation is vectorized, so
dgamma(x, shape = alpha)
calculates a vector of 30 numbers, the values fα(xi) for each of the 30 data points xi
Another optional argument
(found on the
on-line help for the dgamma
function) tells the function
to return log probability density instead of probability density
dgamma(x, shape = alpha, log = TRUE)
Finally the sum
function
(
on-line help) calculates the sum of these logs. Hence the
code
logl <- function(alpha, x) return(sum(dgamma(x, shape = alpha, log = TRUE)))
This function can be slightly improved by inserting a check that
alpha
is a single variable (a vector of length 1 to R)
rather than a vector, which doesn't make sense,
logl <- function(alpha, x) { if (length(alpha) > 1) stop("alpha must be scalar") if (alpha <= 0) stop("alpha must be positive") return(sum(dgamma(x, shape = alpha, log = TRUE))) }
Of course, when there is no mistake (no bugs
in the code)
both functions do exactly the same thing. But when there is
a mistake, the function with the error check will make it clear what
the problem is.
If you want to look at the log likelihood, the following R statements seem to do the job.
R has several functions that optimize functions. The one we will explain
here is the nlm
function
(
on-line help).
It has a huge number of arguments, most of which can be ignored.
The only arguments that must be supplied are two
f
the function to be minimized.
p
a starting value for the variable. The dimension of
p
tells nlm
the dimension of the space in which
f
takes values. So p
must be supplied to specify
the dimension, if for no other reason.
But it helps if we can specify a starting value reasonably close to the solution.
We won't at this point discuss any of the optional arguments described
on the
on-line help except for the somewhat mysterious ...
arguments
...
the
on-line help for the nlm
function describes this as
additional arguments to f
. What this means is that
if we supply an argument whose name is the name of one of the arguments
to the function supplied as the argument f
and whose name
is not one of the named arguments to nlm
, then this argument
will be forwardedto the function supplied as the argument
f
.
In particular, if we write our log likelihood function to have an additional
argument x
which is the data (as we did above), then since
nlm
doesn't have an argument named x
, this will
be passed to our log likelihood function.
Since nlm
minimizes rather than maximizes we need to write
an f
function that calculates the minus the
log likelihood rather than the log likelihood (stand on your head and
maximization becomes minimization).
Good starting values are hard to find, in general. In our particular problem,
maximum likelihood for the shape parameter of the gamma distribution,
a good estimate of the shape parameter α is the sample mean,
since the theoretical mean of the gamma distribution is α / β
where β is the rate
parameter, here assumed to be known to
be β = 1.0. Hence the population mean is α and the sample
mean is a reasonable estimate (by the law of large numbers).
This gives us the following first attempt at maximum likelihood for our example.
A few more options of nlm
can be helpful.
hessian
returns the second derivative (an approximation
calculated by finite differences) of the objective function. This will
be a k × k matrix if the dimension of the
argument p
(which must match the dimension of the
first argument to the function f
) is k.
fscale
helps nlm
know when it has converged
to the solution. It should give roughly the size of the objective function
at the solution. Often fscale = length(x)
is about right.
print.level
tells nlm
to blather about what
is is doing. print.level = 2
gives maximum verbosity.
We can use the hessian
, which is part of the list returned
by the nlm
function to tell whether the solution is a local
minimum. More importantly, we can use it to tells about the statistical
properties of the MLE (Section 7.8 in DeGroot and Schervish).
The hessian
returned by the nlm
function is
the second derivative of minus the log likelihood, that is, what is
referred to in the handout
on Fisher Information and Confidence Intervals
as
observed Fisher information. More precisely, it is a finite
difference approximation to Fisher information, that is, equation (1.2)
in the handout with the MLE plugged in for the parameter.
Thus the following code calculates an asymptotic conf.level
confidence interval for the unknown scale parameter in the example
we have been doing
We need good starting points for our optimization algorithm, and if we stick with the gamma distribution, but now switching to the two-parameter gamma distribution to get a multiparameter problem, the simplest estimators are so-called method of moments estimators.
This is a traditional theory topic that students that have had a course like this are expected to know (on preliminary exams, actuarial exams, and so forth) so it won't hurt to cover it, even though DeGroot and Schervish ignore it.
The first two moments of the Gamma(α, β) distribution are
and
The so-called method of moments says the following
For the gamma distribution, solving for the parameters gives
and
For the plug-in I am handicaped by various sucky web browsers
(especially Microsoft) not implementing web standards for mathematics
on the web. I can't make an x bar
for sample mean.
The R statements for these estimators are
alpha.start <- mean(x)^2 / var(x) beta.start <- mean(x) / var(x)
The method of moments isn't always applicable (the solve for parameters
step may be impossible), and it doesn't necessarily produce good estimators.
Maximum likelihood estimators are asymptotically efficient.
Method of moment estimators generally aren't.
But they do provide good enough starting points for maximum likelihood.
Coding the log likelihood (really minus the log likelihood is
what we need to hand to nlm
) is must the same as
coding the uniparameter case. The main difference
is that the argument to the function must be a vector of parameters.
We usually want to take this vector apart into its scalar components if, as in the gamma example, the different parameters do different things.
With that in mind our mlogl
function looks something like this
mlogl <- function(theta, x) { if (length(theta) != 2) stop("theta must be vector of length 2") alpha <- theta[1] lambda <- theta[2] if (alpha <= 0) stop("theta[1] must be positive") if (lambda <= 0) stop("theta[2] must be positive") return(- sum(dgamma(x, shape = alpha, rate = lambda, log = TRUE))) }
(Note: We've changed the name of the rate parameter
from β which DeGroot and Schervish use to λ which many
other theory books use because it conflicts with an R function name
(the R function beta
calculates the normalizing constant
of the beta distribution, which is called the beta function).
So let's try it.
The important bits here are the MLE
$estimate [1] 1.680531 1.009529
the observed Fisher information matrix
$hessian [,1] [,2] [1,] 24.15181 -29.71534 [2,] -29.71534 49.45877
and, for comparison with the MLE, the method of moments estimators
Rweb:> print(theta.start) [1] 1.5398404 0.9250142
The method of moments estimators seem fairly close to the MLEs, but we can't really tell whether they are close in the statistical sense until we calculate confidence intervals.
The important output is right at the end. The confidence intervals
[1] 0.899571 2.461491 [1] 0.4637939 1.5552641
for the shape parameter and the rate parameter (in that order).
An important comparison is with the confidence interval for the shape parameter we got when only estimating the shape (assuming we knew the scale parameter was 1.0), which was
[1] 1.271798 2.065813
Note that this interval is much narrower: (1.27, 2.07) when the shape parameter is known versus (0.90, 2.46) when the shape parameter is unknown and must also be estimated.
This is an important lesson.
Confidence intervals for parameters of interest are wider when nuisance parameters are estimated.