This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/).
The version of R used to make this document is 3.6.2.
The version of the rmarkdown
package used to make this document is 2.1.
The version of the numDeriv
package used to make this document is 2016.8.1.1.
What is for short called IEEE arithmetic is a standard for floating point arithmetic implemented in nearly all currently manufactured computers.
What you need to know about IEEE arithmetic is that there are several kinds of floating point numbers. In C and C++ the types are
float
about 6.9 decimal digits precision,
double
about 15.6 decimal digits precision, and
long double
which can be anything, often the same as double
.
In R only double
is used.
IEEE arithmetic also represents values that are not ordinary floating point numbers. In R these are printed
NaN
“meaning” not a number,
Inf
“meaning” \(+ \infty\),
-Inf
“meaning” \(- \infty\),
in all three cases the scare quotes around “meaning” mean the meaning is more complicated than first appears, as we shall see as we go along.
These follow obvious rules of arithmetic \[\begin{align*} \texttt{NaN} + x & = \texttt{NaN} \\ \texttt{NaN} * x & = \texttt{NaN} \\ \texttt{Inf} + x & = \texttt{Inf}, \qquad x > \texttt{-Inf} \\ \texttt{Inf} + \texttt{-Inf} & = \texttt{NaN} \\ \texttt{Inf} * x & = \texttt{Inf}, \qquad x > 0 \\ \texttt{Inf} * 0 & = \texttt{NaN} \\ x / 0 & = \texttt{Inf}, \qquad x > 0 \\ 0 / 0 & = \texttt{NaN} \end{align*}\]In R the function is.finite
tests that numbers are not any of NA
, NaN
, Inf
, -Inf
.
It can happen that all(is.finite(x))
is TRUE
but sum(x)
or prod(x)
is Inf
. This is called overflow.
Overflow must be avoided if at all possible. It loses all significant figures.
Example:
log(exp(710))
## [1] Inf
An IEEE arithmetic result can be zero, when the exact infinite-precision result would be positive but smaller than the smallest positive number representable in IEEE arithmetic. This is called underflow.
Example:
log(exp(-746))
## [1] -Inf
Underflow is not a worry if the result is later added to a large number.
Example:
1 + exp(-746)
## [1] 1
is very close to correct, as close to correct as the computer can represent.
Between the smallest positive number representable with full (15.6 decimal digit) precision and zero are numbers representable with less precision.
Example:
log(exp(-743))
## [1] -743.0538
Theoretically, since log
and exp
are inverses of each other, we should get \(- 743\) as the answer. But exp(-743)
is a denormalized number with less than full precision, so we only get close but not very close to the correct result.
We say “catastrophic cancellation” occurs when subtracting two nearly equal positive numbers gives a number with much less precision.
Example \[ 1.020567 - 1.020554 = 1.3 \times 10^{-5} \] Both operands have 7 decimal digits of precision. The result has 2.
That’s if we are assuming decimal arithmetic. Computers, of course, use binary arithmetic, but the principle is the same.
What I call the “complement rule” is the simplest fact of probability theory \[
\Pr(\mathop{\rm not} A) = 1 - \Pr(A), \qquad \text{for any event $A$}.
\] But it assumes real real numbers, not the computer’s sorta-kinda real numbers (double
s).
The complement rule doesn’t work in the upper tail of probability distributions where probabilities are nearly equal to one. R, being (unlike C and C++) a computer language highly concerned with numerical accuracy, provides a workaround. All of the “p
” and “q
” functions like pnorm
and qnorm
have a lower.tail
argument to work around this issue.
pnorm(2.5, lower.tail = FALSE)
## [1] 0.006209665
1 - pnorm(2.5)
## [1] 0.006209665
Same thing, right? But the latter, shorter and simpler though it may seem, suffers from catastrophic cancellation.
x <- 0:20
data.frame(x, p1 = pnorm(x, lower.tail = FALSE), p2 = 1 - pnorm(x))
## x p1 p2
## 1 0 5.000000e-01 5.000000e-01
## 2 1 1.586553e-01 1.586553e-01
## 3 2 2.275013e-02 2.275013e-02
## 4 3 1.349898e-03 1.349898e-03
## 5 4 3.167124e-05 3.167124e-05
## 6 5 2.866516e-07 2.866516e-07
## 7 6 9.865876e-10 9.865877e-10
## 8 7 1.279813e-12 1.279865e-12
## 9 8 6.220961e-16 6.661338e-16
## 10 9 1.128588e-19 0.000000e+00
## 11 10 7.619853e-24 0.000000e+00
## 12 11 1.910660e-28 0.000000e+00
## 13 12 1.776482e-33 0.000000e+00
## 14 13 6.117164e-39 0.000000e+00
## 15 14 7.793537e-45 0.000000e+00
## 16 15 3.670966e-51 0.000000e+00
## 17 16 6.388754e-58 0.000000e+00
## 18 17 4.105996e-65 0.000000e+00
## 19 18 9.740949e-73 0.000000e+00
## 20 19 8.527224e-81 0.000000e+00
## 21 20 2.753624e-89 0.000000e+00
Of course, we can use the symmetry of the normal distribution to compute these without catastrophic cancellation and without lower.tail = FALSE
x <- 7:12
data.frame(x, p1 = pnorm(x, lower.tail = FALSE), p2 = pnorm(- x))
## x p1 p2
## 1 7 1.279813e-12 1.279813e-12
## 2 8 6.220961e-16 6.220961e-16
## 3 9 1.128588e-19 1.128588e-19
## 4 10 7.619853e-24 7.619853e-24
## 5 11 1.910660e-28 1.910660e-28
## 6 12 1.776482e-33 1.776482e-33
but for nonsymmetric distributions, lower.tail = FALSE
is essential for avoiding catastrophic cancellation for upper tail probabilities.
The same argument works the same way for quantiles.
p <- 10^(-(1:20))
cbind(p = p, q1 = qnorm(p, lower.tail = FALSE), q2 = qnorm(1 - p))
## p q1 q2
## [1,] 1e-01 1.281552 1.281552
## [2,] 1e-02 2.326348 2.326348
## [3,] 1e-03 3.090232 3.090232
## [4,] 1e-04 3.719016 3.719016
## [5,] 1e-05 4.264891 4.264891
## [6,] 1e-06 4.753424 4.753424
## [7,] 1e-07 5.199338 5.199338
## [8,] 1e-08 5.612001 5.612001
## [9,] 1e-09 5.997807 5.997807
## [10,] 1e-10 6.361341 6.361341
## [11,] 1e-11 6.706023 6.706023
## [12,] 1e-12 7.034484 7.034487
## [13,] 1e-13 7.348796 7.348755
## [14,] 1e-14 7.650628 7.650731
## [15,] 1e-15 7.941345 7.941444
## [16,] 1e-16 8.222082 8.209536
## [17,] 1e-17 8.493793 Inf
## [18,] 1e-18 8.757290 Inf
## [19,] 1e-19 9.013271 Inf
## [20,] 1e-20 9.262340 Inf
With real real numbers for every \(\varepsilon > 0\) we have \(1 + \varepsilon > 1\). Not so with computer arithmetic.
foo <- 1 + 1e-100
identical(foo, 1)
## [1] TRUE
According to ?.Machine
.Machine$double.eps
## [1] 2.220446e-16
is “the smallest positive floating-point number x
such that 1 + x != 1
”. According to the Wikipedia page for “machine epsilon” definitions of this concept vary among different authorities, but the one R uses is widely used and is also the definition used by C and C++.
The C program
#include <float.h>
#include <stdio.h>
int main(void)
{
printf("machine epsilon: %e\n", DBL_EPSILON);
return 0;
}
and the C++ program
#include <limits>
#include <iostream>
using namespace std;
int main()
{
cout << "machine epsilon:" <<
std::numeric_limits<double>::epsilon() << endl;
return 0;
}
print the same number as R does above.
Is the definition correct?
epsilon <- .Machine$double.eps
1 + epsilon == 1
## [1] FALSE
1 + 0.9 * epsilon == 1
## [1] FALSE
1 + 0.8 * epsilon == 1
## [1] FALSE
1 + 0.7 * epsilon == 1
## [1] FALSE
1 + 0.6 * epsilon == 1
## [1] FALSE
1 + 0.5 * epsilon == 1
## [1] TRUE
1 + 0.4 * epsilon == 1
## [1] TRUE
Hmmmmmmmmmmm. It appears that the “definition” in the R documentation is actually wrong. Perhaps, they are using one of the other definitions that Wikipeda mentions. Oh. The C11 standard says DBL_EPSILON
is “the difference between 1 and the least value greater than 1 that is representable in the given floating point type, \(b^{1 - p}\).” I guess that that means that DBL_EPSILON
(hence the rest too) has to be a power of 2.
log2(epsilon)
## [1] -52
So that seems right.
Anyway, all of these technicalities aside, the machine epsilon is more or less the relative precision of computer arithmetic.
R uses it to define things like tolerances
args(all.equal.numeric)
## function (target, current, tolerance = sqrt(.Machine$double.eps),
## scale = NULL, countEQ = FALSE, formatFUN = function(err,
## what) format(err), ..., check.attributes = TRUE)
## NULL
And you should too. You should also follow this example in making tolerance(s) an argument of your functions (that need tolerances) so the user can override your default.
Also, returning to the preceding section, we see that the machine epsilon is where the complement rule starts to fail.
What some intro stats books call the “short-cut” formula for variance \[ \mathop{\rm var}(X) = E(X^2) - E(X)^2 \] is a mathematical identity when using real real numbers. It is an invitation to catastrophic cancellation when using computer arithmetic.
Always use the two-pass algorithm \[\begin{align*} \bar{x}_n & = \frac{1}{n} \sum_{i = 1}^n x_i \\ v_n & = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}_n)^2 \end{align*}\]x <- 1:10
# short cut
mean(x^2) - mean(x)^2
## [1] 8.25
# two pass
moo <- mean(x)
mean((x - moo)^2)
## [1] 8.25
Looks OK. What’s the problem? But
x <- x + 1e9
# short cut
mean(x^2) - mean(x)^2
## [1] 0
# two pass
moo <- mean(x)
mean((x - moo)^2)
## [1] 8.25
Catastrophic cancellation!
There is also sophisticated one-pass algorithm (Chan, Golub, and LeVeque (1983), “Algorithms for computing the sample variance: Analysis and recommendations”, American Statistician, 37, 242-247), but it is not efficient in R (it can be used when you are programming in C or C++).
Some commonly used mathematical operations invite catastrophic cancellation. R and C and C++ provide special functions to do these right.
The R function log1p
calculates \(\log(1 + x)\) in a way that avoids catastrophic cancellation when \(x\) is nearly zero. We know from calculus (Taylor series) that \(\log(1 + x) \approx x\) for small \(x\).
log1p(1 / pi)
## [1] 0.2763505
log(1 + 1 / pi)
## [1] 0.2763505
not much difference, but
foo <- 1e-20
log1p(foo)
## [1] 1e-20
log(1 + foo)
## [1] 0
catastrophic cancellation!
The R function expm1
calculates \(e^x - 1\) in a way that avoids catastrophic cancellation when \(x\) is nearly zero. We know from calculus (Taylor series) that \(e^x - 1 \approx x\) for small \(x\).
expm1(1 / pi)
## [1] 0.3748022
exp(1 / pi) - 1
## [1] 0.3748022
not much difference, but
foo <- 1e-20
expm1(foo)
## [1] 1e-20
exp(foo) - 1
## [1] 0
catastrophic cancellation!
C and C++ also have log1p
and expm1
. In fact, R is just calling the C functions to do them.
New in R-3.1.0 are functions cospi(x)
, sinpi(x)
, and tanpi(x)
, which compute cos(pi*x)
, sin(pi*x)
, and tan(pi*x)
.
These functions are also in C and C++.
x <- (0:20) / 2
data.frame(sin = sin(pi * x), sinpi = sinpi(x))
## sin sinpi
## 1 0.000000e+00 0
## 2 1.000000e+00 1
## 3 1.224647e-16 0
## 4 -1.000000e+00 -1
## 5 -2.449294e-16 0
## 6 1.000000e+00 1
## 7 3.673940e-16 0
## 8 -1.000000e+00 -1
## 9 -4.898587e-16 0
## 10 1.000000e+00 1
## 11 6.123234e-16 0
## 12 -1.000000e+00 -1
## 13 -7.347881e-16 0
## 14 1.000000e+00 1
## 15 8.572528e-16 0
## 16 -1.000000e+00 -1
## 17 -9.797174e-16 0
## 18 1.000000e+00 1
## 19 1.102182e-15 0
## 20 -1.000000e+00 -1
## 21 -1.224647e-15 0
The log likelihood for the usual parameter \(p\) for the binomial distribution with observed data \(x\) and sample size \(n\) is \[ l(p) = x \log(p) + n \log(1 - p) \] In terms of the “natural” parameter \[ \theta = \mathop{\rm logit}(p) = \log(p) - \log(1 - p) \] the log likelihood is \[ l(\theta) = x \theta - n \log(1 + e^\theta) \] The function going the other way between \(p\) and \(\theta\) is \[ p = \frac{e^\theta}{1 + e^\theta} = \frac{1}{e^{- \theta} + 1} \]
The first derivative of the log likelihood is \[ l'(\theta) = x - n \frac{e^\theta}{1 + e^\theta} = x - n p \] and the second derivative is \[ l''(\theta) = - n \frac{e^\theta}{1 + e^\theta} + n \frac{(e^\theta)^2}{(1 + e^\theta)^2} = - n p (1 - p) \] we want an R function that evaluates this log likelihood and its derivatives.
The first problem we have to deal with is overflow. We never want \(e^\theta\) or \(e^{- \theta}\) to overflow. We see that we can write \(p\) in terms of either \(e^\theta\) or \(e^{- \theta}\), so we want to pick the expression that cannot overflow. Similarly the expression for the log likelihood itself can be rewritten in terms of \(e^{- \theta}\) \[ l(\theta) = x \theta - n \log[ e^\theta (e^{- \theta} + 1) ] = x \theta - n \theta - n \log(e^{- \theta} + 1) = - (n - x) \theta - n \log(e^{- \theta} + 1) \] and here too we want to pick the expression that cannot overflow.
The second problem we want to deal with is catastrophic cancellation. We never want to evaluate \(1 - p\) by subtracting \(p\) from 1. Instead use algebra to rewrite it so there is no subtraction \[ q = 1 - p = 1 - \frac{e^\theta}{1 + e^\theta} = \frac{1}{1 + e^\theta} = \frac{e^{- \theta}}{e^{- \theta} + 1} \] so now there is no catastrophic cancellation here and no overflow either if we choose the expression that does not overflow.
logl <- function(theta, x, n, deriv = 2) {
stopifnot(is.numeric(theta))
stopifnot(is.finite(theta))
stopifnot(length(theta) == 1)
stopifnot(is.numeric(x))
stopifnot(is.finite(x))
stopifnot(length(x) == 1)
if (x != round(x)) stop("x must be integer")
stopifnot(is.numeric(n))
stopifnot(is.finite(n))
stopifnot(length(n) == 1)
if (n != round(n)) stop("n must be integer")
stopifnot(x <= n)
stopifnot(length(deriv) == 1)
stopifnot(deriv %in% 0:2)
val <- if (theta < 0) x * theta - n * log1p(exp(theta)) else
- (n - x) * theta - n * log1p(exp(- theta))
}
Note that we use log1p
in the obvious places to avoid catastrophic cancellation.
For once we won’t test that every error message works as supposed. We leave that as an exercise for the reader.
Our function doesn’t do derivatives yet, but we want to get to testing right away.
thetas <- seq(-10, 10)
x <- 0
n <- 10
log.thetas <- Map(function(theta) logl(theta, x, n), thetas)
log.thetas.too <- Map(function(theta) dbinom(x, n,
1 / (exp(- theta) + 1), log = TRUE), thetas)
all.equal(log.thetas, log.thetas.too)
## [1] TRUE
The first derivative is simple, but we worry about catastrophic cancellation in \(x - n p\). We special-case one case: when \(x = n\) we have \[ l'(\theta) = n (1 - p) = n q \] and we want to be sure to evaluate \(q\) without catastrophic cancellation.
But for the general case, there does not seem to be any way to avoid cancellation (maybe we shouldn’t call it “catastrophic” here) if it occurs. We have to compare \(x\) and \(n p\) somehow, and comparing “real” (double
) numbers is always fraught with danger (or at least inaccuracy).
logl <- function(theta, x, n, deriv = 2) {
stopifnot(is.numeric(theta))
stopifnot(is.finite(theta))
stopifnot(length(theta) == 1)
stopifnot(is.numeric(x))
stopifnot(is.finite(x))
stopifnot(length(x) == 1)
if (x != round(x)) stop("x must be integer")
stopifnot(is.numeric(n))
stopifnot(is.finite(n))
stopifnot(length(n) == 1)
if (n != round(n)) stop("n must be integer")
stopifnot(x <= n)
stopifnot(length(deriv) == 1)
stopifnot(deriv %in% 0:2)
val <- if (theta < 0) x * theta - n * log1p(exp(theta)) else
- (n - x) * theta - n * log1p(exp(- theta))
result <- list(value = val)
if (deriv == 0) return
pp <- if (theta < 0) exp(theta) / (1 + exp(theta)) else
1 / (exp(- theta) + 1)
qq <- if (theta < 0) 1 / (1 + exp(theta)) else
exp(- theta) / (exp(- theta) + 1)
grad <- if (x < n) x - n * pp else n * qq
result$gradient <- grad
result
}
Now that we know the first part of our function (log likelihood calculation) is correct, we can trust it while we are testing whether the derivative is correct.
I can think of two obvious methods of testing derivatives. One is to use R’s knowledge of calculus, which is primitive but good enough for this problem.
d1 <- D(expression(x * theta - n * log(1 + exp(theta))), "theta")
d1
## x - n * (exp(theta)/(1 + exp(theta)))
mygrad <- function(theta, x, n) eval(d1)
g0 <- Map(function(theta) logl(theta, x, n)$gradient, thetas)
g1 <- Map(function(theta) mygrad(theta, x, n), thetas)
all.equal(g0, g1)
## [1] TRUE
Even when R does not know how to check derivatives, they can still be approximated numerically. The simplest way is to use finite differences \[
f'(x) \approx \frac{f(x + h) - f(x)}{h}, \qquad \text{for small $h$}
\] but there is a CRAN package numDeriv
that does a lot more sophisticated calculations.
library(numDeriv)
numgrad <- function(theta) grad(function(theta) logl(theta, x, n)$value, theta)
g2 <- Map(numgrad, thetas)
all.equal(g0, g2)
## [1] TRUE
The definition of numgrad
above may seem confusing: too many theta
s! First, numgrad
is itself a function of theta
. It is supposed to calculate the first derivative of the log likelihood \(l'(\theta)\). We calculate that using the R function grad
in the R package numDeriv
. It wants a function as its first argument (the function to differentiate). That function, too, we think of as a function of theta
, and we define that function right there as an anonymous expression
function(theta) logl(theta, x, n)$value
and theta
in this expression has nothing whatsoever to do with theta
outside this expression (just like any argument of any function). In this expression theta
is the argument of this anonymous function. It might help readability to rewrite our definition of numgrad
as
function(theta) grad(function(theta.too) logl(theta.too, x, n)$value, theta)
so we can tell our theta
s apart, but R has no trouble with the way it was written first.
The last theta
in the definition of numgrad
is the point where grad
is to evaluate the derivative.
We also need to test the special case x == n
and while we are at it, it wouldn’t hurt to test the special case x == 0
.
g0 <- Map(function(theta) logl(theta, n, n)$gradient, thetas)
g1 <- Map(function(theta) mygrad(theta, n, n), thetas)
all.equal(g0, g1)
## [1] TRUE
numgrad <- function(theta) grad(function(theta) logl(theta, n, n)$value, theta)
g2 <- Map(numgrad, thetas)
all.equal(g0, g2)
## [1] TRUE
g0 <- Map(function(theta) logl(theta, 0, n)$gradient, thetas)
g1 <- Map(function(theta) mygrad(theta, 0, n), thetas)
all.equal(g0, g1)
## [1] TRUE
numgrad <- function(theta) grad(function(theta) logl(theta, 0, n)$value, theta)
g2 <- Map(numgrad, thetas)
all.equal(g0, g2)
## [1] TRUE
It is a bit ugly that our tests have to redefine numgrad
each time, but it doesn’t matter because no one has to use the tests, just the function logl
that we are testing.
The second derivative is even simpler, \[ l''(\theta) = - n p q \] so long as we calculate \(p\) and \(q\) without catastrophic cancellation, which we know how to do.
logl <- function(theta, x, n, deriv = 2) {
stopifnot(is.numeric(theta))
stopifnot(is.finite(theta))
stopifnot(length(theta) == 1)
stopifnot(is.numeric(x))
stopifnot(is.finite(x))
stopifnot(length(x) == 1)
if (x != round(x)) stop("x must be integer")
stopifnot(is.numeric(n))
stopifnot(is.finite(n))
stopifnot(length(n) == 1)
if (n != round(n)) stop("n must be integer")
stopifnot(0 <= x)
stopifnot(x <= n)
stopifnot(length(deriv) == 1)
stopifnot(deriv %in% 0:2)
val <- if (theta < 0) x * theta - n * log1p(exp(theta)) else
- (n - x) * theta - n * log1p(exp(- theta))
result <- list(value = val)
if (deriv == 0) return(result)
pp <- if (theta < 0) exp(theta) / (1 + exp(theta)) else
1 / (exp(- theta) + 1)
qq <- if (theta < 0) 1 / (1 + exp(theta)) else
exp(- theta) / (exp(- theta) + 1)
grad <- if (x < n) x - n * pp else n * qq
result$gradient <- grad
if (deriv == 1) return(result)
result$hessian <- (- n * pp * qq)
return(result)
}
I noticed in this re-re-implementation that our re-implementation was completely broken in a way that was not tested. It did not return the right thing in case deriv = 0
. Now this is fixed, but we should be sure to test it this time.
Much later (during class) I noticed that I was missing the test that 0 <= x
so that has been added also.
logl(1.1, x, n, 0)
## $value
## [1] -13.87335
logl(1.1, x, n, 1)
## $value
## [1] -13.87335
##
## $gradient
## [1] -7.502601
logl(1.1, x, n, 2)
## $value
## [1] -13.87335
##
## $gradient
## [1] -7.502601
##
## $hessian
## [1] -1.873699
logl(1.1, x, n, 3)
## Error in logl(1.1, x, n, 3): deriv %in% 0:2 is not TRUE
So we see the deriv
argument (now) works correctly.
We still have to test the second derivative. We do this just like we tested the first derivative.
d2 <- D(d1, "theta")
d2
## -(n * (exp(theta)/(1 + exp(theta)) - exp(theta) * exp(theta)/(1 +
## exp(theta))^2))
myhess <- function(theta, x, n) eval(d2)
h0 <- Map(function(theta) logl(theta, x, n)$hessian, thetas)
h1 <- Map(function(theta) myhess(theta, x, n), thetas)
all.equal(h0, h1)
## [1] TRUE
numhess <- function(theta)
grad(function(theta) logl(theta, x, n)$gradient, theta)
h2 <- Map(numhess, thetas)
all.equal(h0, h2)
## [1] TRUE
Everything looks good.
We could replace the test function
function(theta, x) dbinom(x, 20, prob = 1 / (1 + exp(- theta)), log = TRUE)
that appears in problem 7 on homework 1 with our new improved version
function(theta, x) logl(theta, x, 20, deriv = 0)$value
(but I haven’t actually tested that, so I’m not 100% certain of that).
Suppose we have a probability density function (PDF) or probability mass function (PMF) of the form \[ f_\theta(x) = a(\theta) b(x) e^{x \theta} \] (in which case this is called an exponential family of distributions), and
we do not know how to calculate the function \(a\) but
we do know how to simulate random variables having this distribution.
This may seem crazy, but there is a general methodology for simulating probability distributions known only up to an unknown normalizing constant called the Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953), “Equation of state calculations by fast computing machines”, Journal of Chemical Physics, 21, 1087-1092).
Geyer and Thompson (1992, “Constrained Monte Carlo maximum likelihood for dependent data (with discussion), Journal of the Royal Statistical Society, Series B, 54, 657-699) show that the following method approximates the log likelihood of this distribution, when \(x_\text{obs}\) is the observed data, \(x\) is a vector of simulations of the distribution for parameter value \(\psi\), \[ l(\theta) = x_\text{obs} \theta - \log\left( \sum_{i = 1}^n e^{x_i (\theta - \psi)} \right) \] or in R
logl <- function(theta) xobs * theta - log(sum(exp(x * (theta - psi))))
except that won’t work well because the exponentials are likely to overflow or underflow.
Our problem is to rewrite this so none of exponentials overflow and at least some of the exponentials do not underflow.
The key idea is to add and subtract a constant from each exponential. \[\begin{align*} \log\left( \sum_{i = 1}^n e^{x_i (\theta - \psi)} \right) & = \log\left( \sum_{i = 1}^n e^{x_i (\theta - \psi) + c - c} \right) \\ & = \log\left( \sum_{i = 1}^n e^c e^{x_i (\theta - \psi) - c} \right) \\ & = \log\left( e^c \sum_{i = 1}^n e^{x_i (\theta - \psi) - c} \right) \\ & = \log(e^c) + \log \left( \sum_{i = 1}^n e^{x_i (\theta - \psi) - c} \right) \\ & = c + \log \left( \sum_{i = 1}^n e^{x_i (\theta - \psi) - c} \right) \end{align*}\]This is true for any real number \(c\), but we need to choose \(c\) so we know the exponentials cannot overflow. An obvious choice is to choose \(c\) to be the largest of the terms \(x_i (\theta - \psi)\).
This will make the largest term in the sum equal to one, so not all of the exponentials underflow (and those that do make negligible contribution to the sum).
Having decided to make one term in the sum equal to one, we now have an opportunity to use log1p
to calculate the log. And we should to avoid catastrophic cancellation.
Before we start this problem, we clean up the R global environment
rm(list = ls())
We write the log likelihood as
logl <- function(theta) {
foo <- x * (theta - psi)
foomax <- max(foo)
i <- which(foo == foomax)
i <- i[1] # just in case there was more than one largest term
foo <- foo[-i]
bar <- foomax + log1p(sum(exp(foo - foomax)))
xobs * theta - bar
}
For once we dispense with GIEMO and write the function using global variables as explained in Section 7.4.2 of the “Basics” handout.
We happen to have some appropriate data for this problem.
load(url("http://www.stat.umn.edu/geyer/3701/data/ising.rda"))
ls()
## [1] "logl" "psi" "x" "xobs"
Note that it is necessary to use the url
function here, whereas it is unnecessary when reading from a URL with scan
, read.table
, or read.csv
, because the “read” functions do extra trickery to recognize URLS and do the right thing, and load
doesn’t bother.
What the model actually is, we won’t bother to explain. It is irrelevant to the present discussion (avoiding overflow and catastrophic cancellation).
It turns out that this function, which was tricky enough to write, is even trickier to test because any other method I can think of to calculate this does not work because of either overflow or catastrophic cancellation.
So we just plot the function and see that it makes sense.
thetas <- seq(psi / 1.005, psi * 1.005, length = 101)
l0 <- Map(logl, thetas)
plot(thetas, unlist(l0), xlab=expression(theta), ylab=expression(l(theta)))
Log Likelihood Function
Theory says that this function should be concave and asymptotically linear, that is, bends downward and looks like a linear function for very large (positive or negative) values of the argument. At least it looks like that.