1 License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/).

The R markdown source for this document is https://www.stat.umn.edu/geyer/5421/notes/poisson.Rmd.

2 R

The version of R used to make this document is 4.5.1.

The version of the rmarkdown package used to make this document is 2.29.

3 Summary

This web page is just a copy of the web page about inference for the binomial distribution changing what needs to be changed to work for the Poisson distribution instead.

4 Data

In our example the count is 17

x <- 17

5 Maximum Likelihood Estimator

The usual estimator of the parameter \(\mu\) is \(\hat{\mu} = x\). This turns out to also be the maximum likelihood estimator (MLE).

mu.hat <- x

If we had IID data from a Poisson distribution, the MLE would be the sample mean. But here we have sample size n = 1.

6 Log Likelihood

We can plot the log likelihood function using the following code

logl <- function(mu) {
    result <- x * log(mu) - mu
    result[is.nan(result)] <-  (- mu)
    return(result)
}
curve(logl, from = 0.0001, to = 50, n = 1001,
    xlab = expression(mu), ylab = "log likelihood",
    ylim = logl(mu.hat) + c(-10, 0))
abline(v = mu.hat, lty = "dashed")
Log Likelihood

Log Likelihood

The dashed vertical line shows where the MLE is, and it does appear to be where the log likelihood is maximized. The log likelihood goes to minus infinity as \(\mu \to 0\) or \(\mu \to \infty\). Except when \(x = 0\) the log likelihood increases to 0 as \(\mu \to 0\). Thus we cannot try to draw the curve from zero to infinity but rather from a little bit above zero to some big number. But that gives us a plot in which it is hard to see what is going on.

Hence the ylim optional argument to R function curve. It will turn out that the only interesting part of the log likelihood is the region near the maximum. Hence we include in our plot only the part of the curve in which the log likelihood is within 10 of the maximum. The 10 was pulled out of the air. We don’t really want to get scientific about this yet (but do in the section on likelihood-based confidence intervals below). Also it really doesn’t matter, since we are just using this plot to get some idea what is going on. We don’t use this plot for statistical inference.

7 Confidence Intervals

7.1 Confidence Level

Set the confidence level. It is best programming practice to never hard code numbers like this, that is, the number 0.95 should only occur in your document once where it is used to initialize a variable. Then use that variable elsewhere.

If there is ever a need to change this (to use say 0.90 for the confidence level), then it only need be changed in one place and will be consistently used everywhere.

conf.level <- 0.95

7.2 Wald Confidence Interval

This interval is \[ \hat{\mu} \pm z_{\alpha / 2} \cdot \sqrt{\hat{\mu}} \] where \(z_{\alpha / 2}\) is the \(1 - \alpha / 2\) quantile of the standard normal distribution, because the variance of the distribution is \(\mu\).

mu.hat + c(-1, 1) * qnorm((1 + conf.level) / 2) * sqrt(mu.hat)
## [1]  8.918861 25.081139

7.3 Wald Interval Fixed to Handle Data Equal Zero

As with the binomial distribution, the Wald interval for the Poisson distribution is zero width, hence ridiculous, when \(\hat{\mu} = 0\).

The Wald interval can be repaired by using a different procedure (Geyer, 2009, DOI: 10.1214/08-EJS349). For the Poisson distribution, this recipe uses the interval \([0, - \log(\alpha)]\) for coverage \(1 - \alpha\). Thus when we observe \(x = 0\) and want 95% confidence, the interval is

alpha <- 1 - conf.level
c(0, - log(alpha))
## [1] 0.000000 2.995732

The same fix can be used for any confidence interval, not just the Wald interval.

7.4 Score Interval (also called Rao)

To derive this inverval we invert the score test. The test statistic is \[ \frac{x - \mu}{\sqrt{\mu}} \] so the interval is the set of \(\mu\) such that \[ \left\lvert \frac{x - \mu}{\sqrt{\mu}} \right\rvert < z_{\alpha / 2} \] or \[ x^2 - 2 x \mu + \mu^2 < \mu z_{\alpha / 2}^2 \] or \[ \mu^2 - (2 x + z_{\alpha / 2}^2) \mu + x^2 < 0 \] From the quadratic formula this interval has endpoints \[ \frac{2 x + z_{\alpha / 2}^2 \pm \sqrt{(2 x + z_{\alpha / 2}^2)^2 - 4 x^2}}{2} \] or \[ x + \frac{z_{\alpha / 2}^2}{2} \pm z_{\alpha / 2} \sqrt{x + \frac{z_{\alpha / 2}^2}{4}} \]

z <- qnorm((1 + conf.level) / 2)
x + z^2 / 2 + c(-1, 1) * z * sqrt(x + z^2 / 4)
## [1] 10.61447 27.22699

7.5 Likelihood Interval (also called Wilks)

crit <- qchisq(conf.level, df = 1)
fred <- function(mu) 2 * (logl(mu.hat) - logl(mu)) - crit
tol <- sqrt(.Machine$double.eps)
if (mu.hat == 0) {
    low <- 0
} else {
    low <- uniroot(fred, lower = 0, upper = mu.hat, tol = tol)$root
}
hig <- uniroot(fred, lower = mu.hat, upper = 2 * mu.hat,
    extendInt = "upX", tol = tol)$root
c(low, hig)
## [1] 10.14533 26.40924

8 Hypothesis Tests

8.1 Hypotheses

8.1.1 Upper-Tailed

\[\begin{align*} H_0: & \mu = \mu_0 \\ H_1: & \mu > \mu_0 \end{align*}\]

8.1.2 Lower-Tailed

\[\begin{align*} H_0: & \mu = \mu_0 \\ H_1: & \mu < \mu_0 \end{align*}\]

8.1.3 Two-Tailed

\[\begin{align*} H_0: & \mu = \mu_0 \\ H_1: & \mu \neq \mu_0 \end{align*}\]

Suppose we have

mu.zero <- 13

8.2 Score Test (also called Rao)

The test statistic is (repeating what was said in the section on the score interval above) \[ t = \frac{x - \mu}{\sqrt{\mu}} \] where in both places \(\mu\) is now the value of \(\mu\) hypothesized under the null hypothesis. Call that \(\mu_0\) and we have \[ t = \frac{x - \mu_0}{\sqrt{\mu_0}} \]

tstat <- (x - mu.zero) / sqrt(mu.zero)
tstat
## [1] 1.1094
# P-values
# upper-tailed
pnorm(tstat, lower.tail = FALSE)
## [1] 0.1336287
# lower-tailed
pnorm(tstat)
## [1] 0.8663713
# two-tailed
2 * pnorm(abs(tstat), lower.tail = FALSE)
## [1] 0.2672575

8.3 Likelihood Ratio Test (also called Wilks)

As always the test statistic for a two-tailed test is \[ t = 2 [ l(\hat{\mu}) - l(\mu_0) ] \] and has a chi-square distribution on one degree of freedom under the null hypothesis.

This can be converted into an asymptotically normal distribution by taking the square root and assigning the sign of \(\hat{\mu} - \mu_0\) because the square of a standard normal random variable is chi-square with one degree of freedom.

And we can also use this signed likelihood ratio test statistic for the two-tailed test because its square is the original likelhood ratio test statistic.

tstat <- sign(mu.hat - mu.zero) * sqrt(2 * (logl(mu.hat) - logl(mu.zero)))
tstat
## [1] 1.058761
# P-values
# upper-tailed
pnorm(tstat, lower.tail = FALSE)
## [1] 0.1448542
# lower-tailed
pnorm(tstat)
## [1] 0.8551458
# two-tailed
2 * pnorm(abs(tstat), lower.tail = FALSE)
## [1] 0.2897085

8.4 Wald Test

This is just like the score test except that the variance is estimated under the alternative rather than under the null. \[ t = \frac{x - \mu_0}{\sqrt{\hat{\mu}}} \] and this has the problem of blowing up when \(\hat{\mu} = 0\). It also doesn’t make intuitive sense to assume \(\mu_0\) is the true unknown parameter value and then not use \(\sqrt{\mu_0}\) for the standard deviation.

So nobody recommends this (for Poisson), and we won’t even illustrate it.

8.5 Exact Test

Here the test statistic is just \(x\).

# P-values
# upper-tailed
ppois(x - 1, mu.zero, lower.tail = FALSE)
## [1] 0.1645069
# lower-tailed
ppois(x, mu.zero)
## [1] 0.890465

The reason for the x - 1 is the discreteness of the Poisson distribution (that’s the way lower.tail = FALSE works).

There is no exact two-tailed test because the exact (Poisson) distribution is not symmetric, so there is no reason to use \(\lvert X - \mu_0 \rvert\) as a test statistic. One could also say there are many possible two-tailed exact tests and no reason to prefer any one of them.

8.6 Fuzzy Test

There is a fuzzy test, but there is no computer implementation available. So we don’t cover that either.

9 Bayesian Inference

Here we just follow Bayesian inference for the binomial distribution changing what needs to be changed in going from binomial to Poisson.

9.1 Conjugate Distribution

For the binomial distribution we were told (this is derived in theory courses) that the conjugate prior distribution to the binomial data model is the beta distribution.

Now we tell you (also derived in theory courses) that the conjugate prior distribution to the Poisson data model is the gamma distribution. And we will rely on R’s knowledge of this distribution for the most part.

The gamma distribution has two parameters: \(\alpha\) the shape parameter and \(\lambda\) the rate parameter, and, like the beta distribution has qualitatively different behavior for different values of the shape parameter \[ \lim_{x \to 0} f_{\alpha, \lambda}(x) = \begin{cases} \infty, & 0 < \alpha < 1 \\ \lambda, & \alpha = 1 \\ 0, & 1 < \alpha < \infty \end{cases} \] but (unlike the beta distribution) the upper bound of the sample space is infinity (as it must be because the support of the prior and posterior must be the parameter space of the data distribution, which is \(0 < \mu < \infty\) for the parameter of the Poisson distribution) and \[ \lim_{x \to \infty} f_{\alpha, \lambda}(x) = 0 \] as must be the case for any probability density function.

So unlike the beta distribution, there is no special case of the gamma distribution that is flat (uniformly distributed, putatively uninformative).

As always in Bayesian inference, we take the parameter of the data distribution

  • \(\mu\) for the Poisson distribution

to be the variable of the prior or posterior distribution (here these are gamma distributions) and the parameters of those distributions we call hyperparameters:

  • \(\alpha\) the shape parameter of the gamma distribution and

  • \(\lambda\) the rate parameter of the gamma distribution.

In short, \(\mu\) is the parameter, and \(\alpha\) and \(\lambda\) are the hyperparameters.

It may help in choosing the hyperparameters of the prior distribution that

  • the mean of the gamma distribution is \(\alpha / \lambda\), and

  • the variance of the gamma distribution is \(\alpha / \lambda^2\), so

  • the standard deviation of the gamma distribution is \(\sqrt{\alpha} / \lambda\).

9.2 Bayes Rule

Now we tell you (also derived in theory courses) that what Bayes rule says for Poisson data and gamma prior is

  • if the hyperparameters of the prior are \(\alpha\) and \(\lambda\),

  • then the hyperparameters of the posterior are \(\alpha + x\) and \(\lambda + 1\).

We make a plot for an example, like we did for binomial data,

alpha <- 2.7
lambda <- 0.3
curve(dgamma(mu, x + alpha, rate = lambda + 1), from = 0, to = 30,
    xname = "mu", xlab = expression(mu), ylab = "probability density")
curve(dgamma(mu, alpha, rate = lambda), from = 0, to = 30,
    xname = "mu", add = TRUE, lty = "dashed")
Poisson data x = 17.  Solid curve posterior, dashed curve prior.

Poisson data x = 17. Solid curve posterior, dashed curve prior.

Note that, as always in Bayesian inference, the posterior is closer to the observed data, than the prior, but not exactly at the observed data, because the posterior incorporates both data and prior information in exactly the way Bayes’ rule says it should.

9.2.1 Point Estimates

One can also make point estimates based on the posterior, for example, with our current posterior, we have the following.

# update hyperparameters to be those for posterior
alpha <- x + alpha
lambda <- lambda + 1
# posterior mean
alpha / lambda
## [1] 15.15385
# posterior median
qgamma(0.5, alpha, rate = lambda)
## [1] 14.89822
# posterior mode
(alpha - 1) / lambda |> max(0)
## [1] 14.38462
# posterior standard deviation
sqrt(alpha) / lambda
## [1] 3.414206

9.2.2 Interval Estimates

One can also make interval estimates based on the posterior, for example, with our current posterior, an equal-tailed 95% Bayesian credible interval is

# hyperparameters have already been updated to be those for posterior
crit <- c((1 - conf.level) / 2, (1 + conf.level) / 2)
names(crit) <- crit
qgamma(crit, alpha, rate = lambda)
##     0.025     0.975 
##  9.217489 22.541946

Bayesians say “credible interval” rather than “confidence interval” to emphasize that they are doing something different from the frequentists.

9.3 The Bayesian Learning Paradigm

As we saw with the binomial distribution, the Bayesian learning paradigm applies. If we get more data from the same data distribution, then the posterior for the previous analysis becomes the prior for the next analysis.

So, if \(X_1\), \(X_2\), \(\ldots\) are IID Poisson random variables and we do a Bayesian analysis with \(\mu\) as the parameter and \(\alpha\) and \(\lambda\) as hyperparameters,

  • the prior before the 2nd data point is observed is \[ \text{Gamma}(\alpha + x_1, \lambda + 1) \]
  • and the posterior after the 2nd data point is observed is \[ \text{Gamma}(\alpha + x_1 + x_2, \lambda + 2) \] and, in general, the posterior after \(n\) data points have been observed is \[ \text{Gamma}(\alpha + x_1 + x_2 + \cdots + x_n, \lambda + n) \]

9.4 Bayesian Asymptotics

Also note that, as always in Bayesian inference, the posterior is more concentrated (less spread out) than the prior. In fact, it obeys the square root law (as we saw for the binomial) because the posterior mean is approximately \[ \frac{\alpha_\text{posterior}}{\lambda_\text{posterior}} = \frac{\alpha + x_1 + x_2 + \cdots + x_n}{\lambda + n} \approx \bar{x}_n \] and the posterior standard deviation is approximately \[ \frac{\sqrt{\alpha_\text{posterior}}}{\lambda_\text{posterior}} = \frac{\sqrt{\alpha + x_1 + x_2 + \cdots + x_n}}{\lambda + n} \approx \sqrt{\frac{\bar{x}_n}{n}} \] which agrees with the frequentist asymptotics and obeys the square root law as the Bernstein—von Mises theorem says it must.

So, as often (but not always because the Bernstein—von Mises has conditions that binomial and Poisson satisfy, but not all data distributions do), the Bayesians agree quantitatively (but not philosophically) with the frequentists for large sample sizes.