This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/).
The R markdown source for this document is https://www.stat.umn.edu/geyer/5421/notes/poisson.Rmd.
The version of R used to make this document is 4.5.1.
The version of the rmarkdown package used to make this
document is 2.29.
This web page is just a copy of the web page about inference for the binomial distribution changing what needs to be changed to work for the Poisson distribution instead.
In our example the count is 17
x <- 17
The usual estimator of the parameter \(\mu\) is \(\hat{\mu} = x\). This turns out to also be the maximum likelihood estimator (MLE).
mu.hat <- x
If we had IID data from a Poisson distribution, the MLE would be the sample mean. But here we have sample size n = 1.
We can plot the log likelihood function using the following code
logl <- function(mu) {
result <- x * log(mu) - mu
result[is.nan(result)] <- (- mu)
return(result)
}
curve(logl, from = 0.0001, to = 50, n = 1001,
xlab = expression(mu), ylab = "log likelihood",
ylim = logl(mu.hat) + c(-10, 0))
abline(v = mu.hat, lty = "dashed")
Log Likelihood
The dashed vertical line shows where the MLE is, and it does appear to be where the log likelihood is maximized. The log likelihood goes to minus infinity as \(\mu \to 0\) or \(\mu \to \infty\). Except when \(x = 0\) the log likelihood increases to 0 as \(\mu \to 0\). Thus we cannot try to draw the curve from zero to infinity but rather from a little bit above zero to some big number. But that gives us a plot in which it is hard to see what is going on.
Hence the ylim optional argument to R function
curve. It will turn out that the only interesting part of
the log likelihood is the region near the maximum. Hence we include in
our plot only the part of the curve in which the log likelihood is
within 10 of the maximum. The 10 was pulled out of the air. We don’t
really want to get scientific about this yet (but do in the section on likelihood-based confidence
intervals below). Also it really doesn’t matter, since we are just
using this plot to get some idea what is going on. We don’t use this
plot for statistical inference.
Set the confidence level. It is best programming practice to never hard code numbers like this, that is, the number 0.95 should only occur in your document once where it is used to initialize a variable. Then use that variable elsewhere.
If there is ever a need to change this (to use say 0.90 for the confidence level), then it only need be changed in one place and will be consistently used everywhere.
conf.level <- 0.95
This interval is \[ \hat{\mu} \pm z_{\alpha / 2} \cdot \sqrt{\hat{\mu}} \] where \(z_{\alpha / 2}\) is the \(1 - \alpha / 2\) quantile of the standard normal distribution, because the variance of the distribution is \(\mu\).
mu.hat + c(-1, 1) * qnorm((1 + conf.level) / 2) * sqrt(mu.hat)
## [1] 8.918861 25.081139
As with the binomial distribution, the Wald interval for the Poisson distribution is zero width, hence ridiculous, when \(\hat{\mu} = 0\).
The Wald interval can be repaired by using a different procedure (Geyer, 2009, DOI: 10.1214/08-EJS349). For the Poisson distribution, this recipe uses the interval \([0, - \log(\alpha)]\) for coverage \(1 - \alpha\). Thus when we observe \(x = 0\) and want 95% confidence, the interval is
alpha <- 1 - conf.level
c(0, - log(alpha))
## [1] 0.000000 2.995732
The same fix can be used for any confidence interval, not just the Wald interval.
To derive this inverval we invert the score test. The test statistic is \[ \frac{x - \mu}{\sqrt{\mu}} \] so the interval is the set of \(\mu\) such that \[ \left\lvert \frac{x - \mu}{\sqrt{\mu}} \right\rvert < z_{\alpha / 2} \] or \[ x^2 - 2 x \mu + \mu^2 < \mu z_{\alpha / 2}^2 \] or \[ \mu^2 - (2 x + z_{\alpha / 2}^2) \mu + x^2 < 0 \] From the quadratic formula this interval has endpoints \[ \frac{2 x + z_{\alpha / 2}^2 \pm \sqrt{(2 x + z_{\alpha / 2}^2)^2 - 4 x^2}}{2} \] or \[ x + \frac{z_{\alpha / 2}^2}{2} \pm z_{\alpha / 2} \sqrt{x + \frac{z_{\alpha / 2}^2}{4}} \]
z <- qnorm((1 + conf.level) / 2)
x + z^2 / 2 + c(-1, 1) * z * sqrt(x + z^2 / 4)
## [1] 10.61447 27.22699
crit <- qchisq(conf.level, df = 1)
fred <- function(mu) 2 * (logl(mu.hat) - logl(mu)) - crit
tol <- sqrt(.Machine$double.eps)
if (mu.hat == 0) {
low <- 0
} else {
low <- uniroot(fred, lower = 0, upper = mu.hat, tol = tol)$root
}
hig <- uniroot(fred, lower = mu.hat, upper = 2 * mu.hat,
extendInt = "upX", tol = tol)$root
c(low, hig)
## [1] 10.14533 26.40924
\[\begin{align*} H_0: & \mu = \mu_0 \\ H_1: & \mu > \mu_0 \end{align*}\]
\[\begin{align*} H_0: & \mu = \mu_0 \\ H_1: & \mu < \mu_0 \end{align*}\]
\[\begin{align*} H_0: & \mu = \mu_0 \\ H_1: & \mu \neq \mu_0 \end{align*}\]
Suppose we have
mu.zero <- 13
The test statistic is (repeating what was said in the section on the score interval above) \[ t = \frac{x - \mu}{\sqrt{\mu}} \] where in both places \(\mu\) is now the value of \(\mu\) hypothesized under the null hypothesis. Call that \(\mu_0\) and we have \[ t = \frac{x - \mu_0}{\sqrt{\mu_0}} \]
tstat <- (x - mu.zero) / sqrt(mu.zero)
tstat
## [1] 1.1094
# P-values
# upper-tailed
pnorm(tstat, lower.tail = FALSE)
## [1] 0.1336287
# lower-tailed
pnorm(tstat)
## [1] 0.8663713
# two-tailed
2 * pnorm(abs(tstat), lower.tail = FALSE)
## [1] 0.2672575
As always the test statistic for a two-tailed test is \[ t = 2 [ l(\hat{\mu}) - l(\mu_0) ] \] and has a chi-square distribution on one degree of freedom under the null hypothesis.
This can be converted into an asymptotically normal distribution by taking the square root and assigning the sign of \(\hat{\mu} - \mu_0\) because the square of a standard normal random variable is chi-square with one degree of freedom.
And we can also use this signed likelihood ratio test statistic for the two-tailed test because its square is the original likelhood ratio test statistic.
tstat <- sign(mu.hat - mu.zero) * sqrt(2 * (logl(mu.hat) - logl(mu.zero)))
tstat
## [1] 1.058761
# P-values
# upper-tailed
pnorm(tstat, lower.tail = FALSE)
## [1] 0.1448542
# lower-tailed
pnorm(tstat)
## [1] 0.8551458
# two-tailed
2 * pnorm(abs(tstat), lower.tail = FALSE)
## [1] 0.2897085
This is just like the score test except that the variance is estimated under the alternative rather than under the null. \[ t = \frac{x - \mu_0}{\sqrt{\hat{\mu}}} \] and this has the problem of blowing up when \(\hat{\mu} = 0\). It also doesn’t make intuitive sense to assume \(\mu_0\) is the true unknown parameter value and then not use \(\sqrt{\mu_0}\) for the standard deviation.
So nobody recommends this (for Poisson), and we won’t even illustrate it.
Here the test statistic is just \(x\).
# P-values
# upper-tailed
ppois(x - 1, mu.zero, lower.tail = FALSE)
## [1] 0.1645069
# lower-tailed
ppois(x, mu.zero)
## [1] 0.890465
The reason for the x - 1 is the discreteness of the
Poisson distribution (that’s the way lower.tail = FALSE
works).
There is no exact two-tailed test because the exact (Poisson) distribution is not symmetric, so there is no reason to use \(\lvert X - \mu_0 \rvert\) as a test statistic. One could also say there are many possible two-tailed exact tests and no reason to prefer any one of them.
There is a fuzzy test, but there is no computer implementation available. So we don’t cover that either.
Here we just follow Bayesian inference for the binomial distribution changing what needs to be changed in going from binomial to Poisson.
For the binomial distribution we were told (this is derived in theory courses) that the conjugate prior distribution to the binomial data model is the beta distribution.
Now we tell you (also derived in theory courses) that the conjugate prior distribution to the Poisson data model is the gamma distribution. And we will rely on R’s knowledge of this distribution for the most part.
The gamma distribution has two parameters: \(\alpha\) the shape parameter and \(\lambda\) the rate parameter, and, like the beta distribution has qualitatively different behavior for different values of the shape parameter \[ \lim_{x \to 0} f_{\alpha, \lambda}(x) = \begin{cases} \infty, & 0 < \alpha < 1 \\ \lambda, & \alpha = 1 \\ 0, & 1 < \alpha < \infty \end{cases} \] but (unlike the beta distribution) the upper bound of the sample space is infinity (as it must be because the support of the prior and posterior must be the parameter space of the data distribution, which is \(0 < \mu < \infty\) for the parameter of the Poisson distribution) and \[ \lim_{x \to \infty} f_{\alpha, \lambda}(x) = 0 \] as must be the case for any probability density function.
So unlike the beta distribution, there is no special case of the gamma distribution that is flat (uniformly distributed, putatively uninformative).
As always in Bayesian inference, we take the parameter of the data distribution
to be the variable of the prior or posterior distribution (here these are gamma distributions) and the parameters of those distributions we call hyperparameters:
\(\alpha\) the shape parameter of the gamma distribution and
\(\lambda\) the rate parameter of the gamma distribution.
In short, \(\mu\) is the parameter, and \(\alpha\) and \(\lambda\) are the hyperparameters.
It may help in choosing the hyperparameters of the prior distribution that
the mean of the gamma distribution is \(\alpha / \lambda\), and
the variance of the gamma distribution is \(\alpha / \lambda^2\), so
the standard deviation of the gamma distribution is \(\sqrt{\alpha} / \lambda\).
Now we tell you (also derived in theory courses) that what Bayes rule says for Poisson data and gamma prior is
if the hyperparameters of the prior are \(\alpha\) and \(\lambda\),
then the hyperparameters of the posterior are \(\alpha + x\) and \(\lambda + 1\).
We make a plot for an example, like we did for binomial data,
alpha <- 2.7
lambda <- 0.3
curve(dgamma(mu, x + alpha, rate = lambda + 1), from = 0, to = 30,
xname = "mu", xlab = expression(mu), ylab = "probability density")
curve(dgamma(mu, alpha, rate = lambda), from = 0, to = 30,
xname = "mu", add = TRUE, lty = "dashed")
Poisson data x = 17. Solid curve posterior, dashed curve prior.
Note that, as always in Bayesian inference, the posterior is closer to the observed data, than the prior, but not exactly at the observed data, because the posterior incorporates both data and prior information in exactly the way Bayes’ rule says it should.
One can also make point estimates based on the posterior, for example, with our current posterior, we have the following.
# update hyperparameters to be those for posterior
alpha <- x + alpha
lambda <- lambda + 1
# posterior mean
alpha / lambda
## [1] 15.15385
# posterior median
qgamma(0.5, alpha, rate = lambda)
## [1] 14.89822
# posterior mode
(alpha - 1) / lambda |> max(0)
## [1] 14.38462
# posterior standard deviation
sqrt(alpha) / lambda
## [1] 3.414206
One can also make interval estimates based on the posterior, for example, with our current posterior, an equal-tailed 95% Bayesian credible interval is
# hyperparameters have already been updated to be those for posterior
crit <- c((1 - conf.level) / 2, (1 + conf.level) / 2)
names(crit) <- crit
qgamma(crit, alpha, rate = lambda)
## 0.025 0.975
## 9.217489 22.541946
Bayesians say “credible interval” rather than “confidence interval” to emphasize that they are doing something different from the frequentists.
As we saw with the binomial distribution, the Bayesian learning paradigm applies. If we get more data from the same data distribution, then the posterior for the previous analysis becomes the prior for the next analysis.
So, if \(X_1\), \(X_2\), \(\ldots\) are IID Poisson random variables and we do a Bayesian analysis with \(\mu\) as the parameter and \(\alpha\) and \(\lambda\) as hyperparameters,
Also note that, as always in Bayesian inference, the posterior is more concentrated (less spread out) than the prior. In fact, it obeys the square root law (as we saw for the binomial) because the posterior mean is approximately \[ \frac{\alpha_\text{posterior}}{\lambda_\text{posterior}} = \frac{\alpha + x_1 + x_2 + \cdots + x_n}{\lambda + n} \approx \bar{x}_n \] and the posterior standard deviation is approximately \[ \frac{\sqrt{\alpha_\text{posterior}}}{\lambda_\text{posterior}} = \frac{\sqrt{\alpha + x_1 + x_2 + \cdots + x_n}}{\lambda + n} \approx \sqrt{\frac{\bar{x}_n}{n}} \] which agrees with the frequentist asymptotics and obeys the square root law as the Bernstein—von Mises theorem says it must.
So, as often (but not always because the Bernstein—von Mises has conditions that binomial and Poisson satisfy, but not all data distributions do), the Bayesians agree quantitatively (but not philosophically) with the frequentists for large sample sizes.