---
title: "Stat 5421 Lecture Notes: Statistical Inference for the Binomial Distribution"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
number_sections: true
md_extensions: -tex_math_single_backslash
mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
pdf_document:
number_sections: true
latex_engine: xelatex
---
# License
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).
# Data
In our example there are two successes in 25 trials.
```{r data}
x <- 2
n <- 25
```
# Maximum Likelihood Estimator
The usual estimator of the parameter $\pi$ is $\hat{\pi} = x / n$.
This turns out to also be the maximum likelihood estimator.
```{r mle}
pi.hat <- x / n
```
# Log Likelihood
We can plot the log likelihood function using the following code
```{r label="logl.plot", fig.cap="Log Likelihood", fig.align='center'}
logl <- function(pi) {
result <- x * log(pi) + (n - x) * log(1 - pi)
result[is.nan(result)] <- 0
return(result)
}
curve(logl, from = 0, to = 1, n = 10001,
xlab = expression(pi), ylab = "log likelihood",
ylim = logl(pi.hat) + c(-10, 0))
abline(v = pi.hat, lty = "dashed")
```
The dashed vertical line shows where the MLE is, and it does appear to
be where the log likelihood is maximized. The log likelihood goes to
minus infinity as $\pi \to 0$ or $\pi \to 1$. Except when $x = 0$ the
log likelihood increases to 0 as $\pi \to 0$, and when $x = n$ it
increases to 0 as $\pi \to 1$.
Hence we need a `ylim` optional argument to R function `curve`
because otherwise the curve would go to minus infinity and nothing could
be seen.
It will turn out that the only interesting part of the
log likelihood is the region near the maximum. Hence we include in our
plot only the part of the curve in which the log likelihood is within 10
of the maximum. The 10 was pulled out of the air. We don't really want
to get scientific about this yet (but do in [the section on likelihood-based
confidence intervals below](#likelihood-interval)).
Also it really doesn't matter, since
we are just using this plot to get some idea what is going on.
We don't use this plot for statistical inference.
# Confidence Level
Set the confidence level. It is best programming practice to never
hard code numbers like this, that is, the number 0.95 should only occur
in your document once where it is used to initialize a variable.
Then use that variable elsewhere.
If there is ever a need to change this (to use say 0.90 for the confidence
level), then it only need be changed in one place and will be consistently
used everywhere.
```{r conf.level}
conf.level <- 0.95
```
This is an example of using the DRY/SPOT rule (Wikipedia pages
Don't Repeat
Yourself and
Single Point
of Truth).
# Wald, Wilks, Rao
Section 1.3.3 in Agresti discusses the three main strategies
for constructing hypothesis
tests. And each kind of hypothesis goes with a confidence interval that is
derived by inverting the test. So there are the same three strategies for
confidence intervals.
The three strategies are
* Wald test
* Wilks test, also called likelihood ratio test
* Rao test, also called score test and Lagrange multiplier test (this
last name is used mostly by economists).
All of these procedures are asymptotically equivalent under the usual
asymptotics of maximum likelihood. Hence none is better than the others
for sufficiently large sample size.
But they may be quite different for small sample sizes.
# Confidence Intervals
## Wald Confidence Interval
This used to be the standard taught in intro stats, maybe it still is in
many such courses. These are the only intervals of the form
$$
\text{point estimate} \pm \text{critical value} \times \text{standard error}
$$
```{r wald}
pi.hat + c(-1, 1) * qnorm((1 + conf.level) / 2) * sqrt(pi.hat * (1 - pi.hat) / n)
```
We know $0 \le \pi \le 1$ but this interval
makes no use of that information, giving a lower bound that is negative.
This may look ridiculous, but is not wrong. If we know that $0 \le \pi$,
then it is true that
$`r pi.hat - qnorm((1 + conf.level) / 2) * sqrt(pi.hat * (1 - pi.hat) / n)` \le \pi$
too.
Agresti (Section 1.4) points out that this interval gives ridiculous
zero-width confidence intervals when $\hat{\pi}$ is equal to zero or one
(when the data $x$ is equal to zero or $n$). However, the Wald interval
can be repaired by using a different procedure (Geyer, 2009,
*Electronic Journal of Statistics*, **3**, 259--289) that was illustrated
on the [web page discussing coverage of confidence
intervals](http://www.stat.umn.edu/geyer/5102/examp/coverage.html#musual).
So that is not a reason not to use Wald intervals so long as one knows
how to use this modification when the MLE is on the boundary of the
parameter space.
## Score Interval (also called Rao)
Some intro stats books now teach this.
In particular, [Agresti's intro stats book](https://www.amazon.com/Statistics-Art-Science-Learning-Data/dp/0321997832/) teaches this. Some teach a dumbed down approximation
to this. Since we are using R, we may as well do the right thing, not the
dumbed down version suitable for hand calculation.
```{r score}
prop.test(x, n, conf.level = conf.level, correct = FALSE)
```
The "`correct = FALSE`" is just bizarre.
Does that mean we don't want the correct answer? No.
The R statement `help(prop.test)` explains that
it means we do not want to use "continuity correction". And we don't.
No authority recommends what `prop.test` does by default.
Agresti, Section 1.4.2, recommends `correct = FALSE`.
Brown, Cai and DasGupta (2005,
[DOI: 10.1214/088342305000000395](https://doi.org/10.1214/088342305000000395))
criticize Geyer and Meeden (2005,
[DOI: 10.1214/088342305000000340](https://doi.org/10.1214/088342305000000340))
for using `prop.test` with `correct = TRUE`, providing plots of coverage
probability for with `correct = FALSE` and `correct = TRUE` to show this.
AFAIK no authority defends `correct = TRUE`, even `help(prop.test)` cites
no such authority. They do cite
Newcombe (1998,
[DOI: 10.1002/(SICI)1097-0258(19980430)17:8<857::AID-SIM777>3.0.CO;2-E](https://doi.org/10.1002/(SICI)1097-0258(19980430)17:8<857::AID-SIM777>3.0.CO;2-E))
where `correct = FALSE`
is his method 3 and `correct = TRUE` is (apparently) his method 4, but
Newcombe's simulations do not show that `correct = TRUE` is better than
`correct = FALSE` and his conclusion does not recommend `correct = TRUE`
over `correct = FALSE` (but he does not pick a particular method to recommend).
For a full derivation of the score interval and a check that `correct = FALSE`
actually calculates it see
[http://www.stat.umn.edu/geyer/5102/slides/s2.pdf](http://www.stat.umn.edu/geyer/5102/slides/s2.pdf#page=113), slides 113--116 and 123.
(Hmmmm. That does not actually seem to check. We'll check here.)
```{r check.score}
crit <- qnorm((1 + conf.level) / 2)
foo <- (pi.hat + crit^2 / (2 * n) + c(-1, 1) * crit *
sqrt(pi.hat * (1 - pi.hat) / n + crit^2 / (4 * n^2))) / (1 + crit^2 / n)
foo
bar <- prop.test(x, n, conf.level = conf.level, correct = FALSE)
names(bar)
bar$conf.int
all.equal(bar$conf.int, foo, check.attributes = FALSE)
```
## Likelihood Interval (also called Wilks) {#likelihood-interval}
Confidence interval that is a level set of the log likelihood.
This is too esoteric for intro stats.
```{r logl}
crit <- qchisq(conf.level, df = 1)
fred <- function(pi) 2 * (logl(pi.hat) - logl(pi)) - crit
tol <- sqrt(.Machine$double.eps)
if (pi.hat == 0) {
low <- 0
} else {
low <- uniroot(fred, lower = 0, upper = pi.hat, tol = tol)$root
}
if (pi.hat == 1) {
hig <- 1
} else {
hig <- uniroot(fred, lower = pi.hat, upper = 1, tol = tol)$root
}
c(low, hig)
```
We can check we have done the right thing by redoing our log likelihood
plot.
```{r label="logl.ci.plot", fig.cap="Likelihood-Based Confidence Interval", fig.align='center'}
sally <- function(pi) 2 * logl(pi)
curve(sally, from = 0.0001, to = 0.99, n = 1001,
xlab = expression(pi), ylab = "two times log likelihood",
ylim = sally(pi.hat) + c(-6, 0))
abline(h = sally(pi.hat) - crit, lty = "dashed")
abline(v = low, lty = "dashed")
abline(v = hig, lty = "dashed")
```
We can see that the vertical dashed lines, which are the endpoints
of the likelihood-based confidence interval, do indeed intersect the
graph of two times the log likelihood `crit` down from the maximum,
where `crit` is the critical value derived from the chi-squared distribution.
Note that this figure has even less of a vertical range than the preceding
one. This does say that the interesting part of the picture is captured
by both graphs where now "interesting" means the part relevant to a 95%
confidence interval.
## Likelihood Interval using MASS Package
We can also do this with R function `confint` in R package `MASS`.
Except this function botches the calculation when $x = 0$ or $x = n$.
Our calculation above always does the right thing.
```{r confint}
library(MASS)
gout <- glm(rbind(c(x, n - x)) ~ 1, family = binomial)
cout <- suppressMessages(confint(gout, level = conf.level))
cout
fred <- 1 / (1 + exp(- cout))
fred
all.equal(c(low, hig), fred, check.attributes = FALSE, tolerance = 4e-4)
```
## Five Intervals Compared
```{r three}
pi.hat + c(-1, 1) * qnorm((1 + conf.level) / 2) * sqrt(pi.hat * (1 - pi.hat) / n)
as.vector(bar$conf.int)
c(low, hig)
```
As can be seen, the intervals are rather different.
If we were to go back to the top and change the data to
$x = 200$ and $n = 2500$ (both 100 times what they were before),
then the intervals are quite close to each other (not to mention
a lot shorter).
The [web page discussing coverage of confidence
intervals](http://www.stat.umn.edu/geyer/5102/examp/coverage.html)
discusses two more intervals
```{r two-more}
# Clopper-Pearson Binomial Confidence Intervals
binom.test(x, n, conf.level = conf.level)
# Variance Stabilized Binomial Confidence Intervals
# http://www.stat.umn.edu/geyer/5102/slides/s2.pdf#page=120
thetahat <- asin(2 * pi.hat - 1)
crit.val <- qnorm((1 + conf.level) / 2)
(1 + sin(thetahat + c(-1,1) * crit.val / sqrt(n))) / 2
```
All five of these intervals are asymptotically equivalent.
All are correct and say almost the same thing for sufficiently
large sample sizes. No theory says that one is better than another
for small sample sizes with one exception.
When the estimators are on the boundary $\hat{\pi} = 0$ or $\hat{\pi} = 1$,
the Wald and arcsine intervals have abysmally bad performance (so bad that
a wild guess is better).
But, as discussed in the section about the Wald interval, which refers to
the [web page discussing coverage of confidence
intervals](http://www.stat.umn.edu/geyer/5102/examp/coverage.html#musual).
this abysmally bad performance can be fixed.
So there is no reason to prefer one of these intervals (modified, if necessary,
to fix bad behavior at the endpoints) over another.
# Hypothesis Tests (One-Tailed)
## Data
For some reason, we are going to use different data in the hypothesis
tests section, presumably because with the small sample size before
there was no power to reject almost all null hypotheses.
```{r data-too}
x <- 1300
n <- 2500
```
## MLE
```{r mle-too}
pi.hat <- x / n
```
## Null Hypothesis
```{r null}
pi.0 <- 1 / 2
```
## Score Test
Now this is the standard test for proportions taught in intro stats
(regardless of which interval is taught. We will do an upper-tail test.
Test statistic and $P$-value.
```{r score.test}
tstat <- (x - n * pi.0) / sqrt(n * pi.0 * (1 - pi.0))
tstat
pnorm(tstat, lower.tail = FALSE)
```
This is an asymptotic procedure, only approximately correct for large
sample sizes.
## Exact Test
Here is an "exact" test (really only conservative-exact,
the $P$-value is guaranteed to understate the statistical significance).
P-value for exact test
```{r exact}
pbinom(x - 1, n, pi.0, lower.tail = FALSE)
```
This should perhaps be standard in intro stats.
## Fuzzy P-Value
This comes from Geyer and Meeden (*Statistical Science*,
2005, **20**, 358--387).
Here the $P$-value is considered to be uniformly distributed on interval
```{r fuzzy}
pbinom(c(x, x - 1), n, pi.0, lower.tail = FALSE)
```
This test is truly exact (exact-exact rather than conservative-exact)
in the sense that the probability $P \le \alpha$ is equal to $\alpha$
for $0 \le \alpha \le 1$.
This test is what is actually comparable to an exact test with a continuous
test statistic (like a $t$-test, for example).
## Likelihood Ratio Test
For a one-tailed test we have to use the signed likelihood ratio test
statistic.
Here are the test statistic and P-value for this test.
```{r lrt}
tstat <- 2 * (logl(pi.hat) - logl(pi.0))
tstat <- sqrt(tstat)
tstat <- tstat * sign(pi.hat - pi.0)
tstat
pnorm(tstat, lower.tail = FALSE)
```
This too is an asymptotic procedure, only approximately correct for large
sample sizes. It is asymptotically equivalent to the score test. Both
the test statistic and the $P$-value for these two tests (and the Wald test,
which is next) will be nearly equal for large sample sizes.
## Wald Test
Test statistic and $P$-value
```{r wald.test}
tstat <- (x - n * pi.0) / sqrt(n * pi.hat * (1 - pi.hat))
tstat
pnorm(tstat, lower.tail = FALSE)
```
This too is an asymptotic procedure, only approximately correct for large
sample sizes. It is asymptotically equivalent to the score test and the
likelihood ratio test.
The test statistic and the $P$-value for these three tests
will be nearly equal for large sample sizes.
This seems intuitively wrong. The $P$-value is calculated assuming
$\pi_0$ is the true unknown parameter value (in general, assuming the
null hypothesis is true). So if you know $\pi = \pi_0$, why not use that
fact in doing the test? The score test and likelihood ratio test do;
the Wald test doesn't.
(This is related to the Wald test not needing the MLE in the null hypothesis.
Here we have a point null hypothesis, so the MLE in the null hypothesis
is $\pi_0$.) Hence no intro text recommends the Wald test for the binomial
distribution.
Of course many textbooks recommend Wald tests in other situations,
for example, those output by the R generic function `summary`.
# Two-Tailed Tests
Now we illustrate two-tailed tests for the same data.
## Score Test
```{r score.test.two.tail}
tstat <- (x - n * pi.0) / sqrt(n * pi.0 * (1 - pi.0))
tstat
pnorm(abs(tstat), lower.tail = FALSE) * 2
pnorm(- abs(tstat)) * 2
pnorm(abs(tstat), lower.tail = FALSE) + pnorm(- abs(tstat))
```
As can be seen the last three commands are three equivalent ways to calculate
the $P$-value for the two-tailed test using the symmetry of the standard
normal distribution.
## Exact Test {#exact-two-tailed}
Oops! There is no exact non-fuzzy two-tailed test.
Except if the null hypothesis is $\pi_0 = 1 / 2$, then the test statistic
does have a symmetric distribution, so
two tails is twice one tail. But this **does not work** for other $\pi_0$.
## Fuzzy Test
Fuzzy P-value for exact test (Geyer and Meeden, *Statistical Science*,
2005, **20**, 358--387). Now this is not simple, but there is an
R function to do it in R package `ump`.
```{r fuzzy.test.two.tailed}
library(ump)
print(arpv.binom(x, n, pi.0, plot = FALSE))
```
Here because the value of the cumulative distribution is so low
at the middle knot, this is almost a uniform distribution. The
fuzzy $P$-value is approximately uniformly distributed on the interval
```{r fuzzy.test.two.tailed.interval}
aout <- arpv.binom(x, n, pi.0, plot = FALSE)
aout$alpha[2:3]
```
This behavior is not the way this test always works. Sometimes the
distribution of the fuzzy $P$-value is quite complicated. Here's an example,
which is Figure 3 in Geyer and Meeden (*Statistical Science*,
2005, vol 20, pp. 358--387).
```{r fuzzy.other}
aout <- arpv.binom(10, 10, 0.7, plot = FALSE)
aout
```
We make the plot that is the aforementioned Figure 3 as follows.
```{r label="fuzzy.plot", fig.cap="PDF of Fuzzy P-Value", fig.align='center'}
k <- length(aout$alpha)
alpha <- aout$alpha
pdf <- diff(aout$phi) / diff(aout$alpha)
plot(alpha, c(0, pdf), type = "n", xlab = expression(alpha),
ylab = "probability density")
segments(alpha[-k], pdf, alpha[-1], pdf)
segments(alpha, c(0, pdf), alpha, c(pdf, 0), lty = "dashed")
```
The fuzzy $P$-value can be considered to be a random variable
having the probability density function shown in the figure.
## Likelihood Ratio Test
Test statistic and $P$-value
```{r lrt.two.tailed}
tstat <- 2 * (logl(pi.hat) - logl(pi.0))
tstat
pchisq(tstat, df = 1, lower.tail = FALSE)
```
## Wald Test
Test statistic and $P$-value
```{r wald.two.tailed}
tstat <- (x - n * pi.0) / sqrt(n * pi.hat * (1 - pi.hat))
tstat
pnorm(abs(tstat), lower.tail = FALSE) * 2
```