--- title: "Stat 5421 Lecture Notes: Statistical Inference for the Binomial Distribution" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML pdf_document: number_sections: true latex_engine: xelatex --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # Data In our example there are two successes in 25 trials. ```{r data} x <- 2 n <- 25 ``` # Maximum Likelihood Estimator The usual estimator of the parameter $\pi$ is $\hat{\pi} = x / n$. This turns out to also be the maximum likelihood estimator. ```{r mle} pi.hat <- x / n ``` # Log Likelihood We can plot the log likelihood function using the following code ```{r label="logl.plot", fig.cap="Log Likelihood", fig.align='center'} logl <- function(pi) { result <- x * log(pi) + (n - x) * log(1 - pi) result[is.nan(result)] <- 0 return(result) } curve(logl, from = 0, to = 1, n = 10001, xlab = expression(pi), ylab = "log likelihood", ylim = logl(pi.hat) + c(-10, 0)) abline(v = pi.hat, lty = "dashed") ``` The dashed vertical line shows where the MLE is, and it does appear to be where the log likelihood is maximized. The log likelihood goes to minus infinity as $\pi \to 0$ or $\pi \to 1$. Except when $x = 0$ the log likelihood increases to 0 as $\pi \to 0$, and when $x = n$ it increases to 0 as $\pi \to 1$. Hence we need a `ylim` optional argument to R function `curve` because otherwise the curve would go to minus infinity and nothing could be seen. It will turn out that the only interesting part of the log likelihood is the region near the maximum. Hence we include in our plot only the part of the curve in which the log likelihood is within 10 of the maximum. The 10 was pulled out of the air. We don't really want to get scientific about this yet (but do in [the section on likelihood-based confidence intervals below](#likelihood-interval)). Also it really doesn't matter, since we are just using this plot to get some idea what is going on. We don't use this plot for statistical inference. # Confidence Level Set the confidence level. It is best programming practice to never hard code numbers like this, that is, the number 0.95 should only occur in your document once where it is used to initialize a variable. Then use that variable elsewhere. If there is ever a need to change this (to use say 0.90 for the confidence level), then it only need be changed in one place and will be consistently used everywhere. ```{r conf.level} conf.level <- 0.95 ``` This is an example of using the DRY/SPOT rule (Wikipedia pages Don't Repeat Yourself and Single Point of Truth). # Wald, Wilks, Rao Section 1.3.3 in Agresti discusses the three main strategies for constructing hypothesis tests. And each kind of hypothesis goes with a confidence interval that is derived by inverting the test. So there are the same three strategies for confidence intervals. The three strategies are * Wald test * Wilks test, also called likelihood ratio test * Rao test, also called score test and Lagrange multiplier test (this last name is used mostly by economists). All of these procedures are asymptotically equivalent under the usual asymptotics of maximum likelihood. Hence none is better than the others for sufficiently large sample size. But they may be quite different for small sample sizes. # Confidence Intervals ## Wald Confidence Interval This used to be the standard taught in intro stats, maybe it still is in many such courses. These are the only intervals of the form $$ \text{point estimate} \pm \text{critical value} \times \text{standard error} $$ ```{r wald} pi.hat + c(-1, 1) * qnorm((1 + conf.level) / 2) * sqrt(pi.hat * (1 - pi.hat) / n) ``` We know $0 \le \pi \le 1$ but this interval makes no use of that information, giving a lower bound that is negative. This may look ridiculous, but is not wrong. If we know that $0 \le \pi$, then it is true that $`r pi.hat - qnorm((1 + conf.level) / 2) * sqrt(pi.hat * (1 - pi.hat) / n)` \le \pi$ too. Agresti (Section 1.4) points out that this interval gives ridiculous zero-width confidence intervals when $\hat{\pi}$ is equal to zero or one (when the data $x$ is equal to zero or $n$). However, the Wald interval can be repaired by using a different procedure (Geyer, 2009, *Electronic Journal of Statistics*, **3**, 259--289) that was illustrated on the [web page discussing coverage of confidence intervals](http://www.stat.umn.edu/geyer/5102/examp/coverage.html#musual). So that is not a reason not to use Wald intervals so long as one knows how to use this modification when the MLE is on the boundary of the parameter space. ## Score Interval (also called Rao) Some intro stats books now teach this. In particular, [Agresti's intro stats book](https://www.amazon.com/Statistics-Art-Science-Learning-Data/dp/0321997832/) teaches this. Some teach a dumbed down approximation to this. Since we are using R, we may as well do the right thing, not the dumbed down version suitable for hand calculation. ```{r score} prop.test(x, n, conf.level = conf.level, correct = FALSE) ``` The "`correct = FALSE`" is just bizarre. Does that mean we don't want the correct answer? No. The R statement `help(prop.test)` explains that it means we do not want to use "continuity correction". And we don't. No authority recommends what `prop.test` does by default. Agresti, Section 1.4.2, recommends `correct = FALSE`. Brown, Cai and DasGupta (2005, [DOI: 10.1214/088342305000000395](https://doi.org/10.1214/088342305000000395)) criticize Geyer and Meeden (2005, [DOI: 10.1214/088342305000000340](https://doi.org/10.1214/088342305000000340)) for using `prop.test` with `correct = TRUE`, providing plots of coverage probability for with `correct = FALSE` and `correct = TRUE` to show this. AFAIK no authority defends `correct = TRUE`, even `help(prop.test)` cites no such authority. They do cite Newcombe (1998, [DOI: 10.1002/(SICI)1097-0258(19980430)17:8<857::AID-SIM777>3.0.CO;2-E](https://doi.org/10.1002/(SICI)1097-0258(19980430)17:8<857::AID-SIM777>3.0.CO;2-E)) where `correct = FALSE` is his method 3 and `correct = TRUE` is (apparently) his method 4, but Newcombe's simulations do not show that `correct = TRUE` is better than `correct = FALSE` and his conclusion does not recommend `correct = TRUE` over `correct = FALSE` (but he does not pick a particular method to recommend). For a full derivation of the score interval and a check that `correct = FALSE` actually calculates it see [http://www.stat.umn.edu/geyer/5102/slides/s2.pdf](http://www.stat.umn.edu/geyer/5102/slides/s2.pdf#page=113), slides 113--116 and 123. (Hmmmm. That does not actually seem to check. We'll check here.) ```{r check.score} crit <- qnorm((1 + conf.level) / 2) foo <- (pi.hat + crit^2 / (2 * n) + c(-1, 1) * crit * sqrt(pi.hat * (1 - pi.hat) / n + crit^2 / (4 * n^2))) / (1 + crit^2 / n) foo bar <- prop.test(x, n, conf.level = conf.level, correct = FALSE) names(bar) bar$conf.int all.equal(bar$conf.int, foo, check.attributes = FALSE) ``` ## Likelihood Interval (also called Wilks) {#likelihood-interval} Confidence interval that is a level set of the log likelihood. This is too esoteric for intro stats. ```{r logl} crit <- qchisq(conf.level, df = 1) fred <- function(pi) 2 * (logl(pi.hat) - logl(pi)) - crit tol <- sqrt(.Machine$double.eps) if (pi.hat == 0) { low <- 0 } else { low <- uniroot(fred, lower = 0, upper = pi.hat, tol = tol)$root } if (pi.hat == 1) { hig <- 1 } else { hig <- uniroot(fred, lower = pi.hat, upper = 1, tol = tol)$root } c(low, hig) ``` We can check we have done the right thing by redoing our log likelihood plot. ```{r label="logl.ci.plot", fig.cap="Likelihood-Based Confidence Interval", fig.align='center'} sally <- function(pi) 2 * logl(pi) curve(sally, from = 0.0001, to = 0.99, n = 1001, xlab = expression(pi), ylab = "two times log likelihood", ylim = sally(pi.hat) + c(-6, 0)) abline(h = sally(pi.hat) - crit, lty = "dashed") abline(v = low, lty = "dashed") abline(v = hig, lty = "dashed") ``` We can see that the vertical dashed lines, which are the endpoints of the likelihood-based confidence interval, do indeed intersect the graph of two times the log likelihood `crit` down from the maximum, where `crit` is the critical value derived from the chi-squared distribution. Note that this figure has even less of a vertical range than the preceding one. This does say that the interesting part of the picture is captured by both graphs where now "interesting" means the part relevant to a 95% confidence interval. ## Likelihood Interval using MASS Package We can also do this with R function `confint` in R package `MASS`. Except this function botches the calculation when $x = 0$ or $x = n$. Our calculation above always does the right thing. ```{r confint} library(MASS) gout <- glm(rbind(c(x, n - x)) ~ 1, family = binomial) cout <- suppressMessages(confint(gout, level = conf.level)) cout fred <- 1 / (1 + exp(- cout)) fred all.equal(c(low, hig), fred, check.attributes = FALSE, tolerance = 4e-4) ``` ## Five Intervals Compared ```{r three} pi.hat + c(-1, 1) * qnorm((1 + conf.level) / 2) * sqrt(pi.hat * (1 - pi.hat) / n) as.vector(bar$conf.int) c(low, hig) ``` As can be seen, the intervals are rather different. If we were to go back to the top and change the data to $x = 200$ and $n = 2500$ (both 100 times what they were before), then the intervals are quite close to each other (not to mention a lot shorter). The [web page discussing coverage of confidence intervals](http://www.stat.umn.edu/geyer/5102/examp/coverage.html) discusses two more intervals ```{r two-more} # Clopper-Pearson Binomial Confidence Intervals binom.test(x, n, conf.level = conf.level) # Variance Stabilized Binomial Confidence Intervals # http://www.stat.umn.edu/geyer/5102/slides/s2.pdf#page=120 thetahat <- asin(2 * pi.hat - 1) crit.val <- qnorm((1 + conf.level) / 2) (1 + sin(thetahat + c(-1,1) * crit.val / sqrt(n))) / 2 ``` All five of these intervals are asymptotically equivalent. All are correct and say almost the same thing for sufficiently large sample sizes. No theory says that one is better than another for small sample sizes with one exception. When the estimators are on the boundary $\hat{\pi} = 0$ or $\hat{\pi} = 1$, the Wald and arcsine intervals have abysmally bad performance (so bad that a wild guess is better). But, as discussed in the section about the Wald interval, which refers to the [web page discussing coverage of confidence intervals](http://www.stat.umn.edu/geyer/5102/examp/coverage.html#musual). this abysmally bad performance can be fixed. So there is no reason to prefer one of these intervals (modified, if necessary, to fix bad behavior at the endpoints) over another. # Hypothesis Tests (One-Tailed) ## Data For some reason, we are going to use different data in the hypothesis tests section, presumably because with the small sample size before there was no power to reject almost all null hypotheses. ```{r data-too} x <- 1300 n <- 2500 ``` ## MLE ```{r mle-too} pi.hat <- x / n ``` ## Null Hypothesis ```{r null} pi.0 <- 1 / 2 ``` ## Score Test Now this is the standard test for proportions taught in intro stats (regardless of which interval is taught. We will do an upper-tail test. Test statistic and $P$-value. ```{r score.test} tstat <- (x - n * pi.0) / sqrt(n * pi.0 * (1 - pi.0)) tstat pnorm(tstat, lower.tail = FALSE) ``` This is an asymptotic procedure, only approximately correct for large sample sizes. ## Exact Test Here is an "exact" test (really only conservative-exact, the $P$-value is guaranteed to understate the statistical significance). P-value for exact test ```{r exact} pbinom(x - 1, n, pi.0, lower.tail = FALSE) ``` This should perhaps be standard in intro stats. ## Fuzzy P-Value This comes from Geyer and Meeden (*Statistical Science*, 2005, **20**, 358--387). Here the $P$-value is considered to be uniformly distributed on interval ```{r fuzzy} pbinom(c(x, x - 1), n, pi.0, lower.tail = FALSE) ``` This test is truly exact (exact-exact rather than conservative-exact) in the sense that the probability $P \le \alpha$ is equal to $\alpha$ for $0 \le \alpha \le 1$. This test is what is actually comparable to an exact test with a continuous test statistic (like a $t$-test, for example). ## Likelihood Ratio Test For a one-tailed test we have to use the signed likelihood ratio test statistic. Here are the test statistic and P-value for this test. ```{r lrt} tstat <- 2 * (logl(pi.hat) - logl(pi.0)) tstat <- sqrt(tstat) tstat <- tstat * sign(pi.hat - pi.0) tstat pnorm(tstat, lower.tail = FALSE) ``` This too is an asymptotic procedure, only approximately correct for large sample sizes. It is asymptotically equivalent to the score test. Both the test statistic and the $P$-value for these two tests (and the Wald test, which is next) will be nearly equal for large sample sizes. ## Wald Test Test statistic and $P$-value ```{r wald.test} tstat <- (x - n * pi.0) / sqrt(n * pi.hat * (1 - pi.hat)) tstat pnorm(tstat, lower.tail = FALSE) ``` This too is an asymptotic procedure, only approximately correct for large sample sizes. It is asymptotically equivalent to the score test and the likelihood ratio test. The test statistic and the $P$-value for these three tests will be nearly equal for large sample sizes. This seems intuitively wrong. The $P$-value is calculated assuming $\pi_0$ is the true unknown parameter value (in general, assuming the null hypothesis is true). So if you know $\pi = \pi_0$, why not use that fact in doing the test? The score test and likelihood ratio test do; the Wald test doesn't. (This is related to the Wald test not needing the MLE in the null hypothesis. Here we have a point null hypothesis, so the MLE in the null hypothesis is $\pi_0$.) Hence no intro text recommends the Wald test for the binomial distribution. Of course many textbooks recommend Wald tests in other situations, for example, those output by the R generic function `summary`. # Two-Tailed Tests Now we illustrate two-tailed tests for the same data. ## Score Test ```{r score.test.two.tail} tstat <- (x - n * pi.0) / sqrt(n * pi.0 * (1 - pi.0)) tstat pnorm(abs(tstat), lower.tail = FALSE) * 2 pnorm(- abs(tstat)) * 2 pnorm(abs(tstat), lower.tail = FALSE) + pnorm(- abs(tstat)) ``` As can be seen the last three commands are three equivalent ways to calculate the $P$-value for the two-tailed test using the symmetry of the standard normal distribution. ## Exact Test {#exact-two-tailed} Oops! There is no exact non-fuzzy two-tailed test. Except if the null hypothesis is $\pi_0 = 1 / 2$, then the test statistic does have a symmetric distribution, so two tails is twice one tail. But this **does not work** for other $\pi_0$. ## Fuzzy Test Fuzzy P-value for exact test (Geyer and Meeden, *Statistical Science*, 2005, **20**, 358--387). Now this is not simple, but there is an R function to do it in R package `ump`. ```{r fuzzy.test.two.tailed} library(ump) print(arpv.binom(x, n, pi.0, plot = FALSE)) ``` Here because the value of the cumulative distribution is so low at the middle knot, this is almost a uniform distribution. The fuzzy $P$-value is approximately uniformly distributed on the interval ```{r fuzzy.test.two.tailed.interval} aout <- arpv.binom(x, n, pi.0, plot = FALSE) aout$alpha[2:3] ``` This behavior is not the way this test always works. Sometimes the distribution of the fuzzy $P$-value is quite complicated. Here's an example, which is Figure 3 in Geyer and Meeden (*Statistical Science*, 2005, vol 20, pp. 358--387). ```{r fuzzy.other} aout <- arpv.binom(10, 10, 0.7, plot = FALSE) aout ``` We make the plot that is the aforementioned Figure 3 as follows. ```{r label="fuzzy.plot", fig.cap="PDF of Fuzzy P-Value", fig.align='center'} k <- length(aout$alpha) alpha <- aout$alpha pdf <- diff(aout$phi) / diff(aout$alpha) plot(alpha, c(0, pdf), type = "n", xlab = expression(alpha), ylab = "probability density") segments(alpha[-k], pdf, alpha[-1], pdf) segments(alpha, c(0, pdf), alpha, c(pdf, 0), lty = "dashed") ``` The fuzzy $P$-value can be considered to be a random variable having the probability density function shown in the figure. ## Likelihood Ratio Test Test statistic and $P$-value ```{r lrt.two.tailed} tstat <- 2 * (logl(pi.hat) - logl(pi.0)) tstat pchisq(tstat, df = 1, lower.tail = FALSE) ``` ## Wald Test Test statistic and $P$-value ```{r wald.two.tailed} tstat <- (x - n * pi.0) / sqrt(n * pi.hat * (1 - pi.hat)) tstat pnorm(abs(tstat), lower.tail = FALSE) * 2 ```