--- title: "Stat 5421 Lecture Notes: Overdispersion" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: bookdown::html_document2: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML bookdown::pdf_document2: latex_engine: xelatex number_sections: true md_extensions: -tex_math_single_backslash linkcolor: blue urlcolor: blue --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. # Quasilikelihood and Estimating Equations ## Variance Functions In linear models (fit by R function `lm`) we assume the components of the response vector are independent, normally distributed, and same variance (homoscedastic). The variance is unrelated to the mean. The mean of each component can be any real number. The common variance of all the components can be any positive real number. In generalized linear models (fit by R function `glm`) none of these assumptions hold except the components of the response vector are independent. * There are constraints on the means. * There is only one parameter for binomial and Poisson models, so the variance and mean cannot be separately varied. For binomial regression, * components of the response vector are independent $\text{Binomial}(n_i, \pi_i)$ distributed. * Means $\mu_i = n_i \pi_i$ satisfy $0 \le \mu_i \le n_i$. * Variances are a function of means $$ \mathop{\rm var}(y_i) = n_i V(\pi_i) = n_i \pi_i (1 - \pi_i) $$ where $$ V(\pi) = \pi (1 - \pi) $$ For Poisson regression, * components of the response vector are independent $\text{Poisson}(\mu_i)$ distributed. * Means $\mu_i$ satisfy $0 \le \mu_i < \infty$. * Variances are a function of means $$ \mathop{\rm var}(y_i) = V(\mu_i) = \mu_i $$ where $$ V(\mu) = \mu $$ ## Modeling without Models Everything we have done so far relied on statistical models (families of probability distributions). Now we are going to do something different: statistics without models. We are only going to make assumptions about means and variances, not about whole probability distributions. In generalized linear model theory, this approach is called quasilikelihood, although the approach can be explained without even defining this term, and we shall do so. We assume the means are given by the same function of the parameters ("coefficients") as in binomial and Poisson regression. Thus we will have the same maximum likelihood estimates (MLE) of "coefficients" and means with and without overdispersion. Except since we don't have a model, we don't have a likelihood, thus these cannot be maximum *likelihood* estimates. We say they are maximum *quasilikelihood* estimates, or just estimates. But they are the *same* estimates as the MLE for ordinary binomial or Poisson regression. The difference is that we do not assume the same variance function as ordinary binomial or Poisson regression. The variance function could be anything, but for simplicity we (like everybody else) assume the variance function is a constant times the ordinary variance function * The variance is $\phi n_i V(\pi_i)$ for binomial. * The variance is $\phi V(\mu_i)$ for Poisson. Because we assumed the estimates of coefficients and means are the same as for ordinary binomial or Poisson regression, we can estimate the means without knowing $\phi$. Call those estimates $\hat{\mu}_i$. ## Estimating the Dispersion Parameter For Poisson, by assumption, $$ \frac{(y_i - \mu_i)^2}{\phi V(\mu_i)} $$ has variance one. Thus $$ \sum_{i = 1}^n \frac{(y_i - \mu_i)^2}{\phi V(\mu_i)} $$ has variance $n$. Hence $$ \hat{\phi} = \frac{1}{n} \sum_{i = 1}^n \frac{(y_i - \mu_i)^2}{V(\mu_i)} $$ would be a good estimate of $\phi$ except that we don't know the $\mu_i$. So we plug in estimated values $$ \hat{\phi} = \frac{1}{n - p} \sum_{i = 1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)} $$ and as usual divide by $n - p$ instead of $n$ where $p$ is the number of "coefficients" (the number of parameters needed to specify the $\hat{\mu}_i$). This division by $n - p$ has no exact theory justifying it. We know that in linear models, dividing by $n - p$ gives an unbiased estimate of variance (or so we are told, this is proved in the theory class [5102 Slides 31–38, Deck 5](http://www.stat.umn.edu/geyer/5102/slides/s5.pdf#page=31)). But we are not doing linear models, so dividing by $n - p$ is just an analogy, not real math. Nevertheless, it is the conventional thing to do. (Of course, it is correct for large $n$, but for large $n$ the difference between $n$ and $n - p$ is inconsequential.) For binomial, the corresponding estimate is $$ \hat{\phi} = \frac{1}{n - p} \sum_{i = 1}^n \frac{(y_i - \hat{\mu}_i)^2}{n_i V(\hat{\mu}_i / n_i)} $$ # Example: Agresti Section 4.7.4 ```{r teratology.data} library(CatDataAnalysis) data(table_4.7) names(table_4.7) sapply(table_4.7, class) ``` R function `glm` wants binomial data with sample sizes greater than one presented as a two-column matrix whose columns are successes and failures, so we have to make that. Instead we are given `y` equals successes and `n` equals totals (successes + failures), apparently. ```{r teratology.data.wtf} with(table_4.7, all(0 <= y & y <= n)) with(table_4.7, range(n)) resp <- with(table_4.7, cbind(dead = y, alive = n - y)) resp ``` Variable litter is just sequence numbers ```{r teratology.data.litter} with(table_4.7, identical(litter, seq(along = litter))) ``` Group is numeric but we need to make it a factor. ```{r teratology.data.group} dat <- transform(table_4.7, group = as.factor(group)) ``` Now just for comparison, we will fit logistic regression ```{r teratology.logistic} gout.logistic <- glm(resp ~ group, data = dat, family = binomial) summary(gout.logistic) ``` Now allow for overdispersion ```{r teratology.over} gout.over <- glm(resp ~ group, data = dat, family = quasibinomial) summary(gout.over) ``` We see that we do have exactly the same coefficients ```{r teratology.coef.equal} all.equal(coef(gout.logistic), coef(gout.over)) ``` And exactly the same mean values ```{r teratology.mean.equal} all.equal(fitted(gout.logistic), fitted(gout.over)) ``` Now try predicted values ```{r teratology.over.predict} pout.over <- predict(gout.over, newdata = data.frame(group = factor(1:4)), type = "response", se.fit = TRUE) pout.over ``` Are these the same as the predicted values for logistic regression? ```{r teratology.logistic.predict} pout.logistic <- predict(gout.logistic, newdata = data.frame(group = factor(1:4)), type = "response", se.fit = TRUE) pout.logistic all.equal(pout.logistic$fit, pout.over$fit) pout.over$se.fit / pout.logistic$se.fit ``` What is that? ```{r teratology.dispersion,error=TRUE} phi.hat <- summary(gout.over)$dispersion phi.hat sqrt(phi.hat) ``` Indeed we are just inflating the estimated variance by a factor of `r round(phi.hat, 6)` and the estimated standard deviation by a factor of `r round(sqrt(phi.hat), 6)`. ## Testing for Overdispersion One might think that, since we have no model, we cannot do hypothesis tests about the dispersion. But hypothesis tests only need a distribution under the null hypothesis, and we do have that. Null hypothesis is ordinary binomial; alternative hypothesis is overdispersed binomial. We can do the test by simulation, as we did in the [section on the parametric bootstrap in the notes on Chapter 9](ch9.html#parametric-bootstrap). ```{r teratology.boot} # set random number generator seed for reproducibility set.seed(42) n <- dat$n nboot <- 999 mu.hat <- fitted(gout.logistic) phi.star <- double(nboot) for (iboot in 1:nboot) { y.star <- rbinom(length(n), n, mu.hat) resp.star <- cbind(dead = y.star, alive = n - y.star) gout.over <- glm(resp.star ~ group, data = dat, family = quasibinomial) phi.star[iboot] <- summary(gout.over)$dispersion } all.phi.values <- c(phi.star, phi.hat) mean(all.phi.values >= phi.hat) ``` None of the simulated dispersion values are as large as the observed value. This is very strong evidence for overdispersion. ## On Not Testing for Overdispersion But the natural idea of only using `family = quasibinomial` when there seems to be evidence for it using the test in the preceeding section is actually the [Wrong Thing](http://www.catb.org/jargon/html/W/Wrong-Thing.html). When one does a composite procedure having two or more steps, one must consider the composite procedure as the procedure. This seems obvious: what one does is what one does. But that is not what most people do when doing composite procedures they invented. In this example, the idea is to use ordinary binomial if the test of overdispersion accepts the null hypothesis and to use overdispersed binomial if the test of overdispersion rejects the null hypothesis, where "use ordinary binomial" or "use overdispersed binomial" means *pretend no pretest had been done*. This is clearly the Wrong Thing. The composite procedure is not the simple procedure that is the second step of the composite procedure. Hence the conclusion is that if you want to use overdispersion, then use it. *Always.* Don't use a composite procedure that is the Wrong Thing. This is a general principle of statistics. This seemingly reasonable idea — many naive users of statistics invent it — has zero theoretical justification, is recommended by zero textbooks, and is obviously wrong once you think about it in the right way. Every multistage procedure invented by newbies has this problem. When you do stage one, and then proceed with stage two *pretending that stage one never happened and that its results satisfy the assumptions for stage two* and so on (maybe stages three and four), you are in the realm of complete [bogosity](http://www.catb.org/jargon/html/B/bogosity.html). Of course, if you [bootstrap](ch9.html#parametric-bootstrap) the whole multistage procedure, then that is the [Right Thing](http://www.catb.org/jargon/html/R/Right-Thing.html) but that is not what we are talking about there. # Example: Agresti Section 4.7.2 We refit the horseshoe crab data from Section 4.3.2 allowing for overdispersion. ```{r crab.data} # clean up R global environment rm(list = ls()) data(table_4.3) names(table_4.3) ``` We found out in homework 3 that the apparent best fitting model when we did not use overdispersion was ```{r crab.no.over} gout.no.over <- glm(satell ~ color + weight, family = poisson, data = table_4.3) summary(gout.no.over) ``` If we allow for overdispersion ```{r crab.over} gout.over <- glm(satell ~ color + weight, family = quasipoisson, data = table_4.3) summary(gout.over) ``` we see that everything is the same except that we are allowing for a lot of overdispersion.