---
title: "Stat 5421 Lecture Notes: Overdispersion"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  bookdown::html_document2:
    number_sections: true
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
  bookdown::pdf_document2:
    latex_engine: xelatex
    number_sections: true
    md_extensions: -tex_math_single_backslash
linkcolor: blue
urlcolor: blue
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document is
   `r packageVersion("rmarkdown")`.

# Quasilikelihood and Estimating Equations

## Variance Functions

In linear models (fit by R function `lm`) we assume the components of
the response vector are independent, normally distributed, and same variance
(homoscedastic).  The variance is unrelated to the mean.  The mean of
each component can be any real number.  The common variance of all the
components can be any positive real number.

In generalized linear models (fit by R function `glm`) none of these
assumptions hold except the components of the response vector are independent.

 * There are constraints on the means.

 * There is only one parameter for binomial and Poisson models, so the
   variance and mean cannot be separately varied.

For binomial regression,

 * components of the response vector are independent
   $\text{Binomial}(n_i, \pi_i)$ distributed.

 * Means $\mu_i = n_i \pi_i$ satisfy $0 \le \mu_i \le n_i$.

 * Variances are a function of means
   $$
      \mathop{\rm var}(y_i)
      =
      n_i V(\pi_i) = n_i \pi_i (1 - \pi_i)
   $$
   where
   $$
      V(\pi) = \pi (1 - \pi)
   $$

For Poisson regression,

 * components of the response vector are independent
   $\text{Poisson}(\mu_i)$ distributed.

 * Means $\mu_i$ satisfy $0 \le \mu_i < \infty$.

 * Variances are a function of means
   $$
      \mathop{\rm var}(y_i)
      =
      V(\mu_i) = \mu_i
   $$
   where
   $$
      V(\mu) = \mu
   $$

## Modeling without Models

Everything we have done so far relied on statistical models
(families of probability distributions).  Now we are going to do something
different: statistics without models.  We are only going to make assumptions
about means and variances, not about whole probability distributions.

In generalized linear model theory, this approach is called quasilikelihood,
although the approach can be explained without even defining this term,
and we shall do so.

We assume the means are given by the same function of the parameters
("coefficients") as in binomial and Poisson regression.  Thus we will
have the same maximum likelihood estimates (MLE) of "coefficients" and means
with and without overdispersion.  Except since we don't have a model,
we don't have a likelihood, thus these cannot be maximum *likelihood*
estimates.  We say they are maximum *quasilikelihood* estimates, or just
estimates.  But they are the *same* estimates as the MLE for ordinary
binomial or Poisson regression.

The difference is that we do not assume the same variance function as
ordinary binomial or Poisson regression.  The variance function could be
anything, but for simplicity we (like everybody else) assume the variance
function is a constant times the ordinary variance function

 * The variance is $\phi n_i V(\pi_i)$ for binomial.

 * The variance is $\phi V(\mu_i)$ for Poisson.

Because we assumed the estimates of coefficients and means are the
same as for ordinary binomial or Poisson regression, we can estimate
the means without knowing $\phi$.  Call those estimates $\hat{\mu}_i$.

## Estimating the Dispersion Parameter

For Poisson, by assumption,
$$
   \frac{(y_i - \mu_i)^2}{\phi V(\mu_i)}
$$
has variance one.  Thus
$$
   \sum_{i = 1}^n \frac{(y_i - \mu_i)^2}{\phi V(\mu_i)}
$$
has variance $n$.  Hence
$$
   \hat{\phi} = \frac{1}{n}
   \sum_{i = 1}^n \frac{(y_i - \mu_i)^2}{V(\mu_i)}
$$
would be a good estimate of $\phi$ except that we don't know the $\mu_i$.

So we plug in estimated values
$$
   \hat{\phi} = \frac{1}{n - p}
   \sum_{i = 1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}
$$
and as usual divide by $n - p$ instead of $n$ where $p$ is the number
of "coefficients" (the number of parameters needed to specify
the $\hat{\mu}_i$).

This division by $n - p$ has no exact theory justifying it.
We know that in linear models, dividing by $n - p$ gives an unbiased
estimate of variance (or so we are told, this is proved in the theory
class [5102 Slides 31–38,
Deck 5](http://www.stat.umn.edu/geyer/5102/slides/s5.pdf#page=31)).
But we are not doing linear models, so dividing by $n - p$ is just an
analogy, not real math.  Nevertheless, it is the conventional thing to do.
(Of course, it is correct for large $n$, but for large $n$ the difference
between $n$ and $n - p$ is inconsequential.)

For binomial, the corresponding estimate is
$$
   \hat{\phi} = \frac{1}{n - p}
   \sum_{i = 1}^n \frac{(y_i - \hat{\mu}_i)^2}{n_i V(\hat{\mu}_i / n_i)}
$$

# Example: Agresti Section 4.7.4

```{r teratology.data}
library(CatDataAnalysis)
data(table_4.7)
names(table_4.7)
sapply(table_4.7, class)
```

R function `glm` wants binomial data with sample sizes greater than one
presented as a two-column matrix whose columns are successes and failures,
so we have to make that.  Instead we are given `y` equals successes and
`n` equals totals (successes + failures), apparently.
```{r teratology.data.wtf}
with(table_4.7, all(0 <= y & y <= n))
with(table_4.7, range(n))

resp <- with(table_4.7, cbind(dead = y, alive = n - y))
resp
```

Variable litter is just sequence numbers
```{r teratology.data.litter}
with(table_4.7, identical(litter, seq(along = litter)))
```

Group is numeric but we need to make it a factor.
```{r teratology.data.group}
dat <- transform(table_4.7, group = as.factor(group))
```

Now just for comparison, we will fit logistic regression
```{r teratology.logistic}
gout.logistic <- glm(resp ~ group, data = dat, family = binomial)
summary(gout.logistic)
```

Now allow for overdispersion
```{r teratology.over}
gout.over <- glm(resp ~ group, data = dat, family = quasibinomial)
summary(gout.over)
```

We see that we do have exactly the same coefficients
```{r teratology.coef.equal}
all.equal(coef(gout.logistic), coef(gout.over))
```
And exactly the same mean values
```{r teratology.mean.equal}
all.equal(fitted(gout.logistic), fitted(gout.over))
```

Now try predicted values
```{r teratology.over.predict}
pout.over <- predict(gout.over, newdata = data.frame(group = factor(1:4)),
    type = "response", se.fit = TRUE)
pout.over
```

Are these the same as the predicted values for logistic regression?
```{r teratology.logistic.predict}
pout.logistic <- predict(gout.logistic,
    newdata = data.frame(group = factor(1:4)),
    type = "response", se.fit = TRUE)
pout.logistic

all.equal(pout.logistic$fit, pout.over$fit)
pout.over$se.fit / pout.logistic$se.fit
```

What is that?
```{r teratology.dispersion,error=TRUE}
phi.hat <- summary(gout.over)$dispersion
phi.hat
sqrt(phi.hat)
```

Indeed we are just inflating the estimated variance by a
factor of `r round(phi.hat, 6)` and the estimated standard deviation by a
factor of `r round(sqrt(phi.hat), 6)`.

## Testing for Overdispersion

One might think that, since we have no model, we cannot do hypothesis
tests about the dispersion.  But hypothesis tests only need a distribution
under the null hypothesis, and we do have that.  Null hypothesis is
ordinary binomial; alternative hypothesis is overdispersed binomial.

We can do the test by simulation, as we did in the
[section on the parametric bootstrap in the notes
on Chapter 9](ch9.html#parametric-bootstrap).
```{r teratology.boot}
# set random number generator seed for reproducibility
set.seed(42)

n <- dat$n
nboot <- 999
mu.hat <- fitted(gout.logistic)
phi.star <- double(nboot)
for (iboot in 1:nboot) {
    y.star <- rbinom(length(n), n, mu.hat)
    resp.star <- cbind(dead = y.star, alive = n - y.star)
    gout.over <- glm(resp.star ~ group, data = dat, family = quasibinomial)
    phi.star[iboot] <- summary(gout.over)$dispersion
}
all.phi.values <- c(phi.star, phi.hat)
mean(all.phi.values >= phi.hat)
```

None of the simulated dispersion values are as large as the observed value.
This is very strong evidence for overdispersion.

## On Not Testing for Overdispersion

But the natural idea of only using `family = quasibinomial` when there
seems to be evidence for it using the test in the preceeding section
is actually the
[Wrong Thing](http://www.catb.org/jargon/html/W/Wrong-Thing.html).

When one does a composite procedure having two or more steps, one
must consider the composite procedure as the procedure.  This seems
obvious: what one does is what one does.  But that is not what most
people do when doing composite procedures they invented.
In this example, the idea is to use ordinary binomial if the test of
overdispersion accepts the null hypothesis and to use overdispersed binomial
if the test of overdispersion rejects the null hypothesis, where "use
ordinary binomial" or "use overdispersed binomial" means *pretend no
pretest had been done*.  This is clearly the Wrong Thing.
The composite procedure is not the simple procedure that is the second
step of the composite procedure.

Hence the conclusion is that if you want to use overdispersion, then use it.
*Always.*  Don't use a composite procedure that is the Wrong Thing.

This is a general principle of statistics.
This seemingly reasonable idea — many naive users of statistics invent it —
has zero theoretical justification, is recommended by zero textbooks,
and is obviously wrong once you think about it in the right way.

Every multistage procedure invented by newbies has this problem.
When you do stage one, and then proceed with stage two *pretending that
stage one never happened and that its results satisfy the assumptions
for stage two* and so on (maybe stages three and four),
you are in the realm of complete
[bogosity](http://www.catb.org/jargon/html/B/bogosity.html).

Of course, if you [bootstrap](ch9.html#parametric-bootstrap) the whole
multistage procedure, then that is the
[Right Thing](http://www.catb.org/jargon/html/R/Right-Thing.html)
but that is not what we are talking about there.

# Example: Agresti Section 4.7.2

We refit the horseshoe crab data from Section 4.3.2 allowing for
overdispersion.
```{r crab.data}
# clean up R global environment
rm(list = ls())

data(table_4.3)
names(table_4.3)
```

We found out in homework 3 that the apparent best fitting model
when we did not use overdispersion was
```{r crab.no.over}
gout.no.over <- glm(satell ~ color + weight, family = poisson, data = table_4.3)
summary(gout.no.over)
```

If we allow for overdispersion
```{r crab.over}
gout.over <- glm(satell ~ color + weight, family = quasipoisson,
    data = table_4.3)
summary(gout.over)
```
we see that everything is the same except that we are allowing for a lot
of overdispersion.