---
title: "Stat 5421 Lecture Notes: To Accompany Agresti Ch 9"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  bookdown::html_document2:
    number_sections: true
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
  bookdown::pdf_document2:
    latex_engine: xelatex
    number_sections: true
    md_extensions: -tex_math_single_backslash
linkcolor: blue
urlcolor: blue
bibliography: ch9.bib
csl: journal-of-the-royal-statistical-society.csl
link-citations: true
---

# Section 9.1 Two-Way Tables

## Section 9.1.1 Only Main Effects

The *independence* or *homogeneity of proportions* model has mean-value
parameters
$$
   \mu_{i j} = \alpha_i \beta_j
$$
or canonical parameters
\begin{equation}
   \theta_{i j} = \alpha_i + \beta_j
   (\#eq:two-way-independence)
\end{equation}
where the alphas and betas are different in the two equations (this causes
no confusion, since if we look at parameters at all, these are usually
canonical parameters).

## Section 9.1.3 Interactions

The *saturated* model has completely arbitrary parameters.
The mean value parameters $\mu_{i j}$ have no specified structure.
The canonical parameters $\theta_{i j}$ have no specified structure.
So this is the largest model.  It has the most (arbitrarily variable)
parameters.

## Section 9.1.5 Hierarchical Models

The hierarchical model principle says that, if you have an interaction term
of a certain order, then you have all lower-order interactions of the same
variables.  For the saturated model for a two-way table this says
\begin{equation}
   \theta_{i j} = \alpha_i + \beta_j + \gamma_{i j}
   (\#eq:two-way-saturated)
\end{equation}
(but this just says $\theta_{i j}$ can be anything.  This is not an
identifiable parameterization.

## Identifiability

A statistical model (in general, not just in categorical data analysis)
is *identifiable* if there is only one (vector) parameter value corresponding
to a probability distribution.  In exponential family models
([notes on exponential families, section on
identifiability](expfam.html#identifiability))
every different mean vector corresponds to
a different distribution, but different canonical parameter vectors can
correspond to the same distribution.

If we are assuming *Poisson sampling* ([notes on Agresti Chapter 1, section
on sampling schemes](ch1.html#statistical-models-for-categorical-data)
and also [notes on sampling schemes](sampling.html)) then the
canonical parameters are identifiable if and only if the mean-value parameters
are (because the link function (componentwise log) is invertible).

Clearly the specification for the two-way independence model
\@ref(eq:two-way-independence) is not identifiable because one
can add a constant to all of the alphas and subtract the same constant
from all the betas and get the same result
$$
   (\alpha_i + c) + (\beta_j - c)
   =
   \alpha_i + \beta_j
$$
Hence \@ref(eq:two-way-saturated) is also not identifiable.

But we don't need to worry about identifiability when fitting models
because the computer automatically takes care of it.

For example, consider the data in Table 3.8 in Agresti
```{r data.fetal.alcohol.syndrome}
counts <- c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1)
drinks <- rep(c("0", "< 1", "1-2", "3-5", ">= 6"), times = 2)
malformation <- rep(c("Absent", "Present"), each = 5)
data.frame(drinks, malformation, counts)
```

We can fit this using R function `chisq.test`
```{r foo}
foo <- xtabs(counts ~ malformation + drinks)
foo
```

That's ugly.  Can we force `xtabs` to keep the `drinks` variable in
the right order?
```{r fugly}
drinks <- factor(drinks, levels = c("0", "< 1", "1-2", "3-5", ">= 6"))
foo <- xtabs(counts ~ malformation + drinks)
foo
```

Better, now back to statistics
```{r foo.chisq}
chisq.test(foo)
```

But it says "approximation may be incorrect" so try
```{r foo.bar}
chisq.test(foo, simulate.p.value = TRUE)
```

But we can also use R function `glm` and its helper functions to do
the job.
```{r glm}
gout.indep <- glm(counts ~ malformation + drinks, family = poisson)
summary(gout.indep)
gout.sat <- glm(counts ~ malformation * drinks, family = poisson)
summary(gout.sat)
anova(gout.indep, gout.sat, test = "LRT")
anova(gout.indep, gout.sat, test = "Rao")
```
```{r glm.hide}
p.value.lrt <- anova(gout.indep, gout.sat, test = "LRT")[2, "Pr(>Chi)"]
p.value.rao <- anova(gout.indep, gout.sat, test = "Rao")[2, "Pr(>Chi)"]
p.value.chisq <- suppressWarnings(chisq.test(foo)$p.value)
p.value.lrt
p.value.rao
p.value.chisq
```

We notice a number of things.

 * The $P$-value for the Rao test (`r round(p.value.rao, 5)`) is exactly equal
   to that output for the Chi-Square test because the Pearson chi-square
   test is a special case of the Rao test.

 * The likelihood ratio test disagrees quite strongly with the Rao test.
   Of course, these tests are asymptotically equivalent, but here the sample
   size is not "large".  The total number of subjects
   ($`r sum(as.integer(foo))`$) is
   large, but the expected number in some cells of the contingency table
   are quite small, a lot less than what the rule of thumb says is necessary
   for valid asymptotics (at least five in each cell),
   so maybe that accounts for the difference.
```{r foo.expected}
suppressWarnings(chisq.test(foo)$expected)
```
 
 * The computer (R function `glm`) knows how to deal with identifiability.
   It adds an intercept, so it is really using the formula
   $$
      \theta_{i j} = \alpha + \beta_i + \gamma_j
   $$
   for the independence model and
   $$
      \theta_{i j} = \alpha + \beta_i + \gamma_j + \delta_{i j}
   $$
   for the saturated model.  Then it "drops" (sets equal to zero) one
   of the betas, one of the gammas, and all interaction terms corresponding
   to the dropped beta and the dropped gamma.

```{r gout.indep.coef}
names(coef(gout.indep))
```
It keeps the "intercept" ($\alpha$ in our notation just above).
It drops one of the coefficients for the factor `malformation`
(and keeps the other one).
It drops one of the coefficients for the factor `drinks`
(and keeps the other four).
And this gives it an identifiable model.

```{r gout.sat.coef}
names(coef(gout.sat))
```
It has all the parameters of the independence model plus a lot of "interaction"
parameters (the ones whose names contain colons).  But it drops all the
"interaction" parameters that involve parameters that were "dropped" from
the independence model (it contains no "interaction" terms containing
`malformationAbsent` or `drinks< 1`)
And this gives it an identifiable model.

# Section 9.2 Three-Way Tables

Three way tables are just like two-way tables except the data have three
subscripts for the three dimensions of the table.  Our analogs of

## Section 9.2.1 Only Main Effects

The *independence* or *homogeneity of proportions* model has mean-value
parameters
$$
   \mu_{i j k} = \alpha_i \beta_j \gamma_k
$$
or canonical parameters
\begin{equation}
   \theta_{i j k} = \alpha_i + \beta_j + \gamma_k
   (\#eq:three-way-independence)
\end{equation}
where the alphas, betas, and gammas
are different in the two equations (this causes
no confusion, since if we look at parameters at all, these are usually
canonical parameters).

## The Saturated Model

As before, the *saturated* model has completely arbitrary parameters.
The mean value parameters $\mu_{i j k}$ have no specified structure.
The canonical parameters $\theta_{i j k}$ have no specified structure.
So this is the largest model.  It has the most (arbitrarily variable)
parameters.

## Example: High School Student Survey

This is Table 9.3 in Agresti and can be found in R package `CatDataAnalysis`.
```{r high.school}
# clean up R global environment
rm(list = ls())

library(CatDataAnalysis)
data(table_9.3)
names(table_9.3)
foo <- xtabs(count ~ a + c + m, data = table_9.3)
foo
```

Since `a`, `c`, and `m` are categorical, we should perhaps make them
factors.  But since they each have only two values, it does not matter
whether we make them factors or not.  If they had more than two values,
then it would be essential to make them factors.


We can make these data look nicer by copying the names out of the book
```{r high.school.dimnames}
dimnames(foo) <- list(alcohol = c("yes", "no"), cigarettes = c("yes", "no"), marijuana = c("yes", "no"))
aperm(foo, c(2, 3, 1))
```
We needed R function `aperm`, which permutes the dimensions of an array,
to present the data in the same order as in the book.

Of course, the order doesn't matter and neither do the `dimnames`.
That is just to match the book.

Now R function `chisq.test` is useless, but R function `glm` has no problems.
Our test of independence or homogeneity of proportions is the same, just
with three "predictors" rather than two (the "predictors" are in scare
quotes because they are just the `dimnames`)
```{r high.school.test}
gout.indep <- glm(count ~ a + c + m, data = table_9.3, family = poisson)
gout.sat <- glm(count ~ a * c * m, data = table_9.3, family = poisson)
anova(gout.indep, gout.sat, test = "LRT")
anova(gout.indep, gout.sat, test = "Rao")
```

Clearly, the independence model does not fit the data.
So statistics says these variables, alcohol use, cigarette use, and
marijuana use have some association.  No real surprise, but the data
do confirm what just about everybody assumes.

## Interactions

Our saturated model has the three-way interaction `a * c * m` but there
can also be two-way interactions.
The model with all two-way interactions would have canonical parameter
$$
   \theta_{i j k} = \alpha_{i j} + \beta_{i k} + \gamma_{j k}
$$
(as before the terms with alphas, betas, and gammas need not match up
with previous uses of these Greek letters).
And people who insist on the hierarchical principle would write
$$
   \theta_{i j k} = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k}
$$
but both formulas specify the same model and neither is identifiable.

So let's look at that model.
```{r high.school.all.two.way}
gout.all.two.way <- glm(count ~ (a + c + m)^2, data = table_9.3, family = poisson)
anova(gout.indep, gout.all.two.way, gout.sat, test = "LRT")
```
```{r high.school.p.values}
foompter <- anova(gout.indep, gout.all.two.way, gout.sat, test = "LRT")
foompter.p.values <- foompter[-1, "Pr(>Chi)"]
foompter.p.values
```

This says the two-way interactions model (model 2) fits as well as
the three-way interactions model (model 3, the saturated model)
because $P = `r round(foompter.p.values[2], 4)`$ is not
statistically significant (not even close).

And it says the two-way interactions model (model 2) fits much better than
the main effects only model (model 1, the independence model)
because $P < 2 \times 10^{-16}$ is
highly statistically significant (by anyone's standards).

We should not be overly impressed by very very small $P$-values because
asymptotic approximation gives only small absolute errors rather than
small relative errors.  All this very very small $P$-value says is that
an exact calculation (if we could do it, which we cannot) would be something
small, say $P < 0.001$.  But there is no reason to believe
the $P < 2 \times 10^{-16}$ that R prints is correct to within even several
orders of magnitude.

## Many Models

There are many hierarchical models here, because we need not include
all of the main effects or all two-way interactions.  The hierarchical
principle says that if we have a two-way interaction, then we must have
the corresponding main effects, but that leaves
\begin{align*}
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k} + \theta_{i j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \zeta_{i k} + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \zeta_{i k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \varepsilon_{i j}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \delta_k + \zeta_{i k}
   \\
   \theta_{i j k}
   & =
   \alpha + \gamma_j + \delta_k + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \delta_k
   \\
   \theta_{i j k}
   & =
   \alpha + \gamma_j + \delta_k
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i
   \\
   \theta_{i j k}
   & =
   \alpha + \gamma_j
   \\
   \theta_{i j k}
   & =
   \alpha + \delta_k
   \\
   \theta_{i j k}
   & =
   \alpha
\end{align*}
Of course, we already know many of these models don't fit the data.
We saw that the main effects only model doesn't fit.
So none of its submodels can fit either.
But this is a complete listing of all hierarchical models (except for
the saturated model).

How can we choose among all these models?  There is a methodology for
that, but this is getting a bit ahead of ourselves.  There will be
a handout for that.  But for now, we just show it without much explanation.
```{r high.school.glmbb}
library(glmbb)
out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson)
summary(out)
```

This says that none of the models that do not have all three two-way
interactions fit the data according to the criterion: none have AIC
less than the AIC for the best model ("best" according to AIC) plus 10.
We don't know what AIC is yet, but we will learn in a later handout.
For now, just consider it a criterion of goodness of fit.

If we want to see how bad some of those non-fitting models were, we can
increase the cutoff.
```{r high.school.glmbb.too}
library(glmbb)
out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson, cutoff = 90)
summary(out)
```
That is a huge jump up to the next best fitting model, which does not
(take my word for it for now) does not fit the data well at all.

# Section 9.4 Four-Way and Beyond

And so on and so forth.  Higher order tables are just more complicated,
but present no new issues.

## Example: Seat-Belt Use

These data are given in Table 9.8 in Agresti.  They are apparently not
in R package `CatDataAnalysis`.
```{r seat.belt,error=TRUE}
# clean up R global environment
rm(list = ls())

count <- c(7287, 11587, 3246, 6134, 10381, 10969, 6123, 6693,
            996,   759,  973,  757,   812,   380, 1084,  513)
injury <- gl(2, 8, 16, labels = c("No", "Yes"))
gender <- gl(2, 4, 16, labels = c("Female", "Male"))
location <- gl(2, 2, 16, labels = c("Urban", "Rural"))
seat.belt <- gl(2, 1, 16, labels = c("No", "Yes"))
data.frame(gender, location, seat.belt, injury, count)
xtabs(count ~ seat.belt + injury + location + gender)
```

```{r options.width,echo=FALSE}
opt.save <- options(width = 250)
```
Looks OK.
```{r seat.belt.glmbb}
out <- glmbb(count ~ seat.belt * injury * location * gender,
    family = "poisson")
summary(out)
```
Now there are several models that fit these data fairly well.
We will leave it at that for now.
```{r options.width.too,echo=FALSE}
options(opt.save)
```

# The Parametric Bootstrap {#bootstrap}

## Relevant and Irrelevant Simulation

### Irrelevant

Most statisticians think a statistics paper isn't really a statistics paper
or a statistics talk isn't really a statistics talk if it doesn't have
simulations demonstrating that the methods proposed work great (at least
on some toy problems).

IMHO, this is nonsense.  Simulations of the kind most statisticians do
*prove* nothing.  The toy problems used are often very special and do
not stress the methods at all.  In fact, they may be
(consciously or unconsciously) chosen to make the methods look good.

In scientific experiments, we know how to use randomization, blinding,
and other techniques to avoid biasing the results.  Analogous things
are never AFAIK done with simulations.

When all of the toy problems simulated are *very different* from the
statistical model you intend to use for your data, what could the
simulation study possibly tell you that is relevant?  Nothing.

Hence, for short, your humble author calls all of these millions of
simulation studies statisticians have done *irrelevant simulation*.

### Relevant

But there is a well-known methodology of *relevant simulation*, except
that it isn't called that.  It is called the *bootstrap*.

It idea is, for each statistical model and each data set to which it is
applied, one should do a simulation study of this model on data of this
form.

But there is a problem: the fundamental problem of statistics, that
$\hat{\theta}$ is not $\theta$.  To be truly relevant we should simulate
from the true unknown distribution of the data, but we don't know what
that is.  (If we did, we wouldn't need statistics.)

So as a second best choice we have to simulate from our best estimate
of the true unknown distribution, the one corresponding to the parameter
value $\hat{\theta}$ if that is the best estimator we know.

But we know that is the
[Wrong Thing](http://www.catb.org/jargon/html/W/Wrong-Thing.html).
So we have to be sophisticated about this.
We have to arrange what we do with our simulations 
to come as close to the
[Right Thing](http://www.catb.org/jargon/html/R/Right-Thing.html)
as possible.

And bootstrap theory and methods are extraordinarily sophisticated
with many different methods of coming very close to the
[Right Thing](http://www.catb.org/jargon/html/R/Right-Thing.html).

## R Packages and Textbooks

There are two well known R packages concerned with the bootstrap.
They go with two well known textbooks.

 * R package `boot` [@boot-package] is an R recommended package
   that is installed
   by default in every installation of R.  As the package description
   says, it goes with the textbook @davison-hinkley.

 * The CRAN package `bootstrap` [@bootstrap-package] goes with,
   as its package description
   says, the textbook @efron-tibshirani.

     The package description also says that "new projects should
     preferentially use the recommended package '`boot`'".
     But I do not agree.  The package maintainer is neither of
     Efron or Tibshirani, and I do not think they would agree.
     Whatever the politics of the R core team that make the `boot`
     package "recommended", they have nothing to do with the quality
     of the package or with the quality of the textbook they go with.
     If you like Efron and Tibshirani (1993), you should be using the
     R package `bootstrap` that goes with it.

These authors range from moderately famous (for a statistician)
to very, very famous (for a statistician).
Efron is the inventor of the term *bootstrap* in its statistical meaning.

## Kinds of Bootstrap

Almost all of the bootstrap literature is about the *nonparametric bootstrap*
that does not assume any parametric statistical model for the data.
Most of the content of the textbooks cited above, and all of the R packages
cited above are about this methodology.

But if one is using parametric models, like everything we do in this course,
then what you are doing is inherently parametric, even if you use the
nonparametric bootstrap.  Thus there is no benefit to using the nonparametric
bootstrap for anything in this course.

A small minority of the bootstrap literature is about
the *parametric bootstrap* that does use the parametric statistical model
for its simulations.  So this is all we will discuss here.

For more on the nonparametric bootstrap, see my [Stat 3701 lecture
notes](https://www.stat.umn.edu/geyer/3701/notes/bootstrap.html), some
of which is repeated here.

## The Bootstrap Analogy

### The Name of the Game

The term "bootstrap" recalls the English idiom
["pull oneself up by one's
bootstraps"](https://en.wiktionary.org/wiki/bootstrap).

The literal meaning of "bootstrap" in non-technical language is leather loops
at the top of boots used to pull them on.  So the literal meaning of
"pull oneself up by one's bootstraps" is to reach down, grab your shoes, and
lift yourself off the ground — a physical impossibility.
But, idiomatically, it doesn't mean do the physically impossible; it means
something like "succeed by one's own efforts", especially when this is
difficult.

The technical meaning in statistics plays off this idiom.
It means to get a good approximation to the sampling distribution
of an estimator without using any theory.  (At least not using
any theory in the computation.  A great deal of very technical
theory may be used in justifying the bootstrap in certain situations.)

### The Real World and the Bootstrap World

The discussion in this section
is stolen from @efron-tibshirani, their Figure 8.1 and the surrounding text.

To understand the bootstrap you have to understand a simple analogy.
Otherwise it is quite mysterious.  I recall being mystified about it
when I was a graduate student.  I hope the students I teach are much
less mystified because of this analogy.
This appears to the untutored to be impossible or magical.
But it isn't really.  It is sound statistical methodology.

Let $P_\theta$ denote the true unknown probability
distribution that we assume the data are a sample from,

The bootstrap makes an analogy between the real world and a mythical
bootstrap world.

|                              | real world                              | bootstrap world                                         |
|:-----------------------------|:---------------------------------------:|:-------------------------------------------------------:| 
| true unknown distribution    |      $P_\theta$                         |     $P_{\hat{\theta}_n}$                                |
| true unknown parameter       |      $\theta$                           |     $\hat{\theta}_n$                                    |
| data                         | $Y$ has distribution $P_\theta$   | $Y^*$ has distribution $P_{\hat{\theta}_n}$     | 
| estimator                    | $\hat{\theta}_n = t(Y)$  | $\theta^*_n = t(Y^*)$                  |
| estimated standard error     | $s(\hat{\theta}_n)$                     | $s(\theta^*_n)$                                         |
| approximate pivotal quantity | $(\hat{\theta}_n - \theta) / s(\hat{\theta}_n)$ | $(\theta^*_n - \hat{\theta}_n) / s(\theta^*_n)$ |

The explanation.

 * In the real world we have the true unknown distribution of the data
   $P_\theta$.
   In the bootstrap world we have the "true" pretend unknown distribution
   of the data $P_{\hat{\theta}_n}$.
   Actually the distribution $P_{\hat{\theta}_n}$ is known,
   and that's a good thing, because it allows us to simulate data from it.
   But we pretend it is unknown when we are reasoning in the bootstrap world.
   It is the analog in the bootstrap world of the true unknown distribution
   $P_\theta$ in the real world.

 * In the real world we have the true unknown parameter $\theta$.  It is
   what we are trying to estimate.
   In the bootstrap world we have the "true" pretend unknown parameter
   $\hat{\theta}_n$.
   Actually the parameter $\hat{\theta}_n$ is known, and that's a good thing,
   because it allows to see how close simulated estimators come to it.
   But we pretend it is unknown when we are reasoning in the bootstrap world.
   It is the analog in the bootstrap world of the true unknown parameter
   $\theta$ in the real world.

 * In the real world we have data $Y$ that is assumed to have distribution
   $P_\theta$.
   In the bootstrap world we simulate data $Y^*$ that are known to have
   distribution $P_{\hat{\theta}_n}$.

     In the real world we have only one data set $Y$.  We have to imagine
     its randomness – that, if the data could be collected again, it would
     turn out different.  In the bootstrap world we can simulate as many
     different $Y^*$ as we like.  We can actually see the distribution
     by making plots and compute things about the distribution by averaging
     over the simulations.

     The way we simulate depends on what parametric model we are assuming.
     Core R has functions to simulate Poisson data (`rpois`) and multinomial
     data (`rmultinom`).  It has no methods for product multinomial,
     but this can easily be done using `rmultinom` and a loop.

 * We have some estimator of $\theta$, which must be a *statistic*,
   that is some function of the data that does not depend
   on the unknown parameter.  In order to have the correct analogy
   in the bootstrap world, our estimate there must
   be the *same function* of the *bootstrap data*.

 * Many procedures require some estimate of standard error
   of $\hat{\theta}_n$.  Call that $\hat{s}_n$.  It too must be a *statistic*,
   that is some function of the data that does not depend
   on the unknown parameter.  In order to have the correct analogy in
   the bootstrap world, our estimate there must
   be the *same function* of the *bootstrap data*.

 * Many procedures use so-called *pivotal quantities*, either exact
   or approximate.

     An exact pivotal quantity is a function of
     the data and the parameter of interest whose distribution *does not
     depend on any parameters*.  The prototypical example is the $t$ statistic
     $$
        \frac{\overline{X}_n - \mu}{s_n / \sqrt{n}}
     $$
     which has, when the data are assumed to be exactly normal, an exact
     $t$ distribution on $n - 1$ degrees of freedom (which does
     not depend on the unknown parameters $\mu$ and $\sigma$ of the
     distribution of the data).  Note that the pivotal quantity
     is *a function of $\mu$* but the sampling distribution of the pivotal
      quantity *does not depend on $\mu$ or $\sigma$*: the $t$ distribution
     with $n - 1$ degrees of freedom does not does not have any unknown
     parameters.

     An asymptotic pivotal quantity is a function of
     the data and the parameter of interest whose asymptotic distribution
     *does not depend on any parameters*.  The prototypical example is the
     $z$ statistic
     $$
        \frac{\overline{X}_n - \mu}{s_n / \sqrt{n}}
     $$
     (actually the same function of data and parameters as the $t$ statistic
     discussed above),
     which has, when the data are assumed to have any distribution with finite
     variance, an asymptotic standard normal distribution (which does
     not depend on the unknown the distribution of the data).  Note that the
     pivotal quantity
     is *a function of $\mu$* but the sampling distribution of the pivotal
     quantity *does not depend on the unknown distribution of the data*:
     the *standard* normal distribution does not does not have any unknown
     parameters.

     An approximate pivotal quantity is a function of
     the data and the parameter of interest whose sampling distribution does
     not depend on the unknown distribution of the data, at least
     not very much.  Often such quantities are made by standardizing in a
     manner similar to those discussed above.
     Any time we have some purported standard errors of estimators, we can
     use them to make approximate pivotal quantities.
     $$
        \frac{\hat{\theta}_n - \theta}{\hat{s}_n}
     $$
     as in the bottom left cell of the table above.

     The importance of pivotal quantities in (frequentist) statistics cannot
     be overemphasized.  They are what allow valid exact or approximate
     inference.  When we invert the pivotal quantity to make confidence
     intervals, for example,
     $$
        \hat{\theta}_n \pm 1.96 \cdot \hat{s}_n
     $$
     this is (exactly or approximately) valid *because the sampling
     distribution of the pivotal quantity does not depend on
     the true unknown distribution of the data,
     at least not much*.  If it did depend strongly on the true distribution
     of the data, then our coverage could be way off, because our
     estimated sampling distribution of the pivotal quantity might be far
     from its correct sampling distribution.

     More on pivotal quantities below.  There are many pivotal quantities
     that one can use, not just those having the form of the bottom line
     of the table.

### Caution: Use the Correct Analogies

In the bottom right cell of the table above there is a strong tendency
for naive users to replace $\hat{\theta}_n$ with $\theta$.
But this is clearly incorrect.  What plays the role of true unknown parameter
value in the bootstrap world is $\hat{\theta}_n$ not $\theta$.
The distribution that bootstrap data are simulated from has true known
parameter value $\hat{\theta}_n$.

There is a lesser tendency for naive users to replace $s(\theta^*_n)$
with $s(\hat{\theta}_n)$.
But this is clearly incorrect.
In the real world we know that $s(\hat{\theta}_n)$ is a random quantity.
We are trying to account for this randomness in the same way a $t$
distribution accounts for the randomness in the sample standard deviation.
So we must use the quantity $s(\theta^*_n)$ in the bootstrap world
that has similar randomness.

What one does in the bootstrap world must always perfectly mimic what one
does in the real world with the sole exception that $\hat{\theta}$ is not
$\theta$.

### Pivotal Quantities and Infinite Regress

Frequentist statistics has a problem of infinite regress.

 * Suppose you are trying estimate the population mean $\mu$.

 * Naturally, you use the sample mean $\bar{x}_n$ as an estimator.

 * If you want to know how good an estimator, the standard deviation of the
   estimator, which is $\sigma / \sqrt{n}$, tells you.

 * But you don't know the population standard deviation $\sigma$ either,
   so that's no help.

 * So we estimate $\sigma$ by the sample standard deviation $\hat{\sigma}_n$.

 * But we still have a problem.  We don't know how far $\hat{\sigma}_n$ is
   from $\sigma$.

 * We can ask about the standard deviation of $\hat{\sigma}_n$, but theory
   can't tell us much about that.  Too complicated.  Theory can tell us
   the [asymptotic variance of $\hat{\sigma}_n^2$](https://www.stat.umn.edu/geyer/5102/slides/s1.pdf#page=100)
$$
   \sqrt{n}(\hat{\sigma}_n^2 - \sigma^2)
   \stackrel{\mathcal{D}}{\longrightarrow}
   \text{Normal}(0, \mu_4 - \sigma^4)
$$
where $\mu_4$ is the population fourth central moment.

 * But you don't know $\mu_4$ or $\sigma$ either,
   so that's no help.

 * So we estimate $\mu_4$ by the sample fourth central moment
$$
   \hat{\mu}_{4, n} = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}_n)^4
$$

 * But we still have a problem.  We don't know how far $\hat{\mu}_{4, n}$ is
   from $\mu_4$.

 * [Theory can tell
   us](https://www.stat.umn.edu/geyer/5102/slides/s1.pdf#page=100)
$$
  \sqrt{n} (\hat{\mu}_{4, n} - \mu_4)
   \stackrel{\mathcal{D}}{\longrightarrow}
   \text{Normal}(0,
   \mu_8 - \mu_4^2 - 8 \mu_3 \mu_5 + 16 \mu_3^2 \mu_2
   )
$$

 * But now we have a bunch more unknown parameters to estimate.

 * And this process goes on without end.

The method of pivotal quantities and asymptotic pivotal quantities
is the idea that short circuits this infinite regress.

If we just [standardize the quantity we are trying to get an asymptotic
distribution of](https://www.stat.umn.edu/geyer/5102/slides/s2.pdf#page=81),
then the infinite regress goes away
$$
   \frac{\bar{x}_n - \mu}{\hat{\sigma}_n / \sqrt{n}}
   \stackrel{\mathcal{D}}{\longrightarrow}
   \text{Normal}(0, 1)
$$
and the asymptotic distribution has no unknown parameters to estimate
(that's what asymptotic pivotal quantity means).

But asymptotic pivotal quantities don't have to come from standardization.
We know that the Wald, Wilks, and Rao test statistics are all asymptotically
chi-square, and the chi-square distribution does not have any unknown
parameters to estimate.  So those are other asymptotically pivotal quantities.

### How the Parametric Bootstrap Works

#### Hypothesis Tests

So the first thing we have to do is select an estimator of the parameter
under the null hypothesis (because that
is the distribution we are supposed to use when computing a hypothesis
test).  So let $\hat{\theta}_n$ be the MLE under the null hypothesis
(or any other good estimate).

The next thing we have to do is select an asymptotically pivotal quantity,
Say the likelihood ratio test statistic.

Now we simulate many datasets $Y^*$ from the distribution of the data
under $\hat{\theta}_n$.  And we use these simulations to approximate
the sampling distribution of the test statistic.  We do not use the
chi-square distribution!  That is only good when $n$ is nearly infinite.
Instead we use the bootstrap distribution (simulation distribution).
If the test statistic is $t(Y)$ and is asymptotically pivotal,
then our bootstrap $P$-value is $\Pr\{ t(Y^*) \ge t(Y) \}$,
and this probability is approximated by averaging over our bootstrap
simulations.

Actually we don't recommend the naive bootstrap $P$-value just described
but rather a slightly more sophisticated bootstrap $P$-value that takes
into account that, asymptotically at least, $t(Y^*)$ and $t(Y)$ have the
same distribution (which is the whole basis of the procedure).  So if
we have $n_\text{boot}$ simulated test statistics
$t(Y^*_1)$, $t(Y^*_2)$, $\ldots,$ $t(Y^*_{n_\text{boot}})$, then we also
have one more $t(Y)$, and we should use
$$
   \frac{\#\{ t(Y^*) \ge t(Y) \} + 1}{n_\text{boot} + 1}
$$
for the bootstrap $P$-value, where $\#$ in the numerator means number
of times the inequality holds.  This corrects for $n_\text{boot}$ being
too small.  The smallest a bootstrap $P$-value can be is $1 / n_\text{boot}$,
so you have to do a lot of bootstrap simulations to get a really small
bootstrap $P$-value.

#### Confidence Intervals

There are many different schemes for nonparametric bootstrap confidence
intervals.  There are fewer for the parametric bootstrap, and discussions
of the parametric bootstrap in textbooks are so sketchy that we do not
get clear recommendations.  Hence we offer the following, which is the
analogue for the parametric bootstrap of what are called bootstrap $t$
procedures for the nonparametric bootstrap.

The first thing we have to do is select an estimator of the parameter.
Call that $\hat{\theta}_n$.

The next thing we have to do is select an estimator of the asymptotic
standard deviation of the parameter.  Like $\hat{\theta}_n$
this estimator depends on the data and hence is a random variable.

If we are doing Wald confidence intervals, and $\hat{\theta}_n$ is
the MLE, then the asymptotic variance of the MLE is inverse Fisher
information.  If $\hat{\theta}_n$ is a vector parameter, then
inverse Fisher information is a matrix.  But then the diagonal elements
of this matrix are the estimated variances of the individual
components of $\hat{\theta}_n$.  We saw this in the [notes on
likelihood computation](likelihood-compute.html#wald-intervals).
Since the estimated asymptotic standard deviation depends on the parameter,
let us denote it $s(\theta)$.  Then we have our estimate for the
actual data $s(\hat{\theta}_n)$ and our estimate for bootstrap data
is $s(\theta^*_i)$.

Now, as always in the parametric bootstrap, we simulate $n_\text{boot}$
IID realizations of the data from the distribution with parameter
$\hat{\theta}_n$.  The parameter estimates for these simulated data sets are,
as we said above $\theta^*_i$, $i = 1$, $\ldots,$ $n_\text{boot}$.
Now we can form $n_\text{boot}$ simulations of the asymptotically pivotal
quantity
$$
   z^*_i = \frac{\theta^*_i - \hat{\theta}_n}{s(\theta^*_i)}
$$
just like in the bottom row of our table about the real world and the
bootstrap world.

Now comes the *tour de force*.  We don't use the normal distribution
to determine the critical value, in fact, we don't even have one critical
value but rather two.  We use the bootstrap distribution of the asymptotic
pivotal quantity rather than its asymptotic distribution.  We determine
*from the bootstrap simulations* numbers $c_1$ and $c_2$ such that
$$
   \text{$c_1 \le z^*_i \le c_2$ exactly $1 - \alpha$ of the time}
$$
to get coverage $1 - \alpha$.  For example, to get an equal-tailed
95% confidence interval, we choose $c_1$ to be the 0.025 quantile
of the bootstrap distribution of the $z^*_i$ and choose $c_2$ to be
the 0.975 quantile.

The point is that we use the bootstrap rather than the asymptotics
to find critical values.

Then our bootstrap confidence interval is
\begin{equation}
   \bigl( \hat{\theta}_n - c_2 s(\hat{\theta}_n),
   \hat{\theta}_n - c_1 s(\hat{\theta}_n) \bigr)
   (\#eq:boot-conf) 
\end{equation}

That is, we apply the critical values (which are exact in the bootstrap world)
to the real world (where they are only the closest we can do to the Right
Thing).

Note that our bootstrap confidence intervals do not have the form
$$
   \text{point estimate} \pm \text{critical value} \times \text{standard error}
$$
which some intro stats books teach is the only kind of confidence interval
that exists.

Of course we already have seen that likelihood confidence intervals
and score (Rao) confidence intervals, don't have this form either.

But now we see that bootstrap Wald intervals also don't have this form.

If the bootstrap distribution of the $z^*_i$ is skewed, then our bootstrap
$t$ confidence intervals are skewed to account for that.

But even if the bootstrap distribution of the $z^*_i$ is not skewed, our
bootstrap $t$ confidence intervals account for the variability of estimating
the asymptotic standard deviation.  Even if "Student"
([W. S. Gosset](https://en.wikipedia.org/wiki/William_Sealy_Gosset))
had never invented the $t$ distribution (nor anyone else), this bootstrap
$t$ procedure would invent it.  When the population distribution is normal
and we do a parametric bootstrap, the bootstrap distribution of the $z^*_i$
is Student's $t$ distribution on $n - 1$ degrees of freedom.  So we do
$t$ tests without knowing any theory.  The bootstrap does all the theory
for us.

Moreover, when the true unknown population distribution is not normal,
the bootstrap figures out the analogue of the "Student" $t$ test for
that population distribution, something theoretical statistics cannot do.

At least, it does all the theory relevant to the true unknown population
distribution.  As we have seen, the description of the bootstrap does
involve some theory.

Actually we don't recommend the bootstrap confidence intervals just described,
or rather we recommend them but only using a very specific method of
computing bootstrap critical values.  Suppose we choose $n_\text{boot}$
so that the quantiles we want when multiplied by $n_\text{boot} + 1$
are integers.  Then we take
\begin{align*}
   c_1 & = z^*_{(n_\text{boot} + 1) \alpha / 2}
   \\
   c_2 & = z^*_{(n_\text{boot} + 1) (1 - \alpha / 2)}
\end{align*}
Like the procedure we actually recommend for bootstrap $P$-values, this
procedure has $+ 1$ to correct for $n_\text{boot}$ being small.  But
the argument is different.

The main virtue of this recommendation is that it avoids arguments about
quantile estimation.  R function `quantile` has 9 different procedures.
None of them are as simple or as good as the procedure recommended here.
Of course our procedure works only for specially chosen $n_\text{boot}$
whereas R function `quantile` has to work for all sample sizes.
So this procedure cannot replace R function `quantile` except in this
particular application.

### Another Infinite Regress

But we don't want to use the parametric bootstrap only when the sample size
is infinite!  So we still have an infinite regress, but a more tractable one.

 * We decide to use the bootstrap because we are worried that the asymptotics
   might not "work" for the actual sample size of our actual data.

 * So what if we are worried about whether the bootstrap works?
   Bootstrap the bootstrap!  This is called a double bootstrap.

 * So what if we are worried about whether the double bootstrap works?
   Bootstrap the double bootstrap!  This is called a triple bootstrap.

 * And so forth.

The double bootstrap can be useful [@geyer-constrained; @bootstrap-recycle]
but is rather complicated.

It is easiest to explain for hypothesis tests.  When we are worried about
the bootstrap, we are worried that the bootstrap distribution of the test
statistic is not exactly the same as the real world distribution of the
test statistic.  So our bootstrap $P$-values are not exactly correct
(even if we have humongously large $n_\text{boot}$).

So our bootstrap $P$-values get demoted to being just another test statistic.
We take them as evidence against the null hypothesis when they are near zero,
but we no longer believe the bootstrap distributions for them.
But what do we do when we doubt the distribution of the test statistic?
Bootstrap it.  So now we get double bootstrap $P$-values
$$
   \frac{\#\{ t(Y^{{*}{*}}) \ge t(Y^*) \} + 1}{n_\text{boot} + 1}
$$
the same as our bootstrap $P$-values except one level up.  For each bootstrap
estimate $\theta^*$ we are doing a bootstrap inside a bootstrap to simulate
many realizations $Y^{{*}{*}}$ of the data from the distribution having
parameter value
$\theta^*$ and using them to calculate this double bootstrap $P$-value.
@geyer-constrained gives details.

@geyer-constrained makes another important point.  The bootstrap not only
provides better, more accurate hypothesis tests and confidence intervals,
but also provides a diagnostic tool.

 * If the bootstrap gives more or less the same answers as the asymptotics,
   then you didn't need the bootstrap.

 * If the double bootstrap gives more or less the same answers as the single
   bootstrap, then you didn't need the double bootstrap.

Except you need the bootstrap or double bootstrap to reach these conclusions.

I don't know if anyone has actually done a triple bootstrap.

@bootstrap-recycle show how to speed up the calculation of a double bootstrap.

But we leave this subject here.  We won't actually do any examples of the
double bootstrap.

### Parametric versus Nonparametric Bootstrap

The nonparametric bootstrap has trouble doing hypothesis tests for any
data, it is mostly for confidence intervals.
The nonparametric bootstrap has trouble doing any analysis of regression
data.
Both of these are explained in my [Stat 3701 lecture
notes](https://www.stat.umn.edu/geyer/3701/notes/bootstrap.html).

The parametric bootstrap has neither of these problems.
This gives another reason for always using the parametric bootstrap
for parametric statistical models.

## Return to Our Example

### Bootstrap Hypothesis Tests

#### Assuming Poisson Sampling

The method of R function `anova` for objects of class `"glm"` produced
by R function `glm` has no optional argument `simulate.p.value`.

So here is how we could do that for our first example.
```{r parametric.bootstrap}
# clean up R global environment
rm(list = ls())
# re-establish stuff done in first section
counts <- c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1)
drinks <- rep(c("0", "< 1", "1-2", "3-5", ">= 6"), times = 2)
malformation <- rep(c("Absent", "Present"), each = 5)
gout.indep <- glm(counts ~ malformation + drinks, family = poisson)
gout.sat <- glm(counts ~ malformation * drinks, family = poisson)
aout <- anova(gout.indep, gout.sat, test = "LRT")
```

This gets us back to where we were before we removed everything
produced in our original analysis of these data.  Now we need
the estimated mean values (expected cell counts) and the $P$-value for
the test
```{r parametric.bootstrap.estimates}
p.value.hat <- aout[2, "Pr(>Chi)"]
p.value.hat
mu.hat <- predict(gout.indep, type = "response")
mu.hat
```

Now we can do the simulation
```{r parametric.bootstrap.bootstrap.estimates,cache=TRUE}
# set random number generator seed for reproducibility
set.seed(42)

nboot <- 999
p.value.star <- double(nboot)
for (iboot in 1:nboot) {
    # simulate new data from fitted model
    counts.star <- rpois(length(mu.hat), lambda = mu.hat)
    gout.indep.star <- glm(counts.star ~ malformation + drinks,
        family = poisson)
    gout.sat.star <- glm(counts.star ~ malformation * drinks,
        family = poisson)
    aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT")
    p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"]
}
all.p.values <- c(p.value.star, p.value.hat)
mean(all.p.values <= p.value.hat)
sd(all.p.values <= p.value.hat) / sqrt(nboot)
```

The code above is a bit tricky, so here is the explanation.

 * Each time through the loop we simulate new data based on our best
   estimate of the model under the null hypothesis.  We are assuming
   Poisson sampling, so we use R function `rpois` to simulate the data.
   This function vectorizes, so it works with `mu.hat` being a vector.

 * The next three statements do exactly the same as the original analysis
   except that we use the simulated data rather than the real data.

 * Then we extract the $P$-value from the result of R function `anova`
   and store it for future use.

 * After the loop terminates, we have `nboot` simulations from the
   distribution of the $P$-value under the null hypothesis.  We also
   have one more (not simulated)  $P$-value that has the same distribution
   under the null hypothesis as the simulated $P$-values.  This is
   `p.value.hat`, the value for the real data.

 * The estimated $P$-value is then the fraction of the time the bootstrap
   simulated $P$-values `p.value.star` are at least as extreme as
   the $P$-value for the actual data `p.value.hat`.

 * The last statement calculates the Monte Carlo standard error, how
   far off our simulation is from what we would get if we did an
   infinite number of simulations.

Here we use the $P$-values rather than the test statistics as asymptotically
pivotal quantities, but otherwise this is as described above.

#### Assuming Multinomial Sampling

When we simulate, we are not using asymptotics.  And we are using exact
sampling distributions.  Thus Poisson sampling and multinomial sampling
are not exactly equivalent (their equivalence is an asymptotic result).

So we can do the same simulation under multinomial sampling.
```{r parametric.bootstrap.bootstrap.estimates.too,cache=TRUE}
n <- sum(counts)
pi.hat <- mu.hat / n
p.value.star <- double(nboot)
for (iboot in 1:nboot) {
    # simulate new data from fitted model
    counts.star <- rmultinom(1, n, prob = pi.hat)
    gout.indep.star <- glm(counts.star ~ malformation + drinks,
        family = poisson)
    gout.sat.star <- glm(counts.star ~ malformation * drinks,
        family = poisson)
    aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT")
    p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"]
}
all.p.values <- c(p.value.star, p.value.hat)
mean(all.p.values <= p.value.hat)
sd(all.p.values <= p.value.hat) / sqrt(nboot)
```

The only difference between this simulation and the one before is the line
```
    counts.star <- rmultinom(1, n, prob = pi.hat)
```
and the lines defining `n` to be the multinomial sample size and `pi.hat`
to be the estimated multinomial parameter vector so we can use
it in the statement just above.
 
#### Assuming Product Multinomial Sampling

For these data the Poisson sampling model seems to be correct, but just
for the sake of showing what to do if the sample sizes for
`malformation == "Absent"` and `malformation == "Present"` were fixed
in advance.
```{r parametric.bootstrap.bootstrap.estimates.too.too,cache=TRUE}
# Note mu.hat the same regardless of Poisson or product multinomial sampling
nAbsent <- sum(counts[malformation == "Absent"])
nPresent <- sum(counts[malformation == "Present"])
pi.hat.Absent <- mu.hat[malformation == "Absent"] / nAbsent
pi.hat.Present <- mu.hat[malformation == "Present"] / nPresent
p.value.star <- double(nboot)
for (iboot in 1:nboot) {
    # simulate new data from fitted model
    counts.star <- double(length(pi.hat))
    counts.star[malformation == "Absent"] <-
        rmultinom(1, nAbsent, prob = pi.hat.Absent)
    counts.star[malformation == "Present"] <-
        rmultinom(1, nPresent, prob = pi.hat.Present)
    gout.indep.star <- glm(counts.star ~ malformation + drinks,
        family = poisson)
    gout.sat.star <- glm(counts.star ~ malformation * drinks,
        family = poisson)
    aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT")
    p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"]
}
all.p.values <- c(p.value.star, p.value.hat)
mean(all.p.values <= p.value.hat)
sd(all.p.values <= p.value.hat) / sqrt(nboot)
```

### Bootstrap Confidence Intervals

```{r}
conf.level <- 0.95
```

We won't do all three sampling schemes, just Poisson.
The changes for multinomial or product multinomial should be obvious
from the above.

We will do Wald intervals for the cell probabilities in the bottom
row of the table (for `malformation == "Present"`).
```{r bootstrap.confidence,cache=TRUE}
ilow <- (nboot + 1) * (1 - conf.level) / 2
ihig <- (nboot + 1) * (1 - (1 - conf.level) / 2)
c(ilow, ihig)
imalf <- malformation == "Present"
imalf

# room for all simulations
z.star <- matrix(NaN, nrow = nboot, ncol = sum(imalf))

for (iboot in 1:nboot) {
    # simulate new data from fitted model
    counts.star <- rpois(length(mu.hat), lambda = mu.hat)
    # fit model to bootstrap data
    gout.star <- glm(counts.star ~ malformation + drinks, family = poisson)
    # get estimates and standard errors
    pout.star <- predict(gout.star, type = "response", se.fit = TRUE)
    z.star[iboot, ] <-
        (pout.star$fit[imalf] - mu.hat[imalf]) / pout.star$se.fit[imalf]
}
```

At this point each column of the matrix `z.star` is the bootstrap distribution
of the asymptotic pivotal quantity for one of our estimates.  Let's look at
the first of these distributions.
```{r bootstrap.hist}
hist(z.star[ , 1], probability = TRUE)
curve(dnorm, add = TRUE)
```

Hard to tell whether this is different from standard normal or not.

Let's look at the last one.
```{r bootstrap.hist.too}
hist(z.star[ , 5], probability = TRUE)
curve(dnorm, add = TRUE)
```

This is more obviously skewed.

Now extract the bootstrap critical values.
```{r bootstrap.critical}
crit <- apply(z.star, 2, function(x)
    sort.int(x, partial = c(ilow, ihig))[c(ilow, ihig)])
colnames(crit) <- c("0", "< 1", "1-2", "3-5", ">= 6")
rownames(crit) <- c("lower", "upper")
crit
```

The fact that none of these are $\pm 1.96$ means the bootstrap is making
some adjustment for skewness and perhaps for heavy tails,
whether or not we can detect skewness in the histogram.
So it seems the parametric bootstrap is working better than asymptotics
(which would replace all of these critical values with $\pm 1.96$).

So now we can make the confidence intervals
```{r bootstrap.confidence.intervals.done}
# fit model to real data
gout <- glm(counts ~ malformation + drinks, family = poisson)
# get estimates and standard errors
pout <- predict(gout, type = "response", se.fit = TRUE)
foo <- rbind(lower = pout$fit[imalf] - crit[2, ] * pout$se.fit[imalf],
             upper = pout$fit[imalf] - crit[1, ] * pout$se.fit[imalf])
colnames(foo) <- colnames(crit)
foo
```

For comparison, here are the asymptotic intervals.
```{r ass.conf.int}
foo <- outer(c(-1, 1) * qnorm(0.975), pout$se.fit[imalf])
colnames(foo) <- c("0", "< 1", "1-2", "3-5", ">= 6")
rownames(foo) <- c("lower", "upper")
foo <- sweep(foo, 2, pout$fit[imalf], "+")
foo
```

The bootstrap doesn't make a big difference.  But it is better than
the asymptotics.  Not a lot different, but different.


# Bibliography {-}