--- title: "Stat 5421 Lecture Notes: To Accompany Agresti Ch 9" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: bookdown::html_document2: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML bookdown::pdf_document2: latex_engine: xelatex number_sections: true md_extensions: -tex_math_single_backslash linkcolor: blue urlcolor: blue bibliography: ch9.bib csl: journal-of-the-royal-statistical-society.csl link-citations: true --- # Section 9.1 Two-Way Tables ## Section 9.1.1 Only Main Effects The *independence* or *homogeneity of proportions* model has mean-value parameters $$ \mu_{i j} = \alpha_i \beta_j $$ or canonical parameters \begin{equation} \theta_{i j} = \alpha_i + \beta_j (\#eq:two-way-independence) \end{equation} where the alphas and betas are different in the two equations (this causes no confusion, since if we look at parameters at all, these are usually canonical parameters). ## Section 9.1.3 Interactions The *saturated* model has completely arbitrary parameters. The mean value parameters $\mu_{i j}$ have no specified structure. The canonical parameters $\theta_{i j}$ have no specified structure. So this is the largest model. It has the most (arbitrarily variable) parameters. ## Section 9.1.5 Hierarchical Models The hierarchical model principle says that, if you have an interaction term of a certain order, then you have all lower-order interactions of the same variables. For the saturated model for a two-way table this says \begin{equation} \theta_{i j} = \alpha_i + \beta_j + \gamma_{i j} (\#eq:two-way-saturated) \end{equation} (but this just says $\theta_{i j}$ can be anything. This is not an identifiable parameterization. ## Identifiability A statistical model (in general, not just in categorical data analysis) is *identifiable* if there is only one (vector) parameter value corresponding to a probability distribution. In exponential family models ([notes on exponential families, section on identifiability](expfam.html#identifiability)) every different mean vector corresponds to a different distribution, but different canonical parameter vectors can correspond to the same distribution. If we are assuming *Poisson sampling* ([notes on Agresti Chapter 1, section on sampling schemes](ch1.html#statistical-models-for-categorical-data) and also [notes on sampling schemes](sampling.html)) then the canonical parameters are identifiable if and only if the mean-value parameters are (because the link function (componentwise log) is invertible). Clearly the specification for the two-way independence model \@ref(eq:two-way-independence) is not identifiable because one can add a constant to all of the alphas and subtract the same constant from all the betas and get the same result $$ (\alpha_i + c) + (\beta_j - c) = \alpha_i + \beta_j $$ Hence \@ref(eq:two-way-saturated) is also not identifiable. But we don't need to worry about identifiability when fitting models because the computer automatically takes care of it. For example, consider the data in Table 3.8 in Agresti ```{r data.fetal.alcohol.syndrome} counts <- c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1) drinks <- rep(c("0", "< 1", "1-2", "3-5", ">= 6"), times = 2) malformation <- rep(c("Absent", "Present"), each = 5) data.frame(drinks, malformation, counts) ``` We can fit this using R function `chisq.test` ```{r foo} foo <- xtabs(counts ~ malformation + drinks) foo ``` That's ugly. Can we force `xtabs` to keep the `drinks` variable in the right order? ```{r fugly} drinks <- factor(drinks, levels = c("0", "< 1", "1-2", "3-5", ">= 6")) foo <- xtabs(counts ~ malformation + drinks) foo ``` Better, now back to statistics ```{r foo.chisq} chisq.test(foo) ``` But it says "approximation may be incorrect" so try ```{r foo.bar} chisq.test(foo, simulate.p.value = TRUE) ``` But we can also use R function `glm` and its helper functions to do the job. ```{r glm} gout.indep <- glm(counts ~ malformation + drinks, family = poisson) summary(gout.indep) gout.sat <- glm(counts ~ malformation * drinks, family = poisson) summary(gout.sat) anova(gout.indep, gout.sat, test = "LRT") anova(gout.indep, gout.sat, test = "Rao") ``` ```{r glm.hide} p.value.lrt <- anova(gout.indep, gout.sat, test = "LRT")[2, "Pr(>Chi)"] p.value.rao <- anova(gout.indep, gout.sat, test = "Rao")[2, "Pr(>Chi)"] p.value.chisq <- suppressWarnings(chisq.test(foo)$p.value) p.value.lrt p.value.rao p.value.chisq ``` We notice a number of things. * The $P$-value for the Rao test (`r round(p.value.rao, 5)`) is exactly equal to that output for the Chi-Square test because the Pearson chi-square test is a special case of the Rao test. * The likelihood ratio test disagrees quite strongly with the Rao test. Of course, these tests are asymptotically equivalent, but here the sample size is not "large". The total number of subjects ($`r sum(as.integer(foo))`$) is large, but the expected number in some cells of the contingency table are quite small, a lot less than what the rule of thumb says is necessary for valid asymptotics (at least five in each cell), so maybe that accounts for the difference. ```{r foo.expected} suppressWarnings(chisq.test(foo)$expected) ``` * The computer (R function `glm`) knows how to deal with identifiability. It adds an intercept, so it is really using the formula $$ \theta_{i j} = \alpha + \beta_i + \gamma_j $$ for the independence model and $$ \theta_{i j} = \alpha + \beta_i + \gamma_j + \delta_{i j} $$ for the saturated model. Then it "drops" (sets equal to zero) one of the betas, one of the gammas, and all interaction terms corresponding to the dropped beta and the dropped gamma. ```{r gout.indep.coef} names(coef(gout.indep)) ``` It keeps the "intercept" ($\alpha$ in our notation just above). It drops one of the coefficients for the factor `malformation` (and keeps the other one). It drops one of the coefficients for the factor `drinks` (and keeps the other four). And this gives it an identifiable model. ```{r gout.sat.coef} names(coef(gout.sat)) ``` It has all the parameters of the independence model plus a lot of "interaction" parameters (the ones whose names contain colons). But it drops all the "interaction" parameters that involve parameters that were "dropped" from the independence model (it contains no "interaction" terms containing `malformationAbsent` or `drinks< 1`) And this gives it an identifiable model. # Section 9.2 Three-Way Tables Three way tables are just like two-way tables except the data have three subscripts for the three dimensions of the table. Our analogs of ## Section 9.2.1 Only Main Effects The *independence* or *homogeneity of proportions* model has mean-value parameters $$ \mu_{i j k} = \alpha_i \beta_j \gamma_k $$ or canonical parameters \begin{equation} \theta_{i j k} = \alpha_i + \beta_j + \gamma_k (\#eq:three-way-independence) \end{equation} where the alphas, betas, and gammas are different in the two equations (this causes no confusion, since if we look at parameters at all, these are usually canonical parameters). ## The Saturated Model As before, the *saturated* model has completely arbitrary parameters. The mean value parameters $\mu_{i j k}$ have no specified structure. The canonical parameters $\theta_{i j k}$ have no specified structure. So this is the largest model. It has the most (arbitrarily variable) parameters. ## Example: High School Student Survey This is Table 9.3 in Agresti and can be found in R package `CatDataAnalysis`. ```{r high.school} # clean up R global environment rm(list = ls()) library(CatDataAnalysis) data(table_9.3) names(table_9.3) foo <- xtabs(count ~ a + c + m, data = table_9.3) foo ``` Since `a`, `c`, and `m` are categorical, we should perhaps make them factors. But since they each have only two values, it does not matter whether we make them factors or not. If they had more than two values, then it would be essential to make them factors. We can make these data look nicer by copying the names out of the book ```{r high.school.dimnames} dimnames(foo) <- list(alcohol = c("yes", "no"), cigarettes = c("yes", "no"), marijuana = c("yes", "no")) aperm(foo, c(2, 3, 1)) ``` We needed R function `aperm`, which permutes the dimensions of an array, to present the data in the same order as in the book. Of course, the order doesn't matter and neither do the `dimnames`. That is just to match the book. Now R function `chisq.test` is useless, but R function `glm` has no problems. Our test of independence or homogeneity of proportions is the same, just with three "predictors" rather than two (the "predictors" are in scare quotes because they are just the `dimnames`) ```{r high.school.test} gout.indep <- glm(count ~ a + c + m, data = table_9.3, family = poisson) gout.sat <- glm(count ~ a * c * m, data = table_9.3, family = poisson) anova(gout.indep, gout.sat, test = "LRT") anova(gout.indep, gout.sat, test = "Rao") ``` Clearly, the independence model does not fit the data. So statistics says these variables, alcohol use, cigarette use, and marijuana use have some association. No real surprise, but the data do confirm what just about everybody assumes. ## Interactions Our saturated model has the three-way interaction `a * c * m` but there can also be two-way interactions. The model with all two-way interactions would have canonical parameter $$ \theta_{i j k} = \alpha_{i j} + \beta_{i k} + \gamma_{j k} $$ (as before the terms with alphas, betas, and gammas need not match up with previous uses of these Greek letters). And people who insist on the hierarchical principle would write $$ \theta_{i j k} = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k} $$ but both formulas specify the same model and neither is identifiable. So let's look at that model. ```{r high.school.all.two.way} gout.all.two.way <- glm(count ~ (a + c + m)^2, data = table_9.3, family = poisson) anova(gout.indep, gout.all.two.way, gout.sat, test = "LRT") ``` ```{r high.school.p.values} foompter <- anova(gout.indep, gout.all.two.way, gout.sat, test = "LRT") foompter.p.values <- foompter[-1, "Pr(>Chi)"] foompter.p.values ``` This says the two-way interactions model (model 2) fits as well as the three-way interactions model (model 3, the saturated model) because $P = `r round(foompter.p.values[2], 4)`$ is not statistically significant (not even close). And it says the two-way interactions model (model 2) fits much better than the main effects only model (model 1, the independence model) because $P < 2 \times 10^{-16}$ is highly statistically significant (by anyone's standards). We should not be overly impressed by very very small $P$-values because asymptotic approximation gives only small absolute errors rather than small relative errors. All this very very small $P$-value says is that an exact calculation (if we could do it, which we cannot) would be something small, say $P < 0.001$. But there is no reason to believe the $P < 2 \times 10^{-16}$ that R prints is correct to within even several orders of magnitude. ## Many Models There are many hierarchical models here, because we need not include all of the main effects or all two-way interactions. The hierarchical principle says that if we have a two-way interaction, then we must have the corresponding main effects, but that leaves \begin{align*} \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k} + \theta_{i j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \zeta_{i k} + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \zeta_{i k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \varepsilon_{i j} \\ \theta_{i j k} & = \alpha + \beta_i + \delta_k + \zeta_{i k} \\ \theta_{i j k} & = \alpha + \gamma_j + \delta_k + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j \\ \theta_{i j k} & = \alpha + \beta_i + \delta_k \\ \theta_{i j k} & = \alpha + \gamma_j + \delta_k \\ \theta_{i j k} & = \alpha + \beta_i \\ \theta_{i j k} & = \alpha + \gamma_j \\ \theta_{i j k} & = \alpha + \delta_k \\ \theta_{i j k} & = \alpha \end{align*} Of course, we already know many of these models don't fit the data. We saw that the main effects only model doesn't fit. So none of its submodels can fit either. But this is a complete listing of all hierarchical models (except for the saturated model). How can we choose among all these models? There is a methodology for that, but this is getting a bit ahead of ourselves. There will be a handout for that. But for now, we just show it without much explanation. ```{r high.school.glmbb} library(glmbb) out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson) summary(out) ``` This says that none of the models that do not have all three two-way interactions fit the data according to the criterion: none have AIC less than the AIC for the best model ("best" according to AIC) plus 10. We don't know what AIC is yet, but we will learn in a later handout. For now, just consider it a criterion of goodness of fit. If we want to see how bad some of those non-fitting models were, we can increase the cutoff. ```{r high.school.glmbb.too} library(glmbb) out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson, cutoff = 90) summary(out) ``` That is a huge jump up to the next best fitting model, which does not (take my word for it for now) does not fit the data well at all. # Section 9.4 Four-Way and Beyond And so on and so forth. Higher order tables are just more complicated, but present no new issues. ## Example: Seat-Belt Use These data are given in Table 9.8 in Agresti. They are apparently not in R package `CatDataAnalysis`. ```{r seat.belt,error=TRUE} # clean up R global environment rm(list = ls()) count <- c(7287, 11587, 3246, 6134, 10381, 10969, 6123, 6693, 996, 759, 973, 757, 812, 380, 1084, 513) injury <- gl(2, 8, 16, labels = c("No", "Yes")) gender <- gl(2, 4, 16, labels = c("Female", "Male")) location <- gl(2, 2, 16, labels = c("Urban", "Rural")) seat.belt <- gl(2, 1, 16, labels = c("No", "Yes")) data.frame(gender, location, seat.belt, injury, count) xtabs(count ~ seat.belt + injury + location + gender) ``` ```{r options.width,echo=FALSE} opt.save <- options(width = 250) ``` Looks OK. ```{r seat.belt.glmbb} out <- glmbb(count ~ seat.belt * injury * location * gender, family = "poisson") summary(out) ``` Now there are several models that fit these data fairly well. We will leave it at that for now. ```{r options.width.too,echo=FALSE} options(opt.save) ``` # The Parametric Bootstrap {#bootstrap} ## Relevant and Irrelevant Simulation ### Irrelevant Most statisticians think a statistics paper isn't really a statistics paper or a statistics talk isn't really a statistics talk if it doesn't have simulations demonstrating that the methods proposed work great (at least on some toy problems). IMHO, this is nonsense. Simulations of the kind most statisticians do *prove* nothing. The toy problems used are often very special and do not stress the methods at all. In fact, they may be (consciously or unconsciously) chosen to make the methods look good. In scientific experiments, we know how to use randomization, blinding, and other techniques to avoid biasing the results. Analogous things are never AFAIK done with simulations. When all of the toy problems simulated are *very different* from the statistical model you intend to use for your data, what could the simulation study possibly tell you that is relevant? Nothing. Hence, for short, your humble author calls all of these millions of simulation studies statisticians have done *irrelevant simulation*. ### Relevant But there is a well-known methodology of *relevant simulation*, except that it isn't called that. It is called the *bootstrap*. It idea is, for each statistical model and each data set to which it is applied, one should do a simulation study of this model on data of this form. But there is a problem: the fundamental problem of statistics, that $\hat{\theta}$ is not $\theta$. To be truly relevant we should simulate from the true unknown distribution of the data, but we don't know what that is. (If we did, we wouldn't need statistics.) So as a second best choice we have to simulate from our best estimate of the true unknown distribution, the one corresponding to the parameter value $\hat{\theta}$ if that is the best estimator we know. But we know that is the [Wrong Thing](http://www.catb.org/jargon/html/W/Wrong-Thing.html). So we have to be sophisticated about this. We have to arrange what we do with our simulations to come as close to the [Right Thing](http://www.catb.org/jargon/html/R/Right-Thing.html) as possible. And bootstrap theory and methods are extraordinarily sophisticated with many different methods of coming very close to the [Right Thing](http://www.catb.org/jargon/html/R/Right-Thing.html). ## R Packages and Textbooks There are two well known R packages concerned with the bootstrap. They go with two well known textbooks. * R package `boot` [@boot-package] is an R recommended package that is installed by default in every installation of R. As the package description says, it goes with the textbook @davison-hinkley. * The CRAN package `bootstrap` [@bootstrap-package] goes with, as its package description says, the textbook @efron-tibshirani. The package description also says that "new projects should preferentially use the recommended package '`boot`'". But I do not agree. The package maintainer is neither of Efron or Tibshirani, and I do not think they would agree. Whatever the politics of the R core team that make the `boot` package "recommended", they have nothing to do with the quality of the package or with the quality of the textbook they go with. If you like Efron and Tibshirani (1993), you should be using the R package `bootstrap` that goes with it. These authors range from moderately famous (for a statistician) to very, very famous (for a statistician). Efron is the inventor of the term *bootstrap* in its statistical meaning. ## Kinds of Bootstrap Almost all of the bootstrap literature is about the *nonparametric bootstrap* that does not assume any parametric statistical model for the data. Most of the content of the textbooks cited above, and all of the R packages cited above are about this methodology. But if one is using parametric models, like everything we do in this course, then what you are doing is inherently parametric, even if you use the nonparametric bootstrap. Thus there is no benefit to using the nonparametric bootstrap for anything in this course. A small minority of the bootstrap literature is about the *parametric bootstrap* that does use the parametric statistical model for its simulations. So this is all we will discuss here. For more on the nonparametric bootstrap, see my [Stat 3701 lecture notes](https://www.stat.umn.edu/geyer/3701/notes/bootstrap.html), some of which is repeated here. ## The Bootstrap Analogy ### The Name of the Game The term "bootstrap" recalls the English idiom ["pull oneself up by one's bootstraps"](https://en.wiktionary.org/wiki/bootstrap). The literal meaning of "bootstrap" in non-technical language is leather loops at the top of boots used to pull them on. So the literal meaning of "pull oneself up by one's bootstraps" is to reach down, grab your shoes, and lift yourself off the ground — a physical impossibility. But, idiomatically, it doesn't mean do the physically impossible; it means something like "succeed by one's own efforts", especially when this is difficult. The technical meaning in statistics plays off this idiom. It means to get a good approximation to the sampling distribution of an estimator without using any theory. (At least not using any theory in the computation. A great deal of very technical theory may be used in justifying the bootstrap in certain situations.) ### The Real World and the Bootstrap World The discussion in this section is stolen from @efron-tibshirani, their Figure 8.1 and the surrounding text. To understand the bootstrap you have to understand a simple analogy. Otherwise it is quite mysterious. I recall being mystified about it when I was a graduate student. I hope the students I teach are much less mystified because of this analogy. This appears to the untutored to be impossible or magical. But it isn't really. It is sound statistical methodology. Let $P_\theta$ denote the true unknown probability distribution that we assume the data are a sample from, The bootstrap makes an analogy between the real world and a mythical bootstrap world. | | real world | bootstrap world | |:-----------------------------|:---------------------------------------:|:-------------------------------------------------------:| | true unknown distribution | $P_\theta$ | $P_{\hat{\theta}_n}$ | | true unknown parameter | $\theta$ | $\hat{\theta}_n$ | | data | $Y$ has distribution $P_\theta$ | $Y^*$ has distribution $P_{\hat{\theta}_n}$ | | estimator | $\hat{\theta}_n = t(Y)$ | $\theta^*_n = t(Y^*)$ | | estimated standard error | $s(\hat{\theta}_n)$ | $s(\theta^*_n)$ | | approximate pivotal quantity | $(\hat{\theta}_n - \theta) / s(\hat{\theta}_n)$ | $(\theta^*_n - \hat{\theta}_n) / s(\theta^*_n)$ | The explanation. * In the real world we have the true unknown distribution of the data $P_\theta$. In the bootstrap world we have the "true" pretend unknown distribution of the data $P_{\hat{\theta}_n}$. Actually the distribution $P_{\hat{\theta}_n}$ is known, and that's a good thing, because it allows us to simulate data from it. But we pretend it is unknown when we are reasoning in the bootstrap world. It is the analog in the bootstrap world of the true unknown distribution $P_\theta$ in the real world. * In the real world we have the true unknown parameter $\theta$. It is what we are trying to estimate. In the bootstrap world we have the "true" pretend unknown parameter $\hat{\theta}_n$. Actually the parameter $\hat{\theta}_n$ is known, and that's a good thing, because it allows to see how close simulated estimators come to it. But we pretend it is unknown when we are reasoning in the bootstrap world. It is the analog in the bootstrap world of the true unknown parameter $\theta$ in the real world. * In the real world we have data $Y$ that is assumed to have distribution $P_\theta$. In the bootstrap world we simulate data $Y^*$ that are known to have distribution $P_{\hat{\theta}_n}$. In the real world we have only one data set $Y$. We have to imagine its randomness – that, if the data could be collected again, it would turn out different. In the bootstrap world we can simulate as many different $Y^*$ as we like. We can actually see the distribution by making plots and compute things about the distribution by averaging over the simulations. The way we simulate depends on what parametric model we are assuming. Core R has functions to simulate Poisson data (`rpois`) and multinomial data (`rmultinom`). It has no methods for product multinomial, but this can easily be done using `rmultinom` and a loop. * We have some estimator of $\theta$, which must be a *statistic*, that is some function of the data that does not depend on the unknown parameter. In order to have the correct analogy in the bootstrap world, our estimate there must be the *same function* of the *bootstrap data*. * Many procedures require some estimate of standard error of $\hat{\theta}_n$. Call that $\hat{s}_n$. It too must be a *statistic*, that is some function of the data that does not depend on the unknown parameter. In order to have the correct analogy in the bootstrap world, our estimate there must be the *same function* of the *bootstrap data*. * Many procedures use so-called *pivotal quantities*, either exact or approximate. An exact pivotal quantity is a function of the data and the parameter of interest whose distribution *does not depend on any parameters*. The prototypical example is the $t$ statistic $$ \frac{\overline{X}_n - \mu}{s_n / \sqrt{n}} $$ which has, when the data are assumed to be exactly normal, an exact $t$ distribution on $n - 1$ degrees of freedom (which does not depend on the unknown parameters $\mu$ and $\sigma$ of the distribution of the data). Note that the pivotal quantity is *a function of $\mu$* but the sampling distribution of the pivotal quantity *does not depend on $\mu$ or $\sigma$*: the $t$ distribution with $n - 1$ degrees of freedom does not does not have any unknown parameters. An asymptotic pivotal quantity is a function of the data and the parameter of interest whose asymptotic distribution *does not depend on any parameters*. The prototypical example is the $z$ statistic $$ \frac{\overline{X}_n - \mu}{s_n / \sqrt{n}} $$ (actually the same function of data and parameters as the $t$ statistic discussed above), which has, when the data are assumed to have any distribution with finite variance, an asymptotic standard normal distribution (which does not depend on the unknown the distribution of the data). Note that the pivotal quantity is *a function of $\mu$* but the sampling distribution of the pivotal quantity *does not depend on the unknown distribution of the data*: the *standard* normal distribution does not does not have any unknown parameters. An approximate pivotal quantity is a function of the data and the parameter of interest whose sampling distribution does not depend on the unknown distribution of the data, at least not very much. Often such quantities are made by standardizing in a manner similar to those discussed above. Any time we have some purported standard errors of estimators, we can use them to make approximate pivotal quantities. $$ \frac{\hat{\theta}_n - \theta}{\hat{s}_n} $$ as in the bottom left cell of the table above. The importance of pivotal quantities in (frequentist) statistics cannot be overemphasized. They are what allow valid exact or approximate inference. When we invert the pivotal quantity to make confidence intervals, for example, $$ \hat{\theta}_n \pm 1.96 \cdot \hat{s}_n $$ this is (exactly or approximately) valid *because the sampling distribution of the pivotal quantity does not depend on the true unknown distribution of the data, at least not much*. If it did depend strongly on the true distribution of the data, then our coverage could be way off, because our estimated sampling distribution of the pivotal quantity might be far from its correct sampling distribution. More on pivotal quantities below. There are many pivotal quantities that one can use, not just those having the form of the bottom line of the table. ### Caution: Use the Correct Analogies In the bottom right cell of the table above there is a strong tendency for naive users to replace $\hat{\theta}_n$ with $\theta$. But this is clearly incorrect. What plays the role of true unknown parameter value in the bootstrap world is $\hat{\theta}_n$ not $\theta$. The distribution that bootstrap data are simulated from has true known parameter value $\hat{\theta}_n$. There is a lesser tendency for naive users to replace $s(\theta^*_n)$ with $s(\hat{\theta}_n)$. But this is clearly incorrect. In the real world we know that $s(\hat{\theta}_n)$ is a random quantity. We are trying to account for this randomness in the same way a $t$ distribution accounts for the randomness in the sample standard deviation. So we must use the quantity $s(\theta^*_n)$ in the bootstrap world that has similar randomness. What one does in the bootstrap world must always perfectly mimic what one does in the real world with the sole exception that $\hat{\theta}$ is not $\theta$. ### Pivotal Quantities and Infinite Regress Frequentist statistics has a problem of infinite regress. * Suppose you are trying estimate the population mean $\mu$. * Naturally, you use the sample mean $\bar{x}_n$ as an estimator. * If you want to know how good an estimator, the standard deviation of the estimator, which is $\sigma / \sqrt{n}$, tells you. * But you don't know the population standard deviation $\sigma$ either, so that's no help. * So we estimate $\sigma$ by the sample standard deviation $\hat{\sigma}_n$. * But we still have a problem. We don't know how far $\hat{\sigma}_n$ is from $\sigma$. * We can ask about the standard deviation of $\hat{\sigma}_n$, but theory can't tell us much about that. Too complicated. Theory can tell us the [asymptotic variance of $\hat{\sigma}_n^2$](https://www.stat.umn.edu/geyer/5102/slides/s1.pdf#page=100) $$ \sqrt{n}(\hat{\sigma}_n^2 - \sigma^2) \stackrel{\mathcal{D}}{\longrightarrow} \text{Normal}(0, \mu_4 - \sigma^4) $$ where $\mu_4$ is the population fourth central moment. * But you don't know $\mu_4$ or $\sigma$ either, so that's no help. * So we estimate $\mu_4$ by the sample fourth central moment $$ \hat{\mu}_{4, n} = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}_n)^4 $$ * But we still have a problem. We don't know how far $\hat{\mu}_{4, n}$ is from $\mu_4$. * [Theory can tell us](https://www.stat.umn.edu/geyer/5102/slides/s1.pdf#page=100) $$ \sqrt{n} (\hat{\mu}_{4, n} - \mu_4) \stackrel{\mathcal{D}}{\longrightarrow} \text{Normal}(0, \mu_8 - \mu_4^2 - 8 \mu_3 \mu_5 + 16 \mu_3^2 \mu_2 ) $$ * But now we have a bunch more unknown parameters to estimate. * And this process goes on without end. The method of pivotal quantities and asymptotic pivotal quantities is the idea that short circuits this infinite regress. If we just [standardize the quantity we are trying to get an asymptotic distribution of](https://www.stat.umn.edu/geyer/5102/slides/s2.pdf#page=81), then the infinite regress goes away $$ \frac{\bar{x}_n - \mu}{\hat{\sigma}_n / \sqrt{n}} \stackrel{\mathcal{D}}{\longrightarrow} \text{Normal}(0, 1) $$ and the asymptotic distribution has no unknown parameters to estimate (that's what asymptotic pivotal quantity means). But asymptotic pivotal quantities don't have to come from standardization. We know that the Wald, Wilks, and Rao test statistics are all asymptotically chi-square, and the chi-square distribution does not have any unknown parameters to estimate. So those are other asymptotically pivotal quantities. ### How the Parametric Bootstrap Works #### Hypothesis Tests So the first thing we have to do is select an estimator of the parameter under the null hypothesis (because that is the distribution we are supposed to use when computing a hypothesis test). So let $\hat{\theta}_n$ be the MLE under the null hypothesis (or any other good estimate). The next thing we have to do is select an asymptotically pivotal quantity, Say the likelihood ratio test statistic. Now we simulate many datasets $Y^*$ from the distribution of the data under $\hat{\theta}_n$. And we use these simulations to approximate the sampling distribution of the test statistic. We do not use the chi-square distribution! That is only good when $n$ is nearly infinite. Instead we use the bootstrap distribution (simulation distribution). If the test statistic is $t(Y)$ and is asymptotically pivotal, then our bootstrap $P$-value is $\Pr\{ t(Y^*) \ge t(Y) \}$, and this probability is approximated by averaging over our bootstrap simulations. Actually we don't recommend the naive bootstrap $P$-value just described but rather a slightly more sophisticated bootstrap $P$-value that takes into account that, asymptotically at least, $t(Y^*)$ and $t(Y)$ have the same distribution (which is the whole basis of the procedure). So if we have $n_\text{boot}$ simulated test statistics $t(Y^*_1)$, $t(Y^*_2)$, $\ldots,$ $t(Y^*_{n_\text{boot}})$, then we also have one more $t(Y)$, and we should use $$ \frac{\#\{ t(Y^*) \ge t(Y) \} + 1}{n_\text{boot} + 1} $$ for the bootstrap $P$-value, where $\#$ in the numerator means number of times the inequality holds. This corrects for $n_\text{boot}$ being too small. The smallest a bootstrap $P$-value can be is $1 / n_\text{boot}$, so you have to do a lot of bootstrap simulations to get a really small bootstrap $P$-value. #### Confidence Intervals There are many different schemes for nonparametric bootstrap confidence intervals. There are fewer for the parametric bootstrap, and discussions of the parametric bootstrap in textbooks are so sketchy that we do not get clear recommendations. Hence we offer the following, which is the analogue for the parametric bootstrap of what are called bootstrap $t$ procedures for the nonparametric bootstrap. The first thing we have to do is select an estimator of the parameter. Call that $\hat{\theta}_n$. The next thing we have to do is select an estimator of the asymptotic standard deviation of the parameter. Like $\hat{\theta}_n$ this estimator depends on the data and hence is a random variable. If we are doing Wald confidence intervals, and $\hat{\theta}_n$ is the MLE, then the asymptotic variance of the MLE is inverse Fisher information. If $\hat{\theta}_n$ is a vector parameter, then inverse Fisher information is a matrix. But then the diagonal elements of this matrix are the estimated variances of the individual components of $\hat{\theta}_n$. We saw this in the [notes on likelihood computation](likelihood-compute.html#wald-intervals). Since the estimated asymptotic standard deviation depends on the parameter, let us denote it $s(\theta)$. Then we have our estimate for the actual data $s(\hat{\theta}_n)$ and our estimate for bootstrap data is $s(\theta^*_i)$. Now, as always in the parametric bootstrap, we simulate $n_\text{boot}$ IID realizations of the data from the distribution with parameter $\hat{\theta}_n$. The parameter estimates for these simulated data sets are, as we said above $\theta^*_i$, $i = 1$, $\ldots,$ $n_\text{boot}$. Now we can form $n_\text{boot}$ simulations of the asymptotically pivotal quantity $$ z^*_i = \frac{\theta^*_i - \hat{\theta}_n}{s(\theta^*_i)} $$ just like in the bottom row of our table about the real world and the bootstrap world. Now comes the *tour de force*. We don't use the normal distribution to determine the critical value, in fact, we don't even have one critical value but rather two. We use the bootstrap distribution of the asymptotic pivotal quantity rather than its asymptotic distribution. We determine *from the bootstrap simulations* numbers $c_1$ and $c_2$ such that $$ \text{$c_1 \le z^*_i \le c_2$ exactly $1 - \alpha$ of the time} $$ to get coverage $1 - \alpha$. For example, to get an equal-tailed 95% confidence interval, we choose $c_1$ to be the 0.025 quantile of the bootstrap distribution of the $z^*_i$ and choose $c_2$ to be the 0.975 quantile. The point is that we use the bootstrap rather than the asymptotics to find critical values. Then our bootstrap confidence interval is \begin{equation} \bigl( \hat{\theta}_n - c_2 s(\hat{\theta}_n), \hat{\theta}_n - c_1 s(\hat{\theta}_n) \bigr) (\#eq:boot-conf) \end{equation} That is, we apply the critical values (which are exact in the bootstrap world) to the real world (where they are only the closest we can do to the Right Thing). Note that our bootstrap confidence intervals do not have the form $$ \text{point estimate} \pm \text{critical value} \times \text{standard error} $$ which some intro stats books teach is the only kind of confidence interval that exists. Of course we already have seen that likelihood confidence intervals and score (Rao) confidence intervals, don't have this form either. But now we see that bootstrap Wald intervals also don't have this form. If the bootstrap distribution of the $z^*_i$ is skewed, then our bootstrap $t$ confidence intervals are skewed to account for that. But even if the bootstrap distribution of the $z^*_i$ is not skewed, our bootstrap $t$ confidence intervals account for the variability of estimating the asymptotic standard deviation. Even if "Student" ([W. S. Gosset](https://en.wikipedia.org/wiki/William_Sealy_Gosset)) had never invented the $t$ distribution (nor anyone else), this bootstrap $t$ procedure would invent it. When the population distribution is normal and we do a parametric bootstrap, the bootstrap distribution of the $z^*_i$ is Student's $t$ distribution on $n - 1$ degrees of freedom. So we do $t$ tests without knowing any theory. The bootstrap does all the theory for us. Moreover, when the true unknown population distribution is not normal, the bootstrap figures out the analogue of the "Student" $t$ test for that population distribution, something theoretical statistics cannot do. At least, it does all the theory relevant to the true unknown population distribution. As we have seen, the description of the bootstrap does involve some theory. Actually we don't recommend the bootstrap confidence intervals just described, or rather we recommend them but only using a very specific method of computing bootstrap critical values. Suppose we choose $n_\text{boot}$ so that the quantiles we want when multiplied by $n_\text{boot} + 1$ are integers. Then we take \begin{align*} c_1 & = z^*_{(n_\text{boot} + 1) \alpha / 2} \\ c_2 & = z^*_{(n_\text{boot} + 1) (1 - \alpha / 2)} \end{align*} Like the procedure we actually recommend for bootstrap $P$-values, this procedure has $+ 1$ to correct for $n_\text{boot}$ being small. But the argument is different. The main virtue of this recommendation is that it avoids arguments about quantile estimation. R function `quantile` has 9 different procedures. None of them are as simple or as good as the procedure recommended here. Of course our procedure works only for specially chosen $n_\text{boot}$ whereas R function `quantile` has to work for all sample sizes. So this procedure cannot replace R function `quantile` except in this particular application. ### Another Infinite Regress But we don't want to use the parametric bootstrap only when the sample size is infinite! So we still have an infinite regress, but a more tractable one. * We decide to use the bootstrap because we are worried that the asymptotics might not "work" for the actual sample size of our actual data. * So what if we are worried about whether the bootstrap works? Bootstrap the bootstrap! This is called a double bootstrap. * So what if we are worried about whether the double bootstrap works? Bootstrap the double bootstrap! This is called a triple bootstrap. * And so forth. The double bootstrap can be useful [@geyer-constrained; @bootstrap-recycle] but is rather complicated. It is easiest to explain for hypothesis tests. When we are worried about the bootstrap, we are worried that the bootstrap distribution of the test statistic is not exactly the same as the real world distribution of the test statistic. So our bootstrap $P$-values are not exactly correct (even if we have humongously large $n_\text{boot}$). So our bootstrap $P$-values get demoted to being just another test statistic. We take them as evidence against the null hypothesis when they are near zero, but we no longer believe the bootstrap distributions for them. But what do we do when we doubt the distribution of the test statistic? Bootstrap it. So now we get double bootstrap $P$-values $$ \frac{\#\{ t(Y^{{*}{*}}) \ge t(Y^*) \} + 1}{n_\text{boot} + 1} $$ the same as our bootstrap $P$-values except one level up. For each bootstrap estimate $\theta^*$ we are doing a bootstrap inside a bootstrap to simulate many realizations $Y^{{*}{*}}$ of the data from the distribution having parameter value $\theta^*$ and using them to calculate this double bootstrap $P$-value. @geyer-constrained gives details. @geyer-constrained makes another important point. The bootstrap not only provides better, more accurate hypothesis tests and confidence intervals, but also provides a diagnostic tool. * If the bootstrap gives more or less the same answers as the asymptotics, then you didn't need the bootstrap. * If the double bootstrap gives more or less the same answers as the single bootstrap, then you didn't need the double bootstrap. Except you need the bootstrap or double bootstrap to reach these conclusions. I don't know if anyone has actually done a triple bootstrap. @bootstrap-recycle show how to speed up the calculation of a double bootstrap. But we leave this subject here. We won't actually do any examples of the double bootstrap. ### Parametric versus Nonparametric Bootstrap The nonparametric bootstrap has trouble doing hypothesis tests for any data, it is mostly for confidence intervals. The nonparametric bootstrap has trouble doing any analysis of regression data. Both of these are explained in my [Stat 3701 lecture notes](https://www.stat.umn.edu/geyer/3701/notes/bootstrap.html). The parametric bootstrap has neither of these problems. This gives another reason for always using the parametric bootstrap for parametric statistical models. ## Return to Our Example ### Bootstrap Hypothesis Tests #### Assuming Poisson Sampling The method of R function `anova` for objects of class `"glm"` produced by R function `glm` has no optional argument `simulate.p.value`. So here is how we could do that for our first example. ```{r parametric.bootstrap} # clean up R global environment rm(list = ls()) # re-establish stuff done in first section counts <- c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1) drinks <- rep(c("0", "< 1", "1-2", "3-5", ">= 6"), times = 2) malformation <- rep(c("Absent", "Present"), each = 5) gout.indep <- glm(counts ~ malformation + drinks, family = poisson) gout.sat <- glm(counts ~ malformation * drinks, family = poisson) aout <- anova(gout.indep, gout.sat, test = "LRT") ``` This gets us back to where we were before we removed everything produced in our original analysis of these data. Now we need the estimated mean values (expected cell counts) and the $P$-value for the test ```{r parametric.bootstrap.estimates} p.value.hat <- aout[2, "Pr(>Chi)"] p.value.hat mu.hat <- predict(gout.indep, type = "response") mu.hat ``` Now we can do the simulation ```{r parametric.bootstrap.bootstrap.estimates,cache=TRUE} # set random number generator seed for reproducibility set.seed(42) nboot <- 999 p.value.star <- double(nboot) for (iboot in 1:nboot) { # simulate new data from fitted model counts.star <- rpois(length(mu.hat), lambda = mu.hat) gout.indep.star <- glm(counts.star ~ malformation + drinks, family = poisson) gout.sat.star <- glm(counts.star ~ malformation * drinks, family = poisson) aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT") p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"] } all.p.values <- c(p.value.star, p.value.hat) mean(all.p.values <= p.value.hat) sd(all.p.values <= p.value.hat) / sqrt(nboot) ``` The code above is a bit tricky, so here is the explanation. * Each time through the loop we simulate new data based on our best estimate of the model under the null hypothesis. We are assuming Poisson sampling, so we use R function `rpois` to simulate the data. This function vectorizes, so it works with `mu.hat` being a vector. * The next three statements do exactly the same as the original analysis except that we use the simulated data rather than the real data. * Then we extract the $P$-value from the result of R function `anova` and store it for future use. * After the loop terminates, we have `nboot` simulations from the distribution of the $P$-value under the null hypothesis. We also have one more (not simulated) $P$-value that has the same distribution under the null hypothesis as the simulated $P$-values. This is `p.value.hat`, the value for the real data. * The estimated $P$-value is then the fraction of the time the bootstrap simulated $P$-values `p.value.star` are at least as extreme as the $P$-value for the actual data `p.value.hat`. * The last statement calculates the Monte Carlo standard error, how far off our simulation is from what we would get if we did an infinite number of simulations. Here we use the $P$-values rather than the test statistics as asymptotically pivotal quantities, but otherwise this is as described above. #### Assuming Multinomial Sampling When we simulate, we are not using asymptotics. And we are using exact sampling distributions. Thus Poisson sampling and multinomial sampling are not exactly equivalent (their equivalence is an asymptotic result). So we can do the same simulation under multinomial sampling. ```{r parametric.bootstrap.bootstrap.estimates.too,cache=TRUE} n <- sum(counts) pi.hat <- mu.hat / n p.value.star <- double(nboot) for (iboot in 1:nboot) { # simulate new data from fitted model counts.star <- rmultinom(1, n, prob = pi.hat) gout.indep.star <- glm(counts.star ~ malformation + drinks, family = poisson) gout.sat.star <- glm(counts.star ~ malformation * drinks, family = poisson) aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT") p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"] } all.p.values <- c(p.value.star, p.value.hat) mean(all.p.values <= p.value.hat) sd(all.p.values <= p.value.hat) / sqrt(nboot) ``` The only difference between this simulation and the one before is the line ``` counts.star <- rmultinom(1, n, prob = pi.hat) ``` and the lines defining `n` to be the multinomial sample size and `pi.hat` to be the estimated multinomial parameter vector so we can use it in the statement just above. #### Assuming Product Multinomial Sampling For these data the Poisson sampling model seems to be correct, but just for the sake of showing what to do if the sample sizes for `malformation == "Absent"` and `malformation == "Present"` were fixed in advance. ```{r parametric.bootstrap.bootstrap.estimates.too.too,cache=TRUE} # Note mu.hat the same regardless of Poisson or product multinomial sampling nAbsent <- sum(counts[malformation == "Absent"]) nPresent <- sum(counts[malformation == "Present"]) pi.hat.Absent <- mu.hat[malformation == "Absent"] / nAbsent pi.hat.Present <- mu.hat[malformation == "Present"] / nPresent p.value.star <- double(nboot) for (iboot in 1:nboot) { # simulate new data from fitted model counts.star <- double(length(pi.hat)) counts.star[malformation == "Absent"] <- rmultinom(1, nAbsent, prob = pi.hat.Absent) counts.star[malformation == "Present"] <- rmultinom(1, nPresent, prob = pi.hat.Present) gout.indep.star <- glm(counts.star ~ malformation + drinks, family = poisson) gout.sat.star <- glm(counts.star ~ malformation * drinks, family = poisson) aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT") p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"] } all.p.values <- c(p.value.star, p.value.hat) mean(all.p.values <= p.value.hat) sd(all.p.values <= p.value.hat) / sqrt(nboot) ``` ### Bootstrap Confidence Intervals ```{r} conf.level <- 0.95 ``` We won't do all three sampling schemes, just Poisson. The changes for multinomial or product multinomial should be obvious from the above. We will do Wald intervals for the cell probabilities in the bottom row of the table (for `malformation == "Present"`). ```{r bootstrap.confidence,cache=TRUE} ilow <- (nboot + 1) * (1 - conf.level) / 2 ihig <- (nboot + 1) * (1 - (1 - conf.level) / 2) c(ilow, ihig) imalf <- malformation == "Present" imalf # room for all simulations z.star <- matrix(NaN, nrow = nboot, ncol = sum(imalf)) for (iboot in 1:nboot) { # simulate new data from fitted model counts.star <- rpois(length(mu.hat), lambda = mu.hat) # fit model to bootstrap data gout.star <- glm(counts.star ~ malformation + drinks, family = poisson) # get estimates and standard errors pout.star <- predict(gout.star, type = "response", se.fit = TRUE) z.star[iboot, ] <- (pout.star$fit[imalf] - mu.hat[imalf]) / pout.star$se.fit[imalf] } ``` At this point each column of the matrix `z.star` is the bootstrap distribution of the asymptotic pivotal quantity for one of our estimates. Let's look at the first of these distributions. ```{r bootstrap.hist} hist(z.star[ , 1], probability = TRUE) curve(dnorm, add = TRUE) ``` Hard to tell whether this is different from standard normal or not. Let's look at the last one. ```{r bootstrap.hist.too} hist(z.star[ , 5], probability = TRUE) curve(dnorm, add = TRUE) ``` This is more obviously skewed. Now extract the bootstrap critical values. ```{r bootstrap.critical} crit <- apply(z.star, 2, function(x) sort.int(x, partial = c(ilow, ihig))[c(ilow, ihig)]) colnames(crit) <- c("0", "< 1", "1-2", "3-5", ">= 6") rownames(crit) <- c("lower", "upper") crit ``` The fact that none of these are $\pm 1.96$ means the bootstrap is making some adjustment for skewness and perhaps for heavy tails, whether or not we can detect skewness in the histogram. So it seems the parametric bootstrap is working better than asymptotics (which would replace all of these critical values with $\pm 1.96$). So now we can make the confidence intervals ```{r bootstrap.confidence.intervals.done} # fit model to real data gout <- glm(counts ~ malformation + drinks, family = poisson) # get estimates and standard errors pout <- predict(gout, type = "response", se.fit = TRUE) foo <- rbind(lower = pout$fit[imalf] - crit[2, ] * pout$se.fit[imalf], upper = pout$fit[imalf] - crit[1, ] * pout$se.fit[imalf]) colnames(foo) <- colnames(crit) foo ``` For comparison, here are the asymptotic intervals. ```{r ass.conf.int} foo <- outer(c(-1, 1) * qnorm(0.975), pout$se.fit[imalf]) colnames(foo) <- c("0", "< 1", "1-2", "3-5", ">= 6") rownames(foo) <- c("lower", "upper") foo <- sweep(foo, 2, pout$fit[imalf], "+") foo ``` The bootstrap doesn't make a big difference. But it is better than the asymptotics. Not a lot different, but different. # Bibliography {-}