\documentclass{article} \usepackage{indentfirst} \usepackage{amstext} \begin{document} \title{A GLM Example} \author{Charles J. Geyer \and Ruth G. Shaw \and Stuart Wagenius} \maketitle As part of a research program to assess the evolutionary consequences of extreme population fragmentation, Stuart Wagenius has conducted a field experiment to study seedling recruitment in \emph{Echinacea angustifolia} (purple coneflower). The experiment was designed to test the effect of different vegetation types and burn treatments on recruitment. Interactions between these factors were also of interest. Approximately 100 seeds were sown into each plot. In order to establish the viability of each seed lot, germination trials were conducted in the lab on randomly chosen lots of a known number of seeds (also around 100). It was of interest to take into account the results of the germination trials in the analysis of the data from the field experiment. The experiment was conducted for three years, with new lots of seeds sown into a separate set of plots in the fall of each year and seedling establishment monitored in the spring. The data for the year 2003 can be read into R and inspected by <>= mydata <- read.table("seeds.txt") summary(mydata) @ from which we see there are six variables in the data set \verb@vegtype@, \verb@burn01@, \verb@burn02@, \verb@burn03@, \verb@totalseeds@, and \verb@seedlings@. Working backwards from the end, \verb@seedlings@ is the number of seedlings that sprouted. This is the response and is assumed to be $\text{Poisson}(n \lambda)$, where $n$ is the number of seeds sown, the variable \verb@totalseeds@ in the data set, and $\lambda$ is some function of the other covariates (\verb@vegtype@ and the three \verb@burn@ variables). The variable \verb@vegtype@ indicates the type of field. The \verb@oldfield@ part of a value (in both \verb@oldfieldcool@ and \verb@oldfieldwarm@) indicates that the field was once used for agriculture. The \verb@warm@ part of a value (in both \verb@oldfieldwarm@ and \verb@plantwarm@) indicates warm weather grasses were growing in the field and the \verb@cool@ part of a value indicates cool weather grasses. The value \verb@yes@ for the variables \verb@burn01@, \verb@burn02@, and \verb@burn03@ indicates that a field was burned in 2001, 2002, or 2003, respectively. Owing to a mistake in the experimental design \verb@burn03@ is completely confounded with \verb@vegtype@ and hence has been omitted from the analysis (this isn't right, but it is not clear what else to do). For the values \verb@lab@ for these predictor variables, see Section~\ref{lab} below. \section{Fitting Poisson Regression Models} The way R fits a model like this is, for example, <>= out <- glm(seedlings ~ vegtype + burn01 + burn02 + offset(log(totalseeds)), data = mydata, family = poisson) summary(out) @ The way this works is that the mean value parameter is $n \lambda$. The link function we are using is ``log'' (the default for the \verb@poisson@ family), which makes the linear predictor the same as the canonical parameter $$ \eta = \log(n) + \log(\lambda) $$ thus we see that $\log(n)$ is just a \emph{known} constant additive term in the linear predictor. The way R handles such a term in the linear predictor that does \emph{not} contain an unknown parameter to fit is as an ``offset''. Since the variable $n$ in the math formula is the variable \verb@totalseeds@ in R, the ``offset'' is \verb@offset(log(totalseeds))@. Thus the model being fit is that the linear predictor value of the $i$-th case is $$ \eta_i = \log(n_i) + \sum_{j = 1}^k d_{i j} \beta_j $$ where the $\beta_j$ are the regression coefficients and $d_{i j}$ the value of the $j$-th dummy variable for the $i$-th case. You don't create the dummy variables. R does that for itself. Clearly, from the regression printout, it makes $k = 7$ of them for this fit. \subsection{More on Dummy Variables} \label{lab} There is something funny about this count of dummy variables. Generally, if there isn't something funny going on, there are one dummy variable for the overall level (``intercept'') plus one less than the number of categories for each categorical predictor. From the summary \verb@vegtype@ has 5 categories, and \verb@burn01@ and \verb@burn02@ have three categories each. Hence, naively, we should have $1 + 4 + 2 + 2 = 9$ dummy variables. But instead R makes only 7. Why? The 15 cases that have the value \verb@lab@ for any one of these predictor variables have the same value for all of them <>= attach(mydata) identical(vegtype == "lab", burn01 == "lab") identical(vegtype == "lab", burn02 == "lab") @ Thus there is really only one ``lab'' dummy variable rather than three (one for each predictor). The naive count is off by 2, and R is right. Generally, R always does the right thing about dummy variables. It creates all the dummy variables and then drops variables until it gets down to a linearly independet set (but see Section~\ref{fubar} below where is messes this up). The scientific reason the variables are coded this way is that for the ``lab'' cases, there are no values of these covariates, which refer to the outdoor environment of the ``field'' cases. \subsection{More Model Fitting} We fit some more models and compare them. <>= out.noveg <- glm(seedlings ~ burn02 + burn01 + offset(log(totalseeds)), data = mydata, family = poisson) summary(out.noveg) anova(out.noveg, out, test = "Chisq") @ From the analysis of deviance test (also called, likelihood ratio test) we cannot drop \verb@vegtype@. Similar analyses (not shown) show that we cannot drop \verb@burn01@ or \verb@burn02@ either. In each case, the fit of the model \begin{verbatim} seedlings ~ vegtype + burn01 + burn02 + offset(log(totalseeds)) \end{verbatim} is highly statistically significantly better than the model obtained by dropping one of the predictors ($P < 10^{-5}$ in each test). \subsection{R Messes Up} \label{fubar} So how about some bigger models? With interaction terms? The first one the scientists analyzing the data tried was <>= out.fubar <- glm(seedlings ~ burn01 + vegtype * burn02 + offset(log(totalseeds)), data = mydata, family = poisson) summary(out.fubar) @ but the scientists, on looking at the regression coefficients, thought there was something funny about them. Where were the interaction dummy variables? So they asked the statistician, and after a great deal of groveling in source code and experimenting to see what R does to different inputs found that the code <>= out.ok <- glm(seedlings ~ vegtype * burn02 + burn01 + offset(log(totalseeds)), data = mydata, family = poisson) summary(out.ok) @ does the right thing. Oops! No it doesn't. Ack. There shouldn't be a regression coefficient for \verb@offset(log(totalseeds))@. Yet another try. <>= out <- glm(seedlings ~ vegtype + burn01 + burn02, offset = log(totalseeds), data = mydata, family = poisson) out.ok <- glm(seedlings ~ vegtype * burn02 + burn01, offset = log(totalseeds), data = mydata, family = poisson) summary(out) summary(out.ok) anova(out, out.ok, test = "Chisq") @ What a struggle! R messed up 2 different ways. It looks like including the ``offset'' in the ``formula'' is dangerous. Better to include it in the separate \verb@offset@ argument. A lesson for programmers: having several different ways to do something is only a good idea if you can guarantee that each is absolutely bug free! The R programmers would have been better off if they had never thought to make the ``offset'' part of the ``formula'' if they couldn't get it right in all cases. \section{The Parametric Bootstrap} To summarize where we are scientifically we repeat the last command <>= anova(out, out.ok, test = "Chisq") @ The big model with the \verb@vegtype * burn02@ interaction clearly fits better according to the analysis of deviance test ($P \approx 10^{-7}$). But can we trust the asymptotics? Are we in ``asymptopia'' where all normal approximations are valid? The only way to tell if normal approximation is valid is to simulate, that is, to do a \emph{parametric bootstrap}. We simulate data that is $\text{Poisson}(\mu)$ with $\mu$ given by the predictions in the \emph{small model}. <>= mu.0 <- out$fitted.values @ These are, of course, not the true unknown population parameter values, which is what we should use to do a perfect simulation. It is because we are using estimated rather than true parameter values, that we say we are doing a ``parametric bootstrap'' rather than a ``simulation study.'' Simulating Poisson data is trivial. The R statement \begin{verbatim} rpois(length(mu.0), lambda = mu.0) \end{verbatim} does it. So here's how the bootstrap goes. <>= set.seed(42) nboot <- 1000 p.star <- double(nboot) for (iboot in 1:nboot) { y.star <- rpois(length(mu.0), lambda = mu.0) out.star.0 <- glm(y.star ~ vegtype + burn02 + burn01, offset = log(totalseeds), data = mydata, family = poisson) out.star.1 <- glm(y.star ~ vegtype * burn02 + burn01, offset = log(totalseeds), data = mydata, family = poisson) p.star[iboot] <- anova(out.star.0, out.star.1, test = "Chisq")[["P(>|Chi|)"]][2] } p.hat <- anova(out, out.ok, test = "Chisq")[["P(>|Chi|)"]][2] @ This creates \verb@p.hat@ the $P$-value for the actual data and \verb@p.star@ the bootstrap simulation of the sampling distribution of this $P$-value. If we are in ``asymptopia'' so that the parametric bootstrap is unnnecesary, the $P$-value is a $\text{Uniform}(0, 1)$ random variable (because the probability that $P \le \alpha$ is supposed to be $\alpha$ for all $\alpha$). Let's check if it is. The following commands <>= par(mar = c(5, 4, 1, 1) + 0.1) plot(seq(1, nboot) / (nboot + 1), sort(p.star), xlab = expression(alpha), ylab = "sorted values of the bootstrap P-value") abline(0, 1) @ make the diagnostic plot shown in Figure~\ref{fig:diag}. \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Parametric Bootstrap Diagnostic Plot. If the parametric bootstrap is unnecessary, the points lie near the line. Otherwise, the asymptotic $P$-value is bogus. (Parametric bootstrap for comparison of the small model \texttt{vegtype + burn02 + burn01} versus the big model \texttt{vegtype * burn02 + burn01}).} \label{fig:diag} \end{figure} From Figure~\ref{fig:diag} it seems the parametric bootstrap was unnnecessary. At least for this test, we do indeed seem to be in ``asymptopia'' and can believe that $P < 0.05$ actually means $P < 0.05$. Just for the record the parametric bootstrap corrected $P$-value is <>= mean(p.star <= p.hat) @ Of course, this doesn't mean \emph{exactly} zero. It just means we didn't take \verb@nboot@ large enough to get any \verb@p.star@ values smaller than \verb@p.hat@. Given how small \verb@p.hat@ is, this is not surprising. All our simulation really tells us is that the true $P$-value appears to be less than \verb@1 / nboot@, that is, $P < 10^{-3}$. \end{document}