--- title: "Stat 5421 Lecture Notes: To Accompany Agresti Ch 8" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: bookdown::html_document2: number_sections: false md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML bookdown::pdf_document2: latex_engine: xelatex number_sections: false md_extensions: -tex_math_single_backslash linkcolor: blue urlcolor: blue --- # Multinomial Response ## Response (in Scare Quotes) To say we are treating one dimension of a contingency table as a "response" is the same as saying we want a product multinomial model with all of the multinomials being for the "response" variable. This means two things. * In order for product multinomial and Poisson to be the same, we must have the interaction of all other factors in the model formula. * In order to have any other term on the right-hand side of the formula, it must be "interacted with" the "response" variable. The latter is because we want to know about the effect of other terms on the "response". ## Using R Function Loglm We do the example on alligator food choice (Section 8.1.2 in Agresti) ```{r alligator.data} library(CatDataAnalysis) data(table_8.1) names(table_8.1) sapply(table_8.1, class) lapply(table_8.1, unique) ``` These data are given in really inconvenient coding. Let's recode using the variable names in Table 8.1, which we hope are given in the correct order. (We'll check that.) ```{r alligator.data.recode} foo <- transform(table_8.1, lake = factor(lake, labels = c("Hancock", "Oklawaha", "Trafford", "George")), gender = factor(gender, labels = c("Male", "Female")), size = factor(size, labels = c("<=2.3", ">2.3")), food = factor(food, labels = c("Fish", "Invertebrate", "Reptile", "Bird", "Other"))) bar <- xtabs(count ~ size + food + gender + lake, data = foo) bar ``` That matches Table 8.1 in Agresti. Note that we had to write the formula `count ~ size + food + gender + lake` for R function `xtabs` with the terms in that order to get our printout to match the table in Agresti. Now we want to consider `food` to be the "response" variable. So we fit all of the models in the first three columns of Table 8.2 in Agresti using R function `loglm` in R package `MASS` because it gives both Pearson Chi-Square and likelihood ratio test statistics. (We could also use R function `glm` but it does not give the Pearson statistics.) ```{r alligator.fits} names(dimnames(bar)) library(MASS) lout <- list() lout[["empty"]] <- loglm(~ size * gender * lake + food, bar) lout[["gender"]] <- loglm(~ size * gender * lake + food * gender, bar) lout[["size"]] <- loglm(~ size * gender * lake + food * size, bar) lout[["lake"]] <- loglm(~ size * gender * lake + food * lake, bar) lout[["size.lake"]] <- loglm(~ size * gender * lake + food * (size + lake), bar) lout[["size.gender.lake"]] <- loglm(~ size * gender * lake + food * (size + gender + lake), bar) woof <- lapply(lout, function(x) c(lrt = x$lrt, pearson = x$pearson, df = x$df)) woof <- as.data.frame(woof) woof <- as.matrix(woof) woof <- t(woof) round(woof, 1) ``` This agrees with the left half of Table 8.2 in Agresti. Agresti says LRT is more reliable than Pearson, but since the sample size is not "large" with many small cell counts, neither is accurate. If we really wanted accurate $P$-values, we would need to parametric bootstrap, [as discussed in the notes on Chapter 9](ch9.html#bootstrap). ## Using R Function Glmbb R function `glmbb` in R package `glmbb` fits models using R function `glm`. It considers all models in a range of models. [There will be a handout for that](glmbb.html). But for now, we just show it without much explanation. ```{r options.too,echo=FALSE} options(width = 120) ``` ```{r alligator.fits.glmbb} library(glmbb) bout <- glmbb(big = count ~ lake * gender * size * food, little = ~ lake * gender * size + food, family = poisson, data = foo) summary(bout) ``` This says that by far the best model is the one Agresti calls $L + S$. As we have seen, the formula for this model, considered as a GLM or as a log-linear model is ``r summary(bout)$results$formula[1]``. ```{r options.too.too,echo=FALSE} options(width = 80) ``` These models having AIC within 10 of the minimum AIC are in Agresti's notation \begin{gather*} L + S \\ G + L + S \\ L * S \\ G * L + S \\ G + L * S \end{gather*} In the notation of R function `glmbb` these are the models between ```{r bout.out} bout$little bout$big ``` (inclusive) that have AIC no larger than min(AIC) + 10. # Ordered Categorical Data This topic is Section 8.2 in Agresti. We are going to use R function `polr` in R package `MASS` to do what it calls *proportional odds logistic regression* (POLR). Because its definitions may differ from Agresti's, we will take its help as authoritative. In the "details" section of the help for this function, it says > This model is what Agresti [they cite the second edition of our textbook, > whereas we are using the third edition] calls a *cumulative link* model. > The basic interpretation is as a *coarsened* version of a latent > variable $Y_i$ which has a logistic or normal or extreme-value or > Cauchy distribution with scale parameter one and a linear model > for the mean. The ordered factor which is observed is which bin > $Y_i$ falls into with breakpoints > $$ \zeta_0 = -\infty < \zeta_1 < \cdots < \zeta_K = \infty $$ > This leads to the model > $$ \mbox{logit} P(Y \le k | x) = \zeta_k - \eta $$ > with *logit* replaced by *probit* for a normal latent variable, > and $\eta$ being the linear predictor, a linear function of the > explanatory variables (with no intercept). Note that it is quite > common for other software to use the opposite sign for $\eta$ (and > hence the coefficients `beta`). > > In the logistic case, the left-hand side of the last display is the > log odds of category $k$ or less, and since these are log odds > which differ only by a constant for different $k$, the odds are > proportional. Hence the term *proportional odds logistic regression*. Comparing with equation (8.5) in the third edition of Agresti (our textbook) we see that R function `polr` does indeed use the opposite sign for regression coefficients from what Agresti uses (and apparently also from what the software he uses for his examples uses). So we had better stick with what R does. But we also see that other than a possible change of sign of regression coefficients, R function `polr` agrees with the model discussed in Section 8.2 of Agresti. As an example, we do example 8.2.4 in Agresti. ```{r happiness.data} # clean up R global environment rm(list = ls()) data(table_8.5) names(table_8.5) sapply(table_8.5, class) lapply(table_8.5, unique) ``` It is annoying that Agresti has coded these data numerically when they should be categorical. Let's fix that like we did for the alligator data. ```{r happiness.data.recode} foo <- transform(table_8.5, race = factor(race, labels = c("White", "Black")), happy = ordered(happy, labels = c("Very.happy", "Pretty.happy", "Not too happy"))) ``` Now we are ready to invoke R function `polr` ```{r happiness.fit} library(MASS) pout <- polr(happy ~ race + trauma, data = foo) summary(pout) ``` We see that the "coefficients" agree with those in the SAS output shown in Agresti Table 8.6 except for sign, which may be due to R dropping a different dummy variable for `race` than SAS does or due to R using a different sign convention (discussed in the quote from the help page above). So we must be fitting the same model. If we want to be sure what the signs of the coefficients mean, we can look at predictions. ```{r happiness.predict} predict(pout, type = "probs", newdata = data.frame(race = c("White", "Black"), trauma = rep(2, 2))) predict(pout, type = "probs", newdata = data.frame(race = rep("White", 6), trauma = 0:5)) ``` One says keeping `trauma` fixed at 2, there are more `Very happy` whites and more `Not too happy` blacks. And the other says keeping `race` fixed at `"White"`, more `trauma` makes for fewer `Very happy` and more `Not too happy`. So in both cases we have positive "coefficient" means worse outcome in a problem like this were later in the order of the ordered categorical response means worse.