\documentclass[11pt]{article} \usepackage{amsmath} \usepackage{indentfirst} \usepackage{url} \usepackage[utf8]{inputenc} \begin{document} \begin{raggedright} \Large Stat 5421 Lecture Notes \\ \textbf{Simple Chi-Square Tests for Contingency Tables} \\ Charles J. Geyer \\ \today \par \end{raggedright} <>= options(keep.source = TRUE, width = 60) @ \section{One-Way Contingency Table} The data set read in by the R function \texttt{read.table} below simulates 6000 rolls of a fair die (singular of dice). We test the hypothesis that all six cells of the contingency table have the same probability (null hypothesis) versus that they are different (alternative hypothesis). <>= foo <- read.table( url("http://www.stat.umn.edu/geyer/5421/mydata/multi-simple-1.txt"), header = TRUE) print(foo) @ \subsection{Pearson Chi-Squared Test Statistic} <>= out <- chisq.test(foo$y) print(out) @ The default test statistic for the R function \texttt{chisq.test} is the Pearson chi-squared test statistic \begin{subequations} \begin{equation} \label{eq:xsq-one} X^2 = \sum_\text{all cells of table} \frac{(\text{observed} - \text{expected})^2}{\text{expected}}. \end{equation} We write this in more mathematical notation as follows. Let $I$ denote the index set for the contingency table, which is one-dimensional in this example, but may have any dimension in general, so making our subscript $i$ range over an abstract set rather than a range of numbers means we do not have to have a different formula for every dimension of table. Let $x_i$ denote the observed count for cell $i$, $\tilde{\pi}_i$ the cell probability for the $i$-cell under the null hypothesis, and $n$ the sample size (so the random vector having components $x_i$ is multinomial with sample size $n$ and parameter vector having components $\tilde{\pi}_i$). Then the Pearson test statistic is \begin{equation} \label{eq:xsq-two} X^2 = \sum_{i \in I} \frac{(x_i - n \tilde{\pi}_i)^2}{n \tilde{\pi}_i}. \end{equation} \end{subequations} The Pearson chi-square test statistic is a special case of the Rao test statistic (also called score test and Lagrange multiplier test, see the likelihood handout for its definition). Consequently, its asymptotic distribution is chi-square with degrees of freedom which is the difference of dimensions of the models being compared. Here the null model is completely specified (no adjustable parameter) so has dimension zero, and the multinomial has $k - 1$ degrees of freedom if the table has $k$ cells because probabilities must sum to one, that is $\sum_{i \in I} \pi_i = 1$ so specifying $k - 1$ parameters determines the last one. Hence the degrees of freedom of the asymptotic chi-square distribution is $k - 1$. Here $k = 6$ so the degrees of freedom is $5$, which is what the printout of the \texttt{chisq.test} function says. \subsection{Likelihood Ratio Test Statistic} We can also do a Wilks test (also called likelihood ratio test). <>= Gsq <- 2 * sum(out$observed * log(out$observed / out$expected)) print(Gsq) Gsq.pval <- pchisq(Gsq, out$parameter, lower.tail = FALSE) print(Gsq.pval) @ The probability mass function (PMF) for the multinomial distribution is $$ f_{\boldsymbol{\pi}}(\mathbf{x}) = \binom{n}{\mathbf{x}} \prod_{i \in I} \pi_i^{x_i}, $$ where the thingy with the round brackets is a multinomial coefficient. Since that does not contain the parameter, we can drop it in the likelihood $$ L(\boldsymbol{\pi}) = \prod_{i \in I} \pi_i^{x_i}, $$ so the log likelihood is $$ l(\boldsymbol{\pi}) = \sum_{i \in I} x_i \log(\pi_i) $$ (in this handout we are using the convention that boldface denotes a vector and normal font it components, so $\boldsymbol{\pi}$ is the vector having components $\pi_i$). If $\hat{\boldsymbol{\pi}}$ and $\tilde{\boldsymbol{\pi}}$ are the maximum likelihood estimates (MLE) in the alternative and null hypotheses, respectively, then the Wilks statistic is \begin{equation} \label{eq:multi-lrt} G^2 = 2 \bigl[ l(\hat{\boldsymbol{\pi}}) - l(\tilde{\boldsymbol{\pi}}) \bigr] = \sum_{i \in I} x_i \log\left(\frac{\hat{\pi}_i}{\tilde{\pi}_i}\right) \end{equation} We already know the MLE in the null hypothesis (completely specified, all cell probabilities the same). To determine the MLE in the alternative hypothesis, we need to solve the likelihood equations for the multinomial distribution, and this is a bit tricky because of the degeneracy of the multinomial. We have to write one parameter as a function of the others to get down to $k - 1$ freely adjustable variables. Fix $i^* \in I$ and eliminate that variable writing $I^* = I \setminus \{ i^* \}$ so $$ l(\boldsymbol{\pi}) = \sum_{i \in I^*} x_i \log(\pi_i) + x_{i^*} \log\left( \sum_{i \in I^*} \pi_i \right) $$ so for $i \in I^*$ $$ \frac{\partial l(\boldsymbol{\pi})}{\partial \pi_i} = \frac{x_i}{\pi_i} - \frac{x_{i^*}}{1 - \sum_{i \in I^*} \pi_i} = \frac{x_i}{\pi_i} - \frac{x_{i^*}}{\pi_{i^*}} $$ Setting derivatives to zero and solving for the parameters gives \begin{equation} \label{eq:pie-one} \pi_i = x_i \cdot \frac{\pi_i^*}{x_{i^*}}, \qquad i \in I^*. \end{equation} In order to make progress we have to figure out what $\pi_{i^*}$ is. The only facts we have at our disposal to do that is that probabilities sum to one and cell counts sum to $n$. So we first note that \eqref{eq:pie-one} trivially holds when $i = i^*$, so summing both sides of \eqref{eq:pie-one} over all $i$ gives $$ \sum_{i \in I} \pi_i = \frac{\pi_i^*}{x_{i^*}} \sum_{i \in I} x_i $$ or $$ 1 = \frac{\pi_i^*}{x_{i^*}} \cdot n $$ or $$ \pi_i^* = \frac{x_{i^*}}{n} $$ and plugging this into \eqref{eq:pie-one} and writing $\hat{\pi}_i$ rather than $\pi_i$ because our solution is the MLE gives \begin{equation} \label{eq:pie-two} \hat{\pi}_i = \frac{x_i}{n}, \qquad i \in I. \end{equation} So we have found that the unrestricted MLE for the multinomial distribution is just the observed cell counts divided by the sample size. This is, of course, the obvious estimator, but we didn't know it was the MLE until we did the math. We now also rewrite \eqref{eq:multi-lrt} gratuitously inserting $n$'s in the numerator and denominator so it uses the quantities used in the Pearson statistic \begin{equation} \label{eq:multi-lrt-too} \begin{split} G^2 & = \sum_{i \in I} x_i \log\left(\frac{n \hat{\pi}_i}{n \tilde{\pi}_i}\right) \\ & = \sum_\text{all cells of table} \text{observed} \cdot \log\left(\frac{\text{observed}}{\text{expected}}\right) \end{split} \end{equation} And that explains the calculation done in R. The tests are asymptotically equivalent so it is no surprise that the test statistics are very similar (\Sexpr{round(out$statistic, 4)} and \Sexpr{round(Gsq, 4)}), as are the P-values (\Sexpr{round(out$p.value, 4)} and (\Sexpr{round(Gsq.pval, 4)}). Since the P-values are not small, the null hypothesis is accepted. The die seems fair. \section{One-Way Contingency Table with Parameters Estimated} The data set read in by the R function \texttt{read.table} below simulates 6000 rolls of an unfair die of the type known as six-ace flats. The one and six faces, which are on opposite sides of the die, are shaved slightly so the other faces have smaller area and smaller probability. <>= foo <- read.table( url("http://www.stat.umn.edu/geyer/5421/mydata/multi-simple-2.txt"), header = TRUE) foo @ \subsection{Naive Tests} We start by testing the hypothesis that all six cells of the contingency table have the same probability (just like in the preceding section). <>= y <- foo$y ### Pearson chi-square test statistic out <- chisq.test(y) print(out) ### likelihood ratio test statistic Gsq <- 2 * sum(out$observed * log(out$observed / out$expected)) print(Gsq) pchisq(Gsq, out$parameter, lower.tail = FALSE) @ The $P$-values are lower than before, so perhaps slightly suggestive that the null hypothesis is false, but still much larger than 0.05 (not that your humble author is suggesting that 0.05 is the dividing line between statistical significance and statistical non-significance, but these are not very low $P$-values). Hence we could again conclude die seems fair. But now that would be wrong because we haven't even fit the six-ace hypothesis yet! \subsection{Likelihood Ratio Test} A calculation similar to the one in the preceding section (which we will mercifully not do) says that the MLE for the six-ace hypothesis estimates the cell probabilities for the six and ace cells to be the average of the observed cell frequencies for these two cells $$ \hat{\pi}_1 = \hat{\pi}_6 = \frac{x_1 + x_6}{2 n} $$ (because they are assumed to have the same area) and similarly for the other cells $$ \hat{\pi}_2 = \hat{\pi}_3 = \hat{\pi_4} = \hat{\pi}_5 = \frac{x_2 + x_3 + x_4 + x_5}{4 n} $$ (because they are assumed to have the same area). This is the MLE in the alternative hypothesis. The dimension of this model is one because if we know $\hat{\pi}_1$, then we can also figure out all of the other cell probabilities from the constraints that probabilities sum to one and the equality constraints above. Hence this model has dimension one. As before the null hypothesis is the completely specified (hence dimension zero) hypothesis that all cells have equal probability. So the test statistics (Wald, Wilks, and Rao) are approximately chi-square on one degree of freedom. The likelihood ratio test is <>= nrolls <- sum(y) phat0 <- rep(1 / 6, 6) phat1 <- rep(NA, 6) phat1[c(1, 6)] <- sum(y[c(1, 6)]) / 2 / nrolls phat1[- c(1, 6)] <- sum(y[- c(1, 6)]) / 4 / nrolls print(phat1) ### likelihood ratio test statistic Gsq <- 2 * sum(y * log(phat1 / phat0)) print(Gsq) Gsq.pval <- pchisq(Gsq, 1, lower.tail = FALSE) print(Gsq.pval) @ Now $P = \Sexpr{round(Gsq.pval, 4)}$ is moderately strong evidence that the null hypothesis (fair die) is false and the alternative (six-ace flat) is true. The moral of the story: you cannot conclude anything about a hypothesis (model) until you have actually fit that hypothesis (model). \subsection{Rao Test} The Rao test depends on what we take to be the parameter of the six-ace hypothesis, say the probability $\alpha$ for the six-ace cells. If the cell probabilities are $(\alpha, \beta, \beta, \beta, \beta, \alpha)$, then $$ \beta = \frac{1 - 2 \alpha}{4} $$ and $$ l(\alpha) = (x_1 + x_6) \log(\alpha) + (x_2 + x_3 + x_4 + x_5) \log(\beta) $$ so \begin{align*} l'(\alpha) & = \frac{x_1 + x_6}{\alpha} + \frac{x_2 + x_3 + x_4 + x_5}{\beta} \cdot \frac{d \beta}{d \alpha} \\ & = \frac{x_1 + x_6}{\alpha} - \frac{x_2 + x_3 + x_4 + x_5}{2 \beta} \end{align*} and \begin{align*} l''(\alpha) & = - \frac{x_1 + x_6}{\alpha^2} - \frac{x_2 + x_3 + x_4 + x_5}{\beta^2} \cdot \left(\frac{d \beta}{d \alpha}\right)^2 \\ & = - \frac{x_1 + x_6}{\alpha^2} - \frac{x_2 + x_3 + x_4 + x_5}{4 \beta^2} \end{align*} so observed Fisher information is $$ J(\alpha) = \frac{x_1 + x_6}{\alpha^2} + \frac{x_2 + x_3 + x_4 + x_5}{4 \beta^2} $$ and expected Fisher information is \begin{align*} I(\alpha) & = \frac{n \alpha + n \alpha}{\alpha^2} + \frac{n \beta + n \beta + n \beta + n \beta}{4 \beta^2} \\ & = \frac{2 n}{\alpha} + \frac{n}{\beta} \end{align*} Hence the Rao test statistic is $$ R = l'(\tilde{\alpha})^2 / I(\tilde{\alpha}) $$ where $\tilde{\alpha}$ is the value specified by the null hypothesis (which is 1 / 6). <>= p0 <- phat0[1] rao <- (((y[1] + y[6]) - sum(y[2:5]) / 2) / p0)^2 / (3 * sum(y) / p0) print(rao) rao.pval <- pchisq(rao, 1, lower.tail = FALSE) print(rao.pval) @ \subsection{Wald Test} The Wald test depends on what we take to be the constraint function, say $$ g(\alpha) = \alpha - \tilde{\alpha} $$ (using the notation of the preceding section). Then the Wald test statistic is $$ W = (\hat{\alpha} - \tilde{\alpha})^2 I(\hat{\alpha}) $$ <>= alpha.hat <- phat1[1] alpha.twiddle <- phat0[1] beta.hat <- (1 - 2 * alpha.hat) / 4 wald <- (alpha.hat - alpha.twiddle)^2 * sum(y) * (2 / alpha.hat + 1 / beta.hat) print(wald) wald.pval <- pchisq(wald, 1, lower.tail = FALSE) print(wald.pval) @ \subsection{Summary} <>= fred <- matrix(c(Gsq, rao, wald, Gsq.pval, rao.pval, wald.pval), ncol = 2) rownames(fred) <- c("Wilks", "Rao", "Wald") colnames(fred) <- c("statistic", "p-value") round(fred, 4) @ IMHO the likelihood ratio test is easier here. But your choice. They all say approximately the same thing. \section{Two-Dimensional Tables} The data set read in by the R function \texttt{read.table} below <>= foo <- read.table( url("http://www.stat.umn.edu/geyer/5421/mydata/multi-simple-3.txt"), header = TRUE) foo @ has a count ``response'' variable and two categorical ``predictor'' variables that can be used for a regression-like formula. But these data can also be turned into a two-dimensional contingency table as follows. <>= yarray <- xtabs(y ~ color + opinion, data = foo) print(yarray) @ We want to do the simplest hypothesis test for two-dimensional contingency tables, which is the only one covered in intro statistics books. It is called the test of independence (of the random variables whose values are the row and column labels) if both are random (that is we just collect individuals and the number in each cell is random with no constraints), or it is called the test of homogeneity of proportions if either the row margin or the column margin (but not both is fixed in advance). In the jargon of categorical data analysis, the first (independence) is multinomial sampling, and the second (homogeneity) is product multinomial sampling. The latter name comes from the fact that if we fix (condition on) the row totals, then each row is a multinomial random vector that is independent of the others (with sample size that is the row total), and when things are independent probabilities multiply (hence ``product''). We are also going to use a third sampling scheme, that the cells of the table are independent Poisson. One goes from Poisson to multinomial by conditioning on the sum over all cells. One goes from multinomial to product multinomial by conditioning on the row totals (or the column totals, but not both). It turns out (this is a mysterious fact that will be cleared up in some other handout) that the MLE expected cell counts are the same for all three sampling schemes, hence the likelihood ratio test statistics, the Pearson chi-square statistics, and the degrees of freedom of their asymptotic chi-square distributions are \emph{exactly the same} for all three sampling schemes. So it doesn't matter which one we use for calculations. \subsection{Likelihood Ratio Test} First use the Poisson sampling model and R functions \texttt{glm} and \texttt{anova} <>= out1 <- glm(y ~ color + opinion, family = poisson, data = foo) out2 <- glm(y ~ color * opinion, family = poisson, data = foo) aout <- anova(out1, out2, test = "Chisq") print(aout) Gsq <- aout[["Deviance"]][2] Gsq.pval <- aout[["Pr(>Chi)"]][2] Gsq.df <- aout[["Df"]][2] @ \subsection{Pearson Chi-Square Test} Second use the multinomial sampling and R function \texttt{chisq.test} <>= cout <- chisq.test(yarray) print(cout) names(cout) Xsq <- cout$statistic Xsq.pval <- cout$p.value Xsq.df <- cout$parameter @ \subsection{Summary} <>= fred <- matrix(c(Gsq, Xsq, Gsq.pval, Xsq.pval), ncol = 2) rownames(fred) <- c("Wilks", "Pearson") colnames(fred) <- c("statistic", "p-value") round(fred, 4) @ Again, your choice. \subsection{Check} Let us check some of the assertions that these do exactly the same thing (asymptotically). First check that \texttt{anova.glm} and \texttt{chisq.test} agree on degrees of freedom. <>= Xsq.df == Gsq.df @ Second check that \texttt{anova.glm} and \texttt{chisq.test} agree on expected values for all cells of the contingency table. <>= glm.expected <- predict(out1, type = "response") glm.expected <- xtabs(glm.expected ~ color + opinion, data = foo) all.equal(as.vector(glm.expected), as.vector(cout$expected)) @ Third check that if we calculated the likelihood ratio test statistic from the result of \texttt{chisq.test} it would agree with what was produced by \texttt{anova.glm}. <>= all.equal(Gsq, 2 * sum(cout$observed * log(cout$observed / cout$expected))) @ Finally check that if we calculated the Pearson $X^2$ statistic from the result of \texttt{glm} it would agree with what was produced by \texttt{chisq.test}. <>= # redo glm.expected so it is in same order as in glm glm.expected <- predict(out1, type = "response") all.equal(as.numeric(Xsq), sum((foo$y - glm.expected)^2 / glm.expected)) @ Everything checks. The R functions \texttt{glm} and \texttt{chisq.test} use two different algorithms to fit the data, so the checks above would not agree exactly (would not agree if we replaced \texttt{all.equal} with \texttt{identical} in the checks above). But they do agree to within the inaccuracy of computer arithmetic when we make each use the same test statistic. \end{document}