--- title: "Stat 5421 Notes: Sampling Schemes" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: bookdown::html_document2: number_sections: false md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML bookdown::pdf_document2: latex_engine: xelatex number_sections: false md_extensions: -tex_math_single_backslash linkcolor: blue urlcolor: blue bibliography: sampling.bib csl: journal-of-the-royal-statistical-society.csl link-citations: true --- # Introduction These notes back up what is said about sampling schemes in our [Chapter Zero](ch0.html#sampling-dist) and also [notes to accompany Agresti Chapter 1](ch1.html#the-relationship-of-the-various-sampling-schemes). For once, they are not titled "lecture notes". We will not lecture on them, because the details are not important for most applied work. They are just here for reference. Also, some of what is done here requires theoretical sophistication that goes beyond some of the prerequisites of this course (although not beyond any theory course, even Stat 4101). So this is not required reading for this course, just background. # Three Lemmas About Conditional Probability :::{.lemma #marginalization} Repeated marginalization gives consistent results. If $X$, $Y$, and $Z$ are random vectors, then calculating the marginal of $X$ and $Y$ and then calculating the marginal of $X$ from that (in two steps) gives the same result as calculating the marginal of $X$ directly (in one step). ::: :::{.proof} If $X$, $Y$, and $Z$ are discrete having joint PMF $f$, then what must be shown is \begin{equation} \sum_{(y, z) \in S(x)} f(x, y, z) = \sum_{y \in T(x)} \sum_{z \in S(x, y)} f(x, y, z) (\#eq:marginal-of-marginal) \end{equation} where $S$ is the domain of $f$ and \begin{align*} S(x) & = \{\, (y, z) : (x, y, z) \in S \,\} \\ S(x, y) & = \{\, z : (x, y, z) \in S \,\} \\ T(x) & = \{\, y : (x, y, z) \in S \,\} \end{align*} and the sums on the two sides of \@ref(eq:marginal-of-marginal) are the same because the sets $\{y\} \times S(x, y)$ for $y \in T(x)$ partition $S(x)$. If any of these variables are continuous, some of the sums are replaced by integrals, but otherwise the proof is the same. ::: :::{.lemma #marginalization-conditionalization} Marginalization and conditionalization can be interchanged. If $X$, $Y$, and $Z$ are random vectors, then calculating the marginal of $X$ and $Y$ and then calculating the conditional of $X$ given $Y$ from that (in two steps) gives the same result as calculating the conditional of $X$ and $Z$ given $Y$ and then calculating the marginal of that conditional that is the conditional of $X$ given $Y$ (the same two steps, but in reverse order). ::: :::{.proof} Using the notation of the preceding proof, what must be shown is $$ \frac{f_{X, Y}(x, y)}{f_Y(y)} = \sum_{z \in S(x, y)} \frac{f(x, y, z)}{f_Y(y)} $$ where $f_Y$ is the marginal of $Y$ and $f_{X, Y}$ is the marginal of $X$ and $Y$. But this is obvious because $f_Y(y)$ does not depend on $z$ and hence can be moved outside the sum on the right-hand side. If $z$ is continuous, the sum is replaced an integral, but otherwise the proof is the same. ::: :::{.lemma #conditionalization} Repeated conditionalization gives consistent results. If $X$, $Y$, and $Z$ are random vectors, then calculating the conditional of $X$ and $Y$ given $Z$ and then calculating the conditional of $X$ given $Y$ and $Z$ from that (in two steps) gives the same result as calculating the conditional of $X$ given $Y$ and $Z$ directly (in one step). ::: :::{.proof} Using the notation of the preceding two proofs, what must be shown is $$ \frac{\frac{f(x, y, z)}{f_Z(z)}}{f_{Y \mid Z}(y \mid z)} = \frac{f(x, y, z)}{f_{Y, Z}(y, z)} $$ but this is obvious because $$ f_{Y \mid Z}(y \mid z) = \frac{f_{Y, Z}(y, z)}{f_Z(z)} $$ ::: # Subvectors In order to describe the product multinomial sampling scheme, we need the notion of subvectors. If $y$ is a vector having index set $I$ and thus components $y_i$ for $i \in I$, and $A$ is a subset of $I$, when we say $y_A$ is a *subvector* of $y$ having index set $A$ and components $y_i$ for $i \in A$. This is rather odd, because convention requires that the index set of a vector be $\{ 1, 2, \ldots, d \}$ for some positive integer $d$. But here we are allowing arbitrary index sets. For example, we could have $$ I = \{\, \text{cabbage}, \text{dog food}, \text{kumquats} \,\} $$ and then the components of a vector $y$ having index set $I$ are $y_\text{cabbage}$, $y_\text{dog food}$, and $y_\text{kumquats}$. R caters to this idea in allowing character string indexing. ```{r} foo <- rnorm(3) names(foo) <- c("cabbage", "dog food", "kumquats") foo["dog food"] ``` This is useful in categorical data analysis because we can have the index sets consist of actual category names rather than arbitrary numbers. But it is even more important in subvector theory because it allows us to match up components of a vector $y$ and its subvector $y_A$. If $i \in A$, then $y_i$ means the same thing as a component of $y$ and as a component of $y_A$. This trick of using arbitrary index sets is not widely used but leads to much more elegant mathematics in books such as @rockafellar and @lauritzen. But what are subvectors if they aren't the usual notion? In advanced math vectors are functions. More precisely if $V$ is a vector space and $S$ is an arbitrary set, then $V^S$ denotes the set of all functions $S \to V$, and these functions can be considered vectors with vector addition $h = f + g$ meaning $$ h(x) = f(x) + g(x), \qquad x \in S, $$ and scalar multiplication $h = a f$ meaning $$ h(x) = a f(x), \qquad x \in S, $$ where $f$, $g$, and $h$ are elements of the vector space $V^S$ and $a$ is a scalar (an element of the field of scalars of $V$, a real number in vector spaces used in statistics). This is the reason that the study of infinite-dimensional topological vector spaces is usually called *functional analysis*. In this vectors-are-functions view, a vector $y$ having index set $I$ is a function that is an element of the vector space $\mathbb{R}^I$. And this is a finite-dimensional vector space if and only if $I$ is a finite set. We continue to write $y_i$ for components of $i$ just to look like conventional notation. But this is really function evaluation: $y_i$ means the same thing as $y(i)$, the value of the function $y$ at the point $i$ in its domain (the index set is the domain of the vector considered as a function). In this vectors-are-functions view, a vector $y$ is a function and a subvector $y_A$ is the restriction of this function to the subset $A$ of its domain. Both $y$ and $y_A$ have the same rule $i \mapsto y_i$. But they have different domains (index sets); $y$ has domain (index set) $I$, and $y_A$ has domain (index set) $A$, and $A \subset I$. So that takes care of the mathematical formalities, but if you don't bother to think of vectors and subvectors as functions but rather just as vectors with arbitrary sets as index sets, that is OK too. One oddity. The empty set is a possible index set. This gives us the subvector $y_\varnothing$, which is the one and only element of the vector space $\mathbb{R}^\varnothing$. In the vectors-are-functions view, $y_\varnothing$ is the empty function $\varnothing \to V$, which has no allowed values of its argument and hence no values. Considered as a vector, it has no components. But it is a mathematical object. From linear algebra we know there is only one vector space with a finite number of elements, and that is the zero vector space whose only element is the zero vector. Every vector space must contain a zero vector, and the vector space having only the zero vector does satisfy all the axioms for a vector space. Thus $\mathbb{R}^\varnothing$ must be another notation for the zero vector space (also called trivial vector space). And $y_\varnothing$ must be another notation for the zero vector (regardless of what $y$ is). Hence if $Y$ is a random vector $Y_\varnothing$ is a constant random vector always equal to the zero vector of the trivial vector space $\mathbb{R}^\varnothing$. We usually think a zero vector is one all of whose components are zero. This is also true of $y_\varnothing$ because it does not have any components. R does have vectors of length zero ```{r zero} double(0) double(0) |> length() ``` You can think of that as the unique element of the vector space $\mathbb{R}^\varnothing$. # Sampling Schemes This repeats what [what is said about sampling schemes in our notes to accompany Agresti Chapter 1](ch1.html#the-relationship-of-the-various-sampling-schemes). * In *Poisson sampling* the cell counts in a contingency table are assumed to be independent Poisson random variables. * In *multinomial sampling* the cell counts in a contingency table are are assumed to be components of a multinomial random vector. * In *product multinomial sampling* the cell counts in a contingency table are components of a random vector $Y$ whose index set has a [partition](https://en.wikipedia.org/wiki/Partition_of_a_set) $\mathcal{A}$, and the subvectors $Y_A$, $A \in \mathcal{A}$ are assumed to be independent multinomial random vectors. We now comment on these definitions. In Poisson sampling the cell counts are assumed independent but not identically distributed. The vector of mean values $\mu = E(Y)$ is the [mean value parameter vector](expfam.html#mean-value-parameters) of the exponential family statistical model which is this sampling scheme. Hence different cells of the contingency table can all have different means, which is the whole point of the model. In multinomial sampling, the cell counts are not independent because the total number of individuals in all cells (called the *sample size*) is not random but rather specified in the design of the experiment (survey, whatever). Again, this is the whole point of the model. The vector of mean values $\mu = E(Y)$ is the [mean value parameter vector](expfam.html#mean-value-parameters) of the exponential family statistical model which is this sampling scheme. But now we know that these mean values have the multinomial form $\mu_i = n \pi_i$, where $n$ is the multinomial sample size and $\pi_i$ is the probability of individuals being classified into cell $i$ of the contingency table. (The probabilities sum to one because the classification is mutually exclusive and exhaustive: every individual goes in exactly one category.) In product multinomial sampling the elements of the partition $\mathcal{A}$ can be called *strata,* a term taken from the term [stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling) in sampling theory (this is a Latin word, singular *stratum*, plural *strata*). In many applications the strata are all the same size (same number of cells of the contingency table) but they do not have to be. Our notation applies to arbitrary strata. In product multinomial sampling, the subvectors $Y_A$ are independent, as the definition says. Because of the [multiplication rule for independence](ch1.html#independent-and-identically-distributed) the joint distribution of all the cell counts factors as a product of multinomial distributions \begin{equation} \begin{split} f(y) & = \prod_{A \in \mathcal{A}} f_A(y_A) \\ & = \prod_{A \in \mathcal{A}} \binom{n_A}{y_A} \prod_{i \in A} \pi_i^{y_i} \end{split} (\#eq:product-multinomial-pmf) \end{equation} where $\pi$ is the vector of cell probabilities for the contingency table ($\pi_i$ is the probability that an individual in stratum $A$ is classified in cell $i$, assuming $i \in A$), and \begin{equation} n_A = \sum_{i \in A} Y_i (\#eq:product-multinomial-sample-sizes) \end{equation} is the sample size for the multinomial random vector $Y_A$, and $$ \binom{n_A}{y_A} = \frac{n_A !}{\prod_{i \in A} y_i !} $$ is a multinomial coefficient. In this sampling scheme what is not random are the sample sizes $n_A$, $A \in \mathcal{A}$, which are specified in the design of the experiment (survey, whatever). Again, this is the whole point of the model. The vector of mean values $\mu = E(Y)$ is the [mean value parameter vector](expfam.html#mean-value-parameters) of the exponential family statistical model which is this sampling scheme. But now we know that these mean values have the product multinomial form $\mu_i = n_A \pi_i$, where $n_A$ is the multinomial sample size and $\pi_i$ is the probability of individuals being classified into cell $i$ of the contingency table, where $i \in A$. Note that the probabilities do not sum to one over the whole table but rather within strata $$ \sum_{i \in A} \pi_i = 1, \qquad A \in \mathcal{A}. $$ Our notation also elegantly applies to all contingency tables of any dimension. If we are working with a three-dimensional contingency table with conventional indices $j$, $k$, $l$ (and this word is also Latin, singular *index*, plural *indices*) we can define our index set $I$ to be a set of triples $(j, k, l)$ and then our notation works for three-dimensional tables. Or any-dimensional tables in the same way. And it also works if we [put the data in a data frame rather than a contingency table](ch0.html#data). Is the power of the vectors-are-functions view becoming apparent? Our notation also has the consequence that we really only have two sampling schemes. The multinomial sampling scheme is a special case of the product multinomial sampling scheme when we have the trivial partition which contains only one element, which must be the original index set, that is, $\mathcal{A} = \{ I \}$. But we have many product multinomial sampling schemes, one for each partition. And in talking about that the following terminology is useful. Consider two partitions $\mathcal{A}$ and $\mathcal{B}$. We say $\mathcal{A}$ is *finer* than $\mathcal{B}$ if every $A \in \mathcal{A}$ is contained in some $B \in \mathcal{B}$. (And by the nature of partitions, each $A \in \mathcal{A}$ is then contained in a unique $B \in \mathcal{B}$.) This same relation can be indicated by saying that $\mathcal{B}$ is *coarser* than $\mathcal{A}$. # Theorems about Sampling Schemes and Conditioning :::{.theorem #poisson-to-product-multinomial} Let $Y$ be the random vector of a Poisson sampling model having mean vector $\mu$ and index set $I$. Let $\mathcal{A}$ be a partition of $I$. Define product multinomial sample sizes $n_A$, $A \in \mathcal{A}$. Then the distribution of the product multinomial sampling scheme arises by conditioning the (Poisson sampling scheme) distribution of $Y$ on the events $\sum_{i \in A} Y_i = n_A$, $A \in \mathcal{A}$. And the relationship between the usual parameter vectors of these sampling schemes is $\mu_i = n_A \pi_i$, $i \in A \in \mathcal{A}$. ::: :::{.proof} We know from the [addition rule for independent Poisson random variables](ch1.html/the-poisson-distribution) that $\sum_{i \in A} Y_i$ is again Poisson with mean $\sum_{i \in A} \mu_i$. Hence the conditional distribution is joint over marginal \begin{align*} f\left(Y \,\middle|\, \sum_{i \in A} Y_i = n_A, A \in \mathcal{A}\right) & = \frac{\prod_{i \in I} \mu_i^{y_i} \left. \exp(- \mu_i) \middle/ y_i ! \right.} {\prod_{A \in \mathcal{A}} \left(\sum_{j \in A} \mu_j\right)^{\sum_{j \in A} y_j} \left. \exp\left(- \sum_{j \in A} \mu_j\right) \middle/ \left(\sum_{j \in A} y_j\right) ! \right.} \\ & = \prod_{A \in \mathcal{A}} \frac{\prod_{i \in A} \mu_i^{y_i} \left. \exp(- \mu_i) \middle/ y_i ! \right.} {\left(\sum_{i \in A} \mu_j\right)^{\sum_{j \in A} y_j} \left. \exp\left(- \sum_{j \in A} \mu_j\right) \middle/ \left(\sum_{j \in A} y_j\right) ! \right.} \\ & = \prod_{A \in \mathcal{A}} \binom{n_A}{y_A} \prod_{i \in A} \left( \frac{\mu_i}{\sum_{i \in A} \mu_i} \right)^{y_i} \end{align*} and the last line is the PMF of the product multinomial distribution with success probabilities $$ \pi_i = \frac{\mu_i}{\sum_{i \in A} \mu_i}, \qquad i \in A \in \mathcal{A}. $$ But since $\sum_{i \in A} Y_i = n_A$ implies $$ E\left(\sum_{i \in A} Y_i\right) = n_A = \sum_{i \in A} \mu_i $$ we do have $n_A \pi_i = \mu_i$ as the theorem asserts. ::: :::{.corollary #poisson-to-multinomial} Let $Y$ be the random vector of a Poisson sampling model having mean vector $\mu$ and index set $I$. Then the distribution of the multinomial sampling scheme arises by conditioning the (Poisson sampling scheme) distribution of $Y$ on the event $\sum_{i \in I} Y_i = n$, where $n$ is the multinomial sample size. And the relationship between the usual parameter vectors of these sampling schemes is $\mu_i = n \pi_i$, $i \in I$. ::: :::{.proof} This is just the special case of the theorem where $\mathcal{A}$ is the trivial partition $\{ I \}$. ::: :::{.theorem #product-multinomial-to-product-multinomial} Let $Y$ be a random vector having index set $I$. Let $\mathcal{A}$ and $\mathcal{B}$ be partitions of $I$ with $\mathcal{A}$ finer than $\mathcal{B}$. Define the product multinomial sample sizes $n_A$, $A \in \mathcal{A}$, and $n_B$, $B \in \mathcal{B}$, satisfying $$ \sum_{\substack{A \in \mathcal{A} \\ A \subset B}} n_A = n_B, \qquad B \in \mathcal{B}. $$ Let $Y$ have the product multinomial distribution with partition $\mathcal{B}$ and usual parameter vector $\beta$. Then the conditional distribution of $Y$ given the events $\sum_{i \in A} Y_i = n_A$, $A \in \mathcal{A}$ is product multinomial with partition $\mathcal{A}$ and usual parameter vector $\alpha$, with $n_A \alpha_i = n_B \beta_i$, when $A \in \mathcal{A}$, $B \in \mathcal{B}$, and $i \in A \subset B$. ::: :::{.proof} Define the random variables $N_A = \sum_{i \in A} Y_i$, so the conditioning in the theorem statement is $N_A = n_A$ for $A \in \mathcal{A}$. It is obvious that collapsing some categories of a multinomial random vector gives another multinomial random vector with fewer categories. Hence the marginal distribution of the $N_A$ is product multinomial with PMF $$ \prod_{B \in \mathcal{B}} \frac{n_B !}{\prod_{\substack{A \in \mathcal{A} \\ A \subset B}} N_A !} \prod_{\substack{A \in \mathcal{A} \\ A \subset B}} \left( \sum_{i \in A} \beta_i\right)^{N_A} $$ Hence the conditional distribution is joint over marginal \begin{align*} f\left(Y \,\middle|\, N_A = n_A, A \in \mathcal{A}\right) & = \frac{ \prod_{B \in \mathcal{B}} \binom{n_B}{y_B} \prod_{i \in B} \pi_i^{y_i} }{ \prod_{B \in \mathcal{B}} \frac{n_B !}{\prod_{\substack{A \in \mathcal{A} \\ A \subset B}} N_A !} \prod_{\substack{A \in \mathcal{A} \\ A \subset B}} \left( \sum_{i \in A} \beta_i\right)^{N_A} } \\ & = \prod_{A \in \mathcal{A}} \binom{n_A}{y_A} \prod_{i \in A} \left( \frac{\beta_i}{\sum_{i \in A} \beta_i} \right)^{y_i} \end{align*} and the last line is the PMF of the product multinomial distribution with success probabilities $$ \alpha_i = \frac{\beta_i}{\sum_{i \in A} \beta_i}, \qquad i \in A \in \mathcal{A}. $$ By the [iterated expectation theorem](https://www.stat.umn.edu/geyer/5101/slides/s5.pdf#page=47) we get the same unconditional expectation of $Y$ whether we use its unconditional or conditional distribution, hence $$ E\left(\sum_{i \in A} Y_i\right) = n_A = \sum_{i \in A} n_B \beta_i, \qquad A \subset B \in \mathcal{B} $$ we do have $n_A \alpha_i = n_B \beta_i$ as the theorem asserts. ::: :::{.corollary #multinomial-to-product-multinomial} Let $Y$ be the random vector of a multinomial sampling model having sample size $n$, parameter vector $\pi$, and index set $I$. Let $\mathcal{A}$ be a partition of $I$. Then the conditional distribution of $Y$ given the events $\sum_{i \in A} Y_i = n_A$, $A \in \mathcal{A}$ is product multinomial having PMF given by \@ref(eq:product-multinomial-pmf). ::: :::{.proof} This is just the special case of the theorem where $\mathcal{B}$ is the trivial partition $\{ I \}$. ::: All of the theorems and corollaries in this section have obvious converses where we rearrange $$ \text{conditional} = \frac{\text{joint}}{\text{marginal}} $$ as $$ \text{joint} = \text{conditional} \cdot \text{marginal} $$ and go from conditional to joint rather than the other way. The relevant marginals are found in the proofs of the theorems stated here. # Maximum Likelihood Estimates In the next theorem we need the following terminology. Consider a conditioning event of the form $$ \sum_{i \in A} Y_i = n_A $$ Then we say the *dummy variable associated with this conditioning event* is the vector $u_A$ having zero-or-one-valued components such that $$ \sum_{i \in A} Y_i = u_A^T Y $$ (clearly the $i$-th component of $u_A$ is equal to one when $i \in A$ and zero otherwise). :::{.theorem #mle} Suppose we have two sampling schemes, the one with less conditioning is Poisson, multinomial, or product multinomial, and the one with more conditioning is multinomial or product multinomial with a finer partition than the one with less conditioning if the one with less conditioning is product multinomial. We use [canonical affine submodels](expfam.html#casm) for both sampling schemes having the same offset vectors and model matrices, and we assume every dummy variable associated with a conditioning event for the model with more conditioning is a column of the model matrix. Then the MLE's of the mean value parameter vector for the two sampling schemes are equal, and any (possibly but not necessarily unique) MLE of the canonical parameter vector for the one with less conditioning is also a (necessarily not unique) MLE of the canonical parameter vector for the one with more conditioning. ::: :::{.proof} Suppose the model with less conditioning is Poisson and the one with more conditioning is product multinomial with partition $\mathcal{A}$ and usual parameter vector $\pi$. (This includes the possibility that $\mathcal{A} = \{ I \}$ so product multinomial is actually multinomial.) Let $a$ denote the offset vector, $M$ the model matrix, and $u_A$, $A \in \mathcal{A}$ the dummy variables for conditioning events. By the [observed-equals-expected](expfam.html#observed-equals-expected) principle the likelihood equations determining the MLE Poisson model are \begin{equation} \sum_{i \in I} x_i y_i = \sum_{i \in I} x_i e^{\theta_i} (\#eq:mle-less) \end{equation} where $x$ is any column of $M$ and $\theta = a + M \beta$. Similarly, the likelihood equations determining the MLE for the product multinomial model are \begin{equation} \sum_{i \in I} x_i y_i = \sum_{A \in \mathcal{A}} n_A \frac{\sum_{i \in A} x_i e^{\theta_i}}{\sum_{j \in A} e^{\theta_j}} (\#eq:mle-more) \end{equation} where $x$ and $\theta$ are as in \@ref(eq:mle-less). What is to be shown is that every $\beta$ that is a solution to \@ref(eq:mle-less) is also a solution to \@ref(eq:mle-more). The special case of \@ref(eq:mle-less) where $x = u_A$ gives \begin{equation} n_A = \sum_{i \in A} y_i = \sum_{i \in A} e^{\theta_i} (\#eq:mle-na) \end{equation} and together \@ref(eq:mle-na) and \@ref(eq:mle-less) imply \@ref(eq:mle-more). Hence any $\theta$ that satisfies \@ref(eq:mle-less) also satisfies \@ref(eq:mle-more), in particular $\theta$ of the form $\theta = a + M \beta$. That proves the assertions about canonical parameters. Now for any $\theta = a + M \beta$ that satisfies \@ref(eq:mle-less) the mean value parameter vector for the Poisson sampling model has $i$-th component $e^{\theta_i}$. And the mean value parameter vector for the product multinomial sampling model has $i$-th component $$ \frac{n_A e^{\theta_i}}{\sum_{j \in A} e^{\theta_j}} $$ and \@ref(eq:mle-na) shows these are the same. That proves the assertion about mean value parameter vectors. Now we have to redo the whole proof with the model with less conditioning being product multinomial with a partition $\mathcal{B}$ that is coarser than $\mathcal{A}$. (This includes the possibility that $\mathcal{B} = \{ I \}$ so product multinomial is actually multinomial.) The proof is almost the same. Now instead of \@ref(eq:mle-less) we need \@ref(eq:mle-more) with $A$ and $\mathcal{A}$ replaced by $B$ and $\mathcal{B}$, respectively, that is \begin{equation} \sum_{i \in I} x_i y_i = \sum_{B \in \mathcal{B}} n_B \frac{\sum_{i \in B} x_i e^{\theta_i}}{\sum_{j \in B} e^{\theta_j}} (\#eq:mle-less-too) \end{equation} Now taking $x = u_A$ in \@ref(eq:mle-less-too) gives $$ n_A = \sum_{i \in A} y_i = \frac{n_B \sum_{i \in A} e^{\theta_i}}{\sum_{j \in B} e^{\theta_j}}, \qquad A \subset B \in \mathcal{B} $$ which we can also write as \begin{equation} \frac{n_B}{\sum_{j \in B} e^{\theta_j}} = \frac{n_A}{\sum_{i \in A} e^{\theta_i}}, \qquad A \subset B \in \mathcal{B} (\#eq:mle-na-too) \end{equation} And \@ref(eq:mle-less-too) and \@ref(eq:mle-na-too) imply \@ref(eq:mle-more). And the rest of this case is the same as before. ::: :::{.corollary #mle} Suppose we have models as in the theorem. Then the MLE’s of the mean value parameter vector for the two sampling schemes are equal, regardless of the canonical parameterizations used. ::: A statistical model is a family of probability distributions. By same model, we mean the same family of probability distributions. Since the [mean value parameterization](expfam.html#the-mean-value-parameterization) is a parameterization, same model means the same mean value parameter space. So the assumption of the corollary is that we have two models as described in the theorem, regardless of whether the canonical parameterizations are as described in the theorem. The mean value parameter space of the model with more conditioning is derived from the mean value parameter space with less conditioning by the conditioning. If $M$ is the mean value parameter space of the model with less conditioning and the model with more conditioning is product multinomial with partition $\mathcal{A}$ and sample sizes $n_A$, then the mean value parameter space of the model with more conditioning is $$ \left\{\, \mu \in M : \sum_{i \in A} \mu_i = n_A, \ A \in \mathcal{A} \,\right\} $$ :::{.proof} This is because mean value parameterizations are unique. $E(Y)$ has the same meaning in both models, even if the canonical parameterizations have nothing to do with each other. ::: # Likelihood Ratio Tests :::{.theorem #lrt} Suppose we are comparing nested submodels for categorical data. The likelihood ratio test statistic does not depend on either the parameterization of the submodels or the sampling scheme, so long as all models satisfy the conditions of Corollary \@ref(cor:mle). ::: :::{.proof} A version of the log likelihood for the mean value parameter for the Poisson sampling scheme is $$ l_\text{pois}(\mu) = \sum_{i \in I} \left( y_i \log(\mu_i) - \mu_i \right) $$ (different versions of the log likelihood differ by additive terms that do not depend on the parameter). A version of the log likelihood for the mean value parameter for the product multinomial sampling scheme is \begin{align*} l_\text{multi}(\mu) & = \sum_{A \in \mathcal{A}} \sum_{i \in A} y_i \log\left(\frac{\mu_i}{n_A}\right) \\ & = \left( \sum_{i \in I} y_i \log(\mu_i) \right) - \left( \sum_{A \in \mathcal{A}} \log(n_A) \sum_{i \in A} y_i \right) \end{align*} and we may drop the term that does not contain parameters giving us a different version \begin{equation} l_\text{multi}(\mu) = \sum_{i \in I} y_i \log(\mu_i) (\#eq:logl-multi-version) \end{equation} Define the total sample size $$ n = \sum_{A \in \mathcal{A}} n_A $$ Then the conditions of Corollary \@ref(cor:mle) guarantee that $$ \sum_{i \in I} \hat{\mu}_i = n $$ for the Poisson sampling scheme. Hence if $\hat{\mu}$ and $\tilde{\mu}$ are maximum likelihood estimators for different models being compared \begin{align*} l_\text{pois}(\hat{\mu}) - l_\text{pois}(\tilde{\mu}) & = \sum_{i \in I} y_i \log\left(\frac{\hat{\mu}_i}{\tilde{\mu}_i}\right) \\ & = l_\text{multi}(\hat{\mu}) - l_\text{multi}(\tilde{\mu}) \end{align*} (maximum likelihood estimators for the same model but different sampling schemes are equal by Corollary \@ref(cor:mle), moreover different versions of the log likelihood have the same log likelihood differences because the additive terms not containing parameters by which the versions differ are the same for both terms in a log likelihood difference). Log likelihoods are invariant under change of parameters, that is, if $\hat{\theta}$ is a different parameter corresponding to $\hat{\mu}$ and $\tilde{\theta}$ is a different parameter corresponding to $\tilde{\mu}$, then $$ l(\hat{\mu}) - l(\tilde{\mu}) = l(\hat{\theta}) - l(\tilde{\theta}) $$ and this is true for any log likelihood (any model, any sampling scheme). It is even clear (although we will not fuss about details of the proof) that the same conclusion holds even when MLE do not exist: if $\Theta_\text{null}$ and $\Theta_\text{alt}$ are two parameter spaces of nested models being compared, then $$ \left( \sup_{\theta \in \Theta_\text{alt}} l_\text{pois}(\theta) \right) - \left( \sup_{\theta \in \Theta_\text{null}} l_\text{pois}(\theta) \right) = \left( \sup_{\theta \in \Theta_\text{alt}} l_\text{multi}(\theta) \right) - \left( \sup_{\theta \in \Theta_\text{null}} l_\text{multi}(\theta) \right) $$ It is also clear (although we will not fuss about details of the proof) that we get a similar conclusion when the sampling schemes being compared are product multinomial with two partitions, one finer than the other (obvious from the fact that the log likelihood \@ref(eq:logl-multi-version) does not involve the product multinomial sample sizes). ::: :::{.theorem #df} Suppose we are comparing nested submodels for categorical data. The degrees of freedom for the asymptotic distribution of the likelihood ratio test statistic does not depend on either the parameterization of the submodels or the sampling scheme, so long as all models satisfy the conditions of Corollary \@ref(cor:mle). ::: :::{.proof} Corollary \@ref(cor:mle) refers back to Theorem \@ref(thm:mle) so we may assume the conditions of the latter. So first consider the situation as in that theorem. We are using the same offset vector $a$ and model matrix $M$ for both sampling schemes and assuming that every $u_A$, $A \in \mathcal{A}$ be a column of $M$, where $\mathcal{A}$ is the partition for the sampling scheme with more conditioning. Now we also assume $M$ has full column rank. This can always be achieved by dropping some columns that do not include the $u_A$ because the $u_A$ are linearly independent vectors. Suppose the sampling scheme with less conditioning is Poisson and the sampling scheme with more conditioning is product multinomial with partition $\mathcal{A}$. Then the degrees of freedom (DF) for the former is the number of columns of $M$, call that $d$, because the Poisson sampling scheme has no directions of constancy. And the DF for the latter is $d - \mathop{\rm card}(\mathcal{A})$, where $\mathop{\rm card}(S)$ denotes the cardinality (number of elements in) a set $S$, because every $u_A$ is a [direction of constancy](infinity.html#doc-dor) for this sampling scheme and must be dropped to obtain an identifiable canonical parameterization. So the difference in DF of the two sampling schemes is $\mathop{\rm card}(\mathcal{A}$). Now suppose the sampling scheme with less conditioning is product multinomial with partition $\mathcal{B}$, the sampling scheme with more conditioning is product multinomial with partition $\mathcal{A}$, and $\mathcal{B}$ is coarser than $\mathcal{A}$. Then the DF for the former is $d - \mathop{\rm card}(\mathcal{B})$ and for the latter is $d - \mathop{\rm card}(\mathcal{A})$ and the difference is $\mathop{\rm card}(\mathcal{A}) - \mathop{\rm card}(\mathcal{B})$. Since this analysis applies to both the null and alternative models, the difference in DF is $d_\text{alternative} - d_\text{null}$ in all cases, where $d_\text{alternative}$ and $d_\text{null}$ are what $d$ was in our preceding analysis now applied to the alternative and null hypotheses (still assuming their model matrices have full column rank and the conditions of Theorem \@ref(thm:mle) hold). Since this is the correct way to count DF regardless of whether or not the model matrices originally had full column rank, we are done. ::: # Pearson Chi-Squared Tests :::{.theorem #pearson} The Pearson chi-squared test statistic does not depend on either the parameterization of the model or the sampling scheme, so long as all models satisfy the conditions of Corollary \@ref(cor:mle). ::: :::{.proof} This is obvious from the fact that the form of the test statistic $$ \sum_\text{all cells} \frac{(\text{observed} - \text{expected})^2}{\text{expected}} $$ only depends on the mean value parameter "expected" and Corollary \@ref(cor:mle) says the MLE of the mean value parameters are the same. ::: # Wald and Rao Tests This is as far as we can go. Wald and Rao tests do depend on the sampling scheme. Of course from the asymptotic equivalence of Wald, Wilks, and Rao tests, the differences between these test statistics goes to zero in probability as the sample size goes to infinity. Thus they will be close to the same for large sample sizes but not exactly the same (unlike what we had for the likelihood ratio test). # Bibliography {-}