--- title: "Stat 5421 Lecture Notes: Graphical Models" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: bookdown::html_document2: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML bookdown::pdf_document2: latex_engine: xelatex number_sections: true md_extensions: -tex_math_single_backslash linkcolor: blue urlcolor: blue bibliography: graph.bib csl: journal-of-the-royal-statistical-society.csl link-citations: true --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. * The version of the `glmbb` package used to make this document is `r packageVersion("glmbb")`. * The version of the `CatDataAnalysis` package used to make this document is `r packageVersion("CatDataAnalysis")`. * The version of the `network` package used to make this document is `r packageVersion("network")`. * The version of the `utils` package used to make this document is `r packageVersion("utils")`. ```{r libraries} library(glmbb) library(CatDataAnalysis) library(network) library(utils) ``` # Introduction Graphical models come in many kinds. There are graphical models where all the variables are categorical (@lauritzen; Chapter 4). There are graphical models where the variables are jointly multivariate normal (@lauritzen; Chapter 5). There are graphical models where some of the variables are categorical and the rest are conditionally jointly multivariate normal given the categorical ones (@lauritzen; Chapter 6). Since this is a course in categorical data, we will only be interested in graphical models having only categorical variables. # Undirected Graphs An *undirected graph* consists of a set whose elements are called *nodes* and a set whose elements are called *edges* and which are unordered pairs of nodes. We write $G = (N, E)$ where $G$ is the graph, $N$ is the node set, and $E$ is the edge set. ```{r one,echo=FALSE,fig.align="center",fig.cap="An Undirected Graph",fig.asp=0.5} plot(c(1, 2, 3, 4, 4), c(1, 1, 1, 1, 0), axes = FALSE, type = "n", xlab = "", ylab = "") text(c(1, 2, 3, 4, 4), c(1, 1, 1, 1, 0), c("U", "V", "W", "X", "Z")) e <- 0.1 segments(x0 = c(2 + e, 3 + e, 4, 4 - sqrt(1 / 2) * e), y0 = c(1, 1, 1 - e, 0 + sqrt(1 / 2) * e), x1 = c(3 - e, 4 - e, 4, 3 + sqrt(1 / 2) * e), y1 = c(1, 1, 0 + e, 1 - sqrt(1 / 2) * e)) ``` Figure \@ref(fig:one) is an example. The nodes are $U$, $V$, $W$, $X$, and $Z$ and the edges are the line segments. A graph is called *simple* if it has no repeats of its edges, which our insistence that the edges are a set (which cannot have repeats) rather than a multiset (which can) already rules out, and if it has no loops (edges of the form $(n, n)$). We are only interested in simple graphs. The graph in Figure \@ref(fig:one) is simple. A graph $(N_1, E_1)$ is a *subgraph* of $(N_2, E_2)$ if $N_1 \subset N_2$ and $E_1 \subset E_2$, where $\subset$ means subset (we do not use $\subseteq$ to mean subset, so $A \subset B$ includes the case $A = B$). A graph is *complete* if every pair of nodes is connected by an edge (@lauritzen; Section 2.1.1). A *clique* in a graph is the node set of a maximal complete subgraph, one that is not a subgraph of another complete subgraph (@lauritzen; Section 2.1.1). The cliques in Figure \@ref(fig:one) are $\{U\}$, $\{V, W\}$, and $\{W, X, Z\}$. A *path* in an undirected graph is a sequence of edges $(n_i, n_{i + 1})$, $i = 1$, $\ldots,$ $k$. For example, ```{r path,echo=FALSE,fig.align="center",fig.asp=0.5} plot(c(2, 3, 4, 4), c(1, 1, 1, 0), axes = FALSE, type = "n", xlab = "", ylab = "") text(c(2, 3, 4, 4), c(1, 1, 1, 0), c("V", "W", "X", "Z")) e <- 0.1 segments(x0 = c(2 + e, 3 + e, 4), y0 = c(1, 1, 1 - e), x1 = c(3 - e, 4 - e, 4), y1 = c(1, 1, 0 + e)) ``` is a path in the graph in Figure \@ref(fig:one). We can say that this path goes from $V$ to $Z$ or from $Z$ to $V$. Since the edges are undirected, there is no implied direction. If $A$, $B$, and $C$ are sets of nodes of a graph, then we say that $C$ *separates* $A$ and $B$ if every path from a node in $A$ to a node in $B$ goes through $C$ (has an edge, one of whose nodes is in $C$). In the graph in Figure \@ref(fig:one), the node set $\{W\}$ separates the node sets $\{U, V\}$ and $\{X, Z\}$. # Undirected Graphs and Probability {#sec:undirected-probability} A *log-linear model* for a contingency table is any model having one of the "sampling schemes" (Poisson, multinomial, product multinomial) described in the [Chapter 1 notes](ch1.html#statistical-models-for-categorical-data) and in much more detail in the [notes on sampling schemes](sampling.html). Such a model is *hierarchical* if for every interaction term in the model all main effects and lower-order interactions involving variables in that term are also in the model. The *interaction graph* for a hierarchical model is an undirected graph that has nodes that are the variables in the model and an edge for every pair of variables that appear in the same interaction term (@lauritzen; Section 4.3.3). A hierarchical model is *graphical* if its terms correspond to the cliques of its interaction graph (@lauritzen; Section 4.3.3). Repeating what we said in the preceding section, the cliques the graph in Figure \@ref(fig:one) are $\{U\}$, $\{V, W\}$, and $\{W, X, Z\}$. So the graphical model having this graph has formula \begin{equation*} \sim U + V * W + W * X * Z. \end{equation*} Not every hierarchical model is graphical. For example, the hierarchical model with formula \begin{equation*} \sim U + V * W + W * X + X * Z + Z * W \end{equation*} has the same interaction graph. But it is not graphical because it does not have the term $W * X * Z$. A *Markov property* for a graph having random variables for nodes tells us something about conditional independence or factorization into marginals and conditionals. The relevant Markov property for log-linear models is the following, which is stated in Section 4.3.3 in @lauritzen and said to follow from Theorems 3.7 and 3.9 in @lauritzen. :::{.theorem #separation} If $C$ separates $A$ and $B$ in the interaction graph, then the variables in $A$ are conditionally independent of those in $B$ given those in $C$. ::: The conclusion of the theorem is often written $$ A \perp\!\!\!\perp B \mid C. $$ It means, of course, that the conditional distribution factorizes $$ f(y_{A \cup B} \mid y_C) = f(y_A \mid y_C) f(y_B \mid y_C) $$ where we have the usual abuse of notation that the three $f$'s denote three different functions, the conditional PMF's of the indicated sets of variables, and where we are now using the notation $y_S$ for subvectors of the response vector $y$ of the probability model (described in the [section on subvectors in the notes on sampling](sampling.html#subvectors)). Repeating what we said in the preceding section, in the graph in Figure \@ref(fig:one), $\{W\}$ separates $\{U, V\}$ and $\{X, Z\}$. So if Figure \@ref(fig:one) is the interaction graph of a graphical model, we have $$ U, V \perp\!\!\!\perp X, Z \mid W. $$ Perhaps not quite so obvious, in the same graph the empty set separates $U$ from all the other nodes. Since there are no paths from $U$ to any other node. Every path from $U$ to another node (there are none of them) goes through the empty set. Thus we can also say $$ U \perp\!\!\!\perp V, W, X, Z \mid \varnothing. $$ But, since conditioning on an empty set of variables is the same as not conditioning. We can also say $$ U \perp\!\!\!\perp V, W, X, Z $$ meaning $U$ is (unconditionally) independent of the rest of the variables. # R Package `glmbb` R package `glmbb` has some functions for dealing with graphical models. R function `isGraphical` says whether a formula is for a graphical model. ```{r is.graphical} isGraphical(~ U + V * W + W * X * Z) isGraphical(~ U + V * W + W * X + X * Z + Z * W) isHierarchical(~ U + V * W + W * X + X * Z + Z * W) ``` These opinions of the computer agree with our analysis above. There is also an option `graphical = TRUE` to R function `glmbb` to tell it to only consider graphical models. ```{r options,echo=FALSE} saveopt <- options(width = 500) ``` ```{r glmbb,cache=TRUE} data(exercise_6.28) sapply(exercise_6.28, class) lapply(exercise_6.28, levels) # all hierarchical models gout <- glmbb(counts ~ (Occupational_aspirations + Socioeconomic_status + IQ + Residence + Gender)^5, data = exercise_6.28) summary(gout) # now just the graphical model gout.gr <- glmbb(counts ~ (Occupational_aspirations + Socioeconomic_status + IQ + Residence + Gender)^5, data = exercise_6.28, graphical = TRUE) summary(gout.gr) ``` ```{r options.too,echo=FALSE} options(saveopt) ``` More hierarchical than graphical ```{r glmbb.compare, cache=TRUE} results.hier <- summary(gout)$results results.graph <- summary(gout.gr)$results nrow(results.hier) nrow(results.graph) ``` And ```{r glmbb.compare.too,attr.id=cumsum} min(which(cumsum(results.hier$weight) >= 0.9)) min(which(cumsum(results.graph$weight) >= 0.9)) ``` `r min(which(cumsum(results.hier$weight) >= 0.9))` models account for 90% of the weight among all hierarchical models less than the cutoff. `r min(which(cumsum(results.graph$weight) >= 0.9))` models account for 90% of the weight among all graphical models less than the cutoff. We will try to interpret some of these graphs [after we learn how to draw graphs](#drawing-the-graph-for-a-formula). # Directed Graphs A *directed graph* consists of a set whose elements are called *nodes* and a set whose elements are called *edges* and which are ordered pairs of nodes. When drawing a directed graph, the edges are drawn as arrows to indicate the direction. Figure \@ref(fig:two) shows a directed graph. ```{r two,echo=FALSE,fig.align="center",fig.cap="A Directed Acyclic Graph",fig.asp=0.5} plot(c(1, 2, 3, 4, 4), c(1, 1, 1, 1, 0), axes = FALSE, type = "n", xlab = "", ylab = "", ylim = c(-0.1, 1.1)) text(c(1, 2, 3, 4, 4), c(1, 1, 1, 1, 0), c("U", "V", "W", "X", "Z")) e <- 0.1 arrows(x0 = c(2 + e, 3 + e, 4, 3 + sqrt(1 / 2) * e), y0 = c(1, 1, 1 - e, 1 - sqrt(1 / 2) * e), x1 = c(3 - e, 4 - e, 4, 4 - sqrt(1 / 2) * e), y1 = c(1, 1, 0 + e, 0 + sqrt(1 / 2) * e)) ``` A *path* in a directed graph is a sequence of edges $(n_i, n_{i + 1})$, $i = 1$, $\ldots,$ $k$. Since the edges are directed, there is an implied direction. We say the path goes from $n_1$ to $n_{k + 1}$. For example, ```{r path.directed,echo=FALSE,fig.align="center",fig.asp=0.5} plot(c(2, 3, 4, 4), c(1, 1, 1, 0), axes = FALSE, type = "n", xlab = "", ylab = "", ylim = c(-0.1, 1.1)) text(c(2, 3, 4, 4), c(1, 1, 1, 0), c("V", "W", "X", "Z")) e <- 0.1 arrows(x0 = c(2 + e, 3 + e, 4), y0 = c(1, 1, 1 - e), x1 = c(3 - e, 4 - e, 4), y1 = c(1, 1, 0 + e)) ``` is a path from $V$ to $Z$ in the graph in Figure \@ref(fig:two). (The path must go in the same direction as the arrows.) A directed graph is called *acyclic* if it has no cycles, which are paths that return to their starting point. The graph in Figure \@ref(fig:two) is acyclic. Directed acyclic graphs are so important that they have a well-known TLA (three-letter acronym) DAG. In a DAG the *parents* of a node $u$ are the nodes from which edges go to $u$. The set of parents of $u$ is denoted $\mathop{\rm pa}(u)$. In Figure \@ref(fig:two) we have \begin{align*} \mathop{\rm pa}(U) & = \varnothing \\ \mathop{\rm pa}(V) & = \varnothing \\ \mathop{\rm pa}(W) & = \{ V \} \\ \mathop{\rm pa}(X) & = \{ W \} \\ \mathop{\rm pa}(Z) & = \{ W, X \} \end{align*} # Directed Acyclic Graphs and Probability Every DAG that has nodes that are the variables in a statistical model is associated with a factorization of the joint distribution for that model into a product of marginal and conditional distributions $$ f(y) = \prod_{i \in N} f(y_i \mid y_{\mathop{\rm pa}(i)}) $$ (@lauritzen; Section 4.5.1). Using the parent sets we found in the preceding section, the graph in Figure \@ref(fig:two) has the factorization \begin{equation} f(u, v, w, x, z) = f(z \mid w, x) f(x \mid w) f(w \mid v) f(v) f(u). (\#eq:factorize-fig-two) \end{equation} This agrees with the separation properties of the corresponding undirected graph discussed in Section \@ref(sec:undirected-probability) above. A general factorization, valid for any probability model would be \begin{align*} f(u, v, w, x, z) & = f(z \mid u, v, w, x) f(u, v, w, x) \\ & = f(z \mid u, v, w, x) f(x \mid u, v, w) f(u, v, w) \\ & = f(z \mid u, v, w, x) f(x \mid u, v, w) f(w \mid u, v) f(u, v) \\ & = f(z \mid u, v, w, x) f(x \mid u, v, w) f(w \mid u, v) f(v \mid u) f(u) \end{align*} Comparing this general factorization with \@ref(eq:factorize-fig-two) we see that we have \begin{align*} f(z \mid u, v, w, x) & = f(z \mid w, x) \\ f(x \mid u, v, w) & = f(x \mid w) \\ f(w \mid u, v) & = f(w \mid v) \\ f(v \mid u) & = f(v) \end{align*} which imply \begin{gather*} Z \perp\!\!\!\perp U, V \mid W, X \\ X \perp\!\!\!\perp U, V \mid W \\ W \perp\!\!\!\perp U \mid V \\ V \perp\!\!\!\perp U \end{gather*} all of which are Markov properties that can be read off the undirected graph in Figure \@ref(fig:one). # Directed Acyclic Graphs and Causality DAG's are also widely used to indicate causal relationships [@pearl; @spirtes-glymour-scheines]. The idea is that a directed edge ```{r edge.directed,echo=FALSE,fig.align="center",fig.asp=0.35} plot(c(2, 3), c(1, 1), axes = FALSE, type = "n", xlab = "", ylab = "", ylim = c(0.9, 1.1)) text(c(2, 3), c(1, 1), c("V", "W")) e <- 0.1 arrows(x0 = 2 + e, y0 = 1, x1 = 3 - e, y1 = 1) ``` indicates that changes in $V$ cause changes in $W$. Authorities on causal inference [@pearl; @spirtes-glymour-scheines] understand that correlation is not causation. But they do insist that independence implies lack of causation. (Strictly speaking, they do not even insist that, because there are tricky examples where there is independence but still causation. But generally independence is taken to imply lack of causation.) If two variables are independent, then there cannot be a causal relationship between them (because that would induce dependence). If two variables are conditionally independent given a set $S$ of other variables, then there cannot be a direct causal relationship. Any causal link must be indirect, the path from one to the other going through $S$. Direction of causality cannot be inferred from independence, conditional or unconditional. If $U$ and $V$ are dependent, then changes in $U$ may cause changes in $V$ or changes in $V$ may cause changes in $U$ or changes in some other variable $W$ may cause changes in both $U$ and $V$. There are two ways one can infer direction of causality. One is what is called "intervention" in the causal inference literature. It is the same thing that statisticians are talking about when they refer to controlled experiments. The other way is to simply assume that only one direction for edges makes sense (scientific sense, business sense, whatever). In short, statistics can give you an undirected graph, but only controlled experiments or assumptions can give direction and causal interpretation of the edges. # Drawing Graphs This about drawing graphs using R. By graphs we don't mean plots, but rather figures like Figures \@ref(fig:one) and \@ref(fig:two) above. [TIMTODTWI](https://en.wikipedia.org/wiki/There%27s_more_than_one_way_to_do_it). There is an [R task view on graphical models](https://cloud.r-project.org/web/views/gR.html) that lists 8 CRAN packages and 2 Bioconductor packages for dealing with graphical models. In addition to all of these packages one can just use base R graphics, which is how Figures \@ref(fig:one) and \@ref(fig:two) were done. But the packages are perhaps easier to use. We will just illustrate using R package `network`. We will redraw the graphs that are Figures \@ref(fig:one) and \@ref(fig:two) above. R function `network` in R package `network` makes objects of class `"network"` that can then be plotted. It wants the graph presented as an adjacency matrix which is a matrix whose row and column indices are nodes of the graph and the entries of the matrix are zero if there is no edge from one node to another and one otherwise. We can use the same matrix for both directed and undirected graphs. So we set up the matrix for Figure 2. ```{r make.graph} nodeset <- LETTERS["U" <= LETTERS & "Y" != LETTERS] nodeset foo <- matrix(0, nrow = length(nodeset), ncol = length(nodeset)) rownames(foo) <- colnames(foo) <- nodeset foo["V", "W"] <- foo["W", "X"] <- foo["W", "Z"] <- foo["X", "Z"] <- 1 foo ``` Now we can plot it. ```{r directed,fig.align="center",fig.cap="A Directed Graph"} bar <- network(foo) plot(bar, displaylabels = TRUE) ``` If we want an undirected graph, we have to remake the network object. ```{r undirected,fig.align="center",fig.cap="An Undirected Graph"} bar <- network(foo, directed = FALSE) plot(bar, displaylabels = TRUE) ``` Every time you redo your plots, the graph nodes will be in different positions. But this doesn't matter. The nodes don't really have positions, only connections (edges). On the other hand, one can just draw the graph in some drawing application and include it in the Rmarkdown ![Undirected Graph](undirected.png) # Drawing the Graph for a Formula ```{r options.too.too,echo=FALSE} saveopt <- options(width = 500) ``` ```{r glmbb.compare.too.too} idx <- min(which(cumsum(results.graph$weight) >= 0.9)) results.graph[1:idx, ] foo <- results.graph$formula[1:idx] ``` ```{r options.too.too.restore,echo=FALSE} options(saveopt) ``` So let's look at one of these formulas ```{r formula} foo[1] ``` Now we are going to some real magic of R! R is not an ordinary computing language, in that is knows a lot about itself. Everything in R is an object, including R expressions. So R can take apart this formula. First we turn this character string into a formula, and then we call R function `terms` to interpret it. ```{r formula.take.apart} foo[1] ff <- as.formula(foo[1]) tt <- terms(ff) names(attributes(tt)) ``` Looks like we want `term.labels` ```{r formula.take.apart.too} ll <- attr(tt, "term.labels") ll ``` This is something of a mess because of hierarchicality. We really only need the highest order terms, but the others don't hurt. First use R function `strsplit` to get the variables ```{r formula.take.apart.too.too} ss <- lapply(ll, strsplit, split = ":", fixed = TRUE) head(ss) ss <- lapply(ss, unlist) ss ``` Now we want all pairs of components of the same interaction. There's an R function for that: `combn` in R package `utils`. ```{r formula.take.apart.too.too.too,error=TRUE} cc <- lapply(ss, combn, m = 2) ``` But it doesn't like vector smaller than 2. So remove them and retry. ```{r formula.take.apart.too.too.too.too} len.ss <- sapply(ss, length) ss <- ss[len.ss >= 2] ss cc <- lapply(ss, combn, m = 2) cc ``` Getting there. Now put this whole thing in one big two-row matrix, ```{r formula.take.apart.too.too.too.too.too} rr <- Reduce(cbind, cc, init = NULL) rr ``` Now there is no honest way to do the last step without a loop AFAICS. ```{r formula.take.apart.too.too.too.too.too.too,fig.align="center"} nn <- sort(unique(as.vector(rr))) gr <- matrix(0, length(nn), length(nn)) rownames(gr) <- colnames(gr) <- nn for (i in seq(1:ncol(rr))) gr[rr[1, i], rr[2, i]] <- 1 # symmetrize gr <- gr + t(gr) bar <- network(gr, directed = FALSE) plot(bar, displaylabels = TRUE) ``` That's pretty easy to interpret. `IQ` is clearly separated from `Residence` and `Gender` by `Occupational_aspirations` and `Socioeconomic_status`. Using single letters for the variable names $$ I \perp\!\!\!\perp R, G \mid O, S $$ But that big clique of size four, nothing separates anything in it from anything else. That's the way cliques work. Try it again. This time we systematize our calculations as a function. ```{r formula-graph} formula.to.graph <- function(f) { stopifnot(inherits(f, "formula")) tt <- terms(f) ll <- attr(tt, "term.labels") ss <- lapply(ll, strsplit, split = ":", fixed = TRUE) ss <- lapply(ss, unlist) len.ss <- sapply(ss, length) ss <- ss[len.ss >= 2] cc <- lapply(ss, combn, m = 2) rr <- Reduce(cbind, cc, init = NULL) nn <- sort(unique(as.vector(rr))) gr <- matrix(0, length(nn), length(nn)) rownames(gr) <- colnames(gr) <- nn for (i in seq(1:ncol(rr))) gr[rr[1, i], rr[2, i]] <- 1 gr <- gr + t(gr) network(gr, directed = FALSE) } ``` And give it a try-out. ```{r formula-graph-plot,fig.align="center"} plot(formula.to.graph(as.formula(foo[[2]])), displaylabels = TRUE) ``` This is a little more interesting. Now we have `IQ`, `Residence`, and `Gender` all conditionally independent given the other two variables. In math \begin{align*} I \perp\!\!\!\perp R, G & \mid O, S \\ R \perp\!\!\!\perp I, G & \mid O, S \\ G \perp\!\!\!\perp I, R & \mid O, S \end{align*} This is a bit annoying the $\perp\!\!\!\perp$ symbol isn't powerful enough to say what we want. Independence is not a pairwise property. So when three variables are conditionally indepenent given some others, it takes more than one use of symbols to get that. Let's look at factorizations \begin{align*} f(i, r, g, o, s) & = f(i \mid r, g, o, s) f(r, g, o, s) \\ & = f(i \mid o, s) f(r, g, o, s) \\ & = f(i \mid o, s) f(r \mid g, o, s) f(g, o, s) \\ & = f(i \mid o, s) f(r \mid o, s) f(g, o, s) \\ & = f(i \mid o, s) f(r \mid o, s) f(g \mid o, s) f(o, s) \end{align*} where the equalties on even numbered lines are just joint equals conditional times marginal and the equalities on odd numbered lines are the conditional independence properties we read off the graph. Another way to write this is $$ f(i, r, g \mid o, s) = f(i \mid o, s) f(r \mid o, s) f(g \mid o, s) $$ which is what we said in words right at the begninning of our analysis of this graph $I$, $R$, and $G$ are conditionally independent given $O$ and $S$. # Bibliography