---
title: "Stat 5421 Lecture Notes: Graphical Models"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  bookdown::html_document2:
    number_sections: true
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML
  bookdown::pdf_document2:
    latex_engine: xelatex
    number_sections: true
    md_extensions: -tex_math_single_backslash
linkcolor: blue
urlcolor: blue
bibliography: graph.bib
csl: journal-of-the-royal-statistical-society.csl
link-citations: true
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document is
   `r packageVersion("rmarkdown")`.

 * The version of the `glmbb` package used to make this document is
   `r packageVersion("glmbb")`.

 * The version of the `CatDataAnalysis` package used to make this document is
   `r packageVersion("CatDataAnalysis")`.

 * The version of the `network` package used to make this document is
   `r packageVersion("network")`.

 * The version of the `utils` package used to make this document is
   `r packageVersion("utils")`.

```{r libraries}
library(glmbb)
library(CatDataAnalysis)
library(network)
library(utils)
```

# Introduction

Graphical models come in many kinds.
There are graphical models where all the variables are categorical
(@lauritzen; Chapter 4).
There are graphical models where the variables are jointly multivariate normal
(@lauritzen; Chapter 5).
There are graphical models where some of the variables are categorical
and the rest are conditionally jointly multivariate normal given the
categorical ones
(@lauritzen; Chapter 6).

Since this is a course in categorical data, we will only be interested
in graphical models having only categorical variables.

# Undirected Graphs

An *undirected graph* consists of a set whose elements are
called *nodes* and a set whose elements are called *edges*
and which are unordered pairs of nodes.
We write $G = (N, E)$ where $G$ is the graph, $N$ is the
node set, and $E$ is the edge set.

```{r one,echo=FALSE,fig.align="center",fig.cap="An Undirected Graph",fig.asp=0.5}
plot(c(1, 2, 3, 4, 4), c(1, 1, 1, 1, 0), axes = FALSE, type = "n",
    xlab = "", ylab = "")
text(c(1, 2, 3, 4, 4), c(1, 1, 1, 1, 0), c("U", "V", "W", "X", "Z"))
e <- 0.1
segments(x0 = c(2 + e, 3 + e, 4, 4 - sqrt(1 / 2) * e),
         y0 = c(1, 1, 1 - e, 0 + sqrt(1 / 2) * e),
         x1 = c(3 - e, 4 - e, 4, 3 + sqrt(1 / 2) * e),
         y1 = c(1, 1, 0 + e, 1 - sqrt(1 / 2) * e))
```
Figure \@ref(fig:one) is an example.  The nodes
are $U$, $V$, $W$, $X$, and $Z$ and the edges are the line segments.

A graph is called *simple* if it has no repeats of its edges, which
our insistence that the edges are a set (which cannot have repeats)
rather than a multiset (which can) already rules out,
and if it has no loops (edges of the form $(n, n)$).
We are only interested in simple graphs.
The graph in Figure \@ref(fig:one) is simple.

A graph $(N_1, E_1)$ is a *subgraph* of
$(N_2, E_2)$ if $N_1 \subset N_2$ and $E_1 \subset E_2$, where $\subset$
means subset (we do not use $\subseteq$ to mean subset, so $A \subset B$
includes the case $A = B$).

A graph is *complete* if every pair of nodes is connected by an edge
(@lauritzen; Section 2.1.1).

A *clique* in a graph is the node set of a maximal complete subgraph,
one that is not a subgraph of another complete subgraph
(@lauritzen; Section 2.1.1).
The cliques in Figure \@ref(fig:one) are
$\{U\}$,
$\{V, W\}$,
and $\{W, X, Z\}$.

A *path* in an undirected graph is a sequence of edges
$(n_i, n_{i + 1})$, $i = 1$, $\ldots,$ $k$.
For example,
```{r path,echo=FALSE,fig.align="center",fig.asp=0.5}
plot(c(2, 3, 4, 4), c(1, 1, 1, 0), axes = FALSE, type = "n",
    xlab = "", ylab = "")
text(c(2, 3, 4, 4), c(1, 1, 1, 0), c("V", "W", "X", "Z"))
e <- 0.1
segments(x0 = c(2 + e, 3 + e, 4),
         y0 = c(1, 1, 1 - e),
         x1 = c(3 - e, 4 - e, 4),
         y1 = c(1, 1, 0 + e))
```
is a path in the graph in Figure \@ref(fig:one).
We can say that this path goes from $V$ to $Z$ or from $Z$ to $V$.
Since the edges are undirected, there is no implied direction.

If $A$, $B$, and $C$ are sets of nodes of a graph, then we say that
$C$ *separates* $A$ and $B$ if every path
from a node in $A$ to a node in $B$ goes through $C$ (has an edge, one of
whose nodes is in $C$).  In the graph in Figure \@ref(fig:one), the node set
$\{W\}$ separates the node sets $\{U, V\}$ and $\{X, Z\}$.

# Undirected Graphs and Probability {#sec:undirected-probability}

A *log-linear model* for a contingency table is any model having
one of the "sampling
schemes" (Poisson, multinomial, product multinomial) described in
the [Chapter 1 notes](ch1.html#statistical-models-for-categorical-data)
and in much more detail in the [notes on sampling schemes](sampling.html).

Such a model is *hierarchical* if for every interaction term in the model
all main effects and lower-order interactions involving variables in that term
are also in the model.

The *interaction graph* for a hierarchical model is an undirected
graph that has nodes that are the
variables in the model and an edge for every pair of variables that appear
in the same interaction term
(@lauritzen; Section 4.3.3).

A hierarchical model is *graphical* if its terms correspond
to the cliques of its interaction graph
(@lauritzen; Section 4.3.3).
Repeating what we said in the preceding section, the cliques
the graph in Figure \@ref(fig:one) are $\{U\}$, $\{V, W\}$, and $\{W, X, Z\}$.
So the graphical model having this graph has formula
\begin{equation*}
    \sim U + V * W + W * X * Z.
\end{equation*}
Not every hierarchical model is graphical.  For example, the hierarchical
model with formula
\begin{equation*}
    \sim U + V * W + W * X + X * Z + Z * W
\end{equation*}
has the same interaction graph.  But it is not graphical because it does
not have the term $W * X * Z$.

A *Markov property* for a graph having random variables for nodes
tells us something about conditional independence or factorization into
marginals and conditionals.  The relevant Markov property for log-linear
models is the following, which is stated in Section 4.3.3 in
@lauritzen and said to follow from Theorems 3.7 and 3.9 in
@lauritzen.

:::{.theorem #separation}
If $C$ separates $A$ and $B$ in the interaction graph,
then the variables in $A$ are conditionally independent of those in $B$
given those in $C$.
:::

The conclusion of the theorem is often written
$$
   A \perp\!\!\!\perp B \mid C.
$$
It means, of course, that the conditional distribution factorizes
$$
   f(y_{A \cup B} \mid y_C)
   =
   f(y_A \mid y_C) f(y_B \mid y_C)
$$
where we have the usual abuse of notation that the three $f$'s denote
three different functions, the conditional PMF's of the indicated sets
of variables, and where we are now using the notation $y_S$ for subvectors
of the response vector $y$ of the probability model (described in the
[section on subvectors in the notes on sampling](sampling.html#subvectors)).

Repeating what we said in the preceding section,
in the graph in Figure \@ref(fig:one), $\{W\}$ separates $\{U, V\}$
and $\{X, Z\}$.  So if Figure \@ref(fig:one) is the interaction graph
of a graphical model, we have
$$
   U, V \perp\!\!\!\perp X, Z \mid W.
$$

Perhaps not quite so obvious, in the same graph the empty set separates
$U$ from all the other nodes.  Since there are no paths from $U$ to any
other node.  Every path from $U$ to another node (there are none of them)
goes through the empty set.  Thus we can also say
$$
   U \perp\!\!\!\perp V, W, X, Z \mid \varnothing.
$$
But, since conditioning on an empty set of variables is the same as not
conditioning.  We can also say
$$
   U \perp\!\!\!\perp V, W, X, Z
$$
meaning $U$ is (unconditionally) independent of the rest of the variables.

# R Package `glmbb`

R package `glmbb` has some functions for dealing with graphical models.

R function `isGraphical` says whether a formula is for a graphical model.
```{r is.graphical}
isGraphical(~ U + V * W + W * X * Z)
isGraphical(~ U + V * W + W * X + X * Z + Z * W)
isHierarchical(~ U + V * W + W * X + X * Z + Z * W)
```
These opinions of the computer agree with our analysis above.

There is also an option `graphical = TRUE` to R function `glmbb` to tell
it to only consider graphical models.
```{r options,echo=FALSE}
saveopt <- options(width = 500)
```
```{r glmbb,cache=TRUE}
data(exercise_6.28)
sapply(exercise_6.28, class)
lapply(exercise_6.28, levels)
# all hierarchical models
gout <- glmbb(counts ~ (Occupational_aspirations + Socioeconomic_status +
    IQ + Residence + Gender)^5, data = exercise_6.28)
summary(gout)
# now just the graphical model
gout.gr <- glmbb(counts ~ (Occupational_aspirations + Socioeconomic_status +
    IQ + Residence + Gender)^5, data = exercise_6.28, graphical = TRUE)
summary(gout.gr)
```
```{r options.too,echo=FALSE}
options(saveopt)
```

More hierarchical than graphical
```{r glmbb.compare, cache=TRUE}
results.hier <- summary(gout)$results
results.graph <- summary(gout.gr)$results
nrow(results.hier)
nrow(results.graph)
```
And
```{r glmbb.compare.too,attr.id=cumsum}
min(which(cumsum(results.hier$weight) >= 0.9))
min(which(cumsum(results.graph$weight) >= 0.9))
```
`r min(which(cumsum(results.hier$weight) >= 0.9))` models account for
90% of the weight among all hierarchical models less than the cutoff.
`r min(which(cumsum(results.graph$weight) >= 0.9))` models account for
90% of the weight among all graphical models less than the cutoff.

We will try to interpret some of these graphs [after we learn how to
draw graphs](#drawing-the-graph-for-a-formula).

# Directed Graphs

A *directed graph* consists of a set whose elements are
called *nodes* and a set whose elements are called *edges*
and which are ordered pairs of nodes.  When drawing a directed graph,
the edges are drawn as arrows to indicate the direction.
Figure \@ref(fig:two) shows a directed graph.
```{r two,echo=FALSE,fig.align="center",fig.cap="A Directed Acyclic Graph",fig.asp=0.5}
plot(c(1, 2, 3, 4, 4), c(1, 1, 1, 1, 0), axes = FALSE, type = "n",
    xlab = "", ylab = "", ylim = c(-0.1, 1.1))
text(c(1, 2, 3, 4, 4), c(1, 1, 1, 1, 0), c("U", "V", "W", "X", "Z"))
e <- 0.1
arrows(x0 = c(2 + e, 3 + e, 4, 3 + sqrt(1 / 2) * e),
       y0 = c(1, 1, 1 - e, 1 - sqrt(1 / 2) * e),
       x1 = c(3 - e, 4 - e, 4, 4 - sqrt(1 / 2) * e),
       y1 = c(1, 1, 0 + e, 0 + sqrt(1 / 2) * e))
```

A *path* in a directed graph is a sequence of edges
$(n_i, n_{i + 1})$, $i = 1$, $\ldots,$ $k$.
Since the edges are directed, there is an implied direction.
We say the path goes from $n_1$ to $n_{k + 1}$.
For example,
```{r path.directed,echo=FALSE,fig.align="center",fig.asp=0.5}
plot(c(2, 3, 4, 4), c(1, 1, 1, 0), axes = FALSE, type = "n",
    xlab = "", ylab = "", ylim = c(-0.1, 1.1))
text(c(2, 3, 4, 4), c(1, 1, 1, 0), c("V", "W", "X", "Z"))
e <- 0.1
arrows(x0 = c(2 + e, 3 + e, 4),
       y0 = c(1, 1, 1 - e),
       x1 = c(3 - e, 4 - e, 4),
       y1 = c(1, 1, 0 + e))
```
is a path from $V$ to $Z$ in the graph in Figure \@ref(fig:two).
(The path must go in the same direction as the arrows.)

A directed graph is called *acyclic* if it has no cycles,
which are paths that return to their starting point.
The graph in Figure \@ref(fig:two) is acyclic.
Directed acyclic graphs are so important that they have a well-known
TLA (three-letter acronym) DAG.

In a DAG the *parents* of a node $u$ are the nodes from which
edges go to $u$.  The set of parents of $u$ is denoted $\mathop{\rm pa}(u)$.
In Figure \@ref(fig:two) we have
\begin{align*}
  \mathop{\rm pa}(U) & = \varnothing
  \\
  \mathop{\rm pa}(V) & = \varnothing
  \\
  \mathop{\rm pa}(W) & = \{ V \}
  \\
  \mathop{\rm pa}(X) & = \{ W \}
  \\
  \mathop{\rm pa}(Z) & = \{ W, X \}
\end{align*}

# Directed Acyclic Graphs and Probability

Every DAG that has nodes that are the variables in a statistical model
is associated with a factorization of the joint distribution for that
model into a product of marginal and conditional distributions
$$
   f(y) = \prod_{i \in N} f(y_i \mid y_{\mathop{\rm pa}(i)})
$$
(@lauritzen; Section 4.5.1).
Using the parent sets we found in the preceding section, the graph
in Figure \@ref(fig:two) has the factorization
\begin{equation}
   f(u, v, w, x, z)
   =
   f(z \mid w, x) f(x \mid w) f(w \mid v) f(v) f(u).
   (\#eq:factorize-fig-two)
\end{equation}

This agrees with the separation properties of the corresponding undirected
graph discussed in Section \@ref(sec:undirected-probability) above.
A general factorization, valid for any probability model would be
\begin{align*}
   f(u, v, w, x, z)
   & =
   f(z \mid u, v, w, x) f(u, v, w, x)
   \\
   & =
   f(z \mid u, v, w, x) f(x \mid u, v, w) f(u, v, w)
   \\
   & =
   f(z \mid u, v, w, x) f(x \mid u, v, w) f(w \mid u, v) f(u, v)
   \\
   & =
   f(z \mid u, v, w, x) f(x \mid u, v, w) f(w \mid u, v) f(v \mid u) f(u)
\end{align*}
Comparing this general factorization with \@ref(eq:factorize-fig-two)
we see that we have
\begin{align*}
   f(z \mid u, v, w, x) & = f(z \mid w, x)
   \\
   f(x \mid u, v, w) & = f(x \mid w)
   \\
   f(w \mid u, v) & = f(w \mid v)
   \\
   f(v \mid u) & = f(v)
\end{align*}
which imply
\begin{gather*}
   Z \perp\!\!\!\perp U, V \mid W, X
   \\
   X \perp\!\!\!\perp U, V \mid W
   \\
   W \perp\!\!\!\perp U \mid V
   \\
   V \perp\!\!\!\perp U
\end{gather*}
all of which are Markov properties that can be read off the undirected
graph in Figure \@ref(fig:one).

# Directed Acyclic Graphs and Causality

DAG's are also widely used to indicate causal relationships
[@pearl; @spirtes-glymour-scheines].  The idea is that a directed edge
```{r edge.directed,echo=FALSE,fig.align="center",fig.asp=0.35}
plot(c(2, 3), c(1, 1), axes = FALSE, type = "n",
    xlab = "", ylab = "", ylim = c(0.9, 1.1))
text(c(2, 3), c(1, 1), c("V", "W"))
e <- 0.1
arrows(x0 = 2 + e,
       y0 = 1,
       x1 = 3 - e,
       y1 = 1)
```
indicates that changes in $V$ cause changes in $W$.

Authorities on causal inference [@pearl; @spirtes-glymour-scheines]
understand that correlation is not causation.  But they do insist that
independence implies lack of causation.  (Strictly speaking, they do
not even insist that, because there are tricky examples where there is
independence but still causation.  But generally independence is taken
to imply lack of causation.)

If two variables are independent, then there cannot be a causal relationship
between them (because that would induce dependence).
If two variables are conditionally independent given a set $S$ of other
variables, then there cannot be a direct causal relationship.  Any causal
link must be indirect, the path from one to the other going through $S$.

Direction of causality cannot be inferred from independence,
conditional or unconditional.  If $U$ and $V$ are dependent, then
changes in $U$ may cause changes in $V$ or
changes in $V$ may cause changes in $U$ or
changes in some other variable $W$ may cause changes in both $U$ and $V$.

There are two ways one can infer direction of causality.  One is what is
called "intervention" in the causal inference literature.  It is the
same thing that statisticians are talking about when they refer to
controlled experiments.  The other way is to simply assume that only
one direction for edges makes sense (scientific sense, business sense,
whatever).

In short, statistics can give you an undirected graph, but only controlled
experiments or assumptions can give direction and causal interpretation of
the edges.

# Drawing Graphs

This about drawing graphs using R.  By graphs we don't mean plots,
but rather figures like Figures \@ref(fig:one) and \@ref(fig:two)
above.

[TIMTODTWI](https://en.wikipedia.org/wiki/There%27s_more_than_one_way_to_do_it).
There is an [R task view on graphical
models](https://cloud.r-project.org/web/views/gR.html) that lists
8 CRAN packages and 2 Bioconductor packages for dealing
with graphical models.

In addition to all of these packages one can just use base R graphics,
which is how Figures \@ref(fig:one) and \@ref(fig:two) were done.
But the packages are perhaps easier to use.

We will just illustrate using R package `network`.
We will redraw the graphs that are Figures \@ref(fig:one) and \@ref(fig:two)
above.

R function `network` in R package `network` makes objects of class `"network"`
that can then be plotted.  It wants the graph presented as an adjacency matrix
which is a matrix whose row and column indices are nodes of the graph and
the entries of the matrix are zero if there is no edge from one node to another
and one otherwise.  We can use the same matrix for both directed and undirected
graphs.  So we set up the matrix for Figure 2.
```{r make.graph}
nodeset <- LETTERS["U" <= LETTERS & "Y" != LETTERS]
nodeset

foo <- matrix(0, nrow = length(nodeset), ncol = length(nodeset))
rownames(foo) <- colnames(foo) <- nodeset
foo["V", "W"] <- foo["W", "X"] <- foo["W", "Z"] <- foo["X", "Z"] <- 1
foo
```

Now we can plot it.
```{r directed,fig.align="center",fig.cap="A Directed Graph"}
bar <- network(foo)
plot(bar, displaylabels = TRUE)
```

If we want an undirected graph, we have to remake the network object.
```{r undirected,fig.align="center",fig.cap="An Undirected Graph"}
bar <- network(foo, directed = FALSE)
plot(bar, displaylabels = TRUE)
```

Every time you redo your plots, the graph nodes will be in different
positions.  But this doesn't matter.  The nodes don't really have positions,
only connections (edges).

On the other hand, one can just draw the graph in some drawing application
and include it in the Rmarkdown
![Undirected Graph](undirected.png)

# Drawing the Graph for a Formula

```{r options.too.too,echo=FALSE}
saveopt <- options(width = 500)
```
```{r glmbb.compare.too.too}
idx <- min(which(cumsum(results.graph$weight) >= 0.9))
results.graph[1:idx, ]
foo <- results.graph$formula[1:idx]
```
```{r options.too.too.restore,echo=FALSE}
options(saveopt)
```

So let's look at one of these formulas
```{r formula}
foo[1]
```
Now we are going to some real magic of R!  R is not an ordinary computing
language, in that is knows a lot about itself.  Everything in R is an
object, including R expressions.  So R can take apart this formula.
First we turn this character string into a formula, and then we call
R function `terms` to interpret it.
```{r formula.take.apart}
foo[1]
ff <- as.formula(foo[1])
tt <- terms(ff)
names(attributes(tt))
```
Looks like we want `term.labels`
```{r formula.take.apart.too}
ll <- attr(tt, "term.labels")
ll
```
This is something of a mess because of hierarchicality.  We really only
need the highest order terms, but the others don't hurt.  First
use R function `strsplit` to get the variables
```{r formula.take.apart.too.too}
ss <- lapply(ll, strsplit, split = ":", fixed = TRUE)
head(ss)
ss <- lapply(ss, unlist)
ss
```
Now we want all pairs of components of the same interaction.
There's an R function for that: `combn` in R package `utils`.
```{r formula.take.apart.too.too.too,error=TRUE}
cc <- lapply(ss, combn, m = 2)
```
But it doesn't like vector smaller than 2.  So remove them and retry.
```{r formula.take.apart.too.too.too.too}
len.ss <- sapply(ss, length)
ss <- ss[len.ss >= 2]
ss
cc <- lapply(ss, combn, m = 2)
cc
```
Getting there.  Now put this whole thing in one big two-row matrix,
```{r formula.take.apart.too.too.too.too.too}
rr <- Reduce(cbind, cc, init = NULL)
rr
```
Now there is no honest way to do the last step without a loop AFAICS.
```{r formula.take.apart.too.too.too.too.too.too,fig.align="center"}
nn <- sort(unique(as.vector(rr)))
gr <- matrix(0, length(nn), length(nn))
rownames(gr) <- colnames(gr) <- nn
for (i in seq(1:ncol(rr))) gr[rr[1, i], rr[2, i]] <- 1
# symmetrize
gr <- gr + t(gr)
bar <- network(gr, directed = FALSE)
plot(bar, displaylabels = TRUE)
```

That's pretty easy to interpret.  `IQ` is clearly separated from
`Residence` and `Gender` by `Occupational_aspirations` and
`Socioeconomic_status`.  Using single letters for the variable names
$$
   I \perp\!\!\!\perp R, G \mid O, S
$$
But that big clique of size four, nothing separates anything in it from
anything else.  That's the way cliques work.

Try it again.  This time we systematize our calculations as a function.
```{r formula-graph}
formula.to.graph <- function(f) {
    stopifnot(inherits(f, "formula"))
    tt <- terms(f)
    ll <- attr(tt, "term.labels")
    ss <- lapply(ll, strsplit, split = ":", fixed = TRUE)
    ss <- lapply(ss, unlist)
    len.ss <- sapply(ss, length)
    ss <- ss[len.ss >= 2]
    cc <- lapply(ss, combn, m = 2)
    rr <- Reduce(cbind, cc, init = NULL)
    nn <- sort(unique(as.vector(rr)))
    gr <- matrix(0, length(nn), length(nn))
    rownames(gr) <- colnames(gr) <- nn
    for (i in seq(1:ncol(rr))) gr[rr[1, i], rr[2, i]] <- 1
    gr <- gr + t(gr)
    network(gr, directed = FALSE)
}
```

And give it a try-out.
```{r formula-graph-plot,fig.align="center"}
plot(formula.to.graph(as.formula(foo[[2]])), displaylabels = TRUE)
```

This is a little more interesting.  Now we have
`IQ`, `Residence`, and `Gender` all conditionally independent
given the other two variables.  In math
\begin{align*}
   I \perp\!\!\!\perp R, G & \mid O, S
   \\
   R \perp\!\!\!\perp I, G & \mid O, S
   \\
   G \perp\!\!\!\perp I, R & \mid O, S
\end{align*}

This is a bit annoying the $\perp\!\!\!\perp$ symbol isn't powerful
enough to say what we want.  Independence is not a pairwise property.
So when three variables are conditionally indepenent given some others,
it takes more than one use of symbols to get that.

Let's look at factorizations
\begin{align*}
   f(i, r, g, o, s) 
   & =
   f(i \mid r, g, o, s) f(r, g, o, s)
   \\
   & =
   f(i \mid o, s) f(r, g, o, s)
   \\
   & =
   f(i \mid o, s) f(r \mid g, o, s) f(g, o, s)
   \\
   & =
   f(i \mid o, s) f(r \mid o, s) f(g, o, s)
   \\
   & =
   f(i \mid o, s) f(r \mid o, s) f(g \mid o, s) f(o, s)
\end{align*}
where the equalties on even numbered lines are just joint equals
conditional times marginal and the equalities on odd numbered lines
are the conditional independence properties we read off the graph.
Another way to write this is
$$
   f(i, r, g \mid o, s) 
   =
   f(i \mid o, s) 
   f(r \mid o, s) 
   f(g \mid o, s) 
$$
which is what we said in words right at the begninning of our analysis
of this graph $I$, $R$, and $G$ are conditionally independent given $O$
and $S$.


# Bibliography