---
title: "Stat 8054 Lecture Notes: Autocovariances in MCMC"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  html_document:
    number_sections: true
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
    css: "../bar.css"
  pdf_document:
    number_sections: true
    md_extensions: -tex_math_single_backslash
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document is
   `r packageVersion("rmarkdown")`.

 * The version of the `mcmc` package used to make this document is
   `r packageVersion("mcmc")`.

# Functionals of a Markov Chain

If $X_1$, $X_2$, $\ldots$ is a Markov chain and $g$ is a real-valued
function on the state space, then $g(X_1)$, $g(X_2)$, $\ldots$ is
called a *functional* of the Markov chain ([Geyer, 2011, Introduction to MCMC,
in *Handbook of Markov Chain Monte Carlo*](http://mcmchandbook.net/HandbookChapter1.pdf), hereinafter referred to as the "Intro", Section 1.6)

# Estimates of Expectations of Functionals of a Markov Chain

If we cannot calculate, either by hand or by using a [computer algebra system](https://en.wikipedia.org/wiki/Computer_algebra_system), an expectation
$$
   \mu = E\{ g(X) \}
$$
for some random object $X$ and some real-valued function $g$ on the state space
(the set where $X$ takes values), then we estimate it by simulating a Markov
chain $X_1$, $X_2$, $\ldots,$ whose unique equilibrium distribution is the
distribution of $X$, and using
$$
   \hat{\mu}_n = \frac{1}{n} \sum_{i = 1}^n g(X_i)
$$
as our estimator (these are equations (1.1) and (1.2) in the
[Intro](http://mcmchandbook.net/HandbookChapter1.pdf)).

# Asymptotic Variance and Autocovariances

The Markov chain central limit theorem says that $\hat{\mu}_n$ is a
consistent and asymptotically normal estimator of $\mu$ with asymptotic
variance $\sigma^2 / n$, where
$$
   \sigma^2 = \gamma_0 + 2 \sum_{k = 1}^\infty \gamma_k
$$
where
$$
   \gamma_k = \mathop{\rm cov}\{ g(X_i), g(X_{i + k}) \}
$$
(these are equations (1.9) and (1.8) in the Intro
[Intro](http://mcmchandbook.net/HandbookChapter1.pdf)), where here the
covariances refer to the stationary Markov chain having the same transition
probabilities as the Markov chain actually being run so that there is no
dependence on $i$ as is required for this to be a good definition
(more on this in Section 1.8 of the intro).

The natural estimator of the "little gammas" $\gamma_k$ is
$$
   \hat{\gamma}_k = \frac{1}{n} \sum_{i = 1}^{n - k}
   \{ [ g(X_i) - \hat{\mu}_n ] [ g(X_{i + k}) - \hat{\mu}_n ]\}
$$
(this is equation (1.10) in the
[Intro](http://mcmchandbook.net/HandbookChapter1.pdf)).

Each $\gamma_k$ is called an *autocovariance* (of this functional
of the Markov chain) and the function $k \mapsto \gamma_k$ is called
the *autocovariance function*.  R function `acf` estimates autocovariance
functions.

# Behavior of the Autocovariance Estimates

The behavior of the autocovariance estimates is extremely bad.
For large $k$, when $\gamma_k$ is nearly zero,
the variance of $\hat{\gamma}_k$ does not depend on $k$ and hence does not
go to zero.  Similarly, the covariance
of $\hat{\gamma}_k$ and $\hat{\gamma}_{k + m}$ does not depend on $k$,
and hence does not decrease with $k$.  Thus, the part of the autocovariance
function that corresponds to $k$ such that $\gamma_k$ is nearly zero, looks
like a stationary time series.   Geyer ([1992,
Practical Markov Chain Monte Carlo, *Statistical Science*](https://www.jstor.org/stable/2246094), Section 3.1, discusses this, not deriving anything new but
rather citing earlier work by Bartlett and Priestly).

As a simple example, we take the Markov chain from the examples section of
the help page for R function `initseq`
```{r label=series}
n <- 2e4
rho <- 0.99
x <- arima.sim(model = list(ar = rho), n = n)
```
This is an AR(1) time series, which is also a Markov chain
(Section 1.9 of the
[Intro](http://mcmchandbook.net/HandbookChapter1.pdf)).

Unlike in the [Intro](http://mcmchandbook.net/HandbookChapter1.pdf)),
we look at the boring part of the estimated autocovariance
function (where the true autocovariance function is nearly zero).
The following code
```{r label=show, eval=FALSE}
aout <- acf(x, type = "covariance", lag.max = n / 5, plot = FALSE)
bound <- max(abs(aout$acf[aout$lag >= 400]))
plot(aout, xlim = c(400, max(aout$lag)), ylim = c(-1, 1) * bound)
```
produces the following figure.
```{r label=plot, fig.cap="Estimated Autocovariance Function", echo = FALSE, fig.align='center'}
aout <- acf(x, type = "covariance", lag.max = n / 4, plot = FALSE)
bound <- max(abs(aout$acf[aout$lag >= 400]))
plot(aout, xlim = c(400, max(aout$lag)), ylim = c(-1, 1) * bound)
```

Although we think the true autocovariance function should go to zero
as $k$ goes to infinity (more on this later), the estimated autocovariance
function does not do this.   Both the figure and the theory discussed above
show this.

Thus the naive estimator of $\sigma^2$ which just adds up the estimated
autocovariances
$$
   \hat{\gamma}_0 + 2 \sum_{k = 1}^{n - 1} \hat{\gamma}_k
$$
is a *terrible* estimator!  It isn't even consistent.  It's variance
is $O(1)$ not $O(1 / n)$ as one would expect for a good estimator
([Geyer, 1992](https://www.jstor.org/stable/2246094), Section 3.1, cited above).

Incidentally, this explains why we divided by $n$ rather than by $n - k$
in the expression for $\hat{\gamma}_k$.  If the estimator is already terrible
for large $k$, then dividing by $n - k$ rather than $n$ would only make it
worse.  Also our definition agrees with what R function `acf` computes
and with most of the time series literature.

This phenomenon was well understood in the time series literature long before
most statisticians became interested in MCMC.  So many methods of dealing
with the problem for stationary time series (that are not necessarily
functionals of Markov chains) have been proposed.  But none is so good
as to be obviously better than the others, and none is anywhere as good
for functionals of reversible Markov chains as the methods proposed
by Geyer ([1992](https://www.jstor.org/stable/2246094),
cited above, Section 3.3) and called *initial sequence methods*
and implemented in R function `initseq` in CRAN package `mcmc`.

# Theory of Initial Sequence Estimators

The infinite-dimensional analog of eigenvalues and eigenvectors, which is
called the *spectral decomposition of a bounded operator on a Hilbert space*,
is used by Geyer ([1992](https://www.jstor.org/stable/2246094),
cited above, Section 3.3) to prove things about the
behavior of the sequence of "big gammas"
$$
   \Gamma_k = \gamma_{2 k} + \gamma_{2 k + 1}
$$

But first we look at the little gammas
$$
   \gamma_k = \int \lambda^k \, E_g(d \lambda)
$$
where $E_g$ is some positive measure concentrated on the interval $(-1, 1)$
of the real line if the Markov chain is irreducible and aperiodic,
which it must be in
order to be useful in MCMC ([Chan and Geyer, 1994, Discussion of the paper
by Tierney, *Annals of Statistics*](https://www.jstor.org/stable/2242481),
Lemma 6 and the surrounding discussion).

In this equation the $\lambda$ are the infinite-dimensional analog of
eigenvalues.  There are infinitely many of them.  The infinite-dimensional
analog of eigenvectors is hidden in the measure $E_g$.  The details don't
matter because we cannot calculate $E_g$.  Nor do we even need to know
what bounded operator on a Hilbert space this comes from.  It is
the transition operator of the Markov chain
([Chan and Geyer, 1994](https://www.jstor.org/stable/2242481), cited above,
pp. 1750 ff.), but that doesn't help us do any calculations.

Nor does this representation tell us much about the autocovariance function,
but when we use this representation on the big gammas we get
$$
   \Gamma_k = \int \lambda^{2 k} (1 - \lambda) \, E_g(d \lambda)
$$
from which we immediately see that, unlike the little gammas, the big gammas
are strictly positive (because even powers are positive and $1 - \lambda$ is
positive because $\lambda$ is concentrated on the interval $(-1, 1)$).
Actually, a point that Geyer ([1992](https://www.jstor.org/stable/2246094))
missed, is that $E_g$ could be concentrated
at zero, but this can only happen when the Markov chain is actually independent
and identically distributed, which is not interesting (no source for this,
it comes from a paper I am working on --- a triviality but it needs to be
mentioned in order to make the theorem about initial sequence estimators
correct).

It is also obvious from this representation that the big gammas are
strictly decreasing in $k$, because $\lambda^{2 k}$ is decreasing in $k$
when $\lambda$ is less than one in absolute value, which it is almost
everywhere with respect to $E_g$.  It is a little less obvious that
the big gamma sequence is a strictly convex function of $k$ but
Geyer ([1992](https://www.jstor.org/stable/2246094), cited above)
proves this in just a few lines.

Much less obvious is that
\begin{align*}
   \sigma^2 & = - \gamma_0 + 2 \sum_{k = 0}^\infty \Gamma_k
   \\
   & = - \gamma_0 + 2 \sum_{k = 0}^\infty \gamma_k
   \\
   & =
   \lim_{n \to \infty} \int \left( - 1 + 2 \sum_{k = 0}^n \lambda^k \right)
   \, E_g(d \lambda)
   \\
   & = \lim_{n \to \infty} \int
   \left( - 1 + 2 \frac{1 - \lambda^n}{1 - \lambda} \right) \, E_g(d \lambda)
   \\
   & = \lim_{n \to \infty} \int
   \frac{1 + \lambda - 2 \lambda^n}{1 - \lambda} \, E_g(d \lambda)
   \\
   & = \int \frac{1 + \lambda}{1 - \lambda} \, E_g(d \lambda)
\end{align*}
by monotone and dominated convergence.

The Kipnis-Varadhan Markov chain central limit theorem for functionals
of reversible Markov chains ([Kipnis and Varadhan, 1986, Central limit
theorem for additive functionals of reversible Markov processes and
applications to simple exclusions, *Communications in Mathematical
Physics*](https://link.springer.com/article/10.1007/BF01210789))
says the central limit theorem holds if the integral above is finite.

So that tells us the $\Gamma_k$ must decrease to zero as $k$ goes to infinity
if there is to be a central limit theorem.

# Example of Initial Sequence Estimators

## Initial Positive Sequence Estimator

So let's look at the initial sequence estimators for our example.
The following code
```{r label=show.pos, eval=FALSE}
library("mcmc")
iout <- initseq(x)
plot(seq(along = iout$Gamma.pos) - 1, iout$Gamma.pos,
    ylab = expression(Gamma[k]), xlab = expression(k))
abline(h = 0)
```
produces the following figure.
```{r label=plot.pos, fig.cap="Initial Positive Sequence Estimator", echo = FALSE, fig.align='center'}
library("mcmc")
iout <- initseq(x)
plot(seq(along = iout$Gamma.pos) - 1, iout$Gamma.pos,
    ylab = expression(Gamma[k]), xlab = expression(k))
abline(h = 0)
```
(this figure and the other figures in this handout change every time
the document is processed by R function `render` in R package `rmarkdown`
because we do not set random number generator seeds anywhere in this
document).

This is already a very good estimator, we first estimated all of the
big gammas by the obvious
$$
   \widehat{\Gamma}_k = \hat{\gamma}_{2 k} + \hat{\gamma}_{2 k + 1}
$$
and then set to zero all of the negative $\widehat{\Gamma}_k$ and all of
the $\widehat{\Gamma}_k$ after any that are negative.
That's what is shown
in the figure.  Already we have gotten rid of all the spurious wiggles
of the autocovariance function for large lag (by setting them to zero
as just described).  Summing these gives a pretty good estimator.

## Initial Decreasing Sequence Estimator

We can do a little better by imposing monotonicity.
The following code
```{r label=show.dec, eval=FALSE}
plot(seq(along = iout$Gamma.pos) - 1, iout$Gamma.pos,
    ylab = expression(Gamma[k]), xlab = expression(k))
abline(h = 0)
lines(seq(along = iout$Gamma.con) - 1, iout$Gamma.con, col = "blue", lwd = 3)
```
produces the following figure.
```{r label=plot.dec, fig.cap="Initial Decreasing Sequence Estimator.  Dots are initial positive sequence estimator.  Blue line is initial decreasing sequence estimator", echo = FALSE, fig.align='center'}
plot(seq(along = iout$Gamma.pos) - 1, iout$Gamma.pos,
    ylab = expression(Gamma[k]), xlab = expression(k))
abline(h = 0)
lines(seq(along = iout$Gamma.dec) - 1, iout$Gamma.dec, col = "blue", lwd = 3)
```

The improvement cannot even be seen on the plot, but there is some.
```{r label=pos.dec}
iout$var.pos
iout$var.dec
iout$var.pos - iout$var.dec
(iout$var.pos - iout$var.dec) / iout$var.dec
```
about a `r round(100 * (iout$var.pos - iout$var.dec) / iout$var.dec, 1)`
percent decrease in the estimator.

## Initial Convex Sequence Estimator

We can do a little better by imposing convexity.
The following code
```{r label=show.con, eval=FALSE}
plot(seq(along = iout$Gamma.dec) - 1, iout$Gamma.dec,
    ylab = expression(Gamma[k]), xlab = expression(k))
abline(h = 0)
lines(seq(along = iout$Gamma.con) - 1, iout$Gamma.con, col = "red", lwd = 3)
```
produces the following figure.
```{r label=plot.con, fig.cap="Initial Convex Sequence Estimator.  Dots are initial decreasing sequence estimator.  Red line is initial convex sequence estimator", echo = FALSE, fig.align='center'}
plot(seq(along = iout$Gamma.dec) - 1, iout$Gamma.dec,
    ylab = expression(Gamma[k]), xlab = expression(k))
abline(h = 0)
lines(seq(along = iout$Gamma.con) - 1, iout$Gamma.con, col = "red", lwd = 3)
```

The improvement cannot even be seen on the plot, but there is some.
```{r label=dec.con}
iout$var.dec
iout$var.con
iout$var.dec - iout$var.con
(iout$var.dec - iout$var.con) / iout$var.con
```
about a `r round(100 * (iout$var.dec - iout$var.con) / iout$var.con, 1)`
percent decrease in the estimator.

## Some More Examples

Let's just try this a bunch of times.
```{r label=sim}
nsim <- 20
result <- NULL
for (i in 1:nsim) {
    x <- arima.sim(model = list(ar = rho), n = n)
    iout <- initseq(x)
    tmp <- iout[c("var.pos", "var.dec", "var.con")]
    tmp$imp.dec <- with(iout, 100 * (var.pos - var.dec) / var.dec)
    tmp$imp.con <- with(iout, 100 * (var.dec - var.con) / var.con)
    result <- rbind(result, unlist(tmp))
}
round(result, 2)
```
We can see that the amount of improvement (fourth and fifth columns,
relative improvements in percent) is quite variable.
Sometimes negligible, sometimes appreciable.
Most of the time, the initial positive sequence estimator is just as
good as the others.
But rarely, the initial decreasing sequence estimator or the initial
convex sequence estimator produce big wins.

We can see that the estimators themselves (first, second, and third columns)
are also quite variable.
That is because of the small Monte Carlo sample size.
Increasing `n` would improve the estimates (of course).