--- title: "Stat 8054 Lecture Notes: Autocovariances in MCMC" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML css: "../bar.css" pdf_document: number_sections: true md_extensions: -tex_math_single_backslash --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. * The version of the `mcmc` package used to make this document is `r packageVersion("mcmc")`. # Functionals of a Markov Chain If $X_1$, $X_2$, $\ldots$ is a Markov chain and $g$ is a real-valued function on the state space, then $g(X_1)$, $g(X_2)$, $\ldots$ is called a *functional* of the Markov chain ([Geyer, 2011, Introduction to MCMC, in *Handbook of Markov Chain Monte Carlo*](http://mcmchandbook.net/HandbookChapter1.pdf), hereinafter referred to as the "Intro", Section 1.6) # Estimates of Expectations of Functionals of a Markov Chain If we cannot calculate, either by hand or by using a [computer algebra system](https://en.wikipedia.org/wiki/Computer_algebra_system), an expectation $$ \mu = E\{ g(X) \} $$ for some random object $X$ and some real-valued function $g$ on the state space (the set where $X$ takes values), then we estimate it by simulating a Markov chain $X_1$, $X_2$, $\ldots,$ whose unique equilibrium distribution is the distribution of $X$, and using $$ \hat{\mu}_n = \frac{1}{n} \sum_{i = 1}^n g(X_i) $$ as our estimator (these are equations (1.1) and (1.2) in the [Intro](http://mcmchandbook.net/HandbookChapter1.pdf)). # Asymptotic Variance and Autocovariances The Markov chain central limit theorem says that $\hat{\mu}_n$ is a consistent and asymptotically normal estimator of $\mu$ with asymptotic variance $\sigma^2 / n$, where $$ \sigma^2 = \gamma_0 + 2 \sum_{k = 1}^\infty \gamma_k $$ where $$ \gamma_k = \mathop{\rm cov}\{ g(X_i), g(X_{i + k}) \} $$ (these are equations (1.9) and (1.8) in the Intro [Intro](http://mcmchandbook.net/HandbookChapter1.pdf)), where here the covariances refer to the stationary Markov chain having the same transition probabilities as the Markov chain actually being run so that there is no dependence on $i$ as is required for this to be a good definition (more on this in Section 1.8 of the intro). The natural estimator of the "little gammas" $\gamma_k$ is $$ \hat{\gamma}_k = \frac{1}{n} \sum_{i = 1}^{n - k} \{ [ g(X_i) - \hat{\mu}_n ] [ g(X_{i + k}) - \hat{\mu}_n ]\} $$ (this is equation (1.10) in the [Intro](http://mcmchandbook.net/HandbookChapter1.pdf)). Each $\gamma_k$ is called an *autocovariance* (of this functional of the Markov chain) and the function $k \mapsto \gamma_k$ is called the *autocovariance function*. R function `acf` estimates autocovariance functions. # Behavior of the Autocovariance Estimates The behavior of the autocovariance estimates is extremely bad. For large $k$, when $\gamma_k$ is nearly zero, the variance of $\hat{\gamma}_k$ does not depend on $k$ and hence does not go to zero. Similarly, the covariance of $\hat{\gamma}_k$ and $\hat{\gamma}_{k + m}$ does not depend on $k$, and hence does not decrease with $k$. Thus, the part of the autocovariance function that corresponds to $k$ such that $\gamma_k$ is nearly zero, looks like a stationary time series. Geyer ([1992, Practical Markov Chain Monte Carlo, *Statistical Science*](https://www.jstor.org/stable/2246094), Section 3.1, discusses this, not deriving anything new but rather citing earlier work by Bartlett and Priestly). As a simple example, we take the Markov chain from the examples section of the help page for R function `initseq` ```{r label=series} n <- 2e4 rho <- 0.99 x <- arima.sim(model = list(ar = rho), n = n) ``` This is an AR(1) time series, which is also a Markov chain (Section 1.9 of the [Intro](http://mcmchandbook.net/HandbookChapter1.pdf)). Unlike in the [Intro](http://mcmchandbook.net/HandbookChapter1.pdf)), we look at the boring part of the estimated autocovariance function (where the true autocovariance function is nearly zero). The following code ```{r label=show, eval=FALSE} aout <- acf(x, type = "covariance", lag.max = n / 5, plot = FALSE) bound <- max(abs(aout$acf[aout$lag >= 400])) plot(aout, xlim = c(400, max(aout$lag)), ylim = c(-1, 1) * bound) ``` produces the following figure. ```{r label=plot, fig.cap="Estimated Autocovariance Function", echo = FALSE, fig.align='center'} aout <- acf(x, type = "covariance", lag.max = n / 4, plot = FALSE) bound <- max(abs(aout$acf[aout$lag >= 400])) plot(aout, xlim = c(400, max(aout$lag)), ylim = c(-1, 1) * bound) ``` Although we think the true autocovariance function should go to zero as $k$ goes to infinity (more on this later), the estimated autocovariance function does not do this. Both the figure and the theory discussed above show this. Thus the naive estimator of $\sigma^2$ which just adds up the estimated autocovariances $$ \hat{\gamma}_0 + 2 \sum_{k = 1}^{n - 1} \hat{\gamma}_k $$ is a *terrible* estimator! It isn't even consistent. It's variance is $O(1)$ not $O(1 / n)$ as one would expect for a good estimator ([Geyer, 1992](https://www.jstor.org/stable/2246094), Section 3.1, cited above). Incidentally, this explains why we divided by $n$ rather than by $n - k$ in the expression for $\hat{\gamma}_k$. If the estimator is already terrible for large $k$, then dividing by $n - k$ rather than $n$ would only make it worse. Also our definition agrees with what R function `acf` computes and with most of the time series literature. This phenomenon was well understood in the time series literature long before most statisticians became interested in MCMC. So many methods of dealing with the problem for stationary time series (that are not necessarily functionals of Markov chains) have been proposed. But none is so good as to be obviously better than the others, and none is anywhere as good for functionals of reversible Markov chains as the methods proposed by Geyer ([1992](https://www.jstor.org/stable/2246094), cited above, Section 3.3) and called *initial sequence methods* and implemented in R function `initseq` in CRAN package `mcmc`. # Theory of Initial Sequence Estimators The infinite-dimensional analog of eigenvalues and eigenvectors, which is called the *spectral decomposition of a bounded operator on a Hilbert space*, is used by Geyer ([1992](https://www.jstor.org/stable/2246094), cited above, Section 3.3) to prove things about the behavior of the sequence of "big gammas" $$ \Gamma_k = \gamma_{2 k} + \gamma_{2 k + 1} $$ But first we look at the little gammas $$ \gamma_k = \int \lambda^k \, E_g(d \lambda) $$ where $E_g$ is some positive measure concentrated on the interval $(-1, 1)$ of the real line if the Markov chain is irreducible and aperiodic, which it must be in order to be useful in MCMC ([Chan and Geyer, 1994, Discussion of the paper by Tierney, *Annals of Statistics*](https://www.jstor.org/stable/2242481), Lemma 6 and the surrounding discussion). In this equation the $\lambda$ are the infinite-dimensional analog of eigenvalues. There are infinitely many of them. The infinite-dimensional analog of eigenvectors is hidden in the measure $E_g$. The details don't matter because we cannot calculate $E_g$. Nor do we even need to know what bounded operator on a Hilbert space this comes from. It is the transition operator of the Markov chain ([Chan and Geyer, 1994](https://www.jstor.org/stable/2242481), cited above, pp. 1750 ff.), but that doesn't help us do any calculations. Nor does this representation tell us much about the autocovariance function, but when we use this representation on the big gammas we get $$ \Gamma_k = \int \lambda^{2 k} (1 - \lambda) \, E_g(d \lambda) $$ from which we immediately see that, unlike the little gammas, the big gammas are strictly positive (because even powers are positive and $1 - \lambda$ is positive because $\lambda$ is concentrated on the interval $(-1, 1)$). Actually, a point that Geyer ([1992](https://www.jstor.org/stable/2246094)) missed, is that $E_g$ could be concentrated at zero, but this can only happen when the Markov chain is actually independent and identically distributed, which is not interesting (no source for this, it comes from a paper I am working on --- a triviality but it needs to be mentioned in order to make the theorem about initial sequence estimators correct). It is also obvious from this representation that the big gammas are strictly decreasing in $k$, because $\lambda^{2 k}$ is decreasing in $k$ when $\lambda$ is less than one in absolute value, which it is almost everywhere with respect to $E_g$. It is a little less obvious that the big gamma sequence is a strictly convex function of $k$ but Geyer ([1992](https://www.jstor.org/stable/2246094), cited above) proves this in just a few lines. Much less obvious is that \begin{align*} \sigma^2 & = - \gamma_0 + 2 \sum_{k = 0}^\infty \Gamma_k \\ & = - \gamma_0 + 2 \sum_{k = 0}^\infty \gamma_k \\ & = \lim_{n \to \infty} \int \left( - 1 + 2 \sum_{k = 0}^n \lambda^k \right) \, E_g(d \lambda) \\ & = \lim_{n \to \infty} \int \left( - 1 + 2 \frac{1 - \lambda^n}{1 - \lambda} \right) \, E_g(d \lambda) \\ & = \lim_{n \to \infty} \int \frac{1 + \lambda - 2 \lambda^n}{1 - \lambda} \, E_g(d \lambda) \\ & = \int \frac{1 + \lambda}{1 - \lambda} \, E_g(d \lambda) \end{align*} by monotone and dominated convergence. The Kipnis-Varadhan Markov chain central limit theorem for functionals of reversible Markov chains ([Kipnis and Varadhan, 1986, Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions, *Communications in Mathematical Physics*](https://link.springer.com/article/10.1007/BF01210789)) says the central limit theorem holds if the integral above is finite. So that tells us the $\Gamma_k$ must decrease to zero as $k$ goes to infinity if there is to be a central limit theorem. # Example of Initial Sequence Estimators ## Initial Positive Sequence Estimator So let's look at the initial sequence estimators for our example. The following code ```{r label=show.pos, eval=FALSE} library("mcmc") iout <- initseq(x) plot(seq(along = iout$Gamma.pos) - 1, iout$Gamma.pos, ylab = expression(Gamma[k]), xlab = expression(k)) abline(h = 0) ``` produces the following figure. ```{r label=plot.pos, fig.cap="Initial Positive Sequence Estimator", echo = FALSE, fig.align='center'} library("mcmc") iout <- initseq(x) plot(seq(along = iout$Gamma.pos) - 1, iout$Gamma.pos, ylab = expression(Gamma[k]), xlab = expression(k)) abline(h = 0) ``` (this figure and the other figures in this handout change every time the document is processed by R function `render` in R package `rmarkdown` because we do not set random number generator seeds anywhere in this document). This is already a very good estimator, we first estimated all of the big gammas by the obvious $$ \widehat{\Gamma}_k = \hat{\gamma}_{2 k} + \hat{\gamma}_{2 k + 1} $$ and then set to zero all of the negative $\widehat{\Gamma}_k$ and all of the $\widehat{\Gamma}_k$ after any that are negative. That's what is shown in the figure. Already we have gotten rid of all the spurious wiggles of the autocovariance function for large lag (by setting them to zero as just described). Summing these gives a pretty good estimator. ## Initial Decreasing Sequence Estimator We can do a little better by imposing monotonicity. The following code ```{r label=show.dec, eval=FALSE} plot(seq(along = iout$Gamma.pos) - 1, iout$Gamma.pos, ylab = expression(Gamma[k]), xlab = expression(k)) abline(h = 0) lines(seq(along = iout$Gamma.con) - 1, iout$Gamma.con, col = "blue", lwd = 3) ``` produces the following figure. ```{r label=plot.dec, fig.cap="Initial Decreasing Sequence Estimator. Dots are initial positive sequence estimator. Blue line is initial decreasing sequence estimator", echo = FALSE, fig.align='center'} plot(seq(along = iout$Gamma.pos) - 1, iout$Gamma.pos, ylab = expression(Gamma[k]), xlab = expression(k)) abline(h = 0) lines(seq(along = iout$Gamma.dec) - 1, iout$Gamma.dec, col = "blue", lwd = 3) ``` The improvement cannot even be seen on the plot, but there is some. ```{r label=pos.dec} iout$var.pos iout$var.dec iout$var.pos - iout$var.dec (iout$var.pos - iout$var.dec) / iout$var.dec ``` about a `r round(100 * (iout$var.pos - iout$var.dec) / iout$var.dec, 1)` percent decrease in the estimator. ## Initial Convex Sequence Estimator We can do a little better by imposing convexity. The following code ```{r label=show.con, eval=FALSE} plot(seq(along = iout$Gamma.dec) - 1, iout$Gamma.dec, ylab = expression(Gamma[k]), xlab = expression(k)) abline(h = 0) lines(seq(along = iout$Gamma.con) - 1, iout$Gamma.con, col = "red", lwd = 3) ``` produces the following figure. ```{r label=plot.con, fig.cap="Initial Convex Sequence Estimator. Dots are initial decreasing sequence estimator. Red line is initial convex sequence estimator", echo = FALSE, fig.align='center'} plot(seq(along = iout$Gamma.dec) - 1, iout$Gamma.dec, ylab = expression(Gamma[k]), xlab = expression(k)) abline(h = 0) lines(seq(along = iout$Gamma.con) - 1, iout$Gamma.con, col = "red", lwd = 3) ``` The improvement cannot even be seen on the plot, but there is some. ```{r label=dec.con} iout$var.dec iout$var.con iout$var.dec - iout$var.con (iout$var.dec - iout$var.con) / iout$var.con ``` about a `r round(100 * (iout$var.dec - iout$var.con) / iout$var.con, 1)` percent decrease in the estimator. ## Some More Examples Let's just try this a bunch of times. ```{r label=sim} nsim <- 20 result <- NULL for (i in 1:nsim) { x <- arima.sim(model = list(ar = rho), n = n) iout <- initseq(x) tmp <- iout[c("var.pos", "var.dec", "var.con")] tmp$imp.dec <- with(iout, 100 * (var.pos - var.dec) / var.dec) tmp$imp.con <- with(iout, 100 * (var.dec - var.con) / var.con) result <- rbind(result, unlist(tmp)) } round(result, 2) ``` We can see that the amount of improvement (fourth and fifth columns, relative improvements in percent) is quite variable. Sometimes negligible, sometimes appreciable. Most of the time, the initial positive sequence estimator is just as good as the others. But rarely, the initial decreasing sequence estimator or the initial convex sequence estimator produce big wins. We can see that the estimators themselves (first, second, and third columns) are also quite variable. That is because of the small Monte Carlo sample size. Increasing `n` would improve the estimates (of course).