---
title: "Stat 8054 Lecture Notes: Geometric Ergodicity of Metropolis-Hastings"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  bookdown::html_document2:
    number_sections: true
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
    css: "../bar.css"
  bookdown::pdf_document2:
    latex_engine: xelatex
    number_sections: true
    md_extensions: -tex_math_single_backslash
linkcolor: blue
urlcolor: blue
bibliography: mcmc.bib
csl: journal-of-the-royal-statistical-society.csl
link-citations: true
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document is
   `r packageVersion("rmarkdown")`.

# Geometric Ergodicity and the Markov Chain CLT

This is not a theory course but some theory came up in discussion of MCMC.

A Markov chain is *geometrically ergodic* 
[@meyn-tweedie, Chapter 15] if
\begin{equation}
   \lVert P^n(x, \,\cdot\,) - \pi(\,\cdot\,) \rVert \le M(x) \rho^n
   (\#eq:geom)
\end{equation}
for some function $M$ and some constant $\rho < 1$, where the left-hand
is total variation distance between the distribution of the Markov chain
at time $n$ when started at $x$ and the equilibrium distribution.
Thus this distance is decreasing exponentially fast (also called geometrically
fast because of the geometric series).

Geometric ergodicity is important because it is part of the sufficient
conditions for the Markov chain central limit theorem (CLT) to hold.
@chan-geyer, Theorem 2, showed that if the functional of the Markov chain
under discussion has $2 + \varepsilon$ moments for some $\varepsilon > 0$ and
the Markov chain is geometrically ergodic, then the CLT holds.
@roberts-rosenthal, Theorem 2.1, showed that if the functional of the
Markov chain under discussion has second moments and
the Markov chain is *reversible* and geometrically ergodic, then the CLT holds.

Proving geometric ergodicity is hard.  That is why
@mengersen-tweedie, @roberts-tweedie, and @jarner-hansen exist.
But even the theorems in those papers are hard to apply to non-toy problems.

<!--
The general way one proves geometric ergodicity is to show that
\begin{equation}
   P V \le \lambda V + L
   (\#eq:drift)
\end{equation}
where $V$ is a function unbounded off petite sets and $L$ is a constant
[@meyn-tweedie, Lemma 15.2.8] and
$$
   (P V)(x) = \int V(y) P(x, d y)
$$
where $P$ is the transition kernel of the Markov chain or (in other notation)
$$
   (P V)(x) = E\{ V(X_n) | X_{n - 1} = x \}
$$
and where *unbounded off petite sets* [@meyn-tweedie, Section 8.4.2]
means that for any constant $L$ the set
$$
   \{\, x : V(x) \le L \,\}
$$
is petite, and the definition of petite sets is too complicated to discuss
here [@meyn-tweedie, Chapter 5] but with enough continuity of the Markov
chain transition probabilities, petite sets can be taken to be bounded sets
[@meyn-tweedie, Theorem 6.2.5].
-->

# Lessons from Mengersen and Tweedie

## Independence Metropolis-Hastings is Just Like Rejection Sampling Except for Minor Details

@tierney coined the name *independence* Metropolis-Hastings for
a Metropolis-Hastings sampler for which the proposal distribution does
not depend on the current state.  Thus it is just a probability density
function (PDF) $q$.  @mengersen-tweedie, Theorem 2.1, says such a Markov chain
is geometrically ergodic only if the importance weights $\pi(x) / q(x)$
are bounded, where $\pi$ is the target distribution, in which case
the sampler is actually uniformly ergodic (which means that $M(x)$ in
\@ref(eq:geom) does not actually depend on $x$).

This is exactly the same condition needed for ordinary Monte Carlo (OMC)
rejection sampling to work, where $q$ is the proposal distribution for that.
The only differences are that

 * in OMC rejection sampling one needs to know the bound (it is used in
   in the OMC rejection step), but

 * in Metropolis-Hastings rejection one does not need to know the bound
   (it is not used in the Metropolis-Hastings rejection step).

and, of course,

 * OMC rejection sampling produces independent and identically distributed
   samples from the target distribution, but

 * the Metropolis-Hastings algorithm produces a Markov chain having the
   target distribution as its equilibrium distribution.

## OMC Importance Sampling is Maybe Better

For the CLT to hold for OMC Importance sampling, one only needs
that the quantities being averaged
$$
   \frac{g(X) \pi(X)}{q(X)}
$$
have finite variance and this is different from the importance weights
being bounded.
Although, if one wants this to hold for every $g$ that has second moments
with respect to $\pi$, then this is equivalent to the importance weights
being bounded.

## Random Walk Metropolis-Hastings

@tierney coined the name *random walk* Metropolis (RWM) for
a Metropolis sampler for which the proposal distribution has the form
$$
   q(x, y) = q(y - x)
$$
where $y$ is the proposal and $x$ is the current state.

@mengersen-tweedie provide some counterintuitive results for RWM.
Their theorem 3.3 says that provided the proposal distribution has absolute
first moment the sample cannot be geometrically ergodic unless the target
distribution has a moment generating function.
@jarner-tweedie considerably improve this result.
They drop the condition on the proposal distribution and extend to the
multivariate case.  Their Theorem 2.2 says an RMW sampler is geometrically
ergodic only if the target distribution has a moment generating function.
If it does not, then no proposal distribution will work.

Then @mengersen-tweedie provide a sufficient condition for geometric
ergodicity.  For the multivariate case @roberts-tweedie and @jarner-hansen
prove some sufficient conditions.  All of these sufficient conditions are
far from necessary.  A student asked in class about distributions falling
in between these necessary conditions and sufficient conditions.

# A Lesson from Roberts and Tweedie

Theorem 5.1 of @roberts-tweedie says that no Metropolis-Hastings algorithm
that has
\begin{equation}
   \mathop{\rm ess\,sup} P(x, \{x\}) = 1
   (\#eq:reject)
\end{equation}
can be geometrically ergodic.  Here $P(x, \{x\})$ is the probability that
the (continuous) proposal is (Metropolis-Hastings) rejected when the current
position is $x$.  If that is not bounded away from one, then we cannot have
geometric ergodicity.

# A Toy Problem

What that student asked about is a target distribution for with the PDF
does not go to zero at infinity.  The PDF is spiky.

Such a distribution satisfies none of
the sufficient conditions of @mengersen-tweedie or @roberts-tweedie
or @jarner-hansen.
So we do look at one such toy problem.

The toy problem we will analyze has a continuous uniform distribution
on a certain set but is really a smeared out geometric distribution.
the PDF is
$$
   f(x) =
   \begin{cases}
   1, & \text{$n \in \mathbb{N}$ and $n < x < n + p (1 - p)^n$}
   \\ 0, & \text{otherwise}
   \end{cases}
$$
(here $\mathbb{N}$ is the set of natural numbers (nonnegative integers)).
This PDF clearly integrates to one and the lim sup as $x \to \infty$ is one.

We propose to sample with via an RWM with any continuous unimodal proposal
distribution.  Examples are mean-zero normal distribution
or uniform distribution on an interval symmetric about zero.

Clearly, for $x$ very far out in the tails of the distribution,
the per $x$ rejection probability, the probability in \@ref(eq:reject),
goes to zero as $x$ goes to infinity.

Thus the sampler is not geometrically ergodic.

It is hard to know what lessons to draw from a toy problem.  It is clear
that there is no way to modify the example so that it is geometrically
ergodic unless

 * we modify the proposal distribution so it "knows" where the spikes are or

 * we modify the target PDF so it isn't so spiky.

But this example also shows there is something weird about asymptotics.
The only problem is way out in the tails.  The farther out in the tails
the sampler gets, the longer it takes to get back to the center of the
distribution.  But we only have a problem far out in the tails.  So as
long as we are interested in getting a good idea about the part of the
distribution that is not very far out in the tails, we are OK.
For example if we conditioned on the event that $X < 10^{100000}$,
then we would have a geometrically ergodic sampler and the Markov CLT
would hold.  But we would never see samples outside this event in any run
we would actually do anyway!  So Markov chain asymptotics can be somewhat
misleading.

# Bibliography {-}