\documentclass[11pt]{article} \usepackage{graphics} \usepackage{amsmath} \usepackage{indentfirst} % \usepackage[utf8]{inputenc} \usepackage{url} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\trigamma}{trigamma} \DeclareMathOperator{\NormalDis}{\mathcal{N}} \DeclareMathOperator{\UniformDis}{Unif} \newcommand{\boldtheta}{\boldsymbol{\theta}} \newcommand{\boldI}{\mathbf{I}} \newcommand{\boldM}{\mathbf{M}} \newcommand{\boldX}{\mathbf{X}} \newcommand{\boldY}{\mathbf{Y}} \newcommand{\weakto}{\stackrel{\mathcal{D}}{\longrightarrow}} \begin{document} <>= options(keep.source = TRUE, width = 60) @ \title{Stat 5102 Notes: Markov Chain Monte Carlo and Bayesian Inference} \author{Charles J. Geyer} \maketitle \section{The Problem} This is an example of an application of Bayes rule that requires some form of computer analysis. We will use Markov chain Monte Carlo (MCMC). The problem is the same one that was done by maximum likelihood on the computer examples web pages (\url{http://www.stat.umn.edu/geyer/5102/examp/like.html}). The data model is gamma. We will use the Jeffreys prior. \subsection{Data} The data are loaded by the R command <>= foo <- read.table(url("http://www.stat.umn.edu/geyer/5102/data/ex3-1.txt"), header = TRUE) x <- foo$x @ \subsection{R Package} We load the R contributed package \texttt{mcmc}, which is available from CRAN. <>= library(mcmc) @ If this does not work, then get the library using the package menu for R. \subsection{Random Number Generator Seed} In order to get the same results every time, we set the seed of the random number generator. <>= set.seed(42) @ To get different results, change the seed or simply omit this statement. \subsection{Prior} We have not done the Fisher information matrix for the two-parameter gamma distribution. To calculate Fisher information, it is enough to have the log likelihood for sample size one. The PDF is $$ f(x \mid \alpha, \lambda) = \frac{\lambda^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} \exp\left( - \lambda x \right) $$ The log likelihood is $$ l(\alpha, \lambda) = \log f(x \mid \alpha, \lambda) = \alpha \log(\lambda) - \log \Gamma(\alpha) + (\alpha - 1) \log(x) - \lambda x $$ which has derivatives \begin{align*} \frac{\partial l(\alpha, \lambda)}{\partial \alpha} & = \log(\lambda) - \frac{d}{d \alpha} \log \Gamma(\alpha) + \log(x) \\ \frac{\partial l(\alpha, \lambda)}{\partial \lambda} & = \frac{\alpha}{\lambda} - x \\ \frac{\partial^2 l(\alpha, \lambda)}{\partial \alpha^2} & = - \frac{d^2}{d \alpha^2} \log \Gamma(\alpha) \\ \frac{\partial^2 l(\alpha, \lambda)}{\partial \alpha \partial \lambda} & = \frac{1}{\lambda} \\ \frac{\partial^2 l(\alpha, \lambda)}{\partial \lambda^2} & = - \frac{\alpha}{\lambda^2} \end{align*} % foo = alpha Log[lambda] - Log[Gamma[alpha]] + (alpha - 1) Log[x] - lambda x % D[ D[ foo, alpha ], alpha ] % D[ D[ foo, alpha ], lambda ] % D[ D[ foo, lambda ], lambda ] % info = - { { D[ D[ foo, alpha ], alpha ], D[ D[ foo, alpha ], lambda ] }, % { D[ D[ foo, alpha ], lambda ], D[ D[ foo, lambda ], lambda ] } } % Det[ info ] Recall that $\partial^2 \log(\Gamma(\alpha)) / \partial \alpha^2$ is called the trigamma function. So the Fisher information matrix is $$ I(\boldtheta) = \begin{pmatrix} \trigamma(\alpha) & - 1 / \lambda \\ - 1 / \lambda & \alpha / \lambda^2 \end{pmatrix} $$ Its determinant is $$ \lvert I(\boldtheta) \rvert = \begin{vmatrix} \trigamma(\alpha) & - 1 / \lambda \\ - 1 / \lambda & \alpha / \lambda^2 \end{vmatrix} = \frac{\alpha \trigamma(\alpha) - 1}{\lambda^2} $$ and the Jeffreys prior is $$ g(\alpha, \lambda) = \frac{\sqrt{\alpha \trigamma(\alpha) - 1}}{\lambda} $$ \section{The Markov Chain Monte Carlo} \subsection{Ordinary Monte Carlo} The ``Monte Carlo method'' refers to the theory and practice of learning about probability distributions by simulation rather than calculus. In ordinary Monte Carlo (OMC) we use IID simulations from the distribution of interest. Suppose $X_1$, $X_2$, $\ldots$ are IID simulations from some distribution, and suppose we want to know an expectation $$ \mu = E \{ g(X_i) \}. $$ Then the law of large numbers (LLN) then says $$ \hat{\mu}_n = \frac{1}{n} \sum_{i = 1}^n g(X_i) $$ converges in probability to $\mu$, and the central limit theorem (CLT) says $$ \sqrt{n} (\hat{\mu}_n - \mu) \weakto \NormalDis(0, \sigma^2) $$ where $$ \sigma^2 = \var\{ g(X_i) \} $$ which can be estimated by the empirical variance $$ \hat{\sigma}^2 = \frac{1}{n} \sum_{i = 1}^n \bigl( g(X_i) - \hat{\mu}_n \bigr)^2 $$ and this allows us to say everything we want to say about $\mu$, for example, an asymptotic 95\% confidence interval for $\mu$ is $$ \hat{\mu}_n \pm 1.96 \frac{\hat{\sigma}_n}{\sqrt{n}} $$ The theory of OMC is just the theory of frequentist statistical inference. The only differences are that \begin{itemize} \item the ``data'' $X_1$, $\ldots$, $X_n$ are computer simulations rather than measurements on objects in the real world, \item the ``sample size'' is the number of computer simulations rather than the size of some real world data, and \item the unknown parameter $\mu$ is in principle completely known, given by some integral, which, unfortunately, we are unable to do. \end{itemize} Often we say that $n$ is the \emph{Monte Carlo sample size} to distinguish it from anything else that may be called a sample size. Often we say that $\hat{\mu}_n$ is the \emph{Monte Carlo estimate} or \emph{Monte Carlo approximation} or \emph{Monte Carlo calculation} of $\mu$ to distinguish it from anything else that may be called an estimate. Often we say that $\hat{\sigma}_n / \sqrt{n}$ is the \emph{Monte Carlo standard error} of $\hat{\mu}_n$ to distinguish it from anything else that may be called a standard error. Great! The only problem is that is is very difficult to simulate IID simulations of random variables or random vectors whose distribution is not a brand name distribution. \section{Markov Chains} A \emph{Markov chain} is a sequence of dependent random variables $X_1$, $X_2$, $\ldots$ having the property that the conditional distribution of the future given the past depends only on the present: the conditional distribution of $X_{n + 1}$ given $X_1$, $\ldots$, $X_n$ depends only on $X_n$. We say the Markov chain has \emph{stationary transition probabilities} if the conditional distribution of $X_{n + 1}$ given $X_n$ is the same for all $n$. This property is so important that it is often assumed without comment. Every Markov chain used in MCMC has stationary transition probabilities, but this goes without saying. We say the Markov chain has an \emph{invariant distribution}, or \emph{stationary distribution}, or \emph{equilibrium distribution} if whenever $X_n$ has this equilibrium distribution and the conditional distribution of $X_{n + 1}$ given $X_n$ is that of the transition probabilities of the Markov chain, then $X_{n + 1}$ also has this equilibrium distribution (same as $X_n$). Thus if $X_1$ has the equilibrium distribution, then so does $X_n$ for all $n$. \section{The Metropolis Algorithm} It turns out that there is an algorithm for simulating a Markov chain having any equilibrium distribution for which one has an unnormalized PDF, called the \emph{Metropolis-Hastings-Green algorithm}. An R contributed package \texttt{mcmc} has a function \texttt{metrop} that does this using the most basic version, called the \emph{normal random walk Metropolis algorithm}. It works as follows, suppose the current state of the Markov chain is $X_n$ and suppose the unnormalized density of the desired equilibrium distribution is $h$. Then the next step of the Markov chain is simulated as follows. \begin{itemize} \item Simulate $Y_n$ having a $\NormalDis(0, \tau^2)$ distribution. \item Calculate $$ r = \frac{h(X_n + Y_n)}{h(X_n)} $$ \item Simulate $U_n$ having a $\UniformDis(0, 1)$ distribution. \item If $U_n < r$, then set $X_{n + 1} = X_n + Y_n$, otherwise set $X_{n + 1} = X_n$. \end{itemize} The algorithm works just as well when the state of the Markov chain is a vector. We just replace normal font with bold face, and the variance $\tau^2$ with a variance matrix $\boldM$. Suppose the current state of the Markov chain is $\boldX_n$ and suppose the unnormalized density of the desired equilibrium distribution is $h$. Then the next step of the Markov chain is simulated as follows. \begin{itemize} \item Simulate $\boldY_n$ having a $\NormalDis(0, \boldM)$ distribution. \item Calculate $$ r = \frac{h(\boldX_n + \boldY_n)}{h(\boldX_n)} $$ \item Simulate $U_n$ having a $\UniformDis(0, 1)$ distribution. \item If $U_n < r$, then set $\boldX_{n + 1} = \boldX_n + \boldY_n$, otherwise set $\boldX_{n + 1} = \boldX_n$. \end{itemize} The only thing we can adjust, and must adjust, in the algorithm is the \emph{proposal variance}, $\tau^2$ in the scalar version and $\boldM$ in the vector version. If we make the variance too small, then the chain only takes little baby steps and takes a very long time to equilibrate. If we make the variance too big, then the algorithm proposes very large steps $Y_n$ or $\boldY_n$, but most of those steps are not used in the last step because $r$ calculated in the third step is too small. \section{The Markov Chain Monte Carlo} MCMC is much like OMC. Most Markov chains used in MCMC obey the LLN and the CLT. These are the Markov chain LLN and Markov chain CLT and are not quite the same as the IID LLN and CLT. However, they serve the purpose. Suppose $X_1$, $X_2$, $\ldots$ is a Markov chain whose initial distribution is its equilibrium distribution, and suppose we want to know an expectation $$ \mu = E \{ g(X_i) \}. $$ Then the law of large numbers (LLN) then says $$ \hat{\mu}_n = \frac{1}{n} \sum_{i = 1}^n g(X_i) $$ converges in probability to $\mu$, and the central limit theorem (CLT) says $$ \sqrt{n} (\hat{\mu}_n - \mu) \weakto \NormalDis(0, \sigma^2) $$ where $$ \sigma^2 = \var\{ g(X_i) \} + 2 \sum_{k = 1}^\infty \cov\{ g(X_i), g(X_{i + k}) \} $$ which can be estimated by the method of batch means (more on this later). These results are stated in terms of a Markov chain started at equilibrium, which usually cannot be done, but they also hold regardless of the initial distribution, meaning $\hat{\mu}_n$ is calculated for a Markov chain started in any distribution, but $E \{ g(X_i) \}$, $\var\{ g(X_i) \}$, and $\cov\{ g(X_i), g(X_{i + k}) \}$ refer to a Markov chain with the same transition probabilities and stationary initial distribution. Thus MCMC works just like OMC, except the variance in the CLT is more complicated and must be estimated differently. \section{Batch Means} The Markov chain CLT says that $\hat{\mu}_n$ is approximately normal and centered at the quantity $\mu$ that we are trying to calculate approximately. All we need to know is estimate the variance, which the CLT says obeys the square root law. It is of the form $\sigma^2 / n$ for some constant $\sigma^2$, which we don't know. However, the estimate $\hat{\mu}_b$ for a subsection of the chain of length $b$, should also be approximately normal with mean $\mu$ and variance $\sigma^2 / b$, so long as $b$ is large enough. Suppose $b$ evenly divides $n$ and we form the means $$ \hat{\mu}_{b, k} = \frac{1}{b} \sum_{i = b k + 1}^{b k + b} g(X_i) $$ for $k = 1$, $\ldots$, $m = n / b$. Each of these ``batch means'' is approximately normal with mean approximately $\mu$ and variance approximately $\sigma^2 / b$. Thus their empirical variance $$ \frac{1}{m} \sum_{k = 1}^m (\hat{\mu}_{b, k} - \hat{\mu}_n)^2 $$ estimates $\sigma^2 / b$. And $b / n$ times this estimates $\sigma^2 / n$, the asymptotic variance of $\hat{\mu}_n$. Alternatively, we can just make a $t$ confidence interval using the $m$ batch means as data. \section{Trying it Out} \subsection{Log Unnormalized Posterior} We need to define an R function that calculates the log unnormalized posterior. <>= lupost <- function(theta) { stopifnot(is.numeric(theta)) stopifnot(is.finite(theta)) stopifnot(length(theta) == 2) alpha <- theta[1] lambda <- theta[2] if (alpha <= 0) return(- Inf) if (lambda <= 0) return(- Inf) logl <- sum(dgamma(x, shape = alpha, rate = lambda, log = TRUE)) lpri <- (1 / 2) * log(alpha * trigamma(alpha) - 1) - log(lambda) return(logl + lpri) } @ The only tricky bit is that we define this function to be $- \infty$ off of the parameter space. This corresponds to the unnormalized posterior value of zero, and results in the Metropolis algorithm never taking a step into this region. \subsection{Try 1} <>= out <- metrop(lupost, initial = c(1, 1), nbatch = 1e4) print(out$accept) print(out$time) @ \subsection{Try 2} Now we need to adjust the scale. We set the variance matrix $\boldM$ in the Metropolis algorithm to be $\tau^2 \boldI$, where $\boldI$ is the identity matrix. Try $\tau = 0.5$. <>= out <- metrop(out, scale = 0.5) print(out$accept) print(out$time) @ The idea is to get an acceptance rate, which is the fraction of iterations in which $U_n < r$ in step 4 of the Metropolis algorithm so $X_{n + 1} \neq X_n$, to be roughly 20\%. Now we have done that. \section{Diagnostics} \subsection{Time Series Plots} Figure~\ref{fig:fig1} (page~\pageref{fig:fig1}) shows the time series plot made by the R statement <>= plot(ts(out$batch, names = c("alpha", "lambda"))) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Time series plot of MCMC output.} \label{fig:fig1} \end{figure} What we are looking for is no obvious trend. If the beginning of the run looks very different from the end, then the simulated Markov chain is nowhere near stationarity and we need to run longer. This looks o.~k. \subsection{Autocorrelation Plots} Figure~\ref{fig:fig2} (page~\pageref{fig:fig2}) shows the autocorrelation plot made by the R statement <>= acf(ts(out$batch, names = c("alpha", "lambda")), lag.max = 50) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Autocorrelation plot of MCMC output.} \label{fig:fig2} \end{figure} The diagonal plots show the autocovariance function $$ k \mapsto \frac{\cov\{ g(X_i), g(X_{i + k}) \}}{\var\{ g(X_i) \}} $$ where $g(X_i)$ is the first or second coordinate of the state vector. The off-diagonal plots show the autocrosscovariance function $$ k \mapsto \frac{\cov\{ g_1(X_i), g_2(X_{i + k}) \}} {\sd\{ g_1(X_i) \} \sd\{ g_2(X_i) \}} $$ where $g_1(X_i)$ is the first coordinate and $g_2(X_i)$ is the second coordinate of the state vector. What we are looking for is what lag the autocorrelations decrease to being not significantly different from zero (within the blue confidence region). It looks like about lag 40. We want batches for the method of batch means larger than that, perhaps much larger. \section{More Tries} \subsection{Try 3} <>= out <- metrop(out, blen = 200, nbatch = 500) print(out$accept) print(out$time) @ From this we can make frequentist confidence intervals for the posterior means of $\alpha$ and $\lambda$. <>= alpha <- out$batch[ , 1] lambda <- out$batch[ , 2] t.test(alpha) t.test(lambda) @ The confidence intervals are fairly tight. We are done with our MCMC calculation if all we wanted to do is estimate the posterior means. The package vignette for the \texttt{mcmc} package illustrates also calculating posterior variances and posterior standard deviations. \subsection{Try 4} Let also do a run without batches, so we can make histograms of the marginal distributions. <>= out4 <- metrop(out, blen = 1, nbatch = 1e5) print(out4$accept) print(out4$time) @ \section{More Diagnostics} \subsection{Time Series Plots} Figure~\ref{fig:fig3} (page~\pageref{fig:fig3}) shows the time series plot made by the R statement <>= plot(ts(out$batch, names = c("alpha", "lambda"))) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Time series plot of MCMC output. Batches of length \Sexpr{out$blen}} \label{fig:fig3} \end{figure} What we are looking for is no obvious trend. If the beginning of the run looks very different from the end, then the simulated Markov chain is nowhere near stationarity and we need to run longer. This looks o.~k. \subsection{Autocorrelation Plots} Figure~\ref{fig:fig4} (page~\pageref{fig:fig4}) shows the autocorrelation plot made by the R statement <>= acf(ts(out$batch, names = c("alpha", "lambda"))) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Autocorrelation plot of MCMC output. Batches of length \Sexpr{out$blen}} \label{fig:fig4} \end{figure} What we are looking for is that the autocorrelations of the batches are not significant at all lags other than lag zero (where it is defined to be one for the diagonal plots). Looks o.~k. \section{Marginal Distributions} \subsection{Histograms} Figure~\ref{fig:fig5} (page~\pageref{fig:fig5}) shows the histogram made by the R statement <>= alpha <- out4$batch[ , 1] hist(alpha, freq = FALSE) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Histogram of marginal posterior for parameter $\alpha$.} \label{fig:fig5} \end{figure} Figure~\ref{fig:fig6} (page~\pageref{fig:fig6}) shows the histogram made by the R statement <>= lambdda <- out4$batch[ , 2] hist(lambda, freq = FALSE) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Histogram of marginal posterior for parameter $\lambda$.} \label{fig:fig6} \end{figure} \subsection{Smooth Density Plots} Figure~\ref{fig:fig7} (page~\pageref{fig:fig7}) shows the histogram made by the R statement <>= out <- density(alpha) plot(out) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Smooth density plot of marginal posterior for parameter $\alpha$.} \label{fig:fig7} \end{figure} Figure~\ref{fig:fig8} (page~\pageref{fig:fig8}) shows the histogram made by the R statement <>= out <- density(lambda) plot(out) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Smooth density plot of marginal posterior for parameter $\lambda$.} \label{fig:fig8} \end{figure} \end{document} \begin{center} \LARGE REVISED DOWN TO HERE \end{center}