On the Bogosity of MCMC Diagnostics

University of Minnesota, Twin Cities     School of Statistics     Charlie's Home Page

A few years ago I wrote a couple of web pages (about one long run and burn-in) that were an attempt to clarify some of the issues about so-called MCMC diagnostics. But I must admit that those pages do not address the issue directly. This page does.

Contents

Why are you so hard on MCMC diagnostics?

Isn't bogosity, despite it's humor, a bit strong? We know this is ha ha, only serious.

And aren't you being inconsistent? Where is your rant about regression diagnostics?

And why are you the only one out of step? If everyone else likes MCMC diagnostics, despite their problems, what's the matter with you?

A Digression about Regression Diagnostics

I don't mind regression diagnostics. I've never taught a regression course, but I do recommend simple regression diagnostics (plot of residuals versus fitted values, Q-Q plot of residuals) in intro stats.

But I also tell the students about their limited usefulness.

Regression diagnostics don't even claim to reliably diagnose all problems. Whatever diagnostic you use will miss some problems. That's why leave-k-out diagnostics and other complicated diagnostics are interesting.

So what I tell students in intro stats is that the purpose of diagnostics is not to find all problems.

The purpose of regression diagnostics is to find obvious, gross, embarrassing problems that jump out of simple plots.

Back in the dinosaur era, when plots were made out of lots of asterisks on ugly green striped fanfold paper, and there wasn't any graphics software, you made plots with FORTRAN print statements, people did lots of ridiculously bogus regressions because they couldn't see how bogus they were.

Nowadays, there is no excuse for that. The diagnostics only take seconds to do.

But, here is an interesting question to ponder. How bad does heteroscedasticity have to be before you can diagnose it? This depends on the sample size, so say 100. My answer, derived from student performance on exam questions, is you need a factor of 3 (at least!) in error standard deviation from one side of the plot to the other, 2 just isn't enough.

Let's try it (courtesy of rweb.stat.umn.edu/Rweb). Click the Submit button to see an example. (This only works if you are student, faculty, or staff at the University of Minnesota. If you are not, then you have to cut and paste the R commands into R on your own computer.)

Not so easy to see even with het <- 3. Try changing het to lower values (just edit the text in the web form and resubmit).

Back to MCMC Diagnostics

Seat-of-the-Pants Diagnostics

If MCMC diagnostics were similar, with similarly limited claims, there would be nothing to object to.

I don't mind seat-of-the-pants MCMC diagnostics, such as time series plots, acf (autocorrelation function) plots, or Q-Q plots of batch means.

I've always used these myself, and I recommend them to students, for example, see the package vignette from my MCMC package for R.

There's no problem with these so long as it is understood that

these diagnostics only find obvious, gross, embarrassing problems that jump out of simple plots.

They are worthless for finding subtle problems.

MCMC as a Black Box

Consider MCMC as a black box (see Wikipedia and Webopedia entries). We have software that runs a Markov chain having a specified equilibrium distribution. We don't know anything other than that.

This may seem extreme. How often do you know absolutely nothing about your MCMC algorithm and its equilibrium distribution?

On the other hand how many users of MCMC ever use theorems about MCMC convergence? The user may know something but not enough to mathematically prove anything.

Thus the black box view is not extreme. It reflects the situation most MCMC users find themselves in.

Now what MCMC theory applies to black box MCMC? None of it!

And what MCMC diagnostics are useful? None of them!

The reason is obvious. Suppose there is an event B having high probability under the equilibrium distribution, and also suppose there exist bad starting points from which it takes the MCMC sampler software a very long time to reach B (say longer than the age of the universe even if Moore's Law continues to hold until then). What chance do you have to diagnose this?

None whatsoever, unless you can somehow guarantee that you start at a good starting point. But this is precisely what the black box view assumes you cannot do!

Is MCMC like Regression?

In a word, no.

To go off on a somewhat unrelated rant, MCMC isn't even statistics, it is a tool. It calculates (approximates, estimates, whatever) by Monte Carlo probabilities and expectations that you cannot do analytically (either with pencil an paper or with a computer algebra system). The problems it is applied to need have nothing to do with probability and statistics.

The empirical analysis of the Markov chain, as in the package vignette from my MCMC package for R, does involve statistics. But such analyses come with no more solid guarantee than diagnostics. If your chain works, then the empirical analysis gives accurate Monte Carlo standard errors. If your chain doesn't work, then the empirical analysis is garbage in, garbage out.

So MCMC has something to do with statistics, sort of, but not really. Fundamentally, it has nothing to do with statistics.

You have an expectation you want to calculate. It is a well-defined number, no more random than ∫01 x3 dx.

If you think like a statistician about MCMC, this expectation, call it θ, is an unknown quantity, so call it a parameter, and your MCMC answer is a statistical point estimate of this parameter.

And in this way of thinking MCMC is exactly like regression. If you get an unlucky sample, there's nothing you can do. Better luck next time!

But nobody, not even statisticians, actually thinks about MCMC this way! No one is satisfied with better luck next time. For one thing, better luck may take longer than the age of the universe to happen. And for another thing, most people don't think of the expectation you want to calculate as an unknown parameter and the MCMC sample as a given so if it is unlucky then there's no remedy.

If Worried about Convergence, Get a Better Sampler

Not only is the sample not given, neither is the sampler. There are zillions of samplers with the same equilibrium distribution.

My favorite way to improve samplers is simulated tempering. But there are lots of other ways to improve samplers. If you haven't tried hard to improve your sampler, then you can't expect any sympathy about your convergence problems.

One Long Run

But after you have tried hard to improve your sampler, after you have the best sampler you can devise, what then?

In the black box view, all samplers with the same equilibrium distribution are exactly alike!

But there is one obvious consequence of the black box view

To find out anything, you have to run the sampler! The longer the run the better!

If you don't know any good starting points (and the black box view assumes you don't), then restarting the sampler at many bad starting points is (as we used to say in the sixties) part of the problem, not part of the solution

And this issue is not merely theoretical. People who have done really hard problems with MCMC and have worked really hard on validation (worrying not only about convergence but also about code correctness) have stories where problems didn't show up except in a really long run taking weeks of computer time.

It is a sad fact about scholarly literature that it is foolishly optimistic. Everything must be given highly positive spin. If it isn't the referees will stomp all over it. Thus the literature has a file drawer problem much larger than is generally recognized, extending far beyond P > 0.05.

That is why horror stories about weeks of MCMC running being necessary to diagnose problems do not appear in the literature.

Toy Problems Teach Bad Habits

What does appear in the literature (even in my own papers) are toy problems. Many statisticians find this term offensive. They take great pride in having real problems as examples. But these real data turn out to be very small with only a few variables, being only a small subset of the data originally collected, and the questions addressed by the analysis turn out to have nothing to do with the actual scientific (business, whatever) questions the data were collected to address.

I sometimes call this Pooh-Bah data after the character in Gilbert and Sullivan's Mikado who has the line

Merely corroborative detail, intended to give artistic verisimilitude to an otherwise bald and unconvincing narrative.

I understand how hard it is to do justice to real data in a statistical paper or textbook. Neither students nor referees nor other readers have any patience with it. Often the only thing that can be learned from real data is that it is very complicated, too messy for any simple analysis to work.

So we use toy data instead.

But understandable as it may be, this use of toy data teaches some very bad habits.

It's hard to know what lessons to learn from toy examples.

When are toy data too simplistic? When have they been chosen (consciously or unconsciously) to avoid problematic features of the method being illustrated? Does the method (consciously or unconsciously) use features of the toy problem that are not analogous to real applications?

Honest Cheating

I coined the term honest cheating for statistical cheating that is done right out in the open with nothing hidden from the reader, so by the canons of scientific publication is completely honest. The classic example is multiple testing without correction. It's bogus, but knowledgeable readers are given enough information to see exactly how bogus it is and dismiss the claims of the paper. Naive readers are fooled.

Similar honest cheating goes on in the MCMC diagnostics literature.

It's bogus, but knowledgeable readers are given enough information to see exactly how bogus it is and dismiss the claims of the paper. Naive readers are fooled.

Conclusions

So I'm not really saying anything so different from what the other MCMC experts say (a bit ruder perhaps).


Last modified: October 15, 2012 (fixed broken links).

Last modified before that: January 8, 2006.