Chapter 10 Basic Bayesian Procedures
The key philosophical aspect of Bayesian procedures is that we express our prior knowledge about parameters and other unknowns using prior (probability) distributions and update those distributions to posterior (probability) distributions after observing the data. The key practical difference with Bayesian procedures is that there are generally no simple computations. Instead, we get approximate samples from the posterior distributions via Monte Carlo simulation, and then make statements about the posteriors based on the approximate samples.
There are several tools for doing the Monte Carlo
in R including jags
(which does Gibbs sampling),
rstan
(which does Hamiltonian sampling), brms
,
and others. We will illustrate using the bglmm()
function in bcfcdae
, which is a front end for rstan
.
We will demonstrate bglmm()
using the data from the RunStitch
example
in cfcdae
, which we have used before. Each worker run stitches collars using two different
setups: the conventional setup and an ergonomic setup. The two
runs are made in random order for each worker, and the interest
is in any difference in average speed between the two setups.
Load the runstitch data from the package and create differences between Standard and Ergonomic.
> data(RunStitch)
> differences <- RunStitch[,"Standard"] - RunStitch[,"Ergonomic"]
10.1 Prior Distributions
General Bayesian packages such as rstan
and jags
allow you to set
up practically arbitrary prior distributions for parameters. bglmm()
has a much narrower set of priors available, but they still allow us
to accomplish a lot in the context of the models for which bglmm()
was designed.
10.1.1 Gamma distributions
A gamma distribution is a common distribution for positive
random variables; for example, a chi-squared distribution
is a special case of a gamma.
bglmm()
parameterizes gamma distributions via
the expected value \(\tau\) and
shape parameter \(\nu\). A large value of the shape
parameter means that you believe
that the random variable is near \(\tau\), whereas a small shape
value indicates uncertainty.
> shape <- c(1, 1.5, 2:10, 15, 20, 30, 50, 100, 200, 400, 1000, 20000)
> ratio <- qgamma(.995, shape) / qgamma(.005, shape)
> data.frame(shape, ratio)
shape ratio
1 1.0 1057.012101
2 1.5 178.999426
3 2.0 71.792473
4 3.0 27.448349
5 4.0 16.330513
6 5.0 11.683607
7 6.0 9.206618
8 7.0 7.686343
9 8.0 6.663908
10 9.0 5.930983
11 10.0 5.380372
12 15.0 3.893019
13 20.0 3.224391
14 30.0 2.587675
15 50.0 2.081903
16 100.0 1.676711
17 200.0 1.440322
18 400.0 1.294069
19 1000.0 1.176992
20 20000.0 1.037100
With a shape of just 1, the plausible range of values for \(\sigma\) extends over 3 orders of magnitude. By the time you get to shape 5, the range is down to just one order of magnitude. You need a huge shape parameter to indicate that the random variable must be very near \(\tau\).
10.1.2 Priors for terms in the linear predictor
Terms in the linear predictor for a model could be group means, differences of group means, regression slopes, or other parameters depending on the complexity of the model. The prior for these parameters is normal with mean zero and standard deviation \(\sigma_i\) for parameters in the ith term of the model. Parameters from different terms are assumed to be independent. Parameters within the same term might not be independent.
Note that these priors are all centered at zero. This is sensible when the prior presumption is that the term/effect/coefficient is most likely to be small, but need not be small. For example, we might have a prior belief that the difference between treatment means might be near zero. However, this is often a dubious assumption for a parameter that describes the overall mean of the data. For parameters like that, the standard deviation of the prior distribution might need to be very large to be reasonable.
10.1.3 Priors for standard deviations
Many models assume that responses are normally distributed with
a certain set of means and a certain standard deviation (sometimes
more than one standard deviation). Call this standard deviation \(\sigma_0\).
Robust alternatives to normal distributions have an analogous scaling
factor. bglmm()
uses a gamma prior distribution for \(\sigma_0\),
with shape parameter sigma0.shape
and expected value sigma0.mean
.
A large value of sigma0.shape
means that you are
very certain that \(\sigma_0\) is near sigma0.mean
, whereas a small
value indicates uncertainty.
Parameters in the i-th term of the linear predictor are assumed to be normal with mean zero and some standard deviation \(\sigma_i\). We usually don’t know \(\sigma_i\), so we put a second-level prior on \(\sigma_i\).
By default, bglmm()
assumes that
\(\sigma_i\) follows a gamma distribution with expected value
sigma.mean
and
shape parameter sigma.shape
. You may specify sigma.mean
and
sigma.shape
as vectors, with one value for each term in the
model, or you may specify them as scalars, in which case the
same mean and shape is used for all the terms. (What really happens
is that any values you specify get recycled enough times to
fill out the full vectors of means and shapes.)
If you do not specify the shape parameter for a gamma prior
on a standard deviation, bglmm()
uses 2, which
is rather dispersed. If you do not specify the prior expected
value for a standard deviation, bglmm()
defaults to a value that
is based on a frequentist fit (definitely not a very
Bayesian thing to do).
An alternative is available for \(\sigma\) when usecauchy=TRUE
is
used as an argument to bglmm()
.
In that case, we assume that \(\log(\sigma_i/\sigma_0)\)
follows a standard Cauchy distribution, independently across
the different \(\sigma_i\) parameters. This prior encourages shrinkage
to zero when there is relatively little evidence that the
parameters are non-zero, and it does little shrinkage toward zero
when there is evidence that the parameters are far from zero.
This has proven to be fairly unstable numerically in practice.
10.2 Other bglmm()
arguments
There is only one required argument for bglmm()
, and that is the formula
for the model. There are (hopefully reasonable) defaults for all other arguments.
As with lm()
, there are data=
and subset=
arguments that let you tell
R where to look for data and which data to use.
bglmm()
uses the rstan
package to do Markov chain Monte Carlo. There
are several optional arguments that control the MCMC. Here are some that
might be useful now; others can wait.
n.samples
says how long each Markov chain should be.n.chains
says how many parallel chains to run.warmup
says how many of then.samples
should be discarded as warm up runs.thin
controls how the MCMC results are retained. For example,thin=5
retains every fifth value.quiet=TRUE
suppresses the printing of progress information.adapt_delta
andmax_treedepth
control how the MCMC is run. You may receive diagnostic information telling you to increase their values to improve the MCMC run.seed
An optional integer seed for the random number generator. You need this to make your results repeatable.
Examples
Suppose that we have the differences
for paired data, and we just want to fit the mean. Using the defaults and the gamma priors
gives us:
library(cfcdae)
library(bcfcdae)
library(rstan)
fit <- blgmm(differences ~ 1)
If we wanted to assume that the prior mean was .5, we could do:
fit <- bglmm(differences-.5 ~ 1)
This would tell us how much the mean deviates from .5. That is, you would need to add .5 back into the posterior values to get the estimate of the mean with the prior mean being .5.
If we wanted to specify that the standard deviation (around 0) of the prior distribution for the mean was probably around .1, but we were not really certain, we could do:
fit <- bglmm(differences-.5 ~ 1, sigma.mean=.1, sigma.shape=20)
If we were really sure that the prior standard deviation should be .1, we could do:
fit <- bglmm(differences-.5 ~ 1, sigma.mean=.1,sigma.shape=2000)
Suppose that we had the notched boards data and wished to estimate the difference in means in the notched boards data. Using defaults, we would do
fit <- bglmm(strength ~ shape,data=NotchedBoards)
This model fits the overall mean and the deviation from the overall mean for each shape.
If we wanted to use a large standard deviation for the intercept (overall mean) parameter and a smaller one for the difference, we could use:
fit <- bglmm(strength ~ shape,data=NotchedBoards,sigma.mean.int=300)
If we wanted to fit the means individually we could do:
fit.notch.indiv <- bglmm(strength ~ shape-1,data=NotchedBoards,
seed=10001, # for repeatability
quiet=TRUE, # to suppress progress informaiton
sigma.mean=300,sigma.shape=3)
10.3 Compilation
One drawback of using bglmm()
is that R will need to compile the model specification
into an executable format. It only needs to do that once for each type of model,
but it will need to do it every time you start a new R session. This has two
regretable consequences:
- It’s slow.
- You need to have the right software or packages to do the compilation. R will try to download what is needed, but student experience has been that this is not always a failure-free process.
10.4 Looking at the results
There are both numerical and graphical summaries available. Begin with the
summary()
function. For the output of bglmm()
, summary()
provides
a matrix with ten columns and one row for each parameter. The parameters
include the obvious parameters in the model such as \(\sigma_0\) and the
parameters in the model for the mean, but also included are the prior standard deviations
of the parameters in the model for the mean. For example, for the
fit bglmm(strength~shape-1)
the parameters are the means for the two levels
of shape (called shape1
and shape2
), sigma0
, and sigma.shape
.
Here is the summary:
> summary(fit.notch.indiv)
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
shape1 289.0 0.270 17.60 255.0 277.0 289.0 300.0 323 4280 1
shape2 296.0 0.311 17.70 261.0 285.0 296.0 307.0 331 3250 1
sigma0 63.6 0.152 9.33 48.3 56.9 62.5 69.1 85 3750 1
sigma.shape 330.0 2.300 126.00 155.0 239.0 305.0 393.0 639 3000 1
The first column of the summary is the estimated posterior mean (for all
intents and purposes, this is the usual Bayesian estimate) and
third column is the estimated
posterior standard deviation. The second column (labeled se_mean
) is
a numerical assessment of how accurately we have computed the posterior mean;
it is usually not of much direct interest.
Summary columns 4 through 8 give percentiles of the posterior distribution
for each parameter; these are the percentiles that you use to form
a posterior credible interval (analog of a confidence interval). By
default, the 2.5, 25, 50, 75, and 97.5 percentiles are reported. You
can change those by including a probs
argument, for example,
probs=c(.005,.025,.975,.995)
.
You can also choose which parameters to summarize by including
a pars
argument, for example, pars=c("shape1","shape2","sigma0")
.
bglmm()
fit gives two plots: the Pearson residuals
against fitted values, and a normal probability plot of the
Pearson residuals.
> par(mfrow=c(1,2))
> plot(fit.notch.indiv)

Figure 10.1: Residual plots for notched boards
> par(mfrow=c(1,1))
> plot(fit.notch.indiv,"histogram",pars=c("shape1","shape2"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 10.2: Histograms of posterior densities
> plot(fit.notch.indiv,"interval",pars=c("shape1","shape2"))
ci_level: 0.8 (80% intervals)
outer_level: 0.95 (95% intervals)

Figure 10.3: Interval estimates of shape means
> plot(fit.notch.indiv,"pointwise")

Figure 10.4: Predictive distributions for data
NULL
An outlier would be out in the tail of its predictive distribution.
10.5 Assessing the MCMC
An MCMC works well when there is little correlation between successive samples and when the Markov chain fully explores the posterior distribution. It would be nice if MCMC always worked well, but problems are not infrequent. Here are some good, but not foolproof, ways to check your fit. If one of these says that the chain is behaving badly, then it is behaving badly. However, a chain can be behaving badly without these methods catching the problem.
The last two columns of the summary of a bglmm()
fit provide some
assessment. n_eff
is the effective sample size; here, bigger is better.
By default, bglmm()
runs four chains of 2,000 iterations each, with
the first half discarded. Thus the default reported sample size is 4,000.
If there is a lot of correlation in the iterations, the effective sample
size will be much smaller.
Rhat
is a non-convergence diagnostic. (This is in some dispute,
but the builders of rstan
, which bglmm()
uses, like it and
use it.) If Rhat
is big, say bigger than
1.1 or 1.2, then you know that the MCMC is not mixing well, and the
estimates you get are not very reliable. Sadly, and Rhat
down near 1
does not mean that the MCMC is mixing well, it just means that you have not
discovered it mixing poorly.
library(rstan)
prior to making
some of these plots.
> plot(fit.notch.indiv,"autocor",pars=c("shape1","shape2"))

Figure 10.5: Autocorrelation for shape means
Here the autocorrelations drop to 0 quickly.
We can also plot the iterates for a parameter for each chain as a trace, and then overlay the traces. Ideally, these plots should look like blah nothing, with no patterns over time or separation between the multiple chains.> plot(fit.notch.indiv,"trace",pars=c("shape1","shape2"))

Figure 10.6: Trace plots for shape means
These trace plots look fine.
10.6 Divergent transitions
One problem that the MCMC can encounter is “divergent transitions.” This
is an ominous warning message that you sometimes receive after bglmm()
.
Divergent transitions do not always signify a problem, but they are not
to be encouraged.
Here are three things you can do to try to reduce divergent transitions:
- Increase
adapt_delta
above its default value of .8. Anything below 1 is allowable. Thus you might try adding the argumentadapt_delta=.95
or higher. - Increase
max_treedepth
above its default value of 10. Thus you might try adding the argumentmax_treedepth=15
. - Try a tighter prior distribution. This might involve setting
sigma.shape
to a larger value, for examplesigma.shape=5
.
One other thing to note is that divergent transitions can arise when the model you’re using simply doesn’t fit the data very well. This could happen when you put a tight prior near 0 on a parameter that the data would suggest are rather farther from 0.
10.7 Comparing Bayesian fits
Suppose that we have two Bayesian models for data and we
want to compare them with the idea of choosing the better model.
We have two approaches. The first computes an information criterion
for each model and selects the model with the lower IC. We will use
the Leave One Out Information Criterion, or loo
for short.
The second approach computes the Bayes factor of model 1 to model 2.
Factors greater than 1 favor model 1. The Bayes factor is
fairly difficult to compute, and we typically only get an approximation.
loo
is easier to compute, but still non-trivial.
> fit.single <- bglmm(strength ~ 1,data=NotchedBoards,
+ seed=10003, # for repeatability
+ sigma.mean=300,sigma.shape=3,quiet=TRUE)
loo()
function in the loo
package to compute the criteria:
> loo::loo(fit.notch.indiv)
[1] 291.7324
> loo::loo(fit.single)
[1] 289.7221
fit.single
has the lower value, so it is preferred, but it is
only lower by 2, which gives a reasonable, but not overwhelming,
preference.
> bayes_factor(fit.single,fit.notch.indiv)
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Estimated Bayes factor in favor of bridge1 over bridge2: 19.88565
The Bayes factor prefers the model with a single mean. A Bayes factor
in the range of 3 to 20 is considered a positive preference for fit.single
,
and 19.9 is very close to being a strong preference.