Statistics 5102 (Geyer, Spring 2003) Simulation in R

Simulating Random Variables in R
The MSE of an Estimator
The Monte Carlo Error of a Monte Carlo Estimator
- Theory
- Practice
Importance Sampling
- Theory
- Practice

Simulating Random Variables in R

R has many functions for simulating random variables with specified distributions. We will just use these without explaining how they work. These functions are very sophisticated and complicated in the algorithms they use. Anything simple enough to explain is too simple to use. So we will just take these functions as given.

.Random.seed	Random Number Generation
RNG	Random Number Generation
RNGkind	Random Number Generation
set.seed	Random Number Generation
rbeta	The Beta Distribution
rbinom	The Binomial Distribution
rcauchy	The Cauchy Distribution
rchisq	The (non-central) Chi-Squared Distribution
rexp	The Exponential Distribution
rf	The F Distribution
rgamma	The Gamma Distribution
rgeom	The Geometric Distribution
rhyper	The Hypergeometric Distribution
rlnorm	The Log Normal Distribution
rlogis	The Logistic Distribution
rnbinom	The Negative Binomial Distribution
rnorm	The Normal Distribution
rpois	The Poisson Distribution
rsignrank	Distribution of the Wilcoxon Signed Rank Statistic
rt	The Student t Distribution
runif	The Uniform Distribution
rweibull	The Weibull Distribution
rwilcox	Distribution of the Wilcoxon Rank Sum Statistic
mvrnorm	Simulate from a Multivariate Normal Distribution
sample	Random Samples and Permutations

Just a few comments.

The first four items in the table (the R object .Random.seed and the R functions RNG, RNGkind, and set.seed) are used for control of the random number generator system as a whole. The only ones of these we will use are .Random.seed and set.seed which tell the random number generator system the place to start and guarantee repeatability.
For example,
```
set.seed(1234)
```
produces a valid state of the random number generator, and every time set.seed(1234) the random number generator is reset to exactly the same state.
To save the current state of the random number generator (which may be very complicated), do
```
saveseed <- .Random.seed
```
(note the name of .Random.seed begins with a dot) and then later to restore the state do
```
.Random.seed <- saveseed
```
The next bunch of items in the table (all the functions that start with little r) generate independent, identically distributed random samples from the specified distribution). Or for the fussy, we say that these computer-generated numbers, if not truly philosophically random behave as if they are according to all statistical tests.
The next to last item (mvrnorm) generates random samples from multivariate normal distributions. To follow the convention its name should begin with the letter r but, because it was written by someone who doesn't care about consistency of function names, it doesn't. Note that this function is part of the MASS library, so you must say library(MASS) before using it.
The last item (sample) generates random samples from finite populations, both with and without replacement, and also generates independent and identically distributed random variates from any distribution on a finite sample space.

The MSE of an Estimator

Example 11.1.1 in DeGroot and Schervish looks at the mean square error (MSE) of an estimator. Since the population distribution is symmetric and the estimator is unbiased, MSE is the same thing as variance in this example.

The R statements below simulate nsim simulations of data of sample size n from the Cauchy distribution (using the function rcauchy and for each simulation stores the estimator (calculated by median)

The interpretation couldn't be simpler. The vector theta.hat contains independent and identically distributed random variates of the estimator under investigation (sample median of a Cauchy distribution).

The statement

mean(theta.hat^2)

calculates (more precisely, estimates) the MSE. We use this instead of var(theta.hat) because we know the true value of the parameter: the distribution of the simulated data x is the standard Cauchy distribution centered at zero.

This knowing the truth is the only way simulations differ from doing statistics. We actually know the truth, but pretend we don't, and estimate it by theta.hat, over and over, thus learning about the sampling error of the estimator.

It's always a good idea to look at a histogram (or some other picture) of the samples, in case something funny is going on. Here we see the sampling distribution of the sample median is symmetric and unimodal, but isn't very well approximated by the normal distribution having variance mean(theta.hat^2). The problem isn't so much that the distribution isn't normal, it's that the distribution is heavy tailed and variance isn't a very good parameter to use.

The other normal distribution (red) is the theoretical asymptotic distribution of the estimator. That fits pretty well, even though n = 10 isn't really large.

The Monte Carlo Error of a Monte Carlo Estimator

Theory

The really annoying thing about simulation is that it doesn't produce exact answers. Worse, the inexact answers are random, so you don't get the same answer every time (unless you set the random number generator seed to be the same before each run).

For example, five different repetitions of the preceding example give

Most annoying. How do you report this stuff?

The really nice thing about simulation (if you already know statistics), is that the theory of simulation is just statistics. You already know all the theory you need to know to understand simulation.

All you have to do is keep the different levels of randomness straight. Here we have two levels.

We have nsim samples x of size n from the Cauchy distribution.
We have one sample theta.hat of size nsim from the sampling distribution of the sample mean of a sample of size n from a standard Cauchy distribution.

For that reason, it is very important to keep the levels clear, to be very specific about Which mean do you mean? as DeGroot and Schervish put the issue in a section heading on p. 702.

We will use some common terminology to keep the levels straight.

The first level, where x are the samples, we will call sampling as this mimics what we do in statistics. The only differences are that

In statistics, we don't know the true parameter values. In simulation, we actually do know the true parameter values. We just pretend we don't when required to mimic the statistical situation.
In statistics, we just have one sample x (from the unknown population distribution). In simulation, we repeat the statistical situation over and over, getting nsim samples (from the known, but pretend unknown, population distribution).

The second level, where theta.hat is the sample, we will call Monte Carlo. This is a cutesy name that has been used so long (a half-century) that the joke is forgotten and people use it with a straight face. The city of Monte Carlo in the small country of Monaco on the Mediterranean Sea has the most famous gambling casino in the world. Gambling is random, and some now forgotten someone decided to use it as a cute name for simulation. Nowadays, doing Monte Carlo is just another term for doing simulation, and Monte Carlo error is just another term for simulation error.

Now here's the key idea of simulation or Monte Carlo, whatever you want to call it. We could get an idea about the Monte Carlo error two ways.

There's the dumb way, that ignores everything we know about statistics. We could just do the preceding example over and over (or add a third level, another for loop) and see what the distribution of the numbers is. Our table of the five Monte Carlo estimates above is a start on this dumb method.
There's the smart way, that uses what we know about statistics. The elements of the vector theta.hat are a random sample of size nsim from an unknown distribution (the sampling distribution of the estimator) and the mean of the distribution of theta.hat^2 is the exact answer we are trying to calculate.
If we think of that exact answer as an unknown parameter, then the usual theory about a confidence interval for a sample mean tells us that either
```
mean(theta.hat^2) + c(-1, 1) * qnorm(0.975) * sd(theta.hat^2) / sqrt(nsim)
```
or, more simply,
```
t.test(theta.hat^2)
```
gives us a 95% confidence interval for the unknown exact answer.
The only tricky thing, when we use the first form, is that we must remember to use the Monte Carlo sample size nsim not the data sample size n. When we use the second form, the right thing happens automatically as t.test takes the sample size from length(theta.hat^2), which is the Monte Carlo sample size.
Note that there's no problem with the applicability of large sample asymptotic theory to Monte Carlo. The Monte Carlo sample size n_sim = 10000 is certainly large.

It may or may not help to solidify your thinking about this upper level (the Monte Carlo level) if we compare with its counterpart in ordinary statistics.

In statistics, the sampling distribution of the estimator is an entirely hypothetical-theoretical construct. We talk about it a lot, but we never see it. All we see is one realization from it: the actual estimator for the actual data set. We are supposed to imagine the sampling variability of this estimator. Instead of just one number, we could have gotten many different numbers if we had repeated the experiment over and over.
In simulation, we actually do a simulated experiment over and over. So the sampling distribution of the estimator is very concrete. We can actually look at its histogram.

So the differences between simulation and statistics

The sampling distribution of the estimator is imaginary in statistics but concrete in simulation.
The sampling distribution of the estimator is derived from the unknown population distribution in statistics but from some known distribution in simulation.

So except for imaginary/concrete and unknown/known, simulation mimics statistics perfectly.

Practice

So let's do it!

Because we're smart and because we know statistics, we don't need a third level. We can estimate the accuracy of a Monte Carlo estimator using the Monte Carlo samples themselves.

As long as we can keep the levels straight in our minds, there's no problem. It's easy.

Importance Sampling

The principle of importance sampling is rather tricky. Most elementary expositions completely miss the reason why importance sampling is important. DeGroot and Schervish (their Section 11.3) are no exception.

There is a story and a set of toy problems that serve as examples that explain the name importance sampling. There is another story and a set of different toy problems that serve as examples that explain why importance sampling is important. Sometimes the toy problems are almost the same. We will use only a very slightly modified version of Example 11.3.2 in DeGroot and Schervish to make the point.

Theory

What makes importance sampling important is the following principle

You can use samples from one distribution to learn about many distributions, even infinitely many distributions!

Suppose we have a parametric family of probability distributions with densities f_θ(x). And suppose that g(x) is some other density. Further suppose that we can simulate random variables from the latter so we have X₁, X₂, ..., X_n independent and identically distributed with density g(x).

Now observe that as a simple matter of algebra

E_θ{h(X)} = E_g{h(X) f_θ(X) / g(X) }

so long as the importance weights f_θ(x) / g(x) are well defined (no divide by zero) for all x.

Thus we can use the Monte Carlo approximation to the right hand side of this formula

(1 / n) ∑_{i = 1}ⁿ h(X_i) f_θ(X_i) / g(X_i)

as a Monte Carlo approximation to the left hand side.

Since our Monte Carlo approximation is a simple average, we calculate confidence intervals in the usual way.

Note that one sample works for all θ.

Practice

Example 11.3.2 in DeGroot and Schervish done for a continuous range of α instead of just &alpha = 5.

nsim <- 1e3 alpha.sim <- 1 x <- rgamma(nsim, alpha.sim) g <- function(alpha) { result <- double(length(alpha)) for (i in 1:length(alpha)) result[i] <- mean(1 / (1 + x + x^2) * dgamma(x, alpha[i]) / dgamma(x, alpha.sim)) return(result) } g.se <- function(alpha) { result <- double(length(alpha)) for (i in 1:length(alpha)) result[i] <- sd(1 / (1 + x + x^2) * dgamma(x, alpha[i]) / dgamma(x, alpha.sim)) return(result / sqrt(nsim)) } crit <- qnorm(0.975) from <- 1 to <- 7 ylim <- c(g(to) - crit * g.se(to), g(from) + crit * g.se(from)) curve(g, from = from, to = to, xlab = expression(alpha), ylab = expression(g(alpha)), ylim = ylim) curve(g(x) + crit * g.se(x), from = from, to = to, lty = 2, add = TRUE) curve(g(x) - crit * g.se(x), from = from, to = to, lty = 2, add = TRUE) cat("Calculation took", proc.time()[1], "seconds.\n")

The dashed lines in the figure give the endpoints of the (pointwise, not simultaneous) 95% confidence intervals for the true curve. The solid line is the point estimate.

Statistics 5102 (Geyer, Spring 2003) Simulation in R

Contents