Statistics 5102 (Geyer, Spring 2003) Simulation in R

Contents

Simulating Random Variables in R

R has many functions for simulating random variables with specified distributions. We will just use these without explaining how they work. These functions are very sophisticated and complicated in the algorithms they use. Anything simple enough to explain is too simple to use. So we will just take these functions as given.

.Random.seed Random Number Generation
RNG Random Number Generation
RNGkind Random Number Generation
set.seed Random Number Generation
rbeta The Beta Distribution
rbinom The Binomial Distribution
rcauchy The Cauchy Distribution
rchisq The (non-central) Chi-Squared Distribution
rexp The Exponential Distribution
rf The F Distribution
rgamma The Gamma Distribution
rgeom The Geometric Distribution
rhyper The Hypergeometric Distribution
rlnorm The Log Normal Distribution
rlogis The Logistic Distribution
rnbinom The Negative Binomial Distribution
rnorm The Normal Distribution
rpois The Poisson Distribution
rsignrank Distribution of the Wilcoxon Signed Rank Statistic
rt The Student t Distribution
runif The Uniform Distribution
rweibull The Weibull Distribution
rwilcox Distribution of the Wilcoxon Rank Sum Statistic
mvrnorm Simulate from a Multivariate Normal Distribution
sample Random Samples and Permutations

Just a few comments.

The MSE of an Estimator

Example 11.1.1 in DeGroot and Schervish looks at the mean square error (MSE) of an estimator. Since the population distribution is symmetric and the estimator is unbiased, MSE is the same thing as variance in this example.

The R statements below simulate nsim simulations of data of sample size n from the Cauchy distribution (using the function rcauchy and for each simulation stores the estimator (calculated by median)

The interpretation couldn't be simpler. The vector theta.hat contains independent and identically distributed random variates of the estimator under investigation (sample median of a Cauchy distribution).

The statement

mean(theta.hat^2)

calculates (more precisely, estimates) the MSE. We use this instead of var(theta.hat) because we know the true value of the parameter: the distribution of the simulated data x is the standard Cauchy distribution centered at zero.

This knowing the truth is the only way simulations differ from doing statistics. We actually know the truth, but pretend we don't, and estimate it by theta.hat, over and over, thus learning about the sampling error of the estimator.

It's always a good idea to look at a histogram (or some other picture) of the samples, in case something funny is going on. Here we see the sampling distribution of the sample median is symmetric and unimodal, but isn't very well approximated by the normal distribution having variance mean(theta.hat^2). The problem isn't so much that the distribution isn't normal, it's that the distribution is heavy tailed and variance isn't a very good parameter to use.

The other normal distribution (red) is the theoretical asymptotic distribution of the estimator. That fits pretty well, even though n = 10 isn't really large.

The Monte Carlo Error of a Monte Carlo Estimator

Theory

The really annoying thing about simulation is that it doesn't produce exact answers. Worse, the inexact answers are random, so you don't get the same answer every time (unless you set the random number generator seed to be the same before each run).

For example, five different repetitions of the preceding example give

0.3402748
0.3409837
0.3419415
0.3388291
0.3517652

Most annoying. How do you report this stuff?

The really nice thing about simulation (if you already know statistics), is that the theory of simulation is just statistics. You already know all the theory you need to know to understand simulation.

All you have to do is keep the different levels of randomness straight. Here we have two levels.

  1. We have nsim samples x of size n from the Cauchy distribution.
  2. We have one sample theta.hat of size nsim from the sampling distribution of the sample mean of a sample of size n from a standard Cauchy distribution.

For that reason, it is very important to keep the levels clear, to be very specific about Which mean do you mean? as DeGroot and Schervish put the issue in a section heading on p. 702.

We will use some common terminology to keep the levels straight.

The first level, where x are the samples, we will call sampling as this mimics what we do in statistics. The only differences are that

The second level, where theta.hat is the sample, we will call Monte Carlo. This is a cutesy name that has been used so long (a half-century) that the joke is forgotten and people use it with a straight face. The city of Monte Carlo in the small country of Monaco on the Mediterranean Sea has the most famous gambling casino in the world. Gambling is random, and some now forgotten someone decided to use it as a cute name for simulation. Nowadays, doing Monte Carlo is just another term for doing simulation, and Monte Carlo error is just another term for simulation error.

Now here's the key idea of simulation or Monte Carlo, whatever you want to call it. We could get an idea about the Monte Carlo error two ways.

It may or may not help to solidify your thinking about this upper level (the Monte Carlo level) if we compare with its counterpart in ordinary statistics.

So the differences between simulation and statistics

So except for imaginary/concrete and unknown/known, simulation mimics statistics perfectly.

Practice

So let's do it!

Because we're smart and because we know statistics, we don't need a third level. We can estimate the accuracy of a Monte Carlo estimator using the Monte Carlo samples themselves.

As long as we can keep the levels straight in our minds, there's no problem. It's easy.

Importance Sampling

The principle of importance sampling is rather tricky. Most elementary expositions completely miss the reason why importance sampling is important. DeGroot and Schervish (their Section 11.3) are no exception.

There is a story and a set of toy problems that serve as examples that explain the name importance sampling. There is another story and a set of different toy problems that serve as examples that explain why importance sampling is important. Sometimes the toy problems are almost the same. We will use only a very slightly modified version of Example 11.3.2 in DeGroot and Schervish to make the point.

Theory

What makes importance sampling important is the following principle

You can use samples from one distribution to learn about many distributions, even infinitely many distributions!

Suppose we have a parametric family of probability distributions with densities fθ(x). And suppose that g(x) is some other density. Further suppose that we can simulate random variables from the latter so we have X1, X2, ..., Xn independent and identically distributed with density g(x).

Now observe that as a simple matter of algebra

Eθ{h(X)} = Eg{h(X) fθ(X) / g(X) }

so long as the importance weights fθ(x) / g(x) are well defined (no divide by zero) for all x.

Thus we can use the Monte Carlo approximation to the right hand side of this formula

(1 / n) ∑i = 1n h(Xi) fθ(Xi) / g(Xi)

as a Monte Carlo approximation to the left hand side.

Since our Monte Carlo approximation is a simple average, we calculate confidence intervals in the usual way.

Note that one sample works for all θ.

Practice

Example 11.3.2 in DeGroot and Schervish done for a continuous range of α instead of just &alpha = 5.

The dashed lines in the figure give the endpoints of the (pointwise, not simultaneous) 95% confidence intervals for the true curve. The solid line is the point estimate.