Simulating Random Variables in R
R has many functions for simulating random variables with specified distributions. We will just use these without explaining how they work. These functions are very sophisticated and complicated in the algorithms they use. Anything simple enough to explain is too simple to use. So we will just take these functions as given.
.Random.seed | Random Number Generation |
RNG | Random Number Generation |
RNGkind | Random Number Generation |
set.seed | Random Number Generation |
rbeta | The Beta Distribution |
rbinom | The Binomial Distribution |
rcauchy | The Cauchy Distribution |
rchisq | The (non-central) Chi-Squared Distribution |
rexp | The Exponential Distribution |
rf | The F Distribution |
rgamma | The Gamma Distribution |
rgeom | The Geometric Distribution |
rhyper | The Hypergeometric Distribution |
rlnorm | The Log Normal Distribution |
rlogis | The Logistic Distribution |
rnbinom | The Negative Binomial Distribution |
rnorm | The Normal Distribution |
rpois | The Poisson Distribution |
rsignrank | Distribution of the Wilcoxon Signed Rank Statistic |
rt | The Student t Distribution |
runif | The Uniform Distribution |
rweibull | The Weibull Distribution |
rwilcox | Distribution of the Wilcoxon Rank Sum Statistic |
mvrnorm | Simulate from a Multivariate Normal Distribution |
sample | Random Samples and Permutations |
Just a few comments.
- The first four items in the table
(the R object
.Random.seed
and the R functionsRNG
,RNGkind
, andset.seed
) are used for control of the random number generator system as a whole. The only ones of these we will use are.Random.seed
andset.seed
which tell the random number generator system the place to start and guarantee repeatability.For example,
set.seed(1234)
produces a valid state of the random number generator, and every timeset.seed(1234)
the random number generator is reset to exactly the same state.To save the current state of the random number generator (which may be very complicated), do
saveseed <- .Random.seed
(note the name of.Random.seed
begins with adot
) and then later to restore the state do.Random.seed <- saveseed
- The next bunch of items in the table
(all the functions that start with little
r
) generate independent, identically distributed random samples from the specified distribution). Or for the fussy, we say that these computer-generated numbers, if not truly philosophicallyrandom
behave as if they are according to all statistical tests. - The next to last item (
mvrnorm
) generates random samples from multivariate normal distributions. To follow the convention its name should begin with the letterr
but, because it was written by someone who doesn't care about consistency of function names, it doesn't. Note that this function is part of theMASS
library, so you must saylibrary(MASS)
before using it. - The last item (
sample
) generates random samples from finite populations, both with and without replacement, and also generates independent and identically distributed random variates from any distribution on a finite sample space.
The MSE of an Estimator
Example 11.1.1 in DeGroot and Schervish looks at the mean square error (MSE) of an estimator. Since the population distribution is symmetric and the estimator is unbiased, MSE is the same thing as variance in this example.
The R statements below simulate nsim
simulations of data of
sample size n
from the Cauchy distribution (using the function
rcauchy
and for each simulation stores the estimator
(calculated by median
)
The interpretation couldn't be simpler. The vector theta.hat
contains independent and identically distributed random variates of the
estimator under investigation (sample median of a Cauchy distribution).
The statement
mean(theta.hat^2)
calculates (more precisely, estimates) the MSE. We use this instead
of var(theta.hat)
because we know the true value of the
parameter: the distribution of the simulated data x
is
the standard Cauchy distribution centered at zero.
This knowing the truth
is the only way simulations differ from
doing statistics. We actually know the truth, but pretend we don't,
and estimate it by theta.hat
, over and over, thus learning
about the sampling error of the estimator.
It's always a good idea to look at a histogram (or some other picture)
of the samples, in case something funny is going on. Here we see the
sampling distribution of the sample median is symmetric and unimodal,
but isn't very well approximated by the normal distribution having
variance mean(theta.hat^2)
. The problem isn't so much that
the distribution isn't normal, it's that the distribution is heavy tailed
and variance isn't a very good parameter to use.
The other normal distribution (red) is the theoretical asymptotic distribution
of the estimator. That fits pretty well, even though n = 10
isn't really large
.
The Monte Carlo Error of a Monte Carlo Estimator
Theory
The really annoying thing about simulation
is that it doesn't produce exact
answers. Worse, the inexact answers are random, so you don't get the same
answer every time (unless you set the random number generator seed
to be the same before each run).
For example, five different repetitions of the preceding example give
0.3402748 0.3409837 0.3419415 0.3388291 0.3517652
Most annoying. How do you report this stuff?
The really nice thing about simulation (if you already know statistics), is that the theory of simulation is just statistics. You already know all the theory you need to know to understand simulation.
All you have to do is keep the different levels
of randomness straight.
Here we have two levels
.
- We have
nsim
samplesx
of sizen
from the Cauchy distribution. - We have one sample
theta.hat
of sizensim
from the sampling distribution of the sample mean of a sample of sizen
from a standard Cauchy distribution.
For that reason, it is very important to keep the levels
clear, to be very specific about Which mean do you mean?
as DeGroot
and Schervish put the issue in a section heading on p. 702.
We will use some common terminology to keep the levels straight.
The first level, where x
are the samples, we will call
sampling
as this mimics what we do in statistics. The only
differences are that
- In statistics, we don't know the true parameter values. In simulation, we actually do know the true parameter values. We just pretend we don't when required to mimic the statistical situation.
- In statistics, we just have one sample
x
(from the unknown population distribution). In simulation, we repeat the statistical situation over and over, gettingnsim
samples (from the known, but pretend unknown, population distribution).
The second level, where theta.hat
is the sample, we will call
Monte Carlo
. This is a cutesy name that has been used so long
(a half-century) that the joke is forgotten and people use it with a straight
face. The city of Monte Carlo in the small country of Monaco on the
Mediterranean Sea has the most famous gambling casino in the world.
Gambling is random, and some now forgotten someone decided to use it
as a cute name for simulation. Nowadays, doing Monte Carlo
is just another term for doing simulation
, and Monte Carlo error
is just another term for simulation error
.
Now here's the key idea of simulation or Monte Carlo, whatever you want to call it. We could get an idea about the Monte Carlo error two ways.
- There's the dumb way, that ignores everything we
know about statistics. We could just do the
preceding example over and over (or add a third level,
another
for
loop) and see what the distribution of the numbers is. Our table of the five Monte Carlo estimates above is a start on thisdumb method
. - There's the smart way, that uses what we
know about statistics. The elements of the vector
theta.hat
are a random sample of sizensim
from an unknown distribution (the sampling distribution of the estimator) and the mean of the distribution oftheta.hat^2
is the exact answer we are trying to calculate.If we think of that exact answer as an unknown parameter, then the usual theory about a confidence interval for a sample mean tells us that either
mean(theta.hat^2) + c(-1, 1) * qnorm(0.975) * sd(theta.hat^2) / sqrt(nsim)
or, more simply,
t.test(theta.hat^2)
gives us a 95% confidence interval for the unknown exact answer.The only tricky thing, when we use the first form, is that we must remember to use the Monte Carlo sample size
nsim
not the data sample sizen
. When we use the second form, the right thing happens automatically ast.test
takes the sample size fromlength(theta.hat^2)
, which is the Monte Carlo sample size.Note that there's no problem with the applicability of
large sample
asymptotic theory to Monte Carlo. The Monte Carlo sample size nsim = 10000 is certainlylarge
.
It may or may not help to solidify your thinking about this upper level (the Monte Carlo level) if we compare with its counterpart in ordinary statistics.
- In statistics, the sampling distribution of the estimator is an entirely hypothetical-theoretical construct. We talk about it a lot, but we never see it. All we see is one realization from it: the actual estimator for the actual data set. We are supposed to imagine the sampling variability of this estimator. Instead of just one number, we could have gotten many different numbers if we had repeated the experiment over and over.
- In simulation, we actually do a simulated experiment over and over. So the sampling distribution of the estimator is very concrete. We can actually look at its histogram.
So the differences between simulation and statistics
- The sampling distribution of the estimator is imaginary in statistics but concrete in simulation.
- The sampling distribution of the estimator is derived from the unknown population distribution in statistics but from some known distribution in simulation.
So except for imaginary/concrete and unknown/known, simulation mimics statistics perfectly.
Practice
So let's do it!
Because we're smart and because we know statistics, we don't need
a third level
. We can estimate the accuracy of a Monte Carlo
estimator using the Monte Carlo samples themselves.
As long as we can keep the levels straight in our minds, there's no problem. It's easy.
Importance Sampling
The principle of importance sampling is rather tricky. Most elementary expositions completely miss the reason why importance sampling is important. DeGroot and Schervish (their Section 11.3) are no exception.
There is a story and a set of toy problems that serve as examples that
explain the name importance sampling
. There is another story
and a set of different toy problems that serve as examples that explain
why importance sampling is important. Sometimes the toy problems are
almost the same. We will use only a very slightly modified version of
Example 11.3.2 in DeGroot and Schervish to make the point.
Theory
What makes importance sampling important is the following principle
You can use samples from one distribution to learn about many distributions, even infinitely many distributions!
Suppose we have a parametric family of probability distributions with densities fθ(x). And suppose that g(x) is some other density. Further suppose that we can simulate random variables from the latter so we have X1, X2, ..., Xn independent and identically distributed with density g(x).
Now observe that as a simple matter of algebra
so long as the importance weights
fθ(x) / g(x)
are well defined (no divide by zero) for all x.
Thus we can use the Monte Carlo approximation to the right hand side of this formula
as a Monte Carlo approximation to the left hand side.
Since our Monte Carlo approximation is a simple average, we calculate confidence intervals in the usual way.
Note that one sample works for all θ.
Practice
Example 11.3.2 in DeGroot and Schervish done for a continuous range of α instead of just &alpha = 5.
The dashed lines in the figure give the endpoints of the (pointwise, not simultaneous) 95% confidence intervals for the true curve. The solid line is the point estimate.