R has many functions for simulating random variables with specified distributions. We will just use these without explaining how they work. These functions are very sophisticated and complicated in the algorithms they use. Anything simple enough to explain is too simple to use. So we will just take these functions as given.
.Random.seed | Random Number Generation |
RNG | Random Number Generation |
RNGkind | Random Number Generation |
set.seed | Random Number Generation |
rbeta | The Beta Distribution |
rbinom | The Binomial Distribution |
rcauchy | The Cauchy Distribution |
rchisq | The (non-central) Chi-Squared Distribution |
rexp | The Exponential Distribution |
rf | The F Distribution |
rgamma | The Gamma Distribution |
rgeom | The Geometric Distribution |
rhyper | The Hypergeometric Distribution |
rlnorm | The Log Normal Distribution |
rlogis | The Logistic Distribution |
rnbinom | The Negative Binomial Distribution |
rnorm | The Normal Distribution |
rpois | The Poisson Distribution |
rsignrank | Distribution of the Wilcoxon Signed Rank Statistic |
rt | The Student t Distribution |
runif | The Uniform Distribution |
rweibull | The Weibull Distribution |
rwilcox | Distribution of the Wilcoxon Rank Sum Statistic |
mvrnorm | Simulate from a Multivariate Normal Distribution |
sample | Random Samples and Permutations |
Just a few comments.
.Random.seed
and the R functions
RNG
, RNGkind
, and set.seed
)
are used for control of the random number generator system as a whole.
The only ones of these we will use are .Random.seed
and
set.seed
which tell the random number generator system
the place to start and guarantee repeatability.
For example,
set.seed(1234)produces a valid state of the random number generator, and every time
set.seed(1234)
the random number generator is reset
to exactly the same state.
To save the current state of the random number generator (which may be very complicated), do
saveseed <- .Random.seed(note the name of
.Random.seed
begins with a dot) and then later to restore the state do
.Random.seed <- saveseed
r
) generate
independent, identically distributed random samples from the specified
distribution). Or for the fussy, we say that these computer-generated
numbers, if not truly philosophically randombehave as if they are according to all statistical tests.
mvrnorm
)
generates random samples from multivariate normal distributions.
To follow the convention its name should begin with the letter r
but, because it was written by someone who doesn't care about consistency
of function names, it doesn't. Note that this function is part of the
MASS
library, so you must say library(MASS)
before using it.
sample
)
generates random samples from
finite populations, both with and without replacement, and also generates
independent and identically distributed random variates from any distribution
on a finite sample space.
Example 11.1.1 in DeGroot and Schervish looks at the mean square error (MSE) of an estimator. Since the population distribution is symmetric and the estimator is unbiased, MSE is the same thing as variance in this example.
The R statements below simulate nsim
simulations of data of
sample size n
from the Cauchy distribution (using the function
rcauchy
and for each simulation stores the estimator
(calculated by median
)
The interpretation couldn't be simpler. The vector theta.hat
contains independent and identically distributed random variates of the
estimator under investigation (sample median of a Cauchy distribution).
The statement
mean(theta.hat^2)
calculates (more precisely, estimates) the MSE. We use this instead
of var(theta.hat)
because we know the true value of the
parameter: the distribution of the simulated data x
is
the standard Cauchy distribution centered at zero.
This knowing the truth
is the only way simulations differ from
doing statistics. We actually know the truth, but pretend we don't,
and estimate it by theta.hat
, over and over, thus learning
about the sampling error of the estimator.
It's always a good idea to look at a histogram (or some other picture)
of the samples, in case something funny is going on. Here we see the
sampling distribution of the sample median is symmetric and unimodal,
but isn't very well approximated by the normal distribution having
variance mean(theta.hat^2)
. The problem isn't so much that
the distribution isn't normal, it's that the distribution is heavy tailed
and variance isn't a very good parameter to use.
The other normal distribution (red) is the theoretical asymptotic distribution
of the estimator. That fits pretty well, even though n = 10
isn't really large
.
The really annoying thing about simulation
is that it doesn't produce exact
answers. Worse, the inexact answers are random, so you don't get the same
answer every time (unless you set the random number generator seed
to be the same before each run).
For example, five different repetitions of the preceding example give
0.3402748 0.3409837 0.3419415 0.3388291 0.3517652
Most annoying. How do you report this stuff?
The really nice thing about simulation (if you already know statistics), is that the theory of simulation is just statistics. You already know all the theory you need to know to understand simulation.
All you have to do is keep the different levels
of randomness straight.
Here we have two levels
.
nsim
samples x
of size n
from the Cauchy distribution.
theta.hat
of size nsim
from the sampling distribution of the sample mean of a sample of size
n
from a standard Cauchy distribution.
For that reason, it is very important to keep the levels
clear, to be very specific about Which mean do you mean?
as DeGroot
and Schervish put the issue in a section heading on p. 702.
We will use some common terminology to keep the levels straight.
The first level, where x
are the samples, we will call
sampling
as this mimics what we do in statistics. The only
differences are that
x
(from the unknown population distribution).
In simulation, we repeat the statistical situation over and over,
getting nsim
samples
(from the known, but pretend unknown, population distribution).
The second level, where theta.hat
is the sample, we will call
Monte Carlo
. This is a cutesy name that has been used so long
(a half-century) that the joke is forgotten and people use it with a straight
face. The city of Monte Carlo in the small country of Monaco on the
Mediterranean Sea has the most famous gambling casino in the world.
Gambling is random, and some now forgotten someone decided to use it
as a cute name for simulation. Nowadays, doing Monte Carlo
is just another term for doing simulation
, and Monte Carlo error
is just another term for simulation error
.
Now here's the key idea of simulation or Monte Carlo, whatever you want to call it. We could get an idea about the Monte Carlo error two ways.
for
loop) and see what the distribution of the numbers
is. Our table of the five Monte Carlo estimates above is a start on this
dumb method.
theta.hat
are a random sample of size nsim
from an unknown distribution
(the sampling distribution of the estimator) and the mean of the distribution
of theta.hat^2
is the exact answer we are trying to calculate.
If we think of that exact answer as an unknown parameter, then the usual theory about a confidence interval for a sample mean tells us that either
mean(theta.hat^2) + c(-1, 1) * qnorm(0.975) * sd(theta.hat^2) / sqrt(nsim)
or, more simply,
t.test(theta.hat^2)gives us a 95% confidence interval for the unknown exact answer.
The only tricky thing, when we use the first form, is that we must remember
to use the Monte Carlo sample size nsim
not the data sample
size n
. When we use the second form, the right thing happens
automatically as t.test
takes the sample size from
length(theta.hat^2)
, which is the Monte Carlo sample size.
Note that there's no problem with the applicability of large sample
asymptotic theory to Monte Carlo. The Monte Carlo sample size
nsim = 10000 is certainly large
.
It may or may not help to solidify your thinking about this upper level (the Monte Carlo level) if we compare with its counterpart in ordinary statistics.
So the differences between simulation and statistics
So except for imaginary/concrete and unknown/known, simulation mimics statistics perfectly.
So let's do it!
Because we're smart and because we know statistics, we don't need
a third level
. We can estimate the accuracy of a Monte Carlo
estimator using the Monte Carlo samples themselves.
As long as we can keep the levels straight in our minds, there's no problem. It's easy.
The principle of importance sampling is rather tricky. Most elementary expositions completely miss the reason why importance sampling is important. DeGroot and Schervish (their Section 11.3) are no exception.
There is a story and a set of toy problems that serve as examples that
explain the name importance sampling
. There is another story
and a set of different toy problems that serve as examples that explain
why importance sampling is important. Sometimes the toy problems are
almost the same. We will use only a very slightly modified version of
Example 11.3.2 in DeGroot and Schervish to make the point.
What makes importance sampling important is the following principle
You can use samples from one distribution to learn about many distributions, even infinitely many distributions!
Suppose we have a parametric family of probability distributions with densities fθ(x). And suppose that g(x) is some other density. Further suppose that we can simulate random variables from the latter so we have X1, X2, ..., Xn independent and identically distributed with density g(x).
Now observe that as a simple matter of algebra
so long as the importance weights
fθ(x) / g(x)
are well defined (no divide by zero) for all x.
Thus we can use the Monte Carlo approximation to the right hand side of this formula
as a Monte Carlo approximation to the left hand side.
Since our Monte Carlo approximation is a simple average, we calculate confidence intervals in the usual way.
Note that one sample works for all θ.
Example 11.3.2 in DeGroot and Schervish done for a continuous range of α instead of just &alpha = 5.
The dashed lines in the figure give the endpoints of the (pointwise, not simultaneous) 95% confidence intervals for the true curve. The solid line is the point estimate.