This one web page explains the big idea in statistics. All of statistics is about this idea (sampling distributions) in one way or another.
If you understand this web page, you understand statistics. This is the key idea. Everything else follows from this one idea.
It may not bring world peace, but it is the only really good idea anyone ever had for dealing with data from processes that you cannot make give the same result every time — that have inherent variability.
The law of large numbers says the sample mean is close to the population mean, with high probability, for sufficiently large sample sizes.
It does not say the sample mean is close to the population mean for any particular sample size (just for some, perhaps very, very large sample sizes). Nor does it give any sharp bound on the difference between sample and population mean: no matter how large the sample size is, very large differences may be possible (with very small probability).
The law of large numbers is a generalization of what we saw for coin flips. The point is that the law of large numbers works for simple random samples (SRS) from any population or for independent and identically distributed (IID) samples from any probability distribution. It's not just for coin flips.
Here we use a population distribution that is very skewed, called the
exponential distribution, which is of no particular interest. We use
it only because it is easy to simulate.
The R function rexp
(on-line help)
does that.
The population mean for this distribution happens to be exactly 1, which is the horizontal line in the plot. The plot shows the sample mean getting closer and closer (in probability) as the sample size increases.
Every time you resubmit, there will be a different picture.
The random sample simulated by the rexp
function will
be different every time.
You can change the sample size n
to see what happens.
The section immediately following and the section using the exponential distribution in the central limit theorem shows the density curve for this population distribution.
In the preceding section we saw that sample means are close to population means for large sample sizes.
A sample proportion is a special case of a sample mean.
If X_{1}, X_{2}, … X_{n} is a sample and A is an event, define Y_{i} to be 1 if X_{i} is in A and 0 otherwise. Then
is the sample proportion for the event A, the number of X_{i} in A divided by the sample size.In this case, the law of large numbers says coin flip process is an example of this (the sample mean for any zero-or-one-valued random variable is a sample proportion).
is close to P(A) for sufficiently large sample sizes. TheWhen applied to several events, this says that the height of each bar of a histogram is close to the corresponding area under the density curve for the population distribution (for large n), and since the the two areas have the same width, they must have nearly the same height.
Thus the law of large numbers says the histogram for an SRS or IID sample is close to the population density curve for sufficiently large sample sizes. Let's try that out. Again we use the exponential distribution.
An exponential random variable is strictly positive. The density curve has a jump exactly at zero. It is zero to the right of zero and 1 just to the left of zero. This is a very skewed distribution.
Increase n
to see the histogram (black) get closer to the
population density curve (red).
The sample mean is a random variable (it is a function of random data, a random sample).
Every random variable has a probability distribution. (We may not know what the distribution is, but, in principle, it has one.)
Therefore the sample mean has a probability distribution, called its
sampling distribution. A sampling distribution
is just a probability distribution like any other, the sampling
just reminds you it has something to do with a random sample.
This is confusing for several reasons. For one, there are two probability distributions now in play. The random sample X_{1}, X_{2}, … X_{n} is an SRS or is IID from one probability distribution, call it the population distribution. This is the probability distribution of each X_{i} (they all have the same distribution). Now we also have the sampling distribution of the sample mean
. The two distributions can be very different (as will be seen below).The other reason for confusion is that the population distribution is very visible but the sampling distribution of the preceding section). But for we have only one number. There is no way to make a distribution, even an estimate of a distribution from it.
is invisible. Make a histogram (or smooth density estimate) of the sample. We know the sample is not the population, so the sample histogram is not the population density curve, but they will be close if the sample size is large (a consequence of the law of large numbers, explained in
We have to imagine the sampling distribution
of one level up
. Not only are the data IID within
a sample, but each sample is identically distributed to all the others
and independent of all the others. For each sample there is a different
sample mean. Imagine every student in the class has taken a different random
sample from the same population. Then each has a different sample mean.
And those sample means constitute an IID sample from
the sampling distribution of the sample mean.
In real life, doing statistics, you have only one
number see
the distribution, we need either a lot of theory
(beyond the scope of this course) or a computer simulation.
The computer can rapidly generate many random samples, calculating the sample
mean of each.
There are two distributions under discussion: the population distribution and the sampling distribution of the sample mean. They are not the same.
In symbols we can write
meaning the mean of the distribution of
is the mean of the distribution of each X_{i}, which is the population mean μ, andmeaning the standard deviation of the distribution of
is the standard deviation of the distribution of each X_{i} divided by the square root of n, which is the population standard deviation σ divided by √n.The distributions have the same mean, different standard deviations, and, in general, different shape.
The relation between the standard deviations is called the square root law.
distribution | mean | standard deviation |
---|---|---|
population distribution | μ | σ |
sampling distribution of | μ | σ ⁄ √n |
Nowhere on this page do we mention the sample standard deviation s. Because the sample is not the population, the sample standard deviation s is not the population standard deviation σ (though they will be close for large sample sizes). And it certainly is not the standard deviation of the sampling distribution of the sample mean, which is σ ⁄ √n.
Although the two distributions under discussion, the population distribution and the sampling distribution of the sample mean do not, in general, have the same shape. We can say something about the shape of the latter.
When the population distribution is exactly normal, then the sampling distribution of the sample mean is exactly normal (regardless of sample size).
When the sample size is large, then the sampling distribution of the sample mean is close to a normal distribution (regardless of the shape of the population distribution).
The rules in the preceding section say which normal distribution, mean μ and standard deviation σ ⁄ √n.
If we have a random sample of size n from a normally distributed population, we know the sampling distribution of the sample mean is exactly normal with mean and standard deviation given in the section about mean and standard deviation of the sample mean.
The simulation below makes a random sample of size n from a normal population and calculates the sample mean. It does this repeatedly nsim times, thus obtaining a random sample from the sampling distribution of the sample mean. The histogram of sample mean values is plotted with a superimposed normal density curve that is the theoretical sampling distribution of the sample mean.
If we have a random sample of size n from a non-normally distributed population, we know the sampling distribution of the sample mean is not exactly normal, only approximately normal for large sample sizes, but we do know the mean and standard deviation are exactly as given in the section about mean and standard deviation of the sample mean.
The simulation below makes a random sample of size n from an exponential population and calculates the sample mean. It does this repeatedly nsim times, thus obtaining a random sample from the sampling distribution of the sample mean. The histogram of sample mean values is plotted with a superimposed non-normal density curve that is the theoretical sampling distribution of the sample mean and the normal density curve (red) that is the approximately sampling distribution for large sample sizes.
If we have a random sample of size n from a non-normally distributed population, we know the sampling distribution of the sample mean is not exactly normal, only approximately normal for large sample sizes, but we do know the mean and standard deviation are exactly as given in the section about mean and standard deviation of the sample mean.
The simulation below makes a random sample of size n from a bimodal skewed population and calculates the sample mean. It does this repeatedly nsim times, thus obtaining a random sample from the sampling distribution of the sample mean. The histogram of sample mean values is plotted with a superimposed non-normal density curve that is the theoretical sampling distribution of the sample mean and the normal density curve (red) that is the approximately sampling distribution for large sample sizes.
mu
to any number between zero and one,
you make the population distribution more or less skewed.
mu <- 1 / 2
makes a symmetric bimodal population.
sigma.normal
to any positive number,
you make the width of the peaks of the population wider or narrower.
Every calculation that can be made about a normal distribution can be made about the large n normal approximation to the sampling distribution of normal distributions page for examples. They can be direct or inverse lookup. They can involve any sort of event: lower tail area, upper tail area, area over an interval.
. See theThe only thing you have to be careful about is to use σ ⁄ √n instead of σ. If you goof that up, you will be nowhere near the right answer. (So if you get an answer that seems ridiculous, check that you got the √n in the right place.)
The random variable of interest is time required to service an air conditioner. We are given
and we are to do a direct lookup to get an upper tail area (area to right) past 50 minutes, which is 50 / 60 hours. (Be careful to keep all numbers in the same units).
To do by computer
To do by hand, see the textbook (p. 261).
There are two ways to botch this: miss the 50 / 60
issue
or forget sqrt(n)
.
Otherwise it is just a general normal distribution, direct lookup,
area to the right problem.
Suppose instead of the preceding problem we are asked an inverse or backward lookup problem.
What is the x such that 90% of the sampling distribution of the sample mean of the n = 70 air conditioner preventative maintenance times lies below x?
Except for using σ ⁄ √n instead of σ, this problem is just a general normal distribution inverse lookup problem.
We use qnorm
rather than pnorm
.
The answer is 1.153175 hours.
If we want to convert to minutes, that's one hour and 9.2 minutes. (But there's no need to do that.)
There are three steps (0) figure out which normal distribution (1) backward look up (2) unstandardize.
Which Normal Distribution. A problem about the sampling distribution of
is about the normal distribution with mean μ and standard deviation σ ⁄ √n, where μ and σ are the population mean and standard deviation, respectively. In this problemand the standard deviation is
The calculation here is no different from a forward problem and is done in the textbook's Example 10.7 and also in our redo of this example.
Backward Look Up. Now we need to find numbers bracketing 0.90 in the table. They are found in the following row.
.00 .01 .02 .03 .04 .05 .06 .07 .08 .09
1.2 | .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015
So the result of the backward lookup is somewhere between 1.28 and 1.29, closer to the former, say 1.281.
Unstandardize.
Important! Where we used σ in Chapter 3
problems, we use &sigma ⁄ √n in Chapter 10
problems. Otherwise, no difference. But this is a very important difference.
If you miss the √n, you get nowhere near the right answer.
Regardless of what the population distribution is, no matter how skewed, or how non-normal looking, the sampling distribution of the sample mean is nearly normal for sufficiently large sample size.
There is no reason to expect the sampling distribution of
to be anything like normal for small sample sizes, unless the population distribution is nearly normal.The more skewed the population distribution, the larger the sample size needed for the central limit theorem to be a decent approximation.
There is no magic sample size where the central limit theorem kicks in
,
contrary to the impression given by many introductory textbooks (but not ours).
It all depends on the population distribution, in particular how skewed and
how heavy tailed it is.