Statistics 3011 (Geyer and Jones, Spring 2006) Examples: Sampling Distributions

Contents

Motivation

This one web page explains the big idea in statistics. All of statistics is about this idea (sampling distributions) in one way or another.

If you understand this web page, you understand statistics. This is the key idea. Everything else follows from this one idea.

It may not bring world peace, but it is the only really good idea anyone ever had for dealing with data from processes that you cannot make give the same result every time — that have inherent variability.

The Law of Large Numbers

The law of large numbers says the sample mean is close to the population mean, with high probability, for sufficiently large sample sizes.

It does not say the sample mean is close to the population mean for any particular sample size (just for some, perhaps very, very large sample sizes). Nor does it give any sharp bound on the difference between sample and population mean: no matter how large the sample size is, very large differences may be possible (with very small probability).

The law of large numbers is a generalization of what we saw for coin flips. The point is that the law of large numbers works for simple random samples (SRS) from any population or for independent and identically distributed (IID) samples from any probability distribution. It's not just for coin flips.

Averages

Here we use a population distribution that is very skewed, called the exponential distribution, which is of no particular interest. We use it only because it is easy to simulate. The R function rexp (on-line help) does that.

The population mean for this distribution happens to be exactly 1, which is the horizontal line in the plot. The plot shows the sample mean getting closer and closer (in probability) as the sample size increases.

Every time you resubmit, there will be a different picture. The random sample simulated by the rexp function will be different every time.

You can change the sample size n to see what happens.

The section immediately following and the section using the exponential distribution in the central limit theorem shows the density curve for this population distribution.

Proportions and Histograms

In the preceding section we saw that sample means are close to population means for large sample sizes.

A sample proportion is a special case of a sample mean.

If X1, X2, … Xn is a sample and A is an event, define Yi to be 1 if Xi is in A and 0 otherwise. Then Y is the sample proportion for the event A, the number of Xi in A divided by the sample size.

In this case, the law of large numbers says Y is close to P(A) for sufficiently large sample sizes. The coin flip process is an example of this (the sample mean for any zero-or-one-valued random variable is a sample proportion).

When applied to several events, this says that the height of each bar of a histogram is close to the corresponding area under the density curve for the population distribution (for large n), and since the the two areas have the same width, they must have nearly the same height.

Thus the law of large numbers says the histogram for an SRS or IID sample is close to the population density curve for sufficiently large sample sizes. Let's try that out. Again we use the exponential distribution.

An exponential random variable is strictly positive. The density curve has a jump exactly at zero. It is zero to the right of zero and 1 just to the left of zero. This is a very skewed distribution.

Increase n to see the histogram (black) get closer to the population density curve (red).

Levels of Explanation

The sample mean is a random variable (it is a function of random data, a random sample).

Every random variable has a probability distribution. (We may not know what the distribution is, but, in principle, it has one.)

Therefore the sample mean has a probability distribution, called its sampling distribution. A sampling distribution is just a probability distribution like any other, the sampling just reminds you it has something to do with a random sample.

This is confusing for several reasons. For one, there are two probability distributions now in play. The random sample X1, X2, … Xn is an SRS or is IID from one probability distribution, call it the population distribution. This is the probability distribution of each Xi (they all have the same distribution). Now we also have the sampling distribution of the sample mean X. The two distributions can be very different (as will be seen below).

The other reason for confusion is that the population distribution is very visible but the sampling distribution of X is invisible. Make a histogram (or smooth density estimate) of the sample. We know the sample is not the population, so the sample histogram is not the population density curve, but they will be close if the sample size is large (a consequence of the law of large numbers, explained in the preceding section). But for X we have only one number. There is no way to make a distribution, even an estimate of a distribution from it.

We have to imagine the sampling distribution of X. We have to imagine many different samples which are IID one level up. Not only are the data IID within a sample, but each sample is identically distributed to all the others and independent of all the others. For each sample there is a different sample mean. Imagine every student in the class has taken a different random sample from the same population. Then each has a different sample mean. And those sample means constitute an IID sample from the sampling distribution of the sample mean.

In real life, doing statistics, you have only one number X. It requires an act of imagination to think how it would be different if you had a different random sample. To see the distribution, we need either a lot of theory (beyond the scope of this course) or a computer simulation. The computer can rapidly generate many random samples, calculating the sample mean of each.

The Central Limit Theorem

Mean and Standard Deviation of the Sample Mean

There are two distributions under discussion: the population distribution and the sampling distribution of the sample mean. They are not the same.

In symbols we can write

mean(X) = mean(Xi) = μ

meaning the mean of the distribution of X is the mean of the distribution of each Xi, which is the population mean μ, and

sd(X) = sd(Xi) ⁄ √n = σ ⁄ √n

meaning the standard deviation of the distribution of X is the standard deviation of the distribution of each Xi divided by the square root of n, which is the population standard deviation σ divided by √n.

The distributions have the same mean, different standard deviations, and, in general, different shape.

The relation between the standard deviations is called the square root law.

Summary

distribution mean standard deviation
population distribution μ σ
sampling distribution of X μ σ ⁄ √n

Caution

Nowhere on this page do we mention the sample standard deviation s. Because the sample is not the population, the sample standard deviation s is not the population standard deviation σ (though they will be close for large sample sizes). And it certainly is not the standard deviation of the sampling distribution of the sample mean, which is σ ⁄ √n.

Sampling Distribution of the Sample Mean

Although the two distributions under discussion, the population distribution and the sampling distribution of the sample mean do not, in general, have the same shape. We can say something about the shape of the latter.

When the population distribution is exactly normal, then the sampling distribution of the sample mean is exactly normal (regardless of sample size).

When the sample size is large, then the sampling distribution of the sample mean is close to a normal distribution (regardless of the shape of the population distribution).

The rules in the preceding section say which normal distribution, mean μ and standard deviation σ ⁄ √n.

Normal Population Distribution

If we have a random sample of size n from a normally distributed population, we know the sampling distribution of the sample mean is exactly normal with mean and standard deviation given in the section about mean and standard deviation of the sample mean.

The simulation below makes a random sample of size n from a normal population and calculates the sample mean. It does this repeatedly nsim times, thus obtaining a random sample from the sampling distribution of the sample mean. The histogram of sample mean values is plotted with a superimposed normal density curve that is the theoretical sampling distribution of the sample mean.

Exponential Population Distribution

If we have a random sample of size n from a non-normally distributed population, we know the sampling distribution of the sample mean is not exactly normal, only approximately normal for large sample sizes, but we do know the mean and standard deviation are exactly as given in the section about mean and standard deviation of the sample mean.

The simulation below makes a random sample of size n from an exponential population and calculates the sample mean. It does this repeatedly nsim times, thus obtaining a random sample from the sampling distribution of the sample mean. The histogram of sample mean values is plotted with a superimposed non-normal density curve that is the theoretical sampling distribution of the sample mean and the normal density curve (red) that is the approximately sampling distribution for large sample sizes.

Bimodal Skewed Population Distribution

If we have a random sample of size n from a non-normally distributed population, we know the sampling distribution of the sample mean is not exactly normal, only approximately normal for large sample sizes, but we do know the mean and standard deviation are exactly as given in the section about mean and standard deviation of the sample mean.

The simulation below makes a random sample of size n from a bimodal skewed population and calculates the sample mean. It does this repeatedly nsim times, thus obtaining a random sample from the sampling distribution of the sample mean. The histogram of sample mean values is plotted with a superimposed non-normal density curve that is the theoretical sampling distribution of the sample mean and the normal density curve (red) that is the approximately sampling distribution for large sample sizes.

Example Calculations

Every calculation that can be made about a normal distribution can be made about the large n normal approximation to the sampling distribution of X. See the normal distributions page for examples. They can be direct or inverse lookup. They can involve any sort of event: lower tail area, upper tail area, area over an interval.

The only thing you have to be careful about is to use σ ⁄ √n instead of σ. If you goof that up, you will be nowhere near the right answer. (So if you get an answer that seems ridiculous, check that you got the √n in the right place.)

Example 10.7 from the Book

The random variable of interest is time required to service an air conditioner. We are given

μ = 1 (hour)
σ = 1 (hour)
n = 70

and we are to do a direct lookup to get an upper tail area (area to right) past 50 minutes, which is 50 / 60 hours. (Be careful to keep all numbers in the same units).

To do by computer

To do by hand, see the textbook (p. 261).

There are two ways to botch this: miss the 50 / 60 issue or forget sqrt(n). Otherwise it is just a general normal distribution, direct lookup, area to the right problem.

An Inverse Problem Related to Example 10.7 from the Book

Suppose instead of the preceding problem we are asked an inverse or backward lookup problem.

What is the x such that 90% of the sampling distribution of the sample mean of the n = 70 air conditioner preventative maintenance times lies below x?

By Computer

Except for using σ ⁄ √n instead of σ, this problem is just a general normal distribution inverse lookup problem.

We use qnorm rather than pnorm.

The answer is 1.153175 hours.

If we want to convert to minutes, that's one hour and 9.2 minutes. (But there's no need to do that.)

By Hand

There are three steps (0) figure out which normal distribution (1) backward look up (2) unstandardize.

Which Normal Distribution. A problem about the sampling distribution of X is about the normal distribution with mean μ and standard deviation σ ⁄ √n, where μ and σ are the population mean and standard deviation, respectively. In this problem

μ = 1 (hour)

and the standard deviation is

σ ⁄ √n = 1 ⁄ √70 = 0.12 (hour)

The calculation here is no different from a forward problem and is done in the textbook's Example 10.7 and also in our redo of this example.

Backward Look Up. Now we need to find numbers bracketing 0.90 in the table. They are found in the following row.

         .00   .01   .02   .03   .04   .05   .06   .07   .08   .09
  1.2 | .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015

So the result of the backward lookup is somewhere between 1.28 and 1.29, closer to the former, say 1.281.

Unstandardize.

x = μ + (&sigma ⁄ √n) z = 1 + 0.12 × 1.281 = 1.154

Important! Where we used σ in Chapter 3 problems, we use &sigma ⁄ √n in Chapter 10 problems. Otherwise, no difference. But this is a very important difference. If you miss the √n, you get nowhere near the right answer.

Summary

Regardless of what the population distribution is, no matter how skewed, or how non-normal looking, the sampling distribution of the sample mean is nearly normal for sufficiently large sample size.

There is no reason to expect the sampling distribution of X to be anything like normal for small sample sizes, unless the population distribution is nearly normal.

The more skewed the population distribution, the larger the sample size needed for the central limit theorem to be a decent approximation.

There is no magic sample size where the central limit theorem kicks in, contrary to the impression given by many introductory textbooks (but not ours). It all depends on the population distribution, in particular how skewed and how heavy tailed it is.