# Stat 3011 (Geyer) In-Class Examples (Chapter 6)

## General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions (that's already done for you). You do not have to select a dataset (that's already done for you).

## Standardized Histograms (Sec 6.1.1 in Wild and Seber)

To make what Wild and Seber call a "standardized histogram" in R or Rweb, add the optional argument `probability=TRUE` or `freq=FALSE` (for some strange reason, known only to the author of the `hist` function, either one works and both do exactly the same thing).

For an example we will use the data in the file gundeath.txt because the data Wild and Seber use in their Section 6.1.1 isn't on-line.

That the area under the histogram is indeed equal to one can be seen by getting the raw numbers used to plot the histogram with the `plot=FALSE` optional argument to the `hist` function.

The `breaks` component of the result gives the bin boundaries, hence `diff(out\$breaks)` gives the box widths. The `intensities` component of the result gives box heights. Multiplying gives the box areas. Summing gives the area under the whole histogram.

Note: You don't have to know how to do this. The result would always be one. So there's no point in calculating it. The only point of the example is to show that the area under the histogram really is exactly 1.000.

Now we return to the histogram, and add an "approximating curve" with the function `sm.density`. First let's see what this function does to the data. It draws a smooth approximation to the histogram.

Putting both the histogram and its smooth approximation on the same plot is a bit tricky. It uses the `add=TRUE` optional argument to the `sm.density` function.

Hmmmmm. Not the "smooth approximation" I would draw. Looks oversmoothed to me. But who am I to argue with the computer? Anyway, I don't understand the details of the `sm.density` function enough to make it draw a better smooth approximation.

## Probability Models Defined by Density Curves (Sec. 6.1.2 in Wild and Seber)

Now we do the same trick as in the last section, a histogram and a smooth curve on the same plot, but with a twist.

• Now the smooth curve defines a theoretical probability model.
• The data are simulated by the computer so that the probability of seeing a point in any particular interval is the area under the curve over that interval.
The actual probability model doesn't matter. I just picked one that looks like the one Wild and Seber used in their Figure 6.1.2 (p. 235), that is, unimodal and skewed. We won't actually learn anything about this particular model.

If you rerun this several times, you will get a different picture each time, because the computer's simulated data really behaves as if it were random.

Another thing to try is to change the data size assigned in the first statement.

## The Normal Distribution (Sec. 6.2 in Wild and Seber)

Now we do the same trick as in the last section, but with the normal distribution. In this example we will use the normal distribution with mean 10 and standard deviation 3.

If you rerun this several times, you will get a different picture each time, because the computer's simulated data really behaves as if it were random.

Another thing to try is to change the data size assigned in the first statement.

### Calculating Probabilities for the Normal Distribution (Sec. 6.2.2 in Wild and Seber)

#### Lower-Tail Probabilities

The function

F(x) = pr(X <= x)
is called the cumulative distribution function (CDF) of a probability model. It gives lower-tail probabilities.

The R function that gives the CDF of a normal distribution is `pnorm`.

• By default `pnorm` calculates the CDF of the standard normal distribution (mean 0.0 and sd 1.0).
• To calculate the CDF of a different normal distribution, use the optional arguments `mean` and `sd` to supply the mean and standard deviation.

For example, we do the problem in Figure 6.2.3 on p. 240 in Wild and Seber. Either of the following lines calculates the answer (0.8194424). The first line uses the names of the optional arguments. The second line uses the fact that the mean is the second argument and the sd the third if no names are used. Either works. Your choice.

#### Upper-Tail Probabilities

By the complement rule

1 - F(x) = pr(X > x)
so upper-tail probabilities can also be easily obtained using the CDF (just subtract the lower tail from one).

The problem in Figure 6.2.5 in Wild and Seber (p. 243): what is pr(X > 25) if X has a normal distribution with mean 27.3 and standard deviation of 4.1?

Answer: (calculated by Rweb) 0.7125929.

#### Probabilities of Intervals

If a < b, then the events X <= a and a < X <= b are mutually exclusive (X can't be both below a and above a). Thus by the addition rule,

pr(X <= b) = pr(X <= a) + pr(a < X <= b)
or, moving one term from one side of the equation to the other
pr(a < X <= b) = pr(X <= b) - pr(X <= a) = F(b) - F(a)

Thus

the probability of an interval is the difference of the values of the CDF at the endpoints.

The problem in Figure 6.2.4 in Wild and Seber (p. 242): what is pr(160 < X < 180) if X has a normal distribution with mean 174 and standard deviation of 6.57?

Answer: (calculated by Rweb) 0.8028936.

### The Inverse Problem: Percentiles and Quantiles (Sec. 6.2.3 in Wild and Seber)

#### Quantiles

Recall that CDF of a probability distribution

F(x) = pr(X <= x)
gives lower-tail probabilities.

The inverse CDF is the function that goes the other way. If

F(x) = p
then
F-1(p) = x

The R function that gives the inverse CDF of a normal distribution is `qnorm`. As with `pnorm`, the the optional arguments `mean` and `sd` supply the mean and standard deviation. (And the defaults are `mean=0` and `sd=1`.)

Another way to think of what `qnorm` does is that, given a p between zero and one, it solves

F(x) = p
for x. The solution is called the p-th quantile of the normal distribution. It is the point x such that
pr(X < x) = p
and
pr(X > x) = 1 - p

Question (from Figure 6.2.6 in Wild and Seber): What is the 0.8 quantile of women's heights, assuming the heights follow a normal distribution with mean 162.7 (centimeters) and standard deviation 6.2 (centimeters).

Solution:

Answer: (calculated by Rweb) 167.9181.

Note that

• the result of `pnorm` is always between zero and one because it is a probability.
• the (first) argument of `qnorm` is always between zero and one because it is a probability.

#### Percentiles

A percentile is just a quantile for a probability expressed as percent rather than as a number between zero and one.
• The 25-th percentile is the 0.25 quantile (is the lower quartile).
• The 50-th percentile is the 0.50 quantile (is the median).
• The 85-th percentile is the 0.85 quantile.
And so forth. See Quantiles.

#### Central Ranges

Question (from Figure 6.2.8 in Wild and Seber): What is the range of heights that contains the central 50% of women's heights, assuming the heights follow a normal distribution with mean 162.7 (centimeters) and standard deviation 6.2 (centimeters).

Solution: The first step in the analysis is to see that we want 25% of heights below the range we are to find and 25% above (leaving 50% in the middle). Thus the lower end of the range is the 25-th percentile and the upper end is the 75-th percentile. (Why 75-th? Because 25% plus 50% is below this point.)

Answer: 158.5182 (25-th percentile) and 166.8818 (75-th percentile).

## Sums and Differences of Random Quantities (Section 6.4 in Wild and Seber)

This section illustrates the behavior of sums and differences of random variables discussed in Section 6.4 of Wild and Seber.

In the first five lines below, the computer simulates two datasets `x` and `y` of size `n`. They are both normally distributed with mean `mu` and standard deviation `sigma` and are statistically independent.

The rest of the lines, except the last, calculate various means and standard deviations for comparison with the formulas given in the book.

The last line gives one of those theoretical values. What is it? What are the other theoretical values? That is, what theoretical values are `mean(x + y)` and `mean(x - y)` supposed to be near?

Now the same thing, but with pictures rather than, numbers.

## Review of Normal Distribution Calculations

### Direct (Forward) Problems

Direct or forward problems look up probabilities (areas under the density curve) using the CDF (cumulative distribution function) of the probability model

F(x) = pr(X <= x)

The R function that evaluates the normal CDF is `pnorm`.

#### Lower-Tail Probabilities

Area to the left of a, that is, pr(X < a)

`pnorm`(a, . . . )
where the . . . indicates the other arguments that specify the mean and standard deviation of the normal distribution in question. For example, if we want the area to the left of 100 for a normal distribution with mean 50 and standard deviation 20
`pnorm`(100, 50, 20)

#### Upper-Tail Probabilities

Upper-tail probabilities are related to lower-tail probabilities by the complement rule

pr(X > a) = 1 - pr(X < a)
So the area to the right of a, that is, pr(X > a), is
1 - `pnorm`(a, . . . )
where the . . . indicates the other arguments that specify the mean and standard deviation of the normal distribution in question. For example, if we want the area to the right of 100 for a normal distribution with mean 50 and standard deviation 20
1 - `pnorm`(100, 50, 20)

#### Probabilities of Intervals

Probabilities of intervals are calculated by the rule

pr(a < X < b) = F(b) - F(a)
where, as usual, F(x) denotes the CDF. So this is calculated in R by
`pnorm`(b, . . . ) - `pnorm`(a, . . . )
For example, if we want the area to between 75 and 100 for a normal distribution with mean 50 and standard deviation 20, then we calculate
`pnorm`(100, 50, 20) - `pnorm`(75, 50, 20)

### Inverse (Backward) Problems

Inverse or backward problems look up values of the variable corresponding to specific probabilities (areas under the density curve) using the inverse CDF of the probability model, which is the function that goes in the direction opposite to the CDF. If

F(x) = p
then
F-1(p) = x

The R function that evaluates the normal inverse CDF is `qnorm`.

#### Playing Jeopardy

The relation of an inverse problem to a direct problem is the like the relation of Jeopardy to an ordinary game show.

The function `pnorm` answers the question

Question: What is the lower tail area to the left of a?
Answer: `pnorm`(a, . . . )
where the . . . indicates the other arguments that specify the mean and standard deviation of the normal distribution in question.

Conversely, the function `qnorm` finds the question that has a specified probability as the answer.

Question: What is a such that the lower tail area to the left of a is equal to p (the p-th quantile of the distribution)?
Answer: `qnorm`(p, . . . )
where the . . . indicates the other arguments that specify the mean and standard deviation of the normal distribution in question.

#### Quantiles and Percentiles

All of the following questions have the same answer.

Question: What is a such that the lower tail area to the left of a is equal to p?
Question: What is a such that the upper tail area to the right of a is equal to 1 - p?
Question: What is the p-th quantile?
Question: What is the 100p-th percentile?
Answer: `qnorm`(p, . . . )

For example, for a normal distribution with mean 50 and standard deviation 20,

Question: What is a such that the lower tail area to the left of a is equal to 0.65?
Question: What is a such that the upper tail area to the right of a is equal to 0.35?
Question: What is the 0.65 quantile?
Question: What is the 65-th percentile?
Answer: `qnorm`(0.65, 50, 20)

Note especially that in forward problems first you do the look-up and then you apply the complement rule `1 - pnorm( . . . )`. But in backward problems first you apply the complement rule and then you do the lookup

Question: What is a such that the upper tail area to the right of a is equal to 0.35?
Answer: `qnorm`(1 - 0.35, 50, 20)

If you find yourself doing `1 - qnorm( . . .)`, stop! That never makes sense. The argument of `qnorm` is a probability. It makes sense to supply one minus the argument. The result of `qnorm` is a not a probability. It never makes sense to calculate one minus the result.