University of Minnesota, Twin Cities School of Statistics Stat 3011 Rweb Textbook (Wild and Seber)
To make what Wild and Seber call a "standardized histogram" in R or Rweb,
add the optional argument probability=TRUE
or freq=FALSE
(for some strange reason, known only to the author
of the hist
function, either one works and both do exactly the
same thing).
For an example we will use the data in the file gundeath.txt because the data Wild and Seber use in their Section 6.1.1 isn't on-line.
That the area under the histogram is indeed equal to one can be seen by
getting the raw numbers used to plot the histogram with
the plot=FALSE
optional argument to the hist
function.
The breaks
component of the result gives the bin boundaries,
hence diff(out$breaks)
gives the box widths.
The intensities
component of the result gives box heights.
Multiplying gives the box areas. Summing gives the area under the whole
histogram.
Note: You don't have to know how to do this. The result would always be one. So there's no point in calculating it. The only point of the example is to show that the area under the histogram really is exactly 1.000.
Now we return to the histogram, and add an "approximating curve"
with the function sm.density
. First let's see what
this function does to the data. It draws a smooth approximation to the
histogram.
Putting both the histogram and its smooth approximation on the same plot
is a bit tricky. It uses the add=TRUE
optional argument
to the sm.density
function.
Hmmmmm. Not the "smooth approximation" I would draw. Looks oversmoothed to
me. But who am I to argue with the computer? Anyway, I don't understand
the details of the sm.density
function enough to make it draw
a better smooth approximation.
Now we do the same trick as in the last section, a histogram and a smooth curve on the same plot, but with a twist.
If you rerun this several times, you will get a different picture each time, because the computer's simulated data really behaves as if it were random.
Another thing to try is to change the data size assigned in the first statement.
Now we do the same trick as in the last section, but with the normal distribution. In this example we will use the normal distribution with mean 10 and standard deviation 3.
If you rerun this several times, you will get a different picture each time, because the computer's simulated data really behaves as if it were random.
Another thing to try is to change the data size assigned in the first statement.
The function
F(x) = pr(X <= x)is called the cumulative distribution function (CDF) of a probability model. It gives lower-tail probabilities.
The R function that gives the CDF of a normal distribution is
pnorm
.
pnorm
calculates the CDF of the standard
normal distribution (mean 0.0 and sd 1.0).
mean
and sd
to supply the mean
and standard deviation.
For example, we do the problem in Figure 6.2.3 on p. 240 in Wild and Seber. Either of the following lines calculates the answer (0.8194424). The first line uses the names of the optional arguments. The second line uses the fact that the mean is the second argument and the sd the third if no names are used. Either works. Your choice.
By the complement rule
1 - F(x) = pr(X > x)so upper-tail probabilities can also be easily obtained using the CDF (just subtract the lower tail from one).
The problem in Figure 6.2.5 in Wild and Seber (p. 243): what is pr(X > 25) if X has a normal distribution with mean 27.3 and standard deviation of 4.1?
Answer: (calculated by Rweb) 0.7125929.
If a < b, then the events X <= a and a < X <= b are mutually exclusive (X can't be both below a and above a). Thus by the addition rule,
pr(X <= b) = pr(X <= a) + pr(a < X <= b)or, moving one term from one side of the equation to the other
pr(a < X <= b) = pr(X <= b) - pr(X <= a) = F(b) - F(a)
Thus
the probability of an interval is the difference of the values of the CDF at the endpoints.
The problem in Figure 6.2.4 in Wild and Seber (p. 242): what is pr(160 < X < 180) if X has a normal distribution with mean 174 and standard deviation of 6.57?
Answer: (calculated by Rweb) 0.8028936.
Recall that CDF of a probability distribution
F(x) = pr(X <= x)gives lower-tail probabilities.
The inverse CDF is the function that goes the other way. If
F(x) = pthen
F^{-1}(p) = x
The R function that gives the inverse CDF of a normal distribution is
qnorm
.
As with pnorm
, the the optional
arguments mean
and sd
supply the mean
and standard deviation. (And the defaults are mean=0
and
sd=1
.)
Another way to think of what qnorm
does is that, given
a p between zero and one, it solves
F(x) = pfor x. The solution is called the p-th quantile of the normal distribution. It is the point x such that
pr(X < x) = pand
pr(X > x) = 1 - p
Question (from Figure 6.2.6 in Wild and Seber): What is the 0.8 quantile of women's heights, assuming the heights follow a normal distribution with mean 162.7 (centimeters) and standard deviation 6.2 (centimeters).
Solution:
Answer: (calculated by Rweb) 167.9181.Note that
pnorm
is always between
zero and one because it is a probability.
qnorm
is always between zero and one because it is a probability.
Question (from Figure 6.2.8 in Wild and Seber): What is the range of heights that contains the central 50% of women's heights, assuming the heights follow a normal distribution with mean 162.7 (centimeters) and standard deviation 6.2 (centimeters).
Solution: The first step in the analysis is to see that we want 25% of heights below the range we are to find and 25% above (leaving 50% in the middle). Thus the lower end of the range is the 25-th percentile and the upper end is the 75-th percentile. (Why 75-th? Because 25% plus 50% is below this point.)
Answer: 158.5182 (25-th percentile) and 166.8818 (75-th percentile).This section illustrates the behavior of sums and differences of random variables discussed in Section 6.4 of Wild and Seber.
In the first five
lines below, the computer simulates two datasets
x
and y
of size n
.
They are both normally distributed with mean
mu
and standard deviation sigma
and are
statistically independent.
The rest of the lines, except the last, calculate various means and standard deviations for comparison with the formulas given in the book.
The last line gives one of those theoretical values. What is it?
What are the other theoretical values? That is, what theoretical
values are mean(x + y)
and mean(x - y)
supposed
to be near?
Now the same thing, but with pictures rather than, numbers.
Direct or forward problems look up probabilities (areas under the density curve) using the CDF (cumulative distribution function) of the probability model
F(x) = pr(X <= x)
The R function that evaluates the normal CDF is pnorm
.
Area to the left of a, that is, pr(X < a)
pnorm
(a, . . . )
where the . . . indicates the other arguments that specify the mean and
standard deviation of the normal distribution in question.
For example, if we want the area to the left of 100 for a normal distribution
with mean 50 and standard deviation 20
pnorm
(100, 50, 20)
Upper-tail probabilities are related to lower-tail probabilities by the complement rule
pr(X > a) = 1 - pr(X < a)So the area to the right of a, that is, pr(X > a), is
1 - pnorm
(a, . . . )
where the . . . indicates the other arguments that specify the mean and
standard deviation of the normal distribution in question.
For example, if we want the area to the right of 100 for a normal distribution
with mean 50 and standard deviation 20
1 - pnorm
(100, 50, 20)
Probabilities of intervals are calculated by the rule
pr(a < X < b) = F(b) - F(a)where, as usual, F(x) denotes the CDF. So this is calculated in R by
For example, if we want the area to between 75 and 100 for a normal distribution with mean 50 and standard deviation 20, then we calculatepnorm
(b, . . . ) -pnorm
(a, . . . )
pnorm
(100, 50, 20) -pnorm
(75, 50, 20)
Inverse or backward problems look up values of the variable corresponding to specific probabilities (areas under the density curve) using the inverse CDF of the probability model, which is the function that goes in the direction opposite to the CDF. If
F(x) = pthen
F^{-1}(p) = x
The R function that evaluates the normal inverse CDF is qnorm
.
The relation of an inverse problem to a direct problem is the like the relation of Jeopardy to an ordinary game show.
The function
pnorm
answers the question
Question: What is the lower tail area to the left of a?where the . . . indicates the other arguments that specify the mean and standard deviation of the normal distribution in question.
Answer:pnorm
(a, . . . )
Conversely, the function
qnorm
finds the question that has a specified probability as the
answer.
Question: What is a such that the lower tail area to the left of a is equal to p (the p-th quantile of the distribution)?where the . . . indicates the other arguments that specify the mean and standard deviation of the normal distribution in question.
Answer:qnorm
(p, . . . )
All of the following questions have the same answer.
Question: What is a such that the lower tail area to the left of a is equal to p?
Question: What is a such that the upper tail area to the right of a is equal to 1 - p?
Question: What is the p-th quantile?
Question: What is the 100p-th percentile?
Answer:qnorm
(p, . . . )
For example, for a normal distribution with mean 50 and standard deviation 20,
Question: What is a such that the lower tail area to the left of a is equal to 0.65?
Question: What is a such that the upper tail area to the right of a is equal to 0.35?
Question: What is the 0.65 quantile?
Question: What is the 65-th percentile?
Answer:qnorm
(0.65, 50, 20)
Note especially that in forward problems first you do the look-up
and then you apply the complement rule 1 - pnorm( . . . )
.
But in backward problems first you apply the complement rule
and then you do the lookup
Question: What is a such that the upper tail area to the right of a is equal to 0.35?
Answer:qnorm
(1 - 0.35, 50, 20)
If you find yourself doing 1 - qnorm( . . .)
, stop!
That never makes sense. The argument of qnorm
is
a probability. It makes sense to supply one minus the argument.
The result of qnorm
is
a not a probability. It never makes sense to calculate
one minus the result.