Statistics 3011 (Geyer and Jones, Spring 2006) Examples: Mean, etc. and Boxplots

Contents

Statistics

In statistics (the subject of this course) the word statistic (in the singular) is used as a technical term, meaning any number that can be calculated from a sample (p. 250 in the textbook).

Mean

For our example we will use the data from Example 2.1 in the Textbook (Moore) which is in the URL

http://www.stat.umn.edu/geyer/3011/mdata/chap02/eg02-01.dat

so we can use the URL external data entry method.

External Data Entry

Enter a dataset URL :

Or

Select a local file to submit:

Median

See above.

Trimmed Mean

Our textbook doesn't discuss trimmed means, although many others do. To calculate a 10% trimmed mean

  1. Sort the data.
  2. Throw away the upper 10% and the lower 10%.
  3. Calculate the mean (average) of the rest.

And similarly for a 5%, 15%, whatever trimmed mean.

Note that a 10% trimmed mean trims 10% from each end, not 10% in all (it trims 20% in all).

Also note that you must sort the data first. You throw away the 10% of the data having the largest values and the 10% having the smallest values, not the first and last 10% in the order presented (unless the data are given to you sorted).

The point of trimmed means is that they lie between the mean an the median in behavior. The ordinary mean is a 0% trimmed mean, and the median is essentially a 50% trimmed mean. More can be said on this subject, but we'll leave it for later.

The example above clearly shows this behavior. The less trimming, the closer to the ordinary mean. The more trimming, the closer to the median.

Note that to specify a 10% trimmed mean you give the argument trim = 0.1 to the mean function. You give the fraction 0.1, not the percent 10.

Comment

The statistics discussed so far are competitors. They all measure more or less the same abstract concept, which the book calls center in various ways. If the distribution of the data is symmetric, then they all (mean, median, and all trimmed means) measure the same thing, the center of symmetry. Otherwise they all measure something different, but all the things they measure have as much right to be called center as the others.

Quantiles (a. k. a. percentiles)

Our textbook also doesn't discuss quantiles and percentiles, although many others do.

Abstractly, the p-th quantile of a distribution is the point x such that p of the data are below x and 1 - p of the data are above x. Here p is a number between zero and one.

When p is multiplied by 100 to convert it to a percent, then the same point x is called a percentile, which should be familiar from the way standardized tests are reported. Your score was at the 95-th percentile if 95% of all other scores were below yours and 5% of all other scores were above yours.

For theoretical reasons, statisticians prefer numbers between zero and one (which can be interpreted as probabilities) rather than percents. So if your score was at the 95-th percentile, a statistician would prefer to say 0.95 quantile. Same point, different name.

Our textbook does discuss two particular quantiles, the 0.25 and 0.75 quantiles (a. k. a. the 25-th and 75-th percentiles). These are also called the first quartile and third quartile (in between, the second quartile, is the 50-th percentile, which is the median).

External Data Entry

Enter a dataset URL :

Or

Select a local file to submit:

Comment

When applied to a large population, the notion of quantile is clear, but when applied to a small sample is nowhere near as simple as we have made it. The R quantile function implements 9 different quantile definitions that have appeared in the literature.

Above we look at two, the default method (when type is not specified) is used for the first three statements. In the last type = 1 always chooses a data point as the quantile, in which case different quantiles may have the same value.

It is not appear that any of the 9 methods implemented by the quantile function agree with the method of finding quartiles given in the textbook (pp. 37-38), which is simplified to make hand calculation easy, which is pointless when a computer is available.

Standard Deviation

External Data Entry

Enter a dataset URL :

Or

Select a local file to submit:

Interquartile Range

See above for example.

Comment

The s. d. and the IQR both measure something that can loosely be called spread, but do not measure the same thing.

They are competitors in the sense that you should use one or the other but not both. They are not precise competitors in that the measure different aspects of a distribution.

Five-Number Summary (R Does Six)

See above.

Boxplots

For our example we will use the data that comes with R in the built-in dataset ToothGrowth (on-line help) which is about

the effect of vitamin C on tooth growth in guinea pigs.

The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).

In order to make this more like the problems from the book where we load the data from a URL, we have written this data out of R to the URL

http://www.stat.umn.edu/geyer/3011/rdata/ToothGrowth.dat

so we can use the URL external data entry method.

External Data Entry

Enter a dataset URL :

Or

Select a local file to submit:

To make a simple boxplot, just do boxplot(variable). In this dataset the quantitative variable is len.

To make parallel boxplots, we use the R formula mini-language, the expressions

len ~ supp
len ~ supp + dose

seen in the last two statements in the form. This first use of this mini-language won't be the last time we will see it. All regression in R also uses it.

The variable to the left of the twiddle (~) is the response and the variables to the right are predictors. The response is the quantitative variable we want boxplot(s) of. The predictor(s) in this case are categorical and define the category or categories for which we want parallel boxplots.

In the first parallel boxplot plot, we get only the categories specified by supp, which are OJ (orange juice) and VC (straight vitamin C, the help doesn't say exactly how delivered). We see that OJ is better (at least the median is much higher, although the range for VC is wider, so sometimes VC is better than OJ but not most times).

In the second parallel boxplot plot, we get the categories specified by both supp and dose, the latter being 0.5, 1.0, or 2.0 milligrams. We see that is way better in the 1.0 milligram groups with the first quartile for OJ being above the maximum for VC. There is a similar but not quite as strong advantage for OJ in the 0.5 milligram groups. But in the high dose groups there is not much difference.