In statistics (the subject of this course) the word statistic (in the singular) is used as a technical term, meaning any number that can be calculated from a sample (p. 250 in the textbook).
For our example we will use the data from Example 2.1 in the Textbook (Moore) which is in the URL
so we can use the URL external data entry method.
See above.
Our textbook doesn't discuss trimmed means, although many others do. To calculate a 10% trimmed mean
And similarly for a 5%, 15%, whatever trimmed mean.
Note that a 10% trimmed mean trims 10% from each end, not 10% in all (it trims 20% in all).
Also note that you must sort the data first. You throw away the 10% of the data having the largest values and the 10% having the smallest values, not the first and last 10% in the order presented (unless the data are given to you sorted).
The point of trimmed means is that they lie between the mean an the median in behavior. The ordinary mean is a 0% trimmed mean, and the median is essentially a 50% trimmed mean. More can be said on this subject, but we'll leave it for later.
The example above clearly shows this behavior. The less trimming, the closer to the ordinary mean. The more trimming, the closer to the median.
Note that to specify a 10% trimmed mean you give
the argument trim = 0.1
to the mean
function.
You give the fraction 0.1, not the percent 10.
The statistics discussed so far are competitors. They all measure more or
less the same abstract concept, which the book calls center in
various ways. If the distribution of the data is symmetric, then
they all (mean, median, and all trimmed means) measure the same thing,
the center of symmetry. Otherwise they all measure something
different, but all the things they measure have as much right to be called
center
as the others.
Our textbook also doesn't discuss quantiles and percentiles, although many others do.
Abstractly, the p-th quantile of a distribution is the point x such that p of the data are below x and 1 - p of the data are above x. Here p is a number between zero and one.
When p is multiplied by 100 to convert it to a percent, then the same point x is called a percentile, which should be familiar from the way standardized tests are reported. Your score was at the 95-th percentile if 95% of all other scores were below yours and 5% of all other scores were above yours.
For theoretical reasons, statisticians prefer numbers between zero and one (which can be interpreted as probabilities) rather than percents. So if your score was at the 95-th percentile, a statistician would prefer to say 0.95 quantile. Same point, different name.
Our textbook does discuss two particular quantiles, the 0.25 and 0.75 quantiles (a. k. a. the 25-th and 75-th percentiles). These are also called the first quartile and third quartile (in between, the second quartile, is the 50-th percentile, which is the median).
When applied to a large population, the notion of quantile is clear, but when applied to a small sample is nowhere near as simple as we have made it. The R quantile function implements 9 different quantile definitions that have appeared in the literature.
Above we look at two, the default method (when type
is not
specified) is used for the first three statements. In the last
type = 1
always chooses a data point as the quantile,
in which case different quantiles may have the same value.
It is not appear that any of the 9 methods implemented by
the quantile
function agree with the method of
finding quartiles given in the textbook (pp. 37-38),
which is simplified to make hand calculation easy, which is
pointless when a computer is available.
See above for example.
The s. d. and the IQR both measure something that can loosely be called spread, but do not measure the same thing.
They are competitors in the sense that you should use one or the other but not both. They are not precise competitors in that the measure different aspects of a distribution.
See above.
For our example we will use the data that comes with R
in the built-in dataset ToothGrowth
(on-line help)
which is about
the effect of vitamin C on tooth growth in guinea pigs.
The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).
In order to make this more like the problems from the book where we load the data from a URL, we have written this data out of R to the URL
so we can use the URL external data entry method.
To make a simple boxplot, just do boxplot
(variable).
In this dataset the quantitative variable is len
.
To make parallel boxplots, we use the R formula mini-language, the expressions
len ~ supp len ~ supp + dose
seen in the last two statements in the form. This first use of this mini-language won't be the last time we will see it. All regression in R also uses it.
The variable to the left of the twiddle (~
) is the
response
and the variables to the right
are predictors
. The response is the quantitative variable
we want boxplot(s) of. The predictor(s) in this case are categorical
and define the category or categories for which we want parallel boxplots.
In the first parallel boxplot plot, we get only the categories specified
by supp
, which are OJ (orange juice) and VC (straight vitamin C,
the help doesn't say exactly how delivered). We see that OJ is better
(at least the median is much higher, although the range for VC is wider,
so sometimes VC is better than OJ but not most times).
In the second parallel boxplot plot, we get the categories specified
by both supp
and dose
, the latter being 0.5, 1.0,
or 2.0 milligrams.
We see that is way better in the 1.0 milligram groups with the first
quartile for OJ being above the maximum for VC.
There is a similar but not quite as strong advantage for OJ in the 0.5
milligram groups.
But in the high dose groups there is not much difference.