Chapter 5 t-based Procedures

The standard methods for making inference about the mean of a population are based on the t-distribution, giving us t-tests and t-confidence intervals. There are also t-tests and t-confidence intervals for the difference of the means of two populations. All of these procedures are based on the assumption that the population, or populations, follow a normal distribution. One version of the two-sample procedures also assumes that the variances of the two populations are the same. Fortunately, these procedures also work pretty well for some non-normally distributed populations, with the accuracy of the results being better for larger sample sizes and/or populations that are not too non-normal (and the more non-normal the population is the larger the sample size needs to be).

All of these are addressed using the R function t.test().

5.1 One-sample procedures

Consider the cfcdae::Thermocouples data set, which contains the difference in temperature at two places in an oven over 64 consecutive time periods. One concern might be whether the temperature in the oven is consistent. If so, the mean difference would be zero, and we might want to test whether the data are consistent with the null hypothesis of a zero mean. A second approach might be to assume that there is non-uniformity in the temperatures and, rather than testing, estimate the difference with a margin of error. These lead us to the t-test and t-confidence interval.

Begin with a quick look at the data. This histogram shows us that the data are clustered between 3.05 and 3.20, with most between 3.10 and 3.15. The data seem a little skewed to the left.
> temp.diff <- cfcdae::Thermocouples$temp.diff
> hist(temp.diff,main="Temperature Differences")
Histogram of Temperature Differences

Figure 5.1: Histogram of Temperature Differences

We can do a better job of assessing normality by using a “normal quantile-quantile” plot (also called a “normal probability plot”) via the qqnorm() function. This should look roughly like a straight line if the data follow the normal. The griddiness in the plot reflects that the values have been rounded to the nearest hundredth, and the lowest three values are a bit more extreme than what we would expect from normally distributed data.
> qqnorm(temp.diff,main="NPP of Differences")
Normal Probability Plot of Temperature Differences

Figure 5.2: Normal Probability Plot of Temperature Differences

The function t.test() yields both the test and the confidence interval. The basic usage assumes a mean of zero and a 95% confidence interval.
> t.test(temp.diff)

    One Sample t-test

data:  temp.diff
t = 1001.1, df = 63, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 3.134979 3.147521
sample estimates:
mean of x 
  3.14125 

The t-statistic is over 1000, and the p-value is infinitesimal! Thus there is extremely strong evidence against the null hypothesis of 0 mean (this should be no surprise given that the data ranged from 3.05 to 3.20, with no data close to, or on either side of, 0). The confidence interval is from 3.135 to 3.148.

You can change the mean value in the null hypothesis as well as the coverage of the interval. Suppose that we want more coverage and that a great deal of historical data indicate the the mean difference should be 3.125; are these data consistent with the historical value?
> t.test(temp.diff,mu=3.125,conf.level=.999)

    One Sample t-test

data:  temp.diff
t = 5.1787, df = 63, p-value = 2.493e-06
alternative hypothesis: true mean is not equal to 3.125
99.9 percent confidence interval:
 3.130419 3.152081
sample estimates:
mean of x 
  3.14125 

The p-value is less than \(10^{-5}\), which is pretty strong evidence that the mean of these data differs from the historical value.

5.2 Two-sample procedures

Two-sample procedures refer to making inference about two populations using samples from the two populations. Typically, we are making inference about the group means: Is there evidence that they are not equal? Is there evidence one is greater? What is a range of values for the difference of means that is consistent with the data?

Consider the data on breaking strength for notched and unnotched boards, data set cfcdae::NotchedBoards. We would like to investigate the null hypothesis that unnotched boards of thickness .625 inch have the same strength as boards of thickness .75 inch with a 1 inch wide notch cut in the center to thickness .625 inch.

First get the NotchedBoards data from cfcdae into your current work space.
> data(NotchedBoards) # creates variable NotchedBoards
Now you have a choice. You can create two vectors for the two different groups, or you can work with the data frame directly.
> unnotched <- NotchedBoards$strength[NotchedBoards$shape == "uniform"]
> notched <- NotchedBoards$strength[NotchedBoards$shap == "notched"]
> unnotched
 [1] 243 229 305 395 210 311 289 269 282 399 222 331 369
> notched
 [1] 215 202 273 292 253 247 350 246 352 398 267 331 342
Before we do any inference, let’s just look at the data. Here we do boxplots of the two different groups. They are nearly the same with lots of overlap. This suggests that we will find no significant differences.
> # boxplot(notched, unnotched) # most obvious version
> # boxplot(list(Notched=notched, Unnotched=unnotched)) # provides better labels
> boxplot(strength ~ shape, data=NotchedBoards) # formula version
Boxplots of board strengths

Figure 5.3: Boxplots of board strengths

You can just give several data vectors as arguments to boxplot, you can (should) give them better labels, or you can use a formula of the form response ~ groupings. The latter is easiest if you have many groups or your data come in a data frame.

The two-sample t-test is the typical method used to do tests regarding the means of two groups. In R, this is the t.test(x,y) function. This command does a two-sample t-test between the sets of data in x and y. The confidence interval it generates is for the mean of x minus the mean of y.

There is also a “formula” version of t.test(). The formula takes the form of response ~ predictor, where in our case the predictor is a factor (grouping variable) with two levels. You get the same results, and it’s a little less fuss if your data come from a data frame.

> t.test(unnotched, notched)

    Welch Two Sample t-test

data:  unnotched and notched
t = 0.27353, df = 23.911, p-value = 0.7868
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -43.31049  56.54126
sample estimates:
mean of x mean of y 
 296.4615  289.8462 
> t.test(strength ~ shape, data=NotchedBoards)

    Welch Two Sample t-test

data:  strength by shape
t = -0.27353, df = 23.911, p-value = 0.7868
alternative hypothesis: true difference in means between group notched and group uniform is not equal to 0
95 percent confidence interval:
 -56.54126  43.31049
sample estimates:
mean in group notched mean in group uniform 
             289.8462              296.4615 

Note that by default R uses an unpooled estimate of variance (the Welch version with fractional degrees of freedom), a two-sided alternative, and produces a confidence interval with 95% coverage. You can also get a pooled estimate of variance and/or upper or lower alternatives (i.e., x has greater mean or lesser mean) and/or change the confidence level by using the appropriate optional arguments.

The unpooled (Welch) version is generally the better option in typical two-sample use, because it is almost as good as the pooled version when the population variances are equal and is much better when the population variances differ. However, Analysis of Variance, which generalizes the t-test to multiple groups and to more complicated settings, is a generalization of the unpooled version.

Here we do the test with the option that forces the group variances to be equal and a 99.5% coverage level. Then we jump ahead to an Analysis of Variance approach just to show that for two groups its p-value agrees with the equal variances t-test (the F value is the square of the t). The p-values in both cases are large providing no evidence against the null of equal means.
> # pooled is nearly identical to unpooled for these data
> t.test(unnotched, notched, var.equal=TRUE, conf.level=.995) 

    Two Sample t-test

data:  unnotched and notched
t = 0.27353, df = 24, p-value = 0.7868
alternative hypothesis: true difference in means is not equal to 0
99.5 percent confidence interval:
 -68.12963  81.36040
sample estimates:
mean of x mean of y 
 296.4615  289.8462 
> anova(lm(strength~shape, data=NotchedBoards)) # preview
Analysis of Variance Table

Response: strength
          Df Sum Sq Mean Sq F value Pr(>F)
shape      1    284   284.5  0.0748 0.7868
Residuals 24  91249  3802.0               
One reasonable belief might be that the notched boards would be stronger than the unnotched boards, because while they have the same minimum thickness as the unnotched boards, their average thickness is greater. We can examine this using a one-sided test with the alternative that the unnotched mean is greater than the notched mean. The p-value is smaller than for the two-sided test, but it is still quite large.
> t.test(unnotched, notched, alternative="greater")

    Welch Two Sample t-test

data:  unnotched and notched
t = 0.27353, df = 23.911, p-value = 0.3934
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -34.76902       Inf
sample estimates:
mean of x mean of y 
 296.4615  289.8462 

5.3 Paired t-procedures

Sometimes we get two measurements on a subject under different circumstances, or perhaps we treat every experimental unit with two different treatments and get the responses for the two treatments for each unit. These data are paired, because we expect that the two responses for the same subject or unit to both be high or both be low. They are correlated with each other because they share some aspect of the subject or unit that causes the responses to be similar.

When working with paired data you must take that correlation into account. (If you treat the data as unpaired, you get a very inefficient analysis. For example, your confidence intervals will typically be much wider than necessary.) Generally speaking, this can be done by taking the difference between the two measurements for each subject or unit and then doing your analysis on the differences. Because the mean of the differences is the same as the difference of the means, you get inference about the same thing: the difference of means under the two conditions.

Will will demonstrate using the data cfcdae::RunStitch. Each worker run stitches collars using two different setups: the conventional setup and an ergonomic setup. The two runs are made in random order for each worker, and the interest is in any difference in average speed between the two setups.

Load the runstitch data from the package and look at the first few values.
> data(RunStitch)
> head(RunStitch)
  Standard Ergonomic
1     4.90      3.87
2     4.50      4.54
3     4.86      4.60
4     5.57      5.27
5     4.62      5.59
6     4.65      4.61
The columns are named Standard and Ergonomic. The most obvious way to work with these data is to compute differences and then work with the differences Standard minus Ergonomic.
> differences <- RunStitch[,"Standard"] - RunStitch[,"Ergonomic"]
> differences
 [1]  1.03 -0.04  0.26  0.30 -0.97  0.04 -0.57  1.75  0.01  0.42  0.45 -0.80
[13]  0.39  0.25  0.18  0.95 -0.18  0.71  0.42  0.43 -0.48 -1.08 -0.57  1.10
[25]  0.27 -0.45  0.62  0.21 -0.21  0.82
It’s almost always a good idea to begin with simple summary statistics. Data lean somewhat toward positive differences, although there are plenty of data on both sizes of 0.
> summary(differences)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.0800 -0.2025  0.2550  0.1753  0.4450  1.7500 
> hist(differences, freq=FALSE)
Histogram of runstitch differences

Figure 5.4: Histogram of runstitch differences

The paired t-test (confidence interval) just does a one-sample test (confidence interval) on the differences. For these data, the p-value is .15, so there is little evidence that the means differ.
> t.test(differences)

    One Sample t-test

data:  differences
t = 1.49, df = 29, p-value = 0.147
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.06532811  0.41599478
sample estimates:
mean of x 
0.1753333 

Should this be a one or two-sided alternative? That is a good question without an obvious answer. I could hypothesize that the employees will be faster with the standard because they are used to it, or I could hypothesize that they will be faster with the ergonomic, because it’s just better. I don’t know which is reasonable, so it is probably best to use a two-sided alternative that will check for either.

You can also do the paired test using the two sets of responses as for a two-sample setting but using the paired=TRUE argument. Results are the same as above.

> t.test(RunStitch$Standard, RunStitch$Ergonomic, paired=TRUE)

    Paired t-test

data:  RunStitch$Standard and RunStitch$Ergonomic
t = 1.49, df = 29, p-value = 0.147
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.06532811  0.41599478
sample estimates:
mean difference 
      0.1753333 
Finally, you can choose a null mean other than 0.
> t.test(differences, mu=.5)

    One Sample t-test

data:  differences
t = -2.7591, df = 29, p-value = 0.009934
alternative hypothesis: true mean is not equal to 0.5
95 percent confidence interval:
 -0.06532811  0.41599478
sample estimates:
mean of x 
0.1753333