Chapter 29 Fixing Problems

29.1 Data sets

We will illustrate fixing problems with several data sets, most of which we have used before. Three old friends are the resin lifetime, cloud seeding, and oven data sets.

> library(cfcdae)
> data(ResinLifetimes)
> lifetimes.fit <- lm(Time~temp,data=ResinLifetimes)
> data(CloudSeeding)
> cloud.fit <- lm(rainfall~seeding,data=CloudSeeding)
> data(Thermocouples)
> Oven.fit <- lm(temp.diff~1,data=Thermocouples)
Another data set we will explore is TissueTearing. Here weight is added to a facial tissue stretched across an open jar until the tissue tears. There are four treatments: tissues of brand A or brand B, used either dry or wet.
> data(TissueTearing)
> tearing.fit <- lm(tear.wt~treatment,data=TissueTearing)
A fifth data set we will explore is TissueVaporization. For these data, a laser scalpel is drawn across a rat liver at a rate of .01 inches/second. We measure the width of the vaporized tissue that results as we use the laser at four different power settings.
> data(TissueVaporization)
> vaporization.fit <- lm(width~power,data=TissueVaporization)
A sixth data set we will explore is PeaGermination. Each experimental unit was a group of 20 seeds. Each group was pre-soaked either 12 or 24 hours and then placed on either potting soil or paper towel. The response is the number of germinated seeds after 5 days.
> data(PeaGermination)
> peas.fit <- lm(germinated~treatment,data=PeaGermination)
A seventh data set is the number of Copepoda found in wetland soils 14 days after artificial snow melt. The units are randomly assigned to either neutral or acidic artificial snow melt.
> data(Copepoda)
> Copepoda.fit <- lm(count~pH,data=Copepoda)

Note that the responses in the PeaGermination and Copepoda data sets are counts. Thus we might already be skeptical of models that assume normal distributions for the responses.

Our last data set is CloudBackup, which gives the time in seconds to save a 50MB file to a cloud backup service over the internet and then recover it. Each of three services is used ten times, with the 30 observations assigning services at random. The 30 runs are done consecutively, and the data are listed in time order.
> data(CloudBackup)
> backup.fit <- lm(updowntime~service,data=CloudBackup)

A First Course in Design and Analysis of Experiments talks about non-normality, non-constant variance, and dependence one-by-one as well as treating them as if they were unrelated things. It also treats the various approaches to handling violations of assumptions completely separately. In practice, you are going to be assessing all of the assumptions for any given data set and potentially dealing with more than one problem at a time. In addition, you may well try more than one approach to dealing with violations of assumptions in a single data set.

In an attempt to maintain a parallel presentation here and in A First Course in Design and Analysis of Experiments, we present the examples in a technique by technique order. In practice, you are working with one data set and (probably) will be using more than one method on it.

29.2 Power family transformations

A power family transformation replaces the data with the data raised to some power. The Box-Cox method helps us choose a good power. We will use the boxCox() function from package car for this purpose. This function will produce a plot that shows the likelihood as a function of the transformation power. The power at the peak is the estimated value, and the plot shows a confidence interval for the power as well. Any power corresponding to a log-likelihood above the “95%” line is a reasonable power to consider.

There are several optional parameters. The most useful are lambda, which can be used to select a vector of powers to evaluate, and plotit, which if set to FALSE will prevent the plot and return a list with the powers and likelihoods.

Which power should you choose? It’s a compromise between stabilizing the variability, interpretability, sensibility. In the end, the transformation has to work; transform the data and see if the transformed data have stable variance. Interpretability is often a high bar: logs and reciprocals are sometimes interpretable, others less so. Sensibility means we round off to something reasonable; we don’t use the .19223 power, we’d use .2 or .25 or something else.

29.2.1 Example 6.6 Resin lifetime data

The resin lifetime data on the original scale showed considerable increasing variance (right opening megaphone). Let’s try Box-Cox to choose a transformation.
> car::boxCox(lifetimes.fit)
Let’s zoom that plot in to look near the peak:
> car::boxCox(lifetimes.fit,lambda=seq(-.5,.75,.05))

The peak is very near \(\lambda=.2\), and values between about -.1 and .5 are plausible. We chose power 0 (the logarithm) for simplicity and interpretability.

29.2.2 Example 6.7 Cloud seeding

The cloud seeding data have at least two problems: the residuals are not normally distributed and there is evidence of non-constant variance. There might be outliers, but the apparent outliers could be the result of the non-normal distribution.
> plot(cloud.fit,which=c(2,3))
We can do a quantile-quantile plot of the two sets of data (here on the log scale). Because this plot is roughly linear, it suggests that the shapes of the two data sets are similar, even if they are non-normal. This gives us hope that the same transformation might normalize both.
> attach(CloudSeeding)
> seeded <- rainfall[seeding=="seeded"]
> nonseeded <- rainfall[seeding=="nonseeded"]
> qqplot(seeded,nonseeded,log="xy")
The Box-Cox plot helps us find the correct transformation. The peak of this curve is near 0, and 0 is well within the range of acceptable powers (those between the dashed vertical lines). This suggests doing a log transformation of the data.
> car::boxCox(cloud.fit)
> cloud.logfit <- lm(log(rainfall)~seeding,data=CloudSeeding)
After transformation, the residuals look much better.
> plot(cloud.logfit,which=1:3)

The normal plot shows a little bit of a long left tail, and the scale-location plot is just the tiniest bit decreasing. Neither of these issues is sufficient to warrant using some other transformation. The fact that 0 was toward the left hand end of the acceptable range of transformations means that we probably went beyond the optimal transformation, but logged data are a lot easier to understand than the .05 power.

There were outliers on the original scale, but transforming to logged data dealt with that issue.
> car::outlierTest(cloud.logfit)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
    rstudent unadjusted p-value Bonferroni p
26 -2.658736           0.010565       0.5494
An analysis of variance on the log scale shows a p-value of .01, which is small enough to take notice of but not really overwhelming evidence agains the null.
> anova(cloud.logfit)
Analysis of Variance Table

Response: log(rainfall)
          Df  Sum Sq Mean Sq F value  Pr(>F)  
seeding    1  17.007 17.0071  6.4738 0.01408 *
Residuals 50 131.353  2.6271                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> linear.contrast(cloud.logfit,seeding,c(-1,1))
  estimates        se  t-value    p-value lower-ci upper-ci
1  1.143781 0.4495342 2.544369 0.01408266 0.240865 2.046697
attr(,"class")
[1] "linear.contrast"

What did seeding do? Look at a contrast between the nonseeded and seeded treatments. The estimated difference on the log scale is 1.14 with a 95% confidence interval from .24 to 2.05. Back on the original scale, this translates into the median rainfall being a factor of \(e^{1.14} = 3.14\) higher in the seeded treatment, with a 95% confidence interval from \(e^{.241}=1.27\) to \(e^{2.047}= 7.74\). So the data are consistent with seeding increasing the rainfall by just over a quarter, but they are also consistent with seeding increasing the rainfall by a factor of almost 8.

29.2.3 Example 6.8 Tearing Tissues

We are looking at how much mass we can place on taut facial tissue before it tears. Tissues can be either of two brands, and tissues can be wet or dry. The p-value for the F-test is quite small, but there is strong evidence of non-constant variance. What looks like long tails in the normal plot is likely the result of mixing normals with different variances; thus fixing non-constant variance will probably fix non-normality as well.
> anova(tearing.fit)
Analysis of Variance Table

Response: tear.wt
          Df Sum Sq Mean Sq F value    Pr(>F)    
treatment  3  32400 10800.1  1518.7 < 2.2e-16 ***
Residuals 92    654     7.1                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> plot(tearing.fit,which=1:3)
The first thing to try is a power family transformation. However, there could be a problem with that, as the ratio of the maximum to the minimum values is only about 2.4; it could take a pretty extreme transformation to have much effect when the dynamic range is that small.
> range(TissueTearing$tear.wt)
[1]  45 108
> car::boxCox(tearing.fit)

Box-Cox suggests the power \(-.5\), with \(-1\) and 0 both being right at the edge of the interval.

We make the -.5 power transformation, but note that we multiply by -1. When you use a negative power, the big points become the small points and vice versa. Multiplying by -1 means that the big points remain the big points. After this transformation, the residuals look much better and we can proceed with analysis of variance and so on. The overall F-test is highly significant and all the treatments are signficantly different from each other.
> tearing.fitmp5 <- lm(-1/sqrt(tear.wt)~treatment,data=TissueTearing)
> plot(tearing.fitmp5,which=1:3)

> anova(tearing.fitmp5)
Analysis of Variance Table

Response: -1/sqrt(tear.wt)
          Df   Sum Sq   Mean Sq F value    Pr(>F)    
treatment  3 0.024963 0.0083209  2312.8 < 2.2e-16 ***
Residuals 92 0.000331 0.0000036                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> pairwise(tearing.fitmp5,treatment,confidence=.999)
                                          
Pairwise comparisons ( hsd ) of treatment 
                    estimate signif diff        lower        upper
* A_dry - A_wet  0.017871202 0.002140712  0.015730491  0.020011914
* A_dry - B_dry  0.012741368 0.002140712  0.010600657  0.014882080
* A_dry - B_wet  0.044277986 0.002140712  0.042137275  0.046418698
* A_wet - B_dry -0.005129834 0.002140712 -0.007270546 -0.002989123
* A_wet - B_wet  0.026406784 0.002140712  0.024266072  0.028547495
* B_dry - B_wet  0.031536618 0.002140712  0.029395906  0.033677329
This was pretty successful, but suppose that the data values had all been 100 more than they actually were. Then the ratio of the largest to smallest would only be about 1.4, and it would take a much stronger transformation to do any good. The power -3 actually does a pretty good job at fixing non-constant variance in this case, but we usually try to avoid such extreme transformations.
> tearing.fit100 <- lm(tear.wt+100~treatment,data=TissueTearing)
> car::boxCox(tearing.fit100,lambda=seq(-5,0,.05))
One alternative to using an extreme transformation is to shift the data by some amount prior to doing the Box-Cox approach. Here, we subtract 25 from the tearing weight before trying to transform, and we get that a logarithm will be sufficient (rather than power -.5).
> car::boxCox(lm(tear.wt-25~treatment,data=TissueTearing))

Note that some “extreme” transformations really make a lot of sense. If your response is miles per gallon and you take the reciprocal transformation (power -1, kind of extreme), then you’re really just doing an analysis on gallons per mile, which makes perfect sense.

Now go back; which transformation should we use? Let’s look at the scale-location plots for the -.5 power, the logarithm, and the logarithm of the data-25.
> plot(tearing.fitmp5,which=1)

> plot(lm(log(tear.wt)~treatment,data=TissueTearing),which=1)

> plot(lm(log(tear.wt-25)~treatment,data=TissueTearing),which=1)

There is not a lot to choose between here. It looks like the log transformation doesn’t quite get rid of the non-constant variance (although it’s not all that bad). The other two seem OK, it’s just that neither of them is very interpretable.

29.3 Removing outliers

Outliers are data points that would be unusual or unlikely under the current model and the patterns established by the other data. One standard approach to dealing with outliers is to analyze the data with, and without, the outliers. If your inference doesn’t really change, then you are OK. If your inference does change, then you are in a situation where your analysis and conclusions depend strongly on just a few of the data points. That is not good.

29.3.1 Example 7.9 Width of Vaporized Tissue

A laser scalpel cuts by vaporizing tissue. This experiment looks at how wide the cut is based on the power setting of the laser. If you just look at the analysis of variance, it does not look like the power level matters much.
> anova(vaporization.fit)
Analysis of Variance Table

Response: width
          Df Sum Sq Mean Sq F value  Pr(>F)  
power      3 1.3067 0.43558  3.1446 0.08666 .
Residuals  8 1.1081 0.13852                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But do not believe the inference until you check on the model assumptions. The model assumptions plots might be interpreted as non-constant variance, non-normality, or an outlier.
> plot(vaporization.fit,which=1:3)
Simply looking at the data shows us that there is an outlier. For the most part, widths increase linearly with power, but one of the widths at the highest power is 1.12, when we would expect it to be more than 2 based on the other data.
> with(TissueVaporization,plot(power.z,width))

> car::outlierTest(vaporization.fit)
    rstudent unadjusted p-value Bonferroni p
10 -11.24597         9.8168e-06    0.0001178

That tenth value certainly fails to follow the pattern set by the rest of the data. Maybe it is an incorrect value. Perhaps it was mistyped and should have been 2.12 instead of 1.12. But maybe some liver tissues are just tough and resistant to vaporization. We need the original investigator to go back to lab notebooks and other original sources and determine whether we have the correct value.

In the meantime, we can try refitting with the outlier removed to see if it changes our conclusions.
> out10 <- (1:12) != 10
> vaporization.fit.out10 <- lm(width~power,
+        data=TissueVaporization,subset=out10)
> plot(vaporization.fit.out10,which=1:3)

There is a hint of decreasing variance now, but if you look at Box-Cox, it does not recommend making any transformation.

Now look at the test of the single mean null:
> anova(vaporization.fit.out10)
Analysis of Variance Table

Response: width
          Df  Sum Sq Mean Sq F value    Pr(>F)    
power      3 2.05797 0.68599  82.626 7.907e-06 ***
Residuals  7 0.05812 0.00830                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After removing the outlier, the treatment differences are highly significant.

Compare the fitted values with and without the outlier:
> emmeans::emmeans(vaporization.fit,~power)
 power emmean    SE df lower.CL upper.CL
 5       1.14 0.215  8    0.644     1.64
 10      1.56 0.215  8    1.061     2.05
 15      1.92 0.215  8    1.424     2.42
 19      1.96 0.215  8    1.461     2.45

Confidence level used: 0.95 
> emmeans::emmeans(vaporization.fit.out10,~power)
 power emmean     SE df lower.CL upper.CL
 5       1.14 0.0526  7     1.02     1.26
 10      1.56 0.0526  7     1.43     1.68
 15      1.92 0.0526  7     1.80     2.04
 19      2.38 0.0644  7     2.22     2.53

Confidence level used: 0.95 

The first three means are the same, and the fourth mean changes a lot when the outlier is excluded. Furthermore, the standard errors shrink by a factor of 4.

With the outlier removed, there is no evidence against our assumptions. With the outlier removed, laser power is highly significant, which is what we would have anticipated from the plot of the data.

We get vastly different results based on whether a single data point is included or excluded. It is difficult to make any conclusions until we know the true status of that point.

Proceeding for the moment under the assumption that point 10 is bad, we could also explore the linearity that appears to be present in the data plot.
> vaporization.fit.out10.poly <- lm(width~power.z+I(power.z^2)+I(power.z^3),
+                                   data=TissueVaporization,subset=out10)
> anova(vaporization.fit.out10.poly)
Analysis of Variance Table

Response: width
             Df  Sum Sq Mean Sq  F value    Pr(>F)    
power.z       1 2.04093 2.04093 245.8243 1.039e-06 ***
I(power.z^2)  1 0.00784 0.00784   0.9438    0.3637    
I(power.z^3)  1 0.00921 0.00921   1.1094    0.3272    
Residuals     7 0.05812 0.00830                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The linear term is highly significant (with the outlier removed). If we refit with just the linear term, we see that the width of the cut increases about .085 mm per watt of increase in power.
> vaporization.fit.out10.linear <- lm(width~power.z,
+                                   data=TissueVaporization,subset=out10)
> coef(vaporization.fit.out10.linear)
(Intercept)     power.z 
 0.70193050  0.08499035 

29.3.2 Finding a typo

Suppose we were typing in the temperatures and made a typo in the resin lifetimes (case 11 was put in as 175 instead of 194). What might we find? Well, the typo certainly stands out in the box plots.

> temp.oops <- ResinLifetimes$temp
> temp.oops[11] <- 175
> with(ResinLifetimes,boxplot(Time~temp.oops))
Box plot for resin lifetimes with typo

Figure 29.1: Box plot for resin lifetimes with typo

We would still see nonconstant variance.
> separate.oops <- lm(Time~temp.oops,data=ResinLifetimes)
> plot(separate.oops,which=1)
Residual plot for resin lifetimes with typo

Figure 29.2: Residual plot for resin lifetimes with typo

Box-Cox would still choose the log.
> car::boxCox(separate.oops)
Box-Cox plot for resin lifetimes with typo

Figure 29.3: Box-Cox plot for resin lifetimes with typo

Refitting with logged data we would see more constant variance, although the outlier is still visible.
> separate.oops.log <- lm(logTime~temp.oops,data=ResinLifetimes)
> plot(separate.oops.log,which=1)
Residual plot for logged resin lifetimes with typo

Figure 29.4: Residual plot for logged resin lifetimes with typo

The outlier test agrees that it’s an outlier.
> car::outlierTest(separate.oops.log)
    rstudent unadjusted p-value Bonferroni p
11 -5.447548         5.9509e-06   0.00022018

At this stage we would examine our data and discover the typo, fix it, and then go about our business.

29.4 Modified ANOVA

The Welch version of the t-test (confidence intervals) is well known to work well when the two groups have unequal variances. There are a couple of methods that generalize the Welch test to more than two groups. The good news is that the tests have much more accurate p-values than using ordinary ANOVA when the data have non-constant variance. The bad news is that the procedures do not generalize easily or handle inference beyond the test of the single mean null hypothesis.

The brown.forsythe.test() function in cfcdae implements the Brown-Forsythe test. The oneway.test() function in base R implements a different approximation to the p-value.

29.4.1 Example 6.10 Resin Lifetimes

Let’s illustrate these alternative forms of ANOVA using the resin lifetime data. First, recall what we obtained using standard methods:
> anova(lifetimes.fit)
Analysis of Variance Table

Response: Time
          Df  Sum Sq Mean Sq F value    Pr(>F)    
temp       4 28027.9  7007.0  101.81 < 2.2e-16 ***
Residuals 32  2202.4    68.8                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now with Brown-Forsythe:
> brown.forsythe.test(Time~temp,data=ResinLifetimes)

    One-way analysis of means, Brown-Forsythe (unequal variances)

data:  Time and temp
F = 111.74, num df = 4.000, denom df = 18.298, p-value = 1.36e-12
and the modified Welch:
> oneway.test(Time~temp,data=ResinLifetimes)

    One-way analysis of means (not assuming equal variances)

data:  Time and temp
F = 68.943, num df = 4.000, denom df = 14.373, p-value = 3.273e-09

All the methods have very low p-values. One noticeable difference is that the alternative methods use a much lower denominator degrees of freedom.

29.5 Robust methods

There is a class of statistical techniques called robust methods that assume that the errors arise from a distribution that is symmetric but has longer tails than the normal distribution and is thus outlier-prone (robust methods also retain the independence and constant variability assumptions). Roughly speaking, robust methods automatically and internally determine which points appear to be outliers and downweight those points.

The MASS package includes a function rlm() that does robust linear models. In addition to estimating the parameters in the model for the means, it also produces a robust estimate of the scale of the statistical errors. Inferential difficulties arise because we don’t know an equivalent degrees of freedom for that estimate of error.

The summary() function works fine for an rlm() fit. The built in anova() runs, but it doesn’t estimate an error degrees of freedom so it doesn’t produce useful output. Instead, use Anova() in the car package. car::Anova() uses the error degrees of freedom from the standard linear model; this is probably still an overestimate of error degrees of freedom. The functions in package emmeans also work with rlm() fits. These functions generally default to infinite degrees of freedom for confidence intervals, but you can specify some other error degrees of freedom.

29.5.1 Example 6.11 Tissue Vaporization

Use MASS::rlm() to fit the separate means model:
> vaporization.rfit <- MASS::rlm(width~power,data=TissueVaporization)
The residual plots include the outlier, but it is actually much more obvious as an outlier in the robust fit than in the ordinary fit.
> plot(vaporization.rfit,which=1:3)
The usual anova() function is not helpful; instead, use car::Anova():
> anova(vaporization.rfit)
Analysis of Variance Table

Response: width
          Df Sum Sq Mean Sq F value Pr(>F)
power      3 1.9657 0.65523               
Residuals    1.5379                       
> car::Anova(vaporization.rfit)
Analysis of Deviance Table (Type II tests)

Response: width
          Df     F    Pr(>F)    
power      3 79.32 2.718e-06 ***
Residuals  8                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car::Anova() is using the same error degrees of freedom as the full-data linear model. This is probably a little too big. In any case, the p-value is quite small and similar to what we obtained after deleting the outlier.

The fitted values in the individual means model are very similar to what we observed above after deleting the outlier:
> emmeans::emmeans(vaporization.rfit,~power)
 power emmean     SE df asymp.LCL asymp.UCL
 5       1.14 0.0573 NA      1.03      1.25
 10      1.55 0.0573 NA      1.44      1.66
 15      1.92 0.0573 NA      1.81      2.03
 19      2.34 0.0573 NA      2.22      2.45

Confidence level used: 0.95 
Note that the degrees of freedom is listed as missing; that means it is using the large sample approximation. We can tell it to use fewer degrees of freedom (say 8).
> emmeans::emmeans(vaporization.rfit,~power,df=8)
 power emmean     SE df lower.CL upper.CL
 5       1.14 0.0573  8     1.01     1.27
 10      1.55 0.0573  8     1.42     1.68
 15      1.92 0.0573  8     1.79     2.05
 19      2.34 0.0573  8     2.20     2.47

Degrees-of-freedom method: user-specified 
Confidence level used: 0.95 

Now the intervals are wider.

We can retrieve the weights that rlm() used to down weight the outlier:
> vaporization.rfit$w 
 [1] 1.00000000 1.00000000 1.00000000 0.49883606 0.56931858 1.00000000 1.00000000 1.00000000
 [9] 1.00000000 0.06564224 1.00000000 1.00000000

Somewhat surprisingly, cases 4 and 5 are also down weighted. This makes some sense, as the responses for the second power level are more spread out than the other non-outlier responses.

If we fit a linear model using case weights that are the what rlm() gives us, we will get the same thing as the the rlm() fit. Let’s tabulate the coefficients for the rlm() fit, the weighted lm() fit, the lm() fit with the outlier removed, and the lm() fit with the full data set.
> mytable <- rbind(vaporization.rfit$coefficients,
+   lm(width~power,data=TissueVaporization,weights=vaporization.rfit$w)$coefficients,
+   vaporization.fit.out10$coefficients,
+   vaporization.fit$coefficients)
> rownames(mytable) <- c("rlm","wtd","removed","orig")
> mytable
        (Intercept)     power1      power2    power3
rlm        1.736293 -0.5962928 -0.18624002 0.1837072
wtd        1.736293 -0.5962928 -0.18624002 0.1837072
removed    1.747917 -0.6079167 -0.19125000 0.1720833
orig       1.643333 -0.5033333 -0.08666667 0.2766667

The first two are exactly the same; the coefficients with the outlier removed are very similar (with the differences being a small fraction of a standard error of the coefficients); and the linear model including all the data is very different.

29.6 Modeling the variance

The gls() function in package nlme can fit models using different variance structures using its weights argument. The value assigned to weights is called a variance function, and there are several options in nlme. Two we will consider are:

  • weights=varIdent(form=~1|foo) where foo is a factor. This says to fit a different variance for each level of factor foo.
  • weights=varPower(form=~fitted(.)). This says to make the standard deviation be proportional to a power of the fitted values, with the power to be estimated.
  • weights=varPower(form=~fitted(.),fixed=delta). This says to make the standard deviation be proportional to the fitted values raised to the power delta.

29.6.1 Example 6.12 Resin Lifetimes

Let’s fit the two variance models to the resin lifetime data.
> library(nlme)

Attaching package: 'nlme'
The following object is masked from 'package:bcfcdae':

    fixef
> lifetimes.varsep <- gls(Time~temp,data=ResinLifetimes,weights=varIdent(form=~1|temp))
> lifetimes.varpower <- gls(Time~temp,data=ResinLifetimes,weights=varPower(form=~fitted(.)))
The plot() method plots standardized residuals against fitted values. For the separate variances model, we see an even spread all the way across.
> plot(lifetimes.varsep)
For the power variance model, we see residual spreads much more stable than for the original model, but not as stable as for the separate variances model.
> plot(lifetimes.varpower)
Normality is a little questionable for the standardized residuals in the separate variances model (the type=pearson divides the residuals by their estimated standard deviations):
> qqnorm(residuals(lifetimes.varsep,type="pearson"))
Does one fit better than the other? Compare their AIC values.
> AIC(lifetimes.varsep);AIC(lifetimes.varpower)
[1] 237.2421
[1] 237.1706

There is really nothing to choose between these in terms of AIC. The separate variances model has a higher likelihood but uses more parameters, and they balance out very evenly.

Note: both lifetimes.varsep and lifetimes.varpower were fit using a method denoted REML (you will see that in the summary() below). REML AIC values can be used to compare two models with the same mean structure and different variance structures, so the comparison above is OK. You cannot compare REML AIC values with non-REML AIC values. Because models fit by lm() do not produce a REML likelihood, you cannot compare lifetimes.varsep directly to lifetimes.fit. Instead, you must either refit the constant variance model using gls() (which will default to REML) instead of lm(), or you must refit the gls() models using a method="ML" argument.
> lifetimes.varsep.ML <- gls(Time~temp,data=ResinLifetimes,
+                            weights=varIdent(form=~1|temp),
+                            method="ML")
> AIC(lifetimes.varsep.ML)
[1] 249.94
> AIC(lifetimes.fit)
[1] 268.1971

Clearly, fitting separate variances is better than fitting a single variance.

The summary for the GLS fit now contains additional information about how the variance was modeled.
> summary(lifetimes.varsep)
Generalized least squares fit by REML
  Model: Time ~ temp 
  Data: ResinLifetimes 
       AIC      BIC    logLik
  237.2421 251.8995 -108.6211

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | temp 
 Parameter estimates:
      175       194       213       231       250 
1.0000000 0.7339797 0.4918990 0.1329593 0.2839912 

Coefficients:
                Value Std.Error    t-value p-value
(Intercept)  36.41900  1.272706  28.615413  0.0000
temp1        50.00519  3.788275  13.199988  0.0000
temp2         7.14083  2.911776   2.452398  0.0198
temp3       -11.89934  2.168016  -5.488587  0.0000
temp4       -20.70035  1.370035 -15.109355  0.0000

 Correlation: 
      (Intr) temp1  temp2  temp3 
temp1  0.544                     
temp2  0.180 -0.445              
temp3 -0.215 -0.445 -0.268       
temp4 -0.880 -0.522 -0.189  0.171

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-1.8045963 -0.6400941 -0.1057086  0.3666410  1.9257397 

Residual standard error: 13.02881 
Degrees of freedom: 37 total; 32 residual

gls() estimates an overall residual standard error (shown near the bottom) and then also provides multipliers for the standard errors in other groups relative to the overall (which is taken as group 1).

In terms of testing the null hypothesis of a single mean, we can use anova() or car::Anova(). anova() forms tests using the same denominator degrees of freedom as the lm() model; this is likely optimistic (that is, too large). car::Anova() recognizes that it doesn’t know the denominator degrees of freedom and uses infinite denominator degrees of freedom; this is also too large, but at least it tells us it is doing the large sample approximation by indicating a chi-squared test rather than an F-test.
> anova(lifetimes.varsep)
Denom. DF: 32 
            numDF  F-value p-value
(Intercept)     1 959.0707  <.0001
temp            4  78.5368  <.0001
> car::Anova(lifetimes.varsep)
Analysis of Deviance Table (Type II tests)

Response: Time
     Df  Chisq Pr(>Chisq)    
temp  4 314.15  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To illustrate the degrees of freedom problem, let’s redo the analysis using just temperatures 175 and 250 with the separate variances model.
> use <- with(ResinLifetimes,(temp==175 | temp==250))
> lifetimes.firstlast <- gls(Time~temp,data=ResinLifetimes,
+         subset=use,
+         weights= varIdent(form=~1|temp))
> summary(lifetimes.firstlast)
Generalized least squares fit by REML
  Model: Time ~ temp 
  Data: ResinLifetimes 
  Subset: use 
       AIC      BIC    logLik
  96.33582 98.27545 -44.16791

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | temp 
 Parameter estimates:
      175       250 
1.0000000 0.2839908 

Coefficients:
               Value Std.Error  t-value p-value
(Intercept) 49.14843  2.423867 20.27687       0
temp1       37.27576  2.423867 15.37863       0

 Correlation: 
      (Intr)
temp1 0.806 

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-1.3815526 -0.5011049 -0.1774938  0.3040356  1.7824815 

Residual standard error: 13.02882 
Degrees of freedom: 14 total; 12 residual

We get a t-statistic of 15.38 for the group differences, and when you do anova() it’s going to use 12 degrees of freedom for error.

Alternatively, we could have used the Welch version of the t-test. We know that it works really well for two groups.
> with(ResinLifetimes,t.test(Time[1:8],Time[32:37]))

    Welch Two Sample t-test

data:  Time[1:8] and Time[32:37]
t = 15.379, df = 8.4496, p-value = 1.795e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 63.47533 85.62771
sample estimates:
mean of x mean of y 
 86.42419  11.87267 

We get the same t-statistic, but the t-test is using 8.5 degrees of freedom. This is much more realistic, because the variance of the difference is dominated by the larger variance, and it is estimated with only 7 degrees of freedom. For this test, the difference in degrees of freedom doesn’t matter, because the p-values are tiny either way, but it could matter for some data sets.

The Welch t-test uses the Satterthwaite approximation to get degrees of freedom. Functions in emmeans also use the Satterthwaite approximation for more general contrasts. This gives us more confidence that the reported p-values and confidence levels will be realistic.
> library(emmeans)
> lifetimes.emmsep <- emmeans::emmeans(lifetimes.varsep,~temp)
> lifetimes.emmsep
 temp emmean    SE   df lower.CL upper.CL
 175    86.4 4.606 6.99    75.53     97.3
 194    43.6 3.381 6.99    35.56     51.6
 213    24.5 2.266 7.00    19.16     29.9
 231    15.7 0.655 6.00    14.12     17.3
 250    11.9 1.511 5.00     7.99     15.8

Degrees-of-freedom method: satterthwaite 
Confidence level used: 0.95 

When we just estimate group means in the separate variances model, the degrees of freedom in each group are \(n_i-1\), as we would expect.

Things are much more interesting if we estimate treatment means using the power model of variances:
> emmeans::emmeans(lifetimes.varpower,~temp)
 temp emmean   SE   df lower.CL upper.CL
 175    86.4 5.14  9.2    74.84     98.0
 194    43.6 3.03 24.5    37.32     49.8
 213    24.5 1.94 29.3    20.55     28.5
 231    15.7 1.47 16.5    12.60     18.8
 250    11.9 1.28 10.9     9.05     14.7

Degrees-of-freedom method: satterthwaite 
Confidence level used: 0.95 

The standard errors are a little different, particularly at temperature 231, but the really interesting thing is the equivalent degrees of freedom, which peak in the middle and are lower on either end. This reflects the fact that assuming the variance is proportional to some power of the fitted values, that variance will be better estimated (with more equivalent degrees of freedom) in the middle of the range.

More general contrasts give us a variety of degrees of freedom:
> contrast(emmeans(lifetimes.varsep,~temp),method="pairwise",adjust="holm")
 contrast          estimate   SE    df t.ratio p.value
 temp175 - temp194    42.86 5.71 12.84   7.502  <.0001
 temp175 - temp213    61.90 5.13 10.19  12.059  <.0001
 temp175 - temp231    70.71 4.65  7.27  15.197  <.0001
 temp175 - temp250    74.55 4.85  8.44  15.379  <.0001
 temp194 - temp213    19.04 4.07 12.22   4.678  0.0020
 temp194 - temp231    27.84 3.44  7.52   8.084  0.0003
 temp194 - temp250    31.69 3.70  9.53   8.557  0.0001
 temp213 - temp231     8.80 2.36  8.15   3.731  0.0112
 temp213 - temp250    12.65 2.72 11.44   4.644  0.0020
 temp231 - temp250     3.85 1.65  6.85   2.336  0.0529

Degrees-of-freedom method: satterthwaite 
P value adjustment: holm method for 10 tests 
> contrast(emmeans(lifetimes.varsep,~temp),list(temp=c(0,0,1,-2,1)))
 contrast estimate   SE   df t.ratio p.value
 temp         4.96 3.02 15.7   1.640  0.1209

Degrees-of-freedom method: satterthwaite 

No evidence of curvature among the last three levels of temperature.

29.6.2 Example 6.13 Tearing Tissues

Let’s go back to the tearing tissues data. There are four treatments:
> levels(TissueTearing$treatment)
[1] "A_dry" "A_wet" "B_dry" "B_wet"
Suppose that we are interested in the contrasts A versus B, wet versus dry, and whether the brand difference is the same for wet and dry tissues. The following contrast coefficients will do the trick:
> tissues.cfs <- cbind(c(.5,.5,-.5,-.5),c(.5,-.5,.5,-.5),c(.5,-.5,-.5,.5))
> tissues.cfs
     [,1] [,2] [,3]
[1,]  0.5  0.5  0.5
[2,]  0.5 -0.5 -0.5
[3,] -0.5  0.5 -0.5
[4,] -0.5 -0.5  0.5
We want estimates on the original scale, so our previous transformations to stabilize the variance will not do what we need. Instead, we model the variance separately in each group.
> tissues.varsep <- gls(tear.wt~treatment,data=TissueTearing,weights=varIdent(form=~1|treatment))
The separate variances model gives us a good standardized residuals plot.
> plot(tissues.varsep)
The separate variances model is strongly preferred over the constant variance model.
> AIC(tissues.varsep)
[1] 441.5931
> AIC(gls(tear.wt~treatment,data=TissueTearing)) # refit using REML
[1] 467.046
When we look at the parameter estimates for the standard deviations by stratum, we see that the standard deviations are larger for A than B, and larger for dry than wet.
> summary(tissues.varsep)
Generalized least squares fit by REML
  Model: tear.wt ~ treatment 
  Data: TissueTearing 
       AIC      BIC    logLik
  441.5931 461.7674 -212.7965

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | treatment 
 Parameter estimates:
    A_dry     A_wet     B_dry     B_wet 
1.0000000 0.4269964 0.6777834 0.3354123 

Coefficients:
               Value Std.Error   t-value p-value
(Intercept) 74.18750 0.2721713 272.57649       0
treatment1  25.10417 0.6417957  39.11551       0
treatment2  -2.68750 0.3683360  -7.29633       0
treatment3   3.97917 0.4788220   8.31032       0

 Correlation: 
           (Intr) trtmn1 trtmn2
treatment1  0.543              
treatment2 -0.432 -0.531       
treatment3  0.027 -0.561 -0.195

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.2594011 -0.7938436 -0.0610649  0.8723702  2.1625633 

Residual standard error: 4.026857 
Degrees of freedom: 96 total; 92 residual
OK, now look at the contrasts:
> tissues.emm <- emmeans(tissues.varsep,~treatment)
> contrast(tissues.emm,list(treatment=tissues.cfs))
 contrast    estimate    SE   df t.ratio p.value
 treatment.1    22.42 0.544 56.3  41.181  <.0001
 treatment.2    29.08 0.544 56.3  53.428  <.0001
 treatment.3    -1.29 0.544 56.3  -2.373  0.0211

Degrees-of-freedom method: satterthwaite 

Brand A has a response about 22.42 higher than brand B, and dry has a response about 29.08 higher than wet; these is strong evidence that both of these are highly significant. On the other hand, there is only marginal evidence that the brand differences change between wet and dry.

29.7 Modeling dependence

There are many models of dependence, but we will only look at the simplest model for time dependence. The autoregressive model of order 1 (AR1) says that data values k time steps apart have a correlation of \(\rho^k\). It is also possible to have blocks of data that are have temporal dependence within the block but independence between blocks.

Please note, a model with AR1 dependence only makes sense if there is a time order (or, potentially, some other linear order that is indicative of correlation) to the data. Do not try to use AR1 models on data that lack a time order.

In R, you can fit autocorrelated data using the gls function from package nlme. You must set the optional correlation argument to indicate the type of correlation present. We only use the simple AR1 type correlation.

You need some variable x that indicates time. Setting

correlation=corAR1(form=~x)

says to fit the autocorrelation across time x. You can also indicate groups, where the data are independent between groups but have AR1 correlation within groups using

correlation=corAR1(form=~x|group)

where group is a factor that indicates the different groups.

29.7.1 Example 6.14 Thermocouples

The thermocouples data consists of 64 consecutive measurements of a temperature difference. Plotting the data in time order we saw that high values tend to clump and low values tend to clump. That is a visual indication of autocorrelation, but you can also use Durbin-Watson or other tests to quantify it.

We just want to fit the mean, so ignoring correlation we can do:
> Oven.ols <- lm(temp.diff~1,data=Thermocouples)
We have 64 consecutive data points, so we could set the time variable via
> mytime <- 1:64
To fit the AR1 model,
> Oven.ar1 <- gls(temp.diff~1,data=Thermocouples,cor=corAR1(form=~mytime))

That’s all there is to it.

AIC prefers to the AR1 model, but the difference is not overwhelming.
> AIC(Oven.ar1)
[1] -278.9285
> AIC(gls(temp.diff~1,data=Thermocouples)) # for REML
[1] -277.3358
What did we gain. First, look at the fit where we ignored autocorrelation.
> summary(Oven.ols)

Call:
lm(formula = temp.diff ~ 1, data = Thermocouples)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.09125 -0.01125  0.00375  0.00875  0.04875 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.141250   0.003138    1001   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0251 on 63 degrees of freedom

The standard error of the estimated mean is .00314.

Now look at the AR1 fit:
> summary(Oven.ar1)
Generalized least squares fit by REML
  Model: temp.diff ~ 1 
  Data: Thermocouples 
        AIC       BIC   logLik
  -278.9285 -272.4991 142.4643

Correlation Structure: AR(1)
 Formula: ~mytime 
 Parameter estimate(s):
     Phi 
0.243579 

Coefficients:
               Value   Std.Error  t-value p-value
(Intercept) 3.141487 0.004027105 780.0855       0

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-3.6228911 -0.4548726  0.1391309  0.3371320  1.9211413 

Residual standard error: 0.02525238 
Degrees of freedom: 64 total; 63 residual

Now we see that the standard error of the estimated mean is .00403, or about 28% larger. That means we need longer confidence intervals to get the correct coverage; the shorter confidence intervals coming from the fit assuming independence will not have the desired coverage.

When the data are dependent, each additional data point does not give a full unit’s worth of new information. The ratio of variances is
> (.003138/.004027)^2
[1] 0.6072152

Thus accounting for correlation, it is as if we had only \(.61*64=39\) data points.

29.7.2 Example 6.15 Cloud Backup Times

The cloud backup data show non-constant variance, and Box-Cox suggests a log transformation.
> plot(backup.fit,which=1)

> car::boxCox(backup.fit)
Variance looks better on the log scale:
> backup.logfit <- lm(log(updowntime)~service,data=CloudBackup)
> plot(backup.logfit,which=1)
but there is evidence of autocorrelation:
> plot(residuals(backup.logfit),type='b')
> abline(h=0)
This autocorrelation is a manifestation of fast and slow times on my home internet connection. The Durbin-Watson test confirms that the autocorrelation is statistically significant and large enough to worry about (DW below 1.5).
> car::durbinWatsonTest(backup.logfit)
 lag Autocorrelation D-W Statistic p-value
   1         0.44276       1.06326   0.002
 Alternative hypothesis: rho != 0
Let’s fit the AR1 model. The data are in time order.
> mytime <- 1:30
> backup.logar1 <- gls(log(updowntime)~service,data=CloudBackup,
+                      correlation=corAR1(form=~mytime))
AIC strongly prefers the AR1 model fit:
> AIC(backup.logar1)
[1] -26.18125
> AIC(gls(log(updowntime)~service,data=CloudBackup))
[1] -19.74796
The estimate of \(\rho\) is .56, which is rather large.
> summary(backup.logar1)
Generalized least squares fit by REML
  Model: log(updowntime) ~ service 
  Data: CloudBackup 
        AIC       BIC   logLik
  -26.18125 -19.70206 18.09062

Correlation Structure: AR(1)
 Formula: ~mytime 
 Parameter estimate(s):
     Phi 
0.562057 

Coefficients:
                Value  Std.Error   t-value p-value
(Intercept)  5.579809 0.04224382 132.08581       0
service1    -0.483829 0.02136900 -22.64164       0
service2    -0.138539 0.02482215  -5.58126       0

 Correlation: 
         (Intr) servc1
service1 -0.012       
service2  0.023 -0.522

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.0350429 -0.8235312  0.1645706  0.8880758  1.4037069 

Residual standard error: 0.1276129 
Degrees of freedom: 30 total; 27 residual
Here are the estimated treatment means with standard errors fit assuming independence and AR1:
> emmeans(backup.logfit,~service)
 service emmean     SE df lower.CL upper.CL
 1         5.07 0.0387 27     4.99     5.15
 2         5.45 0.0387 27     5.37     5.53
 3         6.22 0.0387 27     6.14     6.30

Results are given on the log (not the response) scale. 
Confidence level used: 0.95 
> emmeans(backup.logar1,~service)
 service emmean     SE   df lower.CL upper.CL
 1         5.10 0.0471 5.01     4.97     5.22
 2         5.44 0.0495 5.61     5.32     5.56
 3         6.20 0.0477 5.24     6.08     6.32

Degrees-of-freedom method: satterthwaite 
Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

The estimated means are not exactly the same in the two models, and the standard errors for the estimates are greater (and with fewer degrees of freedom) in the AR1 model and also not all the same.

The pattern of the treatments affects the standard errors of estimated means and differences of means. Here the treatments were randomized, so the three different services are intermingled over time. The positive autocorrelation increases the standard error of means, with more increase if the observations are closer in time and less increase if the observations are farther apart in time. Here a typical ratio of standard deviations is \(.477/.387=1.23\), or a 23% increase in the standard error. If we had not randomized and had instead run each service for ten consecutive time slots, then the ratio of AR1 model standard errors of means to independence model standard errors of means would be 1.72:
> sqrt(sum(.56^abs(outer(1:10,1:10,FUN='-')))/10)
[1] 1.722991

(No, I don’t expect you to understand where that came from.)

Because the treatments are intermingled and the autocorrelation is positive, the standard errors for differences of means are smaller under the AR1 model than they are under the independence model:
> confint(pairs(emmeans(backup.logar1,~service)))
 contrast            estimate     SE   df lower.CL upper.CL
 service1 - service2   -0.345 0.0403 21.1   -0.447   -0.244
 service1 - service3   -1.106 0.0365 19.2   -1.199   -1.013
 service2 - service3   -0.761 0.0426 22.0   -0.868   -0.654

Degrees-of-freedom method: satterthwaite 
Results are given on the log (not the response) scale. 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates 
> confint(pairs(emmeans(backup.logfit,~service)))
 contrast            estimate     SE df lower.CL upper.CL
 service1 - service2   -0.383 0.0547 27   -0.518   -0.247
 service1 - service3   -1.156 0.0547 27   -1.292   -1.021
 service2 - service3   -0.773 0.0547 27   -0.909   -0.638

Results are given on the log (not the response) scale. 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates 

The differences are highly significant under both models.

The ratio of backup times for service 1 and service 2 is \(\exp(-.345) = .71\), with a confidence interval from \(\exp(-.447)\) to \(\exp(-.244)\), or (.64, .78). The difference between service 1 and service 3 is even greater.

Having estimates of the ratios of backup times is probably adequate, but suppose we wanted results on the original (time) scale. We could instead fit a model on the original scale with different variances by treatment group and AR1 correlation.
> backup.separ1 <- gls(updowntime~service,data=CloudBackup,
+                      correlation=corAR1(form=~mytime),
+                      weights=varIdent(form=~1|service))
> confint(pairs(emmeans(backup.separ1,~service)))
 contrast            estimate   SE    df lower.CL upper.CL
 service1 - service2    -65.5  7.4 12.84    -85.1    -45.9
 service1 - service3   -332.4 19.9  3.93   -403.7   -261.0
 service2 - service3   -266.9 20.1  4.76   -333.4   -200.4

Degrees-of-freedom method: satterthwaite 
Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates 

On the original scale, we estimate service 1 to be 65.5 seconds faster than service 2 (confidence interval (45.9 to 85.1)).

29.8 Using other distributions: Generalized Linear Models

Generalized linear models allow us to fit mean structures to data from non-normal distributions such as binomial, Poisson, or gamma. The mean of the data point \(i\) is \(\mu_i\). There is a function \(g()\) called the link function, and we assume that \(g(\mu_i)\) can be described with some kind of linear model such as separate means, polynomial, etc. For each distribution there is a canonical link function; using the canonical link function makes some things simpler, but they are not typically required.

Many distributions have a dispersion parameter \(\phi\); for example, the normal distribution has a variance and the gamma distribution has a reciprocal shape. For the binomial and Poisson, the variability is determined by the mean and we assume \(\phi=1\).

The deviance for a model is twice the difference in log likelihood between a model of interest and the saturated model (the saturated model has a rich enough structure to compute a separate mean for each response). For models where \(\phi\) must be fit, the deviance is twice the difference in log likelihood between a model of interest and the saturated model multiplied by \(\phi\). We estimate \(\phi\) as the residual deviance divided by the residual degrees of freedom.

The analysis of deviance partitions the deviance in analogy with how ANOVA partitions sums of squares. Each time we make the mean structure richer, we use additional parameters and we reduce the deviance. In models where \(\phi=1\) (binomial and Poisson), terms can be tested by treating the incremental deviance for the term as being distributed under the null as chi-squared with degrees of freedom equal to the number of incremental parameters for the term. In fact, this is the likelihood ratio test. When \(\phi\) must be estimated, form a test statistic for a term by dividing its incremental deviance by its incremental degrees of freedom, and then dividing that ratio by our estimate of \(\phi\). Treat that ratio as having an F distribution under the null, with numerator and denominator degrees of freedom taken as the incremental degrees of freedom for the term and the residual degrees of freedom used to estimate \(\phi\).

In R, the glm() function fits generalized linear models. The family argument determines the type of distribution to use. By default, R will use the canonical link, but you can specify other link functions. The summary() function works as you would expect, and anova() applied to a GLM produces an analysis of deviance.

29.8.1 Example 6.16 Pea Germination

There are eight batches of 20 seeds. Each batch is randomly assigned to one of four treatments, and the response is the number of seeds in a batch that germinate. This can reasonably be modeled as binomial with \(N=20\) and \(p\) depending on the treatments.

With a binomial family, you specify the response as a two column matrix, with the first column being the number of successes and the second column being the number of failures. Thus
> peas.glm <- glm(cbind(germinated,nongerminated)~treatment,
+                 data=PeaGermination,family=binomial)
Here is what we get in the summary:
> summary(peas.glm)

Call:
glm(formula = cbind(germinated, nongerminated) ~ treatment, family = binomial, 
    data = PeaGermination)

Deviance Residuals: 
      1        2        3        4        5        6        7        8  
 0.3292  -0.7039  -0.3292   0.4966  -0.3487   0.7329   0.3487  -0.4809  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.3945     0.2077   1.900   0.0575 .  
treatment1    0.3363     0.3164   1.063   0.2878    
treatment2    0.4528     0.3204   1.413   0.1576    
treatment3   -2.3405     0.3968  -5.899 3.66e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 58.5202  on 7  degrees of freedom
Residual deviance:  1.9703  on 4  degrees of freedom
AIC: 33.856

Number of Fisher Scoring iterations: 4

The linear predictor for the success probability is on the logit scale: \(logit(p) = \log(p/(1-p))\). You can invert the logit via \(p = \exp(logit)/(1+\exp(logit))\). An increase in the linear predictor of \(\alpha\) means that the odds \(p/(1-p)\) have increased by a factor of \(\exp(\alpha)\). Thus the odds for success in treatment 1 are \(\exp(.34 - -2.34) = 14.6\) times as large as for treatment 3.

The deviance residuals say something about how far each observation is from the fitted value. The sum of squares of the deviance residuals will be the residual deviance, here, 1.97. residuals(fit,type="deviance") also produces the deviance residuals.

We get the analysis of deviance via anova():
> anova(peas.glm)
Analysis of Deviance Table

Model: binomial, link: logit

Response: cbind(germinated, nongerminated)

Terms added sequentially (first to last)

          Df Deviance Resid. Df Resid. Dev
NULL                          7      58.52
treatment  3    56.55         4       1.97
The change of deviance is 56.5 with 3 degrees of freedom. The (asymptotically approximate) p-value is
> pchisq(56.55,3,lower.tail = FALSE)
[1] 3.206016e-12

so the treatment effects are highly significant.

We can get estimated treatment means on the linear predictor scale in the usual way:
> emmeans(peas.glm,~treatment)
 treatment emmean    SE  df asymp.LCL asymp.UCL
 soil_12    0.731 0.338 Inf    0.0692      1.39
 soil_24    0.847 0.345 Inf    0.1710      1.52
 towel_12  -1.946 0.478 Inf   -2.8830     -1.01
 towel_24   1.946 0.478 Inf    1.0089      2.88

Results are given on the logit (not the response) scale. 
Confidence level used: 0.95 
as well as the usual such as pairwise comparisons:
> pairs(emmeans(peas.glm,~treatment))
 contrast            estimate    SE  df z.ratio p.value
 soil_12 - soil_24     -0.116 0.483 Inf  -0.241  0.9951
 soil_12 - towel_12     2.677 0.585 Inf   4.574  <.0001
 soil_12 - towel_24    -1.215 0.585 Inf  -2.076  0.1609
 soil_24 - towel_12     2.793 0.590 Inf   4.738  <.0001
 soil_24 - towel_24    -1.099 0.590 Inf  -1.863  0.2440
 towel_12 - towel_24   -3.892 0.676 Inf  -5.756  <.0001

Results are given on the log odds ratio (not the response) scale. 
P value adjustment: tukey method for comparing a family of 4 estimates 

This shows that the towel-12 treatment is lower than the other three, which cannot really be distinguished. The “infinite” degrees of freedom means that we are using an asymptotic approximate distribution for critical values.

We can back transform them using regrid():
> regrid(emmeans(peas.glm,~treatment))
 treatment  prob     SE  df asymp.LCL asymp.UCL
 soil_12   0.675 0.0741 Inf    0.5299     0.820
 soil_24   0.700 0.0725 Inf    0.5580     0.842
 towel_12  0.125 0.0523 Inf    0.0225     0.227
 towel_24  0.875 0.0523 Inf    0.7725     0.977

Confidence level used: 0.95 

By default, regrid() back transforms based on the link function.

29.8.2 Example 6.17 Cloud Seeding, continued

The rainfall amounts in the cloud seeding data were asymmetrically distributed with a long tail to the right. In a previous example we used a log transformation of the data; that both equalized the variances and made the data more normally distributed. Suppose instead that we want to do the inference on the rainfall scale rather than the log of rainfall scale. We will need a technique that handles both the non-normal distribution and the non-constant variance.

Gamma distributions are distributions for positive data that have long tails to the right. For gamma distributions with the same shape parameter, the variance is proportional to the square of the mean. If the variance divided by the square of the mean is (roughly) the same in the seeded and unseeded clouds, that suggests that they might both follow a gamma distribution with the same shape. That is approximately the case for these data:.
> var(nonseeded)/mean(nonseeded)^2
[1] 2.861687
> var(seeded)/mean(seeded)^2
[1] 2.168022
We can fit a generalized linear model specifying a gamma distribution instead of a normal distribution. The default link function for a gamma GLM is the reciprocal, but with only two groups we can also work with an identity link:
> cloud.glmfit <- glm(rainfall~seeding,data=CloudSeeding,
+                     family=Gamma(link="identity"))
> summary(cloud.glmfit)

Call:
glm(formula = rainfall ~ seeding, family = Gamma(link = "identity"), 
    data = CloudSeeding)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.86689  -1.42422  -0.73570  -0.02294   2.93868  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   303.29      73.34   4.135 0.000135 ***
seeding1     -138.70      73.34  -1.891 0.064406 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 2.514854)

    Null deviance: 119.85  on 51  degrees of freedom
Residual deviance: 107.65  on 50  degrees of freedom
AIC: 682.33

Number of Fisher Scoring iterations: 3

We can get an estimate of the difference in means as \(2 \times -138.7 = -277.4\) with a standard error of \(2 \times 73.34 = 146.68\). An approximate 95% confidence interval for the difference is from \(-277.4-2 \times 146.8 = -571\) up to \(-277.4+2 \times 146.8 = 16.2\). The reported p-value is .064; technically, this is for the “Wald test”, based on an estimate divided by its standard error.

A generally more accurate p-value comes from the analysis of deviance:
> anova(cloud.glmfit)
Analysis of Deviance Table

Model: Gamma, link: identity

Response: rainfall

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev
NULL                       51     119.85
seeding  1   12.201        50     107.65
The incremental deviance is 12.2 with 1 degree of freedom. Form the test as the incremental deviance divided by its degrees of freedom divided by the dispersion parameter; treat this as F-distributed with the degrees of freedom taken from the incremental and residuals deviances:
> pf(12.2/1/2.51,1,50,lower=FALSE)
[1] 0.03210672

The p-value of .03 should be a little more accurate than the .06 from the Wald test.

As a diagnostic, we should look at how well the data conform to the gamma distribution. The seeded and unseeded clouds have different means, so we should rescale them before combining them.
> rescaled <- c(nonseeded/mean(nonseeded),seeded/mean(seeded))
The dispersion parameter was estimated as 2.51, so the shape parameter for the gamma is 1/2.51=.4. If the rescaled data are gamma with this shape parameter, then a qqplot of the rescaled data against quantiles of a gamma with this shape parameter should form a straight line. It’s not exactly straight, but it’s a lot straighter than the normal plots of residuals were.
> qqplot(rescaled,2.5*qgamma((1:52)/53,.4))
> abline(0,1)

This is not perfectly straight, but it is pretty good and gives us confidence in the results.

29.8.3 Example 6.18 Copepoda

Artificial wetlands are treated with snowmelt water with either neutral or reduced pH. After treatment, we observed Copepoda counts from each of the six wetlands. Given that these are counts, a Poisson model is a reasonable starting point.
> copepoda.poisson <- glm(count~pH,data=Copepoda,family=poisson())

By default, a Poisson model uses a log link function, but you can change that in the poisson() function if you want to.

Here is a summary of the model fit:
> summary(copepoda.poisson)

Call:
glm(formula = count ~ pH, family = poisson(), data = Copepoda)

Deviance Residuals: 
     1       2       3       4       5       6  
 4.698  -2.173  -2.952  -8.436  -1.605   8.067  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.09496    0.03212 158.638  < 2e-16 ***
pH1         -0.14148    0.03212  -4.405 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 193.93  on 5  degrees of freedom
Residual deviance: 174.32  on 4  degrees of freedom
AIC: 219.29

Number of Fisher Scoring iterations: 5

The deviance residuals ought to be of roughly the magnitude of standardized residuals; instead, they are very large. For a Poisson model, the residual deviance divided by residual degrees of freedom ought to be roughly 1; instead, it is \(174.32/4 = 43.58\). These data have much more variability than we would expect from a Poisson distribution.

One approach to handling this hyper-Poisson variation is to use a quasi-likelihood model, in this case, quasipoisson.
> copepoda.quasip <- glm(count~pH,data=Copepoda,family=quasipoisson())
The summary now shows the same estimated values, but the estimated standard errors are larger by a factor of \(\sqrt{43.58}\).
> summary(copepoda.quasip)

Call:
glm(formula = count ~ pH, family = quasipoisson(), data = Copepoda)

Deviance Residuals: 
     1       2       3       4       5       6  
 4.698  -2.173  -2.952  -8.436  -1.605   8.067  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0950     0.2116  24.075 1.77e-05 ***
pH1          -0.1415     0.2116  -0.669     0.54    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 43.42048)

    Null deviance: 193.93  on 5  degrees of freedom
Residual deviance: 174.32  on 4  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

the pH effect is no longer significant.

We can also test via the deviance:
> anova(copepoda.quasip)
Analysis of Deviance Table

Model: quasipoisson, link: log

Response: count

Terms added sequentially (first to last)

     Df Deviance Resid. Df Resid. Dev
NULL                     5     193.93
pH    1   19.601         4     174.32

The F-test is \((19.601/1)/43.58 = .45\) with 1 and 4 degrees of freedom. Note that the square of the t-statistic above is essentially this F-test.

An alternative to a quasi-Poisson model is to fit a negative binomial model. Negative binomial distributions have more variance than a Poisson with the same mean, with an additional parameter that controls the amount of extra variance. The glm.nb() function in the MASS package will fit a GLM with a negative binomial response.
> copepoda.negbin <- MASS::glm.nb(count~pH,data=Copepoda)
> summary(copepoda.negbin)

Call:
MASS::glm.nb(formula = count ~ pH, data = Copepoda, init.theta = 5.077672042, 
    link = log)

Deviance Residuals: 
      1        2        3        4        5        6  
 0.7246  -0.3621  -0.4971  -1.8268  -0.3054   1.3708  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   5.0950     0.1840  27.690   <2e-16 ***
pH1          -0.1415     0.1840  -0.769    0.442    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(5.0777) family taken to be 1)

    Null deviance: 6.8023  on 5  degrees of freedom
Residual deviance: 6.2128  on 4  degrees of freedom
AIC: 73.788

Number of Fisher Scoring iterations: 1

              Theta:  5.08 
          Std. Err.:  2.95 

 2 x log-likelihood:  -67.788 

The estimated coefficients are the same as the Poisson and quasi-Posson models, but the standard errors for the coefficients in the negative binomial model are similar to those in the quasi-poisson model.

We can also do an analysis of deviance:
> anova(copepoda.negbin)
Warning in anova.negbin(copepoda.negbin): tests made without re-estimating 'theta'
Analysis of Deviance Table

Model: Negative Binomial(5.0777), link: log

Response: count

Terms added sequentially (first to last)

     Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL                     5     6.8023         
pH    1  0.58948         4     6.2128   0.4426

Overall, the quasi-Poisson and negative binomial approaches are producing very similar inferences.

29.9 Nonparametric methods

Nonparametric methods do not assume parametric distributional shapes and typically derive their inferences from other potential randomization outcomes. Examples of these tests include randomization tests and rank tests.

29.9.1 Example 6.19 Cloud Seeding, continued

Function permTS() in the perm package does a randomization analogue of the two-sample t-test. It computes the mean in the first group minus the mean in the second group and then sees where the observed difference of means falls in the distribution of differences of means across all potential randomizations.
> perm::permTS(rainfall~seeding,data=CloudSeeding)

    Permutation Test using Asymptotic Approximation

data:  rainfall by seeding
Z = -1.9421, p-value = 0.05213
alternative hypothesis: true mean seeding=nonseeded - mean seeding=seeded is not equal to 0
sample estimates:
mean seeding=nonseeded - mean seeding=seeded 
                                   -277.3962 

Here we see a p-value of .052, which is similar to the p-value of the t-test. This p-value is for a two-sided alternative; permTS() has an optional argument that allows you to specify a one-sided alternative. Also note, permutation tests sometimes compute the complete permutation distribution (particularly for small data sets), but sometimes the permutation distribution is sampled (particularly for large data sets). When sampling is done, you may see a slightly different p-value in repeated applications of the test.

Working with log data we get a different p-value:
> perm::permTS(log(rainfall)~seeding,data=CloudSeeding)

    Permutation Test using Asymptotic Approximation

data:  log(rainfall) by seeding
Z = -2.4179, p-value = 0.01561
alternative hypothesis: true mean seeding=nonseeded - mean seeding=seeded is not equal to 0
sample estimates:
mean seeding=nonseeded - mean seeding=seeded 
                                   -1.143781 

This has a smaller p-value, which is again similar to what we obtained with the t-test.

The difference of means of logs is the same as the log of the ratio of geometric means. Thus working with logged data is like working with the original data but using a different summary statistic instead of the difference of group means.

For a rank-based test we can do:
> wilcox.test(rainfall~seeding,data=CloudSeeding)
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot compute exact p-value
with ties

    Wilcoxon rank sum test with continuity correction

data:  rainfall by seeding
W = 203, p-value = 0.01383
alternative hypothesis: true location shift is not equal to 0

This p-value is close to what we get with a t-test when working with logged data, although since this is a rank test, logging the data will not change the rank test.

When the assumptions of the parametric test (e.g., the t-test) are met, the randomization, rank, and parametric procedures will generally agree.