Chapter 28 Assessing Assumptions

28.1 Residuals

Our statistical assumptions about the data are about the \(\epsilon_{ij}\)s: independence, constant variance, and normality. But we don’t get to observe the \(\epsilon_{ij}\)s. The best that we can do is try to approximate them by the residuals in our data. We will see that there are several kinds of residuals, based on how much post-processing we do. For most purposes, any of the types of residuals will suffice, but there are situations where one or another is advantageous.

When assessing assumptions, be aware that a problem of one sort (say non-constant variance) can make a problem of another type (say non-normality) appear to be present. Similarly, a non-normal distribution for the errors could lead one to think that there are outliers, when what is really present is general non-normality. Thus it is important to assess the model assumptions in multiple ways before making a too-early decision about what is going on in the data.

Because we want to explore assessing assumptions in data, we will look at data sets that violate the assumptions. In earlier examples we looked at the logarithm of lifetime as a response in the resin lifetime data. Now we will examine these data on the original scale (before taking logs).
> library(cfcdae)
> data(ResinLifetimes)
> lifetimes.separate.means <- lm(Time~temp,data=ResinLifetimes)
Here is the ANOVA. It’s still highly significant, at least nominally.
> anova(lifetimes.separate.means)
Analysis of Variance Table

Response: Time
          Df  Sum Sq Mean Sq F value    Pr(>F)    
temp       4 28027.9  7007.0  101.81 < 2.2e-16 ***
Residuals 32  2202.4    68.8                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are several kinds of residuals and corresponding functions in R to extract them: residuals(), rstandard(), and rstudent().

  • Raw residuals are just the difference between the observed and predicted values.
  • Standardized residuals divide each residual by its estimated standard deviation. These are also called “internally studentized.”
  • Studentized residuals are what you get when you compute each particular residual assuming that the predicted values are based on the data except for the corresponding data point, and then standardizing. These are also called “externally studentized”.
We round to make a cleaner output. Note that the raw residuals are very different from the standardized and studentized, which are somewhat different from each other.
> round(residuals(lifetimes.separate.means),2)
     1      2      3      4      5      6      7      8      9     10     11     12     13     14 
 23.22  -5.14  13.58  -3.25 -15.63   4.78 -10.57  -6.99   2.15   7.73 -17.26  13.98   2.15  -2.82 
    15     16     17     18     19     20     21     22     23     24     25     26     27     28 
 -8.08   2.15   9.36  10.15  -0.53  -4.10  -2.13  -5.90  -6.32  -0.53  -1.59   0.88  -0.93  -1.26 
    29     30     31     32     33     34     35     36     37 
  0.50   3.34  -0.93   6.32  -5.11   0.15  -1.40   0.43  -0.39 
> round(rstandard(lifetimes.separate.means),2)
    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16 
 2.99 -0.66  1.75 -0.42 -2.01  0.62 -1.36 -0.90  0.28  1.00 -2.22  1.80  0.28 -0.36 -1.04  0.28 
   17    18    19    20    21    22    23    24    25    26    27    28    29    30    31    32 
 1.21  1.31 -0.07 -0.53 -0.27 -0.76 -0.81 -0.07 -0.21  0.11 -0.12 -0.16  0.07  0.43 -0.12  0.84 
   33    34    35    36    37 
-0.67  0.02 -0.19  0.06 -0.05 
> round(rstudent(lifetimes.separate.means),2)
    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16 
 3.47 -0.66  1.81 -0.41 -2.12  0.61 -1.38 -0.90  0.27  1.00 -2.38  1.87  0.27 -0.36 -1.04  0.27 
   17    18    19    20    21    22    23    24    25    26    27    28    29    30    31    32 
 1.22  1.32 -0.07 -0.52 -0.27 -0.76 -0.81 -0.07 -0.20  0.11 -0.12 -0.16  0.06  0.43 -0.12  0.83 
   33    34    35    36    37 
-0.67  0.02 -0.18  0.06 -0.05 
A second data set we consider is the cloud seeding data. There are two treatments, seeded and unseeded clouds, with the response being the amount of rain that falls from the clouds. The p-value of the test of the null hypothesis of equal group means in the original data is just about .05.
> data(CloudSeeding)
> cloud.fit <- lm(rainfall~seeding,data=CloudSeeding)
> anova(cloud.fit)
Analysis of Variance Table

Response: rainfall
          Df   Sum Sq Mean Sq F value  Pr(>F)  
seeding    1  1000332 1000332   3.993 0.05114 .
Residuals 50 12526130  250523                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are some quite large studentized residuals in for these data:
> round(rstudent(cloud.fit),2)
    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16 
 2.19  1.37  0.42  0.37  0.32  0.16  0.00 -0.03 -0.14 -0.16 -0.17 -0.19 -0.24 -0.25 -0.26 -0.27 
   17    18    19    20    21    22    23    24    25    26    27    28    29    30    31    32 
-0.27 -0.28 -0.28 -0.28 -0.29 -0.30 -0.31 -0.32 -0.32 -0.33  6.21  2.72  2.61  1.09  0.53  0.10 
   33    34    35    36    37    38    39    40    41    42    43    44    45    46    47    48 
-0.02 -0.22 -0.28 -0.34 -0.34 -0.38 -0.40 -0.49 -0.49 -0.63 -0.65 -0.66 -0.66 -0.71 -0.82 -0.83 
   49    50    51    52 
-0.83 -0.86 -0.88 -0.89 
A third data set is the oven data, which gives the differences over time between the readings of two thermocouples placed in an oven. There is no treatment per se, so we’ll just fit the mean to the data., But these results were collected in time order, which could lead to dependence between data.
> data(Thermocouples)
> Oven.fit <- lm(temp.diff~1,data=Thermocouples)
> summary(Oven.fit)

Call:
lm(formula = temp.diff ~ 1, data = Thermocouples)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.09125 -0.01125  0.00375  0.00875  0.04875 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.141250   0.003138    1001   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0251 on 63 degrees of freedom

28.2 Nonconstant variance

Our principal tool for assessing assumptions is to plot residuals. The old standby plot, frequently called the residual plot, is to plot one point for each data point. The horizontal location is the predicted value and the vertical location is the residual (possibly standardized or Studentized). A more recent variation is the scale-location plot. Again, there is one point on the plot for each data point. The horizontal location for a point is again the predicted value, and the vertical location is the square root of the absolute standardized residual.

For the residual plot you get vertical streaks of points, with one vertical streak for each predicted value. The vertical streaks should be symmetric around 0, and the vertical extent should be consistent across the plot. The scale-location plot also has vertical streaks of points, and the typical height of these streaks should be the same across the plot.

If some streaks are longer than others in the residual plot, or some streaks are higher than others in the scale-location plot, then we have evidence of non-constant variance. Of course, the data are random, so the lengths and heights will not be exactly the same, even if our model is a perfect representation of our data.

In R, plot(modelfit,which=1) produces a residual plot, and plot(modelfit,which=3) produces the scale-location plot. plot(modelfit) produces the residual plot, the normal probability plot, the scale-location plot, and a plot that is generally of less interest in designed experiments (where balance, or near balance, ensures that the leverages are all nearly the same).

We want to see no pattern in the residual plot and the scale-location plot. For the resin lifetime data on the original scale, we see that the variability is higher when the predicted value is higher. This type of residual plot is common and is often referred to as a right-opening megaphone.

> plot(lifetimes.separate.means,which=1)
Residuals vs fitted for original resin lifetimes

Figure 28.1: Residuals vs fitted for original resin lifetimes

> plot(lifetimes.separate.means,which=3)
Scale-location plot for original resin lifetimes

Figure 28.2: Scale-location plot for original resin lifetimes

The car package has a residual plotting command that will plot boxplots of the residuals by treatment group as well as residuals against fitted values. In this case, they just look like mirror images of each other, because the means are in order of the treatments.

> car::residualPlots(lifetimes.separate.means)
Warning in residualPlots.default(model, ...): No possible lack-of-fit tests
Residual plots from `car`

Figure 28.3: Residual plots from car

We get analogous pictures of increasing variance from the cloud seeding data.
> plot(cloud.fit,which=1)

> plot(cloud.fit,which=3)

Comparing variances is more difficult, in a statistical sense, than comparing means. This does not mean a more complex calculation, it means that the statistical properties of the tests to compare variances are not as good as we would like. For example, most of the classical tests of equality of variance are incredibly sensitive to non-normality, sensitive to the point that they are essentially useless in practice. I usually say don’t test variances, but if you have to, use Levene’s test. Here we see that even with this glaringly obvious non-constant variance, Levene’s test is not significant.

> car::leveneTest(lifetimes.separate.means)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value  Pr(>F)  
group  4  2.3746 0.07286 .
      32                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To be honest, this would all have been obvious to us if we had done something simple like making boxplots of the original data by treatment group.

> boxplot(Time~temp,data=ResinLifetimes,main="(Non-log) Resin Lifetimes")
Boxplots of original resin lifetimes

Figure 28.4: Boxplots of original resin lifetimes

Redoing the plots with a log y-axis is analogous to working with logged data. Group variability is much more similar now.

> boxplot(Time~temp,data=ResinLifetimes,main="Log Resin Lifetimes",log='y')
Boxplots of original resin lifetimes

Figure 28.5: Boxplots of original resin lifetimes

Levene’s test also cannot detect nonconstant variance in the cloud seeding data.
> car::leveneTest(cloud.fit)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value  Pr(>F)  
group  1  2.8628 0.09687 .
      50                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

28.3 Normality/outliers

28.3.1 Outliers

An outlier is a data point that does not follow the pattern established by the rest of the data. Outliers can be bad data. Examples include transcription errors when copying or typing data and procedural problems (such as Joe sneezed in the test tube; check those lab notebooks). Outliers can also be the most important data in your data set (such as the trace of a Higgs Boson in the mountain of results at CERN).

Truly bad data should be removed from your analysis and, ideally, replaced with corrected data.

More common is that you have an outlier, but there is no reason to suspect that it is bad data. What that means is that your model is not good enough to fit the data properly. However, it is often quite challenging to establish a new and improved model that can accommodate all of the data. In such situations, it is common to analyze the data twice, once including and once excluding the outlier(s). If the conclusions agree, then lucky you: your inference does not depend on just a few values. If the conclusions disagree, then you either need to look for a new model/analysis method, or you need to report both conclusions noting how the results depend on just a few data points.

Choosing to retain or exclude data points based on which one gives you your desired answer is research fraud.

How do we locate those unusual data points? If our assumptions are correct, then the Studentized residuals follow a t-distribution with N-g-1 degrees of freedom. One simple thing is to look for outliers by finding the largest (absolute) studentized residual and getting its t-tail area (with N-g-1 df). You can do it by hand, or there is a function in the car library that does it all at once. If you look at all of the Studentized residuals, you need to multiply the tail areas by N as a Bonferroni correction for multiple comparisons.

Here we do the outlier test for the resin lifetime data on the original scale. The first data value has the largest studentized residual at 3.47. It’s Bonferroni p-value is .057. That’s not strong evidence that an outlier is present, but it is getting close to the range where we might worry.
> max(abs(rstudent(lifetimes.separate.means)))
[1] 3.471028
> 2*pt(-max(abs(rstudent(lifetimes.separate.means))),31)*37
[1] 0.05729818
The outlierTest() function in the car automates the process.
> car::outlierTest(lifetimes.separate.means) # easier way
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
  rstudent unadjusted p-value Bonferroni p
1 3.471028          0.0015486     0.057298
For the cloud seeding data, the outlier is much more prominent.
> car::outlierTest(cloud.fit)
   rstudent unadjusted p-value Bonferroni p
27 6.212291         1.1009e-07   5.7245e-06
There are 52 data points, and point 27 looks like an outlier. Here we refit with the subset of the data that excludes point 27. Now two more points look like outliers!
> use.these <- rep(TRUE,52)
> use.these[27] <- FALSE
> cloud.fit.subset1 <- lm(rainfall~seeding,data=CloudSeeding,subset=use.these)
> car::outlierTest(cloud.fit.subset1)
   rstudent unadjusted p-value Bonferroni p
28 4.214633         0.00010977    0.0055984
29 4.038566         0.00019293    0.0098392
We could pull them out and refit.
> use.these[28:29] <- FALSE
> cloud.fit.subset2 <- lm(rainfall~seeding,data=CloudSeeding,subset=use.these)
> car::outlierTest(cloud.fit.subset2)
  rstudent unadjusted p-value Bonferroni p
1 5.005484           8.63e-06   0.00042287

Yet another outlier! In fact, we can keep stripping outliers for some time.

The problem with the cloud seeding data is not really outliers. But if you look for outliers first, you will find them.

You need to consider other issues such as non-constant variance and more general non-normality before pursuing outlier removal.

28.3.2 Non-Normality

A graphical way to assess normality is a normal probability plot of (some version of) the residuals. In fact, if you try to plot a fitted linear model, R will produce a normal probability plot along with three other plots; plot(model,which=2) gives you the normal probability plot of the standardized residuals.

For the resin lifetime data on the original scale, we see long tails. Actually, this “long tail” is mostly from combining normals with different variances; the combination looks like long tails.

> plot(lifetimes.separate.means,which=2)
NPP of original resin lifetimes

Figure 28.6: NPP of original resin lifetimes

In comparison, the cloud seeding data are very asymmetric, with a long tail to the right.
> plot(cloud.fit,which=2)

28.4 Temporal dependence

When data are collected in some time order, it is not uncommon for \(\epsilon_{ij}\)s that are close in time to be more similar to each other than we might expect from independence. This is positive serial correlation or positive temporal dependence. The reverse, negative serial correlation where nearby in time errors are less similar than we expect from independence, can happen, but it is much less common than positive serial correlation.

The obvious graphical approach to detecting serial correlation is to plot the residuals in time order and look to see if there are patterns. For example, stretches of positive residuals followed by stretches of negative residuals would indicate positive serial correlation.

The oven data are in time order in the data set, so we can just plot them to look for serial correlation. The type="b" argument in plot() says connect the dots. Doing this often makes issues more clear. The abline() usage puts a horizontal line at 0.
> plot(residuals(Oven.fit),type="b")
> abline(h=0)
Residuals from oven data in time order

Figure 28.7: Residuals from oven data in time order

We see that points above the line tend to cluster together, and points below the line tend to cluster together. This is positive serial correlation.

We can perform a formal test using Durbin-Watson. Here the statistic is in the “uh oh” range (below 1.5 or above 2.5), but the p-value is only just .06.
> car::durbinWatsonTest(Oven.fit)
 lag Autocorrelation D-W Statistic p-value
   1       0.2131219      1.513854    0.06
 Alternative hypothesis: rho != 0

There are many other tests of serial correlation that fall into the category of “runs tests”; we’ll just look at runs.test() in the tseries package. It considers a two-valued (positive and negative) factor variable in time order. It considers consecutive groups of the same value to be a run, and then counts the number of runs. Two few or too many indicate association (positive and negative respectively).

So we need to take our residuals in time order and turn them into a two-valued factor. Here we split them as positive or negative.

We see below that there are fewer runs than we might expect from random data, meaning that positives and negatives are more tightly clustered than one would expect (p-value .001).

> pos.not.neg <- factor(residuals(Oven.fit) > 0)
> tseries::runs.test(pos.not.neg)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

    Runs Test

data:  pos.not.neg
Standard Normal = -3.2761, p-value = 0.001052
alternative hypothesis: two.sided

Remember, only assess temporal association when there actually is a time order for your data.

28.5 And Now for the Bayesians

Assessment options for Bayesians are similar to, but less extensive than, those for non-Bayesians. We can compute residuals as observed data minus the posterior expected value of the model fit. We can then do residual versus fitted value plots and normal probability plots as above. In addition, we can compute the predictive distribution for each observation, and then plot the observation in that distribution. If it is way out in the tails, then we have an outlier.

Bayesian fit of the resin lifetimes.
> library(bcfcdae)
> bfit.resin <- bglmm(Time~temp,data=ResinLifetimes,quiet=TRUE,seed=10026)
The residual plot and normal probability plot look essentially the same as we saw above.
> plot(bfit.resin)

The pointwise plot does not show any grand outliers, but it does reveal the nonconstant variance in another way. The model is fit with constant variance, so the width of the predictive distributions is basically the same across all of the data. However, when the mean is highest, the observed data are far out in the predictive distribution, because the fitted constant variance is smaller than the actual variance for that treatment. Conversely, when the mean is smallest, the observed data are pretty much in the center of the predictive distribution, because the fitted constant variance is larger than the actual variance for that treatment.

> plot(bfit.resin,"pointwise")

NULL
Look at the cloud seeding data with a Bayesian fit.
> bfit.clouds <- bglmm(rainfall~seeding,data=CloudSeeding,quiet=TRUE,
+                      seed=10027)
Warning: There were 21 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
We are warned about divergent transitions, so try increasing adapt_delta.
> bfit.clouds <- bglmm(rainfall~seeding,data=CloudSeeding,quiet=TRUE,
+                      seed=10027,adapt_delta=.995)
Warning: There were 2 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
Better, but still not there. Tightening the prior distribution on \(\sigma\) does the trick:
> bfit.clouds <- bglmm(rainfall~seeding,data=CloudSeeding,quiet=TRUE,
+                      seed=10027,adapt_delta=.995,sigma.shape=3) # default was 2
The residual and normal plots look pretty much the same as we saw for the non-Bayesian models, revealing nonconstant variance and an asymmetric distribution.
> plot(bfit.clouds)
The pointwise plot does show a point far in the tail of the predictive distribution, so it identifies the same outlier we saw before.
> plot(bfit.clouds,"pointwise")

NULL