Chapter 37 Two Series Designs

37.0.1 Daniel’s drill data

These data are from Daniel (1976). The experiment was on a stone drill. Factor A is the load on the drill; B is the flow rate through the drill; C is the rotational speed; D is the type of mud used. The response is the rate of drill advance. There was just one replication.

This is a \(2^4\) design, and the data are in standard order.

> library(cfcdae)
> drill <- read.table("drill.data",header=TRUE)
> drill
   load flow speed mud advance.rate
1     1    1     1   1         1.68
2     2    1     1   1         1.98
3     1    2     1   1         3.28
4     2    2     1   1         3.44
5     1    1     2   1         4.98
6     2    1     2   1         5.70
7     1    2     2   1         9.97
8     2    2     2   1         9.07
9     1    1     1   2         2.07
10    2    1     1   2         2.44
11    1    2     1   2         4.09
12    2    2     1   2         4.53
13    1    1     2   2         7.77
14    2    1     2   2         9.43
15    1    2     2   2        11.75
16    2    2     2   2        16.30

We should start the analysis of an unreplicated two-series with a half-normal plot of the effects of the full model and Basso-Salmaso to indicate important terms. Here only speed appears to be important.

> TwoSeriesPlots(advance.rate~load*mud*speed*flow,data=drill)
B set to 3000
Half-Normal plot for Drill data

Figure 37.1: Half-Normal plot for Drill data

Looking at just the significant term, it looks like nonconstant variance.

> plot(lm(advance.rate~speed,data=drill),which=1)
Residual plot for drill data

Figure 37.2: Residual plot for drill data

Surprise! Box-Cox says to take a log.

> car::boxCox(lm(advance.rate~speed,data=drill))
Box-Cox plot for drill data

Figure 37.3: Box-Cox plot for drill data

OK, now do the Basso-Salmaso on the logged response. Now it looks like three main effects.

> TwoSeriesPlots(log(advance.rate)~load*mud*speed*flow,data=drill)
B set to 3000
Half-normal plot for logged drill data

Figure 37.4: Half-normal plot for logged drill data

Residuals for the three main effects model look pretty good.

> par(mfrow=c(1,2))
> plot(lm(log(advance.rate)~mud+flow+speed,data=drill),which=1:2)
Residual plots for three main effects model of logged drill data

Figure 37.5: Residual plots for three main effects model of logged drill data

> par(mfrow=c(1,1))

You can add PSE cutoffs if you like. The plot will show the comparisonwise cutoff (dashed line) and the Bonferroni adjusted cutoff (dotted line).

> TwoSeriesPlots(log(advance.rate)~load*mud*speed*flow,data=drill,pse=TRUE)
B set to 3000
Half normal plot for logged drill data with PSE cutoffs

Figure 37.6: Half normal plot for logged drill data with PSE cutoffs

You can change the test level. Same three selected for Basso-Salmaso; mud would not be significant with Bonferroni-adjusted PSE.

> TwoSeriesPlots(log(advance.rate)~load*mud*speed*flow,data=drill,alpha=.01,pse=TRUE)
B set to 13934
Half normal plot for logged drill data with .01 cutoff

Figure 37.7: Half normal plot for logged drill data with .01 cutoff

Some people like a Pareto chart; to each his own.

> TwoSeriesPlots(log(advance.rate)~load*mud*speed*flow,data=drill,type="pareto")
B set to 3000
Half normal plot for logged drill data with .01 cutoff

Figure 37.8: Half normal plot for logged drill data with .01 cutoff

37.0.2 Davies’ isatin

Data from Davies (1954) via Box and Meyer (1986). Response is the yield of isatin under different production conditions. Factors are: A – acid strength; B – time; C – amount of acid; D – temperature. These data are also in standard order.

> y<-c(8,4,53,43,31,9,12,36,79,68,73,8,77,38,49,23)/100
> A <- factor(rep(1:2,times=8))
> B <- factor(rep(1:2,each=2,times=4))
> C <- factor(rep(1:2,each=4,times=2))
> D <- factor(rep(1:2,each=8))

Basso-Salmaso says nothing is going on.

> TwoSeriesPlots(y~A*B*C*D)
B set to 3000
Half normal plot for isatin data

Figure 37.9: Half normal plot for isatin data

Now pretend for a moment that you don’t have TwoSeriesPlots() to help you test. You would probably fit a lower order model, pooling higher order terms into a surrogate error. Two obvious places to start would be a main effects model or a two way model.

> mains <- lm(y~A+B+C+D)
> twoways <- lm(y~(A+B+C+D)^2)

Residual plots detect a problem with the main effects model: we either need a transformation or we have missed an important effect.

> par(mfrow=c(1,2))
> plot(mains,which=1:2)
Residual plots for isatin main effects model

Figure 37.10: Residual plots for isatin main effects model

Residual plots look much better for the two-way model.

> par(mfrow=c(1,2))
> plot(twoways,which=1:2)
Residual plots for isatin two way model

Figure 37.11: Residual plots for isatin two way model

We would now do ANOVA for the two-way model. With only five df for error, we would pool some additional terms into error. Or what about a D-only model? It sure looks like we would wind up with some terms to fit in our model.

> anova(twoways)
Analysis of Variance Table

Response: y
          Df   Sum Sq  Mean Sq F value  Pr(>F)  
A          1 0.146306 0.146306  3.8035 0.10863  
B          1 0.001806 0.001806  0.0470 0.83701  
C          1 0.023256 0.023256  0.6046 0.47200  
D          1 0.299756 0.299756  7.7927 0.03837 *
A:B        1 0.000006 0.000006  0.0002 0.99032  
A:C        1 0.004556 0.004556  0.1184 0.74473  
A:D        1 0.104006 0.104006  2.7038 0.16103  
B:C        1 0.017556 0.017556  0.4564 0.52929  
B:D        1 0.252506 0.252506  6.5644 0.05052 .
C:D        1 0.002756 0.002756  0.0717 0.79964  
Residuals  5 0.192331 0.038466                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> reduced <- lm(y~B:D+A:D)
> anova(reduced)
Analysis of Variance Table

Response: y
          Df  Sum Sq  Mean Sq F value   Pr(>F)   
B:D        3 0.55407 0.184690  7.6806 0.005941 **
D:A        2 0.25031 0.125156  5.2048 0.028237 * 
Residuals 10 0.24046 0.024046                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> Donly <- lm(y~D)
> anova(Donly,reduced)
Analysis of Variance Table

Model 1: y ~ D
Model 2: y ~ B:D + A:D
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1     14 0.74509                              
2     10 0.24046  4   0.50463 5.2464 0.01536 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Why are things so different in the two approaches for these data? It’s actually pretty simple. Basso-Salmaso controls the strong familywise error rate. In the surrogate error approach, we looked for big terms to call “model” and we looked for smallish terms to call “error”. After all that searching it’s no surprise that model looks big relative to error.

Moral: Sometimes we have no choice but to use a surrogate error, but be aware that the p-values are not completely trustworthy.

37.0.3 It gets worse

When you are working to find a surrogate error at the same time you’re trying to find the proper scale, complexities ensue.

Go back to the drill data and pretend we have no Basso-Salmaso. In one approach, we begin with a main effects model; in a second, we begin with the two way model.

Begin with an all main effects model.

  1. Residuals are bad.
  2. Box-Cox suggests a log.
  3. On the log scale, the main effects model has flow, speed, and mud highly significant, and load has a p-value of .02.
  4. If you now go ahead and look at two factor interactions as well (logged data), you can pool 4 df into error. The speed by mud interaction is marginally significant with a p-value of .03.

You would be likely to wind up with a model that includes all main effects plus the mud by speed interaction.

Now begin with the two way model.

  1. Residuals are bad.
  2. Box-Cox suggests a -.75 power, but -1 (reciprocal) is right at the edge of the interval. Reciprocal is easy to interpret as time to distance.
  3. On the reciprocal scale, all main effects have p-values under .01, flow by speed has a p-value under .0001, and three other two-way interactions have p-values between .01 and .05.

It’s difficult to say which model is correct; they were both derived in sensible fashion. The log scale model is easier to interpret, but without any correction for multiple testing we get more terms than Basso-Salmasso finds.