Chapter 16 Example 3.5: Comparing Models via ANOVA
Analysis of Variance compares models by looking at the change in residual sum of squares as we move between a simple model and a more complex model that includes the simple model as a special case. For example, we could compare the single mean model to the separate means model, or we could compare the linear to the quadratic to the cubic.
You get Analysis of Variance by
using the anova()
command with a fitted model as
the argument. What anova()
does is sequentially compare
the model fits (staring with just the constant assuming you
didn’t remove 1
from the model) obtained as you add the terms to the model.
Here we ask for anova(separate.means)
(this and all
other models were fit in the previous chapter); the only term other other than 1
in the separate means model is temp
, so we get the comparison with
1
as the smaller model and temp
as the larger model (with 1
in
that larger model by default). This is a comparison of the
separate means model to the single mean model.
We see that temperature is highly significant, but that was obvious
in the box plot.
> anova(separate.means)
Analysis of Variance Table
Response: logTime
Df Sum Sq Mean Sq F value Pr(>F)
temp 4 3.5376 0.88441 96.363 < 2.2e-16 ***
Residuals 32 0.2937 0.00918
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA is a comparison
of models. Another usage of the anova()
function directly specifies
the models to compare. Below compares
the single mean and separate means models by looking at how the residual
sum of squares decreases. The comparison in line 2 is the same as the line for
temp
in the ANOVA above.
> anova(single.mean,separate.means)
Analysis of Variance Table
Model 1: logTime ~ 1
Model 2: logTime ~ temp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 3.8313
2 32 0.2937 4 3.5376 96.363 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The SS for temp.z
is the reduction in residual SS
going from a model of a single mean to a model where the mean varies linearly
with temp.z
. Note that the SS for temp.z
and Residuals
in this ANOVA sum
to the residual SS in the anova for p0
.
> anova(p1)
Analysis of Variance Table
Response: logTime
Df Sum Sq Mean Sq F value Pr(>F)
temp.z 1 3.4593 3.4593 325.41 < 2.2e-16 ***
Residuals 35 0.3721 0.0106
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With multiple terms in the model, the SS are always the reduction in residual SS when adding that term to a model that includes the preceding terms. That is, we see linear improving constant; quadratic improving linear; cubic improving quadratic; and quartic improving cubic.
Here we see that quartic does not improve over cubic, and cubic does not improve over quadratic, but quadratic does improve over linear. Thus an acceptable simple model would include linear and quadratic terms, i.e., model p2.
> anova(p4)
Analysis of Variance Table
Response: logTime
Df Sum Sq Mean Sq F value Pr(>F)
temp.z 1 3.4593 3.4593 376.9128 < 2.2e-16 ***
temp.z2 1 0.0783 0.0783 8.5361 0.006338 **
temp.z3 1 0.0000 0.0000 0.0020 0.964399
temp.z4 1 0.0000 0.0000 0.0009 0.976258
Residuals 32 0.2937 0.0092
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here is the same information,
but making the sequence of model comparisons explicit.
Note that you can put more than two models in the anova()
comparison so long as
the models follow the rule that each model is a special case of the
model that follows it.
> anova(p0,p1,p2,p3,p4)
Analysis of Variance Table
Model 1: logTime ~ 1
Model 2: logTime ~ temp.z
Model 3: logTime ~ temp.z + temp.z2
Model 4: logTime ~ temp.z + temp.z2 + temp.z3
Model 5: logTime ~ temp.z + temp.z2 + temp.z3 + temp.z4
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 3.8313
2 35 0.3721 1 3.4593 376.9128 < 2.2e-16 ***
3 34 0.2937 1 0.0783 8.5361 0.006338 **
4 33 0.2937 1 0.0000 0.0020 0.964399
5 32 0.2937 1 0.0000 0.0009 0.976258
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Even though we built p0 and p4 as
functional/quantitative variable/regression type models, the full model
p4
can fit five means exactly, so it exactly matches the five treatment
means, just like separate.means
did. Thus the comparison between
single.mean
and separate.means
is the same as that between
p0
and p4
.
> sum( (predict(p4)-predict(separate.means))^2 ) # numerically 0
[1] 1.177345e-24
> anova(single.mean,separate.means)
Analysis of Variance Table
Model 1: logTime ~ 1
Model 2: logTime ~ temp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 3.8313
2 32 0.2937 4 3.5376 96.363 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> anova(single.mean,p4)
Analysis of Variance Table
Model 1: logTime ~ 1
Model 2: logTime ~ temp.z + temp.z2 + temp.z3 + temp.z4
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 3.8313
2 32 0.2937 4 3.5376 96.363 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Although it’s not nearly as obvious, the trigonometric model we
used can also fit the five treatment means exactly, so it
has the same comparison with single.mean
as
separate.means
or p4
.
> sum( (predict(trig4)-predict(separate.means))^2 ) # numerically 0
[1] 1.16357e-29
> anova(single.mean,separate.means)
Analysis of Variance Table
Model 1: logTime ~ 1
Model 2: logTime ~ temp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 3.8313
2 32 0.2937 4 3.5376 96.363 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> anova(single.mean,trig4)
Analysis of Variance Table
Model 1: logTime ~ 1
Model 2: logTime ~ sin(2 * pi * temp.z/60) + cos(2 * pi * temp.z/60) +
sin(4 * pi * temp.z/60) + cos(4 * pi * temp.z/60)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 3.8313
2 32 0.2937 4 3.5376 96.363 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1