DeGroot and Schervish do not cover the general topic of categorical predictors
(also referred to as the topic of dummy variables
), although they
do cover a very special case, ANOVA, in their sections 10.6, 10.7, and 10.8.
So we will use the data in the file
This data set contains two predictors, one numeric predictor x
and one categorical predictor color
, and one response variable
y
.
What we want to do here is fit parallel regression lines
for the
three categories. Logically, our model is of the form
There are for regression coefficients:
βred, βblue, βgreen
the intercepts
for each of the categories and the common slope
parameter γ.
At first this seems very ad hoc. But on further consideration, this is no different from any other linear regression. Remember
It's called linear regression because it's linear in the parameters not because it's linear in x
Our linear regression model above is indeed linear in the regression coefficients so there's nothing special about it. We just need to learn how to specify such models.
What predictor variable does βred multiply?
The notation we used above doesn't explicitly indicate a variable, but
there is one, the so-called dummy variable that is the indicator
variable of the category red
. Let us adopt the notation
Ired for this variable (that is 1 when the individual
is in the red category and 0 otherwise) and similar notation for the other
categories. Then we can rewrite our regression model as
For various reasons R prefers the equivalent model with parameterization
where we drop one of the dummy variables and add an intercept.
The reason is that when there are several categorical predictor we must drop one dummy variable from each set of categories and add one intercept to get a well-defined model. Having all the dummy variables would give a model with non-invertible X'X matrix, because the sum of all the dummy variables is the constant predictor, for example,
Thus if we include the constant predictor (1), then we must drop one of the dummy variables.
When the category labels are non-numeric, R just does the right thing. R automagically constructs the required dummy variables.
Warning: if the categorical predictor is coded numerically (that is, the values are numbers, but they aren't meant to be treated as numbers, just as category codes), then R won't know the predictor is categorical unless you tell it. Then you need to say
out <- lm(y ~ x + factor(color))
It's a little easier to see what's going on here, where there is only one categorical predictor, if we tell R not to fit an intercept
The secret code for that is to add + 0
to the formula
specifying the regression model.
Then we see the intercept
for each category as a regression coefficient.
And it is easier to plot the corresponding regression lines.
From the dummy variables
point of view, there's nothing special
about analysis of variance (ANOVA). It's just linear regression in the
special case that all predictor variables are categorical.
Here's Example 10.6.1 in DeGroot and Schervish done in R
Same as Table&10.16 in DeGroot and Schervish with no fuss.
lm
instead of aov
From the dummy variables
point of view, there's nothing special
about analysis of variance (ANOVA). It's just linear regression in the
special case that all predictor variables are categorical.
Here's Example 10.6.1 in DeGroot and Schervish done in R
Same results. This analysis says
F-statistic: 11.6 on 3 and 59 DF, p-value: 4.479e-06
The other says
Df Sum Sq Mean Sq F value Pr(>F) type 3 19454 6485 11.595 4.479e-06 *** Residuals 59 32995 559
Note that the highlighted numbers are the same in both cases.
The F test in ANOVA is the same
Since we are treating ANOVA as just a special case of regression, we don't really need any special theory for it. We can skip from the first example in DeGroot and Schervish to the last.
Here's Example 10.8.1 in DeGroot and Schervish done in R
This doesn't give the same results as in the book because we're fitting a different model. The jargon for what we are doing is fitting only main effects with no interaction terms.
If what we were interested in testing was whether the device
the is supposed to increase gas mileage actually does (whether
there is a significant effect of the equipped
variable),
the conclusion is that the device doesn't work (effect not statistically
significant P = 0.4162).
If what we were interested in testing was the obvious fact that bigger cars get worse gas mileage, then the data do support that, but that's so obvious, we don't need statistics for that.
One might wonder if a more sophisticated model might show something. Perhaps the device only improves gas mileage on big cars or something of the sort.
The statistics jargon for that is that there may be an interaction
between the equipped
and the size
variables.
The way we complicate the model to include interactions is to add dummy variables that are products of the other variables, variables like
and the like. Note that the product is equal to 1 if
we are in the categories equipped == "yes"
and
and size == "compact"
and is equal to 0 otherwise.
Thus products of main effect dummy variables are indicator variables
for the cells
of the two-way table whose row and column labels
are the values of the two categorical variables.
Since we are treating ANOVA as just a special case of regression, we don't really need any special theory for it. We can skip from the first example in DeGroot and Schervish to the last.
Here's Example 10.8.1 in DeGroot and Schervish done in R with interactions.
Now we get the same sums of squares as DeGroot and Schervish (their Table 10.28).
Same as when we didn't have interactions.
The P-value for the equipped
has hardly
changed. Moreover, the interaction P-value is also
non-significant.
Suppose we want to test whether the variable equipped
has no effect whatsoever, neither main effects nor interaction effects.
In the terminology of DeGroot and Schervish this is a test with hypotheses
(10.8.13) or (10.8.17) depending on whether you call the main effects
alphas or betas.
This is an F test for model comparison.
The F statistic 0.2617 is the same as DeGroot and Schervish give in their Example 10.8.4
Nothing there. P = 0.8522. Nowhere remotely close to significance.
The test just done uses the test statistic given by equation (10.8.14) in DeGroot and Schervish. If instead we wanted the test statistic given by equation (10.8.14), that is given by the original ANOVA with interactions for these data.
Rweb:> out <- aov(y ~ equipped * size) Rweb:> summary(out) Df Sum Sq Mean Sq F value Pr(>F) equipped 1 0.4813 0.4813 0.6342 0.4336 size 2 30.9227 15.4613 20.3707 6.735e-06 *** equipped:size 2 0.1147 0.0573 0.0755 0.9275 Residuals 24 18.2160 0.7590