Categorical Predictors and Dummy Variables
The subject of this web page is linear models in which some or all of the predictors are categorical. Special cases are called ANOVA and ANCOVA.
In principle, there is no problem. The model matrix is allowed to be any function whatsoever of the predictor variables (covariates).
In practice, we need to explain the most commonly used way in which the model matrix is made to depend on categorical covariates.
For our first example we will use the data in the file
This data set contains two predictors, one numeric predictor x
and one categorical predictor color
, and one response variable
y
.
What we want to do here is fit parallel regression lines
for the
three categories. Logically, our model is of the form
There are four regression coefficients:
βred, βblue, βgreen
the intercepts
for each of the categories and the common slope
parameter γ.
At first this seems very ad hoc. But on further consideration, this is no different from any other linear regression. Remember
It's called linear regression because it's linear in the parameters not because it's linear in x
Our linear regression model above is indeed linear in the regression coefficients so there's nothing special about it. We just need to learn how to specify such models.
What predictor variable does βred multiply?
The notation we used above doesn't explicitly indicate a variable, but
there is one, the so-called dummy variable that is the indicator
variable of the category red
. Let us adopt the notation
Ired for this variable (that is 1 when the individual
is in the red category and 0 otherwise) and similar notation for the other
categories. Then we can rewrite our regression model as
For various reasons R prefers the equivalent model with parameterization
where we drop one of the dummy variables and add an intercept.
The reason is that when there are several categorical predictor we must drop one dummy variable from each set of categories and add one intercept to get a well-defined model. Having all the dummy variables would give a model whose model matrix is not full rank, because the sum of all the dummy variables is the constant predictor, for example,
Thus if we include the constant predictor (1), then we must drop one of the dummy variables in order to have a full rank model matrix.
Fitting a Regression Model
With Intercept
When the category labels are non-numeric, R just does the right thing. R automagically constructs the required dummy variables.
Warning: if the categorical predictor is coded numerically (that is, the values are numbers, but they aren't meant to be treated as numbers, just as category codes), then R won't know the predictor is categorical unless you tell it. Then you need to say
out <- lm(y ~ x + factor(color))
Alternatively, one can tell R that color
is a factor
before doing the regression
color <- factor(color) out <- lm(y ~ x + color)
Without Intercept
It's a little easier to see what's going on here, where there is only one categorical predictor, if we tell R not to fit an intercept
The secret code for that is to add + 0
to the formula
specifying the regression model
(on-line
help).
Then we see the intercept
for each category as a regression coefficient.
And it is easier to plot the corresponding regression lines.
Tests Comparing Models
If we want to compare this model with models that fit the same regression line to all colors or completely different regression lines (different slope as well as different intercept) to different colors, we need to do F tests of model comparison.
The result? The middle model (different intercepts, same slope) fits much better than the little model (same intercept, same slope) (P ≈ 0). The big model (different intercepts, different slopes) fits no better than the middle model (different intercepts, same slope) (P = 0.55). This says the middle model is the one to use (much better than the smaller model and just as good as the bigger, more complicated one).
ANOVA
Linear models in which all covariates are categorical are called ANOVA.
One-Way
Linear models in which there is one and only one covariate, which is categorical are called one-way ANOVA.
Here's an example done in R. The file
http://www.stat.umn.edu/geyer/5102/data/ex5-5.txt
contains two variables, the response y
and a categorical
covariate treat
(for treatment). We wish to test the
null hypothesis that all of the treatments have the same effect (all
treatment groups have the same mean) versus the alternative hypothesis
which is anything else. The following R code does this.
For our purposes, the only important number in the printout is the
P-value 0.00033. The rest of the table printed by the
summary
function gives, reading from right to left
the F statistic, the numerator and denominator of
the F statistic, the chi-square random variables from which
the numerator and denominator are constructed, and the degrees of
freedom of these chi-square random variables. When all of this
was done by hand calculation rather than computer, it was traditional
to present all of this information in a table like this. It still
gives a warm fuzzy feeling to those who have been taught to do this
calculation by hand.
We use the aov
function
(on-line
help) rather than the lm
function we use to fit other
linear models, but that is a matter of preference. As we see in
the following section, lm
works too.
One-Way, Using lm
instead of aov
From the dummy variables
point of view, there's nothing special
about ANOVA.
It's just linear regression in the
special case that all predictor variables are categorical.
Here's the same example redone using the R function lm
(on-line
help)
More or less the same table. Exactly the same P-value for the F test.
Two-Way, Main Effects Only
Linear models in which there are two and only two covariates, both of which are categorical are called two-way ANOVA.
Here's an example done in R. The file
http://www.stat.umn.edu/geyer/5102/data/ex5-6.txt
contains three variables, the response y
and categorical
covariates treat
(for treatment) and block
.
We wish to test the
null hypothesis that all of the treatments have the same effect (all
treatment groups have the same mean) versus the alternative hypothesis
which is anything else.
Note that the null and alternative hypotheses do not involve the block effects (the regression coefficients for the dummy variables for blocks). The blocks are nevertheless important. If they are left out of the model, the analysis is entirely different.
The following R code does this.
Since R does not know which categorical predictor is the one we want to test, it reports two P-values. We are only interested in the one for treatments (P = 0.036).
Try leaving block
out of the model and see what happens.
Two-Way, With Interactions
More complicated models are possible. A commonly asked question is
are the treatment effects the same in all blocks
? The conventional
way to address this question is to add an interaction
term to the model.
This adds all products of dummy variables for treatments and dummy variables for blocks to the model, which produces, in effect, a dummy variable for each treatment-block combination. The model thus fits, in effect, a different mean for each treatment-block combination.
The following R code does this.
In this ANOVA table the important P-value is the one for the interaction (P = 0.1457). It is not statistically significant. Hence the simpler model in the preceding section fits the data just fine.
Model Comparison
Another way to decide which model is better is to do an F test of model comparison just like we would do for any other nested linear models.
The following R code does this.
The P-values obtained in this section and in the preceding section are the same: P = 0.1457.