Categorical Predictors and Dummy Variables
we will use the data in the file
This data set contains two predictors, one numeric predictor x
and one categorical predictor color
, and one response variable
y
.
What we want to do here is fit parallel regression lines
for the
three categories. Logically, our model is of the form
There are four regression coefficients:
βred, βblue, βgreen
the intercepts
for each of the categories and the common slope
parameter γ.
At first this seems very ad hoc. But on further consideration, this is no different from any other linear regression. Remember
It's called linear regression because it's linear in the parameters not because it's linear in x
Our linear regression model above is indeed linear in the regression coefficients so there's nothing special about it. We just need to learn how to specify such models.
What predictor variable does βred multiply?
The notation we used above doesn't explicitly indicate a variable, but
there is one, the so-called dummy variable that is the indicator
variable of the category red
. Let us adopt the notation
Ired for this variable (that is 1 when the individual
is in the red category and 0 otherwise) and similar notation for the other
categories. Then we can rewrite our regression model as
For various reasons R prefers the equivalent model with parameterization
where we drop one of the dummy variables and add an intercept.
The reason is that when there are several categorical predictor we must drop one dummy variable from each set of categories and add one intercept to get a well-defined model. Having all the dummy variables would give a model with non-invertible X'X matrix, because the sum of all the dummy variables is the constant predictor, for example,
Thus if we include the constant predictor (1), then we must drop one of the dummy variables in order to have a full rank model matrix.
Fitting a Regression Model
With Intercept
When the category labels are non-numeric, R just does the right thing. R automagically constructs the required dummy variables.
Warning: if the categorical predictor is coded numerically (that is, the values are numbers, but they aren't meant to be treated as numbers, just as category codes), then R won't know the predictor is categorical unless you tell it. Then you need to say
out <- lm(y ~ x + factor(color))
Alternatively, one can tell R that color
is a factor
before doing the regression
color <- factor(color) out <- lm(y ~ x + color)
Without Intercept
It's a little easier to see what's going on here, where there is only one categorical predictor, if we tell R not to fit an intercept
The secret code for that is to add + 0
to the formula
specifying the regression model.
Then we see the intercept
for each category as a regression coefficient.
And it is easier to plot the corresponding regression lines.
Summary
From our point of view, there is no difference between
- linear models with no categorical predictors (only numeric),
which people usually call
multiple regression
, - linear models with only categorical predictors,
which people usually call
analysis of variance
(ANOVA), - and linear models with both numeric and categorical predictors,
which people sometimes call
analysis of covariance
(ANCOVA).
When we move to generalized linear models and aster models, nobody makes a big fuss about special nomenclature to single out a trivial concept like dummy variables.