Categorical Predictors and Dummy Variables

we will use the data in the file

This data set contains two predictors, one numeric predictor x and one categorical predictor color, and one response variable y.

What we want to do here is fit parallel regression lines for the three categories. Logically, our model is of the form

y = βcolor + γ x + error

There are four regression coefficients: βred, βblue, βgreen the intercepts for each of the categories and the common slope parameter γ.

At first this seems very ad hoc. But on further consideration, this is no different from any other linear regression. Remember

It's called linear regression because it's linear in the parameters not because it's linear in x

Our linear regression model above is indeed linear in the regression coefficients so there's nothing special about it. We just need to learn how to specify such models.

What predictor variable does βred multiply? The notation we used above doesn't explicitly indicate a variable, but there is one, the so-called dummy variable that is the indicator variable of the category red. Let us adopt the notation Ired for this variable (that is 1 when the individual is in the red category and 0 otherwise) and similar notation for the other categories. Then we can rewrite our regression model as

y = βred Ired + βblue Iblue + βgreen Igreen + γ x + error

For various reasons R prefers the equivalent model with parameterization

y = βintercept + βred Ired + βgreen Igreen + &gamma x + error

where we drop one of the dummy variables and add an intercept.

The reason is that when there are several categorical predictor we must drop one dummy variable from each set of categories and add one intercept to get a well-defined model. Having all the dummy variables would give a model with non-invertible X'X matrix, because the sum of all the dummy variables is the constant predictor, for example,

Ired + Iblue + Igreen = 1

Thus if we include the constant predictor (1), then we must drop one of the dummy variables in order to have a full rank model matrix.

Fitting a Regression Model

With Intercept

When the category labels are non-numeric, R just does the right thing. R automagically constructs the required dummy variables.

Warning: if the categorical predictor is coded numerically (that is, the values are numbers, but they aren't meant to be treated as numbers, just as category codes), then R won't know the predictor is categorical unless you tell it. Then you need to say

out <- lm(y ~ x + factor(color))

Alternatively, one can tell R that color is a factor before doing the regression

color <- factor(color)
out <- lm(y ~ x + color)

Without Intercept

It's a little easier to see what's going on here, where there is only one categorical predictor, if we tell R not to fit an intercept

The secret code for that is to add + 0 to the formula specifying the regression model.

Then we see the intercept for each category as a regression coefficient. And it is easier to plot the corresponding regression lines.

Summary

From our point of view, there is no difference between

When we move to generalized linear models and aster models, nobody makes a big fuss about special nomenclature to single out a trivial concept like dummy variables.