Categorical Predictors and Dummy Variables

The subject of this web page is linear models in which some or all of the predictors are categorical. Special cases are called ANOVA and ANCOVA.

In principle, there is no problem. The model matrix is allowed to be any function whatsoever of the predictor variables (covariates).

In practice, we need to explain the most commonly used way in which the model matrix is made to depend on categorical covariates.

For our first example we will use the data in the file

This data set contains two predictors, one numeric predictor x and one categorical predictor color, and one response variable y.

What we want to do here is fit parallel regression lines for the three categories. Logically, our model is of the form

y = βcolor + γ x + error

There are four regression coefficients: βred, βblue, βgreen the intercepts for each of the categories and the common slope parameter γ.

At first this seems very ad hoc. But on further consideration, this is no different from any other linear regression. Remember

It's called linear regression because it's linear in the parameters not because it's linear in x

Our linear regression model above is indeed linear in the regression coefficients so there's nothing special about it. We just need to learn how to specify such models.

What predictor variable does βred multiply? The notation we used above doesn't explicitly indicate a variable, but there is one, the so-called dummy variable that is the indicator variable of the category red. Let us adopt the notation Ired for this variable (that is 1 when the individual is in the red category and 0 otherwise) and similar notation for the other categories. Then we can rewrite our regression model as

y = βred Ired + βblue Iblue + βgreen Igreen + γ x + error

For various reasons R prefers the equivalent model with parameterization

y = βintercept + βred Ired + βgreen Igreen + &gamma x + error

where we drop one of the dummy variables and add an intercept.

The reason is that when there are several categorical predictor we must drop one dummy variable from each set of categories and add one intercept to get a well-defined model. Having all the dummy variables would give a model whose model matrix is not full rank, because the sum of all the dummy variables is the constant predictor, for example,

Ired + Iblue + Igreen = 1

Thus if we include the constant predictor (1), then we must drop one of the dummy variables in order to have a full rank model matrix.

Fitting a Regression Model

With Intercept

When the category labels are non-numeric, R just does the right thing. R automagically constructs the required dummy variables.

Warning: if the categorical predictor is coded numerically (that is, the values are numbers, but they aren't meant to be treated as numbers, just as category codes), then R won't know the predictor is categorical unless you tell it. Then you need to say

out <- lm(y ~ x + factor(color))

Alternatively, one can tell R that color is a factor before doing the regression

color <- factor(color)
out <- lm(y ~ x + color)

Without Intercept

It's a little easier to see what's going on here, where there is only one categorical predictor, if we tell R not to fit an intercept

The secret code for that is to add + 0 to the formula specifying the regression model (on-line help).

Then we see the intercept for each category as a regression coefficient. And it is easier to plot the corresponding regression lines.

Tests Comparing Models

If we want to compare this model with models that fit the same regression line to all colors or completely different regression lines (different slope as well as different intercept) to different colors, we need to do F tests of model comparison.

The result? The middle model (different intercepts, same slope) fits much better than the little model (same intercept, same slope) (P ≈ 0). The big model (different intercepts, different slopes) fits no better than the middle model (different intercepts, same slope) (P = 0.55). This says the middle model is the one to use (much better than the smaller model and just as good as the bigger, more complicated one).

ANOVA

Linear models in which all covariates are categorical are called ANOVA.

One-Way

Linear models in which there is one and only one covariate, which is categorical are called one-way ANOVA.

Here's an example done in R. The file

http://www.stat.umn.edu/geyer/5102/data/ex5-5.txt

contains two variables, the response y and a categorical covariate treat (for treatment). We wish to test the null hypothesis that all of the treatments have the same effect (all treatment groups have the same mean) versus the alternative hypothesis which is anything else. The following R code does this.

For our purposes, the only important number in the printout is the P-value 0.00033. The rest of the table printed by the summary function gives, reading from right to left the F statistic, the numerator and denominator of the F statistic, the chi-square random variables from which the numerator and denominator are constructed, and the degrees of freedom of these chi-square random variables. When all of this was done by hand calculation rather than computer, it was traditional to present all of this information in a table like this. It still gives a warm fuzzy feeling to those who have been taught to do this calculation by hand.

We use the aov function (on-line help) rather than the lm function we use to fit other linear models, but that is a matter of preference. As we see in the following section, lm works too.

One-Way, Using lm instead of aov

From the dummy variables point of view, there's nothing special about ANOVA. It's just linear regression in the special case that all predictor variables are categorical.

Here's the same example redone using the R function lm (on-line help)

More or less the same table. Exactly the same P-value for the F test.

Two-Way, Main Effects Only

Linear models in which there are two and only two covariates, both of which are categorical are called two-way ANOVA.

Here's an example done in R. The file

http://www.stat.umn.edu/geyer/5102/data/ex5-6.txt

contains three variables, the response y and categorical covariates treat (for treatment) and block. We wish to test the null hypothesis that all of the treatments have the same effect (all treatment groups have the same mean) versus the alternative hypothesis which is anything else.

Note that the null and alternative hypotheses do not involve the block effects (the regression coefficients for the dummy variables for blocks). The blocks are nevertheless important. If they are left out of the model, the analysis is entirely different.

The following R code does this.

Since R does not know which categorical predictor is the one we want to test, it reports two P-values. We are only interested in the one for treatments (P = 0.036).

Try leaving block out of the model and see what happens.

Two-Way, With Interactions

More complicated models are possible. A commonly asked question is are the treatment effects the same in all blocks? The conventional way to address this question is to add an interaction term to the model.

This adds all products of dummy variables for treatments and dummy variables for blocks to the model, which produces, in effect, a dummy variable for each treatment-block combination. The model thus fits, in effect, a different mean for each treatment-block combination.

The following R code does this.

In this ANOVA table the important P-value is the one for the interaction (P = 0.1457). It is not statistically significant. Hence the simpler model in the preceding section fits the data just fine.

Model Comparison

Another way to decide which model is better is to do an F test of model comparison just like we would do for any other nested linear models.

The following R code does this.

The P-values obtained in this section and in the preceding section are the same: P = 0.1457.