## Categorical Predictors and Dummy Variables

The subject of this web page is linear models in which some or all of the predictors are categorical. Special cases are called ANOVA and ANCOVA.

In principle, there is no problem. The model matrix is allowed to be any function whatsoever of the predictor variables (covariates).

In practice, we need to explain the most commonly used way in which the model matrix is made to depend on categorical covariates.

For our first example we will use the data in the file

This data set contains two predictors, one numeric predictor `x`

and one categorical predictor `color`

, and one response variable
`y`

.

What we want to do here is fit parallel regression lines

for the
three categories. Logically, our model is of the form

`y`= β

_{color}+ γ

`x`+ error

There are four regression coefficients:
β_{red}, β_{blue}, β_{green}
the intercepts

for each of the categories and the common slope

parameter γ.

At first this seems very *ad hoc*.
But on further consideration, this is no different from any other linear
regression. Remember

It's called

linear regressionbecause it'slinear in the parametersnot because it's linear inx

Our linear regression model above is indeed linear in the regression coefficients so there's nothing special about it. We just need to learn how to specify such models.

What predictor variable does β_{red} multiply?
The notation we used above doesn't explicitly indicate a variable, but
there is one, the so-called *dummy variable* that is the indicator
variable of the category red

. Let us adopt the notation
`I`_{red} for this variable (that is 1 when the individual
is in the red category and 0 otherwise) and similar notation for the other
categories. Then we can rewrite our regression model as

`y`= β

_{red}

`I`

_{red}+ β

_{blue}

`I`

_{blue}+ β

_{green}

`I`

_{green}+ γ

`x`+ error

For various reasons R prefers the equivalent model with parameterization

`y`= β

_{intercept}+ β

_{red}

`I`

_{red}+ β

_{green}

`I`

_{green}+ &gamma

`x`+ error

where we drop one of the dummy variables and add an intercept.

The reason is that when there are several categorical predictor we must drop one dummy variable from each set of categories and add one intercept to get a well-defined model. Having all the dummy variables would give a model whose model matrix is not full rank, because the sum of all the dummy variables is the constant predictor, for example,

`I`

_{red}+

`I`

_{blue}+

`I`

_{green}= 1

Thus if we include the constant predictor (1), then we must drop one of the dummy variables in order to have a full rank model matrix.

## Fitting a Regression Model

### With Intercept

When the category labels are non-numeric, R just does the right thing. R automagically constructs the required dummy variables.

**Warning:** if the categorical predictor is coded numerically
(that is, the values are numbers, but they aren't meant to be treated as
numbers, just as category codes), then R won't know the predictor is
categorical unless you tell it. Then you need to say

out <- lm(y ~ x + factor(color))

Alternatively, one can tell R that `color`

is a factor
before doing the regression

color <- factor(color) out <- lm(y ~ x + color)

### Without Intercept

It's a little easier to see what's going on here, where there is only one categorical predictor, if we tell R not to fit an intercept

The secret code for that is to add `+ 0`

to the formula
specifying the regression model
(on-line
help).

Then we see the intercept

for each category as a regression coefficient.
And it is easier to plot the corresponding regression lines.

### Tests Comparing Models

If we want to compare this model with models that fit the same
regression line to all colors or completely different regression lines
(different slope as well as different intercept) to different colors,
we need to do `F` tests of model comparison.

The result? The middle model (different intercepts, same slope) fits
much better than the little model (same intercept, same slope)
(`P` ≈ 0).
The big model (different intercepts, different slopes) fits
no better than the middle model (different intercepts, same slope)
(`P` = 0.55). This says the middle model is the one to use
(much better than the smaller model and just as good as the bigger,
more complicated one).

## ANOVA

Linear models in which all covariates are categorical are called ANOVA.

### One-Way

Linear models in which there is one and only one covariate, which is categorical are called one-way ANOVA.

Here's an example done in R. The file

http://www.stat.umn.edu/geyer/5102/data/ex5-5.txt

contains two variables, the response `y`

and a categorical
covariate `treat`

(for treatment). We wish to test the
null hypothesis that all of the treatments have the same effect (all
treatment groups have the same mean) versus the alternative hypothesis
which is anything else. The following R code does this.

For our purposes, the only important number in the printout is the
`P`-value 0.00033. The rest of the table printed by the
`summary`

function gives, reading from right to left
the `F` statistic, the numerator and denominator of
the `F` statistic, the chi-square random variables from which
the numerator and denominator are constructed, and the degrees of
freedom of these chi-square random variables. When all of this
was done by hand calculation rather than computer, it was traditional
to present all of this information in a table like this. It still
gives a warm fuzzy feeling to those who have been taught to do this
calculation by hand.

We use the `aov`

function
(on-line
help) rather than the `lm`

function we use to fit other
linear models, but that is a matter of preference. As we see in
the following section, `lm`

works too.

### One-Way, Using `lm`

instead of `aov`

From the dummy variables

point of view, there's nothing special
about ANOVA.
It's just linear regression in the
special case that all predictor variables are categorical.

Here's the same example redone using the R function `lm`

(on-line
help)

More or less the same table. Exactly the same `P`-value for
the `F` test.

### Two-Way, Main Effects Only

Linear models in which there are two and only two covariates, both of which are categorical are called two-way ANOVA.

Here's an example done in R. The file

http://www.stat.umn.edu/geyer/5102/data/ex5-6.txt

contains three variables, the response `y`

and categorical
covariates `treat`

(for treatment) and `block`

.
We wish to test the
null hypothesis that all of the treatments have the same effect (all
treatment groups have the same mean) versus the alternative hypothesis
which is anything else.

Note that the null and alternative hypotheses do not involve the block effects (the regression coefficients for the dummy variables for blocks). The blocks are nevertheless important. If they are left out of the model, the analysis is entirely different.

The following R code does this.

Since R does not know which categorical predictor is the one we want to test,
it reports two `P`-values. We are only interested in the one
for treatments (`P` = 0.036).

Try leaving `block`

out of the model and see what happens.

### Two-Way, With Interactions

More complicated models are possible. A commonly asked question is
are the treatment effects the same in all blocks

? The conventional
way to address this question is to add an interaction

term to the model.

This adds all products of dummy variables for treatments and dummy variables for blocks to the model, which produces, in effect, a dummy variable for each treatment-block combination. The model thus fits, in effect, a different mean for each treatment-block combination.

The following R code does this.

In this ANOVA table
the important `P`-value is the one for the interaction
(`P` = 0.1457). It is not statistically significant.
Hence the simpler model in
the preceding section fits the data just fine.

### Model Comparison

Another way to decide which model is better is to do an
`F` test of model comparison
just like we would do for any other nested linear models.

The following R code does this.

The `P`-values obtained in this section
and in the preceding section are the same:
`P` = 0.1457.