Statistics 5102 (Geyer, Spring 2003) Regression with Dummy Variables in R

Contents

Categorical Predictors and Dummy Variables

DeGroot and Schervish do not cover the general topic of categorical predictors (also referred to as the topic of dummy variables), although they do cover a very special case, ANOVA, in their sections 10.6, 10.7, and 10.8.

So we will use the data in the file

http://www.stat.umn.edu/geyer/old03/5102/examp/jane.txt

This data set contains two predictors, one numeric predictor x and one categorical predictor color, and one response variable y.

What we want to do here is fit parallel regression lines for the three categories. Logically, our model is of the form

y = βcolor + γ x + error

There are for regression coefficients: βred, βblue, βgreen the intercepts for each of the categories and the common slope parameter γ.

At first this seems very ad hoc. But on further consideration, this is no different from any other linear regression. Remember

It's called linear regression because it's linear in the parameters not because it's linear in x

Our linear regression model above is indeed linear in the regression coefficients so there's nothing special about it. We just need to learn how to specify such models.

What predictor variable does βred multiply? The notation we used above doesn't explicitly indicate a variable, but there is one, the so-called dummy variable that is the indicator variable of the category red. Let us adopt the notation Ired for this variable (that is 1 when the individual is in the red category and 0 otherwise) and similar notation for the other categories. Then we can rewrite our regression model as

y = β1 Ired + β2 Iblue + β3 Igreen + β4 x + error

For various reasons R prefers the equivalent model with parameterization

y = β1 + β2 Ired + β3 Igreen + β4 x + error

where we drop one of the dummy variables and add an intercept.

The reason is that when there are several categorical predictor we must drop one dummy variable from each set of categories and add one intercept to get a well-defined model. Having all the dummy variables would give a model with non-invertible X'X matrix, because the sum of all the dummy variables is the constant predictor, for example,

Ired + Iblue + Igreen = 1

Thus if we include the constant predictor (1), then we must drop one of the dummy variables.

Fitting a Regression Model

Fitting a Regression Model, With Intercept

When the category labels are non-numeric, R just does the right thing. R automagically constructs the required dummy variables.

External Data Entry

Enter a dataset URL :

Warning: if the categorical predictor is coded numerically (that is, the values are numbers, but they aren't meant to be treated as numbers, just as category codes), then R won't know the predictor is categorical unless you tell it. Then you need to say

out <- lm(y ~ x + factor(color))

Fitting a Regression Model, Without Intercept

It's a little easier to see what's going on here, where there is only one categorical predictor, if we tell R not to fit an intercept

The secret code for that is to add + 0 to the formula specifying the regression model.

Then we see the intercept for each category as a regression coefficient. And it is easier to plot the corresponding regression lines.

External Data Entry

Enter a dataset URL :

ANOVA

One Way

From the dummy variables point of view, there's nothing special about analysis of variance (ANOVA). It's just linear regression in the special case that all predictor variables are categorical.

Here's Example 10.6.1 in DeGroot and Schervish done in R

External Data Entry

Enter a dataset URL :

Same as Table&10.16 in DeGroot and Schervish with no fuss.

One Way, Using lm instead of aov

From the dummy variables point of view, there's nothing special about analysis of variance (ANOVA). It's just linear regression in the special case that all predictor variables are categorical.

Here's Example 10.6.1 in DeGroot and Schervish done in R

External Data Entry

Enter a dataset URL :

Same results. This analysis says

F-statistic:  11.6 on 3 and 59 DF,  p-value: 4.479e-06

The other says

            Df Sum Sq Mean Sq F value    Pr(>F)     
type         3  19454    6485  11.595 4.479e-06 *** 
Residuals   59  32995     559      

Note that the highlighted numbers are the same in both cases.

The F test in ANOVA is the same

Two Way, Main Effects Only

Since we are treating ANOVA as just a special case of regression, we don't really need any special theory for it. We can skip from the first example in DeGroot and Schervish to the last.

Here's Example 10.8.1 in DeGroot and Schervish done in R

External Data Entry

Enter a dataset URL :

This doesn't give the same results as in the book because we're fitting a different model. The jargon for what we are doing is fitting only main effects with no interaction terms.

Conclusion

If what we were interested in testing was whether the device the is supposed to increase gas mileage actually does (whether there is a significant effect of the equipped variable), the conclusion is that the device doesn't work (effect not statistically significant P = 0.4162).

If what we were interested in testing was the obvious fact that bigger cars get worse gas mileage, then the data do support that, but that's so obvious, we don't need statistics for that.

Two Way, With Interactions

One might wonder if a more sophisticated model might show something. Perhaps the device only improves gas mileage on big cars or something of the sort.

The statistics jargon for that is that there may be an interaction between the equipped and the size variables.

The way we complicate the model to include interactions is to add dummy variables that are products of the other variables, variables like

Iyes Icompact

and the like. Note that the product is equal to 1 if we are in the categories equipped == "yes" and and size == "compact" and is equal to 0 otherwise. Thus products of main effect dummy variables are indicator variables for the cells of the two-way table whose row and column labels are the values of the two categorical variables.

Since we are treating ANOVA as just a special case of regression, we don't really need any special theory for it. We can skip from the first example in DeGroot and Schervish to the last.

Here's Example 10.8.1 in DeGroot and Schervish done in R with interactions.

External Data Entry

Enter a dataset URL :

Now we get the same sums of squares as DeGroot and Schervish (their Table 10.28).

Conclusion

Same as when we didn't have interactions.

The P-value for the equipped has hardly changed. Moreover, the interaction P-value is also non-significant.

Model Comparison

Suppose we want to test whether the variable equipped has no effect whatsoever, neither main effects nor interaction effects. In the terminology of DeGroot and Schervish this is a test with hypotheses (10.8.13) or (10.8.17) depending on whether you call the main effects alphas or betas.

This is an F test for model comparison.

External Data Entry

Enter a dataset URL :

The F statistic 0.2617 is the same as DeGroot and Schervish give in their Example 10.8.4

Conclusion

Nothing there. P = 0.8522. Nowhere remotely close to significance.

Comment

The test just done uses the test statistic given by equation (10.8.14) in DeGroot and Schervish. If instead we wanted the test statistic given by equation (10.8.14), that is given by the original ANOVA with interactions for these data.

Rweb:> out <- aov(y ~ equipped * size) 
Rweb:> summary(out) 
              Df  Sum Sq Mean Sq F value    Pr(>F)     
equipped       1  0.4813  0.4813  0.6342    0.4336 
size           2 30.9227 15.4613 20.3707 6.735e-06 *** 
equipped:size  2  0.1147  0.0573  0.0755    0.9275     
Residuals     24 18.2160  0.7590