Chapter 19 More about Factors

You can turn a vector into a factor by either factor(vector) or as.factor(vector). For most usages, it doesn’t matter which you use. as.factor() can be slightly faster (but not enough to make a difference in most usage). factor() has optional arguments that let you reorder the levels, change factor labels, exclude some levels, and so on. If myfactor is already a factor, as.factor(myfactor) leaves it absolutely alone, but factor(myfactor) could make some subtle changes having to do with empty levels. But again, in most situations it doesn’t matter.

If you make something into a factor, R puts the levels in ascending order (numerical, alphabetical, etc). You can override that default order by using the levels option in factor(). Note below that the two factors print the same, but the levels attributes are reordered.

> junk <- as.factor(c(22,22,33,33,15,26));junk
[1] 22 22 33 33 15 26
Levels: 15 22 26 33
> junk2 <- factor(c(22,22,33,33,15,26),levels=c(26,33,22,15));junk2
[1] 22 22 33 33 15 26
Levels: 26 33 22 15

You can also turn character strings into a factor, and R will sort the levels alphabetically (unless you specify some other order as we just showed). Note that levels are case sensitive.

> junk3<-factor(c("red","blue","yellow","Red","white","blue"));junk3
[1] red    blue   yellow Red    white  blue  
Levels: blue red Red white yellow

If you try to make a factor numeric, what it gives you are the category numbers, not the original numeric values (assuming they were numeric). In junk 22 was category 2, but 22 is category 3 in junk2.

> as.numeric(junk)
[1] 2 2 4 4 1 3
> as.numeric(junk2)
[1] 3 3 2 2 4 1

You can’t do arithmetic with factors.

> junk+10
Warning in Ops.factor(junk, 10): '+' not meaningful for factors
[1] NA NA NA NA NA NA

You can convert a factor to numeric and do arithmetic, but you are working with the group indicators, not the original numbers that created the groups (15, 22, etc.).

> as.numeric(junk)+10
[1] 12 12 14 14 11 13

You can retrieve the levels of a factor, but they show up as character data (they’re labels).

> levels(junk)
[1] "15" "22" "26" "33"

Because the levels here are character versions of numbers, we can coerce them to numbers and then do arithmetic with them if we wish.

> as.numeric(levels(junk))
[1] 15 22 26 33
> as.numeric(levels(junk))+10
[1] 25 32 36 43

Of course, that doesn’t work if the levels are not versions of numbers.

> levels(junk3)
[1] "blue"   "red"    "Red"    "white"  "yellow"
> as.numeric(levels(junk3))
Warning: NAs introduced by coercion
[1] NA NA NA NA NA

We can retrieve the levels of the factor (which show up as character), and then convert them to numeric. If we then subscript with the numeric version of the factor we can recover the original numeric data. So you can get the original back, but it’s often easier to just have two variables, one a factor and one not a factor.

as.numeric(levels(junk))[as.numeric(junk)]

19.1 Patterned data

If you need to type in factor levels, you can save a lot of time by learning to use the rep() function (repeat function). Here 1 gets repeated 8 times, 2 gets repeated 8 times, and finally 5 gets repeated six times.

> tb <- factor(rep(1:5,c(8,8,8,7,6)))
> tb
 [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5
Levels: 1 2 3 4 5

For more regular patterns, you can use rep() with the each and length parameters.

> rep(1:4,each=3,length=24)
 [1] 1 1 1 2 2 2 3 3 3 4 4 4 1 1 1 2 2 2 3 3 3 4 4 4

Here is a way to get counts (replications) in different treatment groups.

> library(cfcdae)
> data(ResinLifetimes)
> replications(~temp,data=ResinLifetimes)
$temp
temp
175 194 213 231 250 
  8   8   8   7   6 

19.2 Splitting and applying

You can split a data vector by a factor, and the result will be a list with named components corresponding to the different levels of the factor. Each element of the list will be a vector of data with the corresponding level of the factor.

> attach(ResinLifetimes)
The following object is masked _by_ .GlobalEnv:

    logTime
The following objects are masked from ResinLifetimes (pos = 4):

    logTime, temp, temp.z, Time
> split(logTime,temp)
Warning in split.default(logTime, temp): data length is not a multiple of split
variable
$`175`
[1] 3.14159

$`194`
numeric(0)

$`213`
numeric(0)

$`231`
numeric(0)

$`250`
numeric(0)

Splitting is useful in several ways. First, some functions operate directly on lists. For example, boxplot() can take a list argument and make separate boxplots for each component of the list. Second, the functions lapply() and sapply() allow you to execute a function for each of the components of a list. lapply(mylist,function.name) applies the function function.name() separately to the data in each component of mylist, and then collects the output in a new list. In this example, we get summary statistics for the data in each treatment group.

> lapply(split(logTime,temp), summary)
Warning in split.default(logTime, temp): data length is not a multiple of split
variable
$`175`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.142   3.142   3.142   3.142   3.142   3.142 

$`194`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
                                                

$`213`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
                                                

$`231`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
                                                

$`250`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
                                                

When the output of each function application is the same shape, sapply() will simplify the output into a compact matrix form.
This would not work if we were applying the sort() function instead of summary(), because the different groups had different lengths, and so the sorted groups would have different lengths after sorting and would not fit into a matrix.

> sapply(split(logTime,temp), summary)
Warning in split.default(logTime, temp): data length is not a multiple of split
variable
            175 194 213 231 250
Min.    3.14159  NA  NA  NA  NA
1st Qu. 3.14159  NA  NA  NA  NA
Median  3.14159  NA  NA  NA  NA
Mean    3.14159 NaN NaN NaN NaN
3rd Qu. 3.14159  NA  NA  NA  NA
Max.    3.14159  NA  NA  NA  NA

Finally, if the output of your function is just a scalar value, tapply(y,a,function.name) is a shortcut for sapply(split(y,a),function.name).

19.3 More on factor representation

Fair warning: this is a more technical/advanced topic, and it can get to be a bit of tough sledding. You might not need all the details.

Factor representation is basically the outcome of how a factor is parameterized. For \(g\) groups, there are \(g\) group means \(\mu_i\). It is also possible to write the means as \(\mu_i = \mu + \alpha_i\). The parameter \(\mu\) is some kind of reference value, and the treatment effects \(\alpha_i\) describe how group \(i\) differs from the reference value. Each different way of defining the reference value yields a different set of treatment effects. Similarly, each different way of defining the reference value leads to a different way of representing the factor as a matrix predictor. Because we are using \(g+1\) parameters, and there are only \(g\) groups to fit, there is redundancy. Once we have determined what we mean by the reference value \(\mu\), only \(g-1\) of the treatment effects can vary freely; the last one must be constrained to make the means line up. This means that if we include the overall mean \(\mu\) in the formula, we only need \(g-1\) columns in the matrix predictor representation of the fact. Mathematically, all these different representations (parameterizations or ways of defining the reference value) are interchangeable. They all lead to the same fitted values.

We will explain the basic issues using a data set of size 6. Factor A has three levels that appear as (1,1,2,2,3,3) in the data. Factor B has two levels that appear as (1,2,1,2,1,2) in the data. Covariate z is numeric with values (1,2,3,4,5,6).
> A <- factor(c(1,1,2,2,3,3))
> B <- factor(c(1,2,1,2,1,2))
> z <- 1:6
Consider the formula ~ 0 + A. In this case, there is no overall constant in the model (you are not fitting a reference value, called (Intercept) in R), so R assumes that you want to use treatment means \(\mu_i\). In effect, the reference value is \(\mu = 0\) (because the intercept is not in the model) and \(\mu_i = \alpha_i\) (the treatment effects are the treatment means). When there is no overall constant preceding A in the formula, factor A will be represented by three columns of dummy (0 or 1) variables, each indicating one of the groups (you can ignore the attributes [attr stuff] in the output).
> model.matrix(~ 0 + A)
  A1 A2 A3
1  1  0  0
2  1  0  0
3  0  1  0
4  0  1  0
5  0  0  1
6  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.sum"
The coefficients (multipliers) for each column will just be the means of the different groups. Call this matrix D.A for dummy A. There is an analogous matrix for B, which we will call D.B.
> model.matrix(~ 0 + B)
  B1 B2
1  1  0
2  0  1
3  1  0
4  0  1
5  1  0
6  0  1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$B
[1] "contr.sum"

19.3.1 Contrasts

For the formula ~ 1 + A (or just ~ A), the factor variable follows the intercept in the formula. In this case, the intercept is represented by a column of all ones, and A is represented by two columns. (In general, for a factor with \(g\) groups the representation of the factor will have \(g-1\) columns.) There are infinitely many choices for those two columns for A, but a handful of options are common.

R describes these various options by what it calls “contrasts”. By default, R uses what it calls treatment contrasts. For treatment contrasts, R just drops the first column from D.A and uses the remaining \(g-1\) columns.
> model.matrix(~A,contrasts=list(A=contr.treatment))
  (Intercept) A2 A3
1           1  0  0
2           1  0  0
3           1  1  0
4           1  1  0
5           1  0  1
6           1  0  1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$A
  2 3
1 0 0
2 1 0
3 0 1

In this setting, the coefficient of the column of all ones will be the mean of group 1 of A (that is, \(\mu = \mu_1\) or \(\alpha_1=0\)), the coefficient of column two (A2) will be the difference group 2 mean minus group 1 mean (\(\alpha_2 = \mu_2 - \mu_1\)), and the coefficient of column three (A3) will be the difference group 3 mean minus group 1 mean (\(\alpha_3 = \mu_3 - \mu_1\)).

Similarly, the representation for ~ B has two columns: the column of all ones and the second column of D.B.
> model.matrix(~B,contrasts=list(B=contr.treatment))
  (Intercept) B2
1           1  0
2           1  1
3           1  0
4           1  1
5           1  0
6           1  1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$B
  2
1 0
2 1

In this model, the intercept represents the mean of group 1 of B, and the coefficient of the second column represents the difference between the mean of B group 2 and B group 1.

Note that the meaning of the intercept changes between these two models; in the first example it was the mean in group 1 of factor A, and in the second example it was the mean of group 1 of factor B.

Another common set of contrasts is the “sum” contrasts. When you load cfcdae, it changes the default contrasts to be “sum” contrasts. The sum contrasts force the treatment effects to add to zero. Equivalently, the sum contrast forces \(\mu\) to be the average of the \(\mu_i\) values. Because treatment effects add to zero with these contrasts, the last treatment effect (here \(\alpha_3\)) is minus the sum of the others (here \(\alpha_3 = -\alpha_1 -\alpha_2)\). The representing matrix looks like:
> model.matrix(~A,contrasts=list(A=contr.sum))
  (Intercept) A1 A2
1           1  1  0
2           1  1  0
3           1  0  1
4           1  0  1
5           1 -1 -1
6           1 -1 -1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$A
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1

When you are in group 1, you get \(\mu + \alpha_1\), and you get \(\mu + \alpha_2\) when you are in group 2. In group 3, you get \(\mu - \alpha_1 - \alpha_2\), which is the same as \(\mu + \alpha_3\) when the effects add to zero. Of course, you get an analogous matrix for factor B, making use of the fact that \(\beta_2 = -\beta_1\).

Notice that the meaning of \(\mu\) is the same in both ~A and ~B when you use the sum to zero contrasts. (In unbalanced data, the meaning of \(\mu\) would change slightly between the two models.)

If you have more than one factor variable in a model, the second and following factors will always be represented by the contrasts form; removing the intercept only affects the representation of the first factor. In this example, we see A in dummy form and B in contrast form:
> model.matrix(~ 0 + A + B,contrasts=c(A=contr.sum,B=contr.sum))
  A1 A2 A3 B1
1  1  0  0  1
2  1  0  0 -1
3  0  1  0  1
4  0  1  0 -1
5  0  0  1  1
6  0  0  1 -1
attr(,"assign")
[1] 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$A
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1

attr(,"contrasts")$B
  [,1]
1    1
2   -1
Finally, life is easy for a quantitative predictor such as z.
> model.matrix(~z)
  (Intercept) z
1           1 1
2           1 2
3           1 3
4           1 4
5           1 5
6           1 6
attr(,"assign")
[1] 0 1

19.3.2 Once more, with interaction

Consider first the simple hierarchical situation. Here, any interaction in a formula is preceded by any terms it includes. Thus, a:b must follow both a and b in the formula, and a:b:c must follow a, b, c, a:b, a:c, and b:c in the formula. In this case, the matrix that represents an interaction is the matrix that has all the products of one column from each of the matrices representing the interaction’s constituent factors.
> model.matrix(~ A + B + A:B, contrasts=c(A=contr.sum,B=contr.sum))
  (Intercept) A1 A2 B1 A1:B1 A2:B1
1           1  1  0  1     1     0
2           1  1  0 -1    -1     0
3           1  0  1  1     0     1
4           1  0  1 -1     0    -1
5           1 -1 -1  1    -1    -1
6           1 -1 -1 -1     1     1
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$A
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1

attr(,"contrasts")$B
  [,1]
1    1
2   -1
This works as well with quantitative predictors.
> model.matrix(~ A + z + A:z,contrasts=c(A=contr.sum))
  (Intercept) A1 A2 z A1:z A2:z
1           1  1  0 1    1    0
2           1  1  0 2    2    0
3           1  0  1 3    0    3
4           1  0  1 4    0    4
5           1 -1 -1 5   -5   -5
6           1 -1 -1 6   -6   -6
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$A
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1

In this example, the coefficient of z is an average slope, and the coefficients of A:z adjust the slope up or down for the first two groups of A. The coefficient for the last group of A is set so that the three sum to zero.

The process is rather less obvious for formulas that do not follow hierarchy, and it’s probably best to leave that for another time.