Chapter 13 Data Preliminaries

One way or another, you need to get your data into R. The brute force approach is to type your data in, as in:

my.data <- c(5,3,8,9,10,11) # some interesting data

However, this should be a last resort, as typographical errors are common. There is usually a better way if the data are already in some electronic format.

13.1 Read from a file

13.1.1 Plain text

If your data are in a plain text file on your computer (often named with a .txt extension), then you have a couple of options. The most basic is to use the command scan(), as in mydata <- scan("myfilename.txt"). By default, scan() assumes your file contains numeric data; it reads them all and joins them into a vector which it returns (and you will want to save into an R variable). scan() has many options, allowing you to read character string data (and other types), designate missing values, skip lines at the beginning, treat other lines as comment lines, and many other possibilities.

A common format for a data file is “cases by variables”. In this format, data for different variables appear in different columns, and there is one row of data for each case (for example, each experimental unit). Often, there is a header line giving the names of the variables, as in

X  Y
2  5
3  6

The function mydata <- read.table("myfilename.txt",header=TRUE) will read a file in this format, returning a data frame with one named column for each column in the file. (Use header=FALSE if there is no header line.) One big advantage to read.table() is that you can mix types in a file. For example, the file containing

name  age
"Gary" 67
"Becky" 68
"Christie" 38
"Erica" 35

will be read (using header=TRUE) as one variable with character string data and one variable with numeric data. (If strings have no spaces you can usually dispense with the quotes.) As with scan(), read.table() has many options. For example, you can ask it to automatically convert string variables to factors via stringsAsFactors=TRUE.

There is a read.csv() function that sets the options in read.table() so that it can correctly read comma separated value files (csv files). There is also a read.csv2() version that works with files using commas as decimal points.

13.1.2 Other formats

R has a function that allows you to save an R variable to a file in a special R format. These files are often given the extension .rda or .rData. Loading the variable back into R is simple, merely use load("myfilename.rda").

There are optional packages that can read other formats. For example, the package haven includes the functions read_sas(), read_spss(), and read_dta for reading data saved from SAS, SPSS, and Stata.

13.2 Read from a package

Many R packages (including cfcdae) contain data sets. After loading the package using library(), you can use data() to bring the data set into your workspace as a data frame. For example,
> library(cfcdae)
> data(ResinLifetimes)

13.3 Working with a data frame

OK, you now have a data frame. A data frame is kind of like a matrix, with content arranged in rows and columns. Columns represent different variables, and rows different cases within variables. Rows and columns can be addressed by number, and columns are also named. You can find the names using names(frame) or by looking at the first few rows using head(frame) or by getting summaries using summary(frame). For ResinLifetimes we obtain four column names:
> names(ResinLifetimes)
[1] "temp.z"  "logTime" "temp"    "Time"   
The first few values; temp and temp.z look the same when printed, but they are different.
> head(ResinLifetimes)
  temp.z logTime temp      Time
1    175    2.04  175 109.64782
2    175    1.91  175  81.28305
3    175    2.00  175 100.00000
4    175    1.92  175  83.17638
5    175    1.85  175  70.79458
6    175    1.96  175  91.20108
Summaries that show that temp.z is numeric, whereas temp is a factor with five different categories (along with counts of the categories):
> summary(ResinLifetimes)
     temp.z         logTime       temp        Time        
 Min.   :175.0   Min.   :0.830   175:8   Min.   :  6.761  
 1st Qu.:194.0   1st Qu.:1.210   194:8   1st Qu.: 16.218  
 Median :213.0   Median :1.380   213:8   Median : 23.988  
 Mean   :210.1   Mean   :1.465   231:7   Mean   : 38.305  
 3rd Qu.:231.0   3rd Qu.:1.710   250:6   3rd Qu.: 51.286  
 Max.   :250.0   Max.   :2.040           Max.   :109.648  

There are multiple ways to get a named variable out of a data frame. These all give you the same value (not shown here). The first three versions are different ways of finding the third column (which is named temp) of the data frame. The last version says to search for something named temp, but begin the search inside of the data frame ResinLifetimes.

> ResinLifetimes$temp
> ResinLifetimes[,3]
> ResinLifetimes[,"temp"]
> with(ResinLifetimes,temp)
Because the experimental factor (temperature) is a quantitative variable, we have the option of fitting dose-response or functional models as well as the usual separate-means models. To aid in this, will add some powers of temp.z to the data frame and save the new data frame as myRL. We will look at two ways to do this. It does not matter which way you do it; it’s a matter of personal preference. In the first, we make a copy of ResinLifetimes and then add new named component variables.
> myRL <- ResinLifetimes
> myRL$temp.z2 <- myRL$temp.z^2
> myRL$temp.z3 <- myRL$temp.z^3
> myRL$temp.z4 <- myRL$temp.z^4
> head(myRL)
  temp.z logTime temp      Time temp.z2 temp.z3   temp.z4
1    175    2.04  175 109.64782   30625 5359375 937890625
2    175    1.91  175  81.28305   30625 5359375 937890625
3    175    2.00  175 100.00000   30625 5359375 937890625
4    175    1.92  175  83.17638   30625 5359375 937890625
5    175    1.85  175  70.79458   30625 5359375 937890625
6    175    1.96  175  91.20108   30625 5359375 937890625

In the second version, we use the within() function to work within a copy of ResinLifetimes, and then we save that modified copy into myRL. Think of the copy of ResinLifetimes as its own little workspace. Within that workspace, we create some new variables. Then we save that new workspace (actually, a data frame) into a new named variable myRL. with(dataframe,commands) means that R should first look in dataframe for variables when running the commands; within(dataframe,commands) says that anything you create in commands is kept inside a copy of dataframe. You need to save that copy if you want to retain the new stuff for future use.

Thus, with(foo,x+y) tries to do x+y by first looking for x and y inside of foo. It uses them if it finds them, otherwise it looks for x and y as generic variables. with() returns the result of the command.

within(foo,assignment) tries to make the assignment within a copy of foo. This could either add a new variable or change a variable in foo. It returns the modified data frame copy (which you must assign to a variable if you want to keep).

If you want to do more than one thing in the assignment, you must separate the commands with semicolons and wrap them all in curly brackets.

> myRL <- within(ResinLifetimes,
+       {temp.z2 <- temp.z^2;temp.z3<-temp.z^3;temp.z4<-temp.z^4})
> head(myRL,3)
  temp.z logTime temp      Time   temp.z4 temp.z3 temp.z2
1    175    2.04  175 109.64782 937890625 5359375   30625
2    175    1.91  175  81.28305 937890625 5359375   30625
3    175    2.00  175 100.00000 937890625 5359375   30625

For your own sanity, you need to keep track of whether you are creating/modifying variables within a data frame or outside of a data frame. You can wind up with similarly named variables inside and out, and then things get confusing, because you might think you are using one but you get the other.

13.4 Descriptive Statistics; A Quick Look at the Data

It’s usually best to begin with simple descriptive statistics to get a feel for the data. As above, we get some summary information about the contents of ResinLifetimes.
> summary(ResinLifetimes)
     temp.z         logTime       temp        Time        
 Min.   :175.0   Min.   :0.830   175:8   Min.   :  6.761  
 1st Qu.:194.0   1st Qu.:1.210   194:8   1st Qu.: 16.218  
 Median :213.0   Median :1.380   213:8   Median : 23.988  
 Mean   :210.1   Mean   :1.465   231:7   Mean   : 38.305  
 3rd Qu.:231.0   3rd Qu.:1.710   250:6   3rd Qu.: 51.286  
 Max.   :250.0   Max.   :2.040           Max.   :109.648  

Notice that the information for temp is different from the others. That is because temp is a factor. Factors indicate groups, not quantities. A factor doesn’t know that 194 is in some sense between 175 and 213. To a factor, they’re just groups. In this case, the labels for the levels of the factor are numbers, but they may as well be “red,” “white,” and “blue” as far as the factor is concerned. What summary does for a factor is to list the labels for its levels and the number of data values corresponding to each.

You could get similar information for temp via (results not shown):
> table(ResinLifetimes$temp)
> with(ResinLifetimes,table(temp))
A histogram of the log lifetimes looks sort of bimodal:
> hist(ResinLifetimes$logTime)
Histogram of log lifetimes

Figure 13.1: Histogram of log lifetimes

We get something similar with a stem and leaf plot.
> with(ResinLifetimes,stem(logTime))

  The decimal point is 1 digit(s) to the left of the |

   8 | 3
  10 | 26895677
  12 | 1266781588
  14 | 2345
  16 | 166616
  18 | 580126
  20 | 04
Given that temp is a grouping factor, how about summaries of one variable by levels of another? Here, make a table of means of log lifetimes:
> tapply(ResinLifetimes$logTime,ResinLifetimes$temp,mean)
     175      194      213      231      250 
1.932500 1.628750 1.377500 1.194286 1.056667 
Or if you want the result to be a list, here is a list of standard deviations of lifetime:
> lapply(split(ResinLifetimes$logTime,ResinLifetimes$temp),sd)
$`175`
[1] 0.06341473

$`194`
[1] 0.1048042

$`213`
[1] 0.1071381

$`231`
[1] 0.04577377

$`250`
[1] 0.1383715

There are many ways to get boxplots of multiple groups of data. In this approach, we use a model-like approach where we model logTime by temp. This gives us a boxplot of logTime data separately by the levels of temp. It’s clear here that there are big differences in lifetimes between levels of temperature, and the inferential methods are only going to confirm what we can see in the plot.

> boxplot(logTime~temp,data=myRL)
boxplots of log lifetime data by temperature group

Figure 13.2: boxplots of log lifetime data by temperature group

The horizontal locations in the previous boxplot are just equally spaced. In some situations, it is useful to put the boxes at specific positions. For these data, the specified positions are almost equally spaced, so the boxplots at specified positions look a lot like the default. (You need the boxwex=5 or you get skinny boxplots.)

> boxplot(logTime~temp,data=myRL,at=unique(myRL$temp.z),boxwex=5)
boxplots of log lifetime data by temperature

Figure 13.3: boxplots of log lifetime data by temperature

13.5 Down the rabbit hole of search paths

Let’s think about variable naming and how R finds variables when you type a variable name. I don’t want to scare you with this, but I have been burned by getting or modifying the wrong variable, so I add this extended example so that you will benefit from my mistakes.

The moral of the story will be that if you have variables with the same names in different places (for example, ordinary variables and inside data frames), make sure you know that you’re working with the right one.

R has search paths, meaning that it will look in one place for a variable, then in another, and so on, using the first one it finds. The with() function changes the search order. The attach() function changes the search order. Some functions have a data=frame argument that changes the search order.

R will even tell you what the search path is. Here we see that .GlobalEnv is the first place R looks. This is just where normal variables you create are put. After that come a bunch of built in R locations. These are where R functions (which internally are just another R variable) are found. And, there are some data sets as well.

> searchpaths()
 [1] ".GlobalEnv"                                                                      
 [2] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/rstan"      
 [3] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/ggplot2"    
 [4] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/StanHeaders"
 [5] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/bcfcdae"    
 [6] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/perm"       
 [7] "RunStitch"                                                                       
 [8] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/stats"      
 [9] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/graphics"   
[10] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/grDevices"  
[11] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/utils"      
[12] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/datasets"   
[13] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/methods"    
[14] "Autoloads"                                                                       
[15] "/Library/Frameworks/R.framework/Resources/library/base"                          
When you bring in a package, that package (and sometimes others it depends on) is added to the search list right after .GlobalEnv. That’s what library() really does; it puts the package in the search path.
> library(cfcdae)
> searchpaths()
 [1] ".GlobalEnv"                                                                      
 [2] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/cfcdae"     
 [3] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/rstan"      
 [4] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/ggplot2"    
 [5] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/StanHeaders"
 [6] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/bcfcdae"    
 [7] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/perm"       
 [8] "RunStitch"                                                                       
 [9] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/stats"      
[10] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/graphics"   
[11] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/grDevices"  
[12] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/utils"      
[13] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/datasets"   
[14] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/methods"    
[15] "Autoloads"                                                                       
[16] "/Library/Frameworks/R.framework/Resources/library/base"                          

The data() command does not change the search path. It looks for (in the search path) a data frame with that name and makes a copy of it in .GlobalEnv. That way you can use it.

In this example, I am starting with no global environment variables. Then I bring ResinLifetimes in.
> ls() # list out variables in .GlobalEnv
 [1] "all.compare"     "all.out"         "alldata"         "alldata2"       
 [5] "cutoff"          "day15"           "day28"           "differences"    
 [9] "fit.0.0"         "fit.0.1"         "fit.0.3"         "fit.0.5"        
[13] "fit.0.5.s"       "fit.0.8"         "fit.5.0"         "fit.5.1"        
[17] "fit.5.3"         "fit.5.5"         "fit.5.8"         "fit.notch.indiv"
[21] "fit.single"      "fullLike"        "fullmodel"       "i"              
[25] "loglike"         "LRT0"            "LRTp5"           "matrixdata"     
[29] "mcmc.intercept"  "mcmc.samples"    "myLogLike"       "myRL"           
[33] "mytable"         "notched"         "NotchedBoards"   "null0Like"      
[37] "null0model"      "nullp5Like"      "nullp5model"     "r.order"        
[41] "ratio"           "ResinLifetimes"  "rowmatrix"       "RunStitch"      
[45] "shape"           "subject.names"   "temp.diff"       "unnotched"      
[49] "x"               "y"              
> data(Shrinkage)
> ls()
 [1] "all.compare"     "all.out"         "alldata"         "alldata2"       
 [5] "cutoff"          "day15"           "day28"           "differences"    
 [9] "fit.0.0"         "fit.0.1"         "fit.0.3"         "fit.0.5"        
[13] "fit.0.5.s"       "fit.0.8"         "fit.5.0"         "fit.5.1"        
[17] "fit.5.3"         "fit.5.5"         "fit.5.8"         "fit.notch.indiv"
[21] "fit.single"      "fullLike"        "fullmodel"       "i"              
[25] "loglike"         "LRT0"            "LRTp5"           "matrixdata"     
[29] "mcmc.intercept"  "mcmc.samples"    "myLogLike"       "myRL"           
[33] "mytable"         "notched"         "NotchedBoards"   "null0Like"      
[37] "null0model"      "nullp5Like"      "nullp5model"     "r.order"        
[41] "ratio"           "ResinLifetimes"  "rowmatrix"       "RunStitch"      
[45] "shape"           "Shrinkage"       "subject.names"   "temp.diff"      
[49] "unnotched"       "x"               "y"              

The attach() function does something similar to library() by taking the named data frame object in the argument and putting it in the search path. So if I have a data frame named VeryLongNameThatIsAPainToType, then I can just attach it into the search path. That way when I use a variable name that corresponds to a column in VeryLongNameThatIsAPainToType, R will find it without having to type in the long name again.

> mean(ResinLifetimes$logTime)
[1] 1.465135
> mean(logTime) # wont find it
Error in mean(logTime): object 'logTime' not found
> attach(ResinLifetimes)
> searchpaths() # now in second slot
 [1] ".GlobalEnv"                                                                      
 [2] "ResinLifetimes"                                                                  
 [3] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/cfcdae"     
 [4] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/rstan"      
 [5] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/ggplot2"    
 [6] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/StanHeaders"
 [7] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/bcfcdae"    
 [8] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/perm"       
 [9] "RunStitch"                                                                       
[10] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/stats"      
[11] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/graphics"   
[12] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/grDevices"  
[13] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/utils"      
[14] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/datasets"   
[15] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/methods"    
[16] "Autoloads"                                                                       
[17] "/Library/Frameworks/R.framework/Resources/library/base"                          
> mean(logTime)
[1] 1.465135

Be aware that if you have an ordinary variable (something in .GlobalEnv) with the same name as a column in the data frame you just attached, R is going to find the ordinary variable before it finds the column in the attached data frame, because .GlobalEnv comes before the attached data frame in the search order.

> logTime <- 3.14159
> mean(logTime) #pi not the log lifetimes
[1] 3.14159

The with(myframe,mycommand) function changes the search order temporarily. During the execution of mycommand, myframe is searched first, even before ordinary variables in .GlobalEnv.

> logTime <- 3.14159
> with(ResinLifetimes,mean(logTime)) # not pi this time
[1] 1.465135

Some functions have a data=myframe argument. That says to look in myframe before anything else, including the search path and anything added temporarily via with().

One needs to be careful … If you have several things with the same name in various places in the search path or in various data frames

  • Make sure that your command finds the one you want.
  • When you change a variable, make sure that your command is changing the one you want.

13.5.1 More Examples

Make some new variables for illustration.

  • xx is a regular variable and also in data frames foo and goo.
  • yy is in the two data frames but is not a regular variable.
  • zz is a regular variable but not in the data frames.

The values from the different sources are all different, so we can tell which one we’re getting.

> xx <- 1;zz <- 3
> foo <- data.frame(xx=10,yy=20)
> goo <- data.frame(xx=100,yy=200)
Let’s try to see xx. First we just look at xx as a usual variable; this does not look into the data frames.
> xx
[1] 1
The next two commands show that using with() pushes the data frame ahead of the normal variable, so R finds the data frame versions.
> with(foo,xx)
[1] 10
> with(goo,xx)
[1] 100
The fourth command starts out by using with() to put foo at the front of the search path. But then the command it runs starts out by pushing goo to the head of the search path, so we get the version in goo.
> with(foo,with(goo,xx))
[1] 100
Finally, there is no yy variable, and even though there are versions of yy in the data frames, if we just use yy bare R won’t find it.
> yy
Error in eval(expr, envir, enclos): object 'yy' not found

zz is just a regular variable, and there is no version in either data frame. If we ask for zz bare, we get it. If we ask for zz but push foo to the head of the search path using with(), then R looks inside foo and doesn’t find zz, so it moves on to the next item in the search path, which is .GlobalEnv, and finds zz there.

> zz
[1] 3
> with(foo,zz)
[1] 3

We’re going to fit a single mean model to a single data point (we do this instead of using something simple like mean() because ls() has a data= option), but it lets us see which variable we’re getting.

There is a plain variable xx which R can find, but yy lives only in the data frames, so R can’t find it without help.

> coef( lm(xx~1) )
(Intercept) 
          1 
> coef( lm(yy~1) )
Error in eval(predvars, data, env): object 'yy' not found

The data=foo argument tells lm() to look in foo, so no problem. Using with() also works.

> coef( lm(yy~1,data=foo))
(Intercept) 
         20 
> with(foo, coef(lm(yy~1)))
(Intercept) 
         20 

There is a yy in goo, but the data=foo argument tells lm() to look in foo first, so that’s the one it gets, even ahead of what is pushed via with().

> with(goo,coef(lm(yy~1,data=foo)))
(Intercept) 
         20 

This attach() command puts goo in the search path, but it puts it in after normal variables (.GlobalEnv). The warning is saying that the xx in goo is hidden by the plain xx variable.

> attach(goo)
The following object is masked _by_ .GlobalEnv:

    xx

There is no ordinary variable yy, so not finding it there R goes ahead and finds it in goo. On the other hand, there is a plain xx, so R finds that one before the one in goo.

> yy
[1] 200
> xx
[1] 1

with() pushes variables in foo ahead of ordinary variables and anything in goo.

> with(foo,yy)
[1] 20

The data=foo form pushes foo ahead of anything attached, or anything added to the search path via with.

> coef(lm(yy~1,data=foo))
(Intercept) 
         20 
> with(goo,coef(lm(yy~1,data=foo)))
(Intercept) 
         20 

You can remove something from the search path using detach().

> detach(goo)