Chapter 13 Data Preliminaries
One way or another, you need to get your data into R. The brute force approach is to type your data in, as in:
my.data <- c(5,3,8,9,10,11) # some interesting data
However, this should be a last resort, as typographical errors are common. There is usually a better way if the data are already in some electronic format.
13.1 Read from a file
13.1.1 Plain text
If your data are in a plain text file on your computer
(often named with a .txt
extension), then you have a couple
of options. The most basic is to use the command scan()
,
as in mydata <- scan("myfilename.txt")
. By default, scan()
assumes your
file contains numeric data; it reads them all and joins them into
a vector which it returns (and you will want to save into an R variable).
scan()
has many options, allowing you to read character string data
(and other types), designate missing values,
skip lines at the beginning, treat other lines as comment lines,
and many other possibilities.
A common format for a data file is “cases by variables”. In this format, data for different variables appear in different columns, and there is one row of data for each case (for example, each experimental unit). Often, there is a header line giving the names of the variables, as in
X Y
2 5
3 6
The function mydata <- read.table("myfilename.txt",header=TRUE)
will read a
file in this format, returning a data frame with one named column
for each column in the file. (Use header=FALSE
if there is no header line.)
One big advantage to read.table()
is that you can mix types in a file.
For example, the file containing
name age
"Gary" 67
"Becky" 68
"Christie" 38
"Erica" 35
will be read (using header=TRUE
) as one variable with character string
data and one variable with numeric data. (If strings have no spaces you
can usually dispense with the quotes.)
As with scan()
, read.table()
has many options. For example,
you can ask it to automatically convert string variables to factors via
stringsAsFactors=TRUE
.
There is a read.csv()
function that sets the options in read.table()
so that
it can correctly read comma separated value files (csv files). There is also a
read.csv2()
version that works with files using commas as decimal points.
13.1.2 Other formats
R has a function that allows you to save an R variable to a file in a special R format.
These files are often given the extension .rda
or .rData
. Loading the variable
back into R is simple, merely use load("myfilename.rda")
.
There are optional packages that can read other formats. For example,
the package haven
includes the functions read_sas()
, read_spss()
,
and read_dta
for reading data saved from SAS, SPSS, and Stata.
13.2 Read from a package
Many R packages (includingcfcdae
) contain data sets. After loading
the package using library()
, you can use data()
to bring the data
set into your workspace as a data frame. For example,
> library(cfcdae)
> data(ResinLifetimes)
13.3 Working with a data frame
OK, you now have a data frame. A data frame is kind of like a matrix, with content arranged in rows and columns. Columns represent different variables, and rows different cases within variables. Rows and columns can be addressed by number, and columns are also named. You can find the names usingnames(frame)
or by looking at the first few rows using head(frame)
or by
getting summaries using summary(frame)
. For ResinLifetimes
we obtain
four column names:
> names(ResinLifetimes)
[1] "temp.z" "logTime" "temp" "Time"
temp
and temp.z
look the same when
printed, but they are different.
> head(ResinLifetimes)
temp.z logTime temp Time
1 175 2.04 175 109.64782
2 175 1.91 175 81.28305
3 175 2.00 175 100.00000
4 175 1.92 175 83.17638
5 175 1.85 175 70.79458
6 175 1.96 175 91.20108
temp.z
is numeric, whereas temp
is
a factor with five different categories (along with counts of the categories):
> summary(ResinLifetimes)
temp.z logTime temp Time
Min. :175.0 Min. :0.830 175:8 Min. : 6.761
1st Qu.:194.0 1st Qu.:1.210 194:8 1st Qu.: 16.218
Median :213.0 Median :1.380 213:8 Median : 23.988
Mean :210.1 Mean :1.465 231:7 Mean : 38.305
3rd Qu.:231.0 3rd Qu.:1.710 250:6 3rd Qu.: 51.286
Max. :250.0 Max. :2.040 Max. :109.648
There
are multiple ways to get a named variable out of a data frame.
These all give you the same value (not shown here). The
first three versions are different ways of finding the
third column (which is named temp
) of the data frame.
The last version says to search for something named temp
,
but begin the search inside of the data frame ResinLifetimes
.
> ResinLifetimes$temp
> ResinLifetimes[,3]
> ResinLifetimes[,"temp"]
> with(ResinLifetimes,temp)
temp.z
to the data frame and save the new data frame as myRL
.
We will look at two ways to do this. It does not matter
which way you do it; it’s a matter of personal preference.
In the first, we
make a copy of ResinLifetimes
and then add new
named component variables.
> myRL <- ResinLifetimes
> myRL$temp.z2 <- myRL$temp.z^2
> myRL$temp.z3 <- myRL$temp.z^3
> myRL$temp.z4 <- myRL$temp.z^4
> head(myRL)
temp.z logTime temp Time temp.z2 temp.z3 temp.z4
1 175 2.04 175 109.64782 30625 5359375 937890625
2 175 1.91 175 81.28305 30625 5359375 937890625
3 175 2.00 175 100.00000 30625 5359375 937890625
4 175 1.92 175 83.17638 30625 5359375 937890625
5 175 1.85 175 70.79458 30625 5359375 937890625
6 175 1.96 175 91.20108 30625 5359375 937890625
In the second version, we use the within()
function to
work within a copy of ResinLifetimes
, and then we save that
modified copy into myRL
.
Think of the copy of ResinLifetimes
as its own little workspace. Within
that workspace, we create some new variables. Then we save that new workspace
(actually, a data frame) into a new named variable myRL
. with(dataframe,commands)
means that R should first look in dataframe
for variables when
running the commands
; within(dataframe,commands)
says that anything
you create in commands
is kept inside a copy of dataframe
. You need to save that copy if you
want to retain the new stuff for future use.
Thus, with(foo,x+y)
tries to do x+y
by first looking for
x
and y
inside of foo
. It uses them if it finds them, otherwise
it looks for x
and y
as generic variables. with()
returns
the result of the command.
within(foo,assignment)
tries to make the assignment within
a copy
of foo
. This could either add a new variable or change a variable in
foo
. It returns the modified data frame copy (which you must
assign to a variable if you want to keep).
If you want to do more than one thing in the assignment
, you must
separate the commands with semicolons and wrap them all in curly brackets.
> myRL <- within(ResinLifetimes,
+ {temp.z2 <- temp.z^2;temp.z3<-temp.z^3;temp.z4<-temp.z^4})
> head(myRL,3)
temp.z logTime temp Time temp.z4 temp.z3 temp.z2
1 175 2.04 175 109.64782 937890625 5359375 30625
2 175 1.91 175 81.28305 937890625 5359375 30625
3 175 2.00 175 100.00000 937890625 5359375 30625
For your own sanity, you need to keep track of whether you are creating/modifying variables within a data frame or outside of a data frame. You can wind up with similarly named variables inside and out, and then things get confusing, because you might think you are using one but you get the other.
13.4 Descriptive Statistics; A Quick Look at the Data
It’s usually best to begin with simple descriptive statistics to get a feel for the data. As above, we get some summary information about the contents ofResinLifetimes
.
> summary(ResinLifetimes)
temp.z logTime temp Time
Min. :175.0 Min. :0.830 175:8 Min. : 6.761
1st Qu.:194.0 1st Qu.:1.210 194:8 1st Qu.: 16.218
Median :213.0 Median :1.380 213:8 Median : 23.988
Mean :210.1 Mean :1.465 231:7 Mean : 38.305
3rd Qu.:231.0 3rd Qu.:1.710 250:6 3rd Qu.: 51.286
Max. :250.0 Max. :2.040 Max. :109.648
Notice that the information for temp
is different from the others. That is because temp
is
a factor. Factors indicate groups, not quantities. A factor
doesn’t know that 194 is in some sense between 175 and 213. To a factor,
they’re just groups. In this case, the labels for the levels of the
factor are numbers, but they may as well be “red,” “white,” and
“blue” as far as the factor is concerned.
What summary
does for a factor is to list the labels
for its levels and
the number of data values corresponding to each.
temp
via (results not shown):
> table(ResinLifetimes$temp)
> with(ResinLifetimes,table(temp))
> hist(ResinLifetimes$logTime)

Figure 13.1: Histogram of log lifetimes
> with(ResinLifetimes,stem(logTime))
The decimal point is 1 digit(s) to the left of the |
8 | 3
10 | 26895677
12 | 1266781588
14 | 2345
16 | 166616
18 | 580126
20 | 04
temp
is a grouping factor, how about summaries of
one variable by levels of another? Here, make a table of means of log lifetimes:
> tapply(ResinLifetimes$logTime,ResinLifetimes$temp,mean)
175 194 213 231 250
1.932500 1.628750 1.377500 1.194286 1.056667
list
, here is a list of standard deviations of lifetime:
> lapply(split(ResinLifetimes$logTime,ResinLifetimes$temp),sd)
$`175`
[1] 0.06341473
$`194`
[1] 0.1048042
$`213`
[1] 0.1071381
$`231`
[1] 0.04577377
$`250`
[1] 0.1383715
There are many
ways to get boxplots of multiple groups of data. In this approach,
we use a model-like approach where we model logTime
by temp
. This gives
us a boxplot of logTime
data separately by the levels of temp
.
It’s clear here that there are big differences in lifetimes
between levels of temperature, and the inferential methods
are only going to confirm what we can see in the plot.
> boxplot(logTime~temp,data=myRL)

Figure 13.2: boxplots of log lifetime data by temperature group
The horizontal locations in the previous boxplot are just
equally spaced. In some situations, it is useful to put
the boxes at specific positions. For these data,
the specified positions are almost equally spaced,
so the boxplots at specified positions look a lot like
the default. (You need the boxwex=5
or you get skinny
boxplots.)
> boxplot(logTime~temp,data=myRL,at=unique(myRL$temp.z),boxwex=5)

Figure 13.3: boxplots of log lifetime data by temperature
13.5 Down the rabbit hole of search paths
Let’s think about variable naming and how R finds variables when you type a variable name. I don’t want to scare you with this, but I have been burned by getting or modifying the wrong variable, so I add this extended example so that you will benefit from my mistakes.
The moral of the story will be that if you have variables with the same names in different places (for example, ordinary variables and inside data frames), make sure you know that you’re working with the right one.
R has search paths, meaning that it will
look in one place for a variable, then in another, and so on,
using the first one it finds. The with()
function changes the search order. The attach()
function changes the search
order. Some functions have a data=frame
argument that changes the search order.
R will even tell you
what the search path is.
Here we see that .GlobalEnv
is the first place R looks.
This is just where normal variables you create are put.
After that come a bunch of built in R locations. These
are where R functions (which internally are just another R variable)
are found. And, there are some data sets as well.
> searchpaths()
[1] ".GlobalEnv"
[2] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/rstan"
[3] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/ggplot2"
[4] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/StanHeaders"
[5] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/bcfcdae"
[6] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/perm"
[7] "RunStitch"
[8] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/stats"
[9] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/graphics"
[10] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/grDevices"
[11] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/utils"
[12] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/datasets"
[13] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/methods"
[14] "Autoloads"
[15] "/Library/Frameworks/R.framework/Resources/library/base"
.GlobalEnv
. That’s what library()
really
does; it puts the package in the search path.
> library(cfcdae)
> searchpaths()
[1] ".GlobalEnv"
[2] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/cfcdae"
[3] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/rstan"
[4] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/ggplot2"
[5] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/StanHeaders"
[6] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/bcfcdae"
[7] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/perm"
[8] "RunStitch"
[9] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/stats"
[10] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/graphics"
[11] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/grDevices"
[12] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/utils"
[13] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/datasets"
[14] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/methods"
[15] "Autoloads"
[16] "/Library/Frameworks/R.framework/Resources/library/base"
The data()
command does not change the search path.
It looks for (in the search path) a data frame with that name
and makes a copy of it in .GlobalEnv
. That way you can use it.
> ls() # list out variables in .GlobalEnv
[1] "all.compare" "all.out" "alldata" "alldata2"
[5] "cutoff" "day15" "day28" "differences"
[9] "fit.0.0" "fit.0.1" "fit.0.3" "fit.0.5"
[13] "fit.0.5.s" "fit.0.8" "fit.5.0" "fit.5.1"
[17] "fit.5.3" "fit.5.5" "fit.5.8" "fit.notch.indiv"
[21] "fit.single" "fullLike" "fullmodel" "i"
[25] "loglike" "LRT0" "LRTp5" "matrixdata"
[29] "mcmc.intercept" "mcmc.samples" "myLogLike" "myRL"
[33] "mytable" "notched" "NotchedBoards" "null0Like"
[37] "null0model" "nullp5Like" "nullp5model" "r.order"
[41] "ratio" "ResinLifetimes" "rowmatrix" "RunStitch"
[45] "shape" "subject.names" "temp.diff" "unnotched"
[49] "x" "y"
> data(Shrinkage)
> ls()
[1] "all.compare" "all.out" "alldata" "alldata2"
[5] "cutoff" "day15" "day28" "differences"
[9] "fit.0.0" "fit.0.1" "fit.0.3" "fit.0.5"
[13] "fit.0.5.s" "fit.0.8" "fit.5.0" "fit.5.1"
[17] "fit.5.3" "fit.5.5" "fit.5.8" "fit.notch.indiv"
[21] "fit.single" "fullLike" "fullmodel" "i"
[25] "loglike" "LRT0" "LRTp5" "matrixdata"
[29] "mcmc.intercept" "mcmc.samples" "myLogLike" "myRL"
[33] "mytable" "notched" "NotchedBoards" "null0Like"
[37] "null0model" "nullp5Like" "nullp5model" "r.order"
[41] "ratio" "ResinLifetimes" "rowmatrix" "RunStitch"
[45] "shape" "Shrinkage" "subject.names" "temp.diff"
[49] "unnotched" "x" "y"
The attach()
function does something similar to library()
by
taking the named data frame object in the argument and putting it in the
search path. So if I have a data frame named VeryLongNameThatIsAPainToType
,
then I can just attach it into the search path. That way when I use
a variable name that corresponds to a column in VeryLongNameThatIsAPainToType
,
R will find it without having to type in the long name again.
> mean(ResinLifetimes$logTime)
[1] 1.465135
> mean(logTime) # wont find it
Error in mean(logTime): object 'logTime' not found
> attach(ResinLifetimes)
> searchpaths() # now in second slot
[1] ".GlobalEnv"
[2] "ResinLifetimes"
[3] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/cfcdae"
[4] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/rstan"
[5] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/ggplot2"
[6] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/StanHeaders"
[7] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/bcfcdae"
[8] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/perm"
[9] "RunStitch"
[10] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/stats"
[11] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/graphics"
[12] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/grDevices"
[13] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/utils"
[14] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/datasets"
[15] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/methods"
[16] "Autoloads"
[17] "/Library/Frameworks/R.framework/Resources/library/base"
> mean(logTime)
[1] 1.465135
Be aware that if you have an ordinary variable (something in .GlobalEnv
) with
the same name as a column in the data frame you just attached, R is
going to find the ordinary variable before it finds the column in the attached
data frame, because .GlobalEnv
comes before the attached data frame
in the search order.
> logTime <- 3.14159
> mean(logTime) #pi not the log lifetimes
[1] 3.14159
The with(myframe,mycommand)
function changes the search order temporarily. During
the execution of mycommand
, myframe
is searched first,
even before ordinary variables in .GlobalEnv
.
> logTime <- 3.14159
> with(ResinLifetimes,mean(logTime)) # not pi this time
[1] 1.465135
Some functions have a data=myframe
argument.
That says to look in myframe
before anything else,
including the search path and anything added
temporarily via with()
.
One needs to be careful … If you have several things with the same name in various places in the search path or in various data frames
- Make sure that your command finds the one you want.
- When you change a variable, make sure that your command is changing the one you want.
13.5.1 More Examples
Make some new variables for illustration.
xx
is a regular variable and also in data framesfoo
andgoo
.yy
is in the two data frames but is not a regular variable.zz
is a regular variable but not in the data frames.
The values from the different sources are all different, so we can tell which one we’re getting.
> xx <- 1;zz <- 3
> foo <- data.frame(xx=10,yy=20)
> goo <- data.frame(xx=100,yy=200)
xx
. First we just look at xx
as a usual variable; this does
not look into the data frames.
> xx
[1] 1
with()
pushes
the data frame ahead of the normal variable, so R finds the data frame versions.
> with(foo,xx)
[1] 10
> with(goo,xx)
[1] 100
with()
to put foo
at the front of the search path. But then the command it
runs
starts out by pushing goo
to the head of the search path, so we get the
version in goo
.
> with(foo,with(goo,xx))
[1] 100
yy
variable, and
even though there are versions of yy
in the data frames, if
we just use yy
bare R won’t find it.
> yy
Error in eval(expr, envir, enclos): object 'yy' not found
zz
is just a regular variable, and there
is no version in either data frame. If we ask for zz
bare, we get it.
If we ask for zz
but push foo
to the head of the search path
using with()
, then R looks inside foo
and doesn’t find zz
, so it moves
on to the next item in the search path, which is .GlobalEnv
, and
finds zz
there.
> zz
[1] 3
> with(foo,zz)
[1] 3
We’re going to fit a single mean model to a single data point (we do this instead
of using something simple like mean()
because ls()
has a data=
option),
but it lets us see which variable we’re getting.
There is a plain variable xx
which R can find, but yy
lives
only in the data frames, so R can’t find it without help.
> coef( lm(xx~1) )
(Intercept)
1
> coef( lm(yy~1) )
Error in eval(predvars, data, env): object 'yy' not found
The data=foo
argument
tells lm()
to look in foo
, so no problem. Using with()
also works.
> coef( lm(yy~1,data=foo))
(Intercept)
20
> with(foo, coef(lm(yy~1)))
(Intercept)
20
There is a yy
in goo
,
but the data=foo
argument
tells lm()
to look in foo
first, so that’s the one it gets,
even ahead of what is pushed via with()
.
> with(goo,coef(lm(yy~1,data=foo)))
(Intercept)
20
This attach()
command puts goo
in the search path, but it puts it in after normal variables (.GlobalEnv
).
The warning is saying that the xx
in goo
is hidden by
the plain xx
variable.
> attach(goo)
The following object is masked _by_ .GlobalEnv:
xx
There is no ordinary variable yy
, so not finding
it there R goes ahead and finds it in goo
. On the other hand, there
is a plain xx
, so R finds that one before the one in goo
.
> yy
[1] 200
> xx
[1] 1
with()
pushes variables in foo
ahead of ordinary variables and anything in goo
.
> with(foo,yy)
[1] 20
The data=foo
form pushes
foo
ahead of anything attached, or anything added to the search
path via with
.
> coef(lm(yy~1,data=foo))
(Intercept)
20
> with(goo,coef(lm(yy~1,data=foo)))
(Intercept)
20
You can remove something from the search path using detach()
.
> detach(goo)