University of Minnesota, Twin Cities     School of Statistics     Stat 5601     Rweb

Stat 5601 (Geyer) Examples (R Intro)

Contents

What are R, S, and S-PLUS?

First there was S, a general-purpose, interpreted, computer language especially designed for statistics. S by itself is no longer commercially available, although it still exists as a research project at Bell Labs. S together with additional functions and features is marketed by Insightful Corporation under the name S-PLUS. R is free software available from the Comprehensive R Archive Network (CRAN). It is free as in "free beer" (you can download it with no charge) and free as in "free speech" (you can do whatever you want with it except make it non-free). R and S-PLUS are more or less compatible. Roughly 90% of things you want to do work in both. Most other things work with minor variations.

The code in this introduction works in R. It may well work in S-PLUS, but we have not bothered to test that. From now on we will just say "R" rather than "R or S or S-PLUS".

Anything you can make a computer do, you can do in R. It has all the features of a general-purpose computer language like C or Pascal or Java and so can do any computation that can be programmed. Unlike those languages, R is interpreted like Lisp or Logo or Basic or Python. This means you can execute the language one statement at a time, which makes it much easier to find errors.

We must admit, though, that like any computer language R is finicky. Any mistake, even a simple typo, and your command won't work. If you have never programmed at all and have no idea how to do anything on a computer except with mice and menus, you may find it difficult. But the examples in these pages should help a lot.

R has a huge number of functions and features designed especially for statistics, all kinds of statistics. Most of what it does, we won't touch in this course (nonparametrics being a small part of statistics). If you have the time and want to know more about R, the Introduction to R that comes with the R software is the first thing to read, but it is way more than you need to know for this course.

Variables and Assignment

Like all other computer languages, R has variables, which are referred to by variable names. Variable names may contain any letter, digit, or the dot (.) and cannot begin with a digit. Names are case sensitive, thus fred, Fred, and FRED refer to different variables.

The assignment operator in R is an arrow "<-" constructed from two characters. An assignment statement looks like

fred <- 4
or
sally <- 2 + 2
or
a.very.long.variable.name <- sqrt(16)
Each assigns the value of the expression on the right side of the assignment operator to the variable name on the left side. In each case the variable gets the value 4.

Nothing is printed for an assignment statement. If a statement is not an assignment but just an expression, the value of the expression is printed. Thus in order to look at the value of a variable, you just use a statement consists only of a variable name.

If you understand what each line in the following example does, you understand most of what you need to know about R variables.

sally <- 2 + 2
sally
2 + 2

Vectors

Not all R variable values are single numbers (in fact most aren't). Most R variables are vectors, which is R's name for a list of objects of the same type (often numbers but character variables and other types are possible).

There are many ways to create vectors in R. Many functions and operators return vector values if given vector values as arguments. Here we will only look at three ways to create a vector and a few functions and operators that work vectorwise.

The c Function

The R function c (on-line help) "combines" or "collects" all its arguments into one vector, for example

herbie <- c(-3, 27.1, 6.5, 0.02)
herbie

The seq Function

The R function seq (on-line help) creates a sequence, for example

helen <- seq(1, 40)
helen

The scan Function

The R function scan (on-line help) reads a file or other "connection", for example, if you had a file "blurfle.txt" in the folder where R looks for data (this can be selected using the menus in the Windows version and is the current working directory under UNIX)
bob <- scan("blurfle.txt")
bob
would read the contents of the file and assign it to the variable bob.

In order to try this out, you can do one of two things. First you can create the file "blurfle.txt" somewhere using a text editor (note not a word processor like Microsoft Word, something simple like the Notepad editor that comes with Windoze). The file should be a simple ASCII (plain text) file with white-space-separated numbers. Another thing to do is just download such a file, for example http://www.stat.umn.edu/5601/examp/blurfle.txt.

The scan Function Function with url Argument

A cool feature new in Version 1.3.0 of R is the ability to supply a URL in place of a file name in scan and other functions that normally read files.

To read the file "blurfle.txt" from the web without saving it to disk on your computer, do

bob <- scan(url("http://www.stat.umn.edu/geyer/5601/examp/blurfle.txt"))
bob

That's all for now about data input. More later (the section on data frames).

Vectorwise Functions and Operators

It is an important and generally useful fact about R that most functions and operators work vectorwise (operating on each element of the vector).

For example, with bob as defined above

barbie <- 2 * bob
ken <- bob^2
barbie + ken
Note that multiplication needs an explicit operator * as in most computer languages. The ^ operator is exponentiation: bob^2 is "bob squared".

That all for now (admittedly too brief, see Simple manipulations; numbers and vectors in the Introduction to R document if you need to know more, but don't look at it your first time through this).

Indexing Vectors

Indexing operations allow you to modify or pick out or remove specified elements of a vector.

Integer Indexing

The simplest form of indexing uses positive integers in the range from one to the length of the vector. For example, with bob as defined above

bob[1]
alice <- bob[c(3, 5, 7)]
alice
do what is obvious (after you get used to vector indexing). Not quite so obvious is that subscripts work the same way on the other side of the assignment operator.
bob
bob[1] <- 17
bob[c(3, 5, 7)] <- 42
bob

Negative Integer Indexing

Negative index values indicate "everything but"

bob[seq(1, 10, 2)]
bob[- seq(2, 10, 2)]
do the same thing (why? figure it out!).

Logical Vector Indexing

Perhaps the most useful form of indexing uses logical vectors. First the example, then the explanation.

bob[bob != 42]
is the (vector of) elements of bob not equal to 42.

(The operator != is "not equal". Similarly <= is "less than or equal" and >= is "greater than or equal".)

The result of

bob != 42
is a logical vector (all elements having values TRUE or FALSE. Indexing with such a vector picks out the elements for which the index is TRUE.

When the logical vector is the result of a comparison (as here), it picks out the elements for which the comparison was TRUE.

That all for this web page. If you need to know more, see Index vectors; selecting and modifying subsets of a data set in the Introduction to R document if you need to know more, but don't look at it your first time through this.

Data Frames

One last general feature of R before we get to statistics. A data.frame is a bunch of vectors. For example, with barbie and ken as defined above

blurfle <- data.frame(barbie, ken)
blurfle
creates a data frame.

The main benefit of data frames rather than separate vectors is that they can be read in all at once. If you create a file in which different vectors are in parallel columns with variable names as column headers, for example, http://www.stat.umn.edu/geyer/5601/hwdata/t3-2.txt, then you can read the file as a data frame as follows.

foo <- read.table("t3-2.txt", header=TRUE)
foo
assuming you have downloaded the file in the folder where R looks for data.

Or

foo <- read.table(url("http://www.stat.umn.edu/geyer/5601/hwdata/t3-2.txt"),
    header=TRUE)
foo
does the same thing without downloading the file.

The data downloaded here is Table 3.2 in Hollander and Wolfe (typed in, hopefully correctly, by you humble instructor). The other tables in Chapter 3 are in the same place with similar file names.

What is Rweb?

One of the amazing properties of free software is that creative programmers can use it to do whatever they can think of. There's no vendor controlling all use of the software that doesn't squeeze money out of you.

Prof. Jeff Banfield at Montana State University stuck R on the web. You can run simple R commands anywhere you have a computer on the internet. A similar program could be easily done for S-PLUS but would be illegal since it wouldn't make money for the vendor.

The local Rweb server is at http://rweb.stat.umn.edu/Rweb/. This link is also at the top of every course web page.

There are two "interfaces" to Rweb. The simple one found by clicking on the Rweb link on the main Rweb page, is the only one we will explain. It has the virtue of being embeddable in web pages to make examples.

Extracting Components of Data Frames

Once you have a data frame, how do you get at the variables inside. This section and the next explain three ways.

First, you can use integer indexing.

blurfle <- data.frame(herman=c(1, 2, 3), heloise=c(4, 5, 6))
blurfle[[1]]
blurfle[[2]]

Second, you can use the variable names

blurfle <- data.frame(herman=c(1, 2, 3), heloise=c(4, 5, 6))
blurfle$herman
blurfle$heloise
Note that the indexing in the first method uses double square brackets, not single like vector indexing. That's the R idiom inherited from S. If it's confusing, sorry about that. We won't use that method of extracting components much. We prefer the second method.

The attach Function

A third method allows you to just use the variables in a data frame as if they were ordinary variables.

blurfle <- data.frame(herman=c(1, 2, 3), heloise=c(4, 5, 6))
attach(blurfle)
herman
heloise
After you "attach" a data frame with the attach function, the variables inside the data frame are just like ordinary variables in the R global environment.

How Does Rweb Work?

The guts of Rweb is a web form that looks like this.

External Data Entry

Enter a dataset URL :

Or

Select a local file to submit:

If you type any valid R code in the main text area of the form and click the "Submit" button, the code will run on the Rweb server (rweb.stat.umn.edu) and the results will be displayed on a new web page. After you look at the results, you can return here using the "Back" button on your web browser.

This allows you to use R even though it isn't installed on the computer you are sitting at. Try it out.

Rweb Data Entry

Rweb has a fairly limited form of data entry. For security reasons, you are not allowed to run commands like scan and read.table. What it does do is take a specified file, either a URL or a "local" file (one on the computer you are sitting at) and does the following to it

X <- read.table(filename, header=TRUE)
attach(X)
names(X)
It reads the file using read.table (it can use that command but you can't), puts it in a data frame called "capital X", attaches the data frame so the variables in it look just like ordinary variables, and shows you the variable names.

Try it out. Type the URL of a data set, for example,

http://www.stat.umn.edu/geyer/5601/hwdata/t3-2.txt
in the "Enter a dataset URL" box and click "Submit".

The results show you that the variables in the data set are named private and government. But nothing else happens because you didn't specify any R commands.

If you add some commands in the main text area, like

private
government
private - government
mean(private - government)
median(private - government)
You can do a bit of statistical analysis of these data (admittedly primitive at this point, we'll do more later).

You can also use the "Select a local file to submit:" box to specify a local file that you have downloaded to disk or created with a text editor. Try that too.