Statistics 3011 (Geyer and Jones, Spring 2006) Examples: R Intro

Contents

What is R?

First there was S, a general-purpose, interpreted, computer language especially designed for statistics. Like Unix and C, it came from Bell Labs. That's why it has only one letter for its name. S is now a proprietary product marketed by Insightful Corporation under the name S-PLUS.

Later came R, which is free software (having the same license as Linux) and implements pretty much the same language as S-PLUS. By now R has surpassed its predecessor in many ways. We haven't used S-PLUS in so long we don't know what the incompatibilities are.

R is free software, which means it is free as in "free beer" (you can download it with no charge) and free as in "free speech" (you can do whatever you want with it except make it non-free). R is available from the Comprehensive R Archive Network (CRAN).

R is the the language of choice for research statistics. If it's statistics, you can do it in R.

If you have the time and want to know more about R, the Introduction to R that comes with the R software is the first thing to read, but it is way more than you need to know for this course.

Why R?

Why use R, which is the most powerful statistical computing environment in existence, what research statisticians use, for an introductory statistics course? Couldn't we use a graphing calculator, or a spreadsheet, or some dumber statistics program, or many other options?

One answer is that we don't want to be limited to what dumb tools can do, but the main reason is the following section.

What is Rweb?

Free software is amazing. Creative programmers can use it to do anything they can think of. There's no vendor controlling use of the software to protect their profits.

Prof. Jeff Banfield at Montana State University put R on the web. You can run simple R commands from any computer connected to the internet. A similar program could be easily done for S-PLUS but would be illegal because the vendor couldn't profit from it.

The local Rweb server is at http://rweb.stat.umn.edu/Rweb. This link is also at the top of every course web page.

There are two "interfaces" to Rweb. The simple one found by clicking on the Rweb link on the main Rweb page, is the only one we will explain. It has the virtue of being embeddable in web pages to make examples.

Here is a simple example, one of the examples on the Rweb data page.

External Data Entry

Enter a dataset URL :

To see how the example works, just click the "Submit" button.

When you have seen the example, click the "Back" button on you web browser to return to this page.

Don't bother with what the example does. Just notice that it does some calculations on some data and draws a picture.

The Relation between R and Rweb

Rweb is just R. You type R statements into a web form. You submit them. They get executed on the server. The results get stuffed into a web page sent back to your computer. So Rweb is just R run over the web.

So mostly we will use R and Rweb interchangeably.

One important difference between Rweb and R is that the server remembers nothing between Rweb submissions. The entire calculation you want done must be submitted to Rweb in one web form. R run on your own computer does remember. You can build up a complicated analysis a little bit at a time.

Thus Rweb is fairly useless for complicated problems, but is fine for coursework.

Variables and Assignment

Like all other computer languages, R has variables, which are referred to by variable names. Variable names may contain any letter, digit, or the dot (.) and cannot begin with a digit. Names are case sensitive, thus fred, Fred, and FRED refer to different variables.

The assignment operator in R is an arrow "<-" constructed from two characters. An assignment statement looks like

fred <- 4

or

sally <- 2 + 2

or

a.very.long.variable.name <- sqrt(16)

Each assigns the value of the expression on the right side of the assignment operator to the variable name on the left side. In each case the variable gets the value 4.

Output (Print and Plot)

In order to see any results from R. You have to execute a command that makes output, the most common being print and plot.

When a calculation is done or an assignment made, you don't see anything unless you ask explicitly.

prints the value (4) assigned to the variable sally.

If the print statement were omitted, there wouldn't be any point because you wouldn't see anything and Rweb would't remember the results for future use.

Actually this example can be shortened to

because an expression that is not an assignment usually prints its value so

sally

does the same thing as

print(sally)

If in doubt, put in the print.

Rweb External Data Entry

From a URL

Variables can also be read into Rweb from an external file, either a file on your own computer or one on the web. An example file is

http://www.stat.umn.edu/geyer/5601/examp/blurfle.txt

The file has the following properties.

This has the result that all of the variables must be vectors of the same length. This can usually be arranged somehow.

When a job is submitted to Rweb, the first thing it does is read the "External Data Entry" file (if there is one) and create the variables in it. The example blurfle.txt creates three variables, color, x, and y and prints them out.

External Data Entry

Enter a dataset URL :

From a File on Your Computer

Here is another Rweb form with a different kind of data entry that wants a file on your computer.

External Data Entry

Select a local file to submit:

Try to use it to upload a data file that you create and print one of the variables in it. For a start you can just download blurfle.txt and use it to upload back.

Then try to create a new data file of the proper form (for which see the list of requirements above) and use it.

R Functions and Arguments

R functions often have many arguments. For example, see the on-line help for the barplot function.

Fortunately, you don't need to know all the arguments to use the function. Most arguments have default values. The second argument, width = 1, says this argument, which is named width has the value 1 if you don't specify anything. Most defaults are sensible and you can just not specify such arguments unless you have a good reason too.

We do want to use one other argument.

names.arg 	a vector of names to be plotted below each bar
(says the help). We need to use that to tell us which bar is which.

If arguments are named, then they can be given in any order. Unnamed arguments must be given in the order in the function definition (which is given in the help linked above). In

barplot(Weight, names.arg = Material)

Weight is the first argument to the barplot function, which actually has the name height and Material is actually the fourth argument to the function, which has the name names.arg. There are many other arguments, but they all take their default values.

Argument names can be abbreviated to any initial part of the name that unambiguously specifies the argument (no other argument has the same initial part). In the bar plot example we abbreviate names.arg to names.

Vectors

Not all R variable values are single numbers (in fact most aren't). Most R variables are vectors, which is R's name for a list of objects of the same type (often numbers but character variables and other types are possible).

There are many ways to create vectors in R (besides Rweb external data entry described above). Many functions and operators return vector values if given vector values as arguments. Here we will only look at a few ways to create a vector and a few functions and operators that work vectorwise.

The c Function

The R function c (on-line help) "combines" or "collects" all its arguments into one vector, for example

The seq Function

The R function seq (on-line help) creates a sequence, for example

Vectorwise Functions and Operators

It is an important and generally useful fact about R that most functions and operators work vectorwise (operating on each element of the vector).

Note that multiplication needs an explicit operator * as in most computer languages. The ^ operator is exponentiation: bob^2 is "bob squared".

That all for now (admittedly too brief, see Simple manipulations; numbers and vectors in the Introduction to R document if you need to know more, but don't look at it your first time through this).

Indexing Vectors

Indexing operations allow you to modify or pick out or remove specified elements of a vector. They are very useful for removing outliers or obtaining or modifying parts of a data vector.

Integer Indexing

The simplest form of indexing uses positive integers in the range from one to the length of the vector. For example

do what is obvious (after you get used to vector indexing). Not quite so obvious is that subscripts work the same way on the other side of the assignment operator.

Negative Integer Indexing

Negative index values indicate "everything but"

do the same thing (why? figure it out!).

Logical Vector Indexing

Perhaps the most useful form of indexing uses logical vectors. First the example, then the explanation.

bob[bob != 42]

is the (vector of) elements of bob not equal to 42.

(The operator != is "not equal". Similarly <= is "less than or equal" and >= is "greater than or equal".)

The result of

bob != 42

is a logical vector (all elements having values TRUE or FALSE. Indexing with such a vector picks out the elements for which the index is TRUE.

When the logical vector is the result of a comparison (as here), it picks out the elements for which the comparison was TRUE.

That's all for this web page. If you need to know more, see Index vectors; selecting and modifying subsets of a data set in the Introduction to R document if you need to know more, but don't look at it your first time through this.

Missing Data and Computer Arithmetic

The data system has two sorts of accomodation to values the computer can't handle or at least isn't supposed to deal with.

NA: Not Available

Any data value, numeric or not, can be NA. This is what you use for missing data. Always use NA for this purpose. Never use 999 or some other code that is actually a number. Sad experience of many scientists shows this sort of code is always forgotten at some point and the data analysis thereby ruined.

NaN: Not a Number

This is a special value that only numeric variables can take. It is the result of an undefined operation like 0 / 0. It is produced by the low level arithmetic of all modern computers. R is just going along with the standard here.

Inf: Infinity

Numeric variables can also take the values -Inf and Inf. These are produce by the low level arithmetic of all modern computers by operations such as -1 / 0 and 1 / 0. R is just going along with the standard here.

You shouldn't think of these as real infinities, like in calculus, but rather that the correct calculation, if the computer could do it would probably (but not certainly) be very large, larger than the largest numbers the computer can hold (about 10300) and of the sign of the infinity.