First there was S, a general-purpose, interpreted, computer language especially designed for statistics. S by itself is no longer commercially available, although it still exists as a research project at Bell Labs. S together with additional functions and features is marketed by Insightful Corporation under the name S-PLUS.
R is free S. It is free as in "free beer" (you can download it with no charge) and free as in "free speech" (you can do whatever you want with it except make it non-free). More precisely, R is a dialect of the S language. R and S-PLUS are more or less compatible. Roughly 90% of things you want to do work in both. Most other things work with minor variations. R is available from the Comprehensive R Archive Network (CRAN).
R is the the language of choice for research statistics. If it's statistics, you can do it in R.
If you have the time and want to know more about R, the Introduction to R that comes with the R software is the first thing to read, but it is way more than you need to know for this course.
Free software is amazing. Creative programmers can use it to do anything they can think of. There's no vendor controlling use of the software to protect their profits.
Prof. Jeff Banfield at Montana State University put R on the web. You can run simple R commands from any computer connected to the internet. A similar program could be easily done for S-PLUS but would be illegal because the vendor couldn't profit from it.
The local Rweb server is at http://rweb.stat.umn.edu/Rweb. This link is also at the top of every course web page.
There are two "interfaces" to Rweb. The simple one found by clicking on the Rweb link on the main Rweb page, is the only one we will explain. It has the virtue of being embeddable in web pages to make examples.
Here is a simple example (not having much to do with nonparametrics, just one of the examples on the Rweb page)
To see how the example works, just click the "Submit" button.
When you have seen the example, click the "Back" button on you web browser to return to this page.
For now, don't bother with what the example does. Just notice that it does some calculations on some data and draws a picture.
Rweb is just R. You type R statements into a web form. You submit them. They get executed on the server. The results get stuffed into a web page sent back to your computer. So Rweb is just R run over the web.
So mostly we will use R
and Rweb
interchangeably.
One important difference between Rweb and R is that the server remembers nothing between Rweb submissions. The entire calculation you want done must be submitted to Rweb in one web form. R run on your own computer does remember. You can build up a complicated analysis a little bit at a time.
Thus Rweb is fairly useless for really complicated problems, but is fine for coursework.
Like all other computer languages, R has variables, which are referred to
by variable names. Variable names may contain any letter, digit, or the dot
(.
) and cannot begin with a digit. Names are case sensitive,
thus fred
, Fred
, and FRED
refer to
different variables.
The assignment operator in R is an arrow "<-
" constructed
from two characters. An assignment statement looks like
fred <- 4
or
sally <- 2 + 2
or
a.very.long.variable.name <- sqrt(16)
Each assigns the value of the expression on the right side of the assignment operator to the variable name on the left side. In each case the variable gets the value 4.
In order to see any results from R. You have to execute a command that
makes output, the most common being print
and plot
.
When a calculation is done or an assignment made, you don't see anything unless you ask explicitly.
prints the value (4) assigned to the variable sally
.
If the print
statement were omitted, there wouldn't be
any point because you wouldn't see anything and Rweb would't remember
the results for future use.
Actually this example can be shortened to
because an expression that is not an assignment usually prints its value so
sally
does the same thing as
print(sally)
If in doubt, put in the print
.
Not all R variable values are single numbers (in fact most aren't). Most R variables are vectors, which is R's name for a list of objects of the same type (often numbers but character variables and other types are possible).
There are many ways to create vectors in R. Many functions and operators return vector values if given vector values as arguments. Here we will only look at a few ways to create a vector and a few functions and operators that work vectorwise.
c
Function
The R function c
(on-line help)
"combines" or "collects" all its arguments into one vector, for example
seq
Function
The R function seq
(on-line help)
creates a sequence, for example
Variables can also be read into Rweb from an external file, either a file on your own computer or one on the web. We'll only illustrate the latter. An example file is
The file has the following properties.
word processors(like Microsoft Word) is not allowed.
This has the result that all of the variables must be vectors of the same length. This can usually be arranged somehow.
When a job is submitted to Rweb, the first thing it does is read
the "External Data Entry" file (if there is one) and create the variables
in it. The example blurfle.txt creates
three variables, color
, x
, and y
and prints them out.
It is an important and generally useful fact about R that most functions and operators work vectorwise (operating on each element of the vector).
Note that multiplication needs an explicit operator *
as in
most computer languages. The ^
operator is exponentiation:
bob^2
is "bob squared".
That all for now (admittedly too brief, see Simple manipulations; numbers and vectors in the Introduction to R document if you need to know more, but don't look at it your first time through this).
Indexing operations allow you to modify or pick out or remove specified elements of a vector.
The simplest form of indexing uses positive integers in the range from one to the length of the vector. For example
do what is obvious (after you get used to vector indexing). Not quite so obvious is that subscripts work the same way on the other side of the assignment operator.
Negative index values indicate "everything but"
do the same thing (why? figure it out!).
Perhaps the most useful form of indexing uses logical vectors. First the example, then the explanation.
bob[bob != 42]
is the (vector of) elements of bob
not equal to 42.
(The operator !=
is "not equal". Similarly <=
is
"less than or equal" and >=
is "greater than or equal".)
The result of
bob != 42
is a logical vector (all elements having values TRUE
or FALSE
. Indexing with such a vector picks out the elements
for which the index is TRUE
.
When the logical vector is the result of a comparison (as here), it
picks out the elements for which the comparison was TRUE
.
That all for this web page. If you need to know more, see Index vectors; selecting and modifying subsets of a data set in the Introduction to R document if you need to know more, but don't look at it your first time through this.
We've already mentioned a few R functions. There are lots and lots of
others. By built-in
functions, we mean those that you don't have
to do anything special to use. Strictly, speaking R doesn't have any
built-in
functions. Any function is like any other function.
None are more special than any other. But seven packages
of functions
called
base
,
utils
,
graphics
,
stats
,
and
methods
,
are automatically available
with no special effort.
These functions are listed on the documentation for the base and so forth.
To use an R function, you just type the function name followed by the list of arguments in parentheses. We've already seen examples, like
plot(x, y)
Most R functions also have named arguments. The syntax for that is
The named arguments here, main
, xlab
and ylab
can appear in any order so long as they
are after the unnamed arguments.
This makes the functions much simpler to use. Many functions have dozens of arguments, and you only need to use a few (the others have default values or aren't used the way you are invoking the function).
If you actually know the order of all the arguments, then you don't need the name. For example, the three functions
rnorm(10, 0.0, 1.0) rnorm(10, mean = 0.0, sd = 1.0) rnorm(10)
all do the same thing (generate 10 independently distributed standard
normal random numbers) because the second argument is mean
and the third is sd
and the defaults for these arguments
are 0.0 and 1.0, respectively.
Your choice.
Some functions are not available until the library
containing it is added. For example
library(lqs)
adds the lqs
library, which does
Resistant Regression and Covariance Estimation
.
We'll use it in the second half of the course.
Other than needing a library
command first,
these functions are just like any others.
The list of all available packages is
here. It can also be found by going to the
main Rweb page
(follow the link on the navigation bar at the top of any 5601 web page)
then clicking on the link HTML documentation
in the second paragraph
and then on the link Packages
on the main R documentation page.
The function
function defines new functions. For example
trim <- function(x, lower = 0.0, upper = 1.0) { inies <- x >= lower & x <= upper return(x[inies]) }
trims off the values of the argument x
that are
below or above the arguments lower
and upper
,
respectively.
The lower = 0.0
and upper = 1.0
in the
definition specify default values for these arguments that
are used when the user does not supply values.
Let's check it out.
As the assignment suggests, an R function is just an R variable
like any other. In this example, trim
is an R variable
that happens to be a function and x
is an R variable
that happens to be a numeric vector.
This allows functions to be passed as arguments to other functions, a very useful technique that we will use often (that's the only reason we will want to define our own functions).
A return statement is not strictly necessary. Functions return the value of the last expression if there is no return. The curly brackets are not necessary if there is only one statement.
Thus
trim <- function(x, lower = 0.0, upper = 1.0) x[x >= lower & x <= upper]
works just as well as the other definition. But it is a lot harder to read, and we generally won't use this trick.