An Introduction to S and S-PLUS

This is a brief introduction to the S computing language, originally taking one hour of the graduate student orientation at the University of Washington, back when I was a grad student (when there was no web and computing was done with stone tablets and chisels).

The S computing language was invented by a group at Bell Labs (that's why its name is only one letter like C). It still is an active research project at Bell Labs. S is marketed by MathSoft under the name S-PLUS. Although S-PLUS contains extra features added by MathSoft, the core language comes from Bell Labs. The S-PLUS web pages contain some interesting links including an S FAQ.

Table of Contents

Getting In and Out

To get into S-PLUS from a UNIX prompt by issuing the command

  Splus
There will be some blather about versions and copyrights, then you will get an S-PLUS prompt
  >
Everything you now type until you quit S-PLUS is evaluated by the S language interpreter. To quit S you issue the S command
  q()
at an S-PLUS prompt (we won't show the prompts in this intro).

Assignment and Simple S Data Structures

Set a variable x to be 23 and print it to see what it is

  x <- 23  

  x
Same for real values
  x <- 23.1 ; x

  x <- 23e10 ; x
(the semicolon is a statement concatenator. "x <- 23.1" is one S statement and "x" is another. Putting a semicolon in between allows both to be typed on one line.

Make some vectors and try out the types of subscripting S allows

  x <- 1:10 ; x

  y <- c(2,4,11,17) ; y
A vector of positive subscripts selects the corresponding elements
  x[1:5] 
A vector of negative subscripts removes the corresponding elements (selects everything but them)
  x[-1] 
A vector of logical subscripts selects the elements for which the subscript is TRUE (logical variables have values TRUE or FALSE).
  x[x > 5]        
Note that subscript was indeed a logical vector
  x > 5

Make a matrix and try out subscripts

  x <- matrix(1:16, ncol=4) ; x
Matrix subscripts have two indices (of course). The first index selects rows
  x[c(1,3),] 
The second index selects columns
  x[,3:4] 

Data Analysis

Simulate a sample of size n of normal data
  n <- 1000         # sample size

  x <- rnorm(n)     # n standard normal random variates
(Everything from a hash sign # to the end of the line is an S comment.)

Start a device driver. Under HP-UX VUE (Motif flavored X windows) do

  motif()
Now draw a histogram of the sample
  hist(x)
Look at the help for the hist function, by doing either
  help(hist) 
or starting up that fancy graphical help thingy
  help.start(gui="motif")
and asking for the help on histograms. The help shows the options. For example,
  hist(x, nclass=20) 
makes a histogram with smaller bins, and
  hist(x, nclass=20, main="Normal Random Sample", xlab="")
makes a histogram with a title and the x-axis label blank.

To find the mean and standard deviation

  mean(x) 

  sqrt(var(x)) 

Linear Regression

Make vectors of correlated normal random variables
  mu.x <- mu.y <- 0
  sigma.x <- sigma.y <- 1
  rho <- 0.5
  beta <- rho * sigma.y / sigma.x

  x <- rnorm(n, mu.x, sigma.x)

  y <- mu.y + beta * (x - mu.x) + rnorm(n, 0, sigma.y * sqrt(1 - rho^2))
Is this right? Check that the sample moments are about right (var(x, y) calculates covariances and (cor(x, y) calculates correlations).

Make a scatterplot

  plot(x,y)
Do a linear regression. The result is a list of objects
  fitout <- lsfit(x,y)
names(fitout) gives the names of the components of the list. The component
  fitout$coef 
is the regression coefficients. Are they about right?
  ls.print(fitout)
does a nice printout of some of the information in the regression output.

Add the regression line to the plot

  abline(fitout)

The S function

  qqnorm(fitout$residuals)
does a Q-Q (quantile-quantile) plot of the residuals. This is similar to so-called "normal probability plots", but not exactly the same. The line is straight if the residuals are normal.

Find the fitted values (the predicted y values)

  yhat <- y - c(fitout$residuals)
and plot residuals versus fitted values
  plot(yhat, fitout$resid, xlab="Fitted Values", ylab="Residuals")

Make y a nonlinear function of x plus normal noise, for example,

  y <- x - 0.25 * x^2 + rnorm(n,0,0.5)
Redo the whole analysis above. Is anything noticeable in the plots?

Try putting a smooth through the scatterplot.

  plot(x,y)

  lines(lowess(x,y))

Writing Functions in S

Every "command" in S is a function. Even if it has no arguments, it is still a function, and its name is followed by an empty argument list in parentheses. That's why we say q() rather than just q to get out of S. We are invoking the function q with no arguments.

To use an S function give the name with parentheses

  ls()
Every S function is also just an S dataset. It is a dataset that happens also to be a function. If we type the function name without the parentheses we get the definition of the function, for example,
  ls

Functions are defined using the function keyword

  square <- function(x) x^2
defines a very simple function that just squares its input.

The function that takes x (the argument) to x^2 (the result) is assigned to the name "square". The value returned by a function is the value of the last expression evaluated, here the only expression, x^2. This behavior is sometimes confusing. You may use the return keyword to explicitly indicate the returned value

  square <- function(x) return(x^2)
if you prefer.

Try it out. Note that "square" like many S functions operates elementwise on a whole vector or matrix.

  square(2)

  square(1:10)

  square(matrix(1:16, ncol=4))
Note that "x" is a dummy argument in the definition of "square". It has nothing to do with the variable "x" assigned previously.

Quiz: Try to write a function "mad" that calculates the median absolute deviation (MAD) of a vector, the median of the absolute deviations from the median. (There is a built-in function "median" that finds medians and a built-in function "abs" that takes absolute values.)

Simulation

Make a sample from a contaminated normal distribution, 90% Normal(0,1) random variable and 10% normal(0,3).
  n <- 20

  x <- ifelse(runif(n) > 0.10, rnorm(n), rnorm(n,0,3))
See the help for ifelse and runif if this doesn't make sense. What's an easier way to do this?

Try out several estimates of location: the mean, the median, the 20 per cent trimmed mean (the mean of the middle 80 per cent of the sample), and the Hodges-Lehmann estimator (the median of the pairwise averages. S has functions to do all but the last. Define a function "hle" to do that.

  hle <- function(x) {
    tmp <- outer(x, x, "+") / 2
    median(tmp[row(tmp) <= col(tmp)])
  }
Try them
  mean(x)

  median(x)

  mean(x, trim=0.1)

  hle(x)
The actual center of the population is 0. Which estimator does best?

make a matrix of 50 samples of size 20 (samples are rows)

  x <- matrix(ifelse(runif(1000) > 0.10, rnorm(n), rnorm(n,0,3)), ncol=20)
Apply all of the estimators to all of the samples
  estimates.mean <- apply(x, 1, "mean")

  estimates.median <- apply(x, 1, "median")

  estimates.trim <- apply(x, 1, "mean", trim=0.1)

  estimates.hle <- apply(x, 1, "hle")
Compare the mean square error of the estimates
  sum(estimates.mean^2)

  sum(estimates.median^2)

  sum(estimates.trim^2)

  sum(estimates.hle^2)

What happens if there is more contamination or the contamination has a larger variance? What if x has an entirely different distribution? Try the Cauchy distribution

  x <- matrix(rcauchy(1000), ncol=20)
or the slash distribution
  x <- matrix(rnorm(1000) / runif(1000), ncol=20)
Try plotting the sampling distributions of the estimators to see what they look like.
Author: Charles Geyer (charlie@stat.umn.edu). Comments or corrections gratefully accepted.