This is a brief introduction to the S computing language, originally taking one hour of the graduate student orientation at the University of Washington, back when I was a grad student (when there was no web and computing was done with stone tablets and chisels).
The S computing language was invented by a group at Bell Labs (that's why its name is only one letter like C). It still is an active research project at Bell Labs. S is marketed by MathSoft under the name S-PLUS. Although S-PLUS contains extra features added by MathSoft, the core language comes from Bell Labs. The S-PLUS web pages contain some interesting links including an S FAQ.
To get into S-PLUS from a UNIX prompt by issuing the command
SplusThere will be some blather about versions and copyrights, then you will get an S-PLUS prompt
>Everything you now type until you quit S-PLUS is evaluated by the S language interpreter. To quit S you issue the S command
q()at an S-PLUS prompt (we won't show the prompts in this intro).
Set a variable x to be 23 and print it to see what it is
x <- 23 xSame for real values
x <- 23.1 ; x x <- 23e10 ; x(the semicolon is a statement concatenator. "x <- 23.1" is one S statement and "x" is another. Putting a semicolon in between allows both to be typed on one line.
Make some vectors and try out the types of subscripting S allows
x <- 1:10 ; x y <- c(2,4,11,17) ; yA vector of positive subscripts selects the corresponding elements
x[1:5]A vector of negative subscripts removes the corresponding elements (selects everything but them)
x[-1]A vector of logical subscripts selects the elements for which the subscript is TRUE (logical variables have values TRUE or FALSE).
x[x > 5]Note that subscript was indeed a logical vector
x > 5
Make a matrix and try out subscripts
x <- matrix(1:16, ncol=4) ; xMatrix subscripts have two indices (of course). The first index selects rows
x[c(1,3),]The second index selects columns
x[,3:4]
n <- 1000 # sample size x <- rnorm(n) # n standard normal random variates(Everything from a hash sign # to the end of the line is an S comment.)
Start a device driver. Under HP-UX VUE (Motif flavored X windows) do
motif()Now draw a histogram of the sample
hist(x)Look at the help for the hist function, by doing either
help(hist)or starting up that fancy graphical help thingy
help.start(gui="motif")and asking for the help on histograms. The help shows the options. For example,
hist(x, nclass=20)makes a histogram with smaller bins, and
hist(x, nclass=20, main="Normal Random Sample", xlab="")makes a histogram with a title and the x-axis label blank.
To find the mean and standard deviation
mean(x) sqrt(var(x))
mu.x <- mu.y <- 0 sigma.x <- sigma.y <- 1 rho <- 0.5 beta <- rho * sigma.y / sigma.x x <- rnorm(n, mu.x, sigma.x) y <- mu.y + beta * (x - mu.x) + rnorm(n, 0, sigma.y * sqrt(1 - rho^2))Is this right? Check that the sample moments are about right (var(x, y) calculates covariances and (cor(x, y) calculates correlations).
Make a scatterplot
plot(x,y)Do a linear regression. The result is a list of objects
fitout <- lsfit(x,y)names(fitout) gives the names of the components of the list. The component
fitout$coefis the regression coefficients. Are they about right?
ls.print(fitout)does a nice printout of some of the information in the regression output.
Add the regression line to the plot
abline(fitout)
The S function
qqnorm(fitout$residuals)does a Q-Q (quantile-quantile) plot of the residuals. This is similar to so-called "normal probability plots", but not exactly the same. The line is straight if the residuals are normal.
Find the fitted values (the predicted y values)
yhat <- y - c(fitout$residuals)and plot residuals versus fitted values
plot(yhat, fitout$resid, xlab="Fitted Values", ylab="Residuals")
Make y a nonlinear function of x plus normal noise, for example,
y <- x - 0.25 * x^2 + rnorm(n,0,0.5)Redo the whole analysis above. Is anything noticeable in the plots?
Try putting a smooth through the scatterplot.
plot(x,y) lines(lowess(x,y))
To use an S function give the name with parentheses
ls()Every S function is also just an S dataset. It is a dataset that happens also to be a function. If we type the function name without the parentheses we get the definition of the function, for example,
ls
Functions are defined using the function keyword
square <- function(x) x^2defines a very simple function that just squares its input.
The function that takes x (the argument) to x^2 (the result) is assigned to the name "square". The value returned by a function is the value of the last expression evaluated, here the only expression, x^2. This behavior is sometimes confusing. You may use the return keyword to explicitly indicate the returned value
square <- function(x) return(x^2)if you prefer.
Try it out. Note that "square" like many S functions operates elementwise on a whole vector or matrix.
square(2) square(1:10) square(matrix(1:16, ncol=4))Note that "x" is a dummy argument in the definition of "square". It has nothing to do with the variable "x" assigned previously.
Quiz: Try to write a function "mad" that calculates the median absolute deviation (MAD) of a vector, the median of the absolute deviations from the median. (There is a built-in function "median" that finds medians and a built-in function "abs" that takes absolute values.)
n <- 20 x <- ifelse(runif(n) > 0.10, rnorm(n), rnorm(n,0,3))See the help for ifelse and runif if this doesn't make sense. What's an easier way to do this?
Try out several estimates of location: the mean, the median, the 20 per cent trimmed mean (the mean of the middle 80 per cent of the sample), and the Hodges-Lehmann estimator (the median of the pairwise averages. S has functions to do all but the last. Define a function "hle" to do that.
hle <- function(x) { tmp <- outer(x, x, "+") / 2 median(tmp[row(tmp) <= col(tmp)]) }Try them
mean(x) median(x) mean(x, trim=0.1) hle(x)The actual center of the population is 0. Which estimator does best?
make a matrix of 50 samples of size 20 (samples are rows)
x <- matrix(ifelse(runif(1000) > 0.10, rnorm(n), rnorm(n,0,3)), ncol=20)Apply all of the estimators to all of the samples
estimates.mean <- apply(x, 1, "mean") estimates.median <- apply(x, 1, "median") estimates.trim <- apply(x, 1, "mean", trim=0.1) estimates.hle <- apply(x, 1, "hle")Compare the mean square error of the estimates
sum(estimates.mean^2) sum(estimates.median^2) sum(estimates.trim^2) sum(estimates.hle^2)
What happens if there is more contamination or the contamination has a larger variance? What if x has an entirely different distribution? Try the Cauchy distribution
x <- matrix(rcauchy(1000), ncol=20)or the slash distribution
x <- matrix(rnorm(1000) / runif(1000), ncol=20)Try plotting the sampling distributions of the estimators to see what they look like.