Statistics 3011 (Geyer and Jones, Spring 2006) Examples: Linear Regression

Contents

The Regression Line

For our example we will use the data from Example 5.1 in the Textbook (Moore) which is in the URL

http://www.stat.umn.edu/geyer/3011/mdata/chap05/eg05-01.dat

so we can use the URL external data entry method.

External Data Entry

Or

Comments

The R function plot was explained on another page.

The R function lm (on-line help) calculates least squares regressions.

The argument of the lm is called the formula in the jargon of R. In this example it is

New ~ Returning

An R formula always starts with the response variable, then there is a tilde (also called twiddle) character and then the predictor variable (or, in multiple regression, several variables connected by various operators, but that is beyond the scope of this course, not in the textbook).

The response is the variable the book calls y. The predictor is the variable the book calls x. But you can't call them y and x in R. You must call them by their actual names in the data set, here New and Returning.

We store the result of the regression in an R object out for later use. We make two uses of it, and will make more in other examples.

The line abline(out) adds the regression line to the scatter plot.

The line summary(out) prints lots of stuff about the regression. In this chapter of the book, we are only concerned with the regression coefficients, what the book calls intercept and slope or a and b. The relevant part of the printout is repeated below with the regression coefficients highlighted.

Coefficients: 
            Estimate Std. Error t value Pr(>|t|)     
(Intercept) 31.93426    4.83762   6.601 3.86e-05 *** 
Returning   -0.30402    0.08122  -3.743  0.00325 **  
--- 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  

The intercept, labeled (Intercept) is 31.93426.

The slope, labeled with the name of the predictor variable Returning (because in multiple regression, with more then one predictor variable, there is more than one regression coefficient analogous to slope in simple linear regression) is −0.30402.

Thus the equation of the regression line is

y = 31.93426 − 0.30402 × x

The book also talks about r2, something which everyone just calls R squared. It is also in the printout. The relevant part is repeated below with r2 highlighted.

Residual standard error: 3.667 on 11 degrees of freedom 
Multiple R-Squared: 0.5602,	Adjusted R-squared: 0.5202  
F-statistic: 14.01 on 1 and 11 DF,  p-value: 0.003248

For simple linear regression (only one predictor variable), it is just the square of the correlation cor(New, Returning)^2. The reason R calls it Multiple R-Squared is because that is what the generalization to multiple regression is called.

Residuals Versus Fitted Values Plot

The book explains least squares residuals (p. 113) and discusses ploting them against the value of the predictor variable. We don't like this because it doesn't generalize to multiple regression.

A plot that looks almost the same and does generalize plots residuals against predicted values.

For our example we will use the data from Example 5.5 in the Textbook (Moore) which is in the URL

http://www.stat.umn.edu/geyer/3011/mdata/chap05/ta05-01.dat

so we can use the URL external data entry method.

External Data Entry

Or

The first part of our R statements is just like the example in the Regression Line section. It makes a scatter plot with regression line and prints out the regression coefficients.

The second part starting with the comment

##### residuals versus predicted values #####

(everything following a hash mark # in R is a comment) is the new stuff.

The plot makes a scatter plot, but the variables are predict(out) which are the predicted values from the regression and residuals(out) which are the residuals from the regression.

The abline(h = 0) statement adds the horizontal line with equation y = 0 to the plot. You can think of this as the appropriate least squares regression line for this plot because the residuals always have mean zero and are uncorrelated with the fitted values (and the predictor).

Quantile-Quantile Plot of Residuals

The third part of the example above starting with the comment

##### Q-Q plot of residuals #####

makes a quantile-quantile plot (Q-Q plot for short) of the residuals.

This is a plot of the residuals in sorted order (vertical coordinate) against the value those residuals should have if the distribution of the residuals were normal.

The R function qqnorm (on-line help) makes such plots.

If the residuals are (approximately) normal, then the points on the Q-Q plot of residuals lie (nearly) on a straight line.

To help visualizing this, The R function qqline (on-line help on the same page as qqnorm, linked above) adds a line to the Q-Q plot.

Serious Diagnostic Plots

The R function plot.lm (on-line help) makes many regression diagnostics (four different plots). Some are like the two we explained. Others are beyond the scope of this course. (Take Stat 3022 or Stat 5302 to find out more about them). If you ever do real regression on real scientific data, you should learn about them.

In the example they are produced by the command plot(out).