For our example we will use the data from Example 5.1 in the Textbook (Moore) which is in the URL
so we can use the URL external data entry method.
The R function plot
was explained on another page.
The R function lm
(on-line help)
calculates least squares regressions.
The argument of the lm
is called the formula
in the jargon of R. In this example it is
New ~ Returning
An R formula always starts with the response variable, then there is a tilde (also called twiddle) character and then the predictor variable (or, in multiple regression, several variables connected by various operators, but that is beyond the scope of this course, not in the textbook).
The response is the variable the book calls y.
The predictor is the variable the book calls x.
But you can't call them y and x in R.
You must call them by their actual names in the data set,
here New
and Returning
.
We store the result of the regression in an R object out
for later use. We make two uses of it, and will make more in other
examples.
The line abline(out)
adds the regression line to the
scatter plot.
The line summary(out)
prints lots of stuff about the regression.
In this chapter of the book, we are only concerned with the regression
coefficients, what the book calls intercept and slope
or a and b.
The relevant part of the
printout is repeated below with
the regression coefficients highlighted.
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 31.93426 4.83762 6.601 3.86e-05 *** Returning -0.30402 0.08122 -3.743 0.00325 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The intercept, labeled (Intercept)
is 31.93426.
The slope, labeled with the name of the predictor
variable Returning
(because in multiple regression, with
more then one predictor variable, there is more than one regression
coefficient analogous to slope
in simple linear regression)
is −0.30402.
Thus the equation of the regression line is
The book also talks about r2,
something which everyone just calls R squared
.
It is also in the printout.
The relevant part is repeated below with
r2 highlighted.
Residual standard error: 3.667 on 11 degrees of freedom
Multiple R-Squared: 0.5602, Adjusted R-squared: 0.5202
F-statistic: 14.01 on 1 and 11 DF, p-value: 0.003248
For simple linear regression (only one predictor variable),
it is just the square of the correlation
cor(New, Returning)^2
.
The reason R calls it Multiple R-Squared
is because
that is what the generalization to multiple regression is called.
The book explains least squares residuals (p. 113) and discusses ploting them against the value of the predictor variable. We don't like this because it doesn't generalize to multiple regression.
A plot that looks almost the same and does generalize plots residuals against predicted values.
For our example we will use the data from Example 5.5 in the Textbook (Moore) which is in the URL
so we can use the URL external data entry method.
The first part of our R statements is just like the example in the Regression Line section. It makes a scatter plot with regression line and prints out the regression coefficients.
The second part starting with the comment
##### residuals versus predicted values #####
(everything following a hash mark #
in R is a comment)
is the new stuff.
The plot
makes a scatter plot, but the variables are
predict(out)
which are the predicted values
from the regression and
residuals(out)
which are the residuals
from the regression.
The abline(h = 0)
statement adds the horizontal line
with equation y = 0 to the plot. You can think of this
as the appropriate least squares regression line for this plot
because the residuals always have mean zero and are uncorrelated
with the fitted values (and the predictor).
The third part of the example above starting with the comment
##### Q-Q plot of residuals #####
makes a quantile-quantile plot (Q-Q plot for short) of the residuals.
This is a plot of the residuals in sorted order (vertical coordinate)
against the value those residuals should
have if the distribution
of the residuals were normal.
The R function qqnorm
(on-line help)
makes such plots.
If the residuals are (approximately) normal, then the points on the Q-Q plot of residuals lie (nearly) on a straight line.
To help visualizing this,
The R function qqline
(on-line help on the same page as qqnorm
, linked above)
adds a line to the Q-Q plot.
The R function plot.lm
(on-line help) makes many regression diagnostics (four different plots). Some are
like the two we explained. Others are beyond the scope of this course.
(Take Stat 3022 or Stat 5302 to find out more about them).
If you ever do real regression on real scientific data, you should learn
about them.
In the example they are produced by the command plot(out)
.