University of Minnesota, Twin Cities School of Statistics Stat 3011 Rweb Textbook (Wild and Seber)

- General Instructions
- Simple Linear Regression
- Tests and Confidence Intervals for Regression Coefficients
- Simple Quadratic Regression (not covered)
- Confidence Interval for the Regression Function
- Regression Prediction
- Diagnostic Plots
- Correlation

For an example we will use the data for Example 12.3.1 in Wild and Seber,
which is in the file
cmputer1.txt.
This file contains three variables. The two of interest in this problem
are `no`

, the number of computer terminals, and `pertask`

,
the time per task (p. 104 and p. 511 in Wild and Seber). The
following R commands

- make a scatter plot of the two variables,
- fit the least-squares regression model,
- put the regression line on the plot, and
- print a summary of the regression.

Since the output is rather complicated, we copy it below with the regression coefficients highlighted. The least-squares regression line is

y= 3.04963 + 0.26034x

Rweb:> summary(out) Call: lm(formula = pertask ~ no) Residuals: Min 1Q Median 3Q Max -3.5631 -0.7923 -0.1547 1.3844 2.4403 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.04963 1.26045 2.419 0.0323 * no 0.26034 0.02705 9.625 5.41e-07 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.672 on 12 degrees of freedom Multiple R-Squared: 0.8853, Adjusted R-squared: 0.8758 F-statistic: 92.63 on 1 and 12 degrees of freedom, p-value: 5.406e-07

A test of whether a regression coefficient is zero can be read off the regression printout. No additional work required.

We repeat here the same printout as in the preceding section, but here
the test statistic and *P*-value for a test of the hypotheses

```
Rweb:> summary(out)
Call:
lm(formula = pertask ~ no)
Residuals:
Min 1Q Median 3Q Max
-3.5631 -0.7923 -0.1547 1.3844 2.4403
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.04963 1.26045 2.419 0.0323 *
no 0.26034 0.02705 9.625 5.41e-07 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.672 on 12 degrees of freedom
Multiple R-Squared: 0.8853, Adjusted R-squared: 0.8758
F-statistic: 92.63 on 1 and 12 degrees of freedom, p-value: 5.406e-07
```

The *P*-value read off the printout *P* = 5.41e-07 (meaning 5.4 times
10 to the minus 7) is exceedingly small. So there is a
"highly statistically significant" linear relationship between the two
variables.

R has no procedure especially for producing confidence intervals for regression coefficients. However, all the information necessary is in the regression printout. Again we repeat the same regression printout, this time with the information necessary for doing a confidence interval for the slope highlighted.

Rweb:> summary(out) Call: lm(formula = pertask ~ no) Residuals: Min 1Q Median 3Q Max -3.5631 -0.7923 -0.1547 1.3844 2.4403 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.04963 1.26045 2.419 0.0323 * no 0.26034 0.02705 9.625 5.41e-07 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.672 on 12 degrees of freedom Multiple R-Squared: 0.8853, Adjusted R-squared: 0.8758 F-statistic: 92.63 on 1 and 12 degrees of freedom, p-value: 5.406e-07

In the `Estimate`

column is the estimate of the slope.
In the `Std. Error`

column is the standard error of this estimate.
We multiply the standard error by a *t* critical value derived from
a *t* distribution with 12 degrees of freedom (*n* - 2, but we
can also read that directly off the regression printout). Thus

give the margin of error and the interval estimate for the true slope of the population regression line associated with a 95% confidence level.

To get a different confidence level, you would use a different argument
in the `qt`

function.

R also has no procedure especially for performing general hypothesis tests
for regression coefficients. The *P*-value for the test of whether
a regression coefficient is *zero* is given in the regression printout.
If we want to test any *nonzero hypothesized value* for a regression
coefficient, we have to do it "by hand" (assisted perhaps by the computer).

Again, all the information necessary is in the regression printout. In fact we use exactly the same information that was highlighted in the preceding printout. We just use it in a different way.

For an example we will use the data for Example 12.4.3 in Wild and Seber,
which is in the file
gauge.txt.
This file contains three variables. The two of interest in this problem
are `can`

and `gauge`

.

The R statements

do the plot shown on p. 530 in Wild and Seber and give the accompanying
regression printout, which does *not* match the regression printout
on that page.

The regression printout on p. 530 in Wild and Seber corresponds to the following regression, which deletes case 35 on the grounds that it is an apparent outlier.

Again, we present the R regression output with the relevant numbers highlighted.

Rweb:> summary(out) Call: lm(formula = gauge[-35] ~ can[-35]) Residuals: Min 1Q Median 3Q Max -0.074413 -0.020389 -0.008787 0.021637 0.132076 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.034438 0.007612 4.524 3.37e-05 *** can[-35] 0.439803 0.003649 120.519 < 2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.04156 on 54 degrees of freedom Multiple R-Squared: 0.9963, Adjusted R-squared: 0.9962 F-statistic: 1.452e+04 on 1 and 54 degrees of freedom, p-value: 0As usual, the test statistic is

give the test statistic (which agrees with the book) and the one-tailed
and two-tailed *P*-values (which don't exactly). The book is apparently
wrong here, but since no details are given of exactly how they got their
number (they say "the two-sided *P*-value is 0.98"), we can't resolve
the discrepancy.

**Note:** This example was not covered in class and won't be on the test
or homework. I've just left it on this web page to have it written down
somewhere.

For an example we again use the data for Example 12.3.1 in Wild and Seber,
which is in the file
cmputer1.txt.
This file contains three variables. The two of interest in this problem
are `no`

and `time`

.

Now we want to fit a quadratic model and see if it fits better than the linear model. The following R commands

- make a scatter plot of the two variables,
- fit the least-squares quadratic regression model
- put the quadratic regression curve on the plot, and
- print a summary of the regression.

A couple of fine points.

- The function
`I( )`

is used to indicate that`no^2`

means what it appears to, the square of the variable`no`

. Otherwise R treats`^`

as a magic character meaning something quite different, as explained on the on-line help for model formulas. - The third line is explained by the on-line help for the
curve
and
predict functions. Admittedly, it takes a fairly sophisticated knowledge of R to
decipher these help pages. Students are not expected to understand how this
line works! I assure you, however, that it does work, not just for this
example but for any regression problem. Just replace
`no`

in`curve(predict(out, data.frame(no=x)), add=TRUE)`

with the actual name of the predictor variable in the regression actually done.

The regression printout is repeated below with the coefficient of the
quadratic term highlighted and also the *P*-value for testing whether
this coefficient is nonzero.

Rweb:> summary(out) Call: lm(formula = time ~ no + I(no^2)) Residuals: Min 1Q Median 3Q Max -2.3254 -0.6556 -0.0973 1.0512 1.8757 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.215067 1.941166 0.111 0.91378 no 0.036714 0.100780 0.364 0.72254 I(no^2) 0.004526 0.001209 3.745 0.00324 ** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 1.319 on 11 degrees of freedom Multiple R-Squared: 0.9715, Adjusted R-squared: 0.9663 F-statistic: 187.7 on 2 and 11 degrees of freedom, p-value: 3.159e-09

The actual value of the coefficient 0.004526 is small, but highly statistically
significant (*P* = 0.003). Hence we conclude that the quadratic model
is better and that the linear model is inadequate.

The regression function
is also a parameter of interest. A standard problem asks for
an estimate of this parameter for some particular value of *x*.

The `predict`

function in R also does this.

We return to the data used in the
first section of this page
(Example 12.3.1 in Wild and Seber,
data file
cmputer1.txt with variables `no`

and `pertask`

).
The top of p. 533 in Wild and Seber has the estimate of the regression
function for `no`

= 70. The following R statements do this.

As always, the on-line help for the function in question, here predict, gives optional arguments and other information about the function.

In particular, the optional argument `level=0.90`

changes the
confidence level to 90%.

This section is very similar to the preceding section. The only difference
is the optional argument `interval`

is changed to
`interval="prediction"`

.
This gives a *prediction interval* that has a specified probability
(95% by default) of covering a new data value associated with a specified
predictor value.

The prediction interval at the top of p. 533 in Wild and Seber for
a new `pertask`

value associated with `no`

= 70
is produced by the following R statements.

Yet again we reuse the data used in
first section of this page
(Example 12.3.1 in Wild and Seber,
data file
cmputer1.txt with variables `no`

and `pertask`

).

The R statements

do a QQ plot of the residuals.

For the same data, the R statements

do a plot of the *residuals* versus the *fitted values*.

Again, for the same data, the R statements

do a plot of the *residuals* versus the *predictor variable*
`no`

.

The plots in this section and the preceding section are essentially the same
for simple linear regression. The difference between them only arises in
multiple regression (which we haven't done). If there are *k* predictor
variables, then there are *k* residuals versus predictor plots (one
for each of the *k* predictors, but only one residuals versus fitted
values plot. Since the latter shows most of what you want, it is preferred.
Certainly, if you are going to make only one plot of this type, it should
be a residuals versus fitted values plot.

For an example we will use the data for Example 12.4.1 in Wild and Seber,
which is in the file
rating2.txt.
This file contains two variables: `parent`

and `teacher`

.

For comparison with the correlation analysis, we first do a regression analysis, reproducing the analysis in Example 12.4.1.

Now we return to Section 12.5 in Wild and Seber, where we just want to
examine the *correlation* of these two variables. The R function
that calculates correlations is `cor`

.

This does produce the correlation 0.2970181 found in Wild and Seber.

If we want to test whether the correlation is zero, we look at the test whether the regression slope is zero. Both are exactly the same test, because the correlation is zero if and only if the regression slope is zero.

Thus the *P*-value for the test of whether the correlation is zero
can be read out of the regression printout of the preceding Rweb submission

```
Rweb:> summary(out)
Call:
lm(formula = teacher ~ parent)
Residuals:
Min 1Q Median 3Q Max
-27.936 -13.562 -1.398 11.688 49.945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3659 11.3561 0.120 0.9047
parent 0.4188 0.1799 2.328 0.0236 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 16.85 on 56 degrees of freedom
Multiple R-Squared: 0.08822, Adjusted R-squared: 0.07194
F-statistic: 5.418 on 1 and 56 degrees of freedom, p-value: 0.02357
```