## Fitting Linear Models

### Simple Linear Regression

In so-called simple linear regression we observe the response yi and one quantitative covariate xi for the i-th individual. The mean for the i-th individual is

E(yi) = β1 + xi β2
The i-th row of the model matrix is thus
( 1   xi )
The model matrix is n × 2, the first column is all ones, and the second column is the covariate values for all individuals.

Here is how this model is fit in R.

The regression coefficients are reported in the table

```Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.1908     1.7410   -0.11    0.913
x             1.0221     0.0740   13.81   <2e-16 ***
```
The estimate of β1 is −0.1908 and the estimate of β2 is 1.0221.

The R function `lm` (on-line help) fits the linear model. The expression `y ~ x` specifies the model described above.

More about model formulas is given in the R documentation for model formulas, but you don't need to look at that. We'll cover all the formulas you need to know about in this web page.

For now it's enough to know that the formula specifies the predictor `x` and the response `y`. If these variables were instead named `fred` and `sally`, respectively, the model formula would have to be `sally ~ fred` and the whole model fitting statement would be

```out <- lm(sally ~ fred)
```

Note that the model formula doesn't mention regression coefficients explicitly. There is one regression coefficient for each predictor in the formula (that is, there is a β2 for the predictor `x`) and there is also a regression coefficient for the constant predictor (that is, there is a β1 too) included by default.

The reason why the model fit is saved in a variable `out` is because we generally want to do many things with it. In this example, we use it twice. In other examples, we will use it more.

It's called linear regression because the means are linear functions of the regression coefficients not because the means are linear functions of the covariates — they may be arbitrary functions of the covariates.

In so-called quadratic regression we observe the response yi and one quantitative covariate xi for the i-th individual. The mean for the i-th individual is

E(yi) = β1 + xi β2 + xi2 β3
The i-th row of the model matrix is thus
( 1   xi   xi2 )
The model matrix is n × 3, the first column is all ones, the second column is the covariate values for all individuals, and the third column is the squares of the covariate values for all individuals.

#### Try 1

Here is how this model is fit in R to the data used in the preceding section.

Again the regression coefficients are in the `Estimate` column of the `Coefficients:` table.

For now it's enough to know that the formula specifies the predictors `x` and `x^2` and the response `y`. The plus sign (`+`) is magical. It separates different predictors.

The `I` function that wraps the second predictor function. It is necessary to write `I(x^2)` instead of just `x^2` because otherwise the hat (`^`) would not be interpreted correctly.

Generally, you have to wrap all complicated expressions for predictors. For example, you would write `I(sin(x))` to use `sin(x)` as a predictor. Only simple variable names do not have to be wrapped.

Note that the model formula doesn't mention regression coefficients explicitly. There is one regression coefficient for each predictor in the formula, that is, there is a β2 for the predictor `x` and a β3 for the predictor `I(x^2)`, and there is also a regression coefficient for the constant predictor (that is, there is a β1 too) included by default.

The `curve` function that draws the sample regression curve on the plot we will just treat as magic because it is too complicated to explain. It is, of course, documented in the on-line help, but you don't need to look at that. The `predict` function is explained in the section on confidence intervals for the regression function below.

For comparison, we add (dotted line) the regression line from the other plot (simple linear regression).

#### Try 2

Here is another way to fit the same model used in the preceding section to the same data.

The estimated regression function (the mean values considered as a function of the covariate) is the same as before, but the regression coefficients are different. That is because the R function `poly` (on-line help) uses a different basis for the quadratic function. The means have the form

E(yi) = β1 + g1(xi) β2 + g2(xi2) β3

where g1 is a linear function and g2 is a quadratic function. It doesn't matter what linear function and quadratic function are used. The class of functions is still all quadratic functions of `x`.

This is an example of regression coefficients are meaningless.

#### Proof that Mean Vectors are the Same

Here we check what we already know: the means are the same even though the coefficients are different.

The mean vectors μ = M β are computed by the R function `predict`, which is explained in the section on confidence intervals for the regression function below.

#### Try 3

Here is the same model fit to different data having a much stronger quadratic part.

### Multiple Regression

In so-called multiple regression we observe the response vector y and two (or more) quantitative covariates x1 and x2. The mean for the i-th individual is

E(yi) = β1 + x1 i β2 + x2 i β3

The i-th row of the model matrix is thus

( 1   x1 i   x2 i )

Here's how R fits this model.

Note that we can't call both predictor variables the same name. Here we call them `x1` and `x2`.

Now there is no simple plot that shows the regression function (means as a function of covariates), because with two covariates and one response, the graph of the function is three-dimensional.

For the same data as the preceding section, we can model means as a quadratic function of the covariates. The mean for the i-th individual is

E(yi) = β1 + x1 i β2 + x2 i β3 + x1 i2 β4 + x1 i x2 i β5 + x2 i2 β6
The i-th row of the model matrix is thus
( 1   x1 i   x2 i   x1 i2   x1 i x2 i   x2 i2 )

#### Try 1

Here's how R fits this model.

#### Try 2

Here's another way R can fit this model.

As in the section using the poly function above, the same model is being fit, because the family of regression functions is the same — all bivariate quadratic functions of covariates — and the vector subspace of all mean values is the same. But the regression coefficients are different because a different model matrix is used.

## Hypothesis Tests

#### Done by R

A hypothesis test about whether a regression coefficient is zero is automatically done for each regression coefficient by the R function `summary`. These tests are not corrected for multiple testing and hence must be used with extreme caution.

We'll use the example for simple linear regression and the example for quadratic regression which were done above as examples.

The form below redoes those examples, except for the plots, which we don't need.

The columns labeled `t value` and `Pr(>|t|)` of the `Coefficients:` table give, respectively,

• `t value`, the test statistic for a test of the hypothesis that the specified regression coefficient (the one for that line of the table) is actually zero.
• `Pr(<|t|)` the P-value for the two-tailed test about that regression coefficient.

For example, if we have done simple linear regression, and want to test hypotheses

H0 : β2 = 0
H1 : β2 ≠ 0

then the value of the test statistic reported by R is T = 13.81. Assuming the null hypothesis and assuming the standard regression assumptions (IID normal errors), this test statistic has a Student t distribution with n − 2 degrees of freedom. The corresponding P-value reported by R is P < 2 × 10− 16. (two-tailed). But you don't have to do a look-up in a table of the t distribution, R does it for you.

The interpretation of for this test is, of course, that P < 2 × 10− 16 means the null hypothesis is rejected and the true unknown population regression coefficient β2 is nonzero.

If you wanted a one-tailed rather than a two-tailed test, you would divide the given P-value by 2.

For another example, if we have done quadratic regression, and want to test hypotheses

H0 : β3 = 0
H1 : β3 ≠ 0

then the value of the test statistic is T = −1.320 and the P-value is 0.195.

The interpretation of this P-value is, of course, that the null hypothesis is accepted and the true unknown population regression coefficient is zero. As usual, that doesn't prove the population regression coefficient is actually zero. It only says that the data at hand don't give any real evidence it is not zero.

#### Done by Hand

The by hand here doesn't mean completely by hand. R does most of the work but leaves a bit for you.

Since the calculations are very much like the calculations for confidence intervals, we will put this out of sequence, in the section on confidence intervals in the subsection on general tests about regression coefficients.

#### Summary

The quadratic regression is unnecessary for these data. It appears that the linear model with the regression function

E(yi) = β1 + xi β2

fits the data well.

The linear regression does appear to be necessary for these data. It appears that the model with the constant regression function

E(yi) = β1

does not fit the data well. So the the linear term in the model is necessary.

Regression coefficients in simple linear regression are intimately related to correlation coefficients. The simple relation

β2 = ρ σY ⁄ σX

relates the two. Thus the slope of the population regression line is zero if and only if the correlation of X and Y is zero. And thus the test with null hypothesis β2 = 0 done above is also a test with null hypothesis ρ = 0.

### Omnibus Tests

Also of interest in multiple regression is a test of whether there are any regression coefficients that are significantly nonzero except for the coefficient β1 that goes with the constant predictor and is usually not of interest. That is we want to do a test of

H0 : β2 = β3 = ... = βk = 0
H1 : βi ≠ 0,      for some i ≠ 1

This test is often not of particular research interest. It serves as a straw man to knock down. The null hypothesis is generally thought to be false and is easily rejected with a reasonable amount of data.

But it is important to do the test anyway. When the null hypothesis cannot be rejected, this means the data are completely worthless. The model that has the constant regression function fits as well as the regression model.

Thus if this test fails to reject H0 the right thing to do is throw the data in the trash. They're worthless. No further analysis need be done.

The idea that their precious data, obtained at great cost in time, money, or effort, might actually be worthless upsets some people so much that they don't even want to think about doing this test. But they should. Drawing conclusions from worthless data makes you a fool, whether or not you are aware of your foolishness. If someone else reanalyzes the data, they'll find out.

R does this test in every regression printout. For quadratic regression example were done above part of the printout was

```Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.883215   2.670473  -1.080    0.287
x            1.406719   0.300391   4.683 3.74e-05 ***
I(x^2)      -0.009381   0.007105  -1.320    0.195
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.351 on 37 degrees of freedom
Multiple R-squared: 0.8414,	Adjusted R-squared: 0.8328
F-statistic: 98.11 on 2 and 37 DF,  p-value: 1.613e-15
```

The last line describes this omnibus test.

The interpretation is that the null hypothesis is rejected (P = 1.6 × 10− 15). So the regression is not completely worthless. Either β1 or β2 appears to be nonzero (or perhaps both).

### Model Comparison Tests

In this section we consider testing a big model versus a little model. The null hypothesis is some regression model, and the alternative hypothesis is some other regression model, and the little model is a submodel of the big model (the little model is obtained by setting some of the regression coefficients of the big model to zero).

As an example, let us consider the multiple regression example done above. We use the model from that fit as the little model. And we consider a quadratic model as the big model. A quadratic model has three more terms. The regression function for the big model is

E(Yi) = β1 + β2 x1 + β3 x2 + β4 x12 + β5 x1 x2 + β6 x22

We obtain the regression function for the little model by setting β4 = β5 = β6 = 0. So the little model is indeed a submodel of the big model (as the test requires).

To compare the fits of the two models we first do the fits, which we save in the R objects `out.little` and `out.big`. Then we compare the models using the `anova` function (on-line help). The printout of that function is a so-called ANOVA (analysis of variance) table, which dates back to the days of hand calculation and gives old timers a warm fuzzy feeling. For our purposes we only need to get the test statistic (F = 1.392) and P-value (P = 0.2578) out of this table.

The interpretation of the test is that the null hypothesis is accepted. The data give no evidence of statistically significant departure from the little model. That doesn't prove the little model is actually correct, only that the data give no evidence it isn't.

You can't get the same effect by looking at the three P-values for the three regression coefficients separately. If you want to make this model comparison, you need to do this test.

The omnibus test of the preceding section is the special case of the test of this section where the little model has only the constant predictor.

## Confidence and Prediction Intervals

### Confidence Intervals for Regression Coefficients

We again use the example for quadratic regression done above.

R doesn't do confidence intervals for regression coefficients for you. It does most of the work but leaves a bit left for you. Let's again look at the regression printout for that example.

```Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.883215   2.670473  -1.080    0.287
x            1.406719   0.300391   4.683 3.74e-05 ***
I(x^2)      -0.009381   0.007105  -1.320    0.195
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.351 on 37 degrees of freedom
Multiple R-squared: 0.8414,	Adjusted R-squared: 0.8328
F-statistic: 98.11 on 2 and 37 DF,  p-value: 1.613e-15
```

The printout gives us three numbers (highlighted in yellow) needed to construct the confidence interval. The same three numbers are also needed for general tests about regression coefficients described in the following section.

• The first number is the estimate, also called the sample regression coefficient.
• The second number is the standard error of this estimate (the estimated standard deviation of the sampling distribution of the estimate).
• The third number is the degrees of freedom for the t pivotal quantity made from the first two numbers.

Thus one makes up a confidence interval in the usual way

point estimate ± critical value × standard error

where the critical value here comes from the t distribution with the degrees of freedom stated.

If we want a 95% confidence interval, the critical value for 37 degrees for freedom is given by

```Rweb:> qt(0.975, 37)
[1] 2.026192
```

and then the confidence interval for the parameter β3 for these data is given by

```Rweb:> -0.009381 + c(-1, 1) * qt(0.975, 7) * 0.007105
[1] -0.026181655  0.007419655
```

Of course, this whole calculation can be done by hand with a calculator and a table of the t distribution starting with the regression output. No further use of R is necessary after getting the regression output.

### General Tests about Regression Coefficients

General test (with the hypothesized value under the null hypothesis some number other than zero) are done by hand in the same way as the confidence intervals.

We use the same three numbers highlighted above.

Suppose we want to test

H0 : β3 = 0.25
H1 : β3 < 0.25

The test statistic for a t test is

(point estimate − hypothesized value) / standard error

In this example that is

```Rweb:> (-0.009381 - 0.25) / 0.007105
[1] -36.50683
```

And we look up the P-value using the `pt` function.

```Rweb:> tstat <- (-0.009381 - 0.25) / 0.007105
Rweb:> pt(tstat, 37)
[1] 6.359019e-31
```

Highly statistically significant difference.

### Confidence Intervals for the Regression Function

The R function `predict` (on-line help) makes confidence intervals for means and prediction intervals for new data.

### Mean Values for Old Data

Estimates of the (unknown, true) mean values for the observed data are done by the `predict` function using the defaults for all arguments except the first. We again use the data for the quadratic regression example.

### Confidence Intervals for Means for Old Data

Confidence intervals for the (unknown, true) mean values for the observed data are done by the `predict` function using the optional argument `interval = "confidence"`.

### Mean Values for New Data

Estimates of the (unknown, true) mean values for new data are done by the `predict` function using optional argument `newdata`, which should be a data frame (on-line help) that contains variables having the same names as the predictors used in the model formula in the regression fit. This data frame for the new data does not need to contain the response variable or unused predictor variables.

### Confidence Intervals for Means for New Data

Combining the methods of the two preceding sections, we get confidence intervals for the (unknown, true) mean values for new data.

### Confidence Intervals for Means for New Data, Multiple Covariates

When there are multiple covariates, the data frame for the new data contains all covariates involved in the formula. We use the multiple regression example.

### Prediction Intervals

Prediction intervals are also done by the R function `predict`. Just use `interval = "prediction"` instead of `interval = "confidence"`.

### Prediction Interval Plot

The middle curve is the estimated regression function. The upper and lower curves are the 95% prediction bounds.

To get confidence bounds for the regression function, change `type <- "prediction"` to `type <- "confidence"`.