\documentclass{article}

\usepackage{indentfirst}
  \usepackage{verbatim}

\RequirePackage{amsmath}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\mse}{mse}
\DeclareMathOperator{\tr}{tr}

\newcommand{\SSResid}{\text{\normalfont\upshape SSResid}}

\newcommand{\boldbeta}{\boldsymbol{\beta}}
\newcommand{\boldeta}{\boldsymbol{\eta}}
\newcommand{\boldtheta}{\boldsymbol{\theta}}
\newcommand{\boldmu}{\boldsymbol{\mu}}
\newcommand{\boldp}{\mathbf{p}}
\newcommand{\boldx}{\mathbf{x}}
\newcommand{\boldy}{\mathbf{y}}
\newcommand{\boldA}{\mathbf{A}}
\newcommand{\boldD}{\mathbf{D}}
\newcommand{\boldH}{\mathbf{H}}
\newcommand{\boldI}{\mathbf{I}}
\newcommand{\boldX}{\mathbf{X}}

\newcommand{\mdot}{\,\cdot\,}

\newcommand{\Ybar}{Y{\mkern -14.0 mu}\overline{\phantom{\text{Y}}}}

\newcommand{\norm}[1]{\lVert #1 \rVert}

\RequirePackage{amsfonts}
\newcommand{\real}{\mathbb{R}}

  \newcommand{\centerfig}[2]{\relax}

\RequirePackage{amsthm}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}

\newtheoremstyle{example}% name
     {\topsep}%      Space above
     {\topsep}%      Space below
     {}%         Body font
     {}%         Indent amount (empty = no indent, \parindent = para indent)
     {\bfseries}% Thm head font
     {.}%        Punctuation after thm head
     {\newline}%     Space after thm head: " " = normal interword space;
     {}%         Thm head spec
{\theoremstyle{example} \newtheorem{example}{Example}}

\newenvironment{slogan}{\begin{quote}\em}{\end{quote}}

\begin{document}

\title{Model Selection in R}
\author{Charles J. Geyer}
\maketitle

This used to be a section of my master's level theory notes.
It is a bit overly theoretical for this R course.  Just think
of it as an example of literate programming in R using
the \verb@Sweave@ function.  You don't have to absorb all the
theory, although it is there for your perusal if you are interested.

This discusses the problem of choosing among many models.
Although our examples will be regression problems and some of the
discussion and methods will be specific to linear regression,
the problem is general.  When many models are under consideration,
how does one choose the best?  Thus we also discuss methods
that apply outside the regression context.

There are two main issues.
\begin{itemize}
\item Non-nested models
\item Many models.
\end{itemize}
The only methods for model comparison we have studied, the $F$ test
for comparison of linear regression models and the likelihood ratio
test for comparison of general models, are valid only for comparing
two \emph{nested} models.  We also want to test \emph{non-nested}
models, and for that we need new theory.
When more than two models are under consideration, the issue of
correction for multiple testing arises.  If there are only a handful
of models, Bonferroni correction (or some similar procedure) may suffice.
When there are many models, a conservative correction like Bonferroni
is too conservative.  Let's consider how many models might be
under consideration in a model selection problem.
Consider a regression problem with $k$ predictor variables.
\begin{itemize}
\item \textbf{[Almost the worst case]} There are $2^k$ possible submodels
formed by choosing a subset of the $k$ predictors to include in the model
(because a set with $k$ elements has $2^k$ subsets).
\item \textbf{[Actually the worst case]} That doesn't consider all the
new predictors that one might ``make up'' using functions of the old
predictors.  Thus there are potentially infinitely many models under
consideration.
\end{itemize}
Bonferroni correction for infinitely many tests is undefined.  Even for
$2^k$ tests with $k$ large, Bonferroni is completely pointless.  It would
make nothing statistically significant.

\section{Overfitting}

\begin{slogan}
Least squares is \textbf{good} for model fitting,
but \textbf{useless} for model selection.
\end{slogan}
Why?  A bigger model \emph{always} has a smaller residual sum of squares,
just because a minimum taken over a larger set is smaller.
% \footnote{If
% $g$ is any
% real-valued function and $A$ and $B$ are two subsets of the domain of $g$
% with $A \subset B$ then
% $$
%    \inf_{x \in A} g(x) \ge \inf_{y \in B} g(y)
% $$
% simply because every $x$ that occurs on the left hand side also occurs
% as a $y$ on the right hand side because $A \subset B$.  Taking $g$ to
% be the least squares criterion for a regression model, and $A$ and $B$
% the parameter spaces for two nested models gives the result that the
% larger model always has the smaller residual sum of squares.
% 
% Note this is exactly analogous to what happens with maximum likelihood.
% The larger model always has the larger log likelihood.  The reasoning is
% exactly the same except for the inequality being reversed because of maximizing
% in maximum likelihood rather than minimizing in least squares.}
Thus least squares, taken as a criterion for model selection says
``always choose the biggest model.''  But this is silly.
Consider what the principle of choosing the biggest model says about polynomial
regression.
\begin{example}[Overfitting in Polynomial Regression]
Consider \label{ex:over} the regression data in the file \verb@ex12.7.1.dat@
in this directory, which is read by
<<read-foo>>=
X <- read.table("ex12.7.1.dat", header = TRUE)
names(X)
attach(X)
@
The ``simple'' linear regression is done and a plot made
(Figure~\ref{fig:over1}) by the following
<<label=fig1plot,include=FALSE>>=
par(mar = c(5, 4, 1, 1) + 0.1)
plot(x, y)
out <- lm(y ~ x)
abline(out)
@
\begin{figure}
\begin{center}
<<label=fig1,fig=TRUE,echo=FALSE>>=
<<fig1plot>>
@
\end{center}
\caption{Some regression data.  With fitted ``simple'' regression function
of the form $y = \hat{\alpha} + \hat{\beta} x$.}
\label{fig:over1}
\end{figure}

Couldn't ask for nicer data for simple linear regression.  The data appear
to fit the ``simple'' model (one non-constant predictor, which we take
to be $x$ itself).

But now consider what happens when we try to be a bit more sophisticated.
How do we know that the ``simple'' model is o.~k.?  Perhaps we should consider
some more complicated models.  How about trying polynomial regression?
But if we consider all possible polynomial models, that's an infinite
number of models (polynomials of all orders).

Although an infinite number of models are potentially under consideration,
the fact that the data set is finite limits the number of models that actually
need to be considered to a finite subset.  You may recall from algebra that
any set of $n$ points in the plane having $n$ different $x$ values can be
interpolated (fit exactly) by a polynomial of degree $n - 1$.  A polynomial
that fits exactly has residual sum of squares zero (fits perfectly).
Can't do better than that by the least squares criterion!  Thus all
polynomials of degree at least $n - 1$ will give the same fitted values
and zero residual sum of squares.  Although they may give different
predicted values for $x$ values that do not occur in the data, they give
the same predicted values at those that do.  Hence the polynomials with
degree at least $n - 1$ cannot be distinguished by least squares.
In fact the polynomials with degree more than $n - 1$ cannot be fit
at all, because they have more parameters than there are equations to
determine them.  $\boldX' \boldX$ is a singular matrix and cannot inverted
to give unique estimates of the regression coefficients.

This is a general phenomenon, which occurs in all settings, not just with
linear regression.  Models with more parameters than there are data points
are underdetermined.  Many different parameter vectors give the same
likelihood, or the same empirical moments for the method of moments, or
the same for whatever criterion is being used for parameter estimation.
Thus in general, even when an infinite number of models are theoretically
under consideration, only $n$ models are practically under consideration,
where $n$ is the sample size.

Hence the ``biggest model'' that the least squares criterion selects is the
polynomial of degree $n - 1$.  What does it look like?
The following code fits this model
<<over>>=
out.poly <- lm(y ~ poly(x, 9))
summary(out.poly)
@
(Note the zero residual degrees of freedom.  That's the overfitting!)

Then
<<label=fig2plot,include=FALSE>>=
par(mar = c(5, 4, 1, 1) + 0.1)
plot(x, y)
curve(predict(out.poly, data.frame(x = x)), add = TRUE)
abline(out, lty = 2)
@
plots the two regression functions for our fits (Figure~\ref{fig:over2})
the best fitting (perfectly fitting!)
polynomial of degree $n - 1 = 9$ and the least squares regression line
from the other figure.
\begin{figure}
\begin{center}
<<label=fig2,fig=TRUE,echo=FALSE>>=
<<fig2plot>>
@
\end{center}
\caption{Some regression data.  With fitted linear regression function
(dashed line) and ninth degree polynomial regression function
(solid curve).  See also Figure~\ref{fig:over3}.}
\label{fig:over2}
\end{figure}
\begin{comment}
 n <- 10
 x <- 1:10
 sigma <- 1
 alpha <- 0
 beta <- 1
 y <- alpha + beta * x + rnorm(n, 0, sigma)

 write(rbind(x, y), ncol=2, file="ex12.7.1.dat")

 X <- read.table("/usr/HP9/WEB/httpd/htdocs/geyer/5102/ex12.7.1.dat", header=T)
 attach(X)
 n <- length(x)

 out <- lm(y ~ x)
 summary(out)
 plot(x, y)
 abline(out)

 postscript(file="over1.ps", height=5, width=6, horizontal=FALSE)
 par(mar=c(5,4,0,0)+.1)
 plot(x, y)
 abline(out)

 norm <- function(x) sqrt(sum(x^2))

 foo <- rep(1, n)
 foo <- foo / norm(foo)

 design <- as.matrix(foo)
 for (i in 1:(n - 1)) {
    foo <- (x - mean(x))^i
    a <- t(design) %*% foo
    bar <- design %*% a
    bar <- foo - bar
    bar <- bar / norm(bar)
    design <- cbind(design, bar)
 }

 outn <- lm(y ~ design + 0)
 summary(outn)
 
 xx <- seq(0, n + 1, 0.01)

 foo <- rep(1, n)
 fool <- rep(1, length(xx))
 fool <- fool / norm(foo)
 foo <- foo / norm(foo)

 design <- as.matrix(foo)
 desl <- as.matrix(fool)
 for (i in 1:(n - 1)) {
    foo <- (x - mean(x))^i
    fool <- (xx - mean(x))^i
    a <- t(design) %*% foo

    bar <- design %*% a
    barl <- desl %*% a
    bar <- foo - bar
    barl <- fool - barl
    barl <- barl / norm(bar)
    bar <- bar / norm(bar)
    design <- cbind(design, bar)
    desl <- cbind(desl, barl)
 }

 all(design == desl[match(x, xx), ])

 predl <- desl %*% outn$coef

 plot(x, y)
 abline(out)
 lines(xx, predl)

 postscript(file="over2.ps", height=5, width=6, horizontal=FALSE)
 par(mar=c(5,4,0,0)+.1)
 plot(x, y, ylim=range(predl[xx >= 1 & xx <= n]))
 lines(xx, predl)
 abline(out, lty=2)
 
\end{comment}

How well does the biggest model do?  It fits the observed data perfectly,
but it's hard to believe that it would fit \emph{new} data from the same
population as well.  The extreme oscillations near the ends of the range
of the data are obviously nonsensical, but even the smaller oscillations
in the middle seem to be tracking random noise rather than any real
features of the population regression function.  Of course, we don't actually
know what the true population regression function is.  It could be either
of the two functions graphed in the figure, or it could be some other function.
But it's hard to believe, when the linear function fits so well, that something
as complicated as the ninth degree polynomial is close to the true regression
function.  We say it ``overfits'' the data, meaning it's \emph{too} close to
the data and not close enough to the true population regression function.

Let's try again on Figure~\ref{fig:over2}.  Perhaps we would like to see
the extent of those big oscillations.  Something like this should do it.
<<label=fig3plot,include=FALSE>>=
par(mar = c(5, 4, 1, 1) + 0.1)
xx <- seq(min(x), max(x), length = 1000)
yy <- predict(out.poly, data.frame(x = xx))
plot(x, y, ylim = range(yy))
curve(predict(out.poly, data.frame(x = x)), add = TRUE)
abline(out, lty = 2)
@
The result is shown in Figure~\ref{fig:over3}.
\begin{figure}
\begin{center}
<<label=fig3,fig=TRUE,echo=FALSE>>=
<<fig3plot>>
@
\end{center}
\caption{Same as Figure~\ref{fig:over2} except for greater vertical extent.}
\label{fig:over3}
\end{figure}
\end{example}

\section{Mean Square Error}

So what criterion should we use for model selection if residual sum of
squares is no good?  One theoretical criterion is mean square error.
In the regression setting, it is unclear what quantities mean square error
should apply to.  Do we use the mean square error of the parameter estimates?
We haven't even defined mean square error for vector quantities.
That alone suggests we should avoid looking at MSE of regression
coefficients.  There is also our slogan that regression coefficients
are meaningless.  Hence we should look at estimates of the regression function.

But here too, there are still issues that need to be clarified.  The
regression function $h(\boldx) = E(Y \mid \boldx)$ is a scalar
function of $\boldx$.
(Hence we don't need to worry about MSE of a vector quantity).
But it is a function of the predictor value $\boldx$.
\begin{equation} \label{eq:mse-of-x}
   \mse\{\hat{h}(\boldx)\}
   =
   \text{variance} + \text{bias}^2
   =
   \var\{\hat{h}(\boldx)\} + \left(E\{\hat{h}(\boldx)\} - h(\boldx)\right)^2
\end{equation}
where $h(\boldx)$ is the true population regression function
and $\hat{h}(\boldx)$ is an estimate (we are thinking of least squares
regression estimates, but \eqref{eq:mse-of-x} applies to any estimate).
(``Bias?''\ did I hear someone say?  Aren't linear regression estimates
\emph{unbiased}?  Yes, they are when the model is \emph{correct}.
Here we are considering cases when the model is too small to contain
the true regression function.)

To make a criterion that we can minimize to find the best model, we need
a single scalar quantity, not a function.  There are several things we
could do to make a scalar quantity from \eqref{eq:mse-of-x}.
We could \emph{integrate} it over some range of values,
obtaining so-called \emph{integrated mean squared error}.
A simpler alternative is to sum it over the \emph{design points},
the $\boldx$ values occurring in the data set under discussion.
We'll discuss only the latter.

Write $\boldmu$ for the true population means
of the responses given the predictors, defined by $\mu_i = h(\boldx_i)$.
Let $m$ index models.  The $m$-th model will have design matrix
$\boldX_m$ and hat matrix
\begin{equation} \label{eq:hat}
   \boldH_m = \boldX_m (\boldX_m' \boldX_m)^{-1} \boldX_m'.
\end{equation}
It is called the hat matrix because it ``puts the hat on'' $\boldy$,
that is, $\hat{\boldy}_m = \boldH_m \boldy$ is the vector of predicted
values for the $m$-th model.
The expected value of the regression
predictions is
$$
   E(\hat{\boldy}_m) = \boldH_m \boldmu.
$$
Hence the bias is
\begin{equation} \label{eq:bias-of-reg}
   E(\hat{\boldy}) - E(\boldy) = - (\boldI - \boldH_m) \boldmu.
\end{equation}

If the $m$-th model is correct, then $\boldmu$ is in the range of $\boldH_m$
and the bias is zero.  Models that are incorrect have nonzero bias.
The bias \eqref{eq:bias-of-reg} is, of course, a vector.
Its $i$-th element gives
the bias of $\hat{y}_i$.  What we decided to study was the sum of the
MSEs at the design points, the ``bias'' part of which is just the sum
of squares of the elements of \eqref{eq:bias-of-reg}, which is the same
thing as the squared length of this vector
\begin{equation} \label{eq:the-bias}
   \text{bias}^2
   =
   \norm{(\boldI - \boldH_m) \boldmu}^2
   =
   \boldmu' (\boldI - \boldH_m)^2 \boldmu
   =
   \boldmu' (\boldI - \boldH_m) \boldmu.
\end{equation}

The variance is
\begin{align*}
   \var(\hat{\boldy})
   & =
   E\left\{ \bigl\lVert \hat{\boldy} - E(\hat{\boldy}) \bigr\rVert^2 \right\}
   \\
   & =
   E\left\{ \bigl\lVert \hat{\boldy} - \boldH_m \boldmu \bigr\rVert^2 \right\}
   \\
   & =
   E\left\{ \bigl\lVert \boldH_m (\boldy - \boldmu) \bigr\rVert^2 \right\}
   \\
   & =
   E\left\{ \boldH_m (\boldy - \boldmu)' (\boldy - \boldmu) \boldH_m \right\}
   \\
   & =
   \sigma^2 \boldH_m^2
   \\
   & =
   \sigma^2 \boldH_m
\end{align*}
This is, of course, a matrix.  What we want though is just the sum of the
diagonal elements.  The $i$-th diagonal element is the variance of $\hat{y}_i$,
and our decision to focus on the sum of the MSEs at the design points
says we want the sum of these.  The sum of the diagonal elements of a
square matrix $\boldA$ is called its \emph{trace}, denoted $\tr(\boldA)$.
Thus the ``variance'' part of the MSE is
\begin{equation} \label{eq:the-var}
   \text{variance} = \sigma^2 \tr(\boldH_m).
\end{equation}
And
$$
   \mse_m = \sum_{i = 1}^n \mse(\hat{y}_i)
   = \sigma^2 \tr(\boldH_m) + \boldmu' (\boldI - \boldH_m) \boldmu.
$$

\section{The Bias-Variance Trade-Off}

Generally, one cannot reduce both bias and variance at the same time.
Bigger models have less bias but \emph{more} variance.
Smaller models have less variance but \emph{more} bias.
This is called the ``bias-variance trade-off.''
\begin{example}
Consider \label{ex:trade} the regression data in the file \verb@ex12.3.2.dat@
in this directory, which is read by
<<label=detach1,echo=FALSE>>=
detach(X)
@
<<read-sinus>>=
X <- read.table("ex12.3.2.dat", header = TRUE)
names(X)
attach(X)
@
Figure~\ref{fig:sinus-scatter} shows the scatter plot.
\begin{figure}
\begin{center}
<<label=figsinus-scatter,fig=TRUE,echo=FALSE>>=
plot(x, y)
@
\end{center}
\caption{Some regression data.}
\label{fig:sinus-scatter}
\end{figure}

A mere glance at the plot shows that $y$ is decidedly not a linear function
of $x$, not even close.  However, there is nothing that
prevents us from making up more predictor variables.  The data set itself
has no other variables in it, just $x$ and $y$.  So any new predictor variables
we make up must be functions of $x$.  What functions?

Since we are told nothing about the data, we have no guidance as to what
functions to make up.  In a real application, there might be some guidance
from scientific theories that describe the data.  Or there might not.
Users of linear regression often have no preconceived ideas as to what
particular functional form the regression function may have.  One thing
that is often done, more or less for lack of any better ideas, is
to try a polynomial.

The following fits a bunch of polynomials and compares them.
<<sinus-fit>>=
out1 <- lm(y ~ x)
out2 <- update(out1, . ~ . + I(x^2))
out3 <- update(out2, . ~ . + I(x^3))
out4 <- update(out3, . ~ . + I(x^4))
out5 <- update(out4, . ~ . + I(x^5))
out6 <- update(out5, . ~ . + I(x^6))
out7 <- update(out6, . ~ . + I(x^7))
anova(out1, out2, out3, out4, out5, out6, out7)
@
(no particular reason for stopping at 7).

It is clear from the printout that odd degree polynomials fit way
better than the even ones (strictly speaking, better than the immediately
preceding even one to which they are compared).  In hindsight, this is
obvious.  A glance at Figure~\ref{fig:sinus-scatter} shows that the
regression function is bending down on the left side and up on the right.
This is the behavior of an odd degree polynomial.  An even degree polynomial
heads the same way at both sides (both up or both down).  Thus we need only
consider odd degree polynomials.  Let's add some more.
<<sinus-fit-more>>=
out9 <- update(out7, . ~ . + I(x^8) + I(x^9))
out11 <- update(out9, . ~ . + I(x^10) + I(x^11))
out13 <- update(out11, . ~ . + I(x^12) + I(x^13))
anova(out1, out3, out5, out7, out9, out11, out13)
@
Hmmmmm.  Seems we stopped in the right place.  Model 4 here,
which is the polynomial of degree 7 is clearly fits much better than
model 3.  But models 5, 6, and 7 don't fit significantly better than
model 4.  Thus it is fairly clear that model 4 is the one to go with.

Double Hmmmmm.  The 1 degree of freedom for comparing models 6 and 7 is
wrong.  The only reason I can think of for this is that using the monomials
as predictors is very unstable.  Let's try again.
<<sinus-fit-again>>=
out3 <- lm(y ~ poly(x, 3))
out5 <- lm(y ~ poly(x, 5))
out7 <- lm(y ~ poly(x, 7))
out9 <- lm(y ~ poly(x, 9))
out11 <- lm(y ~ poly(x, 11))
out13 <- lm(y ~ poly(x, 13))
anova(out1, out3, out5, out7, out9, out11, out13)
@
No difference at all except for that last comparison, which
\verb@poly@ gets right!

Figure~\ref{fig:sinus-poly} shows the scatter plot again
with the regression function for the 7th degree polynomial added.
\begin{figure}
\begin{center}
<<label=figsinus,fig=TRUE,echo=FALSE>>=
plot(x, y)
curve(predict(out7, data.frame(x = x)), add = TRUE)
@
\end{center}
\caption{Same regression data as in Figure~\ref{fig:sinus-scatter}.
Now with sample regression function in the model of all polynomials
of degree seven.}
\label{fig:sinus-poly}
\end{figure}

From a data analytic point of view, there's nothing more to be said.
From a theoretical point of view, we haven't even started.  Theoretically,
we haven't yet looked at the bias-variance tradeoff, or for that matter
at the bias or variance.  These are theoretical quantities that cannot
be discovered by looking at fitted models.

To know the bias we need to know the true population regression function.
I do happen to know that here because this is simulated data.
The true population regression function is trignometric, given by
\begin{subequations}
\begin{align}
  \mu_i & = \sin(x_i) + \sin(2 x_i)
  \label{eq:trade-mu}
  \\
  \sigma & = 0.2
  \label{eq:trade-sig}
\end{align}
\end{subequations}
Again we consider various polynomial regression models.
Since we are not being theoretical, we use only the $x$ values
and ignore the $y$ values.

Since the true population regression curve is a trigonometric rather than
a polynomial function, \emph{no} polynomial is unbiased.
This is typical of real applications.  No model under
consideration is exactly correct.

The R function \verb@lm@ returns the design matrix if asked, and from
it we can construct the hat matrix, for example
<<sinus-hat1>>=
out7 <- lm(y ~ poly(x, 7), x = TRUE)
hat7 <- out7$x %*% solve(t(out7$x) %*% out7$x) %*% t(out7$x)
### check that hat matrix really does put the hat on y
all.equal(predict(out7), as.numeric(hat7 %*% y))
@

Then (integrated) bias squared and variance are given by
<<sinus-bv7>>=
mu <- sin(x) + sin(2 * x)
sigma <- 0.2
n <- length(x)
### bias squared
t(mu) %*% (diag(n) - hat7) %*% mu
### variance
sigma^2 * sum(diag(hat7))
@

Put the above in a loop.
<<sinus-bv>>=
kmax <- 20
bsq <- double(kmax)
v <- double(kmax)
for (k in 1:kmax) {
    out <- lm(y ~ poly(x, k), x = TRUE)
    hat <- out$x %*% solve(t(out$x) %*% out$x) %*% t(out$x)
    bsq[k] <- t(mu) %*% (diag(n) - hat) %*% mu
    v[k] <- sigma^2 * sum(diag(hat))
}
foo <- cbind(1:kmax, bsq, v, bsq + v)
dimnames(foo) <- list(NULL, c("degree", "bias^2", "variance", "mse"))
foo
@

The most interesting fact here is that the best model in terms of mean
square error has degree 9 (not the degree 7 model chosen by the data analysis).
Of course, to know the theoretically best model, we need to know the true
population regression function \verb@mu@ and when we are doing a real data
analysis, we don't.  The point is that there is a theoretically best model
and the data analysis need not find it.

Also very interesting is that we don't want to be less biased, much less
unbiased.  Many people exposed to bad statistics teaching think unbiasedness
is absolutely wonderful.  But here we see that less bias is bad when we
go beyond the point of optimum bias-variance tradeoff.  This is quite a
general point.  Linear regression is only unbiased when all of the model
assumptions are correct: the true population regression function is
among those under consideration (exactly) and the errors are IID
(again exactly).  In real life, these assumptions are rarely if ever
known to be correct.  So unbiasedness is mythical.

Of minor interest is the property that the even degree models have the
same bias as the next lower odd degree model.  This is because the true
population regression is an odd function, satisfying $f(x) = - f(- x)$,
which makes the true population regression coefficients on the even
degree polynomials, which are even functions, satisfying $f(x) = f(- x)$,
exactly zero.

Figure~\ref{fig:trade} shows both the true regression function used
to simulate the response values \eqref{eq:trade-mu} and the sample
ninth-degree polynomial regression function.  The R statements making
the figure are
<<label=fig4plot,include=FALSE>>=
par(mar = c(5, 4, 1, 1) + 0.1)
plot(x, y)
curve(sin(x) + sin(2 * x), add = TRUE)
curve(predict(out9, data.frame(x = x)), add = TRUE, lty = 2)
@
The result is shown in Figure~\ref{fig:trade}.
\begin{figure}
\begin{center}
<<label=fig4,fig=TRUE,echo=FALSE>>=
<<fig4plot>>
@
\end{center}
\caption{Some regression data with true regression function (solid line)
and sample regression function (dashed line).}
\label{fig:trade}
\end{figure}

We see that the sample regression function does not estimate the true
regression function perfectly (because the sample is not the population),
but we now know from the theoretical analysis in this example that no
polynomial will fit better.  A lower degree polynomial will have less
variance.  A higher degree will have less bias.  But the bias-variance
trade-off will be worse for either.
\end{example}

\section{Model Selection Criteria}

The theory discussed in the preceding section gives us a framework for
discussing model selection.  We want the model with the smallest MSE,
the model which makes the optimal bias-variance trade-off.

Unfortunately, this is useless in practice.  Mean square error is a theoretical
quantity.  It depends on the unknown true regression function and the unknown
error variance.  Furthermore, there is no obvious way to estimate it.
Without knowing which models are good, which is exactly the question we
are trying to resolve, we can't get a good estimate of the true regression
function (and without that we can't estimate the error variance well either).

There are several quantities that have been proposed in the literature
as estimates of MSE.  No one is obviously right.  We shall not
go into the theory justifying them (it is merely heuristic anyway).
One fairly natural quantity is
\begin{equation} \label{eq:press}
   \sum_{i = 1}^n \hat{e}_{(i)}^2
\end{equation}
where $\hat{e}_{(i)}$ is the so-called leave-one-out residual
$$
   \hat{e}_{(i)} = y_i - \hat{y}_{(i)}
$$
where $\hat{y}_{(i)}$ is the predicted value at the point $x_i$ from
the model fitted to the data that \emph{leaves out $y_i$}.  The idea
is that this gives an ``honest'' estimate of the error because $y_i$
is not used in calculating $\hat{y}_{(i)}$.  Hence $\hat{y}_{(i)}$
can't be ``too close'' to $y_i$ since it doesn't ``know'' about $y_i$.

It turns out that it is not necessary to do $n$ different regressions
(each leaving out one $y_i$).  All of the leave-one-out residuals can
be calculated from the fit in the full regression model using
\begin{equation} \label{eq:lout}
   \hat{e}_{(i)} = \frac{\hat{e}_i}{1 - h_{i i}}
\end{equation}
where $h_{i i}$ is the $i$, $i$ element of the hat matrix.
This nifty algebraic fact makes \eqref{eq:press} much easier to calculate than
it appears at first sight and consequently makes \eqref{eq:press} a much more
useful criterion.

Unlike $\SSResid$ (the sum of the $\hat{e}_i^2$), this does not always
favor the biggest model.  Models that overfit tend to do a bad job
of their leave-one-out predictions.  \eqref{eq:press} is called
the \emph{predicted residual sum of squares} (PRESS) or the
\emph{cross-validated sum of squares} (CVSS), cross-validation being
another term used to describe the leave-one-out idea.

The idea of CVSS or other criteria to be described presently
is to pick the model with the smallest value of the criterion.
This will not necessarily be the model with the smallest MSE,
but it is a reasonable estimate of it.

Another criterion somewhat easier to calculate is Mallows' $C_p$
defined by
\begin{equation} \label{eq:cp}
\begin{split}
  C_p & = \frac{\SSResid_p}{\hat{\sigma}^2} + 2 p - n \\
      & = \frac{\SSResid_p - \SSResid_k}{\hat{\sigma}^2} + p - (k - p) \\
      & = (k - p) (F_{k - p, n - k} - 1) + p
\end{split}
\end{equation}
where $\SSResid_p$ is the sum of squares of the residuals for some model
with $p$ predictors (including the constant predictor, if present),
$\hat{\sigma}^2 = \SSResid_k / (n - k)$ is the estimated error variance
for the largest model under consideration, which has $k$ predictors,
and $F_{k - p, n - k}$ is the $F$ statistic for the $F$ test for comparison of
these two models.  The $F$ statistic is about one in size if the small
model is correct, in which case $C_p \approx p$.  This gives us a criterion
for finding reasonably fitting models.  When many models are under
consideration, many of them may have $C_p \approx p$ or smaller.
All such models must be considered reasonably good fits.
Any might be the correct model or as close to correct as any
model under consideration.  The quantity estimated by $C_p$ is the
mean square error of the model with $p$ predictors, divided by $\sigma^2$.

An idea somewhat related to Mallows' $C_p$, but applicable outside the
regression context is the \emph{Akaike information criterion} (AIC).
\begin{equation} \label{eq:aic}
  - 2 \cdot (\text{log likelihood}) + 2 p
\end{equation}
where as in Mallows' $C_p$, the number of parameters in the model is $p$.
Although \eqref{eq:cp} and \eqref{eq:aic} both have the term $2 p$,
they are otherwise different.  The log likelihood for a linear regression
model, with the MLEs plugged in for the parameters is
$$
   \text{log likelihood} = \text{constant} - \frac{n}{2} \log(\SSResid)
$$
Thus for a regression model with $p$ predictors,
$$
   \text{AIC} = \text{constant} + n \log(\SSResid_p) + 2 p
$$
where ``constant'' here is a term that does not depend on the model
and hence does not matter in model comparisons.

Finally, we add one last criterion, almost the same as AIC, called
the \emph{Bayes information criterion} (BIC)
\begin{equation} \label{eq:bic}
  - 2 \cdot (\text{log likelihood}) + p \log(n)
\end{equation}
In the regression context, this becomes
$$
   \text{BIC} = \text{constant} + n \log(\SSResid_p) + p \log(n).
$$
Neither AIC nor BIC have a rigorous theoretical justification applicable
to a wide variety of models.  Both were derived for special classes of
models that were easy to analyze and both involve some approximations.
Neither can be claimed to be the right thing (nor can anything else).
As the ``Bayes'' in BIC indicates, the BIC criterion is intended to
approximate using Bayes tests instead of frequentist tests (although
its approximation to true Bayes tests is fairly crude).
Note that BIC penalizes models with more parameters more strongly than AIC
($p \log n$ versus $2 p$).  So BIC always selects a larger model than AIC.

This gives us four criteria for model selection.  There are arguments in
favor of each.  None of the arguments are completely convincing.  All are
widely used.
\begin{example}
We use the data for Example~\ref{ex:trade} again.
Now we fit polynomials of various degrees to the data,
and look at our four criteria.
<<four-crit>>=
n <- length(y)
deg <- seq(1, 15, 2)
cvss <- double(length(deg))
cp <- double(length(deg))
aic <- double(length(deg))
bic <- double(length(deg))
out.big <- lm(y ~ poly(x, 17))
sigsqhat.big <- summary(out.big)$sigma^2
for (i in seq(along = deg)) {
    k <- deg[i]
    out <- lm(y ~ poly(x, k))
    aic[i] <- AIC(out)
    bic[i] <- AIC(out, k = log(n))
    cvss[i] <- sum((out$residuals / (1 - hatvalues(out)))^2)
    cp[i] <- sum(out$residuals^2) / sigsqhat.big + 2 * out$rank - n 
}
foo <- cbind(deg, cvss, cp, aic, bic)
dimnames(foo) <- list(NULL, c("degree", "CVSS", "Cp", "AIC", "BIC"))
foo
@

In this example, all four criteria select the same model,
the polynomial of degree 7.  This is not the model with the smallest
MSE discovered by the theoretical analysis.
The criteria do something sensible, but as everywhere else in
statistics, there are errors (the sample is not the population).
\end{example}

\section{All Subsets Regression}

We now return to the situation in which there are $k$ predictors
including the constant predictor and $2^{k - 1}$ models under consideration
(the constant predictor is usually included in all models).
If $k$ is large and one has no non-statistical reason (e.~g., a practical
or scientific reason) that cuts down the number of models to be considered,
then one must fit them all.  Fortunately, there are fast algorithms that
allow a huge number of models to be fit or at least quickly checked to see
that they are much worse than other models of the same size.

There is a contributed package to R that contains a function \texttt{leaps}
that does this.
\begin{example}[$2^k$ subsets]
The \label{ex:twokey} data set in the file \verb@ex12.7.4.dat@
in this directory, read by
<<label=detach2,echo=FALSE>>=
detach(X)
@
<<read-sinus>>=
X <- read.table("ex12.7.4.dat", header = TRUE)
names(X)
attach(X)
@
consists of multivariate normal data with one response variable $y$
and 20 predictor variables $x_1$, $\ldots$, $x_{20}$.  The predictors
are correlated.  The distribution from which they were simulated has
all correlations equal to one-half.  The actual (sample) correlations,
of course, are all different because of chance variation.

The true population regression function (the one used to simulate the $y$
values) was
\begin{equation} \label{eq:twokey-truth}
   y = x_1 + x_2 + x_3 + x_4 + x_5 + e
\end{equation}
with error variance $\sigma^2 = 1.5^2$.
We fit the model by
<<twokey-fit>>=
out <- lm(y ~ ., data = X, x = TRUE)
summary(out)
@

We've left in the significance codes, bogus though they may be
(more on this later), so you can easily spot the regression coefficients
that the least squares fit indicates may be important.  If we go only with
the strongest evidence (two or three stars) we get as ``significant''
three of the five truly important regression coefficients
[recall from \eqref{eq:twokey-truth} that the true nonzero regression
coefficients are $\beta_1$ through $\beta_5$].
The other two are missed.

If we use a less stringent standard, say one or more stars,
we do pick up another truly nonzero regression coefficient, but
we also pick up a false one.  Thus we now have both false negatives
(we still missed $\beta_3$) and false positives (we've picked up $\beta_8$).

With the least stringent standard, all the coefficients marked by any
of the ``significance codes'' we now have no false negatives
(all five of the truly nonzero regression coefficients are now declared
``significant'') but we have four false positives.

No matter how you slice it, least squares regression doesn't pick the right
model.  Of course, this is no surprise.  It's just ``the sample is not the
population.''  But it does show that the results of such model selection
procedures must be treated with skepticism.

Actually, we haven't even started a sensible model selection procedure.
Recall the slogan that if you want to know how good a model fits, you have
to fit that model.  So far we haven't fit any of the models we've discussed.
We're fools to think we can pick out the good submodels
just by looking at printout for the big model.

There is a function \texttt{leaps} in the \texttt{leaps} contributed
package\footnote{This has to be installed separately.  It doesn't
come with the default install.  Like everything else about R, it can be found
at \texttt{http://cran.r-project.org}.}
that fits a huge number of models.
By default, it finds the 10 best models of each size (number of
regression coefficients) for which there are 10 or more models and
finds all the models of other sizes.

It uses the inequality that a bigger
model always has a smaller sum of squares to eliminate many models.
Suppose we have already found 10 models of size $p$ with $\SSResid$ less
than 31.2.  Suppose there was a model of size $p + 1$ that we fit and found
its $\SSResid$ was 38.6.  Finally suppose $\hat{\sigma}^2$ for the big model
is 2.05.  Now the $C_p$ for the 10 best models of size $p$ already found is
$$
   C_p = \frac{\SSResid_p}{\hat{\sigma}^2} + 2 p - n
   < \frac{31.2}{2.05} + 2 p - n
$$
and the $C_p$ for any submodel of size $p$ of the model with $\SSResid = 38.6$
(i.~e., models obtained by dropping one predictor from that model)
has
$$
   C_p \ge \frac{38.6}{2.05} + 2 p - n
$$
This means that no such model can be better than the 10 already found,
so they can be rejected even though we haven't bothered to fit them.
Considerations of this sort make it possible for \texttt{leaps} to
pick the 10 best of each size without fitting all or even a sizable
fraction of the models of each size.  Thus it manages to do in minutes
what it couldn't do in a week if it actually had to fit all $2^k$
models.
The reason why \texttt{leaps} uses $C_p$ as its criterion rather than one
of the others is that it is a simple function of $\SSResid$ and hence to
these inequalities that permit its efficient operation.

We run the leaps function as follows, with the design matrix \texttt{x}
defined as above,\footnote{For some reason, \texttt{leaps} doesn't take
formula expressions like \texttt{lm} does.  The reason is probably historical.
The equivalent S-plus function doesn't either, because it was written before
S had model formulas and hasn't changed.  The \texttt{strictly.compatible=FALSE}
tells R not to be bug-for-bug compatible with S-plus.}
<<label=fig5plot,include=FALSE>>=
par(mar = c(5, 4, 1, 1) + 0.1)
x <- out$x
dim(x)
library(leaps)
outs <- leaps(x, y, int = FALSE, strictly.compatible = FALSE)
plot(outs$size, outs$Cp, log = "y", xlab = "p", ylab = expression(C[p]))
lines(outs$size, outs$size)
@
Figure~\ref{fig:leaps} shows the plot made by the two plot commands.
\begin{figure}
\begin{center}
<<label=fig5,fig=TRUE,echo=FALSE>>=
<<fig5plot>>
@
\end{center}
\caption{$C_p$ plot.  Plot of $C_p$ versus $p$.  The dots are the $C_p$ for
the 10 best models for each $p$.  The curve is the line $C_p = p$
(curved because of the log scale for $C_p$).}
\label{fig:leaps}
\end{figure}

Every model with $C_p < p$, corresponding to the dots below the line is
``good.''  There are a \emph{huge} number of perfectly acceptable models,
because for the larger $p$ there are many more than 10 good models, which
are not shown.

The best model according to the $C_p$ criterion is one with $p = 12$,
so 11 non-constant predictors, which happen to be $x_1$ through $x_5$
(the truly significant predictors) plus $x_8$, $x_{10}$, $x_{11}$, $x_{12}$,
$x_{14}$, and $x_{20}$.  We can get its regression output as follows.
<<leaps-best>>=
ifoo <- outs$Cp == min(outs$Cp)
ifoo <- outs$which[ifoo, ]
foo <- x[ , ifoo]
out.best <- lm(y ~ foo + 0)
summary(out.best)
@

Note that the ``stargazing'' doesn't correspond with the notion of the
best model by the $C_p$ criterion.  One of these coefficients doesn't
even have a dot (so for it $P > 0.10$), and four others only have dots
($0.05 < P < 0.10$).  Considering them separately, this would lead us
to drop them.  But that would be the Wrong Thing (multiple testing without
correction).  The \texttt{leaps} function does as close to the Right Thing
as can be done.  The only defensible improvement would be to change the
criterion, to BIC perhaps, which would choose a smaller ``best'' model
because it penalizes larger models more.  However BIC wouldn't have the
nice inequalities that make \texttt{leaps} so efficient, which accounts
for the use of $C_p$.

I hope you can see from this analysis that model selection when there are
a huge number of models under consideration and no extra-statistical
information (scientific, practical, etc.)\ that can be used to cut down
the number is a mug's game.  The best you can do is not very good.
The only honest conclusion is that a huge number of models are about equally
good, as good as one would expect the correct model to be ($C_p \approx p$).

Thus it is silly to get excited about exactly which model is chosen
as the ``best'' by some model selection procedure (any procedure)!
When many models are equally good, the specific features of any one
of them can't be very important.

All of this is related to our slogan about ``regression is for prediction,
not explanation.''  All of the models with $C_p < p$ predict about equally well.
So if regression is used for \emph{prediction}, the model selection problem
is not serious.  Just pick any one of the many good models and use it.
For \emph{prediction} it doesn't matter which good prediction is used.
But if regression is used for \emph{explanation}, the model selection problem
is insoluble.  If you can't decide which model is ``best'' and are honest
enough to admit that lots of other models are equally good, then how can
you claim to have found the predictors which ``explain'' the response?
Of course, if you really understand ``correlation is not causation, and
regression isn't either,'' then you know that such ``explanations'' are
bogus anyway, even in the ``simple'' case (one non-constant predictor)
where the model selection problem does not arise.
\begin{comment}

 n <- 100
 k <- 20

 M <- diag(k)
 M <- M + .5
 diag(M) <- 1

 L <- chol(M)
 t(L) %*% L

 z <- matrix( rnorm(n * k), nrow=k)
 x <- t(z) %*% L

 var(x)

 beta <- rep(0, k)
 beta[1:5] <- 1
 mu <- x %*% beta
 sigma <- 1.5

 y <- mu + sigma * rnorm(n)

 write(t(cbind(y, x)), ncol=ncol(x)+1, file="ex12.7.4.dat")

 X <- read.table("/usr/HP9/WEB/httpd/htdocs/geyer/5102/ex12.7.4.dat", header=T)
 attach(X)

 foo <- as.matrix(X)
 x <- foo[ , -1]
 out <- lm(y ~ x)
 summary(out)

 library(leaps)


 postscript(file="leaps.ps", height=5, width=6, horizontal=FALSE)
 par(mar=c(5,4,0,0)+.1)
 outs <- leaps(x, y, strictly.compatible=FALSE)
 plot(outs$size, outs$Cp, log="y", xlab="p", ylab=expression(C[p]))
 lines(outs$size, outs$size)

 outs$which[outs$Cp == min(outs$Cp)]
    11    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA 
 TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE 
   NA    NA    NA    NA    NA    NA    NA 
 TRUE FALSE FALSE FALSE FALSE FALSE  TRUE 

 foo <- x[ , outs$which[outs$Cp == min(outs$Cp)]]
 out.best <- lm(y ~ foo)
 summary(out.best)

\end{comment}
\end{example}

\end{document}

\section*{Problems}

\begin{problem}
Prove the assertion
$\hat{\boldbeta} = (\boldX' \boldX)^{- 1} \boldX' \hat{\boldy}$ in the
proof of Theorem~\ref{th:ssresid-dist}.
\end{problem}

\begin{problem}
For the data in Example~\ref{ex:multreg} give 90\% confidence intervals
for $\alpha$ and $\beta_1$ (the intercept and coefficient for $x_1$,
the coefficient for $x_2$ was done in Example~\ref{ex:multreg-cont}
\end{problem}

\begin{problem}
Fill in the details of the proof of Lemma~\ref{th:hat}.
\end{problem}

\begin{problem}
For \label{prob:foo} the data in Example~\ref{ex:foo} in the URL
\begin{verbatim}
http://www.stat.umn.edu/geyer/5102/ex12.3.2.dat
\end{verbatim}
\noindent
try fitting some Fourier series models.  A Fourier series is sum of
sines and cosines of multiples of one fundamental frequency
$$
   f(x) = a + \sum_{k = 1}^m b_k \sin(2 \pi k x / L) +
   \sum_{k = 1}^m c_k \cos(2 \pi k x / L)
$$
where $L$ is a known constant (the wavelength of the lowest frequency
sine and cosine terms) and $a$, the $b_k$, and the $c_k$ are
adjustable constants.  These adjustable constants will be the regression
coefficients you fit using linear regression.  A Fourier series is always
periodic with period $L$.  Since the variable $x$ in this data set does
just happen to take values evenly spaced between zero and $2 \pi$, and
inspection of Figure~\ref{fig:foo} suggests the true regression may be
periodic with this period, I recommend using $L = 2 \pi$, which gives
a regression function of the form
$$
   E(Y | X) = \alpha + \sum_{k = 1}^m \beta_k \sin(k X) +
   \sum_{k = 1}^m \gamma_k \cos(k X)
$$
and we have changed the coefficients to Greek letters to indicate that
they are the population regression coefficients, which are unknown constants
that we have to estimate.

Use linear regression to find a sample regression
function that seems to fit the data better than the polynomial found in
Example~\ref{ex:foo}.  (It's up to you to figure out how high $m$ should be
and whether all the terms up to order $m$ should be included.  You don't
have to find the ``best'' model, whatever that means, just a good model.)
Hand in a plot of your fitted sample regression function with the data points
also plotted (like Figure~\ref{fig:footoo} and the output from the R
\texttt{summary} command showing the regression fit.  (The R functions for
sine and cosine are \texttt{sin} and \texttt{cos}.  The R for $\sin(2 x)$
is \texttt{sin(2 * x)}, because you need the \texttt{*} operator for
multiplication.  You will also need to wrap such terms in the \texttt{I()}
function, like \texttt{I(sin(2 * x))}.)
\end{problem}
\begin{comment}
\begin{problem}
Show \label{prob:mle} that the maximum likelihood estimates of the
parameters under the ``strong assumptions'' \eqref{eq:errors} and
\eqref{eq:ass-strong} are \eqref{eq:lse} for the regression coefficients
and \eqref{eq:sigsq-mle} for the error variance.
\end{problem}
\end{comment}

\begin{problem}
The data set in the URL
\begin{verbatim}
http://www.stat.umn.edu/geyer/5102/prob12-5.dat
\end{verbatim}
\noindent
has three variables \texttt{x1} and \texttt{x2}
(the predictor variables) and \texttt{y} (the response variable).

Fit three models to this data set (1) a ``linear'' model fitting a polynomial
of degree one in the two predictor variables, (2) a ``quadratic'' model
fitting a polynomial of degree two, and (3) a ``cubic'' model fitting a
polynomial of degree three.  Don't forget the terms of degree two and
three containing products of powers of the two predictor variables.

Print out the ANOVA table for comparing these three models and interpret
the $P$-values in the table.  Which model would you say is the best fitting,
and why?
\begin{comment}
  n <- 60
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  a <- 1
  b <- .5
  sig <- 2
  mu <- a * (x1 - x2) + b * (3 * x1^2 + 3 * x2^2 + 2 * x1 * x2)
  y <- mu + rnorm(n, 0, sig)

  out1 <- lm(y ~ x1 + x2)
  summary(out1)
  out2 <- lm(y ~ x1 + x2 + I(x1^2) + I(x1 * x2) + I(x2^2))
  summary(out2)
  out3 <- lm(y ~ x1 + x2 + I(x1^2) + I(x1 * x2) + I(x2^2)
  + I(x1^3) + I(x1^2 * x2) + I(x1 * x2^2) + I(x2^3))
  summary(out3)
  anova(out1, out2, out3)

 write(rbind(x1, x2, y), ncol=3, file="prob12-5.dat")
\end{comment}
\end{problem}

\begin{problem}
The data set in the URL
\begin{verbatim}
http://www.stat.umn.edu/geyer/5102/prob12-6.dat
\end{verbatim}
\noindent
has two variables \texttt{x}
(the predictor variable) and \texttt{y} (the response variable).
As a glance at a scatter plot of the data done in R by
\begin{verbatim}
plot(x, y)
\end{verbatim}
\noindent
shows, the relationship between $x$ and $y$ does not appear to be linear.
However, it does appear that a so-called \emph{piecewise linear} function
with a \emph{knot} at 11 may fit the data well.  The means a function
having the following three properties.
\begin{itemize}
\item It is linear on the interval $x \le 11$.
\item It is linear on the interval $x \ge 11$.
\item These two linear functions agree at $x = 11$.
\end{itemize}
Figure out how to fit this model using linear regression.
(For some choice of predictor variables, which are functions of $x$,
the regression function of the model is the piecewise linear function
described above.  Your job is to figure out what predictor variables
do this.)
\begin{alist}
\item Describe your procedure.  What predictors are you using?  How
many regression coefficients does your procedure have?
\item Use R to fit the model.  Report the parameter estimates
(regression coefficients and residual standard error).
\end{alist}
The following plot
\begin{verbatim}
plot(x, y)
lines(x, out$fitted.values)
\end{verbatim}
\noindent
puts a line of predicted values ($\hat{\boldy}$ in the notation used in the
notes and in Lindgren) on the scatter plot.  It may help you see when you
have got the right thing.  You do not have to turn in the plot.
\begin{comment}
  n <- 21
  x <- 1:n
  sig <- 3
  mu <- ifelse(x < median(x), x, 2 * (x - median(x)) + median(x))
  y <- mu + sig * rnorm(n)
  plot(x, y)
  lines(x, mu)

  x1 <- median(x) - x
  x2 <- - x1
  x1 <- ifelse(x1 > 0, x1, 0)
  x2 <- ifelse(x2 > 0, x2, 0)
  out <- lm(y ~ x1 + x2)
  summary(out)

  plot(x, y)
  lines(x, out$fitted.values)

 write(rbind(x, y), ncol=2, file="prob12-6.dat")
\end{comment}

\textbf{Hint:} The \texttt{ifelse} function in R defines vectors whose
values depend on a condition, for example
\begin{verbatim}
ifelse(x <= 11, 1, 0)
\end{verbatim}
\noindent
defines the indicator function of the interval $x \le 11$.
(This is \emph{not} one of the predictor variables you need for this problem.
It's a hint, but not that much of a hint.  The \texttt{ifelse} function may
be useful, this particular instance is not.)
\end{problem}

\begin{problem}
For the data in the URL
\begin{verbatim}
http://www.stat.umn.edu/geyer/5102/ex12.3.2.dat
\end{verbatim}
\begin{alist}
\item Find the 95\% percent prediction interval for an individual with
$x$ value 5 using the fifth degree polynomial model fit in
Examples~\ref{ex:foo} and~\ref{ex:foofoo} (this interval can be read
off Figure~\ref{fig:foothree}, but get the exact numbers from R).
\item Find the 95\% percent confidence interval for the population
regression function at the same $x$ value for the same model.
\end{alist}
\begin{comment}
out <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5))
predict(out, data.frame(x=5), interval="prediction")
predict(out, data.frame(x=5), interval="confidence")
out <- lm(y ~ I(sin(x)) + I(cos(x)) + I(sin(2 * x)) + I(cos(2 * x)))
predict(out, data.frame(x=5), interval="prediction")
predict(out, data.frame(x=5), interval="confidence")
\end{comment}
\end{problem}

\begin{problem}
Prove Theorem~\ref{th:resid}.
(Use $\hat{\boldy} = \boldH \boldy$, don't reprove it.)
\end{problem}

\begin{problem}
The data set
\begin{verbatim}
http://www.stat.umn.edu/geyer/5102/prob12-9.dat
\end{verbatim}
\noindent
contains three variables, \texttt{x}, \texttt{y}, and \texttt{z}.
Each is an \iid\ sample from some distribution.  The three variables
are independent of each other (this is not a regression problem).
Make Q-Q plots of the variables.  One is normal.  Which one?
Describe the features of the other two plots that make you think the
variables plotted are not normal.

It is an interesting comment on the usefulness of Q-Q plots that this
problem is essentially undoable at sample size 50 (no differences are
apparent in the plots).  It's not completely obvious at the sample
size 100 used here.  Fairly large sample sizes are necessary for
Q-Q plots to be useful.
\begin{comment}
n <- 100
x <- rnorm(n, 20, 5)
y <- 20 + 5 * rt(n, 5)
z <- 20 + 5 * (rgamma(n, 10, 1 / sqrt(10)) - sqrt(10))
mean(x); sd(x)
mean(y); sd(y)
mean(z); sd(z)
qqnorm(x); qqline(x)
qqnorm(y); qqline(y)
qqnorm(z); qqline(z)
write(rbind(y, z, x), ncol=3, file="prob12-9.dat")
\end{comment}
\end{problem}