\documentclass{article} \usepackage{indentfirst} \usepackage{verbatim} \RequirePackage{amsmath} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\mse}{mse} \DeclareMathOperator{\tr}{tr} \newcommand{\SSResid}{\text{\normalfont\upshape SSResid}} \newcommand{\boldbeta}{\boldsymbol{\beta}} \newcommand{\boldeta}{\boldsymbol{\eta}} \newcommand{\boldtheta}{\boldsymbol{\theta}} \newcommand{\boldmu}{\boldsymbol{\mu}} \newcommand{\boldp}{\mathbf{p}} \newcommand{\boldx}{\mathbf{x}} \newcommand{\boldy}{\mathbf{y}} \newcommand{\boldA}{\mathbf{A}} \newcommand{\boldD}{\mathbf{D}} \newcommand{\boldH}{\mathbf{H}} \newcommand{\boldI}{\mathbf{I}} \newcommand{\boldX}{\mathbf{X}} \newcommand{\mdot}{\,\cdot\,} \newcommand{\Ybar}{Y{\mkern -14.0 mu}\overline{\phantom{\text{Y}}}} \newcommand{\norm}[1]{\lVert #1 \rVert} \RequirePackage{amsfonts} \newcommand{\real}{\mathbb{R}} \newcommand{\centerfig}[2]{\relax} \RequirePackage{amsthm} \newtheorem{theorem}{Theorem} \newtheorem{lemma}[theorem]{Lemma} \newtheorem{corollary}[theorem]{Corollary} \newtheoremstyle{example}% name {\topsep}% Space above {\topsep}% Space below {}% Body font {}% Indent amount (empty = no indent, \parindent = para indent) {\bfseries}% Thm head font {.}% Punctuation after thm head {\newline}% Space after thm head: " " = normal interword space; {}% Thm head spec {\theoremstyle{example} \newtheorem{example}{Example}} \newenvironment{slogan}{\begin{quote}\em}{\end{quote}} \begin{document} \title{Model Selection in R} \author{Charles J. Geyer} \maketitle This used to be a section of my master's level theory notes. It is a bit overly theoretical for this R course. Just think of it as an example of literate programming in R using the \verb@Sweave@ function. You don't have to absorb all the theory, although it is there for your perusal if you are interested. This discusses the problem of choosing among many models. Although our examples will be regression problems and some of the discussion and methods will be specific to linear regression, the problem is general. When many models are under consideration, how does one choose the best? Thus we also discuss methods that apply outside the regression context. There are two main issues. \begin{itemize} \item Non-nested models \item Many models. \end{itemize} The only methods for model comparison we have studied, the $F$ test for comparison of linear regression models and the likelihood ratio test for comparison of general models, are valid only for comparing two \emph{nested} models. We also want to test \emph{non-nested} models, and for that we need new theory. When more than two models are under consideration, the issue of correction for multiple testing arises. If there are only a handful of models, Bonferroni correction (or some similar procedure) may suffice. When there are many models, a conservative correction like Bonferroni is too conservative. Let's consider how many models might be under consideration in a model selection problem. Consider a regression problem with $k$ predictor variables. \begin{itemize} \item \textbf{[Almost the worst case]} There are $2^k$ possible submodels formed by choosing a subset of the $k$ predictors to include in the model (because a set with $k$ elements has $2^k$ subsets). \item \textbf{[Actually the worst case]} That doesn't consider all the new predictors that one might ``make up'' using functions of the old predictors. Thus there are potentially infinitely many models under consideration. \end{itemize} Bonferroni correction for infinitely many tests is undefined. Even for $2^k$ tests with $k$ large, Bonferroni is completely pointless. It would make nothing statistically significant. \section{Overfitting} \begin{slogan} Least squares is \textbf{good} for model fitting, but \textbf{useless} for model selection. \end{slogan} Why? A bigger model \emph{always} has a smaller residual sum of squares, just because a minimum taken over a larger set is smaller. % \footnote{If % $g$ is any % real-valued function and $A$ and $B$ are two subsets of the domain of $g$ % with $A \subset B$ then % $$ % \inf_{x \in A} g(x) \ge \inf_{y \in B} g(y) % $$ % simply because every $x$ that occurs on the left hand side also occurs % as a $y$ on the right hand side because $A \subset B$. Taking $g$ to % be the least squares criterion for a regression model, and $A$ and $B$ % the parameter spaces for two nested models gives the result that the % larger model always has the smaller residual sum of squares. % % Note this is exactly analogous to what happens with maximum likelihood. % The larger model always has the larger log likelihood. The reasoning is % exactly the same except for the inequality being reversed because of maximizing % in maximum likelihood rather than minimizing in least squares.} Thus least squares, taken as a criterion for model selection says ``always choose the biggest model.'' But this is silly. Consider what the principle of choosing the biggest model says about polynomial regression. \begin{example}[Overfitting in Polynomial Regression] Consider \label{ex:over} the regression data in the file \verb@ex12.7.1.dat@ in this directory, which is read by <>= X <- read.table("ex12.7.1.dat", header = TRUE) names(X) attach(X) @ The ``simple'' linear regression is done and a plot made (Figure~\ref{fig:over1}) by the following <>= par(mar = c(5, 4, 1, 1) + 0.1) plot(x, y) out <- lm(y ~ x) abline(out) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Some regression data. With fitted ``simple'' regression function of the form $y = \hat{\alpha} + \hat{\beta} x$.} \label{fig:over1} \end{figure} Couldn't ask for nicer data for simple linear regression. The data appear to fit the ``simple'' model (one non-constant predictor, which we take to be $x$ itself). But now consider what happens when we try to be a bit more sophisticated. How do we know that the ``simple'' model is o.~k.? Perhaps we should consider some more complicated models. How about trying polynomial regression? But if we consider all possible polynomial models, that's an infinite number of models (polynomials of all orders). Although an infinite number of models are potentially under consideration, the fact that the data set is finite limits the number of models that actually need to be considered to a finite subset. You may recall from algebra that any set of $n$ points in the plane having $n$ different $x$ values can be interpolated (fit exactly) by a polynomial of degree $n - 1$. A polynomial that fits exactly has residual sum of squares zero (fits perfectly). Can't do better than that by the least squares criterion! Thus all polynomials of degree at least $n - 1$ will give the same fitted values and zero residual sum of squares. Although they may give different predicted values for $x$ values that do not occur in the data, they give the same predicted values at those that do. Hence the polynomials with degree at least $n - 1$ cannot be distinguished by least squares. In fact the polynomials with degree more than $n - 1$ cannot be fit at all, because they have more parameters than there are equations to determine them. $\boldX' \boldX$ is a singular matrix and cannot inverted to give unique estimates of the regression coefficients. This is a general phenomenon, which occurs in all settings, not just with linear regression. Models with more parameters than there are data points are underdetermined. Many different parameter vectors give the same likelihood, or the same empirical moments for the method of moments, or the same for whatever criterion is being used for parameter estimation. Thus in general, even when an infinite number of models are theoretically under consideration, only $n$ models are practically under consideration, where $n$ is the sample size. Hence the ``biggest model'' that the least squares criterion selects is the polynomial of degree $n - 1$. What does it look like? The following code fits this model <>= out.poly <- lm(y ~ poly(x, 9)) summary(out.poly) @ (Note the zero residual degrees of freedom. That's the overfitting!) Then <>= par(mar = c(5, 4, 1, 1) + 0.1) plot(x, y) curve(predict(out.poly, data.frame(x = x)), add = TRUE) abline(out, lty = 2) @ plots the two regression functions for our fits (Figure~\ref{fig:over2}) the best fitting (perfectly fitting!) polynomial of degree $n - 1 = 9$ and the least squares regression line from the other figure. \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Some regression data. With fitted linear regression function (dashed line) and ninth degree polynomial regression function (solid curve). See also Figure~\ref{fig:over3}.} \label{fig:over2} \end{figure} \begin{comment} n <- 10 x <- 1:10 sigma <- 1 alpha <- 0 beta <- 1 y <- alpha + beta * x + rnorm(n, 0, sigma) write(rbind(x, y), ncol=2, file="ex12.7.1.dat") X <- read.table("/usr/HP9/WEB/httpd/htdocs/geyer/5102/ex12.7.1.dat", header=T) attach(X) n <- length(x) out <- lm(y ~ x) summary(out) plot(x, y) abline(out) postscript(file="over1.ps", height=5, width=6, horizontal=FALSE) par(mar=c(5,4,0,0)+.1) plot(x, y) abline(out) norm <- function(x) sqrt(sum(x^2)) foo <- rep(1, n) foo <- foo / norm(foo) design <- as.matrix(foo) for (i in 1:(n - 1)) { foo <- (x - mean(x))^i a <- t(design) %*% foo bar <- design %*% a bar <- foo - bar bar <- bar / norm(bar) design <- cbind(design, bar) } outn <- lm(y ~ design + 0) summary(outn) xx <- seq(0, n + 1, 0.01) foo <- rep(1, n) fool <- rep(1, length(xx)) fool <- fool / norm(foo) foo <- foo / norm(foo) design <- as.matrix(foo) desl <- as.matrix(fool) for (i in 1:(n - 1)) { foo <- (x - mean(x))^i fool <- (xx - mean(x))^i a <- t(design) %*% foo bar <- design %*% a barl <- desl %*% a bar <- foo - bar barl <- fool - barl barl <- barl / norm(bar) bar <- bar / norm(bar) design <- cbind(design, bar) desl <- cbind(desl, barl) } all(design == desl[match(x, xx), ]) predl <- desl %*% outn$coef plot(x, y) abline(out) lines(xx, predl) postscript(file="over2.ps", height=5, width=6, horizontal=FALSE) par(mar=c(5,4,0,0)+.1) plot(x, y, ylim=range(predl[xx >= 1 & xx <= n])) lines(xx, predl) abline(out, lty=2) \end{comment} How well does the biggest model do? It fits the observed data perfectly, but it's hard to believe that it would fit \emph{new} data from the same population as well. The extreme oscillations near the ends of the range of the data are obviously nonsensical, but even the smaller oscillations in the middle seem to be tracking random noise rather than any real features of the population regression function. Of course, we don't actually know what the true population regression function is. It could be either of the two functions graphed in the figure, or it could be some other function. But it's hard to believe, when the linear function fits so well, that something as complicated as the ninth degree polynomial is close to the true regression function. We say it ``overfits'' the data, meaning it's \emph{too} close to the data and not close enough to the true population regression function. Let's try again on Figure~\ref{fig:over2}. Perhaps we would like to see the extent of those big oscillations. Something like this should do it. <>= par(mar = c(5, 4, 1, 1) + 0.1) xx <- seq(min(x), max(x), length = 1000) yy <- predict(out.poly, data.frame(x = xx)) plot(x, y, ylim = range(yy)) curve(predict(out.poly, data.frame(x = x)), add = TRUE) abline(out, lty = 2) @ The result is shown in Figure~\ref{fig:over3}. \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Same as Figure~\ref{fig:over2} except for greater vertical extent.} \label{fig:over3} \end{figure} \end{example} \section{Mean Square Error} So what criterion should we use for model selection if residual sum of squares is no good? One theoretical criterion is mean square error. In the regression setting, it is unclear what quantities mean square error should apply to. Do we use the mean square error of the parameter estimates? We haven't even defined mean square error for vector quantities. That alone suggests we should avoid looking at MSE of regression coefficients. There is also our slogan that regression coefficients are meaningless. Hence we should look at estimates of the regression function. But here too, there are still issues that need to be clarified. The regression function $h(\boldx) = E(Y \mid \boldx)$ is a scalar function of $\boldx$. (Hence we don't need to worry about MSE of a vector quantity). But it is a function of the predictor value $\boldx$. \begin{equation} \label{eq:mse-of-x} \mse\{\hat{h}(\boldx)\} = \text{variance} + \text{bias}^2 = \var\{\hat{h}(\boldx)\} + \left(E\{\hat{h}(\boldx)\} - h(\boldx)\right)^2 \end{equation} where $h(\boldx)$ is the true population regression function and $\hat{h}(\boldx)$ is an estimate (we are thinking of least squares regression estimates, but \eqref{eq:mse-of-x} applies to any estimate). (``Bias?''\ did I hear someone say? Aren't linear regression estimates \emph{unbiased}? Yes, they are when the model is \emph{correct}. Here we are considering cases when the model is too small to contain the true regression function.) To make a criterion that we can minimize to find the best model, we need a single scalar quantity, not a function. There are several things we could do to make a scalar quantity from \eqref{eq:mse-of-x}. We could \emph{integrate} it over some range of values, obtaining so-called \emph{integrated mean squared error}. A simpler alternative is to sum it over the \emph{design points}, the $\boldx$ values occurring in the data set under discussion. We'll discuss only the latter. Write $\boldmu$ for the true population means of the responses given the predictors, defined by $\mu_i = h(\boldx_i)$. Let $m$ index models. The $m$-th model will have design matrix $\boldX_m$ and hat matrix \begin{equation} \label{eq:hat} \boldH_m = \boldX_m (\boldX_m' \boldX_m)^{-1} \boldX_m'. \end{equation} It is called the hat matrix because it ``puts the hat on'' $\boldy$, that is, $\hat{\boldy}_m = \boldH_m \boldy$ is the vector of predicted values for the $m$-th model. The expected value of the regression predictions is $$ E(\hat{\boldy}_m) = \boldH_m \boldmu. $$ Hence the bias is \begin{equation} \label{eq:bias-of-reg} E(\hat{\boldy}) - E(\boldy) = - (\boldI - \boldH_m) \boldmu. \end{equation} If the $m$-th model is correct, then $\boldmu$ is in the range of $\boldH_m$ and the bias is zero. Models that are incorrect have nonzero bias. The bias \eqref{eq:bias-of-reg} is, of course, a vector. Its $i$-th element gives the bias of $\hat{y}_i$. What we decided to study was the sum of the MSEs at the design points, the ``bias'' part of which is just the sum of squares of the elements of \eqref{eq:bias-of-reg}, which is the same thing as the squared length of this vector \begin{equation} \label{eq:the-bias} \text{bias}^2 = \norm{(\boldI - \boldH_m) \boldmu}^2 = \boldmu' (\boldI - \boldH_m)^2 \boldmu = \boldmu' (\boldI - \boldH_m) \boldmu. \end{equation} The variance is \begin{align*} \var(\hat{\boldy}) & = E\left\{ \bigl\lVert \hat{\boldy} - E(\hat{\boldy}) \bigr\rVert^2 \right\} \\ & = E\left\{ \bigl\lVert \hat{\boldy} - \boldH_m \boldmu \bigr\rVert^2 \right\} \\ & = E\left\{ \bigl\lVert \boldH_m (\boldy - \boldmu) \bigr\rVert^2 \right\} \\ & = E\left\{ \boldH_m (\boldy - \boldmu)' (\boldy - \boldmu) \boldH_m \right\} \\ & = \sigma^2 \boldH_m^2 \\ & = \sigma^2 \boldH_m \end{align*} This is, of course, a matrix. What we want though is just the sum of the diagonal elements. The $i$-th diagonal element is the variance of $\hat{y}_i$, and our decision to focus on the sum of the MSEs at the design points says we want the sum of these. The sum of the diagonal elements of a square matrix $\boldA$ is called its \emph{trace}, denoted $\tr(\boldA)$. Thus the ``variance'' part of the MSE is \begin{equation} \label{eq:the-var} \text{variance} = \sigma^2 \tr(\boldH_m). \end{equation} And $$ \mse_m = \sum_{i = 1}^n \mse(\hat{y}_i) = \sigma^2 \tr(\boldH_m) + \boldmu' (\boldI - \boldH_m) \boldmu. $$ \section{The Bias-Variance Trade-Off} Generally, one cannot reduce both bias and variance at the same time. Bigger models have less bias but \emph{more} variance. Smaller models have less variance but \emph{more} bias. This is called the ``bias-variance trade-off.'' \begin{example} Consider \label{ex:trade} the regression data in the file \verb@ex12.3.2.dat@ in this directory, which is read by <>= detach(X) @ <>= X <- read.table("ex12.3.2.dat", header = TRUE) names(X) attach(X) @ Figure~\ref{fig:sinus-scatter} shows the scatter plot. \begin{figure} \begin{center} <>= plot(x, y) @ \end{center} \caption{Some regression data.} \label{fig:sinus-scatter} \end{figure} A mere glance at the plot shows that $y$ is decidedly not a linear function of $x$, not even close. However, there is nothing that prevents us from making up more predictor variables. The data set itself has no other variables in it, just $x$ and $y$. So any new predictor variables we make up must be functions of $x$. What functions? Since we are told nothing about the data, we have no guidance as to what functions to make up. In a real application, there might be some guidance from scientific theories that describe the data. Or there might not. Users of linear regression often have no preconceived ideas as to what particular functional form the regression function may have. One thing that is often done, more or less for lack of any better ideas, is to try a polynomial. The following fits a bunch of polynomials and compares them. <>= out1 <- lm(y ~ x) out2 <- update(out1, . ~ . + I(x^2)) out3 <- update(out2, . ~ . + I(x^3)) out4 <- update(out3, . ~ . + I(x^4)) out5 <- update(out4, . ~ . + I(x^5)) out6 <- update(out5, . ~ . + I(x^6)) out7 <- update(out6, . ~ . + I(x^7)) anova(out1, out2, out3, out4, out5, out6, out7) @ (no particular reason for stopping at 7). It is clear from the printout that odd degree polynomials fit way better than the even ones (strictly speaking, better than the immediately preceding even one to which they are compared). In hindsight, this is obvious. A glance at Figure~\ref{fig:sinus-scatter} shows that the regression function is bending down on the left side and up on the right. This is the behavior of an odd degree polynomial. An even degree polynomial heads the same way at both sides (both up or both down). Thus we need only consider odd degree polynomials. Let's add some more. <>= out9 <- update(out7, . ~ . + I(x^8) + I(x^9)) out11 <- update(out9, . ~ . + I(x^10) + I(x^11)) out13 <- update(out11, . ~ . + I(x^12) + I(x^13)) anova(out1, out3, out5, out7, out9, out11, out13) @ Hmmmmm. Seems we stopped in the right place. Model 4 here, which is the polynomial of degree 7 is clearly fits much better than model 3. But models 5, 6, and 7 don't fit significantly better than model 4. Thus it is fairly clear that model 4 is the one to go with. Double Hmmmmm. The 1 degree of freedom for comparing models 6 and 7 is wrong. The only reason I can think of for this is that using the monomials as predictors is very unstable. Let's try again. <>= out3 <- lm(y ~ poly(x, 3)) out5 <- lm(y ~ poly(x, 5)) out7 <- lm(y ~ poly(x, 7)) out9 <- lm(y ~ poly(x, 9)) out11 <- lm(y ~ poly(x, 11)) out13 <- lm(y ~ poly(x, 13)) anova(out1, out3, out5, out7, out9, out11, out13) @ No difference at all except for that last comparison, which \verb@poly@ gets right! Figure~\ref{fig:sinus-poly} shows the scatter plot again with the regression function for the 7th degree polynomial added. \begin{figure} \begin{center} <>= plot(x, y) curve(predict(out7, data.frame(x = x)), add = TRUE) @ \end{center} \caption{Same regression data as in Figure~\ref{fig:sinus-scatter}. Now with sample regression function in the model of all polynomials of degree seven.} \label{fig:sinus-poly} \end{figure} From a data analytic point of view, there's nothing more to be said. From a theoretical point of view, we haven't even started. Theoretically, we haven't yet looked at the bias-variance tradeoff, or for that matter at the bias or variance. These are theoretical quantities that cannot be discovered by looking at fitted models. To know the bias we need to know the true population regression function. I do happen to know that here because this is simulated data. The true population regression function is trignometric, given by \begin{subequations} \begin{align} \mu_i & = \sin(x_i) + \sin(2 x_i) \label{eq:trade-mu} \\ \sigma & = 0.2 \label{eq:trade-sig} \end{align} \end{subequations} Again we consider various polynomial regression models. Since we are not being theoretical, we use only the $x$ values and ignore the $y$ values. Since the true population regression curve is a trigonometric rather than a polynomial function, \emph{no} polynomial is unbiased. This is typical of real applications. No model under consideration is exactly correct. The R function \verb@lm@ returns the design matrix if asked, and from it we can construct the hat matrix, for example <>= out7 <- lm(y ~ poly(x, 7), x = TRUE) hat7 <- out7$x %*% solve(t(out7$x) %*% out7$x) %*% t(out7$x) ### check that hat matrix really does put the hat on y all.equal(predict(out7), as.numeric(hat7 %*% y)) @ Then (integrated) bias squared and variance are given by <>= mu <- sin(x) + sin(2 * x) sigma <- 0.2 n <- length(x) ### bias squared t(mu) %*% (diag(n) - hat7) %*% mu ### variance sigma^2 * sum(diag(hat7)) @ Put the above in a loop. <>= kmax <- 20 bsq <- double(kmax) v <- double(kmax) for (k in 1:kmax) { out <- lm(y ~ poly(x, k), x = TRUE) hat <- out$x %*% solve(t(out$x) %*% out$x) %*% t(out$x) bsq[k] <- t(mu) %*% (diag(n) - hat) %*% mu v[k] <- sigma^2 * sum(diag(hat)) } foo <- cbind(1:kmax, bsq, v, bsq + v) dimnames(foo) <- list(NULL, c("degree", "bias^2", "variance", "mse")) foo @ The most interesting fact here is that the best model in terms of mean square error has degree 9 (not the degree 7 model chosen by the data analysis). Of course, to know the theoretically best model, we need to know the true population regression function \verb@mu@ and when we are doing a real data analysis, we don't. The point is that there is a theoretically best model and the data analysis need not find it. Also very interesting is that we don't want to be less biased, much less unbiased. Many people exposed to bad statistics teaching think unbiasedness is absolutely wonderful. But here we see that less bias is bad when we go beyond the point of optimum bias-variance tradeoff. This is quite a general point. Linear regression is only unbiased when all of the model assumptions are correct: the true population regression function is among those under consideration (exactly) and the errors are IID (again exactly). In real life, these assumptions are rarely if ever known to be correct. So unbiasedness is mythical. Of minor interest is the property that the even degree models have the same bias as the next lower odd degree model. This is because the true population regression is an odd function, satisfying $f(x) = - f(- x)$, which makes the true population regression coefficients on the even degree polynomials, which are even functions, satisfying $f(x) = f(- x)$, exactly zero. Figure~\ref{fig:trade} shows both the true regression function used to simulate the response values \eqref{eq:trade-mu} and the sample ninth-degree polynomial regression function. The R statements making the figure are <>= par(mar = c(5, 4, 1, 1) + 0.1) plot(x, y) curve(sin(x) + sin(2 * x), add = TRUE) curve(predict(out9, data.frame(x = x)), add = TRUE, lty = 2) @ The result is shown in Figure~\ref{fig:trade}. \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Some regression data with true regression function (solid line) and sample regression function (dashed line).} \label{fig:trade} \end{figure} We see that the sample regression function does not estimate the true regression function perfectly (because the sample is not the population), but we now know from the theoretical analysis in this example that no polynomial will fit better. A lower degree polynomial will have less variance. A higher degree will have less bias. But the bias-variance trade-off will be worse for either. \end{example} \section{Model Selection Criteria} The theory discussed in the preceding section gives us a framework for discussing model selection. We want the model with the smallest MSE, the model which makes the optimal bias-variance trade-off. Unfortunately, this is useless in practice. Mean square error is a theoretical quantity. It depends on the unknown true regression function and the unknown error variance. Furthermore, there is no obvious way to estimate it. Without knowing which models are good, which is exactly the question we are trying to resolve, we can't get a good estimate of the true regression function (and without that we can't estimate the error variance well either). There are several quantities that have been proposed in the literature as estimates of MSE. No one is obviously right. We shall not go into the theory justifying them (it is merely heuristic anyway). One fairly natural quantity is \begin{equation} \label{eq:press} \sum_{i = 1}^n \hat{e}_{(i)}^2 \end{equation} where $\hat{e}_{(i)}$ is the so-called leave-one-out residual $$ \hat{e}_{(i)} = y_i - \hat{y}_{(i)} $$ where $\hat{y}_{(i)}$ is the predicted value at the point $x_i$ from the model fitted to the data that \emph{leaves out $y_i$}. The idea is that this gives an ``honest'' estimate of the error because $y_i$ is not used in calculating $\hat{y}_{(i)}$. Hence $\hat{y}_{(i)}$ can't be ``too close'' to $y_i$ since it doesn't ``know'' about $y_i$. It turns out that it is not necessary to do $n$ different regressions (each leaving out one $y_i$). All of the leave-one-out residuals can be calculated from the fit in the full regression model using \begin{equation} \label{eq:lout} \hat{e}_{(i)} = \frac{\hat{e}_i}{1 - h_{i i}} \end{equation} where $h_{i i}$ is the $i$, $i$ element of the hat matrix. This nifty algebraic fact makes \eqref{eq:press} much easier to calculate than it appears at first sight and consequently makes \eqref{eq:press} a much more useful criterion. Unlike $\SSResid$ (the sum of the $\hat{e}_i^2$), this does not always favor the biggest model. Models that overfit tend to do a bad job of their leave-one-out predictions. \eqref{eq:press} is called the \emph{predicted residual sum of squares} (PRESS) or the \emph{cross-validated sum of squares} (CVSS), cross-validation being another term used to describe the leave-one-out idea. The idea of CVSS or other criteria to be described presently is to pick the model with the smallest value of the criterion. This will not necessarily be the model with the smallest MSE, but it is a reasonable estimate of it. Another criterion somewhat easier to calculate is Mallows' $C_p$ defined by \begin{equation} \label{eq:cp} \begin{split} C_p & = \frac{\SSResid_p}{\hat{\sigma}^2} + 2 p - n \\ & = \frac{\SSResid_p - \SSResid_k}{\hat{\sigma}^2} + p - (k - p) \\ & = (k - p) (F_{k - p, n - k} - 1) + p \end{split} \end{equation} where $\SSResid_p$ is the sum of squares of the residuals for some model with $p$ predictors (including the constant predictor, if present), $\hat{\sigma}^2 = \SSResid_k / (n - k)$ is the estimated error variance for the largest model under consideration, which has $k$ predictors, and $F_{k - p, n - k}$ is the $F$ statistic for the $F$ test for comparison of these two models. The $F$ statistic is about one in size if the small model is correct, in which case $C_p \approx p$. This gives us a criterion for finding reasonably fitting models. When many models are under consideration, many of them may have $C_p \approx p$ or smaller. All such models must be considered reasonably good fits. Any might be the correct model or as close to correct as any model under consideration. The quantity estimated by $C_p$ is the mean square error of the model with $p$ predictors, divided by $\sigma^2$. An idea somewhat related to Mallows' $C_p$, but applicable outside the regression context is the \emph{Akaike information criterion} (AIC). \begin{equation} \label{eq:aic} - 2 \cdot (\text{log likelihood}) + 2 p \end{equation} where as in Mallows' $C_p$, the number of parameters in the model is $p$. Although \eqref{eq:cp} and \eqref{eq:aic} both have the term $2 p$, they are otherwise different. The log likelihood for a linear regression model, with the MLEs plugged in for the parameters is $$ \text{log likelihood} = \text{constant} - \frac{n}{2} \log(\SSResid) $$ Thus for a regression model with $p$ predictors, $$ \text{AIC} = \text{constant} + n \log(\SSResid_p) + 2 p $$ where ``constant'' here is a term that does not depend on the model and hence does not matter in model comparisons. Finally, we add one last criterion, almost the same as AIC, called the \emph{Bayes information criterion} (BIC) \begin{equation} \label{eq:bic} - 2 \cdot (\text{log likelihood}) + p \log(n) \end{equation} In the regression context, this becomes $$ \text{BIC} = \text{constant} + n \log(\SSResid_p) + p \log(n). $$ Neither AIC nor BIC have a rigorous theoretical justification applicable to a wide variety of models. Both were derived for special classes of models that were easy to analyze and both involve some approximations. Neither can be claimed to be the right thing (nor can anything else). As the ``Bayes'' in BIC indicates, the BIC criterion is intended to approximate using Bayes tests instead of frequentist tests (although its approximation to true Bayes tests is fairly crude). Note that BIC penalizes models with more parameters more strongly than AIC ($p \log n$ versus $2 p$). So BIC always selects a larger model than AIC. This gives us four criteria for model selection. There are arguments in favor of each. None of the arguments are completely convincing. All are widely used. \begin{example} We use the data for Example~\ref{ex:trade} again. Now we fit polynomials of various degrees to the data, and look at our four criteria. <>= n <- length(y) deg <- seq(1, 15, 2) cvss <- double(length(deg)) cp <- double(length(deg)) aic <- double(length(deg)) bic <- double(length(deg)) out.big <- lm(y ~ poly(x, 17)) sigsqhat.big <- summary(out.big)$sigma^2 for (i in seq(along = deg)) { k <- deg[i] out <- lm(y ~ poly(x, k)) aic[i] <- AIC(out) bic[i] <- AIC(out, k = log(n)) cvss[i] <- sum((out$residuals / (1 - hatvalues(out)))^2) cp[i] <- sum(out$residuals^2) / sigsqhat.big + 2 * out$rank - n } foo <- cbind(deg, cvss, cp, aic, bic) dimnames(foo) <- list(NULL, c("degree", "CVSS", "Cp", "AIC", "BIC")) foo @ In this example, all four criteria select the same model, the polynomial of degree 7. This is not the model with the smallest MSE discovered by the theoretical analysis. The criteria do something sensible, but as everywhere else in statistics, there are errors (the sample is not the population). \end{example} \section{All Subsets Regression} We now return to the situation in which there are $k$ predictors including the constant predictor and $2^{k - 1}$ models under consideration (the constant predictor is usually included in all models). If $k$ is large and one has no non-statistical reason (e.~g., a practical or scientific reason) that cuts down the number of models to be considered, then one must fit them all. Fortunately, there are fast algorithms that allow a huge number of models to be fit or at least quickly checked to see that they are much worse than other models of the same size. There is a contributed package to R that contains a function \texttt{leaps} that does this. \begin{example}[$2^k$ subsets] The \label{ex:twokey} data set in the file \verb@ex12.7.4.dat@ in this directory, read by <>= detach(X) @ <>= X <- read.table("ex12.7.4.dat", header = TRUE) names(X) attach(X) @ consists of multivariate normal data with one response variable $y$ and 20 predictor variables $x_1$, $\ldots$, $x_{20}$. The predictors are correlated. The distribution from which they were simulated has all correlations equal to one-half. The actual (sample) correlations, of course, are all different because of chance variation. The true population regression function (the one used to simulate the $y$ values) was \begin{equation} \label{eq:twokey-truth} y = x_1 + x_2 + x_3 + x_4 + x_5 + e \end{equation} with error variance $\sigma^2 = 1.5^2$. We fit the model by <>= out <- lm(y ~ ., data = X, x = TRUE) summary(out) @ We've left in the significance codes, bogus though they may be (more on this later), so you can easily spot the regression coefficients that the least squares fit indicates may be important. If we go only with the strongest evidence (two or three stars) we get as ``significant'' three of the five truly important regression coefficients [recall from \eqref{eq:twokey-truth} that the true nonzero regression coefficients are $\beta_1$ through $\beta_5$]. The other two are missed. If we use a less stringent standard, say one or more stars, we do pick up another truly nonzero regression coefficient, but we also pick up a false one. Thus we now have both false negatives (we still missed $\beta_3$) and false positives (we've picked up $\beta_8$). With the least stringent standard, all the coefficients marked by any of the ``significance codes'' we now have no false negatives (all five of the truly nonzero regression coefficients are now declared ``significant'') but we have four false positives. No matter how you slice it, least squares regression doesn't pick the right model. Of course, this is no surprise. It's just ``the sample is not the population.'' But it does show that the results of such model selection procedures must be treated with skepticism. Actually, we haven't even started a sensible model selection procedure. Recall the slogan that if you want to know how good a model fits, you have to fit that model. So far we haven't fit any of the models we've discussed. We're fools to think we can pick out the good submodels just by looking at printout for the big model. There is a function \texttt{leaps} in the \texttt{leaps} contributed package\footnote{This has to be installed separately. It doesn't come with the default install. Like everything else about R, it can be found at \texttt{http://cran.r-project.org}.} that fits a huge number of models. By default, it finds the 10 best models of each size (number of regression coefficients) for which there are 10 or more models and finds all the models of other sizes. It uses the inequality that a bigger model always has a smaller sum of squares to eliminate many models. Suppose we have already found 10 models of size $p$ with $\SSResid$ less than 31.2. Suppose there was a model of size $p + 1$ that we fit and found its $\SSResid$ was 38.6. Finally suppose $\hat{\sigma}^2$ for the big model is 2.05. Now the $C_p$ for the 10 best models of size $p$ already found is $$ C_p = \frac{\SSResid_p}{\hat{\sigma}^2} + 2 p - n < \frac{31.2}{2.05} + 2 p - n $$ and the $C_p$ for any submodel of size $p$ of the model with $\SSResid = 38.6$ (i.~e., models obtained by dropping one predictor from that model) has $$ C_p \ge \frac{38.6}{2.05} + 2 p - n $$ This means that no such model can be better than the 10 already found, so they can be rejected even though we haven't bothered to fit them. Considerations of this sort make it possible for \texttt{leaps} to pick the 10 best of each size without fitting all or even a sizable fraction of the models of each size. Thus it manages to do in minutes what it couldn't do in a week if it actually had to fit all $2^k$ models. The reason why \texttt{leaps} uses $C_p$ as its criterion rather than one of the others is that it is a simple function of $\SSResid$ and hence to these inequalities that permit its efficient operation. We run the leaps function as follows, with the design matrix \texttt{x} defined as above,\footnote{For some reason, \texttt{leaps} doesn't take formula expressions like \texttt{lm} does. The reason is probably historical. The equivalent S-plus function doesn't either, because it was written before S had model formulas and hasn't changed. The \texttt{strictly.compatible=FALSE} tells R not to be bug-for-bug compatible with S-plus.} <>= par(mar = c(5, 4, 1, 1) + 0.1) x <- out$x dim(x) library(leaps) outs <- leaps(x, y, int = FALSE, strictly.compatible = FALSE) plot(outs$size, outs$Cp, log = "y", xlab = "p", ylab = expression(C[p])) lines(outs$size, outs$size) @ Figure~\ref{fig:leaps} shows the plot made by the two plot commands. \begin{figure} \begin{center} <>= <> @ \end{center} \caption{$C_p$ plot. Plot of $C_p$ versus $p$. The dots are the $C_p$ for the 10 best models for each $p$. The curve is the line $C_p = p$ (curved because of the log scale for $C_p$).} \label{fig:leaps} \end{figure} Every model with $C_p < p$, corresponding to the dots below the line is ``good.'' There are a \emph{huge} number of perfectly acceptable models, because for the larger $p$ there are many more than 10 good models, which are not shown. The best model according to the $C_p$ criterion is one with $p = 12$, so 11 non-constant predictors, which happen to be $x_1$ through $x_5$ (the truly significant predictors) plus $x_8$, $x_{10}$, $x_{11}$, $x_{12}$, $x_{14}$, and $x_{20}$. We can get its regression output as follows. <>= ifoo <- outs$Cp == min(outs$Cp) ifoo <- outs$which[ifoo, ] foo <- x[ , ifoo] out.best <- lm(y ~ foo + 0) summary(out.best) @ Note that the ``stargazing'' doesn't correspond with the notion of the best model by the $C_p$ criterion. One of these coefficients doesn't even have a dot (so for it $P > 0.10$), and four others only have dots ($0.05 < P < 0.10$). Considering them separately, this would lead us to drop them. But that would be the Wrong Thing (multiple testing without correction). The \texttt{leaps} function does as close to the Right Thing as can be done. The only defensible improvement would be to change the criterion, to BIC perhaps, which would choose a smaller ``best'' model because it penalizes larger models more. However BIC wouldn't have the nice inequalities that make \texttt{leaps} so efficient, which accounts for the use of $C_p$. I hope you can see from this analysis that model selection when there are a huge number of models under consideration and no extra-statistical information (scientific, practical, etc.)\ that can be used to cut down the number is a mug's game. The best you can do is not very good. The only honest conclusion is that a huge number of models are about equally good, as good as one would expect the correct model to be ($C_p \approx p$). Thus it is silly to get excited about exactly which model is chosen as the ``best'' by some model selection procedure (any procedure)! When many models are equally good, the specific features of any one of them can't be very important. All of this is related to our slogan about ``regression is for prediction, not explanation.'' All of the models with $C_p < p$ predict about equally well. So if regression is used for \emph{prediction}, the model selection problem is not serious. Just pick any one of the many good models and use it. For \emph{prediction} it doesn't matter which good prediction is used. But if regression is used for \emph{explanation}, the model selection problem is insoluble. If you can't decide which model is ``best'' and are honest enough to admit that lots of other models are equally good, then how can you claim to have found the predictors which ``explain'' the response? Of course, if you really understand ``correlation is not causation, and regression isn't either,'' then you know that such ``explanations'' are bogus anyway, even in the ``simple'' case (one non-constant predictor) where the model selection problem does not arise. \begin{comment} n <- 100 k <- 20 M <- diag(k) M <- M + .5 diag(M) <- 1 L <- chol(M) t(L) %*% L z <- matrix( rnorm(n * k), nrow=k) x <- t(z) %*% L var(x) beta <- rep(0, k) beta[1:5] <- 1 mu <- x %*% beta sigma <- 1.5 y <- mu + sigma * rnorm(n) write(t(cbind(y, x)), ncol=ncol(x)+1, file="ex12.7.4.dat") X <- read.table("/usr/HP9/WEB/httpd/htdocs/geyer/5102/ex12.7.4.dat", header=T) attach(X) foo <- as.matrix(X) x <- foo[ , -1] out <- lm(y ~ x) summary(out) library(leaps) postscript(file="leaps.ps", height=5, width=6, horizontal=FALSE) par(mar=c(5,4,0,0)+.1) outs <- leaps(x, y, strictly.compatible=FALSE) plot(outs$size, outs$Cp, log="y", xlab="p", ylab=expression(C[p])) lines(outs$size, outs$size) outs$which[outs$Cp == min(outs$Cp)] 11 NA NA NA NA NA NA NA NA NA NA NA NA TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE NA NA NA NA NA NA NA TRUE FALSE FALSE FALSE FALSE FALSE TRUE foo <- x[ , outs$which[outs$Cp == min(outs$Cp)]] out.best <- lm(y ~ foo) summary(out.best) \end{comment} \end{example} \end{document} \section*{Problems} \begin{problem} Prove the assertion $\hat{\boldbeta} = (\boldX' \boldX)^{- 1} \boldX' \hat{\boldy}$ in the proof of Theorem~\ref{th:ssresid-dist}. \end{problem} \begin{problem} For the data in Example~\ref{ex:multreg} give 90\% confidence intervals for $\alpha$ and $\beta_1$ (the intercept and coefficient for $x_1$, the coefficient for $x_2$ was done in Example~\ref{ex:multreg-cont} \end{problem} \begin{problem} Fill in the details of the proof of Lemma~\ref{th:hat}. \end{problem} \begin{problem} For \label{prob:foo} the data in Example~\ref{ex:foo} in the URL \begin{verbatim} http://www.stat.umn.edu/geyer/5102/ex12.3.2.dat \end{verbatim} \noindent try fitting some Fourier series models. A Fourier series is sum of sines and cosines of multiples of one fundamental frequency $$ f(x) = a + \sum_{k = 1}^m b_k \sin(2 \pi k x / L) + \sum_{k = 1}^m c_k \cos(2 \pi k x / L) $$ where $L$ is a known constant (the wavelength of the lowest frequency sine and cosine terms) and $a$, the $b_k$, and the $c_k$ are adjustable constants. These adjustable constants will be the regression coefficients you fit using linear regression. A Fourier series is always periodic with period $L$. Since the variable $x$ in this data set does just happen to take values evenly spaced between zero and $2 \pi$, and inspection of Figure~\ref{fig:foo} suggests the true regression may be periodic with this period, I recommend using $L = 2 \pi$, which gives a regression function of the form $$ E(Y | X) = \alpha + \sum_{k = 1}^m \beta_k \sin(k X) + \sum_{k = 1}^m \gamma_k \cos(k X) $$ and we have changed the coefficients to Greek letters to indicate that they are the population regression coefficients, which are unknown constants that we have to estimate. Use linear regression to find a sample regression function that seems to fit the data better than the polynomial found in Example~\ref{ex:foo}. (It's up to you to figure out how high $m$ should be and whether all the terms up to order $m$ should be included. You don't have to find the ``best'' model, whatever that means, just a good model.) Hand in a plot of your fitted sample regression function with the data points also plotted (like Figure~\ref{fig:footoo} and the output from the R \texttt{summary} command showing the regression fit. (The R functions for sine and cosine are \texttt{sin} and \texttt{cos}. The R for $\sin(2 x)$ is \texttt{sin(2 * x)}, because you need the \texttt{*} operator for multiplication. You will also need to wrap such terms in the \texttt{I()} function, like \texttt{I(sin(2 * x))}.) \end{problem} \begin{comment} \begin{problem} Show \label{prob:mle} that the maximum likelihood estimates of the parameters under the ``strong assumptions'' \eqref{eq:errors} and \eqref{eq:ass-strong} are \eqref{eq:lse} for the regression coefficients and \eqref{eq:sigsq-mle} for the error variance. \end{problem} \end{comment} \begin{problem} The data set in the URL \begin{verbatim} http://www.stat.umn.edu/geyer/5102/prob12-5.dat \end{verbatim} \noindent has three variables \texttt{x1} and \texttt{x2} (the predictor variables) and \texttt{y} (the response variable). Fit three models to this data set (1) a ``linear'' model fitting a polynomial of degree one in the two predictor variables, (2) a ``quadratic'' model fitting a polynomial of degree two, and (3) a ``cubic'' model fitting a polynomial of degree three. Don't forget the terms of degree two and three containing products of powers of the two predictor variables. Print out the ANOVA table for comparing these three models and interpret the $P$-values in the table. Which model would you say is the best fitting, and why? \begin{comment} n <- 60 x1 <- rnorm(n) x2 <- rnorm(n) a <- 1 b <- .5 sig <- 2 mu <- a * (x1 - x2) + b * (3 * x1^2 + 3 * x2^2 + 2 * x1 * x2) y <- mu + rnorm(n, 0, sig) out1 <- lm(y ~ x1 + x2) summary(out1) out2 <- lm(y ~ x1 + x2 + I(x1^2) + I(x1 * x2) + I(x2^2)) summary(out2) out3 <- lm(y ~ x1 + x2 + I(x1^2) + I(x1 * x2) + I(x2^2) + I(x1^3) + I(x1^2 * x2) + I(x1 * x2^2) + I(x2^3)) summary(out3) anova(out1, out2, out3) write(rbind(x1, x2, y), ncol=3, file="prob12-5.dat") \end{comment} \end{problem} \begin{problem} The data set in the URL \begin{verbatim} http://www.stat.umn.edu/geyer/5102/prob12-6.dat \end{verbatim} \noindent has two variables \texttt{x} (the predictor variable) and \texttt{y} (the response variable). As a glance at a scatter plot of the data done in R by \begin{verbatim} plot(x, y) \end{verbatim} \noindent shows, the relationship between $x$ and $y$ does not appear to be linear. However, it does appear that a so-called \emph{piecewise linear} function with a \emph{knot} at 11 may fit the data well. The means a function having the following three properties. \begin{itemize} \item It is linear on the interval $x \le 11$. \item It is linear on the interval $x \ge 11$. \item These two linear functions agree at $x = 11$. \end{itemize} Figure out how to fit this model using linear regression. (For some choice of predictor variables, which are functions of $x$, the regression function of the model is the piecewise linear function described above. Your job is to figure out what predictor variables do this.) \begin{alist} \item Describe your procedure. What predictors are you using? How many regression coefficients does your procedure have? \item Use R to fit the model. Report the parameter estimates (regression coefficients and residual standard error). \end{alist} The following plot \begin{verbatim} plot(x, y) lines(x, out$fitted.values) \end{verbatim} \noindent puts a line of predicted values ($\hat{\boldy}$ in the notation used in the notes and in Lindgren) on the scatter plot. It may help you see when you have got the right thing. You do not have to turn in the plot. \begin{comment} n <- 21 x <- 1:n sig <- 3 mu <- ifelse(x < median(x), x, 2 * (x - median(x)) + median(x)) y <- mu + sig * rnorm(n) plot(x, y) lines(x, mu) x1 <- median(x) - x x2 <- - x1 x1 <- ifelse(x1 > 0, x1, 0) x2 <- ifelse(x2 > 0, x2, 0) out <- lm(y ~ x1 + x2) summary(out) plot(x, y) lines(x, out$fitted.values) write(rbind(x, y), ncol=2, file="prob12-6.dat") \end{comment} \textbf{Hint:} The \texttt{ifelse} function in R defines vectors whose values depend on a condition, for example \begin{verbatim} ifelse(x <= 11, 1, 0) \end{verbatim} \noindent defines the indicator function of the interval $x \le 11$. (This is \emph{not} one of the predictor variables you need for this problem. It's a hint, but not that much of a hint. The \texttt{ifelse} function may be useful, this particular instance is not.) \end{problem} \begin{problem} For the data in the URL \begin{verbatim} http://www.stat.umn.edu/geyer/5102/ex12.3.2.dat \end{verbatim} \begin{alist} \item Find the 95\% percent prediction interval for an individual with $x$ value 5 using the fifth degree polynomial model fit in Examples~\ref{ex:foo} and~\ref{ex:foofoo} (this interval can be read off Figure~\ref{fig:foothree}, but get the exact numbers from R). \item Find the 95\% percent confidence interval for the population regression function at the same $x$ value for the same model. \end{alist} \begin{comment} out <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5)) predict(out, data.frame(x=5), interval="prediction") predict(out, data.frame(x=5), interval="confidence") out <- lm(y ~ I(sin(x)) + I(cos(x)) + I(sin(2 * x)) + I(cos(2 * x))) predict(out, data.frame(x=5), interval="prediction") predict(out, data.frame(x=5), interval="confidence") \end{comment} \end{problem} \begin{problem} Prove Theorem~\ref{th:resid}. (Use $\hat{\boldy} = \boldH \boldy$, don't reprove it.) \end{problem} \begin{problem} The data set \begin{verbatim} http://www.stat.umn.edu/geyer/5102/prob12-9.dat \end{verbatim} \noindent contains three variables, \texttt{x}, \texttt{y}, and \texttt{z}. Each is an \iid\ sample from some distribution. The three variables are independent of each other (this is not a regression problem). Make Q-Q plots of the variables. One is normal. Which one? Describe the features of the other two plots that make you think the variables plotted are not normal. It is an interesting comment on the usefulness of Q-Q plots that this problem is essentially undoable at sample size 50 (no differences are apparent in the plots). It's not completely obvious at the sample size 100 used here. Fairly large sample sizes are necessary for Q-Q plots to be useful. \begin{comment} n <- 100 x <- rnorm(n, 20, 5) y <- 20 + 5 * rt(n, 5) z <- 20 + 5 * (rgamma(n, 10, 1 / sqrt(10)) - sqrt(10)) mean(x); sd(x) mean(y); sd(y) mean(z); sd(z) qqnorm(x); qqline(x) qqnorm(y); qqline(y) qqnorm(z); qqline(z) write(rbind(y, z, x), ncol=3, file="prob12-9.dat") \end{comment} \end{problem}