A regression with 20 covariates plus intercept. We include the intercept in all models, which is the default, so there are 220 = 1,048,576 models to consider.

This is simulated data. The true (pretend unknown) regression function is

y = x1 + x2 + x3 + x4 + x5
that is, the true model has only 5 of the 20 possible predictors.

Fit the full model.

We've left in the significance codes, bogus though they may be so you can easily spot the regression coefficients that the least squares fit indicates may be important. If we go only with the strongest evidence (two or three stars) we get as significant three of the five truly important regression coefficients. The other two are missed.

If we use a less stringent standard, say one or more stars, we do pick up another truly nonzero regression coefficient, but we also pick up a false one. Thus we now have both false negatives (we still missed β3) and false positives (we've picked up β8).

With the least stringent standard, all the coefficients marked by any of the significance codes we now have no false negatives (all five of the truly nonzero regression coefficients are now declared significant) but we have four false positives.

No matter how you slice it, least squares regression doesn't pick the right model. Of course, this is no surprise. It's just the sample is not the population. But it does show that the results of such model selection procedures must be treated with skepticism.

Actually, we haven't even started a sensible model selection procedure. If you want to know how good a model fits, you have to fit that model. So far we haven't fit any of the models we've discussed. We're fools to think we can pick out the good submodels just by looking at printout for the big model.

Do branch and bound using Mallows' Cp as the criterion. Also make so-called Cp plot, which is plot of Cp versus p. The dots are the Cp for the 10 best models for each p. The curve is the line Cp = p (curved because of the log scale for Cp).

Every model with Cp < p, corresponding to the dots below the line is good. There are a huge number of perfectly acceptable models, because for the larger p there are many more than 10 good models, which are not shown.

The best model according to the Cp criterion is one with p = 12, so 11 non-constant predictors, which happen to be x1 through x5 (the truly significant predictors) plus x8, x10, x11, x12, x14, x14, and x20. We can get its regression output as follows.

Note that the stargazing doesn't correspond with the notion of the best model by the Cp criterion. One of these coefficients doesn't even have a dot (so for it P > 0.10), and four others only have dots (0.05 < PP < 0.10). Considering them separately, this would lead us to drop them. But that would be the Wrong Thing (multiple testing without correction). The leaps function does as close to the Right Thing as can be done. The only defensible improvement would be to change the criterion, to BIC perhaps, which would choose a smaller best model because it penalizes larger models more.

Since Cp is a function of residual sum of squares for a model (RSSp) and the size of the model p, and so are AIC and BIC, we can convert these results to AIC or BIC if we want to. The formulas are

Cp = RSSp / σ2 − 2 pn
AIC = − 2 Mp + 2 p
BIC = − 2 Mp + log(n) p
where σ2 is the estimated response variance from the largest model, n is the number of cases, and Mp is the maximum value of the log likelihood for the model
Mp = − n ⁄ 2 − (n ⁄ 2) log(RSSpn)

Here's AIC.

And here's BIC.