Statistics 8054 (Geyer, Spring 2024) Branch and Bound Example

A regression with 20 covariates plus intercept. We include the intercept in all models, which is the default, so there are 2²⁰ = 1,048,576 models to consider.

This is simulated data. The true (pretend unknown) regression function is

y = x₁ + x₂ + x₃ + x₄ + x₅

that is, the true model has only 5 of the 20 possible predictors.

Fit the full model.

We've left in the significance codes, bogus though they may be so you can easily spot the regression coefficients that the least squares fit indicates may be important. If we go only with the strongest evidence (two or three stars) we get as significant three of the five truly important regression coefficients. The other two are missed.

If we use a less stringent standard, say one or more stars, we do pick up another truly nonzero regression coefficient, but we also pick up a false one. Thus we now have both false negatives (we still missed β₃) and false positives (we've picked up β₈).

With the least stringent standard, all the coefficients marked by any of the significance codes we now have no false negatives (all five of the truly nonzero regression coefficients are now declared significant) but we have four false positives.

No matter how you slice it, least squares regression doesn't pick the right model. Of course, this is no surprise. It's just the sample is not the population. But it does show that the results of such model selection procedures must be treated with skepticism.

Actually, we haven't even started a sensible model selection procedure. If you want to know how good a model fits, you have to fit that model. So far we haven't fit any of the models we've discussed. We're fools to think we can pick out the good submodels just by looking at printout for the big model.

Do branch and bound using Mallows' C_p as the criterion. Also make so-called C_p plot, which is plot of C_p versus p. The dots are the C_p for the 10 best models for each p. The curve is the line C_p = p (curved because of the log scale for C_p).

Every model with C_p < p, corresponding to the dots below the line is good. There are a huge number of perfectly acceptable models, because for the larger p there are many more than 10 good models, which are not shown.

The best model according to the C_p criterion is one with p = 12, so 11 non-constant predictors, which happen to be x₁ through x₅ (the truly significant predictors) plus x₈, x₁₀, x₁₁, x₁₂, x₁₄, x₁₄, and x₂₀. We can get its regression output as follows.

Note that the stargazing doesn't correspond with the notion of the best model by the C_p criterion. One of these coefficients doesn't even have a dot (so for it P > 0.10), and four others only have dots (0.05 < PP < 0.10). Considering them separately, this would lead us to drop them. But that would be the Wrong Thing (multiple testing without correction). The leaps function does as close to the Right Thing as can be done. The only defensible improvement would be to change the criterion, to BIC perhaps, which would choose a smaller best model because it penalizes larger models more.

Since C_p is a function of residual sum of squares for a model (RSS_p) and the size of the model p, and so are AIC and BIC, we can convert these results to AIC or BIC if we want to. The formulas are

C_p = RSS_p / σ² − 2 p − n
AIC = − 2 M_p + 2 p
BIC = − 2 M_p + log(n) p

where σ² is the estimated response variance from the largest model, n is the number of cases, and M_p is the maximum value of the log likelihood for the model

M_p = − n ⁄ 2 − (n ⁄ 2) log(RSS_p ⁄ n)

Here's AIC.

And here's BIC.

Statistics 8054 (Geyer, Spring 2024) Branch and Bound Example

Navigation