Cross-Validation in Arc
Sanford Weisberg
School of Statistics, University of Minnesota, St.
Paul,
MN 55108-6042.
Supported by National Science Foundation Grant DUE 97-52887.
December 7, 1999, revised January 31, 2002
Abstract:
This paper describes an Arc add-in for simple cross-validation.
Cross-validation is a common method used for model checking in regression
problems. The basic outline is:
- Divide the data into two parts, a ``construction" set and a
``validation" set.
- Fit a model of interest to the construction set.
- Compute a summary statistic, usually a function of the deviance and
possibly the number of parameters, based on the validation set only.
- Select models that make the statistic chosen small (usually) when
applied to the validation data.
This Arc add-in permits selecting the construction and validation set,
and (2) automatic computation and printing of a few summary statistics for the
validation set.
Getting the Code
Download the file
http://www.stat.umn.edu/arc/cv.lsp.
Place it in the Extras directory in
your Arc directory (if you don't have such a directory, create one).
The file will be automatically loaded every time you start Arc.
Using Cross-validation
Loading cv.lsp will add a menu item to the data set menu called Cross
validation. Select this item to start using cross validation. You will get
a dialog with three options:
- Set the fraction to a number between 0 and 1 giving the
faction of the data to be put into the validation set.
- Give a list of case numbers that you want to be in the
validation set. If you type '(1 2 3 4 15), then cases 1, 2,
3, 4 and 15 will be put into the validation set. If you type
(iseq 7 30), then cases 7 to 30 will be in the validation set.
If the list c has been defined, for example,
(def c '( 2 5 4 3 6)), then you can simply type c in the
dialog. If you set this option, then the fraction you specified is
ignored.
- If you select the Remove item, then cross validation is
stopped, and all cases are used in computing.
If you select this item, the other two are ignored.
If you select Cross validation a second time will change the cases in the
validation sample.
All models you fit with this data set will exclude the validation set from the
fitting. The output will include a few summary statistics based on the
validation set only. Here is the output for linear models:
Data set = AIS, Name of Fit = L1
64 cases have been deleted.
Normal Regression
Kernel mean function = Identity
Response = LBM
Terms = (Ht Wt RCC Sex)
Coefficient Estimates
Label Estimate Std. Error t-value p-value
Constant -0.183342 7.72187 -0.024 0.9811
Ht 0.0920206 0.0397343 2.316 0.0221
Wt 0.646491 0.0266816 24.230 0.0000
RCC 0.920939 0.748719 1.230 0.2209
Sex -8.63126 0.820372 -10.521 0.0000
R Squared: 0.957295
Sigma hat: 2.88669
Number of cases: 202
Number of cases used: 138
Degrees of freedom: 133
Summary Analysis of Variance Table
Source df SS MS F p-value
Regression 4 24843.7 6210.93 745.34 0.0000
Residual 133 1108.29 8.33298
Cross validation summary of cases not used to get estimates:
Sum of squared deviations: 355.439
Mean squared deviation: 5.55373
Sqrt(mean squared deviation): 2.35663
Number of observations: 64
The additional quantites at the end of the regression output are for the
validation set only.
Users familiar with lisp can modify the method
:display-cross-validation
to give other statistics of interest.
The output for generalized linear models is based on deviance rather than
squared residuals.
S Weisberg
2002-01-31