Cross-Validation in Arc

Sanford Weisberg
School of Statistics, University of Minnesota, St. Paul, MN 55108-6042.
Supported by National Science Foundation Grant DUE 97-52887.

December 7, 1999, revised January 31, 2002

Abstract:

This paper describes an Arc add-in for simple cross-validation.

Introduction

Cross-validation is a common method used for model checking in regression problems. The basic outline is:

Divide the data into two parts, a ``construction" set and a ``validation" set.
Fit a model of interest to the construction set.
Compute a summary statistic, usually a function of the deviance and possibly the number of parameters, based on the validation set only.
Select models that make the statistic chosen small (usually) when applied to the validation data.

This Arc add-in permits selecting the construction and validation set, and (2) automatic computation and printing of a few summary statistics for the validation set.

Getting the Code

Download the file http://www.stat.umn.edu/arc/cv.lsp. Place it in the Extras directory in your Arc directory (if you don't have such a directory, create one). The file will be automatically loaded every time you start Arc.

Using Cross-validation

Loading cv.lsp will add a menu item to the data set menu called Cross validation. Select this item to start using cross validation. You will get a dialog with three options:

Set the fraction to a number between 0 and 1 giving the faction of the data to be put into the validation set.
Give a list of case numbers that you want to be in the validation set. If you type '(1 2 3 4 15), then cases 1, 2, 3, 4 and 15 will be put into the validation set. If you type (iseq 7 30), then cases 7 to 30 will be in the validation set. If the list c has been defined, for example, (def c '( 2 5 4 3 6)), then you can simply type c in the dialog. If you set this option, then the fraction you specified is ignored.
If you select the Remove item, then cross validation is stopped, and all cases are used in computing. If you select this item, the other two are ignored.

If you select Cross validation a second time will change the cases in the validation sample.

All models you fit with this data set will exclude the validation set from the fitting. The output will include a few summary statistics based on the validation set only. Here is the output for linear models:

Data set = AIS, Name of Fit = L1
64 cases have been deleted.
Normal Regression
Kernel mean function = Identity
Response      = LBM
Terms         = (Ht Wt RCC Sex)
Coefficient Estimates
Label      Estimate        Std. Error    t-value    p-value
Constant  -0.183342        7.72187        -0.024     0.9811
Ht         0.0920206       0.0397343       2.316     0.0221
Wt         0.646491        0.0266816      24.230     0.0000
RCC        0.920939        0.748719        1.230     0.2209
Sex       -8.63126         0.820372      -10.521     0.0000

R Squared:               0.957295    
Sigma hat:                2.88669    
Number of cases:             202
Number of cases used:        138
Degrees of freedom:          133

Summary Analysis of Variance Table
Source         df       SS            MS           F    p-value
Regression      4   24843.7       6210.93     745.34    0.0000
Residual      133   1108.29       8.33298    

Cross validation summary of cases not used to get estimates:
Sum of squared deviations:      355.439    
Mean squared deviation:         5.55373    
Sqrt(mean squared deviation):   2.35663    
Number of observations:             64

The additional quantites at the end of the regression output are for the validation set only. Users familiar with lisp can modify the method :display-cross-validation to give other statistics of interest.

The output for generalized linear models is based on deviance rather than squared residuals.

S Weisberg
2002-01-31