**Sanford Weisberg
School of Statistics, University of Minnesota, St.
Paul,
MN 55108-6042.
Supported by National Science Foundation Grant DUE 97-52887.**

**December 7, 1999, revised January 31, 2002**

This paper describes an **Arc** add-in for simple cross-validation.

- Divide the data into two parts, a ``construction" set and a ``validation" set.
- Fit a model of interest to the construction set.
- Compute a summary statistic, usually a function of the deviance and possibly the number of parameters, based on the validation set only.
- Select models that make the statistic chosen small (usually) when applied to the validation data.

Getting the Code

Using Cross-validation

- Set the fraction to a number between 0 and 1 giving the faction of the data to be put into the validation set.
- Give a list of case numbers that you want to be in the
validation set. If you type
`'(1 2 3 4 15)`, then cases 1, 2, 3, 4 and 15 will be put into the validation set. If you type`(iseq 7 30)`, then cases 7 to 30 will be in the validation set. If the list`c`has been defined, for example,`(def c '( 2 5 4 3 6))`, then you can simply type`c`in the dialog. If you set this option, then the fraction you specified is ignored. - If you select the Remove item, then cross validation is stopped, and all cases are used in computing. If you select this item, the other two are ignored.

All models you fit with this data set will exclude the validation set from the fitting. The output will include a few summary statistics based on the validation set only. Here is the output for linear models:

Data set = AIS, Name of Fit = L1 64 cases have been deleted. Normal Regression Kernel mean function = Identity Response = LBM Terms = (Ht Wt RCC Sex) Coefficient Estimates Label Estimate Std. Error t-value p-value Constant -0.183342 7.72187 -0.024 0.9811 Ht 0.0920206 0.0397343 2.316 0.0221 Wt 0.646491 0.0266816 24.230 0.0000 RCC 0.920939 0.748719 1.230 0.2209 Sex -8.63126 0.820372 -10.521 0.0000 R Squared: 0.957295 Sigma hat: 2.88669 Number of cases: 202 Number of cases used: 138 Degrees of freedom: 133 Summary Analysis of Variance Table Source df SS MS F p-value Regression 4 24843.7 6210.93 745.34 0.0000 Residual 133 1108.29 8.33298 Cross validation summary of cases not used to get estimates: Sum of squared deviations: 355.439 Mean squared deviation: 5.55373 Sqrt(mean squared deviation): 2.35663 Number of observations: 64The additional quantites at the end of the regression output are for the validation set only. Users familiar with

`:display-cross-validation`

to give other statistics of interest.
The output for generalized linear models is based on deviance rather than squared residuals.

2002-01-31