Sample Size Consideration in Multivariate Normal Classification

by Seymour Geisser and Wesley Johnson
Technical Report No. 580
School of Statistics
University of Minnesota
and University fo California, Davis
July, 1992

Research supported in part by NIGMS Grant 25271.


Introduction

In classification problems involving two multivariate normal training samples of size N(sub)1 and N(sub)2 already in hand we address the question of whether it would be worthwhile to increase the training sample by given amounts. Consider a situation where it is possible that a total of, say n(sub)1 and n(sub)2 observations would be taken from the two populations respectively. Suppose a decision is made to observe N(sub)1 Our main focus is on the performance of linear allocation rules: performance is measured by the magnitude of mis-allocated probabilities. At the "interim" stage, we can assess the predictive probabilities about various probabilities of error at the "end" of the experiment. For example, the "actual" error rate for Fisher's sample linear discriminant can be estimated at the interim stage and at the end of the experiment. At the interim stage, it may be of interest to assess the chances that this error rate will be less than or greater than .01, .05, .2 etc. after more observations are taken. If it is assessed that the estimated "actual" error rate will be greater than .2 at the "end" of the experiment with predictive probability .99, this may be grounds to terminate the experiment at the interim stage or perhaps to consider additional variables that might aid in lowering the error rate. On the other extreme, if the "actual" error rate is estimated to be less than .01 at the interim stage, and if the predictive probability that it will remain that low is high, it may be deemed unnecessary to observe more data or perhaps continue the experiment with enthusiasm. Similar considerations will be made with respect to the "true" error rate defined for the population linear discriminant.

The approach taken is Bayesian. Since the problem involves the Mahalnobis measure of divergence D^2, which crops up in testing the similarity of two multivariate normal populations, we initially discuss this problem in sections 2 and 3. Section 4 considers the effect of a potential training sample increase on the "true" errors of classification. The effect of the training sample increase on the "actual" errors of classification is addressed in section 5. The results are exhibited by an example presented in section 6.


Click here to download the complete PostScript technical report.
Other formats include the Macintosh Microsoft Word version of this technical report.