Next: discrimquad() Up: Multivariate Macros Help File Previous: dastepstatus()   Contents

discrim()

Usage:
discrim(groups, y), vector of positive integers groups, REAL matrix y
  with no MISSING values



Keywords: classification, discrimination
discrim(groups, y), where groups is a factor or an integer vector, and
y is a REAL data matrix with no MISSING elements, computes the
coefficients of linear discriminant functions that can be used to
classify an observation into one of the populations specified by
argument groups.

The functions being estimated are optimal in the case when the
distribution in each population is multivariate normal and the
variance-covariance matrices are the same for all populations.

When there are g = max(groups) populations, and p = ncols(y) variables,
the value returned is structure(coefs:L, add:C) where L is a REAL p by
g matrix and C is a 1 by g row vector.

If y is a length p vector of data to be classified to one of the
populations, then f = L' %*% y + C' is the vector of discriminant
function values (scores) for the g populations.

If f[j] = max(f) is the largest element of f, then, assuming the g
populations are equally probable (each have prior probability 1/g), then
population j is the most probable population based on y.

If P is a length g vector such that P[j] = prior probability a randomly
selected case belongs to population j, then the estimated
posterior probability that y belongs to population k is
   P[k]*exp(f[k])/sum(P * exp(f)) =
            P[k]*exp(f[k] - f[1])/sum(P * exp(f - f[1]))
The second form is preferred since exp(f[k]) can be too large to
compute.

When Y is a m by p data matrix whose rows are to be classified, F = Y
%*% L + C is m by g matrix, with F[i,j] containing the value of the
discriminant function for population j evaluated with the data in row i
of Y.  A m by g matrix of posterior probabilities for each group and
case can be computed by
       P * exp(F  - F[,1])/((P * exp(F - F[,1])) %*% rep(1,g))

NOTE: It is well known that posterior probabilities computed for a case
that is in "training set", the data set from which a classification
method was estimated, are biased in an "optimistic" direction: The
estimated posterior probability for its actual population is biased
upward.  For this reason posterior probabilities should be estimated
only for cases that are not in the training set.  See macro jackknife()
for a partial remedy.


Gary Oehlert 2003-01-15