Next: discrimquad()
Up: Multivariate Macros Help File
Previous: dastepstatus()
Contents
Usage:
discrim(groups, y), vector of positive integers groups, REAL matrix y
with no MISSING values
|
Keywords:
classification, discrimination
Usage
discrim(groups, y), where groups is a factor or an integer vector, and
y is a REAL data matrix with no MISSING elements, computes the
coefficients of linear discriminant functions that can be used to
classify an observation into one of the populations specified by
argument groups.
The functions being estimated are optimal in the case when the
distribution in each population is multivariate normal and the
variance-covariance matrices are the same for all populations.
When there are g = max(groups) populations, and p = ncols(y) variables,
the value returned is structure(coefs:L, add:C) where L is a REAL p by
g matrix and C is a 1 by g row vector.
If y is a length p vector of data to be classified to one of the
populations, then f = L' %*% y + C' is the vector of discriminant
function values (scores) for the g populations.
If f[j] = max(f) is the largest element of f, then, assuming the g
populations are equally probable (each have prior probability 1/g), then
population j is the most probable population based on y.
Prior and posterior probabilities
If P is a length g vector such that P[j] = prior probability a randomly
selected case belongs to population j, then the estimated
posterior probability that y belongs to population k is
P[k]*exp(f[k])/sum(P * exp(f)) =
P[k]*exp(f[k] - f[1])/sum(P * exp(f - f[1]))
The second form is preferred since exp(f[k]) can be too large to
compute.
Classifying rows of matrix
When Y is a m by p data matrix whose rows are to be classified, F = Y
%*% L + C is m by g matrix, with F[i,j] containing the value of the
discriminant function for population j evaluated with the data in row i
of Y. A m by g matrix of posterior probabilities for each group and
case can be computed by
P * exp(F - F[,1])/((P * exp(F - F[,1])) %*% rep(1,g))
Bias in estimating posterior probabilities
It is well known that posterior probabilities computed for a case that
is in "training set", the data set from which a classification method
was estimated, are biased in an "optimistic" direction: The estimated
posterior probability for its actual population is biased upward. For
this reason posterior probabilities should be estimated only for cases
that are not in the training set. See macro jackknife() for a partial
remedy.
Gary Oehlert
2006-01-30