# describe()

Usage:
 ```describe(data [, silent:T, excludeM:T, all:T, fivenum:T, n:T|F, min:T|F, max:T|F, q1:T|F, q2:T|F, median:T|F, mean:T|F, var:T|F, stddev:T|F, m3:T|F, m4:T|F, g1:T|F, g2:T|F, iqr:T|F, range:T|F]), where data is REAL or a structure with REAL components; F's should be used only with all:T or fivenum:T. ```

Keywords: descriptive statistics
```describe(Data) computes statistics describing the data in the REAL
vector or array Data.

The value of describe(Data) is a structure with following components:
n            sample size, excluding MISSING values
min          minimum
q1           Q1 = lower quartile
median       M = median
q3           Q3 = upper quartile
max          maximum
mean         average
var          variance (with divisor of n-1)

By default Q1 and Q3 are computed as the medians of the lower and upper
halves of the data, *including* the median in both halves when n is odd.
Keyword phrase excludeM:T (see below) changes this definition so that Q1
and Q3 are computed as medians of the lower and upper halves *excluding*
the median.

describe(Data, silent:T) does the same, but any warning messages about
MISSING values or overflows are suppressed.

describe() can compute additional statistics including the standard
deviation and the interquartile range.  See below.

You can specify particular statistics to compute using keyword phrases.
For example, describe(Data, mean:T) has the same result as
describe(x)\$mean, except that no unwanted statistics are computed, and
describe(Data, mean:T,var:T) returns a structure with components 'mean'
and 'var' without computing other statistics.

When only one statistic is requested, the result is a REAL variable and
not a structure.

You can use 'm1' instead of 'mean' and 'q2' instead of 'median' when
specifying what to compute; however, when other statistics are also
computed, the components still have names 'mean' and 'median'.  For
example, describe(x,m1:T,q2:T) is equivalent to describe(x,mean:T,
median:T).

describe(x, fivenum:T) is equivalent to describe(x,min:T,q1:T,median:T,
q3:T,max:T), that is, it computes the five number summary consisting of
the extremes and quartiles.

If you want additional statistics, say, the mean, use describe(x,
fivenum:T,mean:T).

If you want to suppress one or more of the five numbers, say the
extremes, use describe(fivenum:T,max:F,min:F).

There are other statistics that can be computed only by using keyword
phrases.
stddev        standard deviation = sqrt(var)
m2            sum((x-xbar)^2)/n = s2/n = (n-1)*var/n
m3            sum((x-xbar)^3)/n = s3/n
m4            sum((x-xbar)^4)/n = s4/n
g1            coefficient of skewness (see below)
g2            coefficient of kurtosis (see below)
range         maximum - minimum
iqr           IQR = Q3 - Q1 = interquartile range

Some text books give m2 = s2/n as the definition of sample variance
instead of the value of var = s2/(n-1).

Example:
Cmd> describe(x, g1:T, g2:T)

returns a structure containing the skewness and kurtosis of x in
components g1 and g2.  See below for their exact definitions.

describe(Data, all:T) returns a structure with the 8 standard components
plus components stdev, m2, m3, m4, g1, g2, range and iqr.  You can
suppress any component by, for example, median:F.

Example:
Cmd> describe(Data, all:T, q1:F, median:F, q3:F, silent:T)

returns a structure containing all statistics except the median and
quartiles.  Because 'silent:T' is an argument, no warning is printed if
Data contains MISSING values.

describe(Data, excludeM:T ...) is a variant, except that Q1 and Q3 are
computed as medians of the lower and upper half of the data, *excluding*
the median when n is odd, and the IQR is computed from these modified
quartiles.  This corresponds with the definition of quartiles in some
statistical texts, including David S. Moore, The Basic Practice of
Statistics.

'excludeM:T' modifies results only when the number n of non-MISSING
values is odd and one or more of Q1, Q3 or IQR is computed.

The case, upper or lower, of letters is ignored in describe() keyword
names.  For example, Q1:T and q1:T are equivalent.  This currently
differs from the behavior of most other functions.  The names of
components are all lower case.

When Data is multidimensional (a matrix or array) with dimensions n1,
n2, ..., nk, each component of the result (or the result itself when
only one statistic is requested) is an array with dimensions n2, n3,
..., nk, that is, it has one fewer dimensions than Data.  Each statistic
describes all values with the last k-1 subscripts fixed (a column when
Data is a matrix).  In particular, when Data is a true matrix (exactly 2
dimensions), the component is a vector.  For example, when Data is a
true matrix, describe(Data, mean:T) is a vector, but
sum(Data)/nrows(Data) is a row vector (matrix with 1 row).

When Data is a vector or matrix, you can also use tabs(Data [,keywords])
to compute some of the statistics computed by describe() (not q1,
median, q3, m2, m3, m4, g1, g2, range or iqr).  See tabs().

When Data itself is a structure, each component of the result (or the
result itself when only one statistic is requested) is itself a
structure with the same shape as Data, whose components contain summary
values for the corresponding component of Data.

Examples:
Cmd> xbar <- describe(x, mean:T); sx <- describe(x, stddev:T)
compute the mean and standard deviation of x.

Cmd> medians <- describe(split(y,a), median:T) # or MEDIAN:T
and
Cmd> medians <- describe(split(y,a))\$median # not \$MEDIAN
both compute a structure, each of whose elements is the median of the
values of y corresponding to a level of factor a.  The first does less
computing of results you aren't saving.

Cmd> describe(x, mean:T, var:T) # or Mean:T, VAR:T
and
Cmd> describe(x)[vector(7,8)]

are equivalent, except the latter does much unnecessary computing
because it computes and then discards the extremes, the quartiles and
the median.

Skewness g1 = k3/k2^1.5 and kurtosis g2 = k4/k2^2 are computed from
Fisher's k-statistics k2, k3 and k4 defined as

k2 = var = s2/(n - 1)
k3 = n*s3/((n - 1)*(n - 2)), and
k4 = (n*(n + 1)*s4 - 3*(n - 1)*s2^2)/((n - 1)*(n - 2)*(n - 3))

g1 and g2 similar to, but not identical to, sqrt(beta1) = m3/m2^1.5 and
beta2 = m4/m2^2 - 3 which are also used to measure skewness and
kurtosis.

Expressed in terms of sqrt(beta1) and beta2:

g1 = (sqrt(n*(n - 1))/(n - 2))*sqrt(beta1)
g2 = ((n^2 - 1)/((n - 2)*(n - 3)))*(beta2  +  6/(n + 1))

When n <= 2, g1 is computed to be 0.  When n <= 3, g2 is computed to be
0.

g1 and g2 are sometimes used to test the null hypothesis that a sample
comes from a normal population.  If the data are a random sample from a
normal distribution, then g1 and g2 have mean 0 and

V[g1] = 6*n*(n - 1)/((n - 2)*(n + 1)*(n + 3))
V[g2] = 24*n*(n - 1)^2/((n - 3)*(n - 2)*(n + 2)*(n + 5))