General Instructions
To do each example, just click the Submit
button.
You do not have to type in any R instructions or specify a dataset.
That's already done for you.
Theory
(Cumulative) Distribution Functions
The (cumulative) distribution function of a (real-valued) random variable X is the function F defined by
- Because F(x) is a probability, it is necessarily between zero and one.
- Because the event X ≤ x increases as x increases, F is a nondecreasing function.
-
Because the event X ≤ x decreases to the empty
set as x goes to minus infinity,
limx → − ∞ F(x) = 0.
-
Because the event X ≤ x increases to the whole
real line as x goes to plus infinity,
limx → + ∞ F(x) = 1.
- If the support of X is not the whole real line, then all of the increase of F takes place on the support, that is, if a ≤ X ≤ b with probability one, then F(a) = 0 and F(b) = 1.
- Other properties of the distribution function depend on whether X is discrete or continuous.
-
If X is a continuous random variable, then
- F is a continuous function and is strictly increasing on the support of X.
-
If X is a discrete random variable, then
- F is a discontinuous function.
- The discontinuities (jumps) of F occur at the atoms of X (the points having nonzero probability).
- The height of the jump gives the probability of the atom, that is,
pr(X = x) = F(x) − F(x − ε)whenever ε is small enough so that there are no other jumps between x − ε and x.
- F is constant (its graph is horizontal) between jumps.
Empirical (Cumulative) Distribution Functions
The empirical distribution function Fn is just the distribution function of the empirical distribution, which puts probability 1 / n at each data point of a sample of size n.
If x(i) are the order statistics and all of the order statistics are distinct, then the empirical distribution function jumps from (i − 1) / n to i / n at the point x(i) and is constant except for the jumps at the order statistics.
If exactly k order statistics x(i), …, x(i + k − 1), are tied at some value, then then the empirical distribution function jumps from (i − 1) / n to (i + k) / n at that point.
Distribution Function Examples
The R function ecdf
(on-line help)
produces empirical (cumulative) distribution functions. The R functions
of the form p
followed by a distribution name (pnorm
,
pbinom
, etc.) produce theoretical distribution functions.
Comments
If you increase the sample size n
the empirical distribution
function will get closer to the theoretical distribution function.
If you change the theoretical distribution function from standard normal
to something else, the empirical and theoretical distribution functions
will still be close to each other, just different. For example, try
standard exponential (rexp
replaces rnorm
and
pexp
replaces pnorm
).
The Uniform Law of Large Numbers (Glivenko-Cantelli Theorem
As everywhere else in statistics, the law of large numbers holds. In fact, for fixed x this is just the usual law of large numbers because the empirical distribution function Fn(x) is a sample proportion (the proportion of Xi that are less than or equal to x) that estimates the true population proportion F(x). Thus the statement that
is just the ordinary law of large numbers (the convergence here is either in probability or almost sure).
But much more is true. In fact, the convergence is actually uniform
(a fact known as the Glivenko-Cantelli theorem in advanced probability theory).
The Asymptotic Distribution (Brownian Bridge)
As everywhere else in statistics, there is also asymptotic normality. In fact, as noted above, for fixed x this is just the usual central limit theorem because Fn(x) is a sample proportion
where p = F(x), is just the ordinary central limit theorem (the convergence here is convergence in distribution).
But much more is true. In fact, the convergence is actually uniform in a sense that we can't even start to explain at this level, because the objects of interest are not scalar-valued random variables, nor even vector-valued random variables, but function-valued random variables Fn. This can be thought of as an infinite-dimensional random vector because it has an infinite number of coordinates Fn(x) for each of the (infinitely many) values of x.
But we won't bother with those technicalities. Suffice it to say that
converges to a Gaussian stochastic process called the Brownian bridge
in the special case that the true population distribution is Uniform(0, 1).
The Gaussian
here refers to the normal distribution, more about this
in class.
This result would have no place in a course on nonparametrics if it were peculiar to the uniform distribution. But the non-uniform case is not much different. It will be described presently.
We can see what the Brownian bridge looks like by just taking a very large sample size (large enough for the asymptotics to work).
If you repeat the plot over and over, you will see many different realizations of this random function.
The non-uniform case has to do with a curious fact from theoretical statistics. If X is any continuous random variable and F is its distribution function, then F(X) is a Uniform(0, 1) random variable. This means
- Any continuous random variable X can be mapped to a Uniform(0, 1) random variable U (by the transformation F) and vice versa (by the transformation F−1).
- More importantly for the subject of Kolmogorov-Smirnov tests, this means that the distribution of √n ( Fn(x) − F(x) ) is the same for all continuous population distributions except for a transformation of the x-axis. If we base our procedures only on the vertical distance between Fn(x) and F(x) and ignore horizontal distances (which are transformed), then our procedure will be truly nonparametric.
Suprema over the Brownian Bridge
The distributions of both one-sided and two-sided suprema over the Brownian bridge are known. Define
where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D+ is a random variable. The distribution of this random variable is known. It has distribution function
The Brownian bridge is symmetric with respect to being turned upside down
(in distribution). Thus the statistic D− defined
by replacing sup
with inf
in the definition of
D+ has the same distribution as
D+.
Similarly, if we define the two-sided supremum
where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D is a random variable. The distribution of this random variable is also known. It has distribution function
although this involves an infinite series, the series is extremely rapidly converging. Usually a few terms suffice for very high accuracy.
One-Sample Tests
The one-sample Kolmogorov-Smirnov test is based on the test statistic
for an upper-tailed test.
Or on the test statistic
D−n
defined by replacing sup
with inf
in the formula above
for a lower-tailed test.
Or on the test statistic
for a two-tailed test. Usually, we want a two-tailed test.
Because the distribution F hypothesized under the null hypothesis must be completely specified (no free parameters whatsoever). This test is fairly useless, and Hollander and Wolfe do not cover it. However, a very closely related test, the Lilliefors test, covered below is useful.
For now we just do a toy example using the R function ks.test
(on-line help).
As can be seen by trying out the example, the test is not very powerful even for large sample sizes if the distributions are not too different. Try different sample sizes and degrees of freedom for the t.
The Corresponding Confidence Interval
As we said, one-sample Kolmogorov-Smirnov tests are fairly useless from
an applied point of view (however theoretically important). But the
dual confidence interval is of use. It gives a confidence band
for the whole distribution function (Section 11.5 in Hollander
and Wolfe).
The programmer who wrote the ks.test
function for R didn't
bother with the confidence interval. So we are on our own again. We
(like Hollander and Wolfe) will only do the two-sided interval. The
one-sided is similar. Just use the distribution of D+
instead of the distribution of D.
Example 11.6 in Hollander and Wolfe.
Comments.
- The step function is the empirical distribution function.
- The dashed lines on either side mark a 95%
confidence band
for the true population distribution function. (The probability that the true population distribution function lies entirely within the confidence band gets closer and closer to 0.95 as the sample size goes to infinity.) - The first half of the code (above the blank line) could be replaced by
crit.val <- 1.358099
if there was no interest in confidence levels other than 95%. - The tricky
ylab = expression(F[n](x))
argument to the firstplot
function makes the y-axis labelFn(x)
withn
a subscript. Many more such effects are possible and are described byhelp(plotmath)
(on-line version of this help).
The Corresponding Point Estimate
Procedures always come in threes: a hypothesis test, the dual confidence interval, and the corresponding point estimate. What is the point estimator here?
Two-Sample Tests
The difference of two independent Brownian bridges is a rescaled Brownian bridge (vertical axis expanded by √2). The obvious statistic for comparing two empirical distribution functions Fm and Gn which is
has an asymptotic distribution that is a Brownian bridge with the vertical axis expanded by (1 / m + 1 / n)1 / 2 because Fm has variance proportional to 1 / m and Gn has variance proportional to 1 / n.
Thus
has the standard Brownian bridge for its asymptotic distribution.
But we don't actually need to know this ourselves. It is buried in
the code for ks.test
.
Example 5.4 in Hollander and Wolfe.
Comment
It won't bother those with no previous exposure to the R
ks.test
function
(on-line
help) but it came as a shock to me that the meaning
of alternative = "less"
changed since the last time I taught
the course. It now means
But if the distribution function of x is less than that of y, the median of x is greater than that of y.The possible values
"two.sided"
,"less"
and"greater"
ofalternative
specify the null hypothesis that the true distribution function ofx
is equal to, not less than or not greater than the hypothesized distribution function (one-sample case) or the distribution function ofy
(two-sample case), respectively.
So the applied
meaning of alternative
is just the opposite
of what it is for wilcox.test
. If you want
wilcox.test(x, y, alternative = "less")
its competitor is
ks.test(x, y, alternative = "greater")
No real problem as long as you are aware of this issue. (A big problem if you forget!)
The Lilliefors Test
The one-sample Kolmogorov-Smirnov isn't very useful in practice because it requires a simple null hypothesis, that is, the distribution must be completely specified with all parameters known.
What you want to do is test with unknown parameters. You would like the null hypothesis to be all normal distributions (and the alternative all non-normal distributions) or something like that. What you want to do is something like this, a Kolmogorov-Smirnov test with estimated parameters.
The reason for the WARNING is that estimating the parameters changes the null distribution of the test statistic. The null distribution is generally not known when parameters are estimated and is not the same as when parameters are known.
Fortunately, when we have a computer, we can approximate the null distribution of the test statistic by simulation.
There is random error in this calculation from the simulation.
However, because of the trick of adding 1 to the numerator and denominator
in calculating the P-value it can be used straight
without regard
for the randomness. Under the null hypothesis the probability
Pr(P ≤ k / nsim)
is exactly k / nsim when both the randomness
in the data and the randomness in the simulation are taken into account.
Summary
- Bogus P-value: 0.1578
- Simulation P-value: 0.004 ± 0.001
Comment
The name Lilliefors test only applies to this procedure of using the Kolmogorov-Smirnov test statistic with estimated null distribution when the null distribution is assumed to be normal. In this case, the test is exact because the test statistic and the normal family of distributions are invariant under location-scale transformations.
If the same procedure were used with another family of distributions that was not a location-scale family, then the test would not be exact. It would be a special case of the parametric bootstrap, which we will eventually cover.