Statistics 5601 (Geyer, Fall 2013) Kolmogorov-Smirnov and Lilliefors Tests

General Instructions

To do each example, just click the Submit button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Theory

(Cumulative) Distribution Functions

The (cumulative) distribution function of a (real-valued) random variable X is the function F defined by

F(x) = pr(X ≤ x), − ∞ < x < ∞

Because F(x) is a probability, it is necessarily between zero and one.
Because the event X ≤ x increases as x increases, F is a nondecreasing function.
Because the event X ≤ x decreases to the empty set as x goes to minus infinity,
lim_{x → − ∞} F(x) = 0.
Because the event X ≤ x increases to the whole real line as x goes to plus infinity,
lim_{x → + ∞} F(x) = 1.
If the support of X is not the whole real line, then all of the increase of F takes place on the support, that is, if a ≤ X ≤ b with probability one, then F(a) = 0 and F(b) = 1.
Other properties of the distribution function depend on whether X is discrete or continuous.
If X is a continuous random variable, then
- F is a continuous function and is strictly increasing on the support of X.
If X is a discrete random variable, then
- F is a discontinuous function.
- The discontinuities (jumps) of F occur at the atoms of X (the points having nonzero probability).
- The height of the jump gives the probability of the atom, that is,
  pr(X = x) = F(x) − F(x − ε)
  whenever ε is small enough so that there are no other jumps between x − ε and x.
- F is constant (its graph is horizontal) between jumps.

Empirical (Cumulative) Distribution Functions

The empirical distribution function F_n is just the distribution function of the empirical distribution, which puts probability 1 / n at each data point of a sample of size n.

If x_(i) are the order statistics and all of the order statistics are distinct, then the empirical distribution function jumps from (i − 1) / n to i / n at the point x_(i) and is constant except for the jumps at the order statistics.

If exactly k order statistics x_(i), …, x_{(i + k − 1)}, are tied at some value, then then the empirical distribution function jumps from (i − 1) / n to (i + k) / n at that point.

Distribution Function Examples

The R function ecdf (on-line help) produces empirical (cumulative) distribution functions. The R functions of the form p followed by a distribution name (pnorm, pbinom, etc.) produce theoretical distribution functions.

Comments

If you increase the sample size n the empirical distribution function will get closer to the theoretical distribution function.

If you change the theoretical distribution function from standard normal to something else, the empirical and theoretical distribution functions will still be close to each other, just different. For example, try standard exponential (rexp replaces rnorm and pexp replaces pnorm).

The Uniform Law of Large Numbers (Glivenko-Cantelli Theorem

As everywhere else in statistics, the law of large numbers holds. In fact, for fixed x this is just the usual law of large numbers because the empirical distribution function F_n(x) is a sample proportion (the proportion of X_i that are less than or equal to x) that estimates the true population proportion F(x). Thus the statement that

F_n(x) → F(x), n → ∞

is just the ordinary law of large numbers (the convergence here is either in probability or almost sure).

But much more is true. In fact, the convergence is actually uniform

sup_{− ∞ < x < + ∞} | F_n(x) − F(x) | → 0, n → ∞

(a fact known as the Glivenko-Cantelli theorem in advanced probability theory).

The Asymptotic Distribution (Brownian Bridge)

As everywhere else in statistics, there is also asymptotic normality. In fact, as noted above, for fixed x this is just the usual central limit theorem because F_n(x) is a sample proportion

√n [ F_n(x) − F(x) ] → Normal(0, p (1 − p)), n → ∞,

where p = F(x), is just the ordinary central limit theorem (the convergence here is convergence in distribution).

But much more is true. In fact, the convergence is actually uniform in a sense that we can't even start to explain at this level, because the objects of interest are not scalar-valued random variables, nor even vector-valued random variables, but function-valued random variables F_n. This can be thought of as an infinite-dimensional random vector because it has an infinite number of coordinates F_n(x) for each of the (infinitely many) values of x.

But we won't bother with those technicalities. Suffice it to say that

√n [ F_n(x) − F(x) ]

converges to a Gaussian stochastic process called the Brownian bridge in the special case that the true population distribution is Uniform(0, 1). The Gaussian here refers to the normal distribution, more about this in class.

This result would have no place in a course on nonparametrics if it were peculiar to the uniform distribution. But the non-uniform case is not much different. It will be described presently.

We can see what the Brownian bridge looks like by just taking a very large sample size (large enough for the asymptotics to work).

If you repeat the plot over and over, you will see many different realizations of this random function.

The non-uniform case has to do with a curious fact from theoretical statistics. If X is any continuous random variable and F is its distribution function, then F(X) is a Uniform(0, 1) random variable. This means

Any continuous random variable X can be mapped to a Uniform(0, 1) random variable U (by the transformation F) and vice versa (by the transformation F⁻¹).
More importantly for the subject of Kolmogorov-Smirnov tests, this means that the distribution of √n ( F_n(x) − F(x) ) is the same for all continuous population distributions except for a transformation of the x-axis. If we base our procedures only on the vertical distance between F_n(x) and F(x) and ignore horizontal distances (which are transformed), then our procedure will be truly nonparametric.

Suprema over the Brownian Bridge

The distributions of both one-sided and two-sided suprema over the Brownian bridge are known. Define

D⁺ = sup_{0 < t < 1} B(t)

where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D⁺ is a random variable. The distribution of this random variable is known. It has distribution function

F_D⁺(x) = 1 − exp(− 2 x²)

The Brownian bridge is symmetric with respect to being turned upside down (in distribution). Thus the statistic D⁻ defined by replacing sup with inf in the definition of D⁺ has the same distribution as D⁺.

Similarly, if we define the two-sided supremum

D = sup_{0 < t < 1} | B(t) |

where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D is a random variable. The distribution of this random variable is also known. It has distribution function

F_D(x) = 1 + 2 ∑_{k = 1}^∞ (− 1)^k exp(− 2 k² x²)

although this involves an infinite series, the series is extremely rapidly converging. Usually a few terms suffice for very high accuracy.

One-Sample Tests

The one-sample Kolmogorov-Smirnov test is based on the test statistic

D⁺_n = sup_{− ∞ < x < + ∞} √n [ F_n(x) − F(x) ]

for an upper-tailed test. Or on the test statistic D⁻_n defined by replacing sup with inf in the formula above for a lower-tailed test. Or on the test statistic

D_n = sup_{− ∞ < x < + ∞} √n | F_n(x) − F(x) |

for a two-tailed test. Usually, we want a two-tailed test.

Because the distribution F hypothesized under the null hypothesis must be completely specified (no free parameters whatsoever). This test is fairly useless, and Hollander and Wolfe do not cover it. However, a very closely related test, the Lilliefors test, covered below is useful.

For now we just do a toy example using the R function ks.test (on-line help).

As can be seen by trying out the example, the test is not very powerful even for large sample sizes if the distributions are not too different. Try different sample sizes and degrees of freedom for the t.

The Corresponding Confidence Interval

As we said, one-sample Kolmogorov-Smirnov tests are fairly useless from an applied point of view (however theoretically important). But the dual confidence interval is of use. It gives a confidence band for the whole distribution function (Section 11.5 in Hollander and Wolfe).

The programmer who wrote the ks.test function for R didn't bother with the confidence interval. So we are on our own again. We (like Hollander and Wolfe) will only do the two-sided interval. The one-sided is similar. Just use the distribution of D⁺ instead of the distribution of D.

Example 11.6 in Hollander and Wolfe.

Comments.

The step function is the empirical distribution function.
The dashed lines on either side mark a 95% confidence band for the true population distribution function. (The probability that the true population distribution function lies entirely within the confidence band gets closer and closer to 0.95 as the sample size goes to infinity.)
The first half of the code (above the blank line) could be replaced by
```
crit.val <- 1.358099
```
if there was no interest in confidence levels other than 95%.
The tricky ylab = expression(F[n](x)) argument to the first plot function makes the y-axis label F_n(x) with n a subscript. Many more such effects are possible and are described by
```
help(plotmath)
```
(on-line version of this help).

The Corresponding Point Estimate

Procedures always come in threes: a hypothesis test, the dual confidence interval, and the corresponding point estimate. What is the point estimator here?

Two-Sample Tests

The difference of two independent Brownian bridges is a rescaled Brownian bridge (vertical axis expanded by √2). The obvious statistic for comparing two empirical distribution functions F_m and G_n which is

sup_{− ∞ < x < + ∞} | F_m(x) − G_n(x) |

has an asymptotic distribution that is a Brownian bridge with the vertical axis expanded by (1 / m + 1 / n)^{1 / 2} because F_m has variance proportional to 1 / m and G_n has variance proportional to 1 / n.

Thus

(1 / m + 1 / n)^{− 1 / 2} sup_{− ∞ < x < + ∞} | F_m(x) − G_n(x) |

has the standard Brownian bridge for its asymptotic distribution.

But we don't actually need to know this ourselves. It is buried in the code for ks.test.

Example 5.4 in Hollander and Wolfe.

Comment

It won't bother those with no previous exposure to the R ks.test function (on-line help) but it came as a shock to me that the meaning of alternative = "less" changed since the last time I taught the course. It now means

The possible values "two.sided", "less" and "greater" of alternative specify the null hypothesis that the true distribution function of x is equal to, not less than or not greater than the hypothesized distribution function (one-sample case) or the distribution function of y (two-sample case), respectively.

But if the distribution function of x is less than that of y, the median of x is greater than that of y.

So the applied meaning of alternative is just the opposite of what it is for wilcox.test. If you want

wilcox.test(x, y, alternative = "less")

its competitor is

ks.test(x, y, alternative = "greater")

No real problem as long as you are aware of this issue. (A big problem if you forget!)

The Lilliefors Test

The one-sample Kolmogorov-Smirnov isn't very useful in practice because it requires a simple null hypothesis, that is, the distribution must be completely specified with all parameters known.

What you want to do is test with unknown parameters. You would like the null hypothesis to be all normal distributions (and the alternative all non-normal distributions) or something like that. What you want to do is something like this, a Kolmogorov-Smirnov test with estimated parameters.

The reason for the WARNING is that estimating the parameters changes the null distribution of the test statistic. The null distribution is generally not known when parameters are estimated and is not the same as when parameters are known.

Fortunately, when we have a computer, we can approximate the null distribution of the test statistic by simulation.

There is random error in this calculation from the simulation. However, because of the trick of adding 1 to the numerator and denominator in calculating the P-value it can be used straight without regard for the randomness. Under the null hypothesis the probability Pr(P ≤ k / n_sim) is exactly k / n_sim when both the randomness in the data and the randomness in the simulation are taken into account.

Summary

Bogus P-value: 0.1578
Simulation P-value: 0.004 ± 0.001

Comment

The name Lilliefors test only applies to this procedure of using the Kolmogorov-Smirnov test statistic with estimated null distribution when the null distribution is assumed to be normal. In this case, the test is exact because the test statistic and the normal family of distributions are invariant under location-scale transformations.

If the same procedure were used with another family of distributions that was not a location-scale family, then the test would not be exact. It would be a special case of the parametric bootstrap, which we will eventually cover.

Statistics 5601 (Geyer, Fall 2013) Kolmogorov-Smirnov and Lilliefors Tests

General Instructions

Theory

(Cumulative) Distribution Functions

Empirical (Cumulative) Distribution Functions

Distribution Function Examples

Comments

The Uniform Law of Large Numbers (Glivenko-Cantelli Theorem

The Asymptotic Distribution (Brownian Bridge)

Suprema over the Brownian Bridge

One-Sample Tests

The Corresponding Confidence Interval

Example 11.6 in Hollander and Wolfe.

Comments.

The Corresponding Point Estimate

Two-Sample Tests

Example 5.4 in Hollander and Wolfe.

Comment

The Lilliefors Test

Summary

Comment

Navigation

Contents