University of Minnesota, Twin Cities     School of Statistics     Stat 5601     Rweb     Computing Examples

Stat 5601 (Geyer) Examples (Kolmogorov-Smirnov Tests)

Contents

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Theory

(Cumulative) Distribution Functions

The (cumulative) distribution function of a (real-valued) random variable X is the function F defined by

F(x) = pr(Xx),      - ∞ < x < ∞

Empirical (Cumulative) Distribution Functions

The empirical distribution function Fn is just the distribution function of the empirical distribution, which puts probability 1 / n at each data point of a sample of size n.

If x(i) are the order statistics, then the empirical distribution function jumps from (i - 1) / n to i / n at the point x(i) and is constant except for the jumps at the order statistics.

Distribution Function Examples

The R function ecdf in the stepfun library produces empirical (cumulative) distribution functions. The R functions of the form p followed by a distribution name (pnorm, pbinom, etc.) produce theoretical distribution functions.

Comments

If you increase the sample size n the empirical distribution function will get closer to the theoretical distribution function.

If you change the theoretical distribution function from standard normal to something else, the empirical and theoretical distribution functions will still be close to each other, just different. For example, try standard exponential (rexp replaces rnorm and pexp replaces pnorm.

The Asymptotic Distribution (Brownian Bridge)

As everywhere else in statistics, there is asymptotic normality. Here it is a bit trickier than usual because the objects of interest are not scalar-valued random variables, nor even vector-valued random variables, but function-valued random variables Fn. This can be thought of as an infinite-dimensional random vector because it has an infinite number of coordinates Fn(x) for each of the (infinitely many) values of x.

But we won't bother with those technicalities. Suffice it to say that

n [ Fn(x) − F(x) ]
converges to a Gaussian stochastic process called the Brownian bridge in the special case that the true population distribution is Uniform(0, 1). The Gaussian here refers to the normal distribution, more about this in class.

This result would have no place in a course on nonparametrics if it were peculiar to the uniform distribution. But the non-uniform case is not much different. It will be described presently.

We can see what the Brownian bridge looks like by just taking a very large sample size (large enough for the asymptotics to work).

If you repeat the plot over and over, you will see many different realizations of this random function.

The non-uniform case has to do with a curious fact from theoretical statistics. If X is any continuous random variable and F is its distribution function, then F(X) is a Uniform(0, 1) random variable. This means

  1. Any continuous random variable X can be mapped to a Uniform(0, 1) random variable U (by the transformation F) and vice versa (by the transformation F-1).
  2. More importantly for the subject of Kolmogorov-Smirnov tests, this means that the distribution of √n ( Fn(x) − F(x) ) is the same for all continuous population distributions except for a transformation of the x-axis. If we base our procedures only on the vertical distance between Fn(x) and F(x) and ignore horizontal distances (which are transformed, then our procedure will be truly nonparametric.

Suprema over the Brownian Bridge

The distributions of both one-sided and two-sided suprema over the Brownian bridge are known. Define

D+ = sup0 < t < 1 B(t)
where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D+ is a random variable. The distribution of this random variable is known. It has distribution function
FD+(x) = 1 - exp(− 2 x2)

The Brownian bridge is symmetric with respect to being turned upside down (in distribution). Thus the statistic D defined by replacing sup with inf in the definition of D+ has the same distribution as D+.

Similarly, if we define the two-sided supremum

D = sup0 < t < 1 | B(t) |
where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D is a random variable. The distribution of this random variable is also known. It has distribution function
FD(x) = 1 + 2 ∑k = 1 (− 1)k exp(− 2 k2 x2)
although this involves an infinite series, the series is extremely rapidly converging. Usually a few terms suffice for very high accuracy.

One-Sample Tests

The one-sample Kolmogorov-Smirnov test is based on the test statistic

D+n = sup-∞ < x < +∞n [ Fn(x) − F(x) ]
for an upper-tailed test. Or on the test statistic Dn defined by replacing sup with inf in the formula above for a lower-tailed test. Or on the test statistic
Dn = sup-∞ < x < +∞n | Fn(x) − F(x) |
for a two-tailed test. Usually, we want a two-tailed test.

Because the distribution F hypothesized under the null hypothesis must be completely specified (no free parameters whatsoever). This test is fairly useless, and Hollander and Wolfe do not cover it.

As we shall see when we get to the bootstrap, the test can be used with free parameters to be estimated in the null distribution, but that takes us out of Hollander and Wolfe and into Efron and Tibshirani. So we will put that off.

For now we just do a toy example using the R function ks.test ( on-line help).

As can be seen by trying out the example, the test is not very powerful even for large sample sizes if the distributions are not too different. Try different sample sizes and degrees of freedom for the t.

The Corresponding Confidence Interval

As we said, one-sample Kolmogorov-Smirnov tests are fairly useless from an applied point of view (however theoretically important). But the dual confidence interval is of use. It gives a confidence band for the whole distribution function (Section 11.5 in Hollander and Wolfe).

The programmer who wrote the ks.test function for R didn't bother with the confidence interval. So we are on our own again. We (like Hollander and Wolfe) will only do the two-sided interval. The one-sided is similar. Just use the distribution of D+ instead of the distribution of D.

Example 11.6 in Hollander and Wolfe.

External Data Entry

Enter a dataset URL :

Comments.

The Corresponding Point Estimate

Procedures always come in threes: a hypothesis test, the dual confidence interval, and the corresponding point estimate. What is the point estimator here?

Two-Sample Tests

The difference of two independent Brownian bridges is a rescaled Brownian bridge (vertical axis expanded by √2). The obvious statistic for comparing two empirical distribution functions Fm and Gn which is

sup-∞ < x < +∞ | Fm(x) − Gn(x) |
has an asymptotic distribution that is a Brownian bridge with the vertical axis expanded by (1 / m + 1 / n)1 / 2 because Fm has variance proportional to 1 / m and Gn has variance proportional to 1 / n.

Thus

(1 / m + 1 / n)− 1 / 2 sup-∞ < x < +∞ | Fm(x) − Gn(x) |
has the standard Brownian bridge for its asymptotic distribution.

But we don't actually need to know this ourselves. It is buried in the code for ks.test.

Example 5.4 in Hollander and Wolfe.

External Data Entry

Enter a dataset URL :