Stat 5601 (Geyer) Examples (Kolmogorov-Smirnov Tests)

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Theory

(Cumulative) Distribution Functions

The (cumulative) distribution function of a (real-valued) random variable X is the function F defined by

F(x) = pr(Xx),      - ∞ < x < ∞
• Because F(x) is a probability, it is necessarily between zero and one.
• Because the event Xx increases as x increases, F is a nondecreasing function.
• Because the event Xx decreases to the empty set as x goes to minus infinity,
limx → - ∞ F(x) = 0.
• Because the event Xx increases to the whole real line as x goes to plus infinity,
limx → + ∞ F(x) = 1.
• If the support of X is not the whole real line, then all of the increase of F takes place on the support, that is, if aXb with probability one, then F(a) = 0 and F(b) = 1.
• Other properties of the distribution function depend on whether X is discrete or continuous.
• If X is a continuous random variable, then
• F a continuous function and is strictly increasing on the support of X.
• If X is a discrete random variable, then
• F is a discontinuous function.
• The discontinuities (jumps) of F occur at the atoms of X (the points having nonzero probability).
• The height of the jump gives the probability of the atom, that is,
pr(X = x) = F(x) - F(x - ε)
whenever ε is small enough so that there are no other jumps between x - ε and x.
• F is constant (its graph is horizontal) between jumps.

Empirical (Cumulative) Distribution Functions

The empirical distribution function Fn is just the distribution function of the empirical distribution, which puts probability 1 / n at each data point of a sample of size n.

If x(i) are the order statistics, then the empirical distribution function jumps from (i - 1) / n to i / n at the point x(i) and is constant except for the jumps at the order statistics.

Distribution Function Examples

The R function `ecdf` in the `stepfun` library produces empirical (cumulative) distribution functions. The R functions of the form `p` followed by a distribution name (`pnorm`, `pbinom`, etc.) produce theoretical distribution functions.

External Data Entry

Enter a dataset URL :

If you increase the sample size `n` the empirical distribution function will get closer to the theoretical distribution function.

If you change the theoretical distribution function from standard normal to something else, the empirical and theoretical distribution functions will still be close to each other, just different. For example, try standard exponential (`rexp` replaces `rnorm` and `pexp` replaces `pnorm`.

The Asymptotic Distribution (Brownian Bridge)

As everywhere else in statistics, there is asymptotic normality. Here it is a bit trickier than usual because the objects of interest are not scalar-valued random variables, nor even vector-valued random variables, but function-valued random variables Fn. This can be thought of as an infinite-dimensional random vector because it has an infinite number of coordinates Fn(x) for each of the (infinitely many) values of x.

But we won't bother with those technicalities. Suffice it to say that

n ( Fn(x) − F(x) )
converges to a Gaussian stochastic process called the Brownian bridge in the special case that the true population distribution is Uniform(0, 1). The Gaussian here refers to the normal distribution, more about this in class.

This result would have no place in a course on nonparametrics if it were peculiar to the uniform distribution. But the non-uniform case is not much different. It will be described presently.

We can see what the Brownian bridge looks like by just taking a very large sample size (large enough for the asymptotics to work).

External Data Entry

Enter a dataset URL :

If you repeat the plot over and over, you will see many different realizations of this random function.

The non-uniform case has to do with a curious fact from theoretical statistics. If X is any continuous random variable and F is its distribution function, then F(X) is a Uniform(0, 1) random variable. This means

1. Any continuous random variable X can be mapped to a Uniform(0, 1) random variable U (by the transformation F) and vice versa (by the transformation F-1).
2. More importantly for the subject of Kolmogorov-Smirnov tests, this means that the distribution of √n ( Fn(x) − F(x) ) is the same for all continuous population distributions except for a transformation of the x-axis. If we base our procedures only on the vertical distance between Fn(x) and F(x) and ignore horizontal distances (which are transformed, then our procedure will be truly nonparametric.

Suprema over the Brownian Bridge

The distributions of both one-sided and two-sided suprema over the Brownian bridge are known. Define

D+ = sup0 < t < 1 B(t)
where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D+ is a random variable. The distribution of this random variable is known. It has distribution function
F(x) = 1 - exp(− 2 x2)

The Brownian bridge is symmetric with respect to being turned upside down (in distribution). Thus the statistic D defined by replacing sup with inf in the definition of D+ has the same distribution as D+.

Similarly, if we define the two-sided supremum

D = sup0 < t < 1 | B(t) |
where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D is a random variable. The distribution of this random variable is also known. It has distribution function
F(x) = 1 + 2 ∑k = 1 (− 1)k exp(− 2 k2 x2)
although this involves an infinite series, the series is extremely rapidly converging. Usually a few terms suffice for very high accuracy.

One-Sample Tests

The one-sample Kolmogorov-Smirnov test is based on the test statistic

D+n = sup0 < t < 1n ( Fn(x) − F(x) )
for an upper-tailed test. Or on the test statistic Dn defined by replacing sup with inf in the formula above for a lower-tailed test. Or on the test statistic
Dn = sup0 < t < 1n | Fn(x) − F(x) |
for a two-tailed test. Usually, we want a two-tailed test.

Because the distribution F hypothesized under the null hypothesis must be completely specified (no free parameters whatsoever). This test is fairly useless, and Hollander and Wolfe do not cover it.

As we shall see when we get to the bootstrap, the test can be used with free parameters to be estimated in the null distribution, but that takes us out of Hollander and Wolfe and into Efron and Tibshirani. So we will put that off.

For now we just do a toy example using the R function `ks.test` ( on-line help).

External Data Entry

Enter a dataset URL :

As can be seen by trying out the example, the test is not very powerful even for large sample sizes if the distributions are not too different. Try different sample sizes and degrees of freedom for the t.

The Corresponding Confidence Interval

As we said, one-sample Kolmogorov-Smirnov tests are fairly useless from an applied point of view (however theoretically important). But the dual confidence interval is of use. It gives a confidence band for the whole distribution function (Section 11.5 in Hollander and Wolfe).

The programmer who wrote the `ks.test` function for R didn't bother with the confidence interval. So we are on our own again. We (like Hollander and Wolfe) will only do the two-sided interval. The one-sided is similar. Just use the distribution of D+ instead of the distribution of D.

External Data Entry

Enter a dataset URL :

• The step function is the empirical distribution function.
• The dashed lines on either side mark a 95% confidence band for the true population distribution function. (The probability that the true population distribution function lies entirely within the confidence band gets closer and closer to 0.95 as the sample size goes to infinity.)
• The first half of the code (above the blank line) could be replaced by
```crit.val <- 1.358099
```
if there was no interest in confidence levels other than 95%.

The Corresponding Point Estimate

Procedures always come in threes: a hypothesis test, the dual confidence interval, and the corresponding point estimate. What is the point estimator here?

Two-Sample Tests

The difference of two independent Brownian bridges is a rescaled Brownian bridge (vertical axis expanded by √2). The obvious statistic for comparing two empirical distribution functions Fm and Gn which is

sup0 < t < 1 | Fm(t) − Gn(t) |
has an asymptotic distribution that is a Brownian bridge with the vertical axis expanded by (1 / m + 1 / n)1 / 2 because Fm has variance proportional to 1 / m and Gn has variance proportional to 1 / n.

Thus

(1 / m + 1 / n)− 1 / 2 sup0 < t < 1 | Fm(t) − Gn(t) |
has the standard Brownian bridge for its asymptotic distribution.

But we don't actually need to know this ourselves. It is buried in the code for `ks.test`.

External Data Entry

Enter a dataset URL :

• Arrrrrggggghhhh!!!!! `ks.test` with `exact=TRUE` (the default) is completely wrong. There must be a horrible bug. (I haven't investigated.)
• It does do the asymptotic approximation correctly. It agrees with the answer in Hollander and Wolfe with rounding error (they have rounding error, it doesn't). Of course, the exact P-value is 0.0524 given by Hollander and Wolfe. The 0.0546 given by `ks.test` is only a large-sample approximation which isn't quite right because the sample sizes (10 and 10) aren't really large.

We can get the exact test using `perm2fun` because Kolmogorov-Smirnov is also a special case of permutation tests. But it takes so long, I will just provide the printout rather than a form for Rweb to run

```> library(stat5601)
As can be see from the prompts, this was run in R not Rweb on a computer somewhat faster than `rweb.stat.umn.edu`. It still took 525.25 seconds = 8 min and 45 sec. But we do get the answer in the book.