University of Minnesota, Twin Cities School of Statistics Stat 5601 Rweb
The (cumulative) distribution function of a (real-valued) random variable X is the function F defined by
The empirical distribution function F_{n} is just the distribution function of the empirical distribution, which puts probability 1 / n at each data point of a sample of size n.
If x_{(i)} are the order statistics, then the empirical distribution function jumps from (i - 1) / n to i / n at the point x_{(i)} and is constant except for the jumps at the order statistics.
The R function ecdf
in the stepfun
library
produces empirical (cumulative) distribution functions. The R functions
of the form p
followed by a distribution name (pnorm
,
pbinom
, etc.) produce theoretical distribution functions.
If you increase the sample size n
the empirical distribution
function will get closer to the theoretical distribution function.
If you change the theoretical distribution function from standard normal
to something else, the empirical and theoretical distribution functions
will still be close to each other, just different. For example, try
standard exponential (rexp
replaces rnorm
and
pexp
replaces pnorm
.
As everywhere else in statistics, there is asymptotic normality. Here it is a bit trickier than usual because the objects of interest are not scalar-valued random variables, nor even vector-valued random variables, but function-valued random variables F_{n}. This can be thought of as an infinite-dimensional random vector because it has an infinite number of coordinates F_{n}(x) for each of the (infinitely many) values of x.
But we won't bother with those technicalities. Suffice it to say that
Gaussianhere refers to the normal distribution, more about this in class.
This result would have no place in a course on nonparametrics if it were peculiar to the uniform distribution. But the non-uniform case is not much different. It will be described presently.
We can see what the Brownian bridge looks like by just taking a very large sample size (large enough for the asymptotics to work).
If you repeat the plot over and over, you will see many different realizations of this random function.
The non-uniform case has to do with a curious fact from theoretical statistics. If X is any continuous random variable and F is its distribution function, then F(X) is a Uniform(0, 1) random variable. This means
The distributions of both one-sided and two-sided suprema over the Brownian bridge are known. Define
The Brownian bridge is symmetric with respect to being turned upside down
(in distribution). Thus the statistic D^{−} defined
by replacing sup
with inf
in the definition of
D^{+} has the same distribution as
D^{+}.
Similarly, if we define the two-sided supremum
The one-sample Kolmogorov-Smirnov test is based on the test statistic
supwith
infin the formula above for a lower-tailed test. Or on the test statistic
Because the distribution F hypothesized under the null hypothesis must be completely specified (no free parameters whatsoever). This test is fairly useless, and Hollander and Wolfe do not cover it.
As we shall see when we get to the bootstrap, the test can be used with free parameters to be estimated in the null distribution, but that takes us out of Hollander and Wolfe and into Efron and Tibshirani. So we will put that off.
For now we just do a toy example using the R function ks.test
(
on-line help).
As can be seen by trying out the example, the test is not very powerful even for large sample sizes if the distributions are not too different. Try different sample sizes and degrees of freedom for the t.
As we said, one-sample Kolmogorov-Smirnov tests are fairly useless from
an applied point of view (however theoretically important). But the
dual confidence interval is of use. It gives a confidence band
for the whole distribution function (Section 11.5 in Hollander
and Wolfe).
The programmer who wrote the ks.test
function for R didn't
bother with the confidence interval. So we are on our own again. We
(like Hollander and Wolfe) will only do the two-sided interval. The
one-sided is similar. Just use the distribution of D^{+}
instead of the distribution of D.
confidence bandfor the true population distribution function. (The probability that the true population distribution function lies entirely within the confidence band gets closer and closer to 0.95 as the sample size goes to infinity.)
crit.val <- 1.358099if there was no interest in confidence levels other than 95%.
Procedures always come in threes: a hypothesis test, the dual confidence interval, and the corresponding point estimate. What is the point estimator here?
The difference of two independent Brownian bridges is a rescaled Brownian bridge (vertical axis expanded by √2). The obvious statistic for comparing two empirical distribution functions F_{m} and G_{n} which is
Thus
But we don't actually need to know this ourselves. It is buried in
the code for ks.test
.
ks.test
with exact=TRUE
(the default) is completely wrong. There must be a horrible bug. (I haven't
investigated.)
ks.test
is only a large-sample approximation which isn't quite right because
the sample sizes (10 and 10) aren't really large.
We can get the exact test using perm2fun
because
Kolmogorov-Smirnov is also a special case of permutation tests.
But it takes so long, I will just provide the printout rather
than a form for Rweb to run
> library(stat5601) > foo <- read.table(url("http://www.stat.umn.edu/geyer/5601/hwdata/t5-7.txt"), + header=TRUE) > attach(foo) > tstat <- ks.test(x, y, exact=FALSE)$statistic > tstat D 0.6 > fred <- function(x, y) ks.test(x, y, exact=FALSE)$statistic > system.time(tsim <- perm2fun(c(x, y), length(x), fred)) [1] 514.46 10.40 525.25 0.00 0.00 > mean(tsim >= tstat) [1] 0.05244755
As can be see from the prompts, this was run in R not Rweb on
a computer somewhat faster than rweb.stat.umn.edu
.
It still took 525.25 seconds = 8 min and 45 sec. But we do
get the answer in the book.