Course Announcement

Instructor: Charles Geyer (5-8511,


Nonparametric statistics is a very strange name for a subject. Very negative. It defines the subject by what it's not. It's not parametric statistics.

Most modern statistics is parametric. What you learn in other statistics courses at the same level as this one, such as Stat 5302 (regression), Stat 5303 (design, ANOVA), Stat 5401 (multivariate), or Stat 5421 (categorical) is parametric statistics.

So what is nonparametrics? The counterculture? Sort of.


In intro stats we learn that parameters are population quantities we want to know and statistics (or estimates) are the corresponding sample quantities we use to try to make inferences about them.

For example

Parameter Statistic
population mean sample mean
population median sample median
population 15% trimmed mean sample 15% trimmed mean

But there is a more advanced meaning of parameter that one meets only in a theory course, such as Stat 5101-5102. A statistical model is parametric if it has only a few parameters. For example the normal family of distributions has two parameters, the mean μ and the variance σ2. If we know these two parameters, then we know the distribution (within the normal family).

So there are two, related but somewhat different, notions of parameter in statistics

  1. a population quantity we want to estimate
  2. one of a (small) set of variables that (together) determine one probability model in an assumed family of models.

Nonparametrics has parameters in the first sense, but not in the second. It uses families of probability models too large to be determined by a finite set of parameters.


When we are willing to assume a parametric family, this has very strong consequences. The theory of maximum likelihood (Stat 5102) says that the sample mean and sample variance are the best possible estimators of the parameters μ and σ2 (when the assumed parametric family is normal). More precisely, these estimators have smaller asymptotic variance than any others, where asymptotic refers to the limit as the sample size goes to infinity.

This concept is called asymptotic efficiency. Note well (this is very important) that this efficiency property depends crucially on the assumed parametric family.


What if the assumed parametric family is incorrect?

Then the whole efficiency story goes out the window, and what was best (asymptotically efficient) can become the worst.

We say an estimator is robust when it tolerates departures from assumptions. The simplest measure of robustness is breakdown point, which is the asymptotic fraction of the data that can be complete junk while the estimator stays sensible.

This point is usually explained in intro stats without being explicit about the terminology.

The sample mean has breakdown point zero (!) because even one wild observation (outlier, gross error) can make an arbitrarily large change in the sample mean.

The sample median has breakdown point 50% because even if just under half of the data are wild, the median is in the majority half. (The median changes as more data go bad, but it doesn't become completely useless).

There is a trade-off. Efficiency and robustness generally go in opposite directions. The mean is most efficient and least robust. The median (a 50% trimmed mean) is most robust and least efficient (when the assumed parametric family is normal). Trimmed means are in between on both robustness and efficiency.

In short, if you have perfect data (no errors, no outliers, fits the assumed normal model), then you should use the sample mean and related procedures (t tests, linear regression, ANOVA, F tests).

This course is about what you do when you don't have perfect data (or at least don't want to take a chance on imperfect data).

The Median and the Sign Test

For a start, consider the sample median as an estimator of the population median. You probably were told a little bit about the median in the descriptive statistics part of your intro stats course, but when you got to inference, the median was forgotten.

So what is the analog of t tests and confidence intervals (procedures associated with the sample mean and variance) for the median?

We will start there. They are called the sign test and related procedures.

Some people think the sign test gives up too much efficiency. They don't need that much robustness (50% breakdown point).

So there's lots more, for example the Wilcoxon signed rank test and related procedures, which is a competitor of sign tests and t tests.


One pecularity of this course is an emphasis on assumptions. One way to think about nonparametrics is that it makes fewer assumptions. Valid for larger families of probability models is just another way of saying makes fewer assumptions.

For comparison

Procedure Assumption Parameter
sign test continuous median
signed rank test continuous, symmetric center of symmetry
t test normal (Gaussian) mean

The t test and related procedures make the strongest assumption: the population distribution is normal (Gaussian, bell curve). The parameter about which inference is made is the population mean. If the normality assumption is not exactly correct, the inference may be garbage.

The signed rank test and related procedures make a much weaker assumption: the population distribution is continuous and symmetric. The parameter about which inference is made is the population center of symmetry. If the continuity and symmetry assumptions are not exactly correct, the inference may be garbage.

The sign test and related procedures make the weakest assumption of all: the population distribution is continuous. The parameter about which inference is made is the population median. If the continuity assumption is not exactly correct, the inference may be garbage.

These sets of assumptions become progressively weaker, because the normal distribution is continuous and symmetric. Moreover the parameters are comparable, because the center of symmetry (if there is one) is also the median and (if the mean exists) also the mean.

Thus the sign test competes with the signed rank test when both are valid, and these compete with the t when it is valid.

Efficiency Revisited

So what is the price in efficiency for the gain in robustness?

Assume, temporarily, that the population is exactly normal, then we have the following

Procedure Efficiency Robustness
sign test 63% 50%
signed rank test 95% 29%
t test 100% 0%

Robustness does not depend on the true unknown population distribution, but efficiency does. If we drop the normality assumption, then things become really complicated -- just about anything can happen to efficiency.

But the main story can be seen from our table here

So why does anyone ever use the t test and related procedures? Dunno. Beats me!

(Actually I do know: mathematical convenience. The most complicated normal-theory procedures, multiple regression and analysis of variance for complicated experimental designs have no nonparametric analogues -- at least none that are not bleeding edge research unsuitable for a course like this. But this doesn't apply to simple normal-theory procedures that have simple nonparametric competitors.)

The Bootstrap

And now for something completely different, the bootstrap.

The bootstrap is a cutesy name using an old cliche for what is really a generalization of the fundamental problem of statistics: the sample is not the population.

Since we've had intro stats we never make the fundamental blunder of statistics: confusing the sample and the population.

But one level up, the blunder is sometimes (!) no longer a blunder. For inference about the mean we often plug-in the sample standard deviation where the (unknown) population standard should go.

Strictly, this plug-in is wrong. Practically, it works o. k. if the sample size is large: the t distribution (exact) is almost the same as the z distribution (approximate).

The bootstrap is the most general form of plug-in.

In Efron's nonparametric bootstrap we use a completely nonparametric estimate of the population distribution: the empirical distribution (which simply treats the sample as a population).

Now, we cannot generally do much theoretically with such a plug-in estimate, but computer simulation comes to the rescue.

Together these two ideas (general plug-in and simulation replacing theory) are tremendously liberating. They allow us to do statistics in very complicated situations without traditional theoretical understanding (the computer does the theory, we just run the computer).

The bozo view of the bootstrap is that it makes theoretical statistics completely unnecessary. A more sophisticated view remembers the fundamental problem. The bootstrap does not do the Right Thing (plug-in treats the sample as the population). So how wrong is it?

The answer to this question is very tricky and the course will spend weeks answering it. The short answer (sufficient for this web page) is that sometimes the bootstrap works, and sometimes it doesn't. When a naive bootstrap doesn't work, some modification may.

Despite the shortness of this section, we will spend about half the course on the bootstrap.