This web page describes the subject matter of the course. For procedural details, see the General Info page.
- Nonparametric Methods
- Stat 5601
- Fall 2007
- 11:15-12:05 MWF
- FordH 115
|Instructor:||Charles Geyer (5-8511, firstname.lastname@example.org)|
Nonparametric statistics is a very strange name for a subject. Very negative. It defines the subject by what it's not. It's not parametric statistics.
Most modern statistics is parametric. What you learn in other statistics courses at the same level as this one, such as Stat 5302 (regression), Stat 5303 (design, ANOVA), Stat 5401 (multivariate), or Stat 5421 (categorical) is parametric statistics.
So what is nonparametrics? The counterculture? Sort of.
In intro stats we learn that parameters are population quantities we want to know and statistics (or estimates) are the corresponding sample quantities we use to try to make inferences about them.
|population mean||sample mean|
|population median||sample median|
|population 15% trimmed mean||sample 15% trimmed mean|
But there is a more advanced meaning of parameter that one meets only in a theory course, such as Stat 5101-5102. A statistical model is parametric if it has only a few parameters. For example the normal family of distributions has two parameters, the mean μ and the variance σ2. If we know these two parameters, then we know the distribution (within the normal family).
So there are two, related but somewhat different, notions of
- a population quantity we want to estimate
- one of a (small) set of variables that (together) determine one probability model in an assumed family of models.
parameters in the first sense, but not in the second.
It uses families of probability models too large to be determined by a finite
set of parameters.
When we are willing to assume a parametric family, this has very strong
consequences. The theory of maximum likelihood (Stat 5102) says that
the sample mean and sample variance are the best possible estimators
of the parameters μ and σ2 (when the assumed parametric
family is normal). More precisely,
these estimators have smaller asymptotic variance than any others,
asymptotic refers to the limit as the sample size goes
This concept is called asymptotic efficiency. Note well (this is very important) that this efficiency property depends crucially on the assumed parametric family.
What if the assumed parametric family is incorrect?
Then the whole efficiency story goes out the window, and what was best (asymptotically efficient) can become the worst.
We say an estimator is robust when it tolerates departures from assumptions. The simplest measure of robustness is breakdown point, which is the asymptotic fraction of the data that can be complete junk while the estimator stays sensible.
This point is usually explained in intro stats without being explicit about the terminology.
The sample mean has breakdown point zero (!) because even one wild observation (outlier, gross error) can make an arbitrarily large change in the sample mean.
The sample median has breakdown point 50% because even if just under half of the data are wild, the median is in the majority half. (The median changes as more data go bad, but it doesn't become completely useless).
There is a trade-off. Efficiency and robustness generally go in opposite directions. The mean is most efficient and least robust. The median (a 50% trimmed mean) is most robust and least efficient (when the assumed parametric family is normal). Trimmed means are in between on both robustness and efficiency.
In short, if you have perfect data (no errors, no outliers, fits the assumed normal model), then you should use the sample mean and related procedures (t tests, linear regression, ANOVA, F tests).
This course is about what you do when you don't have perfect data (or at least don't want to take a chance on imperfect data).
The Median and the Sign Test
For a start, consider the sample median as an estimator of the population
median. You probably were told a little bit about the median in the
descriptive statistics part of your intro stats course,
but when you got to inference, the median was forgotten.
So what is the analog of t tests and confidence intervals (procedures associated with the sample mean and variance) for the median?
We will start there. They are called the sign test and related procedures.
Some people think the sign test gives up too much efficiency. They don't need that much robustness (50% breakdown point).
So there's lots more, for example the Wilcoxon signed rank test and related procedures, which is a competitor of sign tests and t tests.
One pecularity of this course is an emphasis on
One way to think about nonparametrics is that it makes fewer assumptions.
Valid for larger families of probability models is just another way
makes fewer assumptions.
|signed rank test||continuous, symmetric||center of symmetry|
|t test||normal (Gaussian)||mean|
The t test and related procedures make the strongest assumption: the population distribution is normal (Gaussian, bell curve). The parameter about which inference is made is the population mean. If the normality assumption is not exactly correct, the inference may be garbage.
The signed rank test and related procedures make a much weaker assumption: the population distribution is continuous and symmetric. The parameter about which inference is made is the population center of symmetry. If the continuity and symmetry assumptions are not exactly correct, the inference may be garbage.
The sign test and related procedures make the weakest assumption of all: the population distribution is continuous. The parameter about which inference is made is the population median. If the continuity assumption is not exactly correct, the inference may be garbage.
These sets of assumptions become progressively weaker, because the normal distribution is continuous and symmetric. Moreover the parameters are comparable, because the center of symmetry (if there is one) is also the median and (if the mean exists) also the mean.
Thus the sign test competes with the signed rank test when both are valid, and these compete with the t when it is valid.
So what is the price in efficiency for the gain in robustness?
Assume, temporarily, that the population is exactly normal, then we have the following
|signed rank test||95%||29%|
Robustness does not depend on the true unknown population distribution, but efficiency does. If we drop the normality assumption, then things become really complicated -- just about anything can happen to efficiency.
But the main story can be seen from our table here
- The signed rank test and related procedures are nearly as efficient as the t test and related procedures, even when their normality assumptions are true (and perhaps more efficient when they are false).
- For the slight loss of efficiency, there is a huge gain in robustness.
So why does anyone ever use the t test and related procedures? Dunno. Beats me!
(Actually I do know: mathematical convenience. The most complicated normal-theory procedures, multiple regression and analysis of variance for complicated experimental designs have no nonparametric analogues -- at least none that are not bleeding edge research unsuitable for a course like this. But this doesn't apply to simple normal-theory procedures that have simple nonparametric competitors.)
And now for something completely different, the bootstrap.
bootstrap is a cutesy name using an
old cliche for what is really a generalization of the fundamental
problem of statistics: the sample is not the population.
Since we've had intro stats we never make the fundamental blunder of statistics: confusing the sample and the population.
But one level up, the
blunder is sometimes (!) no longer a blunder.
For inference about the mean we often
plug-in the sample standard
deviation where the (unknown) population standard deviation should go.
Strictly, this plug-in is wrong. Practically, it works o. k. if the sample size is large: the t distribution (exact) is almost the same as the z distribution (approximate).
The bootstrap is the most general form of plug-in.
In Efron's nonparametric bootstrap we use a completely
nonparametric estimate of the population distribution: the
distribution (which simply treats the sample as a population).
Now, we cannot generally do much theoretically with such a plug-in estimate, but computer simulation comes to the rescue.
- Any probability distribution we can simulate we can thoroughly understand with sufficient simulation.
- The empirical distribution is trivial to simulate. Samples from it are (re)samples with replacement from the original sample.
Together these two ideas (general plug-in and simulation replacing theory) are tremendously liberating. They allow us to do statistics in very complicated situations without traditional theoretical understanding (the computer does the theory, we just run the computer).
The bozo view of the bootstrap is that it makes theoretical statistics completely unnecessary. A more sophisticated view remembers the fundamental problem. The bootstrap does not do the Right Thing (plug-in treats the sample as the population). So how wrong is it?
The answer to this question is very tricky and the course will spend weeks answering it. The short answer (sufficient for this web page) is that sometimes the bootstrap works, and sometimes it doesn't. When a naive bootstrap doesn't work, some modification may.
Despite the shortness of this section, we will spend about half the course on the bootstrap.
In linear regression (including multiple linear regression, the subject of Stat 5302) we assume
- The errors are independent, normally distributed, mean zero, same variance.
- The response mean is a linear function of unknown parameters
μi = xi1 β1 + xi2 β2 + … + xip βp
In nonparametric regression we drop one or the other of these assumptions.
If one tries to drop both, there is no structure left to make it
regression problem, at least no structure that has led to any
usable statistical methodology.
If we drop assumption 1 (normal errors) but keep assumption 2 (means linear in the regression coefficients), we get robust regression, procedures robust to departure from normality assumptions.
Since we are not making parametric assumptions about the error distribution, this is nonparametric.
We will just do a few examples of this.
SmoothingIf we keep assumption 1 (normal errors) but drop assumption 2 (means linear in the regression coefficients), we get smoothing, which assumes means are smooth functions of predictor variables
Since we are not making parametric assumptions about the form of the mean function, this is nonparametric.