Statistics 5601 (Geyer, Fall 2013) Course Announcement

This web page describes the subject matter of the course. For procedural details, see the General Info page.

Course Announcement

Nonparametric Methods
Stat 5601
Fall 2007
11:15-12:05 MWF
FordH 115

Instructor:	Charles Geyer (5-8511, `charlie@stat.umn.edu`)

Introduction

Nonparametric statistics is a very strange name for a subject. Very negative. It defines the subject by what it's not. It's not parametric statistics.

Most modern statistics is parametric. What you learn in other statistics courses at the same level as this one, such as Stat 5302 (regression), Stat 5303 (design, ANOVA), Stat 5401 (multivariate), or Stat 5421 (categorical) is parametric statistics.

So what is nonparametrics? The counterculture? Sort of.

Parameters

In intro stats we learn that parameters are population quantities we want to know and statistics (or estimates) are the corresponding sample quantities we use to try to make inferences about them.

For example

Parameter	Statistic
population mean	sample mean
population median	sample median
population 15% trimmed mean	sample 15% trimmed mean

But there is a more advanced meaning of parameter that one meets only in a theory course, such as Stat 5101-5102. A statistical model is parametric if it has only a few parameters. For example the normal family of distributions has two parameters, the mean μ and the variance σ². If we know these two parameters, then we know the distribution (within the normal family).

So there are two, related but somewhat different, notions of parameter in statistics

a population quantity we want to estimate
one of a (small) set of variables that (together) determine one probability model in an assumed family of models.

Nonparametrics has parameters in the first sense, but not in the second. It uses families of probability models too large to be determined by a finite set of parameters.

Efficiency

When we are willing to assume a parametric family, this has very strong consequences. The theory of maximum likelihood (Stat 5102) says that the sample mean and sample variance are the best possible estimators of the parameters μ and σ² (when the assumed parametric family is normal). More precisely, these estimators have smaller asymptotic variance than any others, where asymptotic refers to the limit as the sample size goes to infinity.

This concept is called asymptotic efficiency. Note well (this is very important) that this efficiency property depends crucially on the assumed parametric family.

Robustness

What if the assumed parametric family is incorrect?

Then the whole efficiency story goes out the window, and what was best (asymptotically efficient) can become the worst.

We say an estimator is robust when it tolerates departures from assumptions. The simplest measure of robustness is breakdown point, which is the asymptotic fraction of the data that can be complete junk while the estimator stays sensible.

This point is usually explained in intro stats without being explicit about the terminology.

The sample mean has breakdown point zero (!) because even one wild observation (outlier, gross error) can make an arbitrarily large change in the sample mean.

The sample median has breakdown point 50% because even if just under half of the data are wild, the median is in the majority half. (The median changes as more data go bad, but it doesn't become completely useless).

There is a trade-off. Efficiency and robustness generally go in opposite directions. The mean is most efficient and least robust. The median (a 50% trimmed mean) is most robust and least efficient (when the assumed parametric family is normal). Trimmed means are in between on both robustness and efficiency.

In short, if you have perfect data (no errors, no outliers, fits the assumed normal model), then you should use the sample mean and related procedures (t tests, linear regression, ANOVA, F tests).

This course is about what you do when you don't have perfect data (or at least don't want to take a chance on imperfect data).

The Median and the Sign Test

For a start, consider the sample median as an estimator of the population median. You probably were told a little bit about the median in the descriptive statistics part of your intro stats course, but when you got to inference, the median was forgotten.

So what is the analog of t tests and confidence intervals (procedures associated with the sample mean and variance) for the median?

We will start there. They are called the sign test and related procedures.

Some people think the sign test gives up too much efficiency. They don't need that much robustness (50% breakdown point).

So there's lots more, for example the Wilcoxon signed rank test and related procedures, which is a competitor of sign tests and t tests.

Assumptions

One pecularity of this course is an emphasis on assumptions. One way to think about nonparametrics is that it makes fewer assumptions. Valid for larger families of probability models is just another way of saying makes fewer assumptions.

For comparison

Procedure	Assumption	Parameter
sign test	continuous	median
signed rank test	continuous, symmetric	center of symmetry
`t` test	normal (Gaussian)	mean

The t test and related procedures make the strongest assumption: the population distribution is normal (Gaussian, bell curve). The parameter about which inference is made is the population mean. If the normality assumption is not exactly correct, the inference may be garbage.

The signed rank test and related procedures make a much weaker assumption: the population distribution is continuous and symmetric. The parameter about which inference is made is the population center of symmetry. If the continuity and symmetry assumptions are not exactly correct, the inference may be garbage.

The sign test and related procedures make the weakest assumption of all: the population distribution is continuous. The parameter about which inference is made is the population median. If the continuity assumption is not exactly correct, the inference may be garbage.

These sets of assumptions become progressively weaker, because the normal distribution is continuous and symmetric. Moreover the parameters are comparable, because the center of symmetry (if there is one) is also the median and (if the mean exists) also the mean.

Thus the sign test competes with the signed rank test when both are valid, and these compete with the t when it is valid.

Efficiency Revisited

So what is the price in efficiency for the gain in robustness?

Assume, temporarily, that the population is exactly normal, then we have the following

Procedure	Efficiency	Robustness
sign test	63%	50%
signed rank test	95%	29%
`t` test	100%	0%

Robustness does not depend on the true unknown population distribution, but efficiency does. If we drop the normality assumption, then things become really complicated -- just about anything can happen to efficiency.

But the main story can be seen from our table here

The signed rank test and related procedures are nearly as efficient as the t test and related procedures, even when their normality assumptions are true (and perhaps more efficient when they are false).
For the slight loss of efficiency, there is a huge gain in robustness.

So why does anyone ever use the t test and related procedures? Dunno. Beats me!

(Actually I do know: mathematical convenience. The most complicated normal-theory procedures, multiple regression and analysis of variance for complicated experimental designs have no nonparametric analogues -- at least none that are not bleeding edge research unsuitable for a course like this. But this doesn't apply to simple normal-theory procedures that have simple nonparametric competitors.)

The Bootstrap

And now for something completely different, the bootstrap.

The bootstrap is a cutesy name using an old cliche for what is really a generalization of the fundamental problem of statistics: the sample is not the population.

Since we've had intro stats we never make the fundamental blunder of statistics: confusing the sample and the population.

But one level up, the blunder is sometimes (!) no longer a blunder. For inference about the mean we often plug-in the sample standard deviation where the (unknown) population standard deviation should go.

Strictly, this plug-in is wrong. Practically, it works o. k. if the sample size is large: the t distribution (exact) is almost the same as the z distribution (approximate).

The bootstrap is the most general form of plug-in.

In Efron's nonparametric bootstrap we use a completely nonparametric estimate of the population distribution: the empirical distribution (which simply treats the sample as a population).

Now, we cannot generally do much theoretically with such a plug-in estimate, but computer simulation comes to the rescue.

Any probability distribution we can simulate we can thoroughly understand with sufficient simulation.
The empirical distribution is trivial to simulate. Samples from it are (re)samples with replacement from the original sample.

Together these two ideas (general plug-in and simulation replacing theory) are tremendously liberating. They allow us to do statistics in very complicated situations without traditional theoretical understanding (the computer does the theory, we just run the computer).

The bozo view of the bootstrap is that it makes theoretical statistics completely unnecessary. A more sophisticated view remembers the fundamental problem. The bootstrap does not do the Right Thing (plug-in treats the sample as the population). So how wrong is it?

The answer to this question is very tricky and the course will spend weeks answering it. The short answer (sufficient for this web page) is that sometimes the bootstrap works, and sometimes it doesn't. When a naive bootstrap doesn't work, some modification may.

Despite the shortness of this section, we will spend about half the course on the bootstrap.

Nonparametric Regression

In linear regression (including multiple linear regression, the subject of Stat 5302) we assume

The errors are independent, normally distributed, mean zero, same variance.
The response mean is a linear function of unknown parameters (regression coefficients)
μ_i = x_i1 β₁ + x_i2 β₂ + … + x_ip β_p

where x_ij is the value of the j-th predictor variable for the i-th individual and β_j is the j-th regression coefficient.

In nonparametric regression we drop one or the other of these assumptions. If one tries to drop both, there is no structure left to make it a regression problem, at least no structure that has led to any usable statistical methodology.

Robust Regression

If we drop assumption 1 (normal errors) but keep assumption 2 (means linear in the regression coefficients), we get robust regression, procedures robust to departure from normality assumptions.

Since we are not making parametric assumptions about the error distribution, this is nonparametric.

We will just do a few examples of this.

Smoothing

If we keep assumption 1 (normal errors) but drop assumption 2 (means linear in the regression coefficients), we get smoothing, which assumes means are smooth functions of predictor variables

μ_i = s₁(x_i1) + s₂(x_i2) + … + s_p(x_ip)

where the x_ij are as before and s_j is an unknown smooth function.

Since we are not making parametric assumptions about the form of the mean function, this is nonparametric.

We will do more of this (here and here).