This web page describes the subject matter of the course. For procedural details, see the General Info page.

## Course Announcement

- Nonparametric Methods
- Stat 5601
- Fall 2007
- 11:15-12:05 MWF
- FordH 115

Instructor: | Charles Geyer (5-8511, charlie@stat.umn.edu) |
---|

## Introduction

Nonparametric statistics is a very strange name for a subject. Very negative.
It defines the subject by what it's not.
It's not *parametric statistics*.

Most modern statistics is parametric. What you learn in other statistics courses at the same level as this one, such as Stat 5302 (regression), Stat 5303 (design, ANOVA), Stat 5401 (multivariate), or Stat 5421 (categorical) is parametric statistics.

So what is nonparametrics? The counterculture? Sort of.

## Parameters

In intro stats we learn that *parameters* are population quantities
we want to know and *statistics* (or *estimates*) are the
corresponding sample quantities we use to try to make *inferences*
about them.

For example

Parameter | Statistic |
---|---|

population mean | sample mean |

population median | sample median |

population 15% trimmed mean | sample 15% trimmed mean |

But there is a more advanced meaning of *parameter* that one
meets only in a theory course, such as Stat 5101-5102. A statistical
model is *parametric* if it has only a few parameters. For example
the *normal* family of distributions has two parameters, the mean
μ and the variance σ^{2}. If we know these two parameters,
then we know the distribution (within the normal family).

So there are two, related but somewhat different, notions of parameter

in statistics

- a population quantity we want to estimate
- one of a (small) set of variables that (together) determine one probability model in an assumed family of models.

Nonparametrics has parameters

in the first sense, but not in the second.
It uses families of probability models too large to be determined by a finite
set of parameters.

## Efficiency

When we are willing to assume a parametric family, this has very strong
consequences. The theory of maximum likelihood (Stat 5102) says that
the sample mean and sample variance are the best possible estimators
of the parameters μ and σ^{2} (when the assumed parametric
family is normal). More precisely,
these estimators have smaller asymptotic variance than any others,
where asymptotic

refers to the limit as the sample size goes
to infinity.

This concept is called *asymptotic efficiency*. Note well (this
is very important) that this efficiency property depends crucially
on the assumed parametric family.

## Robustness

What if the assumed parametric family is incorrect?

Then the whole efficiency story goes out the window, and what was best (asymptotically efficient) can become the worst.

We say an estimator is *robust* when it tolerates departures
from assumptions. The simplest measure of robustness
is *breakdown point*, which is the asymptotic fraction of the
data that can be complete junk while the estimator stays sensible.

This point is usually explained in intro stats without being explicit about the terminology.

The sample mean has breakdown point zero (!) because even one wild observation (outlier, gross error) can make an arbitrarily large change in the sample mean.

The sample median has breakdown point 50% because even if just under half of the data are wild, the median is in the majority half. (The median changes as more data go bad, but it doesn't become completely useless).

There is a trade-off. Efficiency and robustness generally go in opposite directions. The mean is most efficient and least robust. The median (a 50% trimmed mean) is most robust and least efficient (when the assumed parametric family is normal). Trimmed means are in between on both robustness and efficiency.

In short, if you have perfect data (no errors, no outliers, fits the
assumed normal model), then
you should use the sample mean and related procedures (`t` tests,
linear regression, ANOVA, `F` tests).

This course is about what you do when you don't have perfect data (or at least don't want to take a chance on imperfect data).

## The Median and the Sign Test

For a start, consider the sample median as an estimator of the population
median. You probably were told a little bit about the median in the
descriptive statistics

part of your intro stats course,
but when you got to inference, the median was forgotten.

So what is the analog of `t` tests and confidence intervals
(procedures associated with the sample mean and variance) for the median?

We will start there. They are called
the
*sign test* and related procedures.

Some people think the sign test gives up too much efficiency. They don't need that much robustness (50% breakdown point).

So there's lots more, for example
the Wilcoxon signed rank test and related procedures, which is
a competitor of sign tests and `t` tests.

## Assumptions

One pecularity of this course is an emphasis on assumptions

.
One way to think about nonparametrics is that it makes fewer assumptions.
Valid for larger families of probability models

is just another way
of saying makes fewer assumptions

.

For comparison

Procedure | Assumption | Parameter |
---|---|---|

sign test | continuous | median |

signed rank test | continuous, symmetric | center of symmetry |

t test
| normal (Gaussian) | mean |

The `t` test and related procedures make the strongest assumption:
the population distribution is normal (Gaussian, bell curve).
The parameter about which inference is made is the population mean.
If the normality assumption is not exactly correct, the inference may
be garbage.

The signed rank test and related procedures make a much weaker assumption: the population distribution is continuous and symmetric. The parameter about which inference is made is the population center of symmetry. If the continuity and symmetry assumptions are not exactly correct, the inference may be garbage.

The sign test and related procedures make the weakest assumption of all: the population distribution is continuous. The parameter about which inference is made is the population median. If the continuity assumption is not exactly correct, the inference may be garbage.

These sets of assumptions become progressively weaker, because the normal distribution is continuous and symmetric. Moreover the parameters are comparable, because the center of symmetry (if there is one) is also the median and (if the mean exists) also the mean.

Thus the sign test competes with the signed rank test when both are valid,
and these compete with the `t` when it is valid.

## Efficiency Revisited

So what is the price in efficiency for the gain in robustness?

Assume, temporarily, that the population is exactly normal, then we have the following

Procedure | Efficiency | Robustness |
---|---|---|

sign test | 63% | 50% |

signed rank test | 95% | 29% |

t test
| 100% | 0% |

Robustness does not depend on the true unknown population distribution, but efficiency does. If we drop the normality assumption, then things become really complicated -- just about anything can happen to efficiency.

But the main story can be seen from our table here

- The signed rank test and related procedures are nearly as efficient
as the
`t`test and related procedures,*even when their normality assumptions are true*(and perhaps more efficient when they are false). - For the slight loss of efficiency, there is a huge gain in robustness.

So why does anyone ever use the `t` test and related procedures?
Dunno. Beats me!

(Actually I do know: mathematical convenience. The most complicated normal-theory procedures, multiple regression and analysis of variance for complicated experimental designs have no nonparametric analogues -- at least none that are not bleeding edge research unsuitable for a course like this. But this doesn't apply to simple normal-theory procedures that have simple nonparametric competitors.)

## The Bootstrap

And now for something completely different, the bootstrap.

The bootstrap

is a cutesy name using an
old cliche for what is really a generalization of the fundamental
problem of statistics: the sample is not the population.

Since we've had intro stats we never make the fundamental blunder of statistics: confusing the sample and the population.

But one level up, the blunder

is sometimes (!) no longer a blunder.
For inference about the mean we often plug-in

the sample standard
deviation where the (unknown) population standard deviation should go.

Strictly, this plug-in is wrong. Practically, it works o. k. if the sample
size is large: the `t` distribution (exact) is almost the same as the
`z` distribution (approximate).

The *bootstrap* is the most general form of plug-in.

In Efron's *nonparametric bootstrap* we use a completely
nonparametric estimate of the population distribution: the empirical

distribution (which simply treats the sample as a population).

Now, we cannot generally do much theoretically with such a plug-in estimate, but computer simulation comes to the rescue.

- Any probability distribution we can simulate we can thoroughly understand with sufficient simulation.
- The empirical distribution is trivial to simulate. Samples from it
are (re)samples
*with replacement*from the original sample.

Together these two ideas (general plug-in and simulation replacing theory) are tremendously liberating. They allow us to do statistics in very complicated situations without traditional theoretical understanding (the computer does the theory, we just run the computer).

The bozo view of the bootstrap is that it makes theoretical statistics completely unnecessary. A more sophisticated view remembers the fundamental problem. The bootstrap does not do the Right Thing (plug-in treats the sample as the population). So how wrong is it?

The answer to this question is very tricky and the course will spend weeks answering it. The short answer (sufficient for this web page) is that sometimes the bootstrap works, and sometimes it doesn't. When a naive bootstrap doesn't work, some modification may.

Despite the shortness of this section, we will spend about half the course on the bootstrap.

## Nonparametric Regression

In linear regression (including multiple linear regression, the subject of Stat 5302) we assume

- The errors are independent, normally distributed, mean zero, same variance.
- The response mean is a linear function of unknown parameters
(regression coefficients)
μ
_{i}=`x`β_{i1}_{1}+`x`β_{i2}_{2}+ … +`x`β_{ip}_{p}

`x`is the value of the

_{ij}`j`-th predictor variable for the

`i`-th individual and β

_{j}is the

`j`-th regression coefficient.

In nonparametric regression we drop one or the other of these assumptions.
If one tries to drop both, there is no structure left to make it
a regression

problem, at least no structure that has led to any
usable statistical methodology.

### Robust Regression

If we drop assumption 1 (normal errors) but keep assumption 2 (means linear
in the regression coefficients), we get *robust regression*, procedures
robust to departure from normality assumptions.

Since we are not making parametric assumptions about the error distribution, this is nonparametric.

We will just do a few examples of this.

### Smoothing

If we keep assumption 1 (normal errors) but drop assumption 2 (means linear in the regression coefficients), we get*smoothing*, which assumes means are smooth functions of predictor variables

_{i}=

`s`(

_{1}`x`) +

_{i1}`s`(

_{2}`x`) + … +

_{i2}`s`(

_{p}`x`)

_{ip}`x`are as before and

_{ij}`s`is an unknown smooth function.

_{j}Since we are not making parametric assumptions about the form of the mean function, this is nonparametric.