University of Minnesota, Twin Cities School of Statistics Stat 5601 Rweb

- General Instructions
- Theory
- Permutations for Unpaired Two-Sample Tests
- Permutations for One-Sample or Paired Two-Sample Tests
- Monte Carlo Tests

Permutation tests, despite being the oldest of all nonparametric procedures that are still used today, do not seem to be covered in Hollander and Wolfe. Thus we need a theory section.

However, both Wilcoxon tests are special cases of the general notion of permutation tests. Thus the subject should seem somewhat familiar.

As the plural permutation tests

implies, this name refers to a class
of tests. It is not a single procedure like the sign test or either of the
Wilcoxon tests. Rather, it is a large class of procedures all derived from
the same fundamental principle.

As with any test of statistical hypotheses we need the following elements.

- data
- null hypothesis
- test statistic
- the sampling distribution of the test statistic under the null hypothesis

For most hypothesis tests, we start with the assumptions and work forward
to derive the sampling distribution of the test statistic under the
null hypothesis. For permutation tests we will reverse the procedure,
since the sampling distribution involves the permutations

which
give the procedure its name and are the key theoretical issue in understanding
the test.

In mathematics, a permutation is a reordering of the numbers
1, ..., `n`. For example,

(1, 2, 3, 4, 5, 6, 7)are all permutations of the numbers 1, ..., 7 (note that this includes the standard order in line one). As one learns in at the beginning of a probability course, there are

(1, 3, 2, 4, 5, 6, 7)

(4, 7, 5, 2, 6, 1, 3)

In the term *permutation tests* the notion of permutation is somewhat
different. It refers to *any* of a specified class of rearrangements
or modifications of the data.

The null hypothesis of the test specifies that the permutations are all
equally likely. A concise way to say this is that the distribution of
the data under the null hypothesis satisfies *exchangeability*.

The sampling distribution of the test statistic under the null hypothesis is computed by forming all of the permutations, calculating the test statistic for each and considering these values all equally likely.

Rather than give a general description of all the kinds of permutations that may arise in applications, we will just consider two specific cases.

For unpaired two-sample tests, in particular,
for Wilcoxon rank sum tests, the permutations are really *combinations*
because order within the `x`'s and `y`'s doesn't matter.

Suppose, as in Comment 5 of Section 4.1 in Hollander and Wolfe,
there are 3 `x`'s and 2 `y`'s. The combinations can
be represented as follows

Rweb:> library(stat5601) Rweb:> foo <- perm2(5, 2) Rweb:> t(foo) [,1] [,2] [1,] 1 2 [2,] 1 3 [3,] 1 4 [4,] 1 5 [5,] 2 3 [6,] 2 4 [7,] 2 5 [8,] 3 4 [9,] 3 5 [10,] 4 5Each row of the printed matrix gives one combination. The sampling distribution of the test statistic is formed by taking for each row of the matrix the corresponding elements of the combined data vector, calling those elements

`choose(n, k)`

combinations is equally likely under the
null hypothesis.
The function `perm2`

(
on-line help)
used here produces all the combinations.

Continuing with this example, the following code calculates

The function `perm2sum`

(
on-line help)
used here produces the sum of each combination of data elements.
Since the argument it is fed is the rank vector (note that there are ties,
so we are calculating the exact sampling distribution of the test statistic
when there are tied ranks), it calculates the sum of each possible pair of
ranks (for this particular rank vector with its particular pattern of ties).

Thus the vector `wsim`

is the vector of all possible values of
the Wilcoxon rank sum test statistic for this particular pattern of ties.

The fraction of these values at or below the observed value of the test
statistic is the `P`-value for an exact lower-tailed test.

For an upper-tailed test, reverse the inequality in the last statement.

For a two-tailed test, double the smaller
of the one-tailed `P`-values.

There is no reason whatsoever why a permutation test has to use any particular test statistic. Any test statistic will do!

Consider, for example, the usual test statistic for
a two-sample

test.
`z`

The function `perm2fun`

(
on-line help)
used here evaluates an arbitrary function on each combination of data elements.

It takes arguments `x`

and `y`

which are the
`x` and `y` values for the permuted

data for
each permutation.

This example is really two examples.

- In the first example, we do the Wilcoxon rank sum test of Example 4.1
in Hollander and Wolfe, just to see that we get the right answer. The function
`fred`

calculates the test statistic for the rank sum test. - In the second example, we do a Student
`t`like permutation test on the same data. The function`sally`

calculates the test statistic for the`t`test that uses Welch's approximation (doesn't assume equal variance). - Just for comparison we also do the
`t`test.

Note the following very interesting observations.

- The first two tests are
*equally nonparametric*. The permutation test requires no assumptions other than those for the Wilcoxon rank sum test. - The two nonparametric tests are not the same in any other way.
- The
`t`like permutation test is more efficient. - The Wilcoxon is more robust.
- They give different
`P`-values.

- The
- The two
`t`tests are not the same either.- They have exactly the same test statistic.
- They give different
`P`-values. - They have different assumptions.

- Any function whatsoever can be used to define the test statistic. There's nothing magical about these two.

For one-sample or paired two-sample tests, in particular,
for Wilcoxon signed rank tests, the permutations are really *subsets*.
The permutation distribution choses an arbitrary subset to mark `+`

and the complementary subset is marked `-`

. Either subset can
be empty.

Suppose, as in Comment 11 of Section 3.1 in Hollander and Wolfe,
there are 4 `z`'s. The subsets can be represented as follows

Rweb:> library(stat5601) Rweb:> foo <- perm1(4) Rweb:> t(foo) [,1] [,2] [,3] [,4] [1,] 0 0 0 0 [2,] 1 0 0 0 [3,] 0 1 0 0 [4,] 1 1 0 0 [5,] 0 0 1 0 [6,] 1 0 1 0 [7,] 0 1 1 0 [8,] 1 1 1 0 [9,] 0 0 0 1 [10,] 1 0 0 1 [11,] 0 1 0 1 [12,] 1 1 0 1 [13,] 0 0 1 1 [14,] 1 0 1 1 [15,] 0 1 1 1 [16,] 1 1 1 1Each row of the printed matrix gives (the indicator vector of) one subset. The sampling distribution of the test statistic is formed by for each row of the matrix summing the ranks of the data elements corresponding to the 1's. Each of the

`2^n`

subsets is equally likely under the
null hypothesis.
The function `perm1`

(
on-line help)
used here produces all the subsets.

Continuing with this example, the following code calculates

The function `perm1sum`

(
on-line help)
used here produces the sum of each subset of data elements.
Since the argument it is fed is the absolute rank vector (note that there
are ties,
so we are calculating the exact sampling distribution of the test statistic
when there are tied ranks), it calculates the sum of each possible subset of
ranks (for this particular rank vector with its particular pattern of ties).

Thus the vector `tsim`

is the vector of all possible values of
the Wilcoxon signed rank test statistic for this particular pattern of ties.

The fraction of these values at or above the observed value of the test
statistic is the `P`-value for an exact upper-tailed test.

For a lower-tailed test, reverse the inequality in the last statement.

For a two-tailed test, double the smaller
of the one-tailed `P`-values.

There is no reason whatsoever why a permutation test has to use any particular test statistic. Any test statistic will do!

Consider, for example, the usual test statistic for
a one-sample

test.
`z`

The function `perm1fun`

(
on-line help)
used here evaluates an arbitrary function on each subset of data elements
and its complement.

It takes arguments `x`

and `y`

which are the
`+` and `-` values for each permutation.

This example is really two examples.

- In the first example, we do the Wilcoxon signed rank test for Problem 3.87
in Hollander and Wolfe (which we did for homework), just to see that we get
the right answer. The function
`fred`

calculates the test statistic for the signed rank test. - In the second example, we do a Student
`t`like permutation test on the same data. The function`sally`

calculates the test statistic for the`t`test. - Just for comparison we also do the
`t`test.

Note the following very interesting observations.

- The first two tests are
*equally nonparametric*. The permutation test requires no assumptions other than those for the Wilcoxon rank sum test. - The two nonparametric tests are not the same in any other way.
- The
`t`like permutation test is more efficient. - The Wilcoxon is more robust. Here where there is a gross outlier,
the
`t`seems highly non-robust. The Wilcoxon estimates μ less than the hypothesized value (17.85) and the`t`estimates it the other way (18.975)! - They give different
`P`-values.

- The
- The two
`t`tests are not the same either.- They have exactly the same test statistic.
- They give different
`P`-values. - They have different assumptions.

- Any function whatsoever can be used to define the test statistic. There's nothing magical about these two.

The perm

functions described above don't work for large sample
sizes. The amount of memory used, `choose(n, k)`

or `2^n`

, gets larger than the computer can hold.

What then?

For the Wilcoxon tests, we can use the normal approximations. But for arbitrary permutation tests there are no normal approximations.

The answer is to use so-called *Monte Carlo* tests which
use a random sample of the permutations rather than all the permutations.

The consequence of this is that the result of the calculation,
the `P`-value of the test is itself a random number.
However, we can calculate its precision using statistics and
so use enough random pemutations to get whatever precision we want.

- Everything down to the for loop is hopefully familiar.
- Each time through the loop takes one sample without replacement
of size
`length(y)`

from the ranks. That's exactly what the exact code above does, except that the exact code does all the possibilities rather than just some of them. `phat`

is the fraction of samples that have (random!) values of the test statistic less than the value for the observed data.- The next line makes a slight adjustment to get the
`P`-value. With this adjustment the test is exact, that is, the probability that`P`< α for any significance level α is exactly α if we include both the Monte Carlo randomness and the randomness from the sampling distribution of the data (argument given in class). - The last line gives the estimated standard deviation of that Monte Carlo
`P`-value from which we can tell how accurate it is and whether`nsim`

should be increased. - For comparison, the exact
`P`-value is (from the rank sum test page) is 0.1272061.