University of Minnesota, Twin Cities School of Statistics Stat 5601 Rweb

Stat 5601 (Geyer) Examples (Permutation Tests and Related Procedures)

The Name of the Game

Permutation tests are also called randomization tests. Different name, same concept.

General Instructions
Theory
Permutations for Unpaired Two-Sample Tests
- Wilcoxon Rank Sum Tests
- Other Tests
Permutations for One-Sample or Paired Two-Sample Tests
- Wilcoxon Signed Rank Tests
- Other Tests
Monte Carlo Tests
- Two-Sample Tests
- One-Sample Tests

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Theory

Permutation tests, despite being the oldest of all nonparametric procedures that are still used today, do not seem to be covered in Hollander and Wolfe. Thus we need a theory section.

However, both Wilcoxon tests are special cases of the general notion of permutation tests. Thus the subject should seem somewhat familiar.

As the plural permutation tests implies, this name refers to a class of tests. It is not a single procedure like the sign test or either of the Wilcoxon tests. Rather, it is a large class of procedures all derived from the same fundamental principle.

As with any test of statistical hypotheses we need the following elements.

data
null hypothesis
test statistic
the sampling distribution of the test statistic under the null hypothesis

The last of these is a complicated structure involving the other three elements. It is a probability distribution for the data which in turn induces a probability distribution for the test statistic (because it is a function of the data). The null hypothesis embodies the assumptions under which this probability distribution for the data holds. (Under alternatives, the data and test statistic would have a different sampling distribution.)

Permutations

For most hypothesis tests, we start with the assumptions and work forward to derive the sampling distribution of the test statistic under the null hypothesis. For permutation tests we will reverse the procedure, since the sampling distribution involves the permutations which give the procedure its name and are the key theoretical issue in understanding the test.

In mathematics, a permutation is a reordering of the numbers 1, ..., n. For example,

(1, 2, 3, 4, 5, 6, 7)
(1, 3, 2, 4, 5, 6, 7)
(4, 7, 5, 2, 6, 1, 3)

are all permutations of the numbers 1, ..., 7 (note that this includes the standard order in line one). As one learns in at the beginning of a probability course, there are n! (n factorial) permutations of n objects. In this case, 7! = 5040, so you can see why they aren't all written out here.

In the term permutation tests the notion of permutation is somewhat different. It refers to any of a specified class of rearrangements or modifications of the data.

The null hypothesis of the test specifies that the permutations are all equally likely. A concise way to say this is that the distribution of the data under the null hypothesis satisfies exchangeability.

The sampling distribution of the test statistic under the null hypothesis is computed by forming all of the permutations, calculating the test statistic for each and considering these values all equally likely.

Rather than give a general description of all the kinds of permutations that may arise in applications, we will just consider two specific cases.

Permutations for Unpaired Two-Sample Tests

Wilcoxon Rank Sum Tests

For unpaired two-sample tests, in particular, for Wilcoxon rank sum tests, the permutations are really combinations because order within the x's and y's doesn't matter.

Suppose, as in Comment 5 of Section 4.1 in Hollander and Wolfe, there are 3 x's and 2 y's. The combinations can be represented as follows

Rweb:> library(stat5601) 
Rweb:> foo <- perm2(5, 2) 
Rweb:> t(foo) 
      [,1] [,2] 
 [1,]    1    2 
 [2,]    1    3 
 [3,]    1    4 
 [4,]    1    5 
 [5,]    2    3 
 [6,]    2    4 
 [7,]    2    5 
 [8,]    3    4 
 [9,]    3    5 
[10,]    4    5

Each row of the printed matrix gives one combination. The sampling distribution of the test statistic is formed by taking for each row of the matrix the corresponding elements of the combined data vector, calling those elements y's, and computing the test statistic. Each of the choose(n, k) combinations is equally likely under the null hypothesis.

The function perm2 ( on-line help) used here produces all the combinations.

Comment 5 in Section 4.1 in Hollander and Wolfe.

Continuing with this example, the following code calculates

The function perm2sum ( on-line help) used here produces the sum of each combination of data elements. Since the argument it is fed is the rank vector (note that there are ties, so we are calculating the exact sampling distribution of the test statistic when there are tied ranks), it calculates the sum of each possible pair of ranks (for this particular rank vector with its particular pattern of ties).

Thus the vector wsim is the vector of all possible values of the Wilcoxon rank sum test statistic for this particular pattern of ties.

The fraction of these values at or below the observed value of the test statistic is the P-value for an exact lower-tailed test.

For an upper-tailed test, reverse the inequality in the last statement.

For a two-tailed test, double the smaller of the one-tailed P-values.

Other Tests

There is no reason whatsoever why a permutation test has to use any particular test statistic. Any test statistic will do!

Consider, for example, the usual test statistic for a two-sample z test.

Data from Example 4.1 in Hollander and Wolfe, Different Test Statistic.

The function perm2fun ( on-line help) used here evaluates an arbitrary function on each combination of data elements.

It takes arguments x and y which are the x and y values for the permuted data for each permutation.

This example is really two examples.

In the first example, we do the Wilcoxon rank sum test of Example 4.1 in Hollander and Wolfe, just to see that we get the right answer. The function fred calculates the test statistic for the rank sum test.
In the second example, we do a Student t like permutation test on the same data. The function sally calculates the test statistic for the t test that uses Welch's approximation (doesn't assume equal variance).
Just for comparison we also do the t test.

Note the following very interesting observations.

The first two tests are equally nonparametric. The permutation test requires no assumptions other than those for the Wilcoxon rank sum test.
The two nonparametric tests are not the same in any other way.
- The t like permutation test is more efficient.
- The Wilcoxon is more robust.
- They give different P-values.
The two t tests are not the same either.
- They have exactly the same test statistic.
- They give different P-values.
- They have different assumptions.
Any function whatsoever can be used to define the test statistic. There's nothing magical about these two.

Permutations for One-Sample or Paired Two-Sample Tests

Wilcoxon Signed Rank Tests

For one-sample or paired two-sample tests, in particular, for Wilcoxon signed rank tests, the permutations are really subsets. The permutation distribution choses an arbitrary subset to mark + and the complementary subset is marked -. Either subset can be empty.

Suppose, as in Comment 11 of Section 3.1 in Hollander and Wolfe, there are 4 z's. The subsets can be represented as follows

Rweb:> library(stat5601) 
Rweb:> foo <- perm1(4) 
Rweb:> t(foo) 
      [,1] [,2] [,3] [,4] 
 [1,]    0    0    0    0 
 [2,]    1    0    0    0 
 [3,]    0    1    0    0 
 [4,]    1    1    0    0 
 [5,]    0    0    1    0 
 [6,]    1    0    1    0 
 [7,]    0    1    1    0 
 [8,]    1    1    1    0 
 [9,]    0    0    0    1 
[10,]    1    0    0    1 
[11,]    0    1    0    1 
[12,]    1    1    0    1 
[13,]    0    0    1    1 
[14,]    1    0    1    1 
[15,]    0    1    1    1 
[16,]    1    1    1    1

Each row of the printed matrix gives (the indicator vector of) one subset. The sampling distribution of the test statistic is formed by for each row of the matrix summing the ranks of the data elements corresponding to the 1's. Each of the 2^n subsets is equally likely under the null hypothesis.

The function perm1 ( on-line help) used here produces all the subsets.

Comment 11 in Section 3.1 in Hollander and Wolfe.

Continuing with this example, the following code calculates

The function perm1sum ( on-line help) used here produces the sum of each subset of data elements. Since the argument it is fed is the absolute rank vector (note that there are ties, so we are calculating the exact sampling distribution of the test statistic when there are tied ranks), it calculates the sum of each possible subset of ranks (for this particular rank vector with its particular pattern of ties).

Thus the vector tsim is the vector of all possible values of the Wilcoxon signed rank test statistic for this particular pattern of ties.

The fraction of these values at or above the observed value of the test statistic is the P-value for an exact upper-tailed test.

For a lower-tailed test, reverse the inequality in the last statement.

For a two-tailed test, double the smaller of the one-tailed P-values.

Other Tests

There is no reason whatsoever why a permutation test has to use any particular test statistic. Any test statistic will do!

Consider, for example, the usual test statistic for a one-sample z test.

Data from Table 3.9 in Hollander and Wolfe, Different Test Statistic.

The function perm1fun ( on-line help) used here evaluates an arbitrary function on each subset of data elements and its complement.

It takes arguments x and y which are the + and - values for each permutation.

This example is really two examples.

In the first example, we do the Wilcoxon signed rank test for Problem 3.87 in Hollander and Wolfe (which we did for homework), just to see that we get the right answer. The function fred calculates the test statistic for the signed rank test.
In the second example, we do a Student t like permutation test on the same data. The function sally calculates the test statistic for the t test.
Just for comparison we also do the t test.

Note the following very interesting observations.

The first two tests are equally nonparametric. The permutation test requires no assumptions other than those for the Wilcoxon rank sum test.
The two nonparametric tests are not the same in any other way.
- The t like permutation test is more efficient.
- The Wilcoxon is more robust. Here where there is a gross outlier, the t seems highly non-robust. The Wilcoxon estimates μ less than the hypothesized value (17.85) and the t estimates it the other way (18.975)!
- They give different P-values.
The two t tests are not the same either.
- They have exactly the same test statistic.
- They give different P-values.
- They have different assumptions.
Any function whatsoever can be used to define the test statistic. There's nothing magical about these two.

Monte Carlo Tests

The perm functions described above don't work for large sample sizes. The amount of memory used, choose(n, k) or 2^n, gets larger than the computer can hold.

What then?

For the Wilcoxon tests, we can use the normal approximations. But for arbitrary permutation tests there are no normal approximations.

The answer is to use so-called Monte Carlo tests which use a random sample of the permutations rather than all the permutations.

The consequence of this is that the result of the calculation, the P-value of the test is itself a random number. However, we can calculate its precision using statistics and so use enough random pemutations to get whatever precision we want.

Two-Sample Tests

Example 4.1 in Hollander and Wolfe.

Comments

Everything down to the for loop is hopefully familiar.
Each time through the loop takes one sample without replacement of size length(y) from the ranks. That's exactly what the exact code above does, except that the exact code does all the possibilities rather than just some of them.
phat is the fraction of samples that have (random!) values of the test statistic less than the value for the observed data.
The next line makes a slight adjustment to get the P-value. With this adjustment the test is exact, that is, the probability that P < α for any significance level α is exactly α if we include both the Monte Carlo randomness and the randomness from the sampling distribution of the data (argument given in class).
The last line gives the estimated standard deviation of that Monte Carlo P-value from which we can tell how accurate it is and whether nsim should be increased.
For comparison, the exact P-value is (from the rank sum test page) is 0.1272061.

Stat 5601 (Geyer) Examples (Permutation Tests and Related Procedures)

The Name of the Game

Contents

General Instructions

Theory

Permutations

Permutations for Unpaired Two-Sample Tests

Wilcoxon Rank Sum Tests

Comment 5 in Section 4.1 in Hollander and Wolfe.

External Data Entry

Other Tests

Data from Example 4.1 in Hollander and Wolfe, Different Test Statistic.

External Data Entry

Permutations for One-Sample or Paired Two-Sample Tests

Wilcoxon Signed Rank Tests

Comment 11 in Section 3.1 in Hollander and Wolfe.

External Data Entry

Other Tests

Data from Table 3.9 in Hollander and Wolfe, Different Test Statistic.

External Data Entry

Monte Carlo Tests

Two-Sample Tests

Example 4.1 in Hollander and Wolfe.

External Data Entry

Comments

One-Sample Tests

Example 3.1 in Hollander and Wolfe.

External Data Entry