University of Minnesota, Twin Cities School of Statistics Stat 5601 Rweb
Permutation tests, despite being the oldest of all nonparametric procedures that are still used today, do not seem to be covered in Hollander and Wolfe. Thus we need a theory section.
However, both Wilcoxon tests are special cases of the general notion of permutation tests. Thus the subject should seem somewhat familiar.
As the plural permutation tests
implies, this name refers to a class
of tests. It is not a single procedure like the sign test or either of the
Wilcoxon tests. Rather, it is a large class of procedures all derived from
the same fundamental principle.
As with any test of statistical hypotheses we need the following elements.
For most hypothesis tests, we start with the assumptions and work forward
to derive the sampling distribution of the test statistic under the
null hypothesis. For permutation tests we will reverse the procedure,
since the sampling distribution involves the permutations
which
give the procedure its name and are the key theoretical issue in understanding
the test.
In mathematics, a permutation is a reordering of the numbers 1, ..., n. For example,
(1, 2, 3, 4, 5, 6, 7)are all permutations of the numbers 1, ..., 7 (note that this includes the standard order in line one). As one learns in at the beginning of a probability course, there are n! (n factorial) permutations of n objects. In this case, 7! = 5040, so you can see why they aren't all written out here.
(1, 3, 2, 4, 5, 6, 7)
(4, 7, 5, 2, 6, 1, 3)
In the term permutation tests the notion of permutation is somewhat different. It refers to any of a specified class of rearrangements or modifications of the data.
The null hypothesis of the test specifies that the permutations are all equally likely. A concise way to say this is that the distribution of the data under the null hypothesis satisfies exchangeability.
The sampling distribution of the test statistic under the null hypothesis is computed by forming all of the permutations, calculating the test statistic for each and considering these values all equally likely.
Rather than give a general description of all the kinds of permutations that may arise in applications, we will just consider two specific cases.
For unpaired two-sample tests, in particular, for Wilcoxon rank sum tests, the permutations are really combinations because order within the x's and y's doesn't matter.
Suppose, as in Comment 5 of Section 4.1 in Hollander and Wolfe, there are 3 x's and 2 y's. The combinations can be represented as follows
Rweb:> library(stat5601) Rweb:> foo <- perm2(5, 2) Rweb:> t(foo) [,1] [,2] [1,] 1 2 [2,] 1 3 [3,] 1 4 [4,] 1 5 [5,] 2 3 [6,] 2 4 [7,] 2 5 [8,] 3 4 [9,] 3 5 [10,] 4 5Each row of the printed matrix gives one combination. The sampling distribution of the test statistic is formed by taking for each row of the matrix the corresponding elements of the combined data vector, calling those elements y's, and computing the test statistic. Each of the
choose(n, k)
combinations is equally likely under the
null hypothesis.
The function perm2
(
on-line help)
used here produces all the combinations.
Continuing with this example, the following code calculates
The function perm2sum
(
on-line help)
used here produces the sum of each combination of data elements.
Since the argument it is fed is the rank vector (note that there are ties,
so we are calculating the exact sampling distribution of the test statistic
when there are tied ranks), it calculates the sum of each possible pair of
ranks (for this particular rank vector with its particular pattern of ties).
Thus the vector wsim
is the vector of all possible values of
the Wilcoxon rank sum test statistic for this particular pattern of ties.
The fraction of these values at or below the observed value of the test statistic is the P-value for an exact lower-tailed test.
For an upper-tailed test, reverse the inequality in the last statement.
For a two-tailed test, double the smaller of the one-tailed P-values.
There is no reason whatsoever why a permutation test has to use any particular test statistic. Any test statistic will do!
Consider, for example, the usual test statistic for
a two-sample z
test.
The function perm2fun
(
on-line help)
used here evaluates an arbitrary function on each combination of data elements.
It takes arguments x
and y
which are the
x and y values for the permuted
data for
each permutation.
This example is really two examples.
fred
calculates the test statistic for the rank sum test.
sally
calculates the test
statistic for the t test that uses Welch's approximation
(doesn't assume equal variance).
Note the following very interesting observations.
For one-sample or paired two-sample tests, in particular,
for Wilcoxon signed rank tests, the permutations are really subsets.
The permutation distribution choses an arbitrary subset to mark +
and the complementary subset is marked -
. Either subset can
be empty.
Suppose, as in Comment 11 of Section 3.1 in Hollander and Wolfe, there are 4 z's. The subsets can be represented as follows
Rweb:> library(stat5601) Rweb:> foo <- perm1(4) Rweb:> t(foo) [,1] [,2] [,3] [,4] [1,] 0 0 0 0 [2,] 1 0 0 0 [3,] 0 1 0 0 [4,] 1 1 0 0 [5,] 0 0 1 0 [6,] 1 0 1 0 [7,] 0 1 1 0 [8,] 1 1 1 0 [9,] 0 0 0 1 [10,] 1 0 0 1 [11,] 0 1 0 1 [12,] 1 1 0 1 [13,] 0 0 1 1 [14,] 1 0 1 1 [15,] 0 1 1 1 [16,] 1 1 1 1Each row of the printed matrix gives (the indicator vector of) one subset. The sampling distribution of the test statistic is formed by for each row of the matrix summing the ranks of the data elements corresponding to the 1's. Each of the
2^n
subsets is equally likely under the
null hypothesis.
The function perm1
(
on-line help)
used here produces all the subsets.
Continuing with this example, the following code calculates
The function perm1sum
(
on-line help)
used here produces the sum of each subset of data elements.
Since the argument it is fed is the absolute rank vector (note that there
are ties,
so we are calculating the exact sampling distribution of the test statistic
when there are tied ranks), it calculates the sum of each possible subset of
ranks (for this particular rank vector with its particular pattern of ties).
Thus the vector tsim
is the vector of all possible values of
the Wilcoxon signed rank test statistic for this particular pattern of ties.
The fraction of these values at or above the observed value of the test statistic is the P-value for an exact upper-tailed test.
For a lower-tailed test, reverse the inequality in the last statement.
For a two-tailed test, double the smaller of the one-tailed P-values.
There is no reason whatsoever why a permutation test has to use any particular test statistic. Any test statistic will do!
Consider, for example, the usual test statistic for
a one-sample z
test.
The function perm1fun
(
on-line help)
used here evaluates an arbitrary function on each subset of data elements
and its complement.
It takes arguments x
and y
which are the
+ and - values for each permutation.
This example is really two examples.
fred
calculates the test statistic for the signed rank test.
sally
calculates the test
statistic for the t test.
Note the following very interesting observations.
The perm
functions described above don't work for large sample
sizes. The amount of memory used, choose(n, k)
or 2^n
, gets larger than the computer can hold.
What then?
For the Wilcoxon tests, we can use the normal approximations. But for arbitrary permutation tests there are no normal approximations.
The answer is to use so-called Monte Carlo tests which use a random sample of the permutations rather than all the permutations.
The consequence of this is that the result of the calculation, the P-value of the test is itself a random number. However, we can calculate its precision using statistics and so use enough random pemutations to get whatever precision we want.
length(y)
from the ranks. That's exactly what
the exact code above does, except that the exact code does all the
possibilities rather than just some of them.
phat
is the fraction of samples that have (random!)
values of the test statistic less than the value for the observed data.
nsim
should be increased.