General Instructions
To do each example, just click the Submit
button.
You do not have to type in any R instructions or specify a dataset.
That's already done for you.
The Wilcoxon Signed Rank Test
Example 3.1 in Hollander and Wolfe.
Summary
- Lower-tailed signed rank test
- Test statistic: T+ = 5
- Sample size: n = 9
- P-value: P = 0.01953
Comments
- Line one assigns the value of the parameter (population median) assumed under the null hypothesis. Usually zero.
- There is no need to sort the z values in line two. It just makes the data easier to look at.
- The vector
r
defined in line four is the (absolute) ranks. - Line five prints the differences
z
and the ranksr
stuffed into a matrix so they line up, each difference over the corresponding rank. - Line six prints the test statistic (sum of the positive signed ranks).
- For an upper-tailed test the seventh line would be replaced by
any of the following, which all do the same thing,
1 - psignrank(tplus - 1, n) psignrank(tplus - 1, n, lower.tail=FALSE) psignrank(n * (n + 1) / 2 - tplus, n)
the first line here does exactly the same as line seven in the example, but is less accurate for very small P-values. The second does exactly the same as line seven of the example because of the symmetry of the distribution of the signed rank test statistic. - For a two-tailed test do both the lower-tailed and the upper-tailed test and double the P-value of the smaller of the two results. (Two tails is twice one tail because of the symmetry of the null distribution of the test statistic.)
- For handling zeros and tied ranks, see Hollander and Wolfe, the class discussion, and below.
The Associated Point Estimate (Median of the Walsh Averages)
The Hodges-Lehmann estimator associated with the signed rank test is the median of the Walsh averages, which are the n (n + 1) / 2 averages of pairs of differences
The following somewhat tricky code computes the Walsh averages and their median.
Example 3.3 in Hollander and Wolfe.
Summary
- Point Estimate (sample median of Walsh averages): −0.46
Comments
- There is no need to sort the z values in line one. It just makes the data easier to look at.
- There is no need to sort the Walsh averages in line three. It just makes them easier to look at. Having them sorted is necessary later on when we use them to construct a confidence interval.
The Associated Confidence Interval
Very similar to the confidence interval associated with the sign test, the confidence interval has the form
where m = n (n + 1) / 2 is the number of Walsh averages, and, as always, parentheses on subscripts indicates order statistics, in this case, of the Walsh averages Wk. That is, one counts in k from each end in the list of sorted Walsh averages to find the confidence interval.
Example 3.4 in Hollander and Wolfe.
Summary
- Achieved confidence level: 0.9609375
- Confidence interval for the population median: (−0.786, −0.010)
Comments
- Some experimentation may be needed to achieve the confidence level
you want. The possible confidence levels are shown by
1 - 2 * psignrank(k - 1, n)
for different values ofk
. The vectorwise operation of R functions can give them all at oncek <- seq(1, 100) conf <- 1 - 2 * psignrank(k - 1, n) conf[conf > 1 / 2]
If one adds these lines to the form above, one sees that the choice is fairly restricted. There are nine possible achieved levels between 0.99 and 0.80 are0.9883, 0.9805, 0.9727, 0.9609, 0.9453, 0.9258, 0.9023, 0.8711, 0.8359. - Alternatively, you can just assign
k
to be any integer between one andn / 2
just before the second to last line in the form (cat . . .
). A confidence interval with some achieved confidence level will be produced. - For a one-tailed confidence interval (called upper and lower bounds by
Hollander and Wolfe) just use
alpha
rather thanalpha / 2
in the fifth line of the form. Then make either the lower limit minus infinity or the upper limit plus infinity, as desired.
The R Functions wilcox.test
and wilcox.exact
All of the above can be done in one shot with the R function
wilcox.test
(on-line help).
Only one complaint. It does not report the actual achieved confidence level
(here 96.1%) but rather the confidence level asked for (here 95%, the default).
If you want to know the actual achieved confidence level, you'll have to
use the code in the confidence interval section above.
But you can use wilcox.test
as a convenient check
(the intervals should agree).
Warning About Ties and Zeros
Do not use the wilcox.test
function when there are ties or
zeros in the data. See the following section.
There is a function wilcox.exact
explained below that does do hypothesis tests correctly
in the presence of ties and zeros
(if you accept the zero fudge).
Neither wilcox.test
nor wilcox.exact
calculate
point estimates or confidence intervals correctly in the presence of ties
or zeros.
Do not use for confidence intervals or point estimates!
Ties and Zeros
What if the continuity assumption is false and there are tied absolute Z values or zero Z values?
First, neither ties nor zeros should make any difference in calculating point estimators or confidence intervals.
- The point estimate is the median of the Walsh averages (section on the associated point estimate above).
- The end points of the confidence interval are k in from each end of the sorted Walsh averages (section on the associated confidence interval above).
- Ties and zeros affect only hypothesis tests.
This is a bit of programmer brain damage
(PBD)
in the implementation of the wilcox.test
and wilcox.exact
functions. They
change the way they calculate point estimates and confidence intervals
when there are ties or zeros. But they shouldn't.
The Zero Fudge
What we called the zero fudge
in the context of the sign test (because
it is fairly bogus there) makes much more sense in the context of the
signed rank test. Zero values in the vector Z = Y − X of paired
differences should get the smallest possible absolute rank because zero is
smaller in absolute value than any other number. We might as well give them
rank zero, starting counting at zero rather than at one. Then they make
no contribution to the sum of ranks statistic.
Here's another way to see the same thing. Let's compare three procedures: Student t, signed rank, and sign.
- The t procedures use the actual numbers, both size and sign. Because of this they are highly non-robust (breakdown point zero).
- The sign procedures completely ignore the size of the numbers using only the sign (are the differences positive or negative). Because of this they are highly robust (breakdown point one-half).
- The Wilcoxon signed rank procedures use both size and sign. But they don't use the actual magnitudes of the numbers, only the ranks of their magnitudes. So they pay some attention to size, but not as much attention as the parametric Student t procedures. Because of this they have in between robustness (breakdown point 0.293).
This analysis explains why the Wilcoxon should treat differences of zero size just like the Student t, that is, they don't count at all.
There is another issue about the zero fudge. As explained in the section about the zero fudge for the sign test, the null hypothesis for the sign test with zero fudge is
which generally is neither true nor interesting for the sign test but which is true under the usual assumptions for the Wilcoxon signed rank test. We are already assuming the distribution of the differences is symmetric, and that implies the probability to either side of the population median μ is the same.
All that having been said, I still feel that there is something wrong with deliberately ignoring data favoring the null hypothesis. It still seems like cheating.
Thus we won't argue (at least not much) with the zero fudge for the signed rank test. We will start our hypothesis test calculations (don't do this for point estimates or confidence intervals) with
z <- y - z z <- z - mu z < z[z != 0]
Tied Ranks
The preceding section takes care of one kind of violation of the continuity assumption. But there's a second kind of violation that causes another problem.
If there are ties among the magnitudes of the Z values, then
- there is no way to unambiguously assign ranks,
- the theorem about the equivalence of the T+ and W+ statistics is no longer true, and
- the distribution of T+ tabulated in
Table A.4 in Hollander and Wolfe and computed by the
psignrank
andqsignrank
functions in R is no longer correct in the presence of ties.
So there are really two issues to be resolved.
- What test statistic?
- What is the sampling distribution of the test statistic under the null hypothesis.
The standard solution to the first problem is to use so-called tied ranks
in which each of a set of tied magnitudes is assigned the average of the
ranks they otherwise would have gotten if they had been slightly different
(and untied). The R rank
function automatically does this.
So the ranks are done the same as before.
And the test statistic is calculated from the ranks the same as before.
r <- rank(abs(z)) tplus <- sum(r[z > 0])
The Wrong Thing
The Wrong Thing (with a capital W
and a capital T
) is to just ignore
the fact that tied ranks change the sampling distribution and just use
tables in books or computer functions that are based on the assumption
of no ties.
This is not quite as bad as it sounds, because tied ranks were thought up in the first place with the idea of not changing the sampling distribution much. So this does give wrong answers, but not horribly wrong.
Example 3.2 in Hollander and Wolfe.
Summary
- Upper-tailed signed rank test
- Test statistic: T+ = 62.5
- Sample size: n = 12
- P-value (wrong): P = 0.03857
Comments
- Note that when the variable names aren't
x
andy
, you have to use the actual names, hereprivate
andgovernment
. - The lines
mu <- 0 # hypothesized value of median
andz <- z - mu z <- z[z != 0]
serve no purpose in this problem where μ = 0 and there are no zero differences. They are only there for the general case. - The
floor
function in the last line roundstplus
down to the nearest integer (in this case from 62.5 to 62) because only integer arguments make sense topsignrank
and rounding down is conservative for upper tails (makes them bigger). -
For a lower-tailed test replace the last line with
psignrank(ceiling(tplus), n)
The Right Thing
As long as you have a computer, why not use it? There are only 2n points in the sample space of the sampling distribution of the test statistic under the null hypothesis corresponding to all possible choices of signs to the ranks. In this case, 212 = 4096, not a particularly big number for a computer (although out of the question for hand calculation). So just do it.
There is an R function wilcox.exact
(on-line help) that does the job.
Example 3.2 in Hollander and Wolfe.
Summary
- Upper-tailed signed rank test
- Test statistic: T+ = 62.5
- Sample size: n = 12
- P-value (exact): P = 0.03345
Comments
- Note that the
wrong thing
isn't all that wrong. It doesn't even have one correct significant figure, though.Wrong P = 0.04 Right P = 0.03 -
You may be wondering why R has a
wilcox.test
function to do the wrong thing and awilcox.exact
function to do the right thing.I wonder too. Here's my guess. The
wilcox.test
function has been there for more than 10 years and uses 60-year-old ideas. It's well understood. Thewilcox.exact
uses ideas that are very recent research. It may eventually replace the other function, but has to wait a while.
The Large Sample Approximation
We have been ignoring up to now, large sample (also called asymptotic
)
approximations to sampling distributions of test statistics. Why use
an approximation when the computer can do it exactly?
However, the computer can't do it exactly for really large n.
The functions psignrank
and qsignrank
crash
when n is larger than about 50.
The function wilcox.exact
may also crash for large n
but I haven't experimented with it.
Thus the need for large sample approximation.
It is a fact, which Hollander and Wolfe derive on pp. 44–45, that the mean and variance of the sampling distribution of T+ are
var(T+) = n (n + 1) (2 n + 1) / 24
under the assumption of continuity (hence no ties).
When there are ties, the mean stays the same but the variance is reduced by a quantity that, because it has a summation sign in it, doesn't look good in web pages. Just see equation (3.13) in Hollander and Wolfe.
Here's how R uses this approximation.
Example 3.2 in Hollander and Wolfe.
Summary
- Upper-tailed signed rank test
- Test statistic: T+ = 62.5
- Sample size: n = 12
- P-value (approximate with correction for ties): P = 0.03258
- P-value (approximate with correction for ties and correction for continuity): P = 0.03403
- P-value (approximate with correction for ties and another correction for continuity): P = 0.03554
Comments
-
The function that calculates the c. d. f. of the normal distribution
is
pnorm
. -
The last two lines calculate without and with a
correction for continuity
which is 1 / 4 here rather than 1 / 2 because the discreteness in the distribution is in steps of 1 / 2 rather than integer steps. This is because tied ranks may have half-integer values (and do in this case).Arrghh!!! The tied ranks may also have only integer values (when the number tied in each tied group is odd). In that case the correction for continuity should be 1 / 2.
-
For a lower-tailed test, we need to reverse the sign of the correction
for continuity
pnorm(tplus, et, sqrt(vt)) pnorm(tplus + 1 / 4, et, sqrt(vt)) pnorm(tplus + 3 / 8, et, sqrt(vt)) pnorm(tplus + 1 / 2, et, sqrt(vt))
whichever you think is better.
The R function wilcox.test
always uses 1 / 2 for the correction
for continuity even though we can see that 1 / 4 works better in this case.
Fuzzy Procedures
These are analogous to the fuzzy procedures for the sign test explained on the sign test and related procedures page and on the fuzzy confidence intervals and P-values page.
Since they are so similar, we won't belabor the issues and interpretations. The only difference is that for fuzzy confidence intervals the jumps in the plot are at Walsh averages (no surprise) rather than at order statistics and for fuzzy P-values they are at numbers in the CDF table for the null distribution of the test statistic, which is now the Wilcoxon signed rank distribution rather than the symmetric binomial distribution.
In short, the distributions have changed but everything else remains the same.
Fuzzy P-Values
Summary
- Upper-tailed signed rank test
- Fuzzy P-value is more or less uniformly distributed on the interval from 0.02612 to 0.03857.
Interpretation: moderate evidence against the null hypothesis of no difference.
Is the fuzzy P-value really worse the the mess that preceded it in the literature and textbooks? Why not just do a right thing?
Fuzzy Confidence Intervals
Summary
- Fuzzy confidence interval associated with the signed rank test
- Associated randomized confidence interval is mixture of two intervals
(to quote the printout)
probability lower end upper end 0.23 -150 1450 0.77 -50 1400
By itself the interval (−150, 1450), which counts in k = 14 from each end of the list of sorted Walsh averages, would have 95.75% confidence. By itself the interval (−50, 1400), which counts in k = 15 from each end of the list of sorted Walsh averages, would have 94.78% confidence.
Since these are so close to 95%, perhaps either is just as good as the fuzzy
interval for practical purposes, but fuzzy.signrank.ci
does
what it was asked to do, whether it is useful or not. Here the ordinary
intervals are fairly simple and close to what is wanted, so the advantage
of fuzzy is not so clear here.