This section is about the ordinary tests of statistical hypotheses you saw in your introductory statistics class. Nothing new here. We only describe what you already know here for ready comparison with randomized and fuzzy tests.
The decision theoretic view says the job of a test is to arrive at
one of two decisions
either accept
or reject the null hypothesis.
It is done in a way such that the so-called type I error
of
rejecting the null hypothesis when it is true is controlled. A number
α between zero and one is chosen, called the significance level,
and the test is constructed so that
where the probability is calculated assuming the null hypothesis.
As an aside, one may well ask why not control
the so-called type II error
of accepting the null hypothesis
when it is false. The reason has nothing to do with this being a worse idea.
Rather the reason is pure mathematical convenience. That's what we have to
do to get a workable procedure. Suppose the null and alternative hypotheses
are
(an upper tailed test). Then the null hypothesis specifies a single parameter value, hence a single probability distribution, which can be used to do a calculation. The alternative hypothesis specifies many parameter values, hence many probability distributions, which can be used to do many calculations, but not a single calculation. The type II error probability
depends on which μ in the alternative hypothesis you pick. This cannot provide a single criterion that tells us how to do the test.
Most tests (and all we will consider in this course) depend on a test statistic. There is a statistic (a function of the data only, not depending on unknown parameters, although it may depend on the value of the parameter hypothesized under the null hypothesis, that is, it can depend on &mu_{0} but not μ) and the test rejects H_{0} when the test statistic is large.
For a concrete example, consider the sign test. The test statistic for an upper tailed test is the number of data values B above the hypothesized value of the population median &mu_{0};. Under the null hypothesis (if the true unknown μ is actually &mu_{0}) and under the required assumptions (the population distribution is continuous) the distribution of B is symmetric binomial with sample size the same as the data sample size.
For an even more concrete example, consider sample size 10. Then an upper tail area table for this particular problem is calculated as follows
Rweb:> n <- 10 Rweb:> x <- seq(0, n) Rweb:> p <- pbinom(x - 1, n, 1/2, lower.tail = FALSE) Rweb:> print(cbind(x, p), digits = 2) x p [1,] 0 1.00000 [2,] 1 0.99902 [3,] 2 0.98926 [4,] 3 0.94531 [5,] 4 0.82812 [6,] 5 0.62305 [7,] 6 0.37695 [8,] 7 0.17187 [9,] 8 0.05469 [10,] 9 0.01074 [11,] 10 0.00098
The only significance levels we can choose are those that appear in the table. At α = 0.00098 we reject when B ≥ 10. At α = 0.01074 we reject when B ≥ 9. At α = 0.05469 we reject when B ≥ 8. And these are the only significance levels of any practical interest.
The decision-theoretic viewpoint is very clean theoretically. And it does apply in some practical situations, such as industrial quality control. After testing one does make an immediate decision with real world effects, say either keep the assembly line running or stop it and correct a problem. The type I error is stopping when there is in fact nothing wrong (just an unlucky sample). The type II error is keep running when there is actually a problem, producing defective product that is shipped to customers.
In scientific inference, such decisions are rare to nonexistent.
A statistical test is merely a way of discussing results. The
decision
to accept or reject affects nothing other than ink
on paper and whatever impression that makes on readers. Nothing
is really decided by these data alone. Many experiments and papers
contribute to the eventual consensus of the scientific community.
For scientific inference, so-called P-values are better.
There are two ways to look at P-values.
The P-value is the lower bound of α at which the null hypothesis is rejected and the upper bound of α at which it is accepted.
The P-value can also be defined with no reference to the decision-theoretic view as
where b is the observed value of the test statistic (the value calculated from the actual data) and B is a random variable having the distribution of the test statistic under the null hypothesis.
For a concrete example, we again use the sign test with sample size 10 and do an upper tailed test, which uses the table calculated above. If we observe B = 10, we report P = 0.00098. If we observe B = 9, we report P = 0.01074. If we observe B = 8, we report P = 0.05469. And, again, these are the only ones of any practical or scientific interest.
A randomized test is an idea which at first sight appears loony (a stupid theoretician trick). It is a severe problem in doing theory about tests involving discrete distributions (like the sign test) that only a few significance levels are possible (those in the distribution table of the null distribution, like 0.00098, 0.01074, and 0.05469 in the examples above). We cannot arrange Pr(reject) = α for any α, only for a few.
A randomized test splits the atom
(atom
being a term of
probability theory for a possible value of a discrete random variable).
An example shows how. Suppose we want &alpha = 0.05. A conventional
test, can't have that. If only we could split
the atom corresponding
to B = 8.
We have probability 0.01074 for B > 8. We have 0.05 - 0.01074 = 0.03926 left to go to get to the level we want. So here is how we split the atom. The total probability of the atom is 0.05469 - 0.01074 = 0.04395. The part of the atom we want is 0.03926 / 0.04395 = 0.89329. So if we reject with probability one when B > 8 and reject with probability 0.89329 when B = 8 and reject with probability zero otherwise. Then the total probability of rejection will add up to exactly 0.05.
From a theoretical point of view this is great stuff. From a practical scientific point of view this is bizarre. People don't like randomness at all. Most people don't like the way conventional statistical tests work. The notion that the decision in a hypothesis should involve additional randomness having nothing to do with sampling the data is very strange.
The idea that you and I could analyze the same data set and you decide
accept
and I decide reject
because we followed exactly
the same procedure but the additional randomization
came out
differently is especially bizarre.
Say in the example we continue to use, we observe B = 8
so we both should use a randomized decision.
We make a pie chart
with 0.89329 of the disk painted red and 0.10671 painted green.
Then we attach a spinner pivoted in the center.
You spin and it comes up green = accept
.
I spin and it comes up red = reject
.
This is doing statistics?
Sounds like a little kid's board game.
Nevertheless, this is the key idea in PhD level theory for exact tests of statistical hypotheses for discrete data (there is never any need to randomize with continuous data because there are no atoms to split, every point in the sample space has probability zero). This antagonism of theory and practice has existed for most of the twentieth century.
A recent paper by your humble instructor and another member of this department gave randomized tests a new interpretation. We think actual randomized decisions make no sense. But there is nothing wrong with simply reporting the randomization probability. In the example where we observed B = 8, we both report that the randomized test would reject with probability 0.89329, but we don't do any additional randomization. We just stop here and write that number 0.89329 in our report of the results. Thus all statisticians analyzing the same data (by the same procedure) report the same results.
This is perfectly satisfactory in scientific inference where no immediate real world decision is needed. Readers are perfectly capable of digesting the number 0.89329 for what it is worth.
Of course, in practice, one doesn't want to use the decision-theoretic view in scientific inference anyway.
The theory of randomized tests never had anything corresponding to P-values, but the paper mentioned above does give such a notion. They call it fuzzy P-values.
A fuzzy P-value can be thought
of as either a random number having a probability distribution
or as simply a smeared out fuzzy
number. Either way it's
the same theory, just with different interpretation.
We won't give the general theory in the paper mentioned above here, because when applied to rank tests (there is another paper specifically about that) everything is much simpler than in the general theory.
For any test where the test statistic B has a symmetric distribution the fuzzy P-values are uniformly distributed according on an interval, which is described in the table below (adapted from the paper mentioned above)
interval type | lower endpoint | upper endpoint |
---|---|---|
upper-tailed | Pr(B > b) | Pr(B ≥ b) |
lower-tailed | Pr(B < b) | Pr(B ≤ b) |
two-tailed | Pr(|B| > |b|) | Pr(|B| ≥ |b|) |
Consider the upper-tailed test in particular (the example we have been doing throughout this page). The conventional P-value (when we observe B = 8 and n = 10) is Pr(B ≥ b) = 0.05469, which we now learn is the upper endpoint of the range of the fuzzy P-value. The lower endpoint in this case is Pr(B > b) = 0.01074.
If we think of the fuzzy P-value as a random variable P,
which in the example is uniformly distributed on the interval
(0.01074, 0.05469). Then it provides an exact randomized test as follows.
Fix a significance level α. Reject the null hypothesis
if P ≤ α.
Note this is exactly the same recipe as one uses with conventional
P-values. The only difference is that now P
is still random given the test statistic B = 8.
This additional randomness
inherent in a randomized test is
what makes the test exact for any α.
The theory in the preceding section is nice, but of little practical or scientific relevance. In practice, we just compute the P-value and interpret it directly, with no mention of randomized tests, significance levels or other theoretical claptrap.
Low P-values are evidence against the null hypothesis. P = 0.0001 says that, if we assume the null hypothesis is true, then an event having probability 0.0001 has occurred. Since this is very unlikely, perhaps the assumption was incorrect, that is, the null hypothesis is actually false. Thus the lower the P-value, the stronger the evidence against the null hypothesis.
If you are on the side that wants to reject the null hypothesis (there are two sides to every question, my grandmother used to say), then low P-values are good (for your side) and high P-values are bad (for your side). Somewhere in the middle, traditionally around 0.05 (one chance in twenty) P-values are equivocal.
Clearly, this equivocality is not sharp. Not only is P = 0.05 equivocal, so are P = 0.04 and P = 0.06. People who through bad statistics teaching believe P = 0.049 means your theory has been proved by statistics and P = 0.051 means your data are worthless have no understanding of either science or statistics.
When P-values become fuzzy, the equivocality becomes a bit more of an issue, but the issues are similar. In the example where we had P uniformly distributed on the interval (0.01074, 0.05469), it is clear that we are in the equivocal situation, with the interval straddling the conventional 0.05 level. But most of the interval is below 0.05. Thus this fuzzy P-value provides moderate but not overwhelming evidence against the null hypothesis. Saying anything stronger is an attempt to make the data say more than they really do.
A fuzzy confidence interval can also be thought of as a randomized procedure. Again, let us keep on using sample size n = 10 as our example. The possible confidence levels are calculated as follows
Rweb:> n <- 10 Rweb:> k <- seq(1, n / 2) Rweb:> level <- 100 * (1 - 2 * pbinom(k - 1, n, 1 / 2)) Rweb:> print(cbind(k, level), digits = 4) k level [1,] 1 99.80 [2,] 2 97.85 [3,] 3 89.06 [4,] 4 65.62 [5,] 5 24.61(for an explanation of this code see the confidence intervals associated with the sign test page).
We can have 99.80% confidence, 97.85% confidence, 89.06% confidence, 65.62% confidence, or 24.61%. No other levels are possible unless we use randomized or fuzzy.
The randomized confidence interval proceeds as follows. It says to get 95% confidence use the 97.85% interval part of the time and the 89.06% interval the rest of the time adjusting to get 95% all together. If p is the fraction of the time we use the larger interval, then we need to solve
which is equivalent to
or
One way to report a fuzzy confidence interval is to just say you use one so and so often and the other so and so often. In the n = 10 example for the interval dual to the sign test, we calculate that 0.6758 of the time you use the interval (X_{(2)}, X_{(9)}), where the subscripts with parens indicate order statistics, and 0.3242 of the time you use the interval (X_{(3)}, X_{(8)}).
We don't recommend that you actually report a random choice of intervals. You should report the whole description: both intervals and the probabilities of each.
That way there is no added randomness in the report. Every statistician analyzing the same data makes the same report.
Another way to represent this interval is as a fuzzy set.
This is the way the fuzzyRankTests
does it.
Let us make up some data and apply this procedure to it.
The resulting plot is shown below
The jumps (where the solid dots are) are the order statistics X_{(2)}, X_{(3)}, X_{(8)}, and X_{(9)}. They will be in different locations if you remake the plot (by clicking the submit button) because we simulated random data. Everything else should be the same. The height of the middle steps is 0.6758, as we calculated above.
How do we interpret this object? We say it gets full credit if the true unknown parameter value (the true unknown population median) is in the interval where it has the value one, which is the interval (X_{(3)}, X_{(8)}). And it gets partial credit, 0.6758, if the true unknown parameter value is in either of the two shorter intervals where it has that value. (And no credit otherwise.)
Averaged over the distribution of the data, intervals formed in this fashion will cover the true unknown parameter value (counting full and partial credit correctly) exactly 95% of the time.
These are exact fuzzy confidence intervals just like the objects described in the preceding section are exact randomized confidence intervals. We don't usually bother to even distinguish the two procedures, since they are so closely related. We just call both fuzzy. Some people insist on calling both randomized. They are what they are. A rose by any other name . . . .
If the population distribution is continuous, then the value of the fuzzy confidence interval at jumps (shown by the dots in the figure above) don't matter, because the probability of a data point being exactly at the population median is zero. When ties occur (covered in the next section), then the dots do matter.
Ties do almost nothing to fuzzy confidence intervals (for rank tests,
done by fuzzyRankTests
). When there are ties, then some
of the intervals seen in the plot above may disappear. For example,
if X_{(2)} = X_{(3)}, then the
short interval between about 0.8 and 0.9 will disappear, there will be
only one jump in this region (not two) and only one dot (not two), which
is the value at the jump. But the idea is much the same.
Let's try it.
The tricky code at the bottom shows where the simulation truth
parameter value (the median of the distribution of the rounded simulated
data) is.
Each time this is submitted, different random numbers are used. So doing this over and over is like doing an (inefficient) simulation study.
In order to see more failures of the fuzzy confidence interval, lower the confidence level, with
plot(fuzzy.sign.ci(x, conf.level = 0.75))
or something of the sort replacing the analogous code given (the default level is 0.95).
Summary: These intervals are exact fuzzy confidence
intervals, meaning that, regardless of the true distribution of the
data and regardless of the true unknown median of the data (for the sign test)
the fuzzy coverage
(counting both full and part credit) is exactly
the nominal level.
Ties do a lot more to fuzzy P-values.
The fuzzy tests that are dual to the fuzzy confidence intervals
illustrated above works as follows.
Conceptually, we do a randomized test, breaking all ties by fair
randomization. For the sign test in particular, we (conceptually) flip
a fair
coin to move each data value tied with the value of the
parameter of interest (population median) hypothesized under the null
hypothesis to above or below (50% chance of going each way).
This gives us a distribution of values of the test statistic.
For each way the randomization comes out, we know what to do,
the fuzzy P-value would be uniform on some interval whose
endpoints are numbers in
the CDF
of the null distribution of the test statistic (symmetric binomial distribution
for the sign test). When we unrandomize
we get a mixture
of these uniform distributions, which is a distribution
whose PDF
is a step function, one step for each possible value of the test
statistic under the randomization.
It's easier to look at an example than try to picture what this does.
When there are no ties (no data points equal to zero or whatever one
sets mu.hypothesized
to), then we get the usual fuzzy
P which is uniform on some interval (which depends on how
many data points are above and below the hypothesized value).
When there are ties (zeroes or whatever), then we get a fuzzy P-value whose PDF has as many steps as there are possible ways to break ties.
But the interpretation is the same in both cases. Low P-values are good (for the side that wants to reject the null hypothesis). High P-values are bad. Intermediate P-values are equivocal. And the fact that they are now smeared out either uniformly or nonuniformly over some interval, doesn't make much difference in interpretation.
Summary: These P-values are exact fuzzy P-values, meaning that, regardless of the true distribution of the data, when the null hypothesis is true and for any fixed α the probability that P < α is α, where P is a random variable whose PDF is plotted.