Tests of significance are very important. Most scientific research depends on them, but most examples in introductory textbooks are so silly that it is hard to see the point (and our textbook is like the rest in this respect). Thus we begin with an extended example that, though made up, is enough like real scientific problems to show the point.
Suppose a psychologist is studying learning by running mice through a maze. She has 20 mice each of which has been run through the same maze five times. The mice are all genetically identical from an inbred strain commonly used in laboratory experiments. We can think of them as a random sample from the population of all mice from the same strain raised under the same laboratory conditions.
All the data are in the URL
These data are rather complicated — too complicated for us to analyze in its entirety now (this is a job for two-way analysis of variance, covered in the follow-on course, Stat 3022) — so we (and the hero of our story, the psychologist) consider something simpler.
Instead of considering all five runs, we consider only the difference between the first and last times through the maze. This difference, if positive, is taken to indicate learning. (If the first time is greater than the last time, the difference first minus last is positive: the mouse ran faster after practice.)
So let's look at the first minus last differences.
The mean is positive, so this indicates the
got faster with practice. Or does it? Not all mice were faster.
From the stem and leaf plot we see that five were actually slower (negative
difference), and some of the rest were not much faster.
Let's try out the only statistical inference method we have already learned and construct a 95% Student t confidence interval for the unknown population mean μ (the mean learning effect for the whole population of mice of which these 20 are considered a random sample).
We copy the relevant printout below
95 percent confidence interval: 5.127726 26.281274 sample estimates: mean of x 15.7045
So the confidence interval says the population mean μ is probably positive, even if not every mouse is positive.
A test of significance, also called a statistical hypothesis test, is a slightly different twist on the same mathematics used in the confidence interval.
Instead of trying to estimate the difference due to learning, it focuses on a different question: whether there is any learning at all. Never mind the size of the learning effect. How can we be sure that there is any effect at all? There is a lot of variability in the maze running performance of mice. Some of the mice had negative differences in performance, running the maze slower after practice. Perhaps the apparent learning effect — positive sample mean — was due to chance. Perhaps if the experimenter redid the experiment with another 20 mice or even with the same 20 mice, the results would come out the other way.
Many people don't like this. They don't like to think their ideas may be wrong. Our hero, the experimental psychologist, is no different. She doesn't like this either. If she hadn't thought there was a learning effect, then she wouldn't have done the experiment. But in order for her work to have any value, she must convince other scientists of its validity. They aren't convinced she is right. (And when the shoe is on the other foot, she isn't convinced that other scientists are right. Scientists don't get convinced without good evidence.) Thus she is forced to consider how the skeptics will think. She is forced to consider that there may be no learning effect, in which case the true population mean μ is zero.
It is traditional in statistics to state this formally, a statement called the null hypothesis of the test of significance.
There is no learning effect and the differences in running times are entirely due to chance. The population mean difference μ is zero, and the sample mean difference x is different from zero only because of chance variation.
This says that the sample mean 15.7045, though apparently large, only appears to indicate a positive learning effect. The true learning effect μ is zero (that's what the null hypothesis assumes) and there is no learning.
The experimenter doesn't make this assumption because she thinks it is true. In fact, she doesn't like this assumption one bit, because if it were true it would mean that she has wasted a lot of time on a stupid experiment with no point — an experiment designed to study an effect that doesn't really exist.
The null hypothesis is only assumed (temporarily) in order to do a calculation
that will prove it is false. As my father used to say,
it is a straw
man to knock down. The point is that making a precise assumption
allows a precise probability calculation.
So we calculate. What is the probability of getting X as large or larger than 15.7045 when in fact μ = 0?
First we standardize. The random variable
has the Student t distribution with n − 1 degrees of freedom
Plugging in the data for this problem we get
Note that we use the value of the unknown parameter μ that is hypothesized under the null hypothesis (here zero) so we have everything we need to calculate t.
Make the following assumptions.
diff15 = run1 - run5) is exactly normal.
Then the distribution of t is exactly Student's t distribution for n − 1 degrees of freedom (here 19 degrees of freedom).
So our question comes down to the probability that t > 3.107743 when t has Student's t distribution for 19 d. f.
Answer (copied from the printout of the form above)
Rweb:> pt(3.107743, df = 19, lower.tail = FALSE)  0.002897115
This number is called the P-value of the test. The scientist will report something like
The learning effect (mean difference of first and last runs) was statistically significant (P = 0.003, upper-tailed t test).
This language is a widely used convention. It alludes to the following more detailed argument. If the null hypothesis (μ = 0) is true, then something very improbable has happened, the probability being 0.003. This is so improbable that we can dismiss the null hypothesis from further consideration. Hence the argument of the skeptics (who argue that μ = 0) has been defeated. We can say the learning effect is real.
As my mother used to say,
there are two sides to every question.
That applies to statistical tests. There is the side that wants to
reject the null hypothesis (our hero, the experimental psychologist)
and the side that wants to accept the null hypothesis (the skeptics).
If you are on the side that wants to reject the null hypothesis, then low P-values are good (for your side).
The lower the P-value, the stronger the evidence against the null hypothesis.
After all, P = 0.1 only means that if the null hypothesis is true
then a one-chance-in-ten
long shot has occurred, but 1 ⁄ 10 is
hardly a low probability. If you repeat the experiment 10 times you should
see just such a
long shot by chance variation even if the null
hypothesis is true. A high P-value, like 0.3 is evidence
against the null hypothesis so weak as to not be evidence at all.
So where is the borderline between strong evidence and weak evidence?
People like clear guidelines. To cater to this hankering, introductory
statistics books have for seventy years (about how long there have been
introductory statistics books) promulgated the arbitrary dividing line of
0.05 (one chance in twenty). By this rule, if P < 0.05, then
the result is statistically significant, otherwise not.
The number 0.05 does have the virtue of being a
when considered by humans. It is
round because humans have five
fingers. If we were intelligent computers, who count in binary, we would
consider some power of two, perhaps 1 ⁄ 16 or 1 ⁄ 32 the arbitrary
dividing line between statistical significance and insignificance.
In the opinion of your instructors and your textbook author the arbitrary dividing line 0.05 or any other arbitrary dividing line has absolutely nothing to recommend it. It is nonsense. From the textbook, Basic Practice of Statistics by Moore, p. 371,
There is no sharp border betweensignificantandinsignificant,only increasingly strong evidence as the P-value decreases. There is no practical distinction between the P-values 0.049 and 0.051.
Or as I once put it more strongly, in language that a co-author cut from our paper as too argumentative
Anyone who thinks there is a scientifically meaningful distinction between P = 0.049 and P = 0.051 understands neither science nor statistics.
The reason we make this point so strongly is that you will find if you do much statistics that many scientists have just this view: that 0.05 is the universal borderline of statistical significance. That's what they learned when they took introductory statistics. They use it in their scientific papers and do not get complaints. Such is the strength of culture, even in science. Culture changes slowly.
Fortunately, the culture is changing. Most introductory statistics textbooks written in the past ten years agree with us.
Just give the exact P-value. Let readers decide whether it isstatistically significantor not. They will anyway.
Back in the stone age, before everyone had a desktop or laptop computer with statistics software installed (or at least a web browser to use Rweb). It was hard to get P-values. Tables in books were inadequate.
Suppose we want to calculate the P-value for the test in our example above (recall it is P = 0.002897115). We need to calculate
where t has a Student t distribution with n − 1 = 19 degrees of freedom.
Now we get to use the header at the top of Table C in the textbook which gives "upper tail probability". We need to do a backwards lookup in this table. Find t in the appropriate (df = 19) row of the table, and report the upper tail area from the header.
df .25 .20 .15 .10 .05 .025 .02 .01 .005 .0025 .001 .0005 19 | 0.688 0.861 1.066 1.328 1.729 2.093 2.205 2.539 2.861 3.174 3.579 3.883
The problem is that we don't have many numbers to work with. All we can say is that since t is between the two highlighted numbers, the P-value must be between the two upper tail probabilities just above the highlighted numbers, that is, 0.0025 < P < 0.005.
And if we had used an older book with a smaller table, we might have only been able to say, perhaps P < 0.01 or even only P < 0.05. Some older, computer-illiterate scientists still think this is a reasonable way to report a P-value — sloppily. It's really absurd in this day and age.
This section could also be called the formal theory of tests of significance. At least this is a formal as we are going to get.
A test of significance involves three things.
The null hypothesis, we have already met. In the example about mice it was μ = 0. A null hypothesis is always a statement that specifies the value of some parameter or parameters.
The test statistic, we have already met, although we did not give it a specific name before. In the example about mice it was
As it stands, this is not a statistic, because it depends on a parameter μ. It becomes a statistic when we plug in the value of μ specified under the null hypothesis. Then, as required, when the null hypothesis is true, the distribution of this test statistic is completely known: the Student t distribution with n − 1 degrees of freedom.
The null hypothesis always makes the test statistic a statistic (not depending on unknown parameter values) and completely determines the sampling distribution of the test statistic. This is the distribution of the test statistic when the null hypothesis is true.
Finally, we choose the tail or tails to test. Many tests come in three kinds. If the distribution of the test statistic t when the null hypothesis is true is symmetric about zero, then these are
|lower tail||P(T ≤ t)|
|upper tail||P(T ≥ t)|
|two tail||P(|T| ≥ |t|)|
In the table above t is the observed value of the test statistic (calculated from the data), and T is a random variable having the distribution of the test statistic when the null hypothesis is true. This uses the capital letters for random variables and lower case letters for numbers convention. When we think of the test statistic as a random variable (because it is calculated from a random sample), then it is T. When we think of the test statistic as a number (like the number 3.107743 for the data in our example), then it is t.
Often tail type is associated with an alternative hypothesis that indicates the tail or tails favored by the test — the tail or tails that contribute to the P-value.
The null hypothesis is often denoted H0 and the alternative hypothesis Ha. With this notation, we can expand our table.
|lower tail||P(T ≤ t)||μ = 0||μ < 0|
|upper tail||P(T ≥ t)||μ = 0||μ > 0|
|two tail||P(|T| ≥ |t|)||μ = 0||μ ≠ 0|
The formula for a two-tailed P-value looks complicated, but for all of the two-tailed tests we will meet in this course, it is actually very simple.
When the distribution of the test statistic when H0 is true is symmetric, then the two-tailed P-value is twice the smaller of the one-tailed P-values.
In short, two tails is twice one tail.
So in the example above, which had P = 0.002897115 one-tailed would have P = 2 × 0.002897115 = 0.00579423 two-tailed. As before, our scientist would round this in her report
The learning effect (mean difference of first and last runs) was statistically significant (P = 0.006, two-tailed t test).
more significantthan the two-tailed test.
statistical significanceno matter which way the data comes out — no matter which tail of its sampling distribution when the null hypothesis is true the test statistic is in — then use a two-tailed test.
statistical significancewhen the test statistic is in one particular tail of its sampling distribution when the null hypothesis is true, then you use the corresponding one-tailed test.
The issue is what the scientist would have done if the data had come out in the other tail. What would our experimental psychologist had done if the 20 mice in her experiment had run slower with practice?
tiredness effector some such, then she must use a two-tailed test in reporting results for either (actual or imaginary) data.
These are the test procedures related to confidence intervals for the population mean based on the normal distribution.
As with the confidence intervals there are two kinds, exact and approximate (large n).
has this distribution (exactly). Here σ is the population standard deviation, which must be known.
has this distribution (approximately, when n is sufficiently large).
In either case the P-values are calculated in the same way and the null and alternative hypotheses are stated the same way.
|lower tail||P(Z ≤ z)||μ = μ0||μ < μ0|
|upper tail||P(Z ≥ z)||μ = μ0||μ > μ0|
|two tail||P(|Z| ≥ |z|)||μ = μ0||μ ≠ μ0|
For an example of this procedure we use data are in the URL
which are just like the data in the example about mice but with larger sample size (in fact the first 20 rows of this data set are the entirety of the data set for the earlier example). Now the sample size is 50, which makes the normal approximation reasonable.
As in the other example, we do an upper-tailed test.
Now the P-value (P = 0.0309, upper-tailed
z test) provides much weaker evidence against the null hypothesis
than the data set with only 20 mice. It is still
significant by the conventional 0.05 borderline, and by anyone's standard
it still provides moderate evidence against the null hypothesis.
Notice that if a two-tailed test had been done, the P-value would
be twice the one-tailed value (P = 0.0619, two-tailed
z test), and this would not be
significant by the conventional 0.05 borderline, although it would
still provide moderate evidence against the null hypothesis.
If we had wanted to do a lower-tailed test, the P-value
would have been calculated the same way except for the last command,
which would become
If we had wanted to do a two-tailed test, the P-value would be twice the P from the upper-tailed test or the lower-tailed test, whichever was smaller.
The statistics for the data set analyzed above are
The null and alternative hypotheses are
indicating an upper-tailed test with μ0 = 0.
Thus the test statistic is
Then the P-value is the area to the right of z under the standard normal density curve (Table A). The area to the left of z is found in the following row of the table
.00 .01 .02 .03 .04 .05 .06 .07 .08 .09 1.8 | .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706
The area to the left of z is found by applying the complement rule 1 − 0.9693 = 0.0307.
This agrees with the answer we got by computer except for some rounding errors.
These are the test procedures related to confidence intervals for the population mean based on the Student t distribution.
has this distribution.
The P-values are calculated in the same way and the null and alternative hypotheses are calculated the same way.
|lower tail||P(T ≤ t)||μ = μ0||μ < μ0|
|upper tail||P(T ≥ t)||μ = μ0||μ > μ0|
|two tail||P(|T| ≥ |t|)||μ = μ0||μ ≠ μ0|
Here we use
The R function
to do the test.
This form, when run, produces the following printout.
Rweb:> t.test(diff15, alternative = "greater") One Sample t-test data: diff15 t = 1.8673, df = 49, p-value = 0.03393 alternative hypothesis: true mean is greater than 0 95 percent confidence interval: 0.8114801 Inf sample estimates: mean of x 7.945
We highlighted both the test statistic and the P-value.
Notice that the test statistic (1.8673) is the same as that for the z test (we just call it t now instead of z).
The P-value 0.0339 is a little higher than that for the z test 0.0309 because the Student t distribution with 49 degrees of freedom has slightly heavier tails than the standard normal distribution.
The learning effect (mean difference of first and last runs) was statistically significant (P = 0.0339, upper-tailed t test).
alternative argument to the
has possible values
"two.sided", which is the default,
To do an upper-tailed test, you must say, as in the example above
alternative = "greater".
To do a lower-tailed test, you must say
alternative = "less".
To do a two-tailed test, you may say
alternative = "two.sided",
but can also omit the
alternative argument entirely.
To specify a different μ0 use the optional
mu. For example,
t.test(fred, mu = 6, alternative = "greater")
tests the hypotheses
for the dataset
As mentioned in the
by computer section, the test statistic
is the same whether we are doing t or z
We need to look this up in the Student t distribution table (Table C) for n − 1 = 49 degrees of freedom. The table only has rows for 40 and 50 degrees of freedom.
df | .25 .20 .15 .10 .05 .025 .02 .01 .005 .0025 .001 .0005 40 | 0.681 0.851 1.050 1.303 1.684 2.021 2.123 2.423 2.704 2.971 3.307 3.551 50 | 0.679 0.849 1.047 1.299 1.676 2.009 2.109 2.403 2.678 2.937 3.261 3.496
But using either row we see that 0.05 > P > 0.025 because the header gives upper tail areas, which is what we want for an upper-tailed test.
Because the numbers for 49 degrees of freedom, if we had them, would be between those for 40 and 50, we conclude that 0.05 > P > 0.025 for 49 degrees of freedom too, but that is the most we can discover from Table C.
This is most unsatisfactory. O. K. for homework, but not acceptable for real data. Use the computer in real life.
To a certain extent
you've seen one test of significance, you've seen 'em
all. There is always
The details change, but the big picture remains the same. (Readers should go back and make sure they know what each of the emphasized terms is for the z and t tests described above.)
Now we look at a so-called nonparametric test of significance,
called the sign test. When we discuss the
relationship between tests and confidence intervals
below, we will see that this is the test
dual to the
nonparametric confidence interval for the
population median presented on our confidence
We start with its assumptions. We assume our sample X1, … Xn is IID from some continuous population distribution. The population distribution can have any shape whatsoever. The only point of the continuity assumption is to rule out ties. All of the Xi will be different from each other, and all will be different from the population median θ.
The null hypothesis specifies a value of the population median
As with the t and z tests, the hypothesized value θ0 is usually zero, but we allow other choices.
The test statistic Y is the number of Xi that are greater than θ0.
The sampling distribution of the test statistic
when the null hypothesis is true, is binomial.
The events Xi > θ
are independent and all have the same
success probability 1 ⁄ 2.
Thus the distribution of the test statistic Y is binomial
with sample size n and success probability 1 ⁄ 2.
This distribution is symmetric, but in a slight difference from the t and z distributions, it is not symmetric about zero. The possible values of Y range from zero to n and the distribution is symmetric about the point n ⁄ 2.
Thus Y and n − Y have the same distribution, which is slightly different from the other test statistics (T and − T have the same distribution, as do Z and − Z). This leads to slight differences in the formulas for P-values.
|lower tail||P(Y ≤ y)||θ = θ0||θ < θ0|
|upper tail||P(Y ≥ y)||θ = θ0||θ > θ0|
|two tail||2 P(Y ≤ y) or 2 P(Y ≥ y)||θ = θ0||θ ≠ θ0|
In the last line of the table, 2 P(Y ≤ y) or 2 P(Y ≥ y) for the P-value means whichever of these can be a P-value is the P-value. One of these will be between zero and one and hence is the P-value. One of these will be greater than one and hence is not.
Note that we still have the two tails is twice one tail property.
We redo the smaller mouse data, which has n = 20.
The value of the parameter (the population median θ) hypothesized by H0 is zero.
The test statistic we can easily calculate in our heads from the stem and leaf plot made by first form on this page. There are 15 differences (first minus last) above zero, so the observed value of the test statistic is y = 15.
To calculate the P-value for an upper-tailed test, we need to calculate P(Y ≥ 15) when Y has the binomial distribution with sample size 20 and success probability 1 ⁄ 2.
Having no tables of the binomial distribution, we use the computer.
Rweb:> pbinom(14, 20, 1 / 2, lower.tail = FALSE)  0.02069473
and our scientist reports
The learning effect (mean difference of first and last runs) was statistically significant (P = 0.02, upper-tailed sign test).
Some readers should now be asking where did 14 come from? This is explained in the section on how to shoot yourself in the foot on the web page about the binomial distribution. Simply put,
and the last line is
pbinom(14, 20, 1 / 2, lower.tail = FALSE) calculates.
One must be very careful with probabilities of tail areas for discrete distributions.
In general, suppose we have sample size
n and observed value
of the test statistic
y already defined as R objects.
To calculate the P-value for an upper-tailed sign
pbinom(y - 1, n, 1 / 2, lower.tail = FALSE)
To calculate the P-value for a lower-tailed sign test do
pbinom(y, n, 1 / 2)
To calculate the P-value for a two-tailed sign test do
2 * pbinom(y, n, 1 / 2) 2 * pbinom(y - 1, n, 1 / 2, lower.tail = FALSE)
and take the one of these that is between zero and one.
The fact that we use
y for lower-tailed and
y - 1
for upper-tailed is confusing, but is just
lower.tail = FALSE works (there is a logical reason why it
works this way, but not a logic that applies to tests of significance).
We have now considered four tests.
|z||yes||population mean μ||population normal, σ known||no|
|z||no||population mean μ||n large||yes|
|t||yes||population mean μ||population normal||no|
|sign||yes||population median θ||none||yes|
Our table needs a few comments.
We should mention for completeness that there is yet another competitor that we have not covered. It is covered in Stat 5601 and other courses with nonparametrics in the title. It is also covered in our textbook in supplementary Chapter 23 found on the cdrom inside the back cover. It would get a row in our table
|signrank||yes||population center of symmetry = μ = θ||population symmetric||yes|
It is in between on assumptions and robustness. Outliers don't bother it, but skewness does. We say no more about it. (It won't be on the exam.)
So if the sign test is so great and the other tests so bad, why not always use the sign test? It's a sad fact of life that you don't get something for nothing. The robustness of the sign test comes with a cost. If the population really is normal, then the t test (or the z if σ is known) will work better, and will give lower P-values.
We saw this in our examples: P = 0.003, upper-tailed
t test, but P = 0.02, upper-tailed sign test.
Note the zeros: P = 0.02 is nearly 10 times larger than
P = 0.003. The data look a lot more
significant when we use the t test.
That was for n = 20, if we reanalyze the n = 50 data, we get a similar picture: P = 0.031, upper-tailed t test, but P = 0.016 upper-tailed sign test. Oops! No we don't get a similar picture. The sign test looks more significant.
All we can say is that, in general, when the population is exactly normally distributed, the t test will usually but not always give a lower P-value than the sign test.
On the other hand, when the population is not exactly normally distributed
and not even close (when you clearly see skewness or outliers
in a stem and leaf plot,
for example), then the
and z procedures are not exact at all. They shouldn't be used,
and if they are used even though they shouldn't, they will do a much worse
job than the sign test.
Sometimes the purpose of a test of significance is to make a decision.
As we described the use of tests of significance in scientific inference,
there are no decisions. Scientist does experiment, scientist writes up
statistically significant results with some
P-value, readers make of it what they will, considering not
only the P-value but also everything else relevant: their prior
opinions on the subject, what other data in other papers say, theoretical
arguments, details of the experimental procedures and possible experimental
errors and biases, and so forth. The P-value is only a small part
(though an important part) of what it takes to convince scientists of the
conclusions of the scientist's write-up.
In industrial quality control, if we take a sample of material coming off the production line to the lab for testing, then the purpose is to make a decision between two actions.
Either decision is costly if mistaken. If the first decision is mistaken, we produce defective product and ship to customers resulting in costs of rework, support, and lost future sales. If the second decision is mistaken, we waste time looking for a problem that doesn't exist.
Tests of significance are suited for making these decisions if
the two decisions above can be made to correspond to the null and
alternative hypotheses of a test. Then low P-values
are evidence in favor of Ha, which in this
context has the interpretation
quality problem, stop production,
and high P-values
are evidence in favor of H0, which in this
context has the interpretation
no problem, continue production.
Hence we should choose a dividing line α and decide as follows
|P ≤ α||Ha||quality problem, stop production|
|P > α||H0||no problem, continue production|
The same logic applies to any other situation where decisions can be made to correspond to null and alternative hypotheses of a test. The first two columns of our table above stay the same. The interpretation must be adjusted to the new situation.
When one uses tests of significance to make decisions, it is important that α be chosen carefully to minimize costs and maximize benefits (negative costs) on average. This involves calculations too complicated to cover in this course.
A carefully chosen α will never be a round number like 0.05. When you see someone using α = 0.05, you know they aren't making careful decisions. Usually, they aren't making actual decisions at all. The language of decisions is only being used because they think they are supposed to always use it to talk about tests of significance (since most introductory statistics books, until recently, emphasized decisions).
You can tell someone overly influenced by the
of tests of significance. They report their results like
The learning effect (mean difference of first and last runs) was statistically significant (P ≤ 0.05, upper-tailed t test).
That is they say P ≤ 0.05 when they could have said P = 0.003.
In real science, this is a terrible idea, a pernicious influence of bad statistics teaching.
There is a theoretical virtue to thinking like this. If it only matters whether P ≤ α or P > α, then the relationship between tests (considered this way) and confidence intervals is very simple.
A test with α = 0.05 goes with a 95% confidence interval, because 95% is 1 − α expressed as a percent.
A two-tailed test with null hypothesis θ = θ0 for any parameter θ has P ≤ 0.05 if and only if a 95% confidence interval for θ does not contain θ0.
A two-tailed test with null hypothesis θ = θ0 has P ≤ α if and only if a 100 (1 − &alpha)% confidence interval for θ does not contain θ0.
In summary, the tests decides against H0 and for Ha at decision level α if and only the confidence interval for the corresponding confidence level (1 &minus α expressed as a percent) does not contain the value of the parameter hypothesized by H0.
The logic of the preceding section can be turned around. We can use tests to make confidence intervals.
A 100 (1 − &alpha)% confidence interval for θ consists of those points θ0 such that the null hypothesis H0: θ = θ0 is not rejected at decision level α (meaning P > α) by the test of significance.
So tests of significance and confidence intervals come in
We have met four such pairs: the four tests in our summary table above, and the corresponding confidence intervals covered on the confidence intervals page.
Our textbook (pp. 371–372) says statistical significance is not the same thing as practical significance. We mean the same thing. We're just emphasizing scientific inference.
If the null hypothesis is false, then statistics can prove that. No matter how small the difference between the true (unknown) value of the parameter θ and the value θ0 hypothesized by the null hypothesis, this difference can be detected if the sample size is sufficiently large.
Because of the square root law, it may take a humongous sample size. If n = 10 can detect more often than not a difference θ − θ0 = 1, then n = 1000 can detect more often than not a difference θ − θ0 = 0.1, and then n = 100,000 can detect more often than not a difference θ − θ0 = 0.01, and so forth.
Thus whether an effect θ − θ0
statistically significant has nothing to do with how large it
actually is. But any practical or scientific significance has very much
to do with how large it actually is.
Small effects may not matter.
In order to say something about the actual size of θ use a confidence interval.
Tests of significance are about the issue of whether effects exist, not about how large they are.
Secondarily, they are about whether the sample size n is large enough to detect the actual effect if it exists. When P = 0.2, there is no point in further discussion of these data. They do not even show the effect exists. So never mind how large it may be if it exists.
The philosophical dogma about tests of significance is the following.
You do only one test. You choose the test before any data are collected. You describe the test to be done in the experimental protocol so there can be no argument you followed the dogma.
This is a very strict rule, so strict that almost no one follows it.
Medical clinical trials do follow it, because if it is not followed, then no one can be sure there was no cheating.
There are, of course, many ways to cheat in statistics: outright fraud, mathematical mistakes, applying a procedure when its assumptions do not hold. Most of them are not peculiar to statistics.
But this section is about one form of cheating that is very widespread and applies only to tests of significance and confidence intervals. It may be more common than all other forms of cheating (about tests of significance) put together. It is so widespread that many scientists do not even know it is wrong, and many others, although they know it is wrong, don't think it is important.
It goes under many names, such as data snooping and data dredging. The technical term is multiple testing without correction. It is violation of the first part of the dogma: do only one test.
When the null hypothesis is true, then a P-value expresses a certain probability. If we think of the test as making a decision at level α, then we have P ≤ α with probability α. This can be though of as the probability of an erroneous announcement of statistical significance at level α. But it only has that interpretation when the dogma is followed, when it is the only test done.
When many tests are done, the probability of erroneous announcement of statistical significance goes up as the number of tests goes up. Let Ai be the event that the i-th test of a sequence has P ≤ α even though the null hypothesis is true. Then
and if this is the only test, then the probability of such an error is only α, a small number we have chosen.
But when k tests are done, the probability of erroneous announcement of statistical significance is by the subadditivity rule
which will not be small when k is large.
From this we see that one way to correct for multiple testing is to multiply all P-values by the number of tests done. This is called Bonferroni correction. It is the only universally applicable correction, working for all multiple testing situations.
Advanced textbooks describe other corrections designed for particular situations and better than Bonferroni for those situations.
Two-tailed tests are a special case. A two-tailed test is just like
doing both one-tailed tests and getting excited when either says
statistically significant, and the correction for doing two tests
is to multiply the P-value by two: two tails is twice one tail.
So everyone who understands the issue agrees. Bonferroni correction (or some more complicated correction applicable to some special situation) is always required when multiple tests are done, but Bonferroni correction is rarely used by working scientists. Again, medical clinical trials, are the exception. If multiple tests will be done, this is specified in the protocol and the relevant correction for multiple testing is also prescribed.
The problem is that real science is not so neat and tidy as it seems from the outside. Most data are collected without any clear idea what they will show. Observational data are often collected by one group of scientists and analyzed by many other groups looking for many different effects. Unlike a medical clinical trial, there is no protocol that describes how the data are collected and every statistical analysis that will be done. Even in medicine, follow-up analyses, by definition, do not follow the protocol for the original analysis.
Hence scientists often do not know what they are looking for until they stumble over it. Necessarily, this involves many tests. If Bonferroni correction is rigorously applied, only very extreme results will seem significant. Interesting scientific findings might be missed.
But if Bonferroni correction is not rigorously applied, then P-values don't mean what the are supposed to mean, and no one can be sure what they mean. It's a conundrum.
One, perhaps somewhat artificial, distinction is between exploratory and confirmatory analyses. In an exploratory analysis, when one doesn't know what one is looking for until it is found, P-values must be taken cum grano salis. In a confirmatory analysis, one follows the dogma and does only one test, and the P-value means what it is supposed to mean.
The issue has no entirely satisfactory solution. It's enough that students and other users of statistics are aware of it.
It follows from the
only one test dogma that the following idea,
which seems logical at first glance, and which seems to occur naturally
to many people who have been exposed to the material on this page
and the confidence interval page, is thoroughly
and completely wrong.
The idea is
You need certain assumptions satisfied, so first do a test to check whether those assumptions are satisfied. Only when your data passes thepretestdo you apply the procedure (confidence interval or test of significance) that requires those assumptions.
A simple reason why this pretest-posttest procedure is wrong is that it does more than one test. If that does not satisfy you, consider the following more complicated explanation.
Points 1 and 2 are different ways of saying the same thing. Point 3 is the crucial issue. Doing a pretest raises theoretical issues that are far beyond our current knowledge to handle.
The textbook (pp. 454–455) recommends that you avoid two particular pretest-posttest combinations that many users think up. But it criticizes them only on particular grounds: either the pretest or the posttest is not a very good test, not robust or whatever.
Here we are making the much stronger point that all pretest-posttest combinations are bad unless a correction for multiple testing is done.
If you are worried that the assumptions for a procedure do not hold, then don't use the procedure, use a more robust one.
If you are afraid that the normality assumptions for a t test don't hold, then use the sign test. Do not first look at a stem and leaf plot and do a t test if the plot looks more or less normal and a sign test otherwise. You have no idea what the statistical properties of your pretest-posttest combination are.