Statistics 3011 (Geyer and Jones, Spring 2006) Examples:
Tests of Significance for One-Sample Location Problems

Contents

Tests of Significance

A Story

Tests of significance are very important. Most scientific research depends on them, but most examples in introductory textbooks are so silly that it is hard to see the point (and our textbook is like the rest in this respect). Thus we begin with an extended example that, though made up, is enough like real scientific problems to show the point.

Suppose a psychologist is studying learning by running mice through a maze. She has 20 mice each of which has been run through the same maze five times. The mice are all genetically identical from an inbred strain commonly used in laboratory experiments. We can think of them as a random sample from the population of all mice from the same strain raised under the same laboratory conditions.

All the data are in the URL

http://www.stat.umn.edu/geyer/3011/gdata/mouse.txt

These data are rather complicated — too complicated for us to analyze in its entirety now (this is a job for two-way analysis of variance, covered in the follow-on course, Stat 3022) — so we (and the hero of our story, the psychologist) consider something simpler.

Instead of considering all five runs, we consider only the difference between the first and last times through the maze. This difference, if positive, is taken to indicate learning. (If the first time is greater than the last time, the difference first minus last is positive: the mouse ran faster after practice.)

So let's look at the first minus last differences.

External Data Entry

Enter a dataset URL :

The mean is positive, so this indicates the average mouse got faster with practice. Or does it? Not all mice were faster. From the stem and leaf plot we see that five were actually slower (negative difference), and some of the rest were not much faster.

A Confidence Interval

Let's try out the only statistical inference method we have already learned and construct a 95% Student t confidence interval for the unknown population mean μ (the mean learning effect for the whole population of mice of which these 20 are considered a random sample).

External Data Entry

Enter a dataset URL :

We copy the relevant printout below

95 percent confidence interval: 
  5.127726 26.281274  
sample estimates: 
mean of x  
  15.7045  

So the confidence interval says the population mean μ is probably positive, even if not every mouse is positive.

A Test of Significance

A test of significance, also called a statistical hypothesis test, is a slightly different twist on the same mathematics used in the confidence interval.

Instead of trying to estimate the difference due to learning, it focuses on a different question: whether there is any learning at all. Never mind the size of the learning effect. How can we be sure that there is any effect at all? There is a lot of variability in the maze running performance of mice. Some of the mice had negative differences in performance, running the maze slower after practice. Perhaps the apparent learning effect — positive sample mean — was due to chance. Perhaps if the experimenter redid the experiment with another 20 mice or even with the same 20 mice, the results would come out the other way.

Many people don't like this. They don't like to think their ideas may be wrong. Our hero, the experimental psychologist, is no different. She doesn't like this either. If she hadn't thought there was a learning effect, then she wouldn't have done the experiment. But in order for her work to have any value, she must convince other scientists of its validity. They aren't convinced she is right. (And when the shoe is on the other foot, she isn't convinced that other scientists are right. Scientists don't get convinced without good evidence.) Thus she is forced to consider how the skeptics will think. She is forced to consider that there may be no learning effect, in which case the true population mean μ is zero.

It is traditional in statistics to state this formally, a statement called the null hypothesis of the test of significance.

There is no learning effect and the differences in running times are entirely due to chance. The population mean difference μ is zero, and the sample mean difference x is different from zero only because of chance variation.

This says that the sample mean 15.7045, though apparently large, only appears to indicate a positive learning effect. The true learning effect μ is zero (that's what the null hypothesis assumes) and there is no learning.

The experimenter doesn't make this assumption because she thinks it is true. In fact, she doesn't like this assumption one bit, because if it were true it would mean that she has wasted a lot of time on a stupid experiment with no point — an experiment designed to study an effect that doesn't really exist.

The null hypothesis is only assumed (temporarily) in order to do a calculation that will prove it is false. As my father used to say, it is a straw man to knock down. The point is that making a precise assumption allows a precise probability calculation.

So we calculate. What is the probability of getting X as large or larger than 15.7045 when in fact μ = 0?

First we standardize. The random variable

t = (x − μ) ⁄ (s ⁄ √n)

has the Student t distribution with n − 1 degrees of freedom

Plugging in the data for this problem we get

External Data Entry

Enter a dataset URL :

In summary

x = 15.7045
s = 22.59925
n = 20

So

t = (15.7045 − 0) ⁄ (22.59925 ⁄ √20) = 3.107743

Note that we use the value of the unknown parameter μ that is hypothesized under the null hypothesis (here zero) so we have everything we need to calculate t.

Make the following assumptions.

Then the distribution of t is exactly Student's t distribution for n − 1 degrees of freedom (here 19 degrees of freedom).

So our question comes down to the probability that t > 3.107743 when t has Student's t distribution for 19 d. f.

Answer (copied from the printout of the form above)

Rweb:> pt(3.107743, df = 19, lower.tail = FALSE) 
[1] 0.002897115 

P-values and Statistical Significance

This number is called the P-value of the test. The scientist will report something like

The learning effect (mean difference of first and last runs) was statistically significant (P = 0.003, upper-tailed t test).

This language is a widely used convention. It alludes to the following more detailed argument. If the null hypothesis (μ = 0) is true, then something very improbable has happened, the probability being 0.003. This is so improbable that we can dismiss the null hypothesis from further consideration. Hence the argument of the skeptics (who argue that μ = 0) has been defeated. We can say the learning effect is real.

As my mother used to say, there are two sides to every question. That applies to statistical tests. There is the side that wants to reject the null hypothesis (our hero, the experimental psychologist) and the side that wants to accept the null hypothesis (the skeptics).

If you are on the side that wants to reject the null hypothesis, then low P-values are good (for your side).

The lower the P-value, the stronger the evidence against the null hypothesis.

After all, P = 0.1 only means that if the null hypothesis is true then a one-chance-in-ten long shot has occurred, but 1 ⁄ 10 is hardly a low probability. If you repeat the experiment 10 times you should see just such a long shot by chance variation even if the null hypothesis is true. A high P-value, like 0.3 is evidence against the null hypothesis so weak as to not be evidence at all.

So where is the borderline between strong evidence and weak evidence? People like clear guidelines. To cater to this hankering, introductory statistics books have for seventy years (about how long there have been introductory statistics books) promulgated the arbitrary dividing line of 0.05 (one chance in twenty). By this rule, if P < 0.05, then the result is statistically significant, otherwise not. The number 0.05 does have the virtue of being a round number when considered by humans. It is round because humans have five fingers. If we were intelligent computers, who count in binary, we would consider some power of two, perhaps 1 ⁄ 16 or 1 ⁄ 32 the arbitrary dividing line between statistical significance and insignificance.

In the opinion of your instructors and your textbook author the arbitrary dividing line 0.05 or any other arbitrary dividing line has absolutely nothing to recommend it. It is nonsense. From the textbook, Basic Practice of Statistics by Moore, p. 371,

There is no sharp border between significant and insignificant, only increasingly strong evidence as the P-value decreases. There is no practical distinction between the P-values 0.049 and 0.051.

Or as I once put it more strongly, in language that a co-author cut from our paper as too argumentative

Anyone who thinks there is a scientifically meaningful distinction between P = 0.049 and P = 0.051 understands neither science nor statistics.

The reason we make this point so strongly is that you will find if you do much statistics that many scientists have just this view: that 0.05 is the universal borderline of statistical significance. That's what they learned when they took introductory statistics. They use it in their scientific papers and do not get complaints. Such is the strength of culture, even in science. Culture changes slowly.

Fortunately, the culture is changing. Most introductory statistics textbooks written in the past ten years agree with us.

Just give the exact P-value. Let readers decide whether it is statistically significant or not. They will anyway.

Back in the stone age, before everyone had a desktop or laptop computer with statistics software installed (or at least a web browser to use Rweb). It was hard to get P-values. Tables in books were inadequate.

Suppose we want to calculate the P-value for the test in our example above (recall it is P = 0.002897115). We need to calculate

P(t > 3.107743)

where t has a Student t distribution with n − 1 = 19 degrees of freedom.

Now we get to use the header at the top of Table C in the textbook which gives "upper tail probability". We need to do a backwards lookup in this table. Find t in the appropriate (df = 19) row of the table, and report the upper tail area from the header.

df    .25   .20   .15   .10   .05   .025  .02   .01   .005 .0025  .001 .0005
19 | 0.688 0.861 1.066 1.328 1.729 2.093 2.205 2.539 2.861 3.174 3.579 3.883

The problem is that we don't have many numbers to work with. All we can say is that since t is between the two highlighted numbers, the P-value must be between the two upper tail probabilities just above the highlighted numbers, that is, 0.0025 < P < 0.005.

And if we had used an older book with a smaller table, we might have only been able to say, perhaps P < 0.01 or even only P < 0.05. Some older, computer-illiterate scientists still think this is a reasonable way to report a P-value — sloppily. It's really absurd in this day and age.

One-Tailed and Two-Tailed Tests

This section could also be called the formal theory of tests of significance. At least this is a formal as we are going to get.

A test of significance involves three things.

The null hypothesis, we have already met. In the example about mice it was μ = 0. A null hypothesis is always a statement that specifies the value of some parameter or parameters.

The test statistic, we have already met, although we did not give it a specific name before. In the example about mice it was

t = (x − μ) ⁄ (s ⁄ √n)

As it stands, this is not a statistic, because it depends on a parameter μ. It becomes a statistic when we plug in the value of μ specified under the null hypothesis. Then, as required, when the null hypothesis is true, the distribution of this test statistic is completely known: the Student t distribution with n − 1 degrees of freedom.

The null hypothesis always makes the test statistic a statistic (not depending on unknown parameter values) and completely determines the sampling distribution of the test statistic. This is the distribution of the test statistic when the null hypothesis is true.

Finally, we choose the tail or tails to test. Many tests come in three kinds. If the distribution of the test statistic t when the null hypothesis is true is symmetric about zero, then these are

Kinds of Tests
tail type P-value
lower tail P(Tt)
upper tail P(Tt)
two tail P(|T| ≥ |t|)

In the table above t is the observed value of the test statistic (calculated from the data), and T is a random variable having the distribution of the test statistic when the null hypothesis is true. This uses the capital letters for random variables and lower case letters for numbers convention. When we think of the test statistic as a random variable (because it is calculated from a random sample), then it is T. When we think of the test statistic as a number (like the number 3.107743 for the data in our example), then it is t.

Often tail type is associated with an alternative hypothesis that indicates the tail or tails favored by the test — the tail or tails that contribute to the P-value.

The null hypothesis is often denoted H0 and the alternative hypothesis Ha. With this notation, we can expand our table.

Kinds of Tests and Hypotheses
tail type P-value H0 Ha
lower tail P(Tt) μ = 0 μ < 0
upper tail P(Tt) μ = 0 μ > 0
two tail P(|T| ≥ |t|) μ = 0 μ ≠ 0

The formula for a two-tailed P-value looks complicated, but for all of the two-tailed tests we will meet in this course, it is actually very simple.

When the distribution of the test statistic when H0 is true is symmetric, then the two-tailed P-value is twice the smaller of the one-tailed P-values.

In short, two tails is twice one tail.

So in the example above, which had P = 0.002897115 one-tailed would have P = 2 × 0.002897115 = 0.00579423 two-tailed. As before, our scientist would round this in her report

The learning effect (mean difference of first and last runs) was statistically significant (P = 0.006, two-tailed t test).

Notice that

So

The issue is what the scientist would have done if the data had come out in the other tail. What would our experimental psychologist had done if the 20 mice in her experiment had run slower with practice?

Tests of Significance for the Population Mean

Based on the Normal Distribution

These are the test procedures related to confidence intervals for the population mean based on the normal distribution.

As with the confidence intervals there are two kinds, exact and approximate (large n).

First exact.

Second approximate.

In either case the P-values are calculated in the same way and the null and alternative hypotheses are stated the same way.

Kinds of Tests and Hypotheses
tail type P-value H0 Ha
lower tail P(Zz) μ = μ0 μ < μ0
upper tail P(Zz) μ = μ0 μ > μ0
two tail P(|Z| ≥ |z|) μ = μ0 μ ≠ μ0

For an example of this procedure we use data are in the URL

http://www.stat.umn.edu/geyer/3011/gdata/moremouse.txt

which are just like the data in the example about mice but with larger sample size (in fact the first 20 rows of this data set are the entirety of the data set for the earlier example). Now the sample size is 50, which makes the normal approximation reasonable.

As in the other example, we do an upper-tailed test.

By Computer

External Data Entry

Enter a dataset URL :

Interpretation

Now the P-value (P = 0.0309, upper-tailed z test) provides much weaker evidence against the null hypothesis than the data set with only 20 mice. It is still statistically significant by the conventional 0.05 borderline, and by anyone's standard it still provides moderate evidence against the null hypothesis.

Notice that if a two-tailed test had been done, the P-value would be twice the one-tailed value (P = 0.0619, two-tailed z test), and this would not be statistically significant by the conventional 0.05 borderline, although it would still provide moderate evidence against the null hypothesis.

Other Alternatives

If we had wanted to do a lower-tailed test, the P-value would have been calculated the same way except for the last command, which would become pnorm(z).

If we had wanted to do a two-tailed test, the P-value would be twice the P from the upper-tailed test or the lower-tailed test, whichever was smaller.

By Hand

The statistics for the data set analyzed above are

x = 7.945
s = 30.087
n = 50

The null and alternative hypotheses are

H0: μ = 0
Ha: μ > 0

indicating an upper-tailed test with μ0 = 0.

Thus the test statistic is

z = (x − μ0) ⁄ (s ⁄ √n) = (7.945 − 0) ⁄ (30.087 ⁄ √50) = 1.87

Then the P-value is the area to the right of z under the standard normal density curve (Table A). The area to the left of z is found in the following row of the table

         .00   .01   .02   .03   .04   .05   .06   .07   .08   .09
  1.8 | .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706

The area to the left of z is found by applying the complement rule 1 − 0.9693 = 0.0307.

This agrees with the answer we got by computer except for some rounding errors.

Based on the Student T Distribution

These are the test procedures related to confidence intervals for the population mean based on the Student t distribution.

The P-values are calculated in the same way and the null and alternative hypotheses are calculated the same way.

Kinds of Tests and Hypotheses
tail type P-value H0 Ha
lower tail P(Tt) μ = μ0 μ < μ0
upper tail P(Tt) μ = μ0 μ > μ0
two tail P(|T| ≥ |t|) μ = μ0 μ ≠ μ0

By Computer

Here we use The R function t.test (on-line help) to do the test.

External Data Entry

Enter a dataset URL :

This form, when run, produces the following printout.

Rweb:> t.test(diff15, alternative = "greater") 
 
	One Sample t-test 
 
data:  diff15  
t = 1.8673, df = 49, p-value = 0.03393
alternative hypothesis: true mean is greater than 0  
95 percent confidence interval: 
 0.8114801       Inf  
sample estimates: 
mean of x  
    7.945  

We highlighted both the test statistic and the P-value.

Notice that the test statistic (1.8673) is the same as that for the z test (we just call it t now instead of z).

The P-value 0.0339 is a little higher than that for the z test 0.0309 because the Student t distribution with 49 degrees of freedom has slightly heavier tails than the standard normal distribution.

Interpretation

The learning effect (mean difference of first and last runs) was statistically significant (P = 0.0339, upper-tailed t test).

Other Alternatives

The alternative argument to the t.test function (on-line help) has possible values "two.sided", which is the default, "less", and "greater".

To do an upper-tailed test, you must say, as in the example above alternative = "greater".

To do a lower-tailed test, you must say alternative = "less".

To do a two-tailed test, you may say alternative = "two.sided", but can also omit the alternative argument entirely.

Other Hypothesized Values of the Population Mean

To specify a different μ0 use the optional argument mu. For example,

t.test(fred, mu = 6, alternative = "greater")

tests the hypotheses

H0: μ = 6
Ha: μ > 6

for the dataset fred.

By Hand

As mentioned in the by computer section, the test statistic is the same whether we are doing t or z

t = (x − μ0) ⁄ (s ⁄ √n) = 1.87

We need to look this up in the Student t distribution table (Table C) for n − 1 = 49 degrees of freedom. The table only has rows for 40 and 50 degrees of freedom.

df |  .25   .20   .15   .10   .05   .025  .02   .01   .005 .0025  .001 .0005
40 | 0.681 0.851 1.050 1.303 1.684 2.021 2.123 2.423 2.704 2.971 3.307 3.551
50 | 0.679 0.849 1.047 1.299 1.676 2.009 2.109 2.403 2.678 2.937 3.261 3.496

But using either row we see that 0.05 > P > 0.025 because the header gives upper tail areas, which is what we want for an upper-tailed test.

Because the numbers for 49 degrees of freedom, if we had them, would be between those for 40 and 50, we conclude that 0.05 > P > 0.025 for 49 degrees of freedom too, but that is the most we can discover from Table C.

This is most unsatisfactory. O. K. for homework, but not acceptable for real data. Use the computer in real life.

A Nonparametric Test of Significance for the Population Median

To a certain extent you've seen one test of significance, you've seen 'em all. There is always

The details change, but the big picture remains the same. (Readers should go back and make sure they know what each of the emphasized terms is for the z and t tests described above.)

Now we look at a so-called nonparametric test of significance, called the sign test. When we discuss the relationship between tests and confidence intervals below, we will see that this is the test dual to the nonparametric confidence interval for the population median presented on our confidence intervals page.

We start with its assumptions. We assume our sample X1, … Xn is IID from some continuous population distribution. The population distribution can have any shape whatsoever. The only point of the continuity assumption is to rule out ties. All of the Xi will be different from each other, and all will be different from the population median θ.

By definition

P(Xi < θ) = P(Xi > θ) = 1 ⁄ 2

The null hypothesis specifies a value of the population median

H0: θ = θ0

As with the t and z tests, the hypothesized value θ0 is usually zero, but we allow other choices.

The test statistic Y is the number of Xi that are greater than θ0.

The sampling distribution of the test statistic when the null hypothesis is true, is binomial. The events Xi > θ are independent and all have the same success probability 1 ⁄ 2. Thus the distribution of the test statistic Y is binomial with sample size n and success probability 1 ⁄ 2.

This distribution is symmetric, but in a slight difference from the t and z distributions, it is not symmetric about zero. The possible values of Y range from zero to n and the distribution is symmetric about the point n ⁄ 2.

Thus Y and nY have the same distribution, which is slightly different from the other test statistics (T and − T have the same distribution, as do Z and − Z). This leads to slight differences in the formulas for P-values.

Kinds of Tests and Hypotheses
tail type P-value H0 Ha
lower tail P(Yy) θ = θ0 θ < θ0
upper tail P(Yy) θ = θ0 θ > θ0
two tail 2 P(Yy) or 2 P(Yy) θ = θ0 θ ≠ θ0

In the last line of the table, 2 P(Yy) or 2 P(Yy) for the P-value means whichever of these can be a P-value is the P-value. One of these will be between zero and one and hence is the P-value. One of these will be greater than one and hence is not.

Note that we still have the two tails is twice one tail property.

Example

We redo the smaller mouse data, which has n = 20.

The value of the parameter (the population median θ) hypothesized by H0 is zero.

The test statistic we can easily calculate in our heads from the stem and leaf plot made by first form on this page. There are 15 differences (first minus last) above zero, so the observed value of the test statistic is y = 15.

To calculate the P-value for an upper-tailed test, we need to calculate P(Y ≥ 15) when Y has the binomial distribution with sample size 20 and success probability 1 ⁄ 2.

Having no tables of the binomial distribution, we use the computer.

The results

Rweb:> pbinom(14, 20, 1 / 2, lower.tail = FALSE) 
[1] 0.02069473

and our scientist reports

The learning effect (mean difference of first and last runs) was statistically significant (P = 0.02, upper-tailed sign test).

Some readers should now be asking where did 14 come from? This is explained in the section on how to shoot yourself in the foot on the web page about the binomial distribution. Simply put,

P(Y ≥ 15) = 1 − P(Y < 15) = 1 − P(Y ≤ 14)

and the last line is what pbinom(14, 20, 1 / 2, lower.tail = FALSE) calculates.

One must be very careful with probabilities of tail areas for discrete distributions.

In general, suppose we have sample size n and observed value of the test statistic y already defined as R objects. To calculate the P-value for an upper-tailed sign test do

pbinom(y - 1, n, 1 / 2, lower.tail = FALSE)

To calculate the P-value for a lower-tailed sign test do

pbinom(y, n, 1 / 2)

To calculate the P-value for a two-tailed sign test do

2 * pbinom(y, n, 1 / 2)
2 * pbinom(y - 1, n, 1 / 2, lower.tail = FALSE)

and take the one of these that is between zero and one.

The fact that we use y for lower-tailed and y - 1 for upper-tailed is confusing, but is just the way lower.tail = FALSE works (there is a logical reason why it works this way, but not a logic that applies to tests of significance).

Comparison of Tests of Significance for One-Sample Location Problems

We have now considered four tests.

Summary of Tests
test exact parameter assumptions robust
z yes population mean μ population normal, σ known no
z no population mean μ n large yes
t yes population mean μ population normal no
sign yes population median θ none yes

Our table needs a few comments.

We should mention for completeness that there is yet another competitor that we have not covered. It is covered in Stat 5601 and other courses with nonparametrics in the title. It is also covered in our textbook in supplementary Chapter 23 found on the cdrom inside the back cover. It would get a row in our table

One Additional Line for Summary of Tests
test exact parameter assumptions robust
signrank yes population center of symmetry = μ = θ population symmetric yes

It is in between on assumptions and robustness. Outliers don't bother it, but skewness does. We say no more about it. (It won't be on the exam.)

So if the sign test is so great and the other tests so bad, why not always use the sign test? It's a sad fact of life that you don't get something for nothing. The robustness of the sign test comes with a cost. If the population really is normal, then the t test (or the z if σ is known) will work better, and will give lower P-values.

We saw this in our examples: P = 0.003, upper-tailed t test, but P = 0.02, upper-tailed sign test. Note the zeros: P = 0.02 is nearly 10 times larger than P = 0.003. The data look a lot more statistically significant when we use the t test.

That was for n = 20, if we reanalyze the n = 50 data, we get a similar picture: P = 0.031, upper-tailed t test, but P = 0.016 upper-tailed sign test. Oops! No we don't get a similar picture. The sign test looks more significant.

All we can say is that, in general, when the population is exactly normally distributed, the t test will usually but not always give a lower P-value than the sign test.

On the other hand, when the population is not exactly normally distributed and not even close (when you clearly see skewness or outliers in a stem and leaf plot, for example), then the exact t and z procedures are not exact at all. They shouldn't be used, and if they are used even though they shouldn't, they will do a much worse job than the sign test.

Tests and Decisions

Sometimes the purpose of a test of significance is to make a decision.

As we described the use of tests of significance in scientific inference, there are no decisions. Scientist does experiment, scientist writes up experiment reporting statistically significant results with some P-value, readers make of it what they will, considering not only the P-value but also everything else relevant: their prior opinions on the subject, what other data in other papers say, theoretical arguments, details of the experimental procedures and possible experimental errors and biases, and so forth. The P-value is only a small part (though an important part) of what it takes to convince scientists of the conclusions of the scientist's write-up.

In industrial quality control, if we take a sample of material coming off the production line to the lab for testing, then the purpose is to make a decision between two actions.

Either decision is costly if mistaken. If the first decision is mistaken, we produce defective product and ship to customers resulting in costs of rework, support, and lost future sales. If the second decision is mistaken, we waste time looking for a problem that doesn't exist.

Tests of significance are suited for making these decisions if the two decisions above can be made to correspond to the null and alternative hypotheses of a test. Then low P-values are evidence in favor of Ha, which in this context has the interpretation quality problem, stop production, and high P-values are evidence in favor of H0, which in this context has the interpretation no problem, continue production.

Hence we should choose a dividing line α and decide as follows

Tests and Decisions
result hypothesis chosen interpretation
P ≤ α Ha quality problem, stop production
P > α H0 no problem, continue production

The same logic applies to any other situation where decisions can be made to correspond to null and alternative hypotheses of a test. The first two columns of our table above stay the same. The interpretation must be adjusted to the new situation.

When one uses tests of significance to make decisions, it is important that α be chosen carefully to minimize costs and maximize benefits (negative costs) on average. This involves calculations too complicated to cover in this course.

A carefully chosen α will never be a round number like 0.05. When you see someone using α = 0.05, you know they aren't making careful decisions. Usually, they aren't making actual decisions at all. The language of decisions is only being used because they think they are supposed to always use it to talk about tests of significance (since most introductory statistics books, until recently, emphasized decisions).

Duality of Tests and Confidence Intervals

You can tell someone overly influenced by the decision picture of tests of significance. They report their results like

The learning effect (mean difference of first and last runs) was statistically significant (P ≤ 0.05, upper-tailed t test).

That is they say P ≤ 0.05 when they could have said P = 0.003.

In real science, this is a terrible idea, a pernicious influence of bad statistics teaching.

There is a theoretical virtue to thinking like this. If it only matters whether P ≤ α or P > α, then the relationship between tests (considered this way) and confidence intervals is very simple.

Tests from Confidence Intervals

A test with α = 0.05 goes with a 95% confidence interval, because 95% is 1 − α expressed as a percent.

A two-tailed test with null hypothesis θ = θ0 for any parameter θ has P ≤ 0.05 if and only if a 95% confidence interval for θ does not contain θ0.

A two-tailed test with null hypothesis θ = θ0 has P ≤ α if and only if a 100 (1 − &alpha)% confidence interval for θ does not contain θ0.

In summary, the tests decides against H0 and for Ha at decision level α if and only the confidence interval for the corresponding confidence level (1 &minus α expressed as a percent) does not contain the value of the parameter hypothesized by H0.

Confidence Intervals from Tests

The logic of the preceding section can be turned around. We can use tests to make confidence intervals.

A 100 (1 − &alpha)% confidence interval for θ consists of those points θ0 such that the null hypothesis H0: θ = θ0 is not rejected at decision level α (meaning P > α) by the test of significance.

Summary

So tests of significance and confidence intervals come in dual pairs.

We have met four such pairs: the four tests in our summary table above, and the corresponding confidence intervals covered on the confidence intervals page.

Philosophy

Statistical Significance ≠ Scientific Significance

Our textbook (pp. 371–372) says statistical significance is not the same thing as practical significance. We mean the same thing. We're just emphasizing scientific inference.

If the null hypothesis is false, then statistics can prove that. No matter how small the difference between the true (unknown) value of the parameter θ and the value θ0 hypothesized by the null hypothesis, this difference can be detected if the sample size is sufficiently large.

Because of the square root law, it may take a humongous sample size. If n = 10 can detect more often than not a difference θ − θ0 = 1, then n = 1000 can detect more often than not a difference θ − θ0 = 0.1, and then n = 100,000 can detect more often than not a difference θ − θ0 = 0.01, and so forth.

Thus whether an effect θ − θ0 is statistically significant has nothing to do with how large it actually is. But any practical or scientific significance has very much to do with how large it actually is. Small effects may not matter.

In order to say something about the actual size of θ use a confidence interval.

Tests of significance are about the issue of whether effects exist, not about how large they are.

Secondarily, they are about whether the sample size n is large enough to detect the actual effect if it exists. When P = 0.2, there is no point in further discussion of these data. They do not even show the effect exists. So never mind how large it may be if it exists.

Data Snooping and other Cheating

The philosophical dogma about tests of significance is the following.

You do only one test. You choose the test before any data are collected. You describe the test to be done in the experimental protocol so there can be no argument you followed the dogma.

This is a very strict rule, so strict that almost no one follows it.

Medical clinical trials do follow it, because if it is not followed, then no one can be sure there was no cheating.

There are, of course, many ways to cheat in statistics: outright fraud, mathematical mistakes, applying a procedure when its assumptions do not hold. Most of them are not peculiar to statistics.

But this section is about one form of cheating that is very widespread and applies only to tests of significance and confidence intervals. It may be more common than all other forms of cheating (about tests of significance) put together. It is so widespread that many scientists do not even know it is wrong, and many others, although they know it is wrong, don't think it is important.

It goes under many names, such as data snooping and data dredging. The technical term is multiple testing without correction. It is violation of the first part of the dogma: do only one test.

When the null hypothesis is true, then a P-value expresses a certain probability. If we think of the test as making a decision at level α, then we have P ≤ α with probability α. This can be though of as the probability of an erroneous announcement of statistical significance at level α. But it only has that interpretation when the dogma is followed, when it is the only test done.

When many tests are done, the probability of erroneous announcement of statistical significance goes up as the number of tests goes up. Let Ai be the event that the i-th test of a sequence has P ≤ α even though the null hypothesis is true. Then

P(Ai) = α

and if this is the only test, then the probability of such an error is only α, a small number we have chosen.

But when k tests are done, the probability of erroneous announcement of statistical significance is by the subadditivity rule

P(A1 or A2 or … or Ak) ≤ k α

which will not be small when k is large.

From this we see that one way to correct for multiple testing is to multiply all P-values by the number of tests done. This is called Bonferroni correction. It is the only universally applicable correction, working for all multiple testing situations.

Advanced textbooks describe other corrections designed for particular situations and better than Bonferroni for those situations.

Two-tailed tests are a special case. A two-tailed test is just like doing both one-tailed tests and getting excited when either says statistically significant, and the correction for doing two tests is to multiply the P-value by two: two tails is twice one tail.

So everyone who understands the issue agrees. Bonferroni correction (or some more complicated correction applicable to some special situation) is always required when multiple tests are done, but Bonferroni correction is rarely used by working scientists. Again, medical clinical trials, are the exception. If multiple tests will be done, this is specified in the protocol and the relevant correction for multiple testing is also prescribed.

The problem is that real science is not so neat and tidy as it seems from the outside. Most data are collected without any clear idea what they will show. Observational data are often collected by one group of scientists and analyzed by many other groups looking for many different effects. Unlike a medical clinical trial, there is no protocol that describes how the data are collected and every statistical analysis that will be done. Even in medicine, follow-up analyses, by definition, do not follow the protocol for the original analysis.

Hence scientists often do not know what they are looking for until they stumble over it. Necessarily, this involves many tests. If Bonferroni correction is rigorously applied, only very extreme results will seem significant. Interesting scientific findings might be missed.

But if Bonferroni correction is not rigorously applied, then P-values don't mean what the are supposed to mean, and no one can be sure what they mean. It's a conundrum.

One, perhaps somewhat artificial, distinction is between exploratory and confirmatory analyses. In an exploratory analysis, when one doesn't know what one is looking for until it is found, P-values must be taken cum grano salis. In a confirmatory analysis, one follows the dogma and does only one test, and the P-value means what it is supposed to mean.

The issue has no entirely satisfactory solution. It's enough that students and other users of statistics are aware of it.

Pretests Not Recommended

It follows from the only one test dogma that the following idea, which seems logical at first glance, and which seems to occur naturally to many people who have been exposed to the material on this page and the confidence interval page, is thoroughly and completely wrong.

The idea is

You need certain assumptions satisfied, so first do a test to check whether those assumptions are satisfied. Only when your data passes the pretest do you apply the procedure (confidence interval or test of significance) that requires those assumptions.

A simple reason why this pretest-posttest procedure is wrong is that it does more than one test. If that does not satisfy you, consider the following more complicated explanation.

  1. The pretest is not guaranteed to make the correct decision.
  2. The decision made by the pretest is random.
  3. The pretest-posttest combination does not have the properties claimed for the posttest alone, even if its required assumptions are true, because the pretest changes sampling distributions.

Points 1 and 2 are different ways of saying the same thing. Point 3 is the crucial issue. Doing a pretest raises theoretical issues that are far beyond our current knowledge to handle.

The textbook (pp. 454–455) recommends that you avoid two particular pretest-posttest combinations that many users think up. But it criticizes them only on particular grounds: either the pretest or the posttest is not a very good test, not robust or whatever.

Here we are making the much stronger point that all pretest-posttest combinations are bad unless a correction for multiple testing is done.

If you are worried that the assumptions for a procedure do not hold, then don't use the procedure, use a more robust one.

If you are afraid that the normality assumptions for a t test don't hold, then use the sign test. Do not first look at a stem and leaf plot and do a t test if the plot looks more or less normal and a sign test otherwise. You have no idea what the statistical properties of your pretest-posttest combination are.