University of Minnesota, Twin Cities     School of Statistics     Stat 3011     Rweb     Textbook (Wild and Seber)

Stat 3011 (Geyer) In-Class Examples (More on Proportions)

Contents

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions (that's already done for you). You do not have to select a dataset (that's already done for you).

What this Page is About

In Section 8.5 of Wild and Seber a valiant attempt is made to clarify some extremely important issues which are generally ignored by other introductory statistics books (for which they deserve cheers). But last year's experience was that more was needed than just those few pages. Hence this page.

The Square Root Law and Subgroups

Statistical accuracy varies as the square root of the sample size (sample sizes when two samples are involved).

What the sample size of a poll is depends on what you are talking about. If a proportion is for a subgroup, then the relevant sample size is the size of the subgroup, not the whole sample.

Wild and Seber mention this on p. 350, but it needs more emphasis.

If the margin of error of a poll is reported as 3%, then that margin of error only applies to questions involving the whole sample.

For questions involving a subgroup, you must multiply the margin of error by the square root of the whole sample size over the subgroup size. For example, if the margin of error is reported as 3% and a question involves a subgroup that is only one-tenth of the sample, then the margin of error must be multiplied by sqrt(10) = 3.162. So the relevant margin of error for proportions in this subgroup is 9.3% rather than 3%.

Three Kinds of Questions about Differences of Proportions

In Figure 8.5.1 and Table 8.5.5 Wild and Seber make a valiant attempt to clarify a rather confusing situation. Past experience says their discussion is clear as mud. So here's my try.

There are three situations, labeled (a), (b), and (c) in Table 8.5.5 in Wild and Seber. All we need is to be able to identify the situations as they arise so we can use the correct formula. Here's my description.

  1. A difference of proportions involving two different polls.

    Each poll is done independently of the other, so that's why Wild and Seber call this two independent samples.

    Note that this is the case we have known about since the end of Chapter 7. So far it is the only interval for difference of proportions that we have studied. It is also the only kind of interval the R function prop.test knows about.

    The following two kinds of intervals are completely new. We have never seen them before.

  2. A difference of proportions involving the same question on the same poll.

    Of course, if there's only one poll, there is only one sample size. The several response categories vaguely refers to the fact that we are comparing different answers (response categories) to the same question.

  3. A difference of proportions involving different questions on the same poll.

    Of course, if there's only one poll, there is only one sample size. The many yes/no items is rather misleading. Whether the questions are yes/no are multiple guess is irrelevant. The point is that we are comparing answers to different questions.

Examples

Example of Type (a)

Suppose in two polls a month apart 50% of likely voters said they favored Jones in the first poll and only 45% in the second poll. Both polls had a sample size of 1000. What is a 2 se interval for the difference?

The result from Rweb is (0.005, 0.095). Since this interval only contains positive numbers, it looks like the true population proportion has decreased in the month between the two polls.

Note: this is case (a) in Wild and Seber's classification, the one we have long known how to do.

Another note: if the proportions involve subgroups, then n1 and n2 are the subgroup sizes not the whole sample sizes! (See also the first section of this page and the last section of this page).

Example of Type (b)

Suppose in one poll 50% of likely voters said they favored Jones, 45% said they favored Smith and 5% were undecided. The poll had a sample size of 1000. What is a 2 se interval for the difference between Jones and Smith?

The result from Rweb is (-0.012, 0.112).

Note that this is very different (quite a bit wider) than the interval for the type (a) example although everything is the same except the type. The sample sizes are all the same (1000). The sample proportions p1 and p2 are the same. The only thing different is the formula for the standard error of p1 - p2, which is very different for type (a) and type (b) problems.

Another note: if the proportions involve a subgroup, then n is the subgroup size not the whole sample size! (See also the first section of this page and the last section of this page).

Example of Type (c)

Suppose in one poll 50% of respondents said they liked Dilbert and 45% said they like Doonesbury. The poll had a sample size of 1000. What is a 2 se interval for the true population difference between liking for these two comic strips?

Note that these are answers to two different questions. Some people like both. Some people like neither. Some only one or the other. So this is a problem of type (c).

The result from Rweb is (-0.012, 0.112).

Here the type (b) and type (c) standard error formulas give exactly the same result, so the intervals for our two examples are exactly the same.

The type (b) and type (c) standard error formulas give quite different results if p1 and p2 are both large. Change them to 0.80 and 0.75 in both examples and see how different the results are.

Another note: if the proportions involve a subgroup, then n is the subgroup size not the whole sample size! (See also the first section of this page and the last section of this page).

Quick and Dirty Calculations

Wild and Seber call these mental adjustments (p. 350). They are invaluable in reading about polls in newspapers and magazines or even watching TV news about polls.

Let's redo the three examples above the quick and dirty way.

Suppose the reported margins of error for the polls are 3% (which is about right for sample size 1000).

For the type (a) example the interval is 5% plus or minus 1.5 times 3%, which is 5% plus or minus 4.5%.

That's (0.005, 0.095), which happens to be the same as the exact calculation (!) to this many significant figures.

For the type (b) and (c) examples the interval is 5% plus or minus 2 times 3%, which is 5% plus or minus 6%.

That's (-0.01, 0.11), which is not so different from the result (-0.012, 0.112) of the exact calculation.

Quick and Dirty Subgroup Calculations

It's not so easy to do in your head the square roots required for subgroup calculations (first section of this page).

In their section on mental adjustments Wild and Seber suggest just taking the closest fraction that is a perfect square. So for our example above, instead of using sqrt(10) = 3.162, they say to use sqrt(9) = 3.

However, if you don't know all the perfect squares, this may still be hard. Maybe it's best to use a calculator.

Differences Involving Subgroups

If the proportions in a difference of proportions problem involve subgroups, then the n1 and n2 for a type (a) problem or the n for a type (b) problem are the subgroup sizes not the whole sample sizes!

The quick and dirty calculation or mental adjustment is to just apply both adjustments.