University of Minnesota, Twin Cities School of Statistics Stat 5601 Rweb Computing Examples

Stat 5601 (Geyer) Examples (Parametric Bootstrap)

This page still under construction (again). Don't look yet. Or you can look, but be warned it's unfinished.

General Instructions
Theory
The Normal Location Problem
The Cauchy Location Problem
Theory II: Multinomial Sampling
A Chi-Square Test of Goodness of Fit
Goodness of Fit with Estimated Parameters
A Chi-Square Test of Independence
A Double Bootstrap Test

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

The theory of the parametric bootstrap is quite similar to that of the nonparametric bootstrap, the only difference is that instead of simulating bootstrap samples that are i. i. d. from the empirical distribution (the nonparametric estimate of the distribution of the data) we simulate bootstrap samples that are i. i. d. from the estimated parametric model.

All the same considerations arise.

Since the parameter estimate theta hat is not the true parameter value theta. We do not sample from the correct distribution. We should sample from F_θ. We do sample with the same thing with a hat on the θ (which I can't do on a web page).
Thus the bootstrap does not do the right thing, only close to the right thing when the sample size is large.
In constructing confidence intervals, it helps to bootstrap pivotal or at least variance-stabilized quantities.

And so forth.

Simulating from a parametric model is not so easy as simulating from the empirical distribution. In fact, it can be arbitrarily complicated. So hard that it is an open research problem how to do it. For some parametric models sampling is easy, others not. In general, it bears no relation to samping from the empirical.

If the observed data are in the vector x, then

x.star <- sample(x, replace = TRUE)

makes a nonparametric bootstrap sample.

In contrast, if the observed data are assumed to be i. i. d. normal, then

x.star <- rnorm(length(x), mean = mean(x), sd = sd(x))

makes a parametric bootstrap sample. This does not do the right thing because we should specify mu to be the true population mean and sd to be the true population standard deviation (but since we don't know the population values we must use estimates).

For more contrast, if the observed data are assumed to be i. i. d. Cauchy, then

x.star <- rcauchy(length(x), location = median(x), scale = IQR(x) / 2)

makes a parametric bootstrap sample. We can't use mean(x) and sd(x) as estimators of location and scale because the Cauchy distribution doesn't have moments and hence these aren't consistent estimators (of anything, much less location and scale). Why median(x) and IQR(x) / 2 are consistent (even asymptotically normal) estimators of location and scale would be more theory than we want to go into here. The only point we wanted to make is that the three examples look a lot different from each other.

The Normal Location Problem

The Cauchy Location Problem

Theory II: Multinomial Sampling

To get to some examples with wading through a tremendous amount of theory, we will stick to one parametric model for which the sampling looks fairly similar to the nonparametric bootstrap. This is the multinomial distribution.

The multinomial distribution is the distribution of categorical measurements on i. i. d. individuals. The number of individuals in each category make up the data vector x and the probabilities of individuals being in each category make up a probability vector p (where probability vector means all(p >= 0) and sum(p) == 1).

Given a probability vector p of length k and a sample size n one creates a multinomial sample with the R statements

c.star <- sample(1:k, n, prob = p, replace = TRUE)
x.star <- tabulate(c.star, k)

(The first statement creates an i. i. d. sample of category numbers with the specified probabilities. The second counts the number of individuals in each category. So x.star is a vector of length k.)

A Chi-Square Test of Goodness of Fit

For example, suppose we observe the multinomial data defined to be x in the form below, and we want to test the null hypothesis that the true category probabilities are all equal (to 1 / 6 because there are 6 categories). The R function chisq.test does the usual chi-square test that uses the large-sample approximation (that the chi-square test statistic has a chi-square distribution). The remainder of the code does the parametric bootstrap test.

Actually, since the null hypothesis is completely specified here this is, strictly speaking, a Monte Carlo test rather than a parametric bootstrap. The test is exact.

Goodness of Fit with Estimated Parameters

For another example, suppose we observe the Poisson data defined to be x in the form below, and we want to test the null hypothesis

A Chi-Square Test of Independence

For another example, suppose we observe the contingency table defined to be x in the form below, and we want to test the null hypothesis of independence (that the row category labels and column category labels are independent random variables).

A Double Bootstrap Test

print(x <- c(7, 4, 7, 5, 3, 8, 1, 4, 7, 4, 7, 2, 7,
    8, 5, 8, 4, 1, 6, 3))

print(theta.hat <- mean(x))

tstat <- function(x, theta) {
    counts <- tabulate(1 + x, 1 + max(x))
    bins <- seq(0, max(x))
    p0 <- dpois(bins, theta)
    p1 <- counts / sum(counts)
    foo <- counts * log(p1 / p0)
    foo[counts == 0] <- 0
    sum(foo)
}

print(tstat.hat <- tstat(x, theta.hat))

boot.pval <- function(x, nboot) {
    theta.hat <- mean(x)
    tstat.hat <- tstat(x, theta.hat)
    tstat.star <- double(nboot)
    for (i in 1:nboot) {
        x.star <- rpois(n, theta.hat)
        theta.star <- mean(x.star)
        tstat.star[i] <- tstat(x.star, theta.star)
    }
    (sum(tstat.star >= tstat.hat) + 1) / (nboot + 1)
}

n <- length(x)
nboot2 <- 199
nboot1 <- 199

print(p.hat <- boot.pval(x, nboot1))

p.star <- double(nboot2)
for (i in 1:nboot2) {
    x.star <- rpois(n, theta.hat)
    p.star[i] <- boot.pval(x.star, nboot1)
}
(sum(p.star <= p.hat) + 1) / (nboot2 + 1)

plot(seq(along = p.star) / (nboot2 + 1), sort(p.star))
abline(0, 1)

New Stuff

the glm function in R ( on-line help).

out <- glm(kyphosis ~ age + I(age^2) + number + start,
    family = "binomial")
summary(out)
out2 <- glm(kyphosis ~ number + start, family = "binomial")
summary(out2)
tout <- anova(out2, out, test = "Chisq")
print(tout)
dev <- tout[2, 4]
print(dev)
pev <- tout[2, 5]
print(pev)

pred <- predict(out2, type = "response")

n <- length(kyphosis)
nboot <- 999
dev.star <- double(nboot)
pev.star <- double(nboot)
for (i in 1:nboot) {
    kyphosis.star <- rbinom(n, 1, pred)
    out.star <- glm(kyphosis.star ~ age + I(age^2) +
        number + start, family = "binomial")
    out2.star <- glm(kyphosis.star ~ number + start,
        family = "binomial")
    tout.star <- anova(out2.star, out.star, test = "Chisq")
    dev.star[i] <- tout.star[2, 4]
    pev.star[i] <- tout.star[2, 5]
}
hist(dev.star)
abline(v = dev, lty = 2)
hist(pev.star)
abline(v = pev, lty = 2)
(sum(pev.star < pev) + 1) / (nboot + 1)

plot(seq(1, nboot) / (nboot + 1), sort(pev.star),
    xlab = "Quantiles of Uniform(0, 1)",
    ylab = "Quantiles of Bootstrap P-Value")
abline(0, 1)

cat("Calculation took", proc.time()[1], "seconds\n")

External Data Entry

Enter a dataset URL :

New New Stuff

the glm function in R ( on-line help).

out <- glm(kyphosis ~ age + I(age^2) + number + start,
    family = "binomial")
sout <- summary(out)
print(sout)
names(sout)
print(sout$coefficients)
theta.hat <- sout$coefficients[4, 1]
se.theta.hat <- sout$coefficients[4, 2]
print(theta.hat)
print(se.theta.hat)

pred <- predict(out, type = "response")

n <- length(kyphosis)
nboot <- 999
theta.star <- double(nboot)
se.theta.star <- double(nboot)
for (i in 1:nboot) {
    kyphosis.star <- rbinom(n, 1, pred)
    out.star <- glm(kyphosis.star ~ age + I(age^2) +
        number + start, family = "binomial")
    sout.star <- summary(out.star)
    theta.star[i] <- sout.star$coefficients[4, 1]
    se.theta.star[i] <- sout.star$coefficients[4, 2]
}

hist(theta.star)
abline(v = theta.hat, lty = 2)
z.star <- (theta.star - theta.hat) / se.theta.star
all(is.finite(z.star))
hist(z.star)
abline(v = 0, lty = 2)

theta.hat - sort(z.star)[(nboot + 1) * c(0.975, 0.025)] * se.theta.hat

qqnorm(z.star)
abline(0, 1)

cat("Calculation took", proc.time()[1], "seconds\n")

External Data Entry

Enter a dataset URL :

Stat 5601 (Geyer) Examples (Parametric Bootstrap)

Contents