Statistics 5601 (Geyer, Fall 2013) Bootstrap Bias Correction

General Instructions

To do each example, just click the Submit button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Bias Estimation

Section 10.3 in Efron and Tibshirani.

Comments

Everything pretty obvious here.

The function ratio calculates the estimator we are investigating.
The bootstrap is just like other bootstraps we have done that use the k.star trick.
The bootstrap estimate of bias is the mean of the theta.star minus theta.hat. This is the obvious analog in the bootstrap world of the actual bias, which is the mean of theta.hat minus the true unknown parameter value theta.

Improved Bias Estimation

Section 10.4 in Efron and Tibshirani.

R statements z <- oldpatch - placebo y <- newpatch - oldpatch n <- length(y) rmean <- function(p, x) { if (length(p) != length(x)) stop("argument lengths don't match") if (any(p < 0)) stop("argument p not a probability vector") if (round(sum(p), 10) != 1) stop("argument p not a probability vector") sum(x * p) } rratio <- function(p, x, y) rmean(p, x) / rmean(p, y) p.hat <- rep(1 / n, n) print(theta.hat <- rratio(p.hat, y, z)) nboot <- 1000 theta.star <- double(nboot) p.bar <- rep(0, n) for (i in 1:nboot) { k.star <- sample(n, replace = TRUE) p.star <- tabulate(k.star, n) / n theta.star[i] <- rratio(p.star, y, z) p.bar <- p.bar + p.star } p.bar <- p.bar / nboot print(theta.bar <- rratio(p.bar, y, z)) hist(theta.star) abline(v = theta.hat, col = 2) abline(v = theta.bar, col = 3) # woof! can't tell the difference hist(theta.star, xlim = c(-0.2, 0)) abline(v = theta.hat, col = 2) abline(v = theta.bar, col = 3) # improved bootstrap estimate of bias print(bias.hat <- mean(theta.star) - theta.bar) # unimproved estimate -- for comparison only print(mean(theta.star) - theta.hat) # bias corrected estimate theta.hat - bias.hat

Dataset URL

Comments

The main comment is about the rather strange form of the functions rmean and rratio that calculate the ratio estimator.
These functions use what Efron and Tibshirani call the resampling vector (pp. 130–132) and the resampling form (pp. 189–190) of the estimator.
The resampling vector is the vector of weights given to the original data points in a resample X₁*, . . ., X_n*. The weight p_i* given to the original data point X_i is the fraction of times X_i appears in the resample. This is calculated by the statement
```
p.star <- tabulate(k.star, n) / n
```
in the bootstrap loop, where k.star is the by now familiar resample of indices. The analogous vector for the original sample is calculated by the statement
```
p.star <- rep(1 / n, n)
```
We now have to write a function that calculates the estimator given the data y and z the resampling vector p.star.
Unfortunately, this is, in general, hard.
Fortunately, this is, for moments, quite straightforward.
For any function g, any data vector x, and any probability vector p, the expression
sum(g(x) * p)
calculates the expectation of the random variable g(X) in the probability model that assigns probability p[i] to the point x[i] for each i (and probability zero to everywhere else).
Thus
```
sum(x * p)
```
calculates the mean
```
sum((x - a)^2 * p)
```
calculates the second moment about the point a, and so forth.
The function rmean(p, x) calculates the sample mean of the data vector x in resampling form.
The stop commands for various error situations are, of course, not required. If the function call is done properly they don't do anything. But it will save you endless hours of head scratching sometime if you get in the habit of putting error checks in the functions you write.
The function rratio(p, x, y) calculates the ratio estimator for data x and y using the rmean function in the obvious fashion.
In the bootstrap loop the vector p.bar accumulates the sum of the p.star vectors. After the bootstrap loop terminates, it is divided by nboot to give the average of the p.star vectors.
Ideally, if nboot were infinity, p.bar would be the same as p.hat. Since nboot is considerably less than infinity, p.bar is different from p.hat
Since sd(theta.star) is based on resamples that yielded the p.bar vector, it makes sense to subtract off rratio(p.bar, y, z) rather than theta.hat to estimate bias.
The logic is that that the Monte Carlo errors in sd(theta.star) and rratio(p.bar, y, z) tend to be in the same direction and cancel to some degree, giving an improved estimator.
This method of expressing estimators in resampling form is an important bootstrap technique, which will be used again for the ABC better bootstrap confidence interval technique.

Comments

The function rmedian calculates the median of a bootstrap sample given in resampling form.

The code is a bit tricky. The statement

    n.star <- round(p * n)

converts p back to counts, the round function being there to make sure the result is exactly integer-valued (not just close). Then the statement

    k.star <- rep(1:n, n.star)

converts these back to the index values that were counted: each element of the sequence 1:n is repeated as many times as the corresponding count in n.star. The resulting k.star inside the function definition is just like the k.star outside the function definition except for order, which doesn't matter. Then we can use k.star to make x.star in the usual way, and apply the function that computes the estimator to x.star in the usual way.

We try it out, and indeed do get the same answers either way.

Clearly, this function has nothing particular to do with medians. Changing the last line lets it calculate any other function.

Statistics 5601 (Geyer, Fall 2013) Bootstrap Bias Correction

General Instructions

Bias Estimation

Section 10.3 in Efron and Tibshirani.

Comments

Improved Bias Estimation

Section 10.4 in Efron and Tibshirani.

Comments

More on Resampling Form Estimators

Comments

Navigation

Contents