Student Seminar Series - May 4, 2006
University of Minnesota
School of Statistics
College of Liberal Arts

Variance Estimation in Complex Survey Data: A Comparative Review and a Bayesian Proposal


Jeremy W. Strief


Thursday, May 4, 2006
9:00 AM, 170 Ford Hall
Minneapolis, East Bank Campus

Refreshments at 8:30 AM
300 Ford Hall


Abstract

In stratified cluster surveys like the U.S. Census Bureau Long Form Report, the probabilities of each person being included in the sample are often unequal. Design-based statisticians would argue that these inclusion probabilities must be considered when estimating population characteristics, such as means, totals, or regression coefficients. Performing inference on the regression coefficients is especially challenging to a design-based statistician, since there exist no closed-form standard errors of the estimated coefficients. Approximations to the standard errors are commonly calculated with tools like Taylor series linearization, the bootstrap, the jackknife, and balanced repeated replication. The model-based perspective considers the finite population as being generated from some statistical model, irrespective of the sampling design. So the inclusion probabilities have no effect on the estimate of any population quantity. In the case of a simple random sample, model-based standard errors of the regression coefficients may be calculated from standard linear model theory. Mixed-effects models may be applied to more complex survey designs.

Our discussion is inspired by the Minnesota Population Center (MPC), an inter-departmental demography research group at the University of Minnesota. The MPC's databases incorporate subsets of the Census Bureau's internal data, but privacy concerns prevent the Census Bureau from releasing stratification information to the MPC. This situation makes the standard design-based and model-based methods difficult to use. The Bayesian perspective on survey sampling, however, can incorporate uncertain stratum membership through use of prior distributions. In particular, we propose an extension of the Polya posterior to the case of stratified, cluster surveys. The Polya posterior will simulate complete copies of the population; by calculating the regression coefficients---or any desired quantity---for each copy, one will obtain a sampling distribution from which variance may be estimated.