--- title: "Bayesian Data Snooping" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: bookdown::html_document2: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML bookdown::pdf_document2: number_sections: true md_extensions: -tex_math_single_backslash linkcolor: blue urlcolor: blue --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. # Data Snooping Statisticians have long had a pejorative term "data snooping" or "data dredging", which many newer terms such a $P$-hacking and HARKing serve as partial replacements. Outside of statistics, the term "cherry picking" is often used to indicate selective use of evidence: ignoring evidence against one's pet theory and possibly mischaracterising putative evidence for it. Many Bayesians have long argued that "optional stopping" for Bayesians is not a problem. It does not change the likelihood and hence does not change Bayesian inference. A recent paper arguing this is > Jeffrey N. Rouder (2014) > Optional stopping: No problem for Bayesians > *Psychonomic Bulletin & Review*, **21**, 301–308 > [DOI:10.3758/s13423-014-0595-4](https://doi.org/10.3758/s13423-014-0595-4) More astute Bayesians have realized that this is a problem. > Rianne de Heide and Peter D. Grünwald (2020) > Why optional stopping can be a problem for Bayesians > *Psychonomic Bulletin & Review*, online > [DOI:10.3758/s13423-020-01803-x](https://doi.org/10.3758/s13423-020-01803-x) > Paul R. Rosenbaum and Donald B. Rubin (1984) > Sensitivity of Bayes Inference with Data-Dependent Stopping Rules > *The American Statistician*, **38**, 106–109 > [DOI:10.1080/00031305.1984.10483176](https://doi.org/10.1080/00031305.1984.10483176) Here we look at stopping rules similar to those discussed by Rosenbaum and Rubin but even more "data snooping". We do Bayesian inference of independent and identically distributed normal data with known variance and flat prior so there is no difference between frequentist and Bayesian intervals. Thus there is nothing inherently Bayesian about the rest of our discussion. What "data snooping" we allow affects frequentists and Bayesians in exactly the same way. # Our Stopping Rule Without loss of generality, assume the known variance of our data is one. Our stopping rule is that we do not stop until the sample mean $\bar{x}_n$ is greater than $C / \sqrt{n}$ where $n$ is the sample size and $C > 0$ is a data-snooper-specified constant. Note that the frequentist or Bayesian interval estimate of the true unknown mean (with, as mentioned above, flat prior on this unknown mean) is $\bar{x}_n \pm z_{\alpha / 2} / \sqrt{n}$, where $z_{\alpha / 2}$ is the $1 - \alpha / 2$ quantile of the standard normal distribution, where $1 - \alpha$ is the desired frequentist coverage probability or Bayesian posterior probability. Hence large enough $C$ guarantees that the resulting interval will exclude zero, even when zero is the true unknown mean. The law of the iterated logarithm says $$ \limsup_{n \to \infty} \frac{\bar{x}_n}{\sqrt{2 \log(\log(n)) / n}} = 1 $$ So with probability one $\bar{x}_n$ will eventually exceed $C / \sqrt{n}$ for any $C$. But we may have to wait a very long time for that to happen. So we allow for minimum and maximum values of the sample size. And we see what happens. For efficiency, we code our stopping rule in C. ```{c foo, results="hide"} #include #include #include #include #include SEXP foo(SEXP crit_in, SEXP nmin_in, SEXP nmax_in) { if (! isReal(crit_in)) error("argument crit must be storage mode double"); if (! isReal(nmin_in)) error("argument nmin must be storage mode double"); if (! isReal(nmax_in)) error("argument nmax must be storage mode double"); if (LENGTH(crit_in) != 1) error("argument crit must be length one"); if (LENGTH(nmin_in) != 1) error("argument nmin must be length one"); if (LENGTH(nmax_in) != 1) error("argument nmax must be length one"); const double crit = REAL(crit_in)[0]; const long nmin = REAL(nmin_in)[0]; const long nmax = REAL(nmax_in)[0]; if (crit <= 0) error("argument crit must be positive"); if (nmin <= 0) error("argument nmin must be positive"); if (nmax <= 0) error("argument nmax must be positive"); const double critsq = crit * crit; GetRNGstate(); double sum_x = 0.0; double n = 0.0; for (;;) { R_CheckUserInterrupt(); sum_x += norm_rand(); n += 1; if (n >= nmin && sum_x > 0 && sum_x * sum_x > critsq * n) break; if (n >= nmax) break; } PutRNGstate(); SEXP result = PROTECT(allocVector(REALSXP, 2)); REAL(result)[0] = sum_x; REAL(result)[1] = n; UNPROTECT(1); return result; } ``` And exercise our code as follows. ```{r snoop.setup,cache=TRUE,cache.vars=c(".Random.seed", "nboot", "crit", "nmax", "nmin")} set.seed(42) nboot <- 30 crit <- qnorm(0.95) crit nmax <- 1e9 nmin <- 1e2 ``` ```{r snoop,cache=TRUE,dependson="snoop.setup"} foo_star <- NULL for (iboot in 1:nboot) foo_star <- rbind(foo_star, .Call("foo", as.double(crit), as.double(nmin), as.double(nmax))) foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2]) colnames(foo_star) <- c("z", "n") foo_star <- as.data.frame(foo_star) print(foo_star, row.names = FALSE) ``` Having a smaller `nmin` would allow not only for shorter runs but also would allow for a smaller proportion of runs hitting `nmax` because there can be less negativity for our stopping rule to overcome. In the runs above we had $n$ less than `nmax` with probability `r p <- mean(foo_star$n != nmax); p` (estimated, standard error `r round(sqrt(p * (1 - p) / nrow(foo_star)), 3)`) If we decrease `nmin` to the minimum allowable, then we get a higher probability. ```{r snoop.too,cache=TRUE,dependson="snoop"} nmin <- 1 foo_star <- NULL for (iboot in 1:nboot) foo_star <- rbind(foo_star, .Call("foo", as.double(crit), as.double(nmin), as.double(nmax))) foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2]) colnames(foo_star) <- c("z", "n") foo_star <- as.data.frame(foo_star) print(foo_star, row.names = FALSE) ``` Now we have $n$ less than `nmax` with probability `r p <- mean(foo_star$n != nmax); p` (estimated, standard error `r round(sqrt(p * (1 - p) / nrow(foo_star)), 3)`) With the critical value we chose the 90% interval is guaranteed to not contain the true unknown parameter value (zero in our simulations) with probability one conditional on $n$ less than `nmax`. Our analysis using the law of the iterated logarithm shows we can increase this probability to 100% by setting `nmax <- Inf` but that may result in huge amounts of computing time being used. We will just show what happens by bumping `nmax` by a factor of 100. ```{r snoop.too.too,cache=TRUE,dependson="snoop.too"} nmax <- nmax * 100 foo_star <- NULL for (iboot in 1:nboot) foo_star <- rbind(foo_star, .Call("foo", as.double(crit), as.double(nmin), as.double(nmax))) foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2]) colnames(foo_star) <- c("z", "n") foo_star <- as.data.frame(foo_star) print(foo_star, row.names = FALSE) ``` Now we have $n$ less than `nmax` with probability `r p <- mean(foo_star$n != nmax); p` (estimated, standard error `r round(sqrt(p * (1 - p) / nrow(foo_star)), 3)`)