1 License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/).

2 R

3 Data Snooping

Statisticians have long had a pejorative term "data snooping" or "data dredging", which many newer terms such a \(P\)-hacking and HARKing serve as partial replacements. Outside of statistics, the term "cherry picking" is often used to indicate selective use of evidence: ignoring evidence against one's pet theory and possibly mischaracterising putative evidence for it.

Many Bayesians have long argued that "optional stopping" for Bayesians is not a problem. It does not change the likelihood and hence does not change Bayesian inference. A recent paper arguing this is

Jeffrey N. Rouder (2014)
Optional stopping: No problem for Bayesians
Psychonomic Bulletin & Review, 21, 301–308
DOI:10.3758/s13423-014-0595-4

More astute Bayesians have realized that this is a problem.

Rianne de Heide and Peter D. Grünwald (2020)
Why optional stopping can be a problem for Bayesians
Psychonomic Bulletin & Review, online
DOI:10.3758/s13423-020-01803-x

Paul R. Rosenbaum and Donald B. Rubin (1984)
Sensitivity of Bayes Inference with Data-Dependent Stopping Rules
The American Statistician, 38, 106–109
DOI:10.1080/00031305.1984.10483176

Here we look at stopping rules similar to those discussed by Rosenbaum and Rubin but even more "data snooping". We do Bayesian inference of independent and identically distributed normal data with known variance and flat prior so there is no difference between frequentist and Bayesian intervals. Thus there is nothing inherently Bayesian about the rest of our discussion. What "data snooping" we allow affects frequentists and Bayesians in exactly the same way.

4 Our Stopping Rule

Without loss of generality, assume the known variance of our data is one. Our stopping rule is that we do not stop until the sample mean \(\bar{x}_n\) is greater than \(C / \sqrt{n}\) where \(n\) is the sample size and \(C > 0\) is a data-snooper-specified constant. Note that the frequentist or Bayesian interval estimate of the true unknown mean (with, as mentioned above, flat prior on this unknown mean) is \(\bar{x}_n \pm z_{\alpha / 2} / \sqrt{n}\), where \(z_{\alpha / 2}\) is the \(1 - \alpha / 2\) quantile of the standard normal distribution, where \(1 - \alpha\) is the desired frequentist coverage probability or Bayesian posterior probability. Hence large enough \(C\) guarantees that the resulting interval will exclude zero, even when zero is the true unknown mean.

The law of the iterated logarithm says \[ \limsup_{n \to \infty} \frac{\bar{x}_n}{\sqrt{2 \log(\log(n)) / n}} = 1 \] So with probability one \(\bar{x}_n\) will eventually exceed \(C / \sqrt{n}\) for any \(C\).

But we may have to wait a very long time for that to happen. So we allow for minimum and maximum values of the sample size. And we see what happens. For efficiency, we code our stopping rule in C.

#include <R.h>
#include <Rmath.h>
#include <Rinternals.h>
#include <R_ext/Utils.h>
#include <math.h>

SEXP foo(SEXP crit_in, SEXP nmin_in, SEXP nmax_in) {

    if (! isReal(crit_in))
        error("argument crit must be storage mode double");
    if (! isReal(nmin_in))
        error("argument nmin must be storage mode double");
    if (! isReal(nmax_in))
        error("argument nmax must be storage mode double");
    if (LENGTH(crit_in) != 1)
        error("argument crit must be length one");
    if (LENGTH(nmin_in) != 1)
        error("argument nmin must be length one");
    if (LENGTH(nmax_in) != 1)
        error("argument nmax must be length one");
    const double crit = REAL(crit_in)[0];
    const long nmin = REAL(nmin_in)[0];
    const long nmax = REAL(nmax_in)[0];
    if (crit <= 0)
        error("argument crit must be positive");
    if (nmin <= 0)
        error("argument nmin must be positive");
    if (nmax <= 0)
        error("argument nmax must be positive");
 
    const double critsq = crit * crit;

    GetRNGstate();

    double sum_x = 0.0;
    double n = 0.0;

    for (;;) {
        R_CheckUserInterrupt();
        sum_x += norm_rand();
        n += 1;
        if (n >= nmin && sum_x > 0 && sum_x * sum_x > critsq * n) break;
        if (n >= nmax) break;
    }

    PutRNGstate();

    SEXP result = PROTECT(allocVector(REALSXP, 2));
    REAL(result)[0] = sum_x;
    REAL(result)[1] = n;
    UNPROTECT(1);
    return result;
}

And exercise our code as follows.

set.seed(42)

nboot <- 30
crit <- qnorm(0.95)
crit
## [1] 1.644854
nmax <- 1e9
nmin <- 1e2
foo_star <- NULL

for (iboot in 1:nboot)
    foo_star <- rbind(foo_star,
        .Call("foo", as.double(crit), as.double(nmin), as.double(nmax)))

foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2])
colnames(foo_star) <- c("z", "n")
foo_star <- as.data.frame(foo_star)
print(foo_star, row.names = FALSE)
##           z          n
##   1.6451428    7150240
##   1.6448615  724582889
##   1.6660137       2299
##   1.6455300        144
##   1.7311039        119
##   1.6455967      17716
##   1.6464479     745835
##   1.6448807    3465383
##   0.4604622 1000000000
##   2.0905846        100
##   1.6553572       6549
##   1.6465587      99237
##  -1.4574581 1000000000
##   1.6451190     332259
##   1.6449366   73604375
##   1.6491055       3304
##  -0.5747132 1000000000
##   1.6474857       7176
##   1.6448879   19307419
##   0.9594091 1000000000
##  -0.7115764 1000000000
##   1.7909918        100
##   0.9675176 1000000000
##  -0.4218012 1000000000
##   1.6488675      16973
##   1.6505782       3399
##   1.6460052      89953
##   0.4606342 1000000000
##  -1.2585646 1000000000
##   1.6501299        531

Having a smaller nmin would allow not only for shorter runs but also would allow for a smaller proportion of runs hitting nmax because there can be less negativity for our stopping rule to overcome.

In the runs above we had \(n\) less than nmax with probability 0.7 (estimated, standard error 0.084)

If we decrease nmin to the minimum allowable, then we get a higher probability.

nmin <- 1

foo_star <- NULL

for (iboot in 1:nboot)
    foo_star <- rbind(foo_star,
        .Call("foo", as.double(crit), as.double(nmin), as.double(nmax)))

foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2])
colnames(foo_star) <- c("z", "n")
foo_star <- as.data.frame(foo_star)
print(foo_star, row.names = FALSE)
##            z          n
##   1.98339058          6
##   1.65123737       5481
##   1.66004563          1
##  -0.01745529 1000000000
##   1.64488731  327446003
##   1.89862772          5
##   1.66972236       3799
##   1.70603012         42
##   1.67207643          1
##   2.05752042         11
##   1.75571501          1
##   1.65357076        579
##   1.73293120         33
##   1.64489872   23145336
##   0.16516065 1000000000
##   1.64568351      10290
##   1.65616908       3569
##   1.64565728       5859
##  -1.05717551 1000000000
##   1.64602081      15299
##   1.64487472  270884872
##   1.64612420      51621
##   1.80804637         73
##   1.65043125      23389
##   1.65237816       3591
##   1.66255064         57
##   1.66428354       3682
##   1.88605638          1
##   1.64513525    4897600
##   1.64521104    4389553

Now we have \(n\) less than nmax with probability 0.9 (estimated, standard error 0.055)

With the critical value we chose the 90% interval is guaranteed to not contain the true unknown parameter value (zero in our simulations) with probability one conditional on \(n\) less than nmax.

Our analysis using the law of the iterated logarithm shows we can increase this probability to 100% by setting nmax <- Inf but that may result in huge amounts of computing time being used.

We will just show what happens by bumping nmax by a factor of 100.

nmax <- nmax * 100

foo_star <- NULL

for (iboot in 1:nboot)
    foo_star <- rbind(foo_star,
        .Call("foo", as.double(crit), as.double(nmin), as.double(nmax)))

foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2])
colnames(foo_star) <- c("z", "n")
foo_star <- as.data.frame(foo_star)
print(foo_star, row.names = FALSE)
##           z            n
##  -2.3706516 100000000000
##   1.6558742         3170
##   1.6457539        66883
##  -1.6195818 100000000000
##   1.6456764        12438
##   1.6505346            7
##   1.6452212      4795523
##   1.6484579         1042
##  -1.7841709 100000000000
##   1.6543119          494
##   1.7974073           37
##   1.8698472           11
##   1.6673684           80
##   1.7556894            7
##   1.6448561   8818986051
##   1.6573505           28
##   1.6752640         2141
##  -0.5107717 100000000000
##   1.7337795           19
##   1.6448803      3408106
##   1.6464094       239236
##   1.6612827          787
##   1.6448648   5805155111
##   1.6450392        20831
##   1.6468050         4159
##   1.6448578   5434275625
##  -0.3973488 100000000000
##   1.8058648            3
##   1.9054245            1
##  -0.4673361 100000000000

Now we have \(n\) less than nmax with probability 0.8 (estimated, standard error 0.073)