---
title: "Stat 5421 Notes: Review of Baby Statistics"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
    html_document:
        md_extensions: -tex_math_single_backslash
        mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
    pdf_document:
        latex_engine: xelatex
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# What is This?

Chapter 1 of the textbook (Agresti) already reviews bits
of the theory of probability and statistics,
but [IMHO](https://en.wiktionary.org/wiki/IMHO) that chapter is not basic
enough for some students.  Hence this.

# Sample, Population, Parameters, and Statistics

## The Fundamental Scam of Baby Statistics

In most (all, [AFAIK](https://en.wiktionary.org/wiki/AFAIK)) intro statistics
classes, the theory of
probability is dumbed down to the theory of simple random sampling from a
finite population.  This makes much of the discussion of probability theory
in such classes wrong or at least wrongheaded or incoherent.

In real probability theory (like that taught in theory classes
like Stat 4101–4102 or 5101–5102 at the [U of M](https://umn.edu))
probabilities can be any numbers between
zero and one, inclusive, not just rational numbers, as arise in finite
population sampling.  Also in real probability theory we allow infinite
sample spaces (sets of possible outcomes of random processes), and that
requires calculus for actually doing the math (which is why those theory
classes require calculus).  But this course requires neither calculus
nor those theory courses as a prerequisite.  So we will have to handwave
some of the mathematical issues.  But we won't dumb it down quite as much
as baby statistics courses do.

So in statistics we are interested in random "samples" from finite
"populations" except not really.  This language actually makes no sense
when we have infinite sample spaces.  It just becomes a bad metaphor when
we are "sampling" from an infinite "population".

## Independent and Identically Distributed, Probability Distributions

So we replace "sample" and "population" with the notion of *independent and
identically distributed* (IID) data.  What corresponds to the "population" in
the finite-population-sampling picture is a *probability model*, also called
a *probability distribution*.  You may have already met some of these in
previous statistics courses.  Those courses certainly used
the *binomial distribution*, the *normal distribution*, the *t distribution*
and the *F distribution*.  But they may have been very sloppy in their
discussion (dumbing things down to the point of actually being incorrect).

In this course, we will have no use for the t and F distributions.
They only apply when the data are assumed to be normally distributed.
But this course is about *discrete data*, also called *categorical data*.
Normally distributed data is *continuous* not *discrete*.  Hence no
t and F.  We will introduce some other distributions (chi-square,
Poisson, multinomial, product multinomial, multivariate normal), but
not on this page.  They can wait until we start with Chapter 1 of Agresti.

A sequence of observations is IID if every random variable in the sequence
has the same distribution (that's the *identically distributed* part) and
the value of any single random variable or any set of random variables
has nothing whatsoever to do with any other random variables (separately
or collectively, that's the *independent* part).  Here and everywhere
else in probability and statistics *independent* means
*stochastically independent*.  We will [eventually](ch1.html#iid) give
a more rigorous definition of IID, but this will have to do for now.

## The Sample is Not the Population

When teaching baby statistics, I repeat this section heading as often as
possible.  This is the most fundamental issue in statistics.  Your data are
*wrong*.  They do not tell you what you want to know — no matter how much
effort you put into collecting it.

## And Statistics are Not the Parameters They Estimate

In statistics (the subject) we call properties of the probability model
*parameters*.  Usually the true probability model of the data is *unknown*.
That is why we are doing statistics, we are trying to find out about the
true unknown probability model (metaphorically, find out about the "population")
by computing the analogous properties for a (hopefully) representative sample
from the "population".  Those quantities that we calculate from the sample
(that do not depend on the unknown population parameters) are called
*statistics* (singular *statistic*).

Thus we have *statistics* the academic subject whose practitioners are called
statisticians.  And we have *statistic* a technical term of this academic
subject that refers to functions of random data only (that do not depend on
unknown parameters).

Statistics of the second kind are used to *estimate* unknown parameters.
For example, the sample mean (a *statistic*) is an estimate of
the population mean (a *parameter*) and the
sample standard deviation (a *statistic*) is an estimate of
the population standard deviation (a *parameter*).

The "estimate" is there to remind you that statistics are not the
parameters they estimate.  The sample mean is not the population mean.
The sample mean is a random quantity (because the data are random).
The population is not random (unless you are a Bayesian, but we ignore that
for now).  Hence the sample mean (because it is random) has a very small
probability (perhaps zero) of being exactly equal to the population mean.

The most we can hope for is that an estimate is *close* to the parameter
it estimates.

## Sampling Distributions

## The Square Root Law

The *square root law* says that statistical precision varies like the square
root of the sample size, more specifically for any estimator $\hat{\theta}_n$
we have
$$
   \mathop{\rm sd}\bigl(\hat{\theta}_n\bigr) \propto \frac{1}{\sqrt{n}}
$$
where $n$ is the sample size and the symbol $\propto$ means proportional to.

If we do not use the $\propto$ symbol, we can rewrite this as
$$
   \mathop{\rm sd}\bigl(\hat{\theta}_n\bigr) =
   \frac{\text{constant}}{\sqrt{n}}
$$
where $\text{constant}$ is some constant (not random) that we may or may
not be able to calculate (and even if we do not know how to calculate it,
sometimes the computer does know).

This is not a "law" except in baby statistics.  There are required conditions
for it to hold, which is another way to say there are exceptions to the rule
(cases that do not satisfy the required conditions).

However, the square root law does hold (at least approximately)
for a very wide variety of estimators and probability distributions,
including most of those that arise in applied statistics
and for every probability model we will consider in this course.

Except this is dumbed down to the point of being wrong (see next section).

## Asymptotics and a More Precise Square Root Law

### Asymptotic Theory, also called Large Sample Theory

It is very common in statistics that, no matter how much theory you know,
you do not know the exact sampling distribution of an estimator.  This will
be true of most of the estimators we discuss in this course.

What we usually (not always) do have is *consistency* and *asymptotic
normality* of estimators along with the square root law.  Then we say
we have a *consistent and asympotically normal* (CAN) estimator.  And we
write this is math as
$$
   \sqrt{n} \bigl(\hat{\theta}_n - \theta\bigr)
   \stackrel{\mathcal{D}}{\longrightarrow}
   \text{Normal}(0, \tau^2)
$$
where

 * $\hat{\theta}_n$ is a statistic for sample size $n$,

 * $\theta$ is the parameter is is supposed to estimate,

 * $\stackrel{\mathcal{D}}{\longrightarrow}$ indicates convergence
   in distribution (a concept of probability theory we will not
   precisely define, [leaving that for theory
   courses](https://www.stat.umn.edu/geyer/5101/slides/s7.pdf#page=3)),

 * and $\tau^2$ indicates a quantity which is the variance of the
   asymptotic normal distribution and which typically depends on
   unknown parameters.

This $\tau^2$ need not have anything to do with the "population" variance
(the variance of the true unknown probability distribution of the data).
For example, we find out in theory courses that if $\hat{\theta}_n$ is
the sample median for sample size $n$ and $\theta$ is the "population"
median, then
$$
   \sqrt{n} \bigl(\hat{\theta}_n - \theta\bigr)
   \stackrel{\mathcal{D}}{\longrightarrow}
   \text{Normal}\left(0, \frac{1}{4 f(\theta)^2} \right)
$$
where $f$ is the [probability density function](ch1.html#pmf-pdf) of the
probability model (which means the this probability model is *continuous*)
and we need to assume $f(\theta) > 0$ in order to not get divide by zero
and we also need to assume that $f$ is a continuous function (which is not
implied by the probability distribution being continuous).

The only point of the example above is to show that the asymptotic variance
need not have anything to do with the population variance.  Otherwise it
is irrelevant to this course and may be forgotten.  We will see many other
examples where the asymptotic variance has no obvious relationship
to population variance, but are not as simple as the median (which is why
we did not choose an example more relevent to this course).

Now we come to where the preceding section is dumbed down to
the point of being wrong.

The *asymptotic variance* $\tau^2$ is the variance of the *asymptotic
distribution* — the distribution that the exact sampling distribution
of $\sqrt{n} \bigl(\hat{\theta}_n - \theta\bigr)$ is getting closer and
closer to as $n$ goes to infinity.

The actual finite sample size estimator $\hat{\theta}_n$ need not have
a variance hence not a standard deviation either.

As an example of this phenomenon,
we know from intro stats that the usual estimator
$\hat{\pi}_n$ of the usual parameter $\pi$ of the binomial distribution
has
$$
   \mathop{\rm sd}\bigl(\hat{\pi}_n\bigr) =
   \sqrt{\frac{\pi (1 - \pi)}{n}}
$$
so this estimator obeys the square root law *exactly* (not just approximately).

But for reasons to be discussed later (many times throughout the course,
this being a major theme), we do not want to use this parameter but rather
$$
   \hat{\theta}_n =
   \mathop{\rm logit}\bigl(\hat{\pi}_n\bigr)
$$
where the logit function (pronounced low-jit) is defined by
$$
   \mathop{\rm logit}(\pi) =
   \log\left(\frac{\pi}{1 - \pi}\right) =
   \log(\pi) - \log(1 - \pi)
$$
*except* when $\pi = 0$ or $\pi = 1$, where it is undefined.
Or we can set
\begin{align*}
   \mathop{\rm logit}(0) & = - \infty
   \\
   \mathop{\rm logit}(1) & = + \infty
\end{align*}
because
\begin{align*}
   \lim_{\pi \searrow 0}
   \mathop{\rm logit}(\pi) & = - \infty
   \\
   \lim_{\pi \nearrow 1}
   \mathop{\rm logit}(\pi) & = + \infty
\end{align*}
And either way this $\hat{\theta}_n$ does not have variance or standard
deviation (or has infinite variance and standard deviation, depending on
how you want to think about it) *for all finite sample sizes*.

In fact (much more on this later in the course)
$$
   \sqrt{n} \bigl(\hat{\theta}_n - \theta\bigr)
   \stackrel{\mathcal{D}}{\longrightarrow}
   \text{Normal}\left(0, \frac{1}{\pi (1 - \pi)} \right)
$$
So this is a CAN estimator, but it does not have a finite sample
standard deviation.  *This is the usual case for the estimators we use in
this course.*

### Sloppy versus Careful Asymptotics

Usually we will rewrite the first displayed formula of the preceding section
as
$$
   \hat{\theta}_n
   \approx
   \text{Normal}\left(\theta, \frac{\tau^2}{n} \right)
$$
where $\approx$ means approximately distributed as or something of the sort.

This is theoretically sloppy, because the right-hand side here cannot be
a limit as $n$ goes to infinity (such a limit cannot depend on $n$; that's
the whole point of a limit, to get rid of the dependence on $n$).  One will
find in theory courses that one cannot use this sloppy notation when arguments
get complicated (too sloppy).  But the sloppy notation is good enough for
most applied work and most of this course.

This makes sense because if
$$
   \sqrt{n} \bigl(\hat{\theta}_n - \theta\bigr)
$$
had exactly (not approximately) the
$$
   \text{Normal}(0, \tau^2)
$$
distribution, then $\hat{\theta}_n$ itself would have exactly
(not approximately) the
$$
   \text{Normal}\left(\theta, \frac{\tau^2}{n} \right)
$$
distribution by the [linearity rule](ch1.html#delta-method).

### The Square Root Law Again

Note that the sloppy version of the asymptotics agrees with the square
root law: if the asymptotic variance is $\tau^2 / n$, then the asymptotic
standard deviation is the square root of that, which is $\tau / \sqrt{n}$.
And that's the square root law (asymptotic version).

In this course, no estimators violate the square root law.  So if your
error of estimation (exact or approximate, theoretical or estimated) is
not proportional to $1 / \sqrt{n} = n^{- 1 / 2}$, then you are making
a mistake.

### Summary

If $\hat{\theta}_n$ is as described in the binomial distribution example
$\hat{\theta}_n = \mathop{\rm logit}\bigl(\hat{\pi}_n\bigr)$), then

 * **(wrong)** the approximate or large-sample or asymptotic *variance*
   of $\hat{\theta}_n$ is $1 / (n \pi (1 - \pi))$ (this is infinitely wrong)

 * **(right)** the approximate or large-sample or asymptotic *distribution*
   of $\hat{\theta}_n$ is the normal distribution having mean $\theta$
   and variance $1 / (n \pi (1 - \pi))$.

Or even shorter **(wrong)** variance is approximated **(right)** the
distribution is approximated.

# There is No One True Way to Do Frequentist Statistics

In so-called frequentist statistics (which would be better called
samplingdistributionist if English made compound words this way —
like German does — because it is statistical inference derived from
sampling distributions of statistics) there is no one right way to do
anything.

 * Any statistic you say is an estimator of a parameter $\theta$ is one.
   It may be completely ridiculuous, like the estimator that ignores the
   data and always says 42.  It may be even more than completely ridiculous,
   like a negative estimate for a parameter that is known to be positive.
   Of course, some estimators are better than others.  So we need theoretical
   statistics to tell us how to figure that out so we can use good ones
   rather than bad ones.  But this course does not require theory as a
   prerequisite, so we will just have to tell you what theory says rather
   than go through proofs.  In this course, we will usually use
   [maximum likelihood
   estimation](ch1.html#principle-in-scare-quotes-of-maximum-likelihood),
   which is asymptotically (but not exactly) best possible, meaning that
   its asymptotic normal distribution has the smallest variance possible
   for any estimation procedure but that does not tell that maximum likelihood
   is best at any finite sample size (and we have examples like
   [James-Stein
   estimators](https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator)
   to show that maximum likelihood need not be the best at any finite
   sample size in certain situations.

 * Any interval you say is a confidence interval is one, even if it does
   not have the stated coverage probability, it has *approximately* the
   stated coverage probability if very bad approximation is allowed.
   In this course, because of discreteness, [no confidence interval will be
   exact](http://www.stat.umn.edu/geyer/5102/examp/coverage.html).  So
   all will only have approximately the stated coverage probability
   (and the coverage probability will depend on the true unknown parameter
   value, which is unknown).  Again, we know that confidence intervals
   associated with the maximum likelihood estimator are asymptotically
   best possible.  But this is tricky for discrete data, because regardless
   of sample size, the approximation gets worse and worse as the parameter
   values get extreme (mean values go to the boundary of their parameter
   space), so the confidence intervals work well when the parameters are
   far from the boundary but not otherwise.  More on this later.

 * Similarly, for hypothesis tests.  [Any recipe for confidence intervals,
   implies a corresponding one for hypothesis tests and vice
   versa](ch1.html#agresti-section-1.3.4) so when you have many confidence
   interval recipes you also have many hypothesis test recipes.

It would be nice (for you) if we dumbed this down and just said "learn this
one" and allowed you to forget the others.  But it would not be nice if later
on you had to deal with the others in real life.  So we will learn about
all of the widely used procedures.

# There is One True Way to Do Bayesian Statistics

The one true way to do Bayesian statistics is to use Bayes' rule (this may
not make much sense if you have not been exposed to [Bayesian
statistics](https://www.stat.umn.edu/geyer/3701/notes/mcmc-bayes.html),
but will eventually).

But different Bayesians will produce different inferences: [if they
use different prior distributions, then they will get different posterior
distributions](https://www.stat.umn.edu/geyer/3701/notes/mcmc-bayes.html#bayesian-inference).  But, unlike frequentist inference where some procedures are
better than others.  All of these Bayesian inferences are perfectly correct
(assuming Bayes' rule was followed correctly and the prior was defensible).
More on this when we get to Bayes.

# No One Right Answer

But either way (frequentist or Bayesian, or "other" for that matter,
not all statistical inference is either of these, although we will not
cover anything else in this course) there will rarely be only one possible
correct answer (unless we are very specific in telling you what to do).
And there will never be only one possible correct answer in real life
(if you ever do statistics in real life).  Sorry about that.  But that's
just the way it is.