---
title: "Stat 5421 Lecture Notes: To Accompany Agresti Ch 4"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
    html_document:
        md_extensions: -tex_math_single_backslash
        mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
    pdf_document:
        latex_engine: xelatex
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# Section 4.2.2 in Agresti

## Data

```{r snoring.data}
heart.disease <- rbind(c(24, 1355),
                       c(35,  603),
                       c(21,  192),
                       c(30,  224))
snoring <- c("never", "occasionally", "nearly.every.night", "every.night")
x <- c(0, 2, 4, 5)
```

## Logistic Regression

Then we fit the logistic GLM (usually called *logistic regression*) by
```{r snoring.glm}
gout <- glm(heart.disease ~ x, family = binomial)
summary(gout)
```

We show this model first because it is the one we and Agresti recommend
and is by far the most widely used.

## Bernoulli Regression With Identity Link

Now we do the nonsense model, that no one ever uses, except teachers sometimes
show it to show why all your intuitions developed from learning about linear
models do not transfer to GLM.
```{r snoring.glm.quasi.linear.link}
gout.goofy <- glm(heart.disease ~ x, family = binomial(link="identity"))
summary(gout.goofy)
```

We make a plot like Figure 4.1 in Agresti.
```{r curves, fig.align="center", fig.cap="logit link (black) versus identity link (red)"}
mu <- predict(gout, type = "response")
mu.goofy <- predict(gout.goofy, type = "response")
mu
mu.goofy

plot(x, mu, ylab = "probability", pch = 19, ylim = range(mu, mu.goofy))
curve(predict(gout, newdata = data.frame(x = x), type = "response"),
    add = TRUE)
grep("red", colors(), value = TRUE)
points(x, mu.goofy, col = "red", pch = 19)
curve(predict(gout.goofy, newdata = data.frame(x = x), type = "response"),
    add = TRUE, col = "red")
```

## Comments

### Linearity versus Constraints

Linearity on the mean scale (identity link) fights the constraints
that probabilities have to be between zero and one.  Let us look at
a few examples where we have probabilities near zero and one for
our regression function.

Make up some data where the logistic regression model is true and
probabilities go from near zero to near one.
```{r logistic.make.up}
z <- 1:100
theta <- z - mean(z)
theta <- 3 * theta / max(theta)
p <- 1 / (1 + exp(- theta))
y <- rbinom(length(p), 1, p)
rm(theta, p)
```
Now fit the logistic regression model and plot it.
```{r logistic.fit.and.plot, fig.align="center", fig.caption="Scatterplot with Regression Function: Logistic Regression"}
gout.logit <- glm(y ~ z, family = binomial)
summary(gout.logit)
plot(z, y)
curve(predict(gout.logit, newdata = data.frame(z = x),
    type = "response"), add = TRUE)
```
Now fit the same data with identity link and add it to the plot.
```{r identity.fit.one, error=TRUE}
gout.identity <- glm(y ~ z, family = binomial(link = "identity"))
```
This model is so bad that R function `glm` has trouble fitting it.
Try two.
```{r identity.fit.two, error=TRUE}
gout.identity <- glm(y ~ z, family = binomial(link = "identity"),
    start = c(0.5, 0))
summary(gout.identity)
```
Now plot.
```{r logistic.plot.two, fig.align="center", fig.caption="Scatterplot with Regression Functions: Logit link (black) and identity link (red)", error=TRUE}
plot(z, y)
curve(predict(gout.logit, newdata = data.frame(z = x),
    type = "response"), add = TRUE)
curve(predict(gout.identity, newdata = data.frame(z = x),
    type = "response"), add = TRUE, col = "red")
```

* The identity link model is so bad that R function `glm` has a lot of
  problems with it and

* we can make the identity link look as bad as we want: it clearly is
  only paying attention to the endpoints of the data and ignoring
  everything in between.

### Hypothesis Tests of Non-Nested Models

Although theory exists for comparing non-nested models
(Cox, [1961](https://projecteuclid.org/euclid.bsmsp/1200512162),
[1962](https://doi.org/10.1111/j.2517-6161.1962.tb00468.x),
[2013](https://www.jstor.org/stable/23360923)),
it is not widely known or used.  The simplest procedure that makes
sense is to compare both to the smallest model containing both.
This is the model with four parameters that fits the data perfectly.
```{r super}
gout.super <- glm(heart.disease ~ snoring, family = binomial)
summary(gout.super)
gout.goofy.super <- glm(heart.disease ~ snoring,
    family = binomial(link="identity"))
summary(gout.goofy.super)
mu.super <- predict(gout.super, type = "response")
mu.goofy.super <- predict(gout.goofy.super, type = "response")
all.equal(mu.super, mu.goofy.super)
```
We fit the model with two different link functions, even though they
are the same model, as proved by the `all.equal` statement.
The reason for this is we are not sure how R function `anova` deals with
different link functions.  We compare the two models of interest to
their least common supermodel as follows.
```{r super.anova}
anova(gout, gout.super, test = "LRT")
anova(gout.goofy, gout.goofy.super, test = "LRT")
```
We see that both models apparently fit these data, but what we are calling
the "goofy" model actually fits better.

# Section 4.2.5 in Agresti

There are other link functions that can be used.
```{r other, fig.align="center", fig.cap="logit link (black) versus identity link (red) versus probit link (yellow) versus cauchit link (blue)"}
gout.probit <- glm(heart.disease ~ x, family = binomial(link="probit"))
gout.cauchit <- glm(heart.disease ~ x, family = binomial(link="cauchit"))
mu.probit <- predict(gout.probit, type = "response")
mu.cauchit <- predict(gout.cauchit, type = "response")

plot(x, mu, ylab = "probability", pch = 19,
    ylim = range(mu, mu.goofy, mu.probit, mu.cauchit))
curve(predict(gout, newdata = data.frame(x = x), type = "response"),
    add = TRUE)
points(x, mu.goofy, col = "red", pch = 19)
curve(predict(gout.goofy, newdata = data.frame(x = x), type = "response"),
    add = TRUE, col = "red")
points(x, mu.probit, col = "yellow3", pch = 19)
curve(predict(gout.probit, newdata = data.frame(x = x), type = "response"),
    add = TRUE, col = "yellow3")
grep("yellow", colors(), value = TRUE)
points(x, mu.cauchit, col = "blue", pch = 19)
curve(predict(gout.cauchit, newdata = data.frame(x = x), type = "response"),
    add = TRUE, col = "blue")
```

Your humble instructor recommends logit link for reasons that will become
apparent when we discuss exponential families.  One really has to like
what Agresti calls the "latent tolerance" motivation to use probit or cauchit.
And even then, logit link is also a "latent tolerance" model based on
the logistic distribution.  So it is unclear that there is any reason
for avoiding the logit link even if one likes the "latent tolerance" story.

The underlying "latent" distributions (meaning completely unobservable,
completely hypothetical, and completely unverifiable) for the available
links for R function `binomial`.
```{r latent, fig.align="center", fig.cap="Distributions.  Logistic (black), normal (yellow) and Cauchy (blue).  Same median and quartiles."}
curve(dcauchy(x, scale = 1 / qcauchy(0.75)), from = -4, to = 4, col = "blue",
    xlab = "latent variable", ylab = "probability density")
curve(dnorm(x, sd = 1 / qnorm(0.75)), add = TRUE, col = "yellow3")
curve(dlogis(x, scale = 1 / qlogis(0.75)), add = TRUE)
```

But there is absolutely no reason to stop at these.  There are an infinite
variety of such distributions.  For example, the $t$ distributions with
$\nu$ degrees of freedom for $1 < \nu < \infty$ (although only integer
degrees of freedom are used in intro stats, the formula for the $t$ PDF
makes sense for non-integer degrees of freedom, as explained in theory
courses) interpolate between the Cauchy distribution, which is the $t$
distribution with one degree of freedom, and the normal distribution,
which is the limit of $t(\nu)$ distributions as $\nu \to \infty$.
So that is already an infinity of possibilities, but there are even more
than that.  And there is no way to choose a "right" one.  So R function
`binomial` stops at these three latent distributions.

# Confidence Intervals

## Wald Intervals

These are produced by R function `predict`.  From now on we will only
use the fit in R object `gout`, which uses the default logit link and
has 2 parameters.

First, the output of summary gives standard errors, so if one wanted
confidence intervals for the betas, one would do them as follows.
```{r ci.beta}
sout <- summary(gout)
class(sout)
class(unclass(sout))
names(unclass(sout))
cout <- sout$coefficients
class(cout)
cout

conf.level <- 0.95
crit <- qnorm((1 + conf.level) / 2)
crit

lower <- cout[ , "Estimate"] - crit * cout[ , "Std. Error"]
upper <- cout[ , "Estimate"] + crit * cout[ , "Std. Error"]
foo <- data.frame(lower, upper)
foo
```

However, the regression coefficients (betas) are meaningless quantities.
They are very far from the data, and the model can be reparameterized
without changing the model.  Thus we look at confidence intervals
for probabilities.
```{r ci.mu, fig.align="center", fig.cap="95% Wald confidence intervals, not simultaneous"}
pout <- predict(gout, type = "response", se.fit = TRUE)
lower <- pout$fit - crit * pout$se.fit
upper <- pout$fit + crit * pout$se.fit
data.frame(lower, upper)

library(sfsmisc)
errbar(x, mu, upper, lower, ylab = "probability")
```

These are not, of course, corrected for multiple comparisons.

We can also get Wald intervals at $x$ values other than those in the
observed data.  It is not clear that these make any sense in these data,
but we will illustrate them anyway.
```{r ci.mu.too, fig.align="center", fig.cap="95% Wald confidence bands, not simultaneous"}
plot(x, mu, ylim = range(mu, upper, lower), ylab = "probability")
curve(predict(gout, newdata = data.frame(x = x), type = "response"),
    add = TRUE)
sally <- function(x) {
    pout <- predict(gout, type = "response", se.fit = TRUE,
        newdata = data.frame(x = x))
    pout$fit - crit * pout$se.fit
}
curve(sally, add = TRUE, lty = "dashed")
sally <- function(x) {
    pout <- predict(gout, type = "response", se.fit = TRUE,
        newdata = data.frame(x = x))
    pout$fit + crit * pout$se.fit
}
curve(sally, add = TRUE, lty = "dashed")
```

## Likelihood Intervals

### R Function Confint

R has a built-in function for doing likelihood intervals
```{r ci.like.beta}
library(MASS)
confint(gout)
```

Unfortunately these intervals are for the meaningless parameters
(regression coefficients) rather than the meaningful parameters
(success probability as a function of $x$).  To calculate those
we need to use the methods of the
[handout on likelihood computation](http://www.stat.umn.edu/geyer/5421/notes/likelihood-compute.html#likelihood-intervals).

### Log Likelihood

But even before we try that, we need the log likelihood function, which
we have not had explicitly, because we were using R function `glm`,
which knows the log likelihood function but doesn't return it to the user.

For this problem, the log likelihood is
$$
   \textstyle \sum_i \bigl[ x_i \log(p_i) - (n_i - x_i) \log(1 - p_i) \bigr]
$$
where each row of the data matrix `heart.disease` is $(x_i, n_i - x_i)$
and the $p_i$ are components of a vector $\mathbf{p}$ that satisfies
$$
   \boldsymbol{\theta} = \mathbf{M} \boldsymbol{\beta}
$$
where $\mathbf{M}$ is the model matrix, $\boldsymbol{\beta}$ is
the vector of regression coefficients, and $\boldsymbol{\theta}$ is
the so-called "linear predictor" which is related to $\mathbf{p}$ by
the inverse link function.
$$
   p_i = \frac{e^{\theta_i}}{1 + e^{\theta_i}} =
   \frac{1}{1 + e^{- \theta_i}}
$$
so
$$
   1 - p_i = \frac{1}{1 + e^{\theta_i}} =
   \frac{e^{- \theta_i}}{1 + e^{- \theta_i}}
$$
and
\begin{gather*}
   \log(p_i) = \theta_i - \log(1 + e^{\theta_i}) = - \log(1 + e^{- \theta_i})
   \\
   \log(1 - p_i) = - \log(1 + e^{\theta_i}) = - \theta_i - \log(1 + e^{- \theta_i})
\end{gather*}
The reason why we have two expressions for each of these is that
$e^\theta$ may overflow (be a number too big for the computer to represent)
when $\theta$ is moderately large
```{r overflow}
exp(800)
```
and, of course, $e^{- \theta}$ can overflow when $\theta$ is moderately
large negative.  Thus we want to use whichever formula cannot overflow,
the one containing $e^{- \theta}$ when $\theta > 0$ and the one containing
$e^\theta$ when $\theta$ is negative.

Also we never want to calculate $1 - p_i$ by subtraction because that can
cause "catastrophic cancellation", loss of all significant figures, when
$p_i$ is near zero.
```{r catastrophic}
1 + 1e-20 - 1
```
Hence we need all 4 formulas above.

Finally, to avoid more catastrophic cancellation, we use R function
`log1p` which calculates the function $x \mapsto \log(1 + x)$ more
accurately when $x$ is small than just using the log function and addition.
From calculus $\log(1 + x) \approx x$ when $x$ is near zero.
```{r log1p}
xx <- c(1e-20, 1e-10, 1, 10)
log1p(xx)
log(1 + xx)
```

Such finicky calculation is also something we do not expect for homework
but which is necessary to write good code usable by others.
This subject of computer arithmetic is really the subject of another
course, so we won't say much about it in this course.
Those that are interested can look at my [Stat 3701 notes on this
subject](http://www.stat.umn.edu/geyer/3701/notes/arithmetic.html).

We also need the model matrix $M$.  We could figure it out for ourselves,
but we can also get it from the output of R function `glm` if we ask for it.
```{r snoring.glm.x}
gout <- glm(heart.disease ~ x, family = binomial, x = TRUE)
gout$x
```

Now we are ready to code the log likelihood function.  We use a factory
function, but, of course, the global variables approach can also be used.
```{r mlogl.logistic.factory}
mlogl.factory <- function(modmat, response) {
    stopifnot(is.numeric(modmat))
    stopifnot(is.finite(modmat))
    stopifnot(is.matrix(modmat))
    stopifnot(is.numeric(response))
    stopifnot(is.finite(response))
    stopifnot(is.matrix(response))
    stopifnot(round(response) == response) # check for integer-valued
    stopifnot(response >= 0)
    stopifnot(ncol(response) == 2)
    stopifnot(nrow(response) == nrow(modmat))
    success <- response[ , 1]
    failure <- response[ , 2]
    function(beta) {
        stopifnot(is.numeric(beta))
        stopifnot(is.finite(beta))
        stopifnot(length(beta) == ncol(modmat))
        theta <- as.vector(modmat %*% beta)
        logp <- ifelse(theta > 0, - log1p(exp(- theta)),
            theta - log1p(exp(theta)))
        logq <- ifelse(theta > 0, - theta - log1p(exp(- theta)),
            - log1p(exp(theta) ))
        - sum(success * logp + failure * logq)
    }
}
```
We should have said when we introduced the function factory approach
the reasons why it is better than the global variables approach are

 * global variables are evil --- they are easily clobbered accidentally
   before they are used, or they can clobber other variables of the
   same name being used by the user for other purposes and

 * the factory function can check the validity of what would be the
   global variables in the other approach --- and this check is done
   once at the time the `mlogl` function is created rather than
   every time the `mlogl` function is invoked (as would have to be
   done in the global variables approach where there is no factory
   function to do this checking)

```{r mlogl.logistic}
mlogl <- mlogl.factory(gout$x, heart.disease)
```

Let us check that our function works and gives the same MLE as R
function `glm`
```{r mlogl.logistic.test}
nout <- nlm(mlogl, c(0, 0))
nout$code <= 2
all.equal(nout$estimate, gout$coefficients, check.attributes = FALSE)
```
Good!  The reason why we did not bother to find a "good" starting point
for the optimization is that we know this log likelihood function has
a unique local maximum, which is also the global maximum, that is found
from any starting point.  But the explanation for this will have to
wait until we study exponential families of distributions.

Now we also need the function that maps $\boldsymbol{\beta}$ to $\mathbf{p}$.
```{r pfun}
pooh.factory <- function(x) {
    stopifnot(is.numeric(x))
    stopifnot(is.finite(x))
    stopifnot(length(x) == 1)
    function(beta) {
        stopifnot(is.numeric(beta))
        stopifnot(is.finite(beta))
        stopifnot(length(beta) == 2)
        theta <- sum(c(1, x) * beta)
        1 / (1 + exp(- theta))
    }
}
```
Try it out.
```{r pfun.test}
pooh <- pooh.factory(0)
pooh(nout$estimate)
mu[1]
```

Now we are finally ready to calculate the likelihood intervals.
```{r logistic.likelihood.intervals, error=TRUE}
library(alabama)
lower <- double(length(x))
for (i in seq(along = x)) {
    aout <- suppressWarnings(auglag(nout$estimate,
        fn = pooh.factory(x[i]),
        hin = function(theta) nout$minimum + crit - mlogl(theta),
        control.outer = list(trace = FALSE)))
    stopifnot(aout$convergence == 0)
    lower[i] <- aout$value
}
upper <- double(length(x))
for (i in seq(along = x)) {
    aout <- suppressWarnings(auglag(nout$estimate,
        fn = pooh.factory(x[i]),
        hin = function(theta) nout$minimum + crit - mlogl(theta),
        control.outer = list(trace = FALSE),
        control.optim = list(fnscale = -1)))
    stopifnot(aout$convergence == 0)
    upper[i] <- aout$value
}
data.frame(lower, upper)
```

```{r ci.pooh, fig.align="center", fig.cap="95% likelihood confidence intervals, not simultaneous"}
errbar(x, mu, upper, lower, ylab = "probability")
```

Of course, if we want simultaneous coverage, we can just use the appropriate
critical value for that, degrees of freedom equal to the length of `beta`,
here 2, as discussed in the [section on simultaneous coverage of the
lecture notes on likelihood computation](http://www.stat.umn.edu/geyer/5421/notes/likelihood-compute.html#simultaneous-coverage)

# Section 4.3 in Agresti

## A Poisson GLM Example

```{r clean}
# clean out R global environment and start fresh
rm(list = ls())
```

### Data

There are also GLM for Poisson response.  For an example, we will use
some data analyzed in my 5102 lecture notes
```{r poisson.data, fig.align="center"}
foo <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex6-4.txt",
    header = TRUE)
names(foo)
hour <- foo$hour
count <- foo$count
plot(hour, count)
```

These (simulated) data are hourly counts from a not necessarily homogeneous
Poisson process. The variables are

 * `hour` which counts hours sequentially throughout a 14-day period
   (running from 1 to $14 \times 24 = 336$) and

 * `count` giving the count for each of these hours.


The idea of the regression is to estimate the mean as a function of time.
Many time series have a daily cycle. If we pool the counts for
the same hour of the day over the 14 days of the series, we see a
clear pattern in the histogram.
```{r poisson.histo, fig.align="center"}
hourofday <- (hour - 1) %% 24 + 1
fred <- structure(list(breaks = 0:24,
    counts = sapply(split(count, hourofday), sum),
    mids = 1:24 - 1 / 2, xname = "hour of day",
    equidist = TRUE), class = "histogram")
plot(fred)
```
The somewhat mysterious code above makes an R object of class `"histogram"`
and hands it off to R generic function `plot` which has a method for
objects of this class.  The requirements for such objects are documented
in the help page for R function `histogram`.

### First GLM

Since there seems to be a daily cycle with two peaks we make
the "linear predictor" for our GLM a Fourier series with frequencies
one per day and two per day.
```{r poisson.glm, fig.align="center"}
w <- hour / 24 * 2 * pi
out <- glm(count ~ I(sin(w)) + I(cos(w)) + I(sin(2 * w)) +
    I(cos(2 * w)), family = poisson)
summary(out)
plot(hourofday, count, xlab = "hour of the day")
curve(predict(out, data.frame(w = x / 24 * 2 * pi),
    type="response"), add=TRUE)
```

Not bad.  We seem to have a reasonable regression function for these data.

### Hypothesis Tests

But maybe more terms in our Fourier series would be better?
```{r poisson.glm.more, fig.align="center"}
form <- "count ~ I(sin(w)) + I(cos(w))"
for (i in 2:10) {
    form <- c(form, paste0(form[length(form)],
        " + I(sin(", i, " * w)) + I(cos(", i, " * w))"))
}
outs <- list()
for (i in seq(along = form))
    outs[[i]] <- glm(as.formula(form[i]), family = poisson)
class(outs) <- "glmlist"
anova(outs, test = "LRT")
```
The big improvement is going from frequency 1 per day to 2 per day,
but smaller improvements occur up to frequency 6 per day.

Alternative methods of model selection that avoid formal hypothesis
tests use the information criteria AIC and BIC.
```{r poisson.information.criterion}
out.aic <- sapply(outs, AIC)
out.aic
which(out.aic == min(out.aic))
out.bic <- sapply(outs, BIC)
out.bic
which(out.bic == min(out.bic))
```

 * BIC (Bayesian Information Criterion) is for when one believes that
   the true unknown probability model is in one of the statistical models
   considered --- in this example that the regression function mapped to the
   linear predictor scale (by taking logs) has the form of a trigonometric
   series.

 * AIC (Akaike Information Criterion) is for when one does not believe
   the true unknown probability model is in one of the statistical models
   considered.

### Confidence Intervals

Confidence intervals are very like those done above for binomial response.
The only difference is that one uses the log likelihood for Poisson response
rather than binomial response.

We will just show some Wald confidence bands.
```{r poisson.bands, fig.align="center", fig.cap="95% Wald confidence bands, not simultaneous"}
conf.level <- 0.95
crit <- qnorm((1 + conf.level) / 2)
plot(hourofday, count, xlab = "hour of the day")
curve(predict(out, newdata = data.frame(w = x / 24 * 2 * pi),
    type="response"), add=TRUE)
sally <- function(x) {
    pout <- predict(out, type = "response", se.fit = TRUE,
        newdata = data.frame(w = x / 24 * 2 * pi))
    pout$fit - crit * pout$se.fit
}
curve(sally, add = TRUE, lty = "dashed")
sally <- function(x) {
    pout <- predict(out, type = "response", se.fit = TRUE,
        newdata = data.frame(w = x / 24 * 2 * pi))
    pout$fit + crit * pout$se.fit
}
curve(sally, add = TRUE, lty = "dashed")
```

# Intercept Makes No Difference When Any Predictor Is Categorical

This is just a fact about how R formulas work.
It is another aspect of canonical parameters are meaningless.

We use for an example data from the examples for R function `glm`
```{r glm.example}
## Dobson (1990) Page 93: Randomized Controlled Trial :
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
data.frame(treatment, outcome, counts) # showing data
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
summary(glm.D93)
```

Note that `outcome` and `treatment` are factors (the terminology R uses
for categorical variables)
```{r glm.example.types}
class(outcome)
class(treatment)
```
and they each have three levels
```{r glm.example.nlevels}
nlevels(outcome)
nlevels(treatment)
```

So how does a categorical variable correspond to a regression model in
which it is used?  For each categorical variable, the regression model
needs *dummy variables*, one for each category (R calls them *levels*
of the *factor* variable).  We can see these dummy variables as follows
```{r glm.example.dummies}
model.matrix(~ 0 + outcome)
model.matrix(~ 0 + treatment)
```
Each dummy variable indicates the cases (components of the response vector,
rows of the model matrix) which are in that category (level).

Note that the dummy variables corresponding to a categorical variable
add up to the vector all of whose components are equal to one,
which is what R calls the *intercept* dummy variable.
```{r glm.example.dummy.sums}
rowSums(model.matrix(~ 0 + outcome))
rowSums(model.matrix(~ 0 + treatment))
model.matrix(counts ~ 1)
```

Thus we would not have an identifiable model if we kept all of the
dummy variables corresponding to a categorical variable and kept
an "intercept" dummy variable.   The default behavior in R is to

 * keep an "intercept" dummy variable and

 * drop one of the dummy variables for each categorical variable (factor).

This gives an identifiable model.
```{r glm.example.default.model.matrix}
model.matrix(counts ~ outcome + treatment)
```
The dummy variables `outcome1` and `treatment1` are omitted.

But we can tell R to omit the intercept.
```{r glm.example.no.intercept.model.matrix}
model.matrix(counts ~ 0 + outcome + treatment)
```

Now the behavior is

 * omit an "intercept" dummy variable and

 * drop one of the dummy variables for each categorical variable (factor)
   **except for one** categorical variable (for which we keep all of its
   dummy variables)

Note — this is very important — that these two recipes give the *same*
model.

 * Leaving out the intercept produces a *different* model *when all of
   the predictor variables are quantitative*.

 * Leaving out the intercept produces the *same* model *when at least one
   of the predictor variables is categorical (qualitative, factor)*.

```{r glm.example.no.intercept}
glm.D93.no.intercept <- glm(counts ~ 0 + outcome + treatment,
    family = poisson())
summary(glm.D93.no.intercept)
```
Note that the regression coefficients (estimates) are different from
the ones with intercept shown above.  But the estimated cell means are
the same, hence the *both specify the same statistical model*.
These are different parameterizations of the same model.
```{r glm.example.no.expected.values}
mu <- predict(glm.D93, type = "response")
mu.no.intercept <- predict(glm.D93.no.intercept, type = "response")
all.equal(mu, mu.no.intercept)
```

# R Function `drop1`

R function `drop1` is useful for finding out the effect of dropping
terms from a model, especially when there are categorical variables,
in which case one wants to drop all of the dummy variables for
a categorical variable or drop none of them.
```{r glm.example.drop1}
drop1(glm.D93, test = "LRT")
```

# R Function `add1`

There is also an R function `add1`
```{r glm.example.add1}
glm.D93.intercept.only <- glm(counts ~ 1, family = poisson())
add1(glm.D93.intercept.only, scope = ~ outcome + treatment, test = "LRT")
```

# R Function `anova`

But one can also do these tests using R function `anova`
```{r glm.example.anova}
glm.D93.outcome.only <- glm(counts ~ outcome, family = poisson())
glm.D93.treatment.only <- glm(counts ~ treatment, family = poisson())

# compare with results of drop1
anova(glm.D93.outcome.only, glm.D93, test = "LRT")
anova(glm.D93.treatment.only, glm.D93, test = "LRT")

# compare with results of add1 
anova(glm.D93.intercept.only, glm.D93.outcome.only, test = "LRT")
anova(glm.D93.intercept.only, glm.D93.treatment.only, test = "LRT")
```

# The Bozo Method of Model Selection

This is not the time to discuss model selection.  We have
[notes on that](glmbb.html) but are not ready for that yet.

This section is just a warning about a particularly awful method of
model selection that is invented over and over again by naive users.
I call it the Bozo method of model selection after a [particularly
losing clown](https://en.wikipedia.org/wiki/Bozo_the_Clown).
The word [bozotic](http://www.catb.org/jargon/html/B/bozotic.html) is
a technical term of geek speak.

The Bozo method is to look at some regression output, for example
```{r bozo-input}
summary(outs[[10]])
```
and keep the regressors that have stars and drop the regressors that don't.

There are many things wrong with this method.

 * It ignores the structure of the problem.  In this problem it makes no
   sense to keep $\sin(k w)$ and drop $\cos(k w)$ for the same $k$ or vice
   versa.  That's not how Fourier series work.
     In other problems, it makes no sense to drop some of the regressors
   that come from one categorical variable (`factor` in R speak) and keep
   others.  The whole point of R functions `drop1` and `add1` is to not
   do that.

 * It is, in effect, multiple testing without correction.  Hypothesis tests
   only make sense when you either do only one test or make some sort
   of correction for multiple testing.

 * There are many methods of model selection that have been put in the
   statistics literature.  This is not one of them.  No authority has
   ever recommended the Bozo method.  We will
   [cover some valid methods](glmbb.html) but not yet.

What R calls "signif stars" are so misleading that I sometimes recommend
putting the command
```{r falsy}
options(show.signif.stars = FALSE)
```
In your Rprofile so you never see them again.
```{r too}
summary(outs[[10]])
```

The $P$-values output by R function `summary` are only valid if corrected
for multiple testing (by Bonferroni correction, for example).  Or if you
are only interested in one of them and are ignoring the rest.

# The Log Likelihood for Poisson Regression

For homework problem 3-2 we need the log likelihood for Poisson regression
(with default link).  This does not seem to be in the notes we have seen
so far.  It is in the notes on exponential families, sections on the
[Poisson distribution](expfam.html#the-poisson-distribution) and
[canonical affine submodels](expfam#casm).

The Poisson distribution has log likelihood for the usual parameter $\mu$
$$
   l(\mu) = y \log(\mu) - \mu
$$
but R function `glm` and other GLM software use the parameter
$$
   \theta = \log(\mu)
$$
(or some other parameter if one does not use the default link function).

Solving for $\mu$ gives
$$
   \mu = e^\theta
$$
So the log likelihood for $\theta$ is
$$
   l(\theta) = y \theta - e^\theta
$$

Now we want to have a contingency table with more than one cell, so
$y$, $\mu$, and $\theta$ all become vectors, and the log likelihood for
$\theta$ becomes
$$
   l(\theta) = \sum_{i \in I} \left[ y_i \theta_i - e^{\theta_i} \right]
$$

The reason for the sum here is that [the joint distribution is the product
of the marginal distributions](ch1.html#iid)
and the log of a product is the sum of the logs.

Now we come to the linear model part of GLM (generalized linear model).
The parameter vector $\theta$ is defined in terms of the parameter
vector $\beta$ (what R calls the ``coefficients'') by
$$
   \theta = M \beta
$$
where $M$ is the model matrix.