---
title: "Stat 3701 Lecture Notes: Statistical Models"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  html_document:
    number_sections: true
    mathjax: "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
  pdf_document:
    number_sections: true
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document is
   `r packageVersion("rmarkdown")`.

 * The version of the `pkgsearch` package used to make this document is
   `r packageVersion("pkgsearch")`.

 * The version of the `MASS` package used to make this document is
   `r packageVersion("MASS")`.

 * The version of the `mgcv` package used to make this document is
   `r packageVersion("mgcv")`.

```{r libraries}
library("pkgsearch")
library("MASS")
library("mgcv")
```

# Statistical Models of the R Kind

Statistical models come in two kinds, those that R calls models and those
it doesn't.  (Have you heard the joke: there are two kinds of people in
this world, those who divide everything into two kinds, and those who don't?
Or its geek version: there are 10 kinds of people in this world, those
who know binary and those who don't?)

What R calls statistical models are fit by functions that work like the
R function `lm` in many ways, make assumptions that are like those for
linear models in many ways, and use many of the same generic functions
to process their output, including `summary`, `predict`, and `anova`.

Of course these models are not *exactly* like linear models.  They have
to be different in *some* ways.  But they are similar enough for some
of the intuitions to carry over and some of the ways of working with models
to carry over.

## Regression Models

Linear models are a special case of "regression" models.  They make the
following assumptions.

 * The data can be put in a data frame.

 * The rows are *cases*, also called *individuals*.
   The columns are *variables*.

 * One variable is special: the *response*.
   All other variables are called *predictors*.

 * The job of *regression* is to estimate the conditional
   distribution of the response given the predictors for each case.
   If $y_i$ is the response for case $i$, and $x_i$ and $z_i$ are
   the predictors for case $i$, then the job is to estimate
   the conditional distribution of $y_i$ given $x_i$ and $z_i$.

For those who have not had a theory course and don't know what a conditional
distribution is, it is just a probability distribution like any other
probability distribution — given by
a probability mass function (PMF) if the response is a discrete random
variable or by a probability density function (PDF) if the response is
a continuous random variable — except that the parameters of the
distribution are allowed to depend on the conditioning variables.
The conditional distribution of the response given the predictors depends
on what the values of the predictors are.

If you imagine new data (not part of the data used to fit the model),
in which you do not get to see the response $y_i$ for this new case
but do get to see the predictors (say $x_i$ and $z_i$), then the estimated
conditional distribution allows *predictions* about $y_i$ given the values
of $x_i$ and $z_i$.

There can be as many predictor variables as you like.  There is only one
response variable.  (Exception: the R function `mlm` does linear models with
*vector* response, but we won't talk about that.)

You can even make up new predictor variables from old ones.

 * In polynomial regression, starting with predictors $x_i$ and $z_i$,
   one can make up

     + quadratic predictors $x_i^2$ and $x_i z_i$ and $z_i^2$

     + cubic predictors $x_i^3$ and $x_i^2 z_i$ and $x_i z_i^2$ and $z_i^3$

     + and so forth for higher order polynomials.

 * In regression with a categorical predictor vector $x$
   (what R calls a `factor`) one must make up *dummy variables*.

     + If $x$ has $k$ categories, then there are dummy variables
       $d_1$, $d_2$, $\ldots,$ $d_k$.

     +  A dummy variable is zero-or-one valued (an *indicator* variable).

     + The component of $d_j$ for case $i$ is equal to one if the value
       of $x_i$ is the $j$-th category (in R `levels(x)[j]`).

     + If there is an *intercept* in the regression, then one of the
       dummy variables is "dropped" (left out of the model).  Otherwise,
       there would be *collinearity* (and not all *coefficients* could
       be estimated).

     + Most R functions that fit models treat R variables of type `"character"`
       as if they were factors.  They are automatically converted to factors
       in the process of model fitting.

 * The *intercept* variable (if present) is itself a "made up" predictor.
   It is the vector all of whose components are equal to one.

 * In general one can make up arbitrary predictors.  Starting with a
   variable $x$ one can make up $g_1(x)$, $g_2(x)$, $g_3(x)$ and so forth,
   where $g_1$ and $g_2$ and $g_3$ are *arbitrary* functions that act
   vectorwise.

The regression equation is
$$
   \theta_i = m_{i 1} \beta_1 + m_{i 2} \beta_2 + \cdots + m_{i p} \beta_p
$$
where $\theta_i$ is some parameter (more on this later), the betas are
unknown parameters to estimate (called the *coefficients* of the model).
And each $m_{i j}$ is the value of some predictor variable (either originally
given or made-up) for case $i$.

If we are sophisticated we think of $m_{i j}$ as as the components of a matrix
$M$.
Then the regression equation above can be rewritten as a matrix-vector equation
$$
   \theta = M \beta
$$
If there are $n$ cases and $p$ coefficients, then

 * the dimension of $\theta$ is $n$,

 * the dimension of $\beta$ is $p$,

 * the dimension of $M$ is $n$ by $p$ ($n$ rows and $p$ columns)

The matrix $M$ is called the *model matrix* by some and the *design matrix*
by others.  But the name *design matrix* does not make much sense when the
data do not come from a designed experiment, although this does not stop
the people who like the term from using it for observational studies.
Also R uses only the term *model matrix* in its documentation and in the
name of the function `model.matrix` that turns R formulas into model matrices.
So we will always say *model matrix*.

Weisberg makes a useful distinction.  He calls the originally given
predictor variables *predictors*, just like I have.
He calls the columns of the model matrix *regressors*.
This is shorter than and sounds more scientific than "either originally
given or made-up predictors".

You know all of this from STAT 3032.  This is just a reminder.

## Linear Models (LM)

In order to know what stays the same and what changes, we briefly review
linear models (those fit by the R function `lm`).  Their assumptions are
as follows.

 * The parameter in the regression equation is the mean so we rewrite it
$$
   \mu = M \beta
$$

 * The components of the response vector are conditionally independent
   given the predictors.

 * The conditional distributions of the components of the response vector
   given the predictors are *normal* and the variance does not depend on
   the predictors.  If the response vector is $y$, then the conditional
   distribution of $y_i$ given the predictors
   is
$$
   \text{Normal}(\mu_i, \sigma^2)
$$

 * the unknown parameters to be estimated are the regression coefficients
   (betas) and $\sigma^2$.

You may be in the habit of thinking of the regression equation as
$$
   y = M \beta + e
$$
or written out in long form
$$
   y_i = m_{i 1} \beta_1 + m_{i 2} \beta_2 + \cdots + m_{i p} \beta_p + e_i
$$
where $e$ is "error".  If so, forget that.  That formulation does not
generalize to so-called "generalized linear models".  The description above
does generalize.

Let's do a toy example, just to make the abstractions above concrete.
```{r}
foo <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex5-4.txt",
    header = TRUE)
names(foo)
sapply(foo, class)
```
Suppose we want to fit parallel regression lines, regressing `y` on `x`
for each color.
```{r}
lout <- lm(y ~ x + color, data = foo)
summary(lout)
```
It will be easier to work with if we do not have an intercept, in which
case R will not drop one dummy variable for `color` and the fitted model
will be the same in the sense that it makes the same predictions about
the conditional distribution of the response given the predictors,
although the coefficients will change.
```{r}
lout <- lm(y ~ 0 + x + color, data = foo)
summary(lout)
```
Then we plot the results
```{r fig.align='center'}
coef <- coefficients(lout)
attach(foo)
plot(x, y, col = as.character(color))
for (col in unique(color)) {
    predname <- paste("color", col, sep="")
    abline(coef[predname], coef["x"], col = col)
}
detach(foo)
```

But that wasn't what we really wanted to illustrate.
Where is the model matrix and the response vector?
We don't see them.  But we can!
```{r}
lout <- lm(y ~ 0 + x + color, data = foo, x = TRUE, y = TRUE)
# model matrix
lout$x
# response vector
lout$y
```

If we ever had a model in which the R formula mini-language —
the mini-language whose interpreter is the R function `model.matrix`
and whose documentation is `?formula` — is inadequate to describe
the model matrix, then we can always just construct the response vector
and model matrix without using a formula and tell `lm` to use them.
```{r}
m <- as.matrix(lout$x) ### strip attributes
y <- as.vector(lout$y) ### strip names
lout.too <- lm(y ~ 0 + m)
summary(lout.too)
```

We need the `0 +` because the "intercept" regressor is already in `m`.

Except for changing the names of the coefficients by pasting `"m"` on the
front, everything is the same.
```{r}
identical(as.vector(coefficients(lout)), as.vector(coefficients(lout.too)))
```

## Generalized Linear Models (GLM)

### Introduction

In generalized linear models we drop the normal distribution of the response
given the predictors.   We even drop the assumption that this distribution
is continuous.  It works for discrete response.

  * In LM the response must be continuous, but the predictors can be anything
    (categorical predictors get turned into dummy variables).

  * In GLM the response can be discrete or continuous.

  * The R function `glm` fits GLM.

  * The families that are allowed for the conditional distribution of response
    given predictor are documented on `?family`.  The two most important are
    `binomial` and `poisson`.

  * The regression equation has the general form
$$
   \theta = M \beta
$$
     where $\theta_i$ is some parameter (usually not the mean) that
     specifies the conditional distribution of $y_i$ given the predictors
     for case $i$ within a specified family.  The vector $\theta$ is
     called the *linear predictor* in GLM parlance.

  * The relationship between mean values and linear predictor values is
    given by
    the *link function*.  If $g$ is the link function, then $\theta = g(\mu)$,
    where the link function $g$ operates vectorwise.

      + For the binomial family, the default link function is the logit
        function (pronounced low-jit)

      + For the poisson family, the default link function is `log`

 * The components of the response vector are conditionally independent
   given the predictors.  (This assumption is the same for LM and GLM.)

R does not have the logit function but we can easily define it.  We also
define its inverse function.
```{r}
logit <- function(p) log(p) - log1p(- p)
invlogit <- function(theta) 1 / (1 + exp(- theta))
```

### Model Fitting

So let's try out GLM with Bernoulli response (a random variable is
\emph{Bernoulli} if it is zero-or-one-valued, in other words if it
is binomial with sample size one).
```{r}
foo <- read.table("http://www.stat.umn.edu/geyer/5102/data/ex6-1.txt",
    header = TRUE)
names(foo)
sapply(foo, class)
gout <- glm(y ~ x, data = foo, family = binomial)
summary(gout)
```

A plot shows what we have done
```{r fig.align='center'}
with(foo, plot(x, y))
curve(predict(gout, newdata = data.frame(x = x), type = "response"),
    add = TRUE)
```

From the plot we see the reason for link function.  The curve is the
mean of the conditional Bernoulli distribution of $y$ given $x$.  It
does not look like this relation should be linear.  The curve given
by the GLM looks reasonable.  So whether or not the logit link is
exactly the right thing, it is *a lot* more reasonable than the
identity link (which would assume means are linear).

The R function `glm` does not even allow trying to use identity link here.

### Hypothesis Tests

Suppose we try out some polynomial regressions (polynomial on the linear
predictor scale).
```{r fig.align='center'}
gout2 <- glm(y ~ x + I(x^2), data = foo, family = binomial)
summary(gout2)
gout3 <- glm(y ~ x + I(x^2) + I(x^3), data = foo, family = binomial)
summary(gout3)
plot(foo)
curve(predict(gout, newdata = data.frame(x = x), type = "response"),
    add = TRUE)
curve(predict(gout2, newdata = data.frame(x = x), type = "response"),
    add = TRUE, col = "red")
curve(predict(gout3, newdata = data.frame(x = x), type = "response"),
    add = TRUE, col = "blue")
legend(1, 1, legend = c("linear", "quadratic", "cubic"),
    col = c("black", "red", "blue"), lty = 1)
```

Another example of TIMTOWTDI.  Different ways to make the same plot.
```{r eval=FALSE}
plot(foo$x, foo$y, xlab = "x", ylab = "y")
attach(foo); plot(x, y); detach(foo)
with(foo, plot(x, y))
plot(foo)
```
(The last only works because the names are `x` and `y`.)

As we know from our experience with `lm`, which produces objects of class
`"lm"`, and all of the generic functions that have methods for such objects,
there are two ways to do hypothesis tests involving such fits.

  * If the test involves whether one coefficient should be in or out of the
    model, we just look at the $P$-values printed by the R function `summary`.
    If we want to know whether the coefficient for `"I(x^3)"` is statistically
    significantly different from zero in the fit `gout3`, the $P$-value
    `r formatC(summary(gout3)$coefficients["I(x^3)", "Pr(>|z|)"],
    digits=3, format="f")` says it is not.

    Then, having decided that the quadratic model is to be preferred on
    grounds of parsimony to the cubic model, we look in the fit of that
    model to see whether the coefficient for `"I(x^2)"` is statistically
    significantly different from zero, and the $P$-value
    `r formatC(summary(gout2)$coefficients["I(x^2)", "Pr(>|z|)"],
    digits=3, format="f")` says it is not.

  * If the test involves more than one parameter then we use the R function
    `anova` to do the comparison.  And it can do several different, but
    asymptotically equivalent (meaning they give nearly the same result
    when the sample size is large).  Let's try them.

```{r}
anova(gout, gout3, test = "Chisq")
anova(gout, gout3, test = "LRT")
anova(gout, gout3, test = "Rao")
```
Somewhat bizarrely, `"Chisq"` and `"LRT"` name the same test: the likelihood
ratio test (LRT) whose asymptotic distribution under the null hypothesis
is chi-square with degrees of freedom that is the difference in the
number of parameters of the model being tested.
The option `"Rao"` names a different test: the Rao test, also called the
"score" test, also called the "Lagrange multiplier" test.  Its asymptotic
distribution under the null hypothesis is the same as the LRT.

  * These $P$-values `r formatC(anova(gout, gout3, test="LRT")[2, "Pr(>Chi)"],
    digits=4, format="f")` and
    `r formatC(anova(gout, gout3, test="Rao")[2, "Pr(>Chi)"],
    digits=4, format="f")` would be nearly the same if we were in
    asymptopia (large sample size territory).  They are not because the
    sample size is not large.

  * Either of them, though, says that the larger model (the cubic one)
    fits these data no better than the smaller model (the linear one),
    hence the latter is preferred on grounds of parsimony.

In short

  * to compare models that differ in one parameter, use the $P$-values
    given by the R function `summary` or perhaps by the R functions
    `drop1` and `add1` (which we do not illustrate), and

  * to compare models that differ in more than one parameter,
    use the $P$-values given by the R function `anova`.
    Of course, `anova` compares any two models, even those that
    differ in one parameter.

  * Hence one can always use `anova`, but only sometimes use `summary`
    or `add1` or `drop1`.

**Warning:** R function `anova` is actually a footgun.  One condition
is required for the test to be valid (other than large sample size),
and that is that the models be *nested*, which means that the little
model (null hypothesis) is a special case of the big model (alternative
hypothesis).  R function `anova` does not check this condition, so you
have to know what you are doing to use it.  Usually it is obvious when
the models are nested (every term in the formula for the little model is
also a term in the formula for the big model).  But not always obvious.
The real condition is that every *probability distribution* in the little
model is also in the big model.


### Confidence Intervals

Statistical inference for a GLM is slightly different from that for LM in
that the sampling distributions of estimators and test statistics are
approximate, "large $n$" distributions, where LM has exact $t$ and $F$
distributions, GLM have approximate normal and chi-square distributions.
But that is the only difference.  Everything else works the same.

For an example, here is a confidence interval for the mean for predictor
value $x = 25$.
```{r}
conf.level <- 0.95
xnew <- 25

crit <- qnorm((1 + conf.level) / 2)
pout <- predict(gout, newdata = data.frame(x = xnew),
    se.fit = TRUE, type = "response")
pout$fit + c(-1,1) * crit * pout$se.fit
```
This confidence interval does not work very well.  Notice that its upper
end point is well above the range of possible values of the parameter
(the response vector is zero-or-one-valued so the means range between
zero and one).  There is nothing wrong with this.  The confidence interval
is based on "large sample approximation" and our data are not numerous.

Still, we can do somewhat better.  The "large sample approximation" assumes,
among other things, that the inverse link function is approximately linear
in the region of the confidence interval, and we can just see from the
picture that that's not correct.  So if we make a confidence interval
for the linear predictor and then map that to the mean value parameter scale,
that should be better.  At least our new confidence interval will have to
be in the range of possible parameter values (because the inverse link
function maps in there).
```{r}
tout <- predict(gout, newdata = data.frame(x = xnew), se.fit = TRUE)
invlogit(tout$fit + c(-1,1) * crit * tout$se.fit)
```

The fact that these two confidence intervals are so different, when
"large sample theory" says they are "asymptotically equivalent" (should
be very close for large sample sizes) says that "large sample approximation"
is not working very well for these data.

## Nonparametric Regression

*Nonparametric* refers to statistical models that cannot be described
using a finite set of parameters.  They are models that are too big
to be parametric.

There does not seem to be any useful way to be totally nonparametric about
regression: to say that there is some relationship between response and
predictors and we want to proceed making no assumptions whatsoever.
There is just not enough structure to get going.  Or at least no one has
ever had any idea about how to proceed in this manner AFAIK.

So nonparametric regression divides in two:

 * being parametric about the regression equation but nonparametric about
   the error distribution and

 * being nonparametric about the regression equation but being parametric
   about the error distribution, usually making the same error distribution
   assumptions as LM or GLM.

### Being Nonparametric About the Error Distribution

There isn't a way to be nonparametric about zero-or-one-valued response.
A zero-or-one-valued random variable is automatically Bernoulli.
So this is about continuous response.  In this section we are looking
at competitors that make all the same assumptions as LM except they

 * drop the assumption that the error distribution is homoscedastic (same
   variance) normal and

 * instead assume the error distribution is symmetric about zero and the
   same for all components of the response vector.

The normal distribution is symmetric.  Hence we have changed normal
distribution symmetric about zero into any distribution symmetric about
zero.  That's nonparametric because "any distribution" is too large a
family to be described by a finite set of parameters.

Already in the notes about correcting errors in data we tried out two
R functions that approach this problem using two different methodologies.
Recall that these were the R functions `lqs` and `rlm` both of which are
in the R package `MASS`, which is a "recommended" package that comes
with every installation of R.

Let us make up some data that is a challenge for LM but these functions
handle well.  First, being a bit bored with "simple" linear regression,
let's try means are a cubic function of the predictor.
```{r fig.align='center'}
x <- 0:200
mu.true <- x * (x - 100) * (x - 200) * 1e-5
plot(x, mu.true, type = "l", xlab = "x", ylab = expression(mu))
```

So now we want to add "errors" having a horrible distribution.  The
worst distribution we know for errors (at least worst for LM) is the
Cauchy distribution, which is very heavy tailed.  If you have Cauchy errors,
and are expecting normal errors, then it looks like there are lots of
"outliers".  But they aren't mistakes.  The errors just have a different
distribution.
```{r fig.align='center'}
set.seed(42)
y <- mu.true + rcauchy(length(x))
plot(x, y)
lines(x, mu.true)
```

So now we have some data to try out methods on.  But we have a problem.

> It's hard to know what lessons you are supposed to draw from a toy problem.
>
> — Me

We now have to imagine that we don't know how the data were created and
just want to do regression.

Let's try LM first (even though the data are wildly inappropriate for it,
which we aren't supposed to know).
```{r}
lout <- lm(y ~ poly(x, degree = 3))
summary(lout)
```

Comparing with the "simulation truth", which of course we are pretending
we don't know, this doesn't look good.  The true regression function
is cubic with coefficients

  * coefficient of constant term (intercept): 0

  * coefficient of first degree term:
    $(0 \cdot (-100) + 0 \cdot (-200) + (-100) (-200)) \times 10^{-5}
    = `r 100 * 200 * 1e-5`$

  * coefficient of second degree term:
    $(0 + (-100) + (-200)) \times 10^{-5}
    = `r ((-100) + (-200)) * 1e-5`$

  * coefficient of third degree term: $10^{-5}$.

So now let's try the others.
```{r}
qout <- lqs(y ~ poly(x, degree = 3))
summary(qout)
```
How annoying!  There is no summary method for class `"lqs"`
```{r}
qout$coefficients
```
This doesn't seem to have done any better than `lm` in coming close to
the simulation truth.  (There are a lot of options to play with, but
we won't bother.)

On to `rlm`.
```{r}
rout <- rlm(y ~ poly(x, degree = 3))
summary(rout)
```
I have no idea why none of these "work".  Let's look at what they did.
```{r fig.align='center'}
plot(x, y)
lines(x, mu.true)
curve(predict(lout, newdata = data.frame(x = x)), col = "red", add = TRUE)
curve(predict(qout, newdata = data.frame(x = x)), col = "blue", add = TRUE)
curve(predict(rout, newdata = data.frame(x = x)), col = "green", add = TRUE)
```

Oh.  There isn't that much difference.  We can't really see very well
what is going on with the compressed range of variation of the regression
curve.  Let's zoom in.
```{r fig.align='center'}
plot(x, y, ylim = c(-5, 5))
lines(x, mu.true)
curve(predict(lout, newdata = data.frame(x = x)), col = "red", add = TRUE)
curve(predict(qout, newdata = data.frame(x = x)), col = "blue", add = TRUE)
curve(predict(rout, newdata = data.frame(x = x)), col = "green", add = TRUE)
legend(160, 5, legend = c("truth", "lm", "lts", "rlm"),
    col = c("black", "red", "blue", "green"), lty = 1)
```

I guess we have to say that `rlm` is the winner on this one toy problem,
because it is closest to the simulation truth at both ends, whereas `lm`
is quite a bit off near $x = 50$ and `lts` (`lqs` using the default method
`lts`) is quite a bit off near $x = 150$.  But, as said above, it is hard
to know what lessons to draw from a toy problem.

Let's look at some confidence intervals.
```{r}
xxx <- seq(0, 200, 50)
# simulation truth
mu.true[x %in% xxx]
predict(lout, newdata = data.frame(x = xxx), interval = "confidence")
```
Three out of five 95% confidence intervals miss the simulation truth.
No surprise because the assumptions for LM are badly violated.  The
only surprise is that LM doesn't do even worse.

The R function `predict.lqs` does not do confidence intervals.
We would have to bootstrap or something of the sort to do any inference
after using `lqs`.  More on the bootstrap later.

```{r}
predict(rout, newdata = data.frame(x = xxx), interval = "confidence")
```

The R function `predict.rlm` apparently does confidence intervals,
but I say "apparently" because this function is undocumented.
When you do `?predict.rlm` you are taken to the help page for the `rlm`
function, but nothing is said about the `rlm` method of the generic
function `predict`.  We managed to use it above by assuming it works
just like `predict.lm` when it is just doing predictions (no confidence
intervals).

If we look at the definition of this function
```{r}
getS3method("predict", "rlm")
```
We see something very mysterious.  What does `NextMethod` do?
It calls the next method for the next class, but what is that?
```{r}
class(rout)
```
Oh.  The next class is `"lm"`, so it actually calls the function `predict.lm`
but with whatever it uses from the object of class `"lm"` is different
when the object is made by `rlm`.

This is not how to document code IMHO.  If you have to
[Use the source, Luke!](http://www.catb.org/jargon/html/U/UTSL.html),
that's not documentation.

### Being Nonparametric About the Regression Equation

#### A Plethora of Packages

Now we are back to assuming normal errors, but we want to be nonparametric
about the regression function.
Suppose in our toy problem we don't know the regression
equation is cubic?  Since we need normal errors, let's go back and put normal
errors on the same simulation truth regression function.
```{r}
y <- mu.true + rnorm(length(x))
```

There are several nonparametric regression functions in the "core" and
"recommended" packages that come with every R installation

 * `smooth.spline` in the `stats` package (core)

 * `ksmooth` in the `stats` package (core)

 * `locpoly` in the `KernSmooth` package (recommended)

 * `gam` in the `mgcv` package (recommended)

 * `smooth.spline` in the `stats` package (core)

 * `loess` in the `stats` package (core)

 * `supsmu` in the `stats` package (core)

And there's lots more on CRAN
```{r cache=TRUE}
foo <- advanced_search("smooth|kernel|spline", size = 1000)
dim(foo)
names(foo)
foo$package[1:50]
```
I have no idea how many of these are true positives (really do
nonparametric regression).

#### R function `smooth.spline`

Rather than tackle what is obviously a huge subject.  Let's just illustrate
one.
```{r fig.align='center'}
plot(x, y)
lines(x, mu.true)
sout <- smooth.spline(x, y, all.knots = TRUE)
lines(sout, col = "green")
```
Good job!  Note that we did not tell `smooth.spline` that the regression
function was cubic.  We didn't tell it anything about the regression function.
Nevertheless, it comes pretty close.

Statistical inference after "smoothing", which is what this is often
called rather than "nonparametric regression" is also problematic.
We would have to bootstrap to get confidence intervals.
Some smoothing methods do come with confidence intervals, but those intervals
assume the degree of smoothness of the regression function is known rather
than estimated from the data.  So they are bogus when, as here, we don't
specify a degree of smoothness (by using the optional argument `df`).

#### R function `gam` in package `mgcv`

Let's try one more.  The R package `mgcv`, which is a "recommended" package
that comes with every R installation, fits much more general models than
`smooth.spline`.  It fits, so-called additive models, where the mean vector
has the form
$$
   \mu = g_1(x_1) + g_2(x_2) + \cdots + g_k(x_k)
$$
where $x_1$, $x_2$, and $x_k$ are different predictors
and $g_1$, $g_2$, and $g_k$ are different smooth functions that
are to be figured out by the computer.  So it does nonparametric
regression with multiple predictors.

But we are just going to illustrate one predictor using the same
data as in the preceding section.
```{r fig.align='center'}
plot(x, y)
lines(x, mu.true)
aout <- gam(y ~ s(x, bs="cr"))
summary(aout)
curve(predict(aout, newdata = data.frame(x = x), type = "response"),
    add = TRUE, col = "darkseagreen")
```

Unlike, some other functions we have looked at, the R function produces
objects of class
```{r}
class(aout)
```
so there is a rich class of methods for generic functions.  We can make
confidence intervals.
```{r fig.align='center'}
conf.level <- 0.95

plot(x, y)
lines(x, mu.true)
fred <- function(x) {
    foo <- predict(aout, newdata = data.frame(x = x),
        type = "response", se.fit = TRUE)
    crit <- qnorm((1 + conf.level) / 2)
    foo$fit + crit * foo$se.fit
}
curve(fred, lty = "dashed", add = TRUE)
fred <- function(x) {
    foo <- predict(aout, newdata = data.frame(x = x),
        type = "response", se.fit = TRUE)
    crit <- qnorm((1 + conf.level) / 2)
    foo$fit - crit * foo$se.fit
}
curve(fred, lty = "dashed", add = TRUE)
```
This seems pretty good.  Even though `gam` and `predict.gam` know nothing
about the true unknown regression function, and even though these confidence
intervals

  * assume that the degree of smoothness is known rather than estimated, and

  * are not corrected for simultaneous coverage,

they mostly (nearly everywhere) cover the simulation truth.

# Statistical Models of Other Kinds

Not all statistical models are regression like, with a response
and a predictor.  You know from intro stats that the simplest
models just have one variable.  But there is still statistical inference.

There can also be very complicated models with many parameters to estimate
(but no single variable picked out as being a "response" to "predict").

More on these later.  First we get some other subjects out of the way.