1 License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/).

2 R

The version of R used to make this document is 4.3.2.
The version of the rmarkdown package used to make this document is 2.26.

3 Data

We will use these data

load(url("http://www.stat.umn.edu/geyer/8054/data/nfl2019.rda"))
ls()

## [1] "nfl2019"

class(nfl2019)

## [1] "data.frame"

names(nfl2019)

## [1] "road.team"  "home.team"  "road.score" "home.score" "week"      
## [6] "road.seed"  "home.seed"  "conf"       "neutral"

head(nfl2019)

##   road.team home.team road.score home.score week road.seed home.seed conf neutral
## 1   Packers     Bears         10          3    1        NA        NA <NA>       0
## 2      Rams  Panthers         30         27    1        NA        NA <NA>       0
## 3    Titans    Browns         43         13    1        NA        NA <NA>       0
## 4    Chiefs   Jaguars         40         26    1        NA        NA <NA>       0
## 5    Ravens  Dolphins         59         10    1        NA        NA <NA>       0
## 6   Falcons   Vikings         12         28    1        NA        NA <NA>       0

These are the results for the 2019–2020 season for the highest professional league for American football. But you don’t need to know anything about the game not explained in these notes to analyze these data for this homework.

4 Model

We are going to do a Bayesian analysis of sports league data, just like the example in the 3701 Lecture Notes with a few minor changes.

For these data there were 6 games at neutral sites, so these games do not have a home field advantage. These are indicated by the variable neutral in this data frame (1 indicates neutral site, so no home field advantage, 0 indicates there is a home field advantage).

Also ties are possible in regular season games, the 256 games with the value of the week variable less than or equal to 17. Ties are rare because of the overtime rule, but they can occur. In these data there is 1 tie.

The question is: how to handle ties? We want to do this in a way that respects the way the league accounts for ties in the standings. They count as half a win and half a loss. Thus we want to fit a multinomial response model with canonical sufficient statistics as follows

Number of ties.
Number of wins + number of ties / 2 for all home teams together
Number of wins + number of ties / 2 for each team separately

This method of devising a model was recommended by Geyer and Thompson (1992, JRSS-B). It says choose your exponential family (in this case product trinomial, each game has one of three results, win, loss, or tie, except that post-season games cannot have ties, so the model is product of trinomial and binomial). Then choose your the quantities you want to want to be sufficient statistics, those that make scientific, business, sports (or whatever) sense. And use the canonical linear submodel of the exponential family that has those statistics as its canonical statistics.

The saturated model for the exponential family has log likelihood \[ l(\theta) = \langle y, \theta \rangle - c(\theta) \] where \(y\) is the full data (response vector), \(\theta\) is the corresponding canonical parameter vector \(\langle\,\cdot\,,\,\cdot\,\rangle\) is the canonical bilinear form placing the canonical statistic space and the canonical parameter space in duality (you can think of it as inner product on \(\mathbb{R}^d\) if you like, or even as \(y^T \theta\) or \(\theta^T y\) if you like), and \(c\) is the cumulant function.

A canonical linear submodel with model matrix \(M\) introduces a new parameterization \[ \theta = M \beta \] from \[ \langle y, M \beta \rangle = y^T M \beta = (M^T y)^T \beta = \langle M y, \beta \rangle \] the log likelihood for \(\beta\) is \[ l(\beta) = \langle M^T y, \beta \rangle - c(M \beta) \] This shows we again have a regular full exponential family (supposing we started with one) having

canonical statistic vector \(M^T y\),
canonical parameter vector \(\beta\),
and cumulant function \(c_\text{sub}\) defined by \(c_\text{sub}(\beta) = c(M \beta)\) for all \(\beta\).

Since the canonical statistics of an exponential family are always sufficient, we see that \(M^T y\) is sufficient.

What this gives us is a very different way to think about model matrices. Instead of “interpreting” the rows of \(M\) we interpret the rows of \(M^T\), which are columns of \(M\).

In our problem (Bradley-Terry models with home field advantage and ties) we need for \(M^T y\) to be the sufficient statistics listed above.

So now we need to figure how to lay out \(M\) and \(y\). As always with multinomials we should think of multinomial response as a vector response not a scalar response. So let us think of \(y\) as a matrix with nrow(nfl2019) = 267 rows and 3 columns. Every entry is either zero or one and the entries in each row sum to one (because every game is a win, loss, or tie). The first column has a one if the home team wins (or if the game is at a neutral site, the team in the home.team column of the data), the second column has a one if the road team wins (or if the game is at a neutral site, the team in the road.team column of the data), and the third column has a one if the game is tied (indicated in our data by the two teams scoring an equal number of points). As we said before the whole third column will contain only 1 one because there was only one tie.

Because now \(y\) has two indices, the model “matrix” must get three indices. The \(k\)-th element of \(M^T y\) is given by \[ \textstyle \sum_i \sum_j m_{i j k} y_{i j} \]

Now we figure out what \(M\) has to be to give us the desired canonical sufficient statistics.

First, we need the first statistic to be just the number of ties. So we need \[\begin{align*} m_{i 1 1} & = 0 \\ m_{i 2 1} & = 0 \\ m_{i 3 1} & = 1 \end{align*}\] and these hold for all \(i\).

Second, we need the second statistic to be number of wins + half the number of ties for all actual (not just listed as) home teams. So we need \[\begin{align*} m_{i 1 2} & = 1 \\ m_{i 2 2} & = 0 \\ m_{i 3 2} & = 1 / 2 \end{align*}\] for \(i\) such that the game is not at a neutral site, and we need \[ m_{i j 2} = 0 \] for \(i\) such that the game is at a neutral site and for all \(j\).

Third, we need the one statistic for each team to be number of wins + half the number of ties for that team. So there is one statistic here for each team, say team \(k\) (renumber the teams so the team numbers start at 3, because we have already use \(k = 1\), and \(k = 2\) for the tie statistic and the home advantage statistic). So we need \[\begin{align*} m_{i 1 k} & = 1 \\ m_{i 2 k} & = 0 \\ m_{i 3 k} & = 1 / 2 \end{align*}\] for \(i\) such that game \(i\) has team \(k\) as the home team, and we need \[\begin{align*} m_{i 1 k} & = 0 \\ m_{i 2 k} & = 1 \\ m_{i 3 k} & = 1 / 2 \end{align*}\] for \(i\) such that game \(i\) has team \(k\) as the road team, and we need \[ m_{i j k} = 0 \] for \(i\) such that game \(i\) does not involve team \(k\) and for all \(j\).

And now I hate to tell you that this only applies to the part of \(M\) that applies to regular season games (which can have ties). In post-season games, the ones with the week variable strictly greater than 17, the rules are different, and no ties can occur. There should be no \(j = 3\) elements of \(y_{i j}\) for those games \(i\), and there should be no \(j = 3\) elements of \(m_{i j k}\) for those games \(i\) and for all coefficient indices \(k\).

So that make \(M\) a ragged array and \(y\) a ragged matrix. But that is no problem for R, we can put NA values in there if we please. We just have to deal with those NA values properly as we go along.

Now we have to figure out how to write down the probability distribution for the multinomial random vectors that are the rows of \(y\). Since we are not eliminating categories, this is what my 5421 Lecture Notes on Exponential Families, Section 2.4 calls the “try III” parameterization (Section 2.4.3).

So that takes care of the likelihood for Bayesian inference. How about priors? Since we hope the priors do not matter much we use the same sort of priors as used in the 3701 notes we are following.

Those priors might not make as much sense in this context as justified by the correspondence with conclusions from “made-up” data. But they still specify functions that can serve as unnormalized priors. We also need a prior for the ties parameter that will look like the prior for the home field advantage parameter.

5 Analysis

5.1 Task 1

Implement the Markov chain. You may use R function metrop in R package mcmc but will need to implement a log unnormalized density function.

Run the Markov chain long enough.

5.2 Specific Question 1

If the super bowl (the last game, week 21)

subset(nfl2019, week == max(week))

##     road.team home.team road.score home.score week road.seed home.seed conf neutral
## 267     49ers    Chiefs         20         31   21         1         2 <NA>       1

were to be replayed, what would the outcome be? That is, what is the posterior predictive distribution of each team winning. The probability that the Chiefs (the actual winner) would win again is a function of the parameters of the model. Integrating out those parameters with respect to the posterior distribution gives the posterior predictive distribution — the Bayesian answer to this question.

Our answer to this question will be very different from that of the average sports fan or the average talking head on TV or Youtube. They always say the team that actually won was destined to win. Never mind that they were actually behind by 10 points with six and a half minutes left to play (out of 60 minutes for the whole game) and were actually behind by 3 points with three minutes left to play. They were destined to win.

As statisticians, we say nothing of the sort. The outcome was random, and could easily have gone the other way. We have a statistical model. We calculate what it says. And that’s that. We don’t pay any attention to the opinions of fandom or talking heads.

Note that this is the same calculation we would do if we were predicting the result of the super bowl before it occurred, except that then our posterior distribution would only be based on the games up to and including week 20 because we are predicting the result of week 21 before it happens (so we wouldn’t have that data yet).

5.3 Specific Question 2

If the whole playoffs were replayed, what would the outcome be?

This is a rather complicated question because if different teams win in week 18, then different teams are playing in week 19 and so forth. For this, one does need some understanding of the rules, which I will now explain (enough to do this part). (Actually, this web page did not contain enough information to do this correctly, but we are now adding this information.)

subset(nfl2019, week > 17)

##     road.team home.team road.score home.score week road.seed home.seed conf neutral
## 257     Bills    Texans         19         22   18         5         4  AFC       0
## 258    Titans  Patriots         20         13   18         6         3  AFC       0
## 259   Vikings    Saints         26         20   18         6         3  NFC       0
## 260  Seahawks    Eagles         17          9   18         5         4  NFC       0
## 261   Vikings     49ers         10         27   19         6         1  NFC       0
## 262    Titans    Ravens         28         12   19         6         1  AFC       0
## 263    Texans    Chiefs         31         51   19         4         2  AFC       0
## 264  Seahawks   Packers         23         28   19         5         2  NFC       0
## 265    Titans    Chiefs         24         35   20         6         2  AFC       0
## 266   Packers     49ers         20         37   20         2         1  NFC       0
## 267     49ers    Chiefs         20         31   21         1         2 <NA>       1

That is what actually happened, but to understand what could have happened we have to look at the bracket and the seeds.

In order to match up the names in our dataset with the names in the bracket we need to look at the seeds table (second link above). Each team has two names (locality represented and a nickname, for example, Minnesota Vikings) and our dataset has only nicknames and the bracket has only localities. But the seeds table has both and allows us to match them up.

First look at the bracket (first link above). The columns are weeks (also called rounds, first round, second round, etc. but they also have goofy names specific to American football: wild-card round (first column), division championships (second column), conference championships (third column), super bowl (fourth column)). The winners of the first round games advance to the second round where they play teams that did not have to play in the first round to get to the second round (they were awarded this privilege for doing better in the regular season games). The games the first round winners advance to are indicated by the lines in the bracket. Then the winners of the second round games advance to the third round (again indicated by the lines in the bracket). Then the winners of the third round games advance to the final game and play for the overall championship.

Every playoff game except the last (round four, super bowl) has a home team. The super bowl is played at a neutral site (at least that is the way it has always happened until 2021 when Tampa Bay won in their own stadium). (In the data we are analyzing, the home team for the super bowl did not make the playoffs so no matter what happens in the preceeding rounds, super bowl has no home team).

The home team in each game in the playoff bracket is the higher seeded team where the seeds are given in the seeds table (also in the seeds columns of our dataset). In the first and second rounds the bottom team listed in the game is the home team. For example, in the second round Green Bay would have been the home team regardless of whether their opponent was Philadelphia or Seattle, and San Francisco would have been the home team regardless of whether their opponent was Minnesota or New Orleans (and so forth). In the third round there are 9 possibilities for the teams in each game (any of three teams could come out of the top half of the blue part of the bracket and any of three teams could have come out of the bottom half of the blue part of the bracket) and whatever happens the home team is the higher seeded team (according to the seeds table). For example, if the teams in the blue game of the third round (also called NFC championship) had been Minnesota and Seattle, then Seattle would have been the home team because they were seeded higher. (And, repeating what was said above, in the fourth round there is no home team, so we don’t have to worry about that. It is only the third round where we have to work out the home team using the seeds. In the first and second round the home teams are the ones shown as the bottom team in each game in the bracket.)

Thus to calculate the probabilities of each team that makes the playoffs (all 12 of the teams that appear in the last printout or in the seeds table or in the bracket) one has to consider all the possibilities. For the 4 games in round one there are \(2^4 = 16\) possible outcomes (no ties allowed in the playoffs), and for each of those \(2^4\) outcomes the second round participants are determined. Thus (given one of those \(2^4\) outcomes) there are again \(2^4\) second round outcomes. Thus there are \(2^4 \cdot 2^4 = 2^8\) possible outcomes of the first and second rounds. However, these \(2^8\) possible outcomes do not result in \(2^8\) different possibilities for the third round games. There are only three possibilities for each slot in the third round, for example, in the slot occupied by Green Bay in the bracket, the possibilities were Green Bay, Seattle, and Philadelphia. And we have, for example for this slot, \[\begin{align*} \Pr(&\text{Green Bay to round three}) \\ & = \Pr(\text{Green Bay (home) beats Seattle}) \Pr(\text{Seattle beats Philadelphia (home)}) \\ & + \Pr(\text{Green Bay (home) beats Philadelphia}) \Pr(\text{Philadelphia (home) beats Seattle}) \\ \Pr(&\text{Seattle to round three}) \\ & = \Pr(\text{Seattle beats Green Bay (home)}) \Pr(\text{Seattle beats Philadelphia (home)}) \\ \Pr(&\text{Philadelphia into round three}) \\ & = \Pr(\text{Philadelphia beats Green Bay (home)}) \Pr(\text{Philadelphia (home) beats Seattle}) \end{align*}\] so 4 possible outcomes of games result in only three possible results (who moves on to round three). So then for each of the 3^2 possibilies of teams in each of the round three games (with probabilities calculated analogous to those above), we have a resulting team that moves on. And, of course \[\begin{align*} \Pr(&\text{Green Bay to round four}) \\ & = \Pr(\text{Green Bay (home) beats Minnesota}) \Pr(\text{Green Bay to round three}) \Pr(\text{Minnesota to round three}) \\ & + \Pr(\text{Green Bay (home) beats New Orleans}) \Pr(\text{Green Bay to round three}) \Pr(\text{New Orleans to round three}) \\ & + \Pr(\text{Green Bay beats San Francisco (home)}) \Pr(\text{Green Bay to round three}) \Pr(\text{San Francisco to round three}) \end{align*}\] and so forth. Then finally \[\begin{align*} \Pr(&\text{Minnesota wins round four}) \\ & = \Pr(\text{Minnesota beats Tennessee}) \Pr(\text{Minnesota to round four}) \Pr(\text{Tennessee to round four}) \\ & + \Pr(\text{Minnesota beats Buffalo}) \Pr(\text{Minnesota to round four}) \Pr(\text{Buffalo to round four}) \\ & + \Pr(\text{Minnesota beats Houston}) \Pr(\text{Minnesota to round four}) \Pr(\text{Houston to round four}) \\ & + \Pr(\text{Minnesota beats New England}) \Pr(\text{Minnesota to round four}) \Pr(\text{New England to round four}) \\ & + \Pr(\text{Minnesota beats Kansas City}) \Pr(\text{Minnesota to round four}) \Pr(\text{Kansas City to round four}) \\ & + \Pr(\text{Minnesota beats Baltimore}) \Pr(\text{Minnesota to round four}) \Pr(\text{Baltimore to round four}) \end{align*}\] (there are no home teams here) and so forth. And all of this has to be calculated for each of the 12 possible winners — if, that is, we want to do a deterministic calculation of the output function.

Fortunately, we don’t have to do that. We can use the Monte Carlo method, which we are already using to average over the posterior distribution of the parameters (which, the parameters being continuous random variable, has an uncountable infinity of possible outcomes). Instead of summing, we simulate.

Given a realization of the posterior distribution of the parameters (the state of the Markov chain at some iteration), simulate the results of week 18 (based on the probabilities determined by that particular realization of the posterior).
Given those simulated winners of week 18, figure out who plays who and who the home teams are in week 19.
Given the schedule for week 19 decided in the previous step and given the same realization of the posterior distribution of the parameters used in step 1, simulate the results of week 19.
Given those simulated winners of week 19, figure out who plays who and who the home teams are in week 20.
Given the schedule for week 20 decided in the previous step and given the same realization of the posterior distribution of the parameters used in steps 1 and 3, simulate the results of week 20.
Given those simulated winners of week 20, they play each other in week 21 (super bowl) and it doesn’t matter who the home team is considered to be because the game is played at a neutral site. Using the same realization of the posterior distribution of the parameters used in steps 1 and 3 and 5, simulate the results of week 21.

Then record what happened in these simulations somehow. To not make things any more complicated than they already are. Let us say we are only interested in the posterior predictive distribution of the winner of the super bowl. To answer that Bayesian question you simply record which team is the (simulated winner in step 6).

And then you do this over and over again. For each iteration of the Markov chain, you use that realization of the posterior distribution of the parameters to simulate winners in each of the playoff weeks. And then you record which team wins the super bowl.

Over a long run of the Markov chain you will have different teams winning in different iterations. Averaging over all of those iterations gives (a point estimate of) the probabilities of each team winning the super bowl (if the playoffs were redone).

And then we want Monte Carlo standard errors (MCSE) for the whole thing.

Note that this is the same calculation we would do if we were predicting the result of the super bowl before any of the playoffs had occurred (between weeks 17 and 18), except that then our posterior distribution would only be based on the games up to and including week 18 (because that would be all the data we would have had at that point in time).

Also note that this whole simulation is a Markov chain. The parameter vectors \(\theta_n\) at iteration \(n\) for all \(n\) form a Markov chain, and the random vector say \(\zeta_n\) which says who won the super bowl (all components are zero except the component for the winner is one, just like we did for the output function in the 3701 notes) is independent of the \(\theta\) chain given \(\theta_n\). This makes the pair \((\theta_n, \zeta_n)\) the state vector of a different Markov chain (having state vector \((\theta, \zeta)\) rather than state vector \(\theta\)). But this is still a reversible Markov chain so nonoverlapping batch means and initial sequence estimators still work.

5.4 Specific Question 3

This question is so obnoxious that it is not part of the homework assignment. But we can see that we could also address the natural Bayesian question: what are the probabilities of each of the 32 teams winning the super bowl if the whole season were replayed. The games, the home teams, and whether the game was at a neutral site would not change. Then if we simulate the results for the 17 weeks of the regular season (once for each realization of the posterior distribution of parameters that is being done by MCMC), there is an extraordinarily complicated set of rules that say what seeding of teams for the playoffs is. Then that seeding determines who plays who and who the home teams are in week 18, and then we can proceed like our simulation in the preceding part.

Note that this is the same calculation we would do if we were predicting the result of the super bowl before the whole season had occurred (before week 1), except that then our posterior distribution would not be be based on any of the games in these data (because we are assuming they wouldn’t have happened yet). Perhaps we would use the posterior for the previous season to calculate for this season. Or perhaps we would use some other distribution. This is a lot iffier question than the ones that are actually assigned.

5.5 Apology.

Why are these things so complicated? People like their sports ridiculously complicated! It adds to the fun. That makes it really hard to do sports statistics, though.

Of course, science is hard too.

And, business analytics is hard too.

All real applications are much harder than anything you ever see in purportedly realistic applications that appear in textbooks, journal articles, and lecture notes. Uffda!,

Stat 8054 Homework: 2019–2020 NFL Season

Charles J. Geyer

March 25, 2024