Statistics 3701 (Geyer, Fall 2022) Quiz 4

Rules

See the Section about Rules for Quizzes and Homeworks on the General Info page.

Your work handed into Canvas should be an Rmarkdown file with text and code chunks that can be run to produce what you did. We do not take your word for what the output is. We may run it ourselves. But we also want the output.

You may ask questions during the quiz, especially if the wording of a question is confusing or there seems to be an issue with the question, but the instructor will not be giving hints.

You must be in the classroom, Molecular and Cellular Biology 2-120, to take the quiz.

Quizzes must uploaded by the end of class (1:10). It should actually allow a few minutes after that, but those not uploaded by 1:10 will be marked late. Here is the link for uploading this quiz https://canvas.umn.edu/courses/330843/assignments/2847451.

Quiz 4

Problem 1

As noted above, this problem could not be done during the quiz, but it now can be (and must be) done for homework following the example in the revised Section on Data Scraping of the course notes about data.

Scrape the data from the table(s) in the following web page

https://wcha.com/standings.aspx?path=whockey

And answer the following questions. Your answers must be computed entirely using R operating on that web page. Simply reading the answers yourself gets no credit. You have to tell R how to get the answers. Print your answers so the grader can see them.

Note that the R functions in the CRAN package rvest used for that example read in all items in the table as character strings. You will have to convert them to numbers if you want to use them as numbers. The R function as.numeric will convert character strings to numbers if they are numbers.

Also note that some columns have two or three numbers separated by hyphens (W-L-T for wins-losses-ties, OW-OL for overtime wins-overtime losses) to separate these strings at the hyphens R function strsplit is useful. For example, if you have read the table and assigned it to the name foo, then


strsplit(foo[["WCHA Record"]], split="-")

produces a list, each component of which as the vector of 3 numbers, the wins, losses, and ties (but still as character strings).

Read the data in this web page. Show the dataframe that you get.
The rules for computing points are discussed below the table in the web page.
All conference games are now worth three points. The points are awarded as follows:
- 3 for a regulation win
- 2 for an overtime win or shootout "win"
- 1 for an overtime or shootout "loss"
- 0 for a regulation loss
This is confusing, in the WCHA Record column the wins and losses in the W-L-T include the overtime wins in the OW-OL column, but the shootout "wins" and "losses" (in scare quotes) are recorded as ties in the W-L-T. Thus to count points (the WCHA Pts column) we subtract overtime wins and overtime losses from the W-L-T and then we count 3 points for each regulation win (what we have after the subtraction) plus 2 points for each overtime win (OW) plus 1 point for each overtime loss (OL) plus 1 point for each tie (which then goes to a shootout) plus 1 point for each SOW (shootout win). This part of the question is to check this. Do this calculation for each team and check that you get the number in the WCHA Pts column.

Problem 2

This problem uses the data read in by


foo <- read.csv("https://www.stat.umn.edu/geyer/3701/data/2022/q4p2.csv")

which makes foo a data frame having variables speed (quantitative), state (categorical), color (categorical), and y (nonnegative integer).

Treat y as the response to be predicted by the other three variables. Assume the conditional distribution of this response given the predictors is Poisson. Use the default link (log).

Following the example Section 3.3 of the course notes about statistical models fit a GLM that has each of the predictor variables as main effects (no interactions).

Perform tests of statistical hypotheses about whether each of these variables can be dropped from the model without making the model fit the data worse.

Interpret the P-values for these tests. What model do they say is the most parsimonious model that fits the data?

Problem 3

This problem uses the data read in by


foo <- read.csv("https://www.stat.umn.edu/geyer/3701/data/2022/q4p3.csv")

which makes foo a data frame having variables x and y both quantitative.

Treat y as the response to be predicted by x.

Following the example Section 3.4.2.3 of the course notes about statistical models fit a GAM that assumes the conditional mean of y given x is a smooth function (but no parametric assumptions about this smooth function).

On a scatter plot of the data, add lines that are the lower and upper end points of 95% confidence intervals for the mean of y given x for each value of x. As in the example in the course notes, do not adjust these intervals to obtain simultaneous coverage.

This is the first question that asks for a plot. For this question, not only upload your R code, but also the plot as a PDF file called q4p3.pdf.

Also give numeric 95% confidence intervals for the conditional mean of y given x for the x values 0, 20, 40, 60, 80, 100.