Statistics 3701 (Geyer, Fall 2022) Homework 4

Rules

See the Section about Rules for Quizzes and Homeworks on the General Info page.

Your work handed into Canvas should be an Rmarkdown file with text and code chunks that can be run to produce what you did. We do not take your word for what the output is. We may run it ourselves. But we also want the output.

Homeworks must uploaded before midnight the day they are due. Here is the link for uploading this homework. https://canvas.umn.edu/courses/330843/assignments/2847452.

Each homework includes the preceding quiz. You may either redo the quiz questions for homework or not redo them if you are satisfied with your quiz answers. In either case the quiz questions also count as homework questions (so quiz questions count twice, once on the quiz and once on the homework, whether redone or not). If you don't submit anything for problems 1–3 (the quiz questions), then we assume you liked the answers you already submitted.

About Problem 1: During the quiz it was discovered that R package htmltab had been removed from CRAN the very day of the quiz. So we could not do the problem. The notes about web scraping have been revised to use another package, so if you follow the revised notes you can now do the problem. Thus everybody must redo Problem 1 for this homework.

Quiz 4

Problem 1

As noted above, this problem could not be done during the quiz, but it now can be (and must be) done for homework following the example in the revised Section on Data Scraping of the course notes about data.

Scrape the data from the table(s) in the following web page

https://wcha.com/standings.aspx?path=whockey

And answer the following questions. Your answers must be computed entirely using R operating on that web page. Simply reading the answers yourself gets no credit. You have to tell R how to get the answers. Print your answers so the grader can see them.

Note that the R functions in the CRAN package rvest used for that example read in all items in the table as character strings. You will have to convert them to numbers if you want to use them as numbers. The R function as.numeric will convert character strings to numbers if they are numbers.

Also note that some columns have two or three numbers separated by hyphens (W-L-T for wins-losses-ties, OW-OL for overtime wins-overtime losses) to separate these strings at the hyphens R function strsplit is useful. For example, if you have read the table and assigned it to the name foo, then


strsplit(foo[["WCHA Record"]], split="-")

produces a list, each component of which as the vector of 3 numbers, the wins, losses, and ties (but still as character strings).

Read the data in this web page. Show the dataframe that you get.
The rules for computing points are discussed below the table in the web page.
All conference games are now worth three points. The points are awarded as follows:
- 3 for a regulation win
- 2 for an overtime win or shootout "win"
- 1 for an overtime or shootout "loss"
- 0 for a regulation loss
This is confusing, in the WCHA Record column the wins and losses in the W-L-T include the overtime wins in the OW-OL column, but the shootout "wins" and "losses" (in scare quotes) are recorded as ties in the W-L-T. Thus to count points (the WCHA Pts column) we subtract overtime wins and overtime losses from the W-L-T and then we count 3 points for each regulation win (what we have after the subtraction) plus 2 points for each overtime win (OW) plus 1 point for each overtime loss (OL) plus 1 point for each tie (which then goes to a shootout) plus 1 point for each SOW (shootout win). This part of the question is to check this. Do this calculation for each team and check that you get the number in the WCHA Pts column.

Problem 2

This problem uses the data read in by


foo <- read.csv("https://www.stat.umn.edu/geyer/3701/data/2022/q4p2.csv")

which makes foo a data frame having variables speed (quantitative), state (categorical), color (categorical), and y (nonnegative integer).

Treat y as the response to be predicted by the other three variables. Assume the conditional distribution of this response given the predictors is Poisson. Use the default link (log).

Following the example Section 3.3 of the course notes about statistical models fit a GLM that has each of the predictor variables as main effects (no interactions).

Perform tests of statistical hypotheses about whether each of these variables can be dropped from the model without making the model fit the data worse.

Interpret the P-values for these tests. What model do they say is the most parsimonious model that fits the data?

Problem 3

This problem uses the data read in by


foo <- read.csv("https://www.stat.umn.edu/geyer/3701/data/2022/q4p3.csv")

which makes foo a data frame having variables x and y both quantitative.

Treat y as the response to be predicted by x.

Following the example Section 3.4.2.3 of the course notes about statistical models fit a GAM that assumes the conditional mean of y given x is a smooth function (but no parametric assumptions about this smooth function).

On a scatter plot of the data, add lines that are the lower and upper end points of 95% confidence intervals for the mean of y given x for each value of x. As in the example in the course notes, do not adjust these intervals to obtain simultaneous coverage.

This is the first question that asks for a plot. For this question, not only upload your R code, but also the plot as a PDF file called q4p3.pdf.

Also give numeric 95% confidence intervals for the conditional mean of y given x for the x values 0, 20, 40, 60, 80, 100.

Homework 4

Problem 4


foo <- available.packages(repos = "https://cloud.r-project.org")

(this may take a few seconds) gets the packages on CRAN that are available in the sense that (quoting for the help for R function available.packages)

By default, the return value includes only packages whose version and OS requirements are met by the running version of R, and only gives information on the latest versions of packages.

If you look at some of these data


class(foo)
head(foo)

you see that foo is a matrix with informative column names. Each row is a CRAN package, and the column named "Package" gives the package name.

Using these data, answer the following questions.

Every package has a field NeedsCompliation that is "yes" if the package contains C, C++, or Fortran code that is compiled and called from R (many R functions work this way). Produce a vector of all values of the field NeedsCompilation in all packages. How many packages need compilation? What proportion of packages need compilation?
Every package has a field License that specifies which licenses among the licenses on the web page https://www.r-project.org/Licenses/ the package is licensed under. Produce a vector of all the packages that are licensed under some version of the GPL (the most common license). Note that AGPL and LGPL don't count and your answer should not include packages that have only these and not also GPL in their license field. The way to find stuff in character strings is the R function grep, and the way to find complicated matches is to have the match string be a regular expression, which is documented on the R help page ?regex. There we see that \b matches word boundaries. So \bGPL should match GPL but not AGPL or LGPL. And that is correct, but in R strings a backslash character followed by another character is a new character, so to put this regular expression in an R string (as we must do to hand it to the grep function) we have to escape the backslashes (each \\ puts a single character \ in the string). Thus the pattern to match is "\\bGPL" when put in an R character string. How many packages are licensed under the GPL? What proportion of packages are licensed under the GPL?

Problem 5

This problem is also about CRAN.

It uses the data read in by


foo <- read.csv("https://www.stat.umn.edu/geyer/3701/data/2022/q4p5.csv")

which makes foo a data frame having five variables,

package, the name of a CRAN package,
r.version, the version of R that it requires or NA if it doesn't say anything about this,
major, the major version number, that is, the 4 in R version 4.2.1,
minor, the minor version number, that is, the 2 in R version 4.2.1, and
patchlev, the patch level, that is, the 1 in R version 4.2.1.

Do the following.

Turn the variable r.version into something that can be compared so that earlier versions of R come first. Alphabetical order (the default for character strings) does not work. And these strings cannot be converted to numbers by R function as.numeric (because no number has two decimal points). Hence you probably want to use major, minor, and patchlev, which are numeric.
Hint: R function order works on multiple vectors.
How many CRAN packages do not specify any version of R? Same question except what fraction instead of how many?
Among the CRAN packages that do specify some versions of R, what are the quantiles, that is, for what version numbers at least 75%, at least 50%, and at least 25% require that version or higher (and at least 25%, at least 50%, and at least 75% or lower, respectively)? If this is not clear, then these theory notes may help and also these notes which discuss R function quantile.

Problem 6

The data for this problem has two sources.

The first is in a CRAN package. To get it do


library(nycflights13)
data(flights)

and the dataset has the same name as in the data statement and help(flights) will show the description of it.

The second source is the web page https://www.nationsonline.org/oneworld/IATA_Codes/airport_code_list.htm, but the HTML in that table is a mess because whoever did it did not know how to write good HTML tables (it has a different table for each letter).

So we reformatted its data into one good table. To get it do.


airp <- read.table("https://www.stat.umn.edu/geyer/3701/data/airport-codes.txt", header = TRUE)

This reads in a data frame with the same three columns and the same data as the web page referenced above. The column names and data are

City.Airport, the city and if more than one city of the same name, the state or country to distinguish it, and if more than one airport in the same city, the name of the airport,
Country, the country in which the airport is located,
IATA.Code, the three-letter code for the airport (like MSP for our local international airport).

R tibble flights has variables origin and dest for the origin of a flight (one of the New York City airports) and destination (some other airport). The values of these variables are IATA three-letter airport codes (like MSP for Minneapolis-St Paul). The R dataframe airp gives the actual names for these, for example


with(airp, City.Airport[IATA.Code == "MSP"])

and (all airports in Minnesota, assuming they give the state for all airports in the USA, not just for these)


subset(airp, grepl("MN", airp$City.Airport))

Answer the following (all your answers must be calculated by R, none just by you looking at the data and figuring this out).

What are the IATA codes and the corresponding actual names of the New York City airports that are origins of flights in the flights database?
How many flights go from each of the New York City airports to MSP?
What fraction of flights go from each of the New York City airports to MSP?
Are there any flights from any of the New York City airports to any Minnesota airport other than MSP?
How many flights from each of the New York City airports are international (go to some airport not in the USA)?
What fraction of flights from each of the New York City airports are international?

Hint: If you want to use R package dplyr, then the examples in the course notes should help. If you want to use base R, then R function split should help.