Rules
See the Section about Rules for Quizzes and Homeworks on the General Info page.
Your work handed into Moodle should be a plain text file with R commands and comments that can be run to produce what you did. We do not take your word for what the output is. We run it ourselves.
Note: Plain text
specifically excludes
Microsoft Word native format (extension .docx
). If you have to
Word as your text editor, then save as and choose the format to be
Text (.txt)
or something like that. Then upload the
saved plain text file.
Note: Plain text
specifically excludes
PDF (Adobe Portable Document Format) (extension .pdf
). If you
use Sweave, knitr, or Rmarkdown, upload the source (extension .Rnw
or .Rmd
) not PDF or any other kind of output.
If you have questions about the quiz, ask them in the Moodle forum for this quiz. Here is the link for that https://ay16.moodle.umn.edu/mod/forum/view.php?id=1310928.
You must be in the classroom, Armory 202, while taking the quiz.
Quizzes must uploaded by the end of class (1:10). Moodle actually allows a few minutes after that. Here is the link for uploading the quiz https://ay16.moodle.umn.edu/mod/assign/view.php?id=1310947.
Homeworks must uploaded before midnight the day they are due. Here is the link for uploading the homework. https://ay16.moodle.umn.edu/mod/assign/view.php?id=1310954.
Quiz 4
Problem 1
Scrape the data from the table(s) in the following web page
following the example Section 4 of the course notes about data
And answer the following questions. Your answers must be computed entirely using R operating on that web page. Simply reading the answers yourself gets no credit. You have to tell R how to get the answers. Print your answers so the grader can see them.
Note that the R function readHTMLTable
in the CRAN package
XML
that was used for that example reads in all items in
the table as character strings. You will have to convert them to numbers
if you want to use them as numbers. The R function as.numeric
will convert character strings to numbers if they are numbers.
- Read the data in this web page, convert numeric columns in the tables
to type
"numeric"
. - In the conference a win counts 3 points, a loss zero points,
and a tie counts either 1 or 2 points depending on a
shootout
. Verify that the points are calculated correctly (SOW stands for shootout wins). - Outside the conference, shootout wins don't count. A win counts 2 points and a tie one point. What would the points be if they were counted for overall records? Associate team names with the numbers you calculate so we can see which is which.
- Since the teams did not play the same number of non-conference games, adjust the numbers calculated in part (c) by dividing by games played.
Problem 2
This problem uses the data read in by
foo <- read.csv("http://www.stat.umn.edu/geyer/s17/3701/data/q4p2.csv", stringsAsFactors = FALSE)which makes
foo
a data frame having variables
speed
(quantitative), state
(categorical),
color
(categorical), and y
(zero-or-one).
Treat y
as the response to be predicted by the other three
variables.
Following the example
Section 3.3 of the course notes about statistical models
fit a GLM that has each of the predictor variables as main effects
(no interactions
).
Perform tests of statistical hypotheses about whether each of these variables can be dropped from the model without making the model fit the data worse.
Interpret the P-values for these tests. What model do they say is the most parsimonious model that fits the data?
Problem 3
This problem uses the data read in by
foo <- read.csv("http://www.stat.umn.edu/geyer/s17/3701/data/q4p3.csv")which makes
foo
a data frame having variables
x
and y
both quantitative.
Treat y
as the response to be predicted by x
.
Following the example
Section 3.4.2.3 of the course notes about statistical models
fit a GAM that assumes the conditional mean of y
given x
is a smooth function (but no parametric assumptions
about this smooth function).
On a scatter plot of the data, add lines that are the lower and upper
end points of 95% confidence intervals for the mean of y
given x
for each value of x
. As in the example
in the course notes, do not adjust these intervals to obtain simultaneous
coverage.
This is the first question that asks for a plot. For this question,
not only upload your R code, but also the plot as a PDF file called
q4p3.pdf
.
Also give numeric 95% confidence intervals for the conditional mean
of y
given x
for the x
values
0, 20, 40, 60, 80, 100.
Homework 4
Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.
Problem 4
This is a problem about JSON, but like the examples in
Section 5 of
the course notes about data, we don't actually deal with JSON.
The R function fromJSON
in the CRAN package jsonlite
returns R data structures, which are the only thing we have to deal with.
Follow the example in Section 5.2 of the course notes about data to read in some data about CRAN.
library(jsonlite) foo <- fromJSON("http://crandb.r-pkg.org/-/latest")(this may take a few seconds).
If you look at some of these data
head(foo)you see that
foo
is a list, each component of foo
has a name that is the name of a CRAN package, and each component
of foo
is itself a list, and the names of the components of that
list correspond to the names of fields of the DESCRIPTION
file.
The authoritative reference for what these data are about (contents
of the DESCRIPTION
file in CRAN packages) is
Section 1.1.1 of Writing R Extensions.
Information about the License
field of this file is in
Section 1.1.2 of Writing R Extensions.
You probably do not need to look at these
references to do this problem, but they have been provided in case you think
you need to.
Using these data, answer the following questions.
- The number of fields in the
DESCRIPTION
file is different for different packages (some fields are mandatory, others are optional). Produce a vector of all field names in all packages. How many unique field names are there? - Every package has a field
NeedsCompliation
that is"yes"
if the package contains C, C++, or Fortran code that is compiled and called from R (many R functions work this way). Produce a vector of all values of the fieldNeedsCompilation
in all packages. How many packages need compilation? What proportion of packages need compilation? - Every package has a field
License
that specifies which licenses among the licenses on the web page https://www.r-project.org/Licenses/ the package is licensed under. Produce a vector of all the packages that are licensed under some version of the GPL (the most common license). Note that AGPL and LGPL don't count and your answer should not include packages that have only these and not also GPL in their license field. The way to find stuff in character strings is the R functiongrep
, and the way to find complicated matches is to have the match string be a regular expression, which is documented on the R help page?regex
. There we see that\b
matches word boundaries. So\bGPL\b
should matchGPL
but notAGPL
orLGPL
. And that is correct, but in R strings a backslash character followed by another character is a new character, so to put this regular expression in an R string (as we must do to hand it to thegrep
function) we have to escape the backslashes (each\\
puts a single character\
in the string). Thus the pattern to match is"\\bGPL\\b"
when put in an R character string. How many packages are licensed under the GPL? What proportion of packages are licensed under the GPL? - Some packages have a field
Depends
that says which version of R itself and which versions of other packages this package depends on (and won't work unless they are present). This field is optional. What is the structure of theDepends
component of items offoo
that have such a component?- Produce a list of the names of packages on which each package depends.
Do not include
"R"
in this list. Do not include version information in this list. - Produce a vector of the names of packages on which any other package
depends, repeating the name for each dependency (so we can count
how many times any package is in the
Depends
field of another package). - Produce a table of counts of how many times each package that appears
in some
Depends
field does so. (Hint: the R functiontable
is useful here.) - Reorder your table so it is in decreasing order of the counts.
- Notice that the packages that have the highest counts are core or
recommended packages, which are in
rcore <- c("base", "compiler", "datasets", "graphics", "grDevices", "grid", "methods", "parallel", "splines", "stats", "stats4", "tcltk", "tools", "translations", "utils", "boot", "class", "cluster", "codetools", "foreign", "KernSmooth", "lattice", "MASS", "Matrix", "mgcv", "nlme", "nnet", "rpart", "spatial", "survival")
Eliminate these packages from your table, and again produce a table reordered so it is in decreasing order of the counts. I found the R functionsetdiff
useful here.
- Produce a list of the names of packages on which each package depends.
Do not include
Problem 5
This problem is about SQL databases. It starts off following Section 6.1 of the course notes about data, but this problem is more complicated than the example in the notes and requires more SQL.
Set up a database by doing
load(url("http://www.stat.umn.edu/geyer/s17/3701/data/q4p5.rda")) library(DBI) mydb <- dbConnect(RSQLite::SQLite(), "") dbWriteTable(mydb, "depends", d$depends) dbWriteTable(mydb, "imports", d$imports) dbWriteTable(mydb, "suggests", d$suggests) dbWriteTable(mydb, "linking", d$linking) rm(d) ls()
This requires that you have the CRAN packages DBI
and
RSQLite
both installed. The rm
command
removes the R object that had contained the data, as the ls
command shows. There is nothing left but the database connection.
dbListTables(mydb)shows there are four tables, and, for example,
dbGetQuery("SELECT * FROM depends LIMIT 10")(which is sort of the equivalent of the R function
head
)
shows what this table is about. It is the same data as in the preceding
problem except that all dependencies on "R"
or any of
the core or recommended packages have already been removed.
All four tables in the database have the same structure (the same field
names and the same kind of field entries: all names of CRAN packages).
The tables only differ in which field of the DESCRIPTION
file of the CRAN packages the data come from. The fields are
Depends
, Imports
, Suggests
,
and LinkingTo
.
We want to do more or less what we did in the last problem, except that
we want to do as much as possible using SQL, and we want to use the
combined data from all four tables. This requires SQL not covered in
the notes. If you just cannot do it with SQL, then do it using R, but
you have to start with the commands above and cannot re-load the object
d
from the URL used above. So you have to at least use
SQL to get the data out of the database.
We want to do the following.
- Extract the
packto
columns of each table, combining them into a new table. An SQL statement that startsCREATE TABLE temp AS SELECT
and continues with the rest of aSELECT
statement creates a new table namedtemp
that is the table that results from theSELECT
statement. For more information see this web page aboutCREATE TABLE
(theAS
part is about halfway down the page).The SQL
UNION ALL
operator can be used to combineSELECT
statements putting all the results in one result.UNION
removes duplicates;UNION ALL
does not, which is what we want here, because we want to count the duplicates. For more information see this web page aboutUNION
andUNION ALL
. -
It is probably easiest to just get this table and do the rest in R,
but, if you want to continue with SQL, the SQL statement
SELECT packto, COUNT(packto) FROM temp GROUP BY packto
does the equivalent of the R functiontable
, as can be seen by executing it. For more information see this web page aboutGROUP BY
(an example usingCOUNT
withGROUP BY
is about halfway down the page).However, the weird name of the second column of this table befuddled me as to how I could sort on it. So I did
SELECT packto, COUNT(packto) AS packcount FROM temp GROUP BY packto
where theAS
gives the count a new name. For more information see this web page aboutAS
. -
Now we want to sort these results. The SQL that uses an
ORDER BY
clause. For more information see this web page aboutORDER BY
. Note that theDESC
modifier asks for a sort in descending order. -
But this isn't the last thing we want to do. We only want to look
at the CRAN packages that have at least 100 packages depending on them.
The SQL syntax for that is to add
WHERE packcount >= 100
For more information see this web page aboutWHERE
.But no matter what I tried I could not get all of this to work in one query. So I created yet another table that had the counting done and then did the sort and extraction of the counts over 100 in another statement.
Problem 6
This problem is about smoothing.
The R command
foo <- read.csv("http://www.stat.umn.edu/geyer/s17/3701/data/q4p3.csv")makes
foo
a data frame having variables
x
and y
both quantitative.
This is the same data as for Problem 3 above.
As there, we treat y
as the response to be predicted
by x
.
The difference is that we are going to try kernel smoothing
(a method not described in the course notes). Use the R function
locpoly
and dpill
in the R package
KernSmooth
, which is a recommended package that
comes with every R installation to fit a smooth to these data
(locpoly
does the smoothing and dpill
does the bandwidth
selection, i. e., choosing the
right amount of smoothness).
You have to read the help pages and follow the examples to do this problem.
Like problem 3 above, this problem also needs a plot to show your solution.
Do a scatter plot with the estimated regression function superimposed.
For this question,
not only upload your R code, but also the plot as a PDF file called
q4p6.pdf
.