Statistics 3701 (Geyer, Spring 2017) Homework 1

Rules

See the Section about Rules for Quizzes and Homeworks on the General Info page.

Your work handed into Moodle should be a plain text file with R commands and comments that can be run to produce what you did. We do not take your word for what the output is. We run it ourselves.

Note: Plain text specifically excludes Microsoft Word native format (extension .docx). If you have to Word as your text editor, then save as and choose the format to be Text (.txt) or something like that. Then upload the saved plain text file.

If you have questions about the quiz, ask them in the Moodle forum for this quiz. Here is the link for that https://ay16.moodle.umn.edu/mod/forum/view.php?id=1221096.

On future assignments you can use knitr or rmarkdown after we have talked about it. But avoid that on this assignment.

Quizzes must uploaded by the end of class (1:10). It should actually allow a few minutes after that. Here is the link for uploading the quiz https://ay16.moodle.umn.edu/mod/assign/view.php?id=1221097.

Homeworks must uploaded before midnight the day they are due. Here is the link for uploading the homework. https://ay16.moodle.umn.edu/mod/assign/view.php?id=1221099.

Quiz 1

Problem 1

Write an R function that, given a numeric vector x calculates its mean, population variance, and population standard deviation, that is, if x_i are the components of x and n is the length of x, then the mean is

μ = (1 / n) ∑_{i = 1}ⁿ x_i

and with μ given by the above the population variance is given by

σ² = (1 / n) ∑_{i = 1}ⁿ (x_i − μ)²

and with σ² given by the above the population standard deviation is σ (the square root of the variance).

Do not use the R functions mean, var, or sd. You may use the R function sum or any other R function in the R core (what is available without using the R function library to attach a package).

Your function should return a list with three components named mean, var, and sd, which are the three things you calculated.

For this problem you do not have to worry about GIEMO (garbage in, error messages out). That is the next problem. If your function does what it is supposed to when the input is correct, that gets full credit.

Not only write a function, but also show it working on the data obtained by the R command


x <- scan(url("http://www.stat.umn.edu/geyer/s17/3701/data/q1p1.txt"))

Problem 2

Rewrite your function for the preceding problem so it does GIEMO (garbage in, error messages out).

It should give an error if its argument has length zero, has NA or NaN or Inf or -Inf components, or is not of type "numeric".

Hint: in Section 8 of the first course notes Basics we used the function is.finite. Look it up and see if you want to use that.

Show that your new function still works on the data described in the preceding problem.

Problem 3

Modify the calculations of Sections 7.5.2, 7.5.3, and 7.5.4 of the first course notes Basics so that they are done by one R function.

Your R function will have one argument, which is the data (x in the example in the notes) and will produce one scalar value, which is the MLE (maximum likelihood estimate) (oout$maximum in the example in the notes).

For this problem you can use the easier method of Section 7.5.3 of the first course notes because inside your function x is not a global variable (hence not evil) because it is a local variable in your function.

For this problem you do not have to worry about GIEMO (garbage in, error messages out). If your function does what it is supposed to when the input is correct, that gets full credit.

Not only write a function, but also show it working on the data obtained by the R command


x <- scan(url("http://www.stat.umn.edu/geyer/s17/3701/data/q1p3.txt"))

Homework 1

Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.

Problem 4

This is a modification of problem 1. Now we will allow unequal probabilities for the data values. So now we have two vectors of the same length, call them x and p and the latter is a probability vector, meaning its components are nonnegative and sum to one.

If x_i are the components of x and p_i are the components of p and n is the length of both x and p, then the equations in problem 1 are modified to

μ = ∑_{i = 1}ⁿ x_i p_i

for the mean,

σ² = ∑_{i = 1}ⁿ (x_i − μ)² p_i

for the population variance. As before and as always, the standard deviation is the square root of the variance.

Again for this problem you need not worry about GIEMO (that's the next problem).

Data for this problem are


d <- read.table(url("http://www.stat.umn.edu/geyer/s17/3701/data/q1p4.txt"),
    header = TRUE)

This produces a data frame, which we have not covered yet but which you can think of as a list (which it is, just a list with extra requirements), that is d$x is what we are calling x above and d$p is what we are calling p above.

Otherwise, this problem is just like problem 1: write the function and show it working on these data.

Problem 5

This problem is to problem 4 as problem 2 is to problem 1. Add GIEMO to your solution to the preceding problem. Catch any problems with either argument. Show your function working.

Hint: when you are checking that p sums to one, don't compare doubles for exact equality Section 10.6 of the first course notes Basics explains.

Problem 6

This problem is about testing. In particular, seeing that the error checks catch all the errors.

For both the functions you wrote in problems 2 and 5, make up bad data for which they fail. Make up data that makes them fail each different check you put in the functions.

Problem 7

This problem is like problem 3 except that now we want to allow different probability distributions. We will still ignore GIEMO, because that may be too hard at this point in the course.

Write an R function that has three arguments

the data vector x, just like before,
a function that itself has two arguments,
- the (univariate) parameter theta and
- the data vector x,
and an interval over which to search given by a vector of length two called interval.

The function is supposed to return the log of the probability density function (PDF) or probability mass function (PMF), depending on whether the distribution of x is contiuous or discrete, respectively. (The reason we have the user supply the interval because there is no way we can tell what the range of values of the parameter is).

One example of such a function is


function(theta, x) dgamma(x, shape = theta, log = TRUE)

that we used in our function but another would be


function(theta, x) dcauchy(x, location = theta, log = TRUE)

and yet another would be


function(theta, x) dbinom(x, 20, prob = 1 / (1 + exp(- theta)), log = TRUE)

The idea is that the user provides a function for whatever the distribution the user wants.

Modify your answer to problem 3 so it works as described here.

If you use the gamma PDF function above as your function argument, then the data for problem 3 are appropriate.

If you use the Cauchy PDF function above as your function argument, then the following data are appropriate


x <- scan(url("http://www.stat.umn.edu/geyer/s17/3701/data/q1p7c.txt"))

If you use the binomial PMF function above as your function argument, then the following data are appropriate


x <- scan(url("http://www.stat.umn.edu/geyer/s17/3701/data/q1p7b.txt"))

The parameter spaces are 0 to ∞ for gamma, − ∞ to ∞ for Cauchy and binomial. But I assure you that the true unknown parameter values are less than 10 in absolute value and, for the gamma, greater than 0.1. You are relying on the user of your function to get this right, but while testing your function you have to play the role of the user.

Show that your function works with all three of the user-supplied functions given above.