Statistics 3701 (Geyer, Fall 2022) Homework 1

Rules

See the Section about Rules for Quizzes and Homeworks on the General Info page.

Your work handed into Canvas should be an Rmarkdown file with text and code chunks that can be run to produce what you did. We do not take your word for what the output is. We may run it ourselves. But we also want the output.

You may ask questions if the wording of the questions are confusing. But the instructor will not be giving hints.

Quizzes must uploaded by the end of class (1:10). It should actually allow a few minutes after that, but those not uploaded by 1:10 will be marked late. Here is the link for uploading this quiz https://canvas.umn.edu/courses/330843/assignments/2795250.

Homeworks must uploaded before midnight the day they are due. Here is the link for uploading this homework. https://canvas.umn.edu/courses/330843/assignments/2795258.

Quiz 1

Problem 1

Write an R function that, given a numeric vector x returns a numeric vector whose components are the first 10 strictly positive components of x or all of the strictly positive components of x if there are fewer than 10. If x has no positive components, then your function should return a numeric vector of length zero.

For this problem you do not have to worry about GIEMO (garbage in, error messages out). That is the next problem. If your function does what it is supposed to when the input is correct, that gets full credit.

Not only write a function, but also show it working on numeric vectors having

zero strictly positive components,
between one and nine strictly positive components,
exactly ten strictly positive components, and
more than ten strictly positive components.

And for each of the cases above show your function working on a vector having no non-positive components and also on another vector having some non-positive components. (So there are eight cases in total.)

Problem 2

Rewrite your function for the preceding problem so it does GIEMO (garbage in, error messages out).

It should give an error if its argument has NA or NaN components or is not of type "numeric".

Show that your new function still works on the data described in the preceding problem.

Note: In R the logical not operator is ! (exclamation point, also called bang). So to reverse a test, precede it with !. The expression ! (x < 0) does the same thing as x >= 0. This is illustrated several places in the notes but we did not mention it in class.

Problem 3

Modify the calculations of Section 7.5.4 of the first course notes Basics so that the statistical model is the Cauchy location family, that is, the Cauchy family of distributions (documented in the help for R function dcauchy) with unknown location parameter and known scale parameter, which we take to be the default value (1).

As in that section, use a function factory to make your log likelihood function.

Not only write a function, but also show it working for finding the MLE (maximum likelihood estimate) for the data obtained by the R command


x <- scan(url("https://www.stat.umn.edu/geyer/3701/data/2022/q1p3.txt"))

Note: If this does not work on your computer, see a note about downloading files. Of course, you have to modify the command used in that note to be the command used here. The point is to give R function scan a local file to input when the computer forbids downloads from the internet.

Note: For an interval to find the MLE use sample median plus or minus 1. Asymptotic theory says the standard error of the median considered as an estimator of the location parameter is (π ⁄ 2) ⁄ √ n where n is the sample size. This is much smaller than 1 here, so this interval should include the MLE.

Note: R function median calculates the median.

Note: For this problem ignore GIEMO (garbage in, error messages out). You do not have to detect erroneous arguments to your function.

Homework 1

Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.

Problem 4

This problem is about probability models on finite sample spaces. These can be specified by two vectors

x, which gives possible values of a random variable, and
p, which gives the corresponding probabilities.

Thus both must be numeric vectors, and p must have components that are nonnegative and sum to one.

If x_i are the components of x and p_i are the components of p and n is the length of both x and p, then the mean of the random variable is

μ = ∑_{i = 1}ⁿ x_i p_i

and the variance of the random variable is

σ² = ∑_{i = 1}ⁿ (x_i − μ)² p_i

and, of course, the standard deviation is the square root of the variance.

Your function should return a list with three components named mean, var, and sd, which are the three things you calculated.

For this problem you need not worry about GIEMO (that's the next problem). If your function works correctly, then it will be considered correct.

Data for this problem are


d <- read.table(url("https://www.stat.umn.edu/geyer/3701/data/q1p4.txt"),
    header = TRUE)

This produces a data frame, which we have not covered yet but which you can think of as a list (which it is, just a list with extra requirements), that is d$x is what we are calling x above and d$p is what we are calling p above.

Write the function and show it working on these data.

Problem 5

This problem is to problem 4 as problem 2 is to problem 1. Add GIEMO to your solution to the preceding problem. Catch any problems with either argument. Show your function working.

Hint: When you are checking that p sums to one, don't compare doubles for exact equality Section 10.6 of the first course notes Basics explains.

Problem 6

This problem is about testing. In particular, seeing that the error checks catch all the errors.

For both the functions you wrote in problems 2 and 5, make up bad data for which they fail. Make up data that makes them fail each different check you put in the functions.

Note: In order to have errors not stop Rmarkdown, you need to look at the new Section about Errors in the Rmarkdown Demo document.

Problem 7

This problem is to add GIEMO (garbage in, error messages out) to Problem 3.

For the Cauchy location model the parameter can be any real number (so do not check that it is positive) and the data can be any real numbers (so do not check that they are positive).

Problem 8

This problem is like Problem 4 except that we want to add the median of the distribution to our output.

Defning the median of the distribution is tricky. First we have to sort the x vector (because the median needs the data in sorted order. But we have to keep track of which components of p go with which components of x, so we cannot just sort x. R has a function order to do this.


i <- order(x)
x <- x[i]
p <- p[i]

does the same thing to x as


x <- sort(x)

but keeps the corresponding components of x and p in the same place.

Then the other tricky part is that we need the cumulative distribution function of this distribution. For each component of x we need the probability that the random variable is less than or equal to that value. R function cumsum does that (after both x and p have been modified as described above)


foo <- cumsum(p)

Then foo[k] is the value of the cumulative distribution function at x[k].

With all of that done, the median of the distribution of the random variable is the smallest x[k] such that foo[k] is greater than or equal to one half.

So redo your solution to Problem 4 adding a component median to the output. Show your function working on the data for Problem 4.

Like in Problem 4 you may ignore GIEMO.