Announcements

New! Reading assignment number 7 assigned, due Wednesday Mar 3, and discussion group in Canvas has been opened for it.

A clarification has been added to the description of Doing Assignment 2.

This page rearranged with two new sections, one of which has subsections.

I explained to someone in office hours that questions do not have to seem deep to have deep answers. So don't worry about how your questions sound. As we saw today, even when the question seems misguided — making some assumption about R that is false — it can still be very interesting and important to know the answer. So just ask!

Course Material

Course material for this course is scattered all over the place. We will look at some material from

Needless to say this is too much material for one course. We can discuss what we want to do.

2021 Announcements of Permanent Interest

Denormalized Numbers

Finishing the example on denormalized numbers I was doing in class.

> exp(-745)
[1] 4.940656e-324
> exp(-746)
[1] 0
So we see that exp(-745) is close to the smallest positive nonzero number defined in IEEE arithmetic. What is it actually?
> library(Rmpfr)
> foo <- mpfr(exp(-745), 120)
> foo
1 'mpfr' number of precision  120   bits 
[1] 4.9406564584124654417656879286822137237e-324
> log2(1 / foo)
1 'mpfr' number of precision  120   bits 
[1] 1074
So the result foo is exactly 2−1074 in the computer's arithmetic. And we know from calculus, that isn't exact (should be an irrational number). We can also see the inexactness in
> log(exp(-745))
[1] -744.4401
> log(2^(-1074))
[1] -744.4401
> log2(exp(-745))
[1] -1074
Note that there is nothing special about exp and log here. Denormalized numbers arise whenever results get in this range
> .Machine$double.xmin
[1] 2.225074e-308
> 2^(-1074)
[1] 4.940656e-324
> 2^(-1074) /2
[1] 0
.Machine$double.xmin is the smallest positive normalized number (IEEE double precision) and 2−1074 is the smallest positive denormalized (IEEE double precision). When you are in that range, you are denormalized. When you are even below that range, you have underflow.

Makevars File

As discussed in class, you cannot put nonstandard compiler flags in the src/Makevars file of an R package. For example, if you add -Wall -Wextra to the definition of PKG_CFLAGS, then R CMD check will emit the warning

Non-portable flags in variable 'PKG_CFLAGS':
  -Wall -Wextra
So what you do is follow the instructions in file package/notes in R repository foo in my account at github.com that I illustrated in class Feb 17.

Searching CRAN

It came up in talking with a student that it would be helpful to cover how to search CRAN (as the following announcement is based on). This is done with R package pkgsearch. For example,

library("pkgsearch")
advanced_search("Arnoldi")
returns no hits. But
advanced_search("Conjugate Gradient", size = 200) -> foo
returns 178 hits (as this is written). The result foo is a data frame with variables
> names(foo)
 [1] "score"                "package"              "version"             
 [4] "title"                "description"          "date"                
 [7] "maintainer_name"      "maintainer_email"     "revdeps"             
[10] "downloads_last_month" "license"              "url"                 
[13] "bugreports"           "package_data"        
And we can now use R function grep to whittle down our search. For example, the method of conjugate gradients can be used either for optimization or for solving linear equations. If we are only interested in the latter, then
> 
> bar <- grepl("equation", foo$title, ignore.case = TRUE) |
+     grepl("equation", foo$description, ignore.case = TRUE)
> sum(bar)
[1] 11
> foo$package[bar]
 [1] "pcg"         "cPCG"        "sanic"       "hgm"         "LassoGEE"   
 [6] "shinyKGode"  "gclm"        "KGode"       "deGradInfer" "calculus"   
[11] "rootSolve"  
cuts down our task. At this point we had to look at the actual titles, descriptions, and web pages for the packages.

Twenty-First Century Numerical Analysis

A search of CRAN turned up two packages that use modern methods for solving linear equations.

But I could not find anything about modern methods for eigenvalue-eigenvector problems.

Creating R Lists in C Called from R

The example I was using for constructing an R object of type "list" inside C code called from R using the .Call interface was C function scdd in R package rcdd which is on GitHub. The code that constructs an R object of type "list" begins at line 268.

Inheritance in OOP

I may have given the wrong impression in my in class answer about inheritance. The R S3 OOP system does not have inheritance in the sense of C++ and Java. In those languages fields and methods of a class can be inherited from its superclass. In R S3 there is no such thing. But there is a very weak form of inheritance in that the class attribute of an object can be a vector, and one can think of the components of this vector as being class, superclass, superclass of superclass and so forth. Thus one should never test for class equality. The R idiom is

if (inherits(foo, "Bar"))
for does object foo have class "Bar".

R Type System

A better answer about when to use typeof, mode, storage.mode, or class. The R Language Definition, Section 1, Objects and Section 5, Object-oriented programming give the better answers.

Mutability

A quotation from Advanced R by Hadley Wickham

Surprisingly, the implementation of reference classes is almost entirely in R code — they are a combination of S4 methods and environments. This is a testament to the flexibility of S4.

It also says that if you need mutability but don't need OOP, then you don't need reference classes, you just need environments. That's what I use.

Changing Parent Environment

My answer in class about whether environments can be changed on the fly was completely wrong. I should not have guessed. As always, TRT is to RTFM. And help(parent.env) says, among other things

environment(fun) <- value
parent.env(env) <- value
(these are the only assignment functions documented on this help page). So both the environment of a function and the parent environment of an environment can be assigned arbitrarily. An examination of the R code base shows that the first is done in model fitting functions and methods functions, and the latter is done only in methods functions and in R startup. So it is not something you would ordinarily do, but it is something you can do. The documentation also says

The replacement function ‘parent.env<-’ is extremely dangerous as it can be used to destructively change environments in ways that violate assumptions made by the internal C code. It may be removed in the near future.

So changing the parent of an environment is something you can do but are strongly advised not too!

Removing Sensitive Material from Git and GitHub

A better answer to what to do when sensitive material has mistakenly been committed to git and pushed to GitHub is found on this GitHub documentation. It suggests using either the git filter-branch command or the BFG Repo-Cleaner open source tool. I think I have used git filter-branch in the past. But it also says

Warning: Once you have pushed a commit to GitHub, you should consider any data it contains to be compromised. If you committed a password, change it! If you committed a key, generate a new one.

2020 Announcements

R Packages that do Regression

The info about R packages that do regression is linked on the course notes about R packages, in particular, follow the links to git repository bar and the accompanying gist.

My way of converting formulas to x and y is a way of R, but not the only way. Instead you can follow http://developer.r-project.org/model-fitting-functions.html if you like.

Robustness

Robustness came up today (Apr 24) in class. This area isn't as trendy as it used to be, but it is still very important. Every statistician should know something about it. For a start read the help pages for R functions lqs and rlm in R package MASS and try the examples.

Then for more, the following

library("pkgsearch")
foo <- advanced_search(Title = "robust", Description = "robust", size = 1000)
returns (as I write this) 144 CRAN packages with robust in the Title or Description fields of the package DESCRIPTION file. Of course, in some of these robust is used with a different meaning from the robust in robust statistics (Wikipedia pages Robust statistics and Robust regression). But there are clearly a lot of robustness packages on CRAN (not to mention the ones in the recommended package MASS that comes with every R installation — which we already did mention, of course).

R Standalone Library of C Functions

New handout about the R standalone library under Course Notes in the navigation to the left.

The authoritative documentation is Section 6.17 of Writing R Extensions. This allows use of many C functions that back up R functions, including the R random number system and all of the rest of R functions for probability distributions, from C code not called from R.

Branch and Bound Algorithm

This is in addition to the slides on Branch and Bound under Slides in the navigation to the left.

R package glmbb (https://cloud.r-project.org/package=glmbb) does branch and bound and is programmed in R (no C or C++). It uses three main ideas

The same ideas will work on any branch and bound problem.

Using SSH Without Passwords

A note about using the ssh command without passwords.

  1. First you have to have ssh. On unix (Linux and Mac OS X) it just comes with the computer. On up to date versions of Windows 10 it can easily be installed (so they say — I haven't tried it since I don't do windows).
  2. Then, if you have never used ssh before, do (from the command line in a terminal)
    ssh username@lts.cla.umn.edu
    
    where you replace username with your actual U of M username (the one you log in with to read mail, register for courses, use the library, etc.). Type your U of M password when prompted and you should be on lts.
  3. Log back out (say exit), because we need to do some work locally.
  4. (Do this step once per computer you own. If you have already done this for working with Github, don't do it again.) From the command line on your computer do
    ssh-keygen
    
    Answer any questions it asks by choosing the default (hit return) until you get to Enter passphrase (empty for no passphrase):. Enter a passphrase. This can be as long as you like, and should be memorable. It should not be used for anything else. Then you have to enter it again to make sure you typed it correctly. Then it should say the keys were successfully created. If you do
    ls .ssh
    
    you should see two files
    id_rsa
    id_rsa.pub
    
    The first is your private key. It stays secret. The only copy anywhere should be right where it is now, on your computer. The second is your public key. You can give that out to anyone and move it to any computer.
  5. The first thing you do is copy it to lts
    scp .ssh/id_rsa.pub lts.cla.umn.edu:
    
    (Note the colon at the end of this command. It's important. You will have to type your U of M password again for this to work).
  6. Now log back into lts again, just like in item 2 above.
  7. This time we have some work to do. We need to add this public key to the authorized_keys in the .ssh directory on the remote computer (in this case lts).
    cat id_rsa.pub >> .ssh/authorized_keys
    
    (you can now remove the file id_rsa.pub if you like, only the copy in .ssh/authorized_keys is used).
  8. Now log out of lts again, just like in item 3 above.
  9. Now log back into lts again, just like in item 2 above. This time you are not asked for a password. This time you are asked for the passphrase for that SSH key. Type that passphrase (that you typed into ssh-keygen. This should get you in to lts. On my computer (an Ubuntu Linux box) it pops up a new window for me to type the passphrase in.
  10. Now log out of lts again, just like in item 3 above.
  11. Now log back into lts again, just like in item 2 above. This time you should just get in. No password and no passphrase. Your computer remembers the key for you. You will never need to type this passphrase again unless you log out of your account on your computer or reboot your computer. So you have to type this passphrase once per reboot rather than once per ssh or scp command.
  12. Since you have the same file system on lts and compute. You can now get into it without a password or passphrase. Try that. Same as item 2, except change lts to compute.
  13. Now copy your public key to github, either github.umn.edu or github.com or both. Log into your account on one or the other.
    • Find the main menu in the upper right corner of your home page. Choose Settings.
    • Now on the left you should see a Personal settings menu one item of which is SSH and GPG keys. Click that.
    • Now in the upper right you should see a green button labeled New SSH key. Click that.
    • Now at a command line on your computer do
      cat ~/.ssh/id_rsa.pub
      
      You will see one very long line of text that probably wraps to several lines in the terminal window. Cut and paste that into the text form labeled Key at Github. You probably also want to give this key a Title so you will know what it is if you have more than one key (from more than one computer — you can use this key on this computer for everything). Then click Add SSH key. You should now see this key among your SSH keys. At a command line in your computer do
      ssh-add -l -E md5
      
      and you should see the hash of the key that Github shows.
  14. Now in any git repo on your computer that has this Github (the one you put the SSH key in) as its origin, what you see when you do
    git remote -v
    
    you should be able to do
    git pull origin master
    
    or
    git push origin master
    
    without a password or passphrase.
  15. Github now suggests you call the main branch in any Git repo main rather than master and has instructions on how to do this.

    As far as git or Github is concerned, you can call it anything you want.

Background Jobs

As mentioned in lecture, the unix way (including Mac OS X and Linux) to do background jobs is just to put an ampersand at the end of the command line. This is mentioned in help("BATCH") in R.

R CMD BATCH --vanilla foo.R &
will run all the R in file foo.R in the background. You get your command line back immediately to do other work while the background is going. Unix functions ps and top can tell you whether the background job is still running and how much time it has taken. top is especially easy to use, requiring no command line arguments in ordinary use.

But if you log out of the computer or even kill the window in which the background job was started, your background job will be killed by default. To avoid this do

nohup R CMD BATCH --vanilla foo.R &
The nohup says let the background job run to completion no matter what (unless the computer reboots, of course).

If you are going to run the background job on computers that other people use for everyday work, then you should nice your job, giving it lower priority than processes that involve user interaction. For that do

nice nohup R CMD BATCH --vanilla foo.R &
For more on all of this see the part about batch processing in some old slides about parallel processing (the rest of that slide deck is defunct, see the current course notes on parallel processing).

Sockets Cluster over Multiple Machines

Here is the example I couldn't do in class.

ssh -A ssh.stat.umn.edu
ssh -A oak.stat.umn.edu
R --vanilla
library("parallel")
hosts <- c("oak", "bald")
cl <- makePSOCKcluster(hosts, homogeneous = FALSE)
parLapply(cl, seq(along = hosts), function(x)
    list(hostname = system("hostname", intern = TRUE), pid = Sys.getpid()))
stopCluster(cl)
q()
exit
exit

Spatial Point Processes

The reading on spatial point processes (Stat 8501 Lecture Notes on Spatial Point Processes) mentioned right at the end of class one day is NOT REQUIRED. That is a theory class, this is a computing class. Nevertheless, if anyone is interested in spatial point processes, that is a good place to start.

For example, the question asked near the end of class was about an equation in the reading that is equation (5) in these 8501 notes, and that equation in the 8501 notes is surrounded by a lot more explanation, essentially what I said in answer to the question.

Initial Sequence Estimators for MCMC

New handout about autocovariance functions for functionals of reversible Markov chains. Expands an answer to a question asked in class.

MCMC Long Runs

In answer to the question about how long to run MCMC, I should have mentioned the soufflé rule: it takes as long as it takes (if you take a soufflé out of the oven too soon, it is inedible).

Parameterized SQL Queries

In regard to the question about parameterized SQL queries mentioned in the vignette for R package RSQLite, we note the following

Copy and Paste Programming

The coolest quote I have ever seen about computing. If you learn nothing else in this course but really learn this, it would be a very valuable course. This is by Douglas Crockford from the Read This First chapter of his new book.

I strongly recommend that you do not copy and paste code that you do not understand, even mine. This seems to have become a standard practice, but it is dangerous and reckless. It is not nearly as stupid as installing packages that you haven't even looked at, but it is still really bad. Given the current state of the art, the most important security filter is your brain. Use it. This is important.

C++

Why your humble instructor is anti-C++.

Good API Design

API is a TLA that stands for application programming interface. For us that mostly means what functions are provided and how those functions are used (what their arguments are).

Speaking of Scott Meyers, he has another wonderful talk titled The Most Important Design Guideline, the guideline being Make Interfaces Easy to Use Correctly and Hard to Use Incorrectly. See also the section on GIEMO in my Stat 3701 notes.

Good Programming Practices

Speaking of wonderful talks, the best talk I know of about R is by Martin Mächler (a member of the R core team), Good Practices in R Programming, which is also recommended on my Stat 3701 home page.

Object-Oriented Programming

Your curmugeonly instructor is also not a big fan of OOP (object-oriented programming). In this he has some fellow travelers (see here or here).

CPU Caches

Speaking of Scott Meyers, he has another wonderful talk titled CPU Caches and Why You Care. This talk is about hardware and what you need to do to make it go fast. Two guidelines (of several)

Unfortunately, your humble instructor knows too little about this subject to make a unit on it. So this course won't cover it. But you cannot make programs go fast without understanding this, so perhaps we should somehow. We can talk about that.