New! Reading assignment number 7 assigned, due Wednesday Mar 3, and discussion group in Canvas has been opened for it.
A clarification has been added to the description of Doing Assignment 2.
This page rearranged with two new sections, one of which has subsections.
I explained to someone in office hours that questions do not have to seem
deep to have deep answers. So don't worry about how your questions
sound. As we saw today, even when the question seems misguided —
making some assumption about R that is false — it can still be very
interesting and important to know the answer. So just ask!
Course material for this course is scattered all over the place. We will look at some material from
- this web site, which has copies of stuff from previous versions of this course,
- some of my github repositories, which originated as demo packages used for this course
- the web site for my version of STAT 3701, which, although at a lower level than this course, has topics not covered before in this course and also some details that are newer than what this course has,
- the web site for my part of graduate student orientation 2018, which, although at a lower level than this course, also has topics not covered in this course (using Rmarkdown in Rstudio).
- the book Advanced R by Hadley Wickham, available in various formats from Amazon and other booksellers but also available for free on the web.
2021 Announcements of Permanent Interest
Finishing the example on denormalized numbers I was doing in class.
> exp(-745)  4.940656e-324 > exp(-746)  0So we see that
exp(-745)is close to the smallest positive nonzero number defined in IEEE arithmetic. What is it actually?
> library(Rmpfr) > foo <- mpfr(exp(-745), 120) > foo 1 'mpfr' number of precision 120 bits  4.9406564584124654417656879286822137237e-324 > log2(1 / foo) 1 'mpfr' number of precision 120 bits  1074So the result
foois exactly 2−1074 in the computer's arithmetic. And we know from calculus, that isn't exact (should be an irrational number). We can also see the inexactness in
> log(exp(-745))  -744.4401 > log(2^(-1074))  -744.4401 > log2(exp(-745))  -1074Note that there is nothing special about
loghere. Denormalized numbers arise whenever results get in this range
> .Machine$double.xmin  2.225074e-308 > 2^(-1074)  4.940656e-324 > 2^(-1074) /2  0
.Machine$double.xminis the smallest positive normalized number (IEEE double precision) and 2−1074 is the smallest positive denormalized (IEEE double precision). When you are in that range, you are denormalized. When you are even below that range, you have underflow.
As discussed in class, you cannot put nonstandard compiler flags in
src/Makevars file of an R package. For example, if you
-Wall -Wextra to the definition of
R CMD check will emit the warning
Non-portable flags in variable 'PKG_CFLAGS': -Wall -WextraSo what you do is follow the instructions in file
package/notesin R repository
fooin my account at github.com that I illustrated in class Feb 17.
It came up in talking with a student that it would be helpful to cover
how to search CRAN
(as the following announcement is based on).
This is done with R package
pkgsearch. For example,
library("pkgsearch") advanced_search("Arnoldi")returns no hits. But
advanced_search("Conjugate Gradient", size = 200) -> fooreturns 178 hits (as this is written). The result
foois a data frame with variables
> names(foo)  "score" "package" "version"  "title" "description" "date"  "maintainer_name" "maintainer_email" "revdeps"  "downloads_last_month" "license" "url"  "bugreports" "package_data"And we can now use R function
grepto whittle down our search. For example, the method of conjugate gradients can be used either for optimization or for solving linear equations. If we are only interested in the latter, then
> > bar <- grepl("equation", foo$title, ignore.case = TRUE) | + grepl("equation", foo$description, ignore.case = TRUE) > sum(bar)  11 > foo$package[bar]  "pcg" "cPCG" "sanic" "hgm" "LassoGEE"  "shinyKGode" "gclm" "KGode" "deGradInfer" "calculus"  "rootSolve"cuts down our task. At this point we had to look at the actual titles, descriptions, and web pages for the packages.
Twenty-First Century Numerical Analysis
A search of CRAN turned up two packages that use modern methods for solving linear equations.
- R package sanic does BiCGSTAB (Biconjugate gradient stabilized method)
- R package Rlinsolve does BiCGSTAB and GMRES.
Creating R Lists in C Called from R
The example I was using for constructing an R object of type
inside C code called from R using the
.Call interface was
scdd in R package
which is on GitHub. The code that constructs an R object
"list" begins at line 268.
Inheritance in OOP
I may have given the wrong impression in my in class answer about inheritance.
The R S3 OOP system
does not have inheritance in the sense of C++ and Java. In those languages
fields and methods of a class can be inherited from its superclass.
In R S3 there is no such thing. But there is a very weak form of inheritance
in that the
class attribute of an object can be a vector,
and one can think of the components of this vector as being class, superclass,
superclass of superclass and so forth.
Thus one should never test for class equality.
The R idiom is
if (inherits(foo, "Bar"))for does object
R Type System
- R function
typeofgives the type of the object as R considers it. As the manual cited above says
The R specific function
typeofreturns the type of an R object. Note that in the C code underlying R, all objects are pointers to a structure with typedef
SEXPREC; the different R data types are represented in C by
SEXPTYPE, which determines how the information in the various parts of the structure is used.
- R function
modeis for backwards compatibility with the S programming language (which R is a reimplementation of). As the manual cited above says
modegives information about the mode of an object in the sense of Becker, Chambers & Wilks (1988) and is more compatible with other implementations of the S language.
- R function
storage.modeis also for backwards compatibility with the S programming language. As the manual cited above says
Finally, the function
storage.modereturns the storage mode of its argument in the sense of Becker et al. (1988). It is generally used when calling functions written in another language, such as C or FORTRAN, to ensure that R objects have the data type expected by the routine being called. (In the S language, vectors with integer or real values are both of mode
"numeric", so their storage modes need to be distinguished.)
- R function
classgives the class of the object as the various OOP systems of core R consider it. In the
S3OOP system the class of an object is just an attribute named
"class"that is an atomic vector of type
"character". So classes are nothing like in C++ or Java. They don't say anything about the object except the name of the class. They are used in method dispatch of generic functions.
A quotation from Advanced R by Hadley Wickham
It also says that if you need mutability but don't need OOP, then you don't need reference classes, you just need environments. That's what I use.
Surprisingly, the implementation of reference classes is almost entirely in R code — they are a combination of S4 methods and environments. This is a testament to the flexibility of S4.
Changing Parent Environment
environment(fun) <- value parent.env(env) <- value(these are the only assignment functions documented on this help page). So both the environment of a function and the parent environment of an environment can be assigned arbitrarily. An examination of the R code base shows that the first is done in model fitting functions and methods functions, and the latter is done only in methods functions and in R startup. So it is not something you would ordinarily do, but it is something you can do. The documentation also says
So changing the parent of an environment is something you can do but are strongly advised not too!
The replacement function ‘parent.env<-’ is extremely dangerous as it can be used to destructively change environments in ways that violate assumptions made by the internal C code. It may be removed in the near future.
Removing Sensitive Material from Git and GitHub
A better answer to what to do when sensitive material has mistakenly been
committed to git and pushed to GitHub is found on this GitHub documentation. It suggests using either the
git filter-branch command or the BFG Repo-Cleaner open source tool.
I think I have used
git filter-branch in the past.
But it also says
Warning: Once you have pushed a commit to GitHub, you should consider any data it contains to be compromised. If you committed a password, change it! If you committed a key, generate a new one.
R Packages that do Regression
My way of converting formulas to x and y is a way of R, but not the only way. Instead you can follow http://developer.r-project.org/model-fitting-functions.html if you like.
Robustness came up today (Apr 24) in class. This area isn't as trendy as it used to be, but it is still very important. Every statistician should know something about it. For a start read the help pages for R functions lqs and rlm in R package MASS and try the examples.
Then for more, the following
library("pkgsearch") foo <- advanced_search(Title = "robust", Description = "robust", size = 1000)returns (as I write this) 144 CRAN packages with
robustin the Title or Description fields of the package DESCRIPTION file. Of course, in some of these robust is used with a different meaning from the robust in
robust statistics(Wikipedia pages Robust statistics and Robust regression). But there are clearly a lot of robustness packages on CRAN (not to mention the ones in the recommended package MASS that comes with every R installation — which we already did mention, of course).
R Standalone Library of C Functions
New handout about the R standalone library under Course Notes in the navigation to the left.
The authoritative documentation is Section 6.17 of Writing R Extensions. This allows use of many C functions that back up R functions, including the R random number system and all of the rest of R functions for probability distributions, from C code not called from R.
Branch and Bound Algorithm
This is in addition to the slides on Branch and Bound under Slides in the navigation to the left.
R package glmbb (https://cloud.r-project.org/package=glmbb) does branch and bound and is programmed in R (no C or C++). It uses three main ideas
- It uses the R formula mini-language to specify models (all possible models), adding and dropping terms to move between models.
- It uses an R environment object to store what it thinks of as global variables. It does not store these variables in the R global environment.
- It uses recursion. There is a function that does branch and bound for all models in a specified set. This function works by partitioning the specified set and then calling itself on each element of the partition.
Using SSH Without Passwords
A note about using the ssh command without passwords.
- First you have to have ssh. On unix (Linux and Mac OS X) it just comes with the computer. On up to date versions of Windows 10 it can easily be installed (so they say — I haven't tried it since I don't do windows).
Then, if you have
never used ssh before, do (from the command line in a terminal)
ssh email@example.com you replace username with your actual U of M username (the one you log in with to read mail, register for courses, use the library, etc.). Type your U of M password when prompted and you should be on lts.
- Log back out (say exit), because we need to do some work locally.
(Do this step once per computer you own. If you have already done this
for working with Github, don't do it again.)
From the command line on your computer do
ssh-keygenAnswer any questions it asks by choosing the default (hit return) until you get to Enter passphrase (empty for no passphrase):. Enter a passphrase. This can be as long as you like, and should be memorable. It should not be used for anything else. Then you have to enter it again to make sure you typed it correctly. Then it should say the keys were successfully created. If you do
ls .sshyou should see two files
id_rsa id_rsa.pubThe first is your private key. It stays secret. The only copy anywhere should be right where it is now, on your computer. The second is your public key. You can give that out to anyone and move it to any computer.
The first thing you do is copy it to lts
scp .ssh/id_rsa.pub lts.cla.umn.edu:(Note the colon at the end of this command. It's important. You will have to type your U of M password again for this to work).
- Now log back into lts again, just like in item 2 above.
This time we have some work to do. We need to add this public key
to the authorized_keys in the .ssh directory
on the remote computer (in this case lts).
cat id_rsa.pub >> .ssh/authorized_keys(you can now remove the file id_rsa.pub if you like, only the copy in .ssh/authorized_keys is used).
- Now log out of lts again, just like in item 3 above.
- Now log back into lts again, just like in item 2 above. This time you are not asked for a password. This time you are asked for the passphrase for that SSH key. Type that passphrase (that you typed into ssh-keygen. This should get you in to lts. On my computer (an Ubuntu Linux box) it pops up a new window for me to type the passphrase in.
- Now log out of lts again, just like in item 3 above.
- Now log back into lts again, just like in item 2 above. This time you should just get in. No password and no passphrase. Your computer remembers the key for you. You will never need to type this passphrase again unless you log out of your account on your computer or reboot your computer. So you have to type this passphrase once per reboot rather than once per ssh or scp command.
- Since you have the same file system on lts and compute. You can now get into it without a password or passphrase. Try that. Same as item 2, except change lts to compute.
Now copy your public key to github, either github.umn.edu
or github.com or both. Log into your account on one or the other.
- Find the main menu in the upper right corner of your home page. Choose Settings.
Now on the left you should see a
Personal settingsmenu one item of which is SSH and GPG keys. Click that.
Now in the upper right you should see a green button labeled
New SSH key. Click that.
Now at a command line on your computer do
cat ~/.ssh/id_rsa.pubYou will see one very long line of text that probably wraps to several lines in the terminal window. Cut and paste that into the text form labeled
Keyat Github. You probably also want to give this key a
Titleso you will know what it is if you have more than one key (from more than one computer — you can use this key on this computer for everything). Then click
Add SSH key. You should now see this key among your SSH keys. At a command line in your computer do
ssh-add -l -E md5and you should see the
hashof the key that Github shows.
Now in any git repo on your computer that has this Github (the one you
put the SSH key in) as its origin, what you see when you do
git remote -vyou should be able to do
git pull origin masteror
git push origin masterwithout a password or passphrase.
- Github now suggests you call the main branch in any Git repo
masterand has instructions on how to do this.
As far as git or Github is concerned, you can call it anything you want.
As mentioned in lecture, the unix way (including Mac OS X and Linux) to do background jobs is just to put an ampersand at the end of the command line. This is mentioned in help("BATCH") in R.
R CMD BATCH --vanilla foo.R &will run all the R in file foo.R in the background. You get your command line back immediately to do other work while the background is going. Unix functions ps and top can tell you whether the background job is still running and how much time it has taken. top is especially easy to use, requiring no command line arguments in ordinary use.
But if you log out of the computer or even kill the window in which the background job was started, your background job will be killed by default. To avoid this do
nohup R CMD BATCH --vanilla foo.R &The nohup says let the background job run to completion no matter what (unless the computer reboots, of course).
If you are going to run the background job on computers that other people
use for everyday work, then you should
nice your job, giving it lower
priority than processes that involve user interaction.
For that do
nice nohup R CMD BATCH --vanilla foo.R &For more on all of this see the part about batch processing in some old slides about parallel processing (the rest of that slide deck is defunct, see the current course notes on parallel processing).
Sockets Cluster over Multiple Machines
Here is the example I couldn't do in class.
ssh -A ssh.stat.umn.edu ssh -A oak.stat.umn.edu R --vanilla library("parallel") hosts <- c("oak", "bald") cl <- makePSOCKcluster(hosts, homogeneous = FALSE) parLapply(cl, seq(along = hosts), function(x) list(hostname = system("hostname", intern = TRUE), pid = Sys.getpid())) stopCluster(cl) q() exit exit
- The -A flag enables agent forwarding so we don't need a password if ssh keys have been set up, as is discussed here (the rest of this slide deck is obsolete, replaced by our current course notes on parallel processing).
- I have to go to oak because ssh, also called lts.cla.umn.edu does not allow R jobs.
- The homogeneous = FALSE tells R not to assume that all the hosts have Rscript in the same place. That's the bit that threw me in class.
Spatial Point Processes
The reading on spatial point processes (Stat 8501 Lecture Notes on Spatial Point Processes) mentioned right at the end of class one day is NOT REQUIRED. That is a theory class, this is a computing class. Nevertheless, if anyone is interested in spatial point processes, that is a good place to start.
For example, the question asked near the end of class was about an equation in the reading that is equation (5) in these 8501 notes, and that equation in the 8501 notes is surrounded by a lot more explanation, essentially what I said in answer to the question.
Initial Sequence Estimators for MCMC
New handout about autocovariance functions for functionals of reversible Markov chains. Expands an answer to a question asked in class.
MCMC Long Runs
In answer to the question about how long to run MCMC, I should have mentioned the soufflé rule: it takes as long as it takes (if you take a soufflé out of the oven too soon, it is inedible).
Parameterized SQL Queries
In regard to the question about parameterized SQL queries mentioned in the vignette for R package RSQLite, we note the following
- Parameterized queries do not seem to be a part of standard SQL, which is why I could not find a section about them in W3schools SQL Tutorial, although they are mentioned under the topic "SQL injection".
- Even in R there does not seem to be a standard syntax for parameterized queries used in all CRAN packages involving SQL.
- Thus the examples in the RSQLite package vignette seem to be specific to RSQLite and not portable to other databases (some other syntax might work for them).
Copy and Paste Programming
The coolest quote I have ever seen about computing. If you learn nothing
else in this course but really learn this, it would be a very valuable course.
This is by Douglas Crockford from the
Read This First chapter of his
I strongly recommend that you do not copy and paste code that you do not understand, even mine. This seems to have become a standard practice, but it is dangerous and reckless. It is not nearly as stupid as installing packages that you haven't even looked at, but it is still really bad. Given the current state of the art, the most important security filter is your brain. Use it. This is important.
Why your humble instructor is anti-C++.
- Recently seen on Hacker News, an article Initialization in C++ is Seriously Bonkers: Just Start With C. And your humble instructor just ends with C.
- Less recently seen, a talk by Scott Meyers titled
The Last Thing D Needs[a bunch of C++ features]. For those who don't know Scott Meyers is a C++ wizard (see next item).
- The series of books by Scott Meyers
- Effective C++: 55 Specific Ways to Improve Your Programs and Designs, 3rd Edition
- More Effective C++: 35 New Ways to Improve Your Programs and Designs
- Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library
- Effective Modern C++: 42 Specific Ways to Improve Your Use of C++11 and C++14
- Some personal experience trying to use GiNaC, which is a recursive acronym that stands for
GiNaC is not a CAS, and CAS is a TLA that stands for
computer algebra system(Mathematica, Maple, MACSYMA, or the like). I did not get far enough to become a knowledgeable C++ programmer.
Good API Design
Speaking of Scott Meyers, he has another
wonderful talk titled
The Most Important Design Guideline, the guideline being
Make Interfaces Easy to Use Correctly and Hard to Use Incorrectly.
See also the section on GIEMO in my Stat 3701 notes.
Good Programming Practices
Speaking of Scott Meyers, he has another
wonderful talk titled
CPU Caches and Why You Care. This talk is about hardware and what
you need to do to make it go fast. Two guidelines (of several)
- Small is the same as fast, and when we are talking about hardware cache small means really small, a few hundred bytes.
- The hardware loves arrays in C/C++ terminology, which are called vectors in R terminology. It hates all other data structures. In particular it hates OOP.