Announcements
One more reading assignment assigned.
R package qux
was indeed broken, as found in class. Fixed.
The thing I was confused about in class, I figured out. When not using parallel computing, the toy problem took roughly 7 seconds (7.1 ± 0.6). When using parallel computing, the toy problem took roughly 1.7 seconds (1.66 ± 0.03). The number that was bigger, ranging from 9.5 to 11.7 seconds is for all child processes together. So the parallel processing does more work (there is communication overhead) but the elapsed time (wall clock time) is much less.
The problem with git clone hanging was that github.com does not understand git. The syntax
git clone git://github.com/cjgeyer/foo.gitused to work, and according to the documentation for the current version still works. But it doesn't work with github.com. So I changed the course notes to use the alternative syntax
git clone git@github.com:cjgeyer/foo.gitwhich is what github.com now says to use and which does work.
Policy in this class is that I am going to lecture as little as possible. There will be reading to be done outside of class, and then in class there will be only questions from students and answers from me (or other students if they volunteer). Since students these days seem reluctant to ask questions, this will be mandatory. I will take attendance and record each question asked. Depending on how many students there are in the class and how long the answers are (one really deep question could take a whole class to answer), each student may need to think up two questions per class period.
I explained to someone in office hours that questions do not have to seem
deep
to have deep answers. So don't worry about how your questions
sound. As the 2021 class saw, even when the question seems misguided —
making some assumption that is false — it can still be very
interesting and important to know the answer. So just ask!
XKCD Project Roles
Roles for Doing Assignments 2 and 3.
Name | Role |
---|---|
Alexander Kuhn | bdfl |
Rui Zhou | R function dxkcd |
Difan Ouyang | R function pxkcd |
Seung Won Lee | R function qxkcd |
Jeonghwan Lee | R function rxkcd |
Jinwen Fu | tests |
Hyun Jung Koo | tests |
Benjamin Lewis | tests |
An Luo | vignette |
Class Project Presentations
These are the times for presentations about class projects.
Name | Date | Time |
---|---|---|
Rui Zhou | 2024-04-19 | 12:45 |
Alexander Kuhn | 2024-04-22 | 12:20 |
An Luo | 2024-04-22 | 12:45 |
Jinwen Fu | 2024-04-24 | 12:20 |
Benjamin Lewis | 2024-04-24 | 12:45 |
Jeonghwan Lee | 2024-04-26 | 12:20 |
Difan Ouyang | 2024-04-26 | 12:45 |
Seung Won Lee | 2024-04-29 | 12:20 |
Hyun Jung Koo | 2024-04-29 | 12:45 |
The code
u <- "https://www.stat.umn.edu/geyer/8054/" library(rvest) foo <- read_html(u) |> html_elements("table") |> html_table() foo bar <- foo[[2]] |> transform(DateTime = paste(Date, Time)) |> transform(DateTime = as.POSIXct(DateTime, tz = "America/Chicago")) bar sapply(bar, class)reads this table (so this is an example of how to do an HTML table right).
Course Material
Course material for this course is scattered all over the place. We will look at some material from
- this web site, which has copies of stuff from previous versions of this course,
- some of my GitHub repositories, which originated as demo packages
used for this course
- R packages
foo
andfooRegister
both in repository foo, - R package
bar
in repository bar, which is also accompanied by a gist about regression packages, - R package
baz
in repository mat, - R package
qux
in and repository qux. - and R package
linkingTo
in and repository linkingTo,
- R packages
- the web site for my version of STAT 3701, which, although at a lower level than this course, has topics not covered before in this course and also some details that are newer than what this course has,
- the web site for my part of graduate student orientation 2018, which, although at a lower level than this course, also has topics not covered in this course (using Rmarkdown in Rstudio).
- the book Advanced R by Hadley Wickham, available in various formats from Amazon and other booksellers but also available for free on the web.
2024 Announcements of Permanent Interest
The Law of Demos
A student asked whether their class project presentation should include
a live demo
and I referred to the
Law of Demos
(they never work).
Following an R markdown vignette (or other notes), like I do in class, is better.
R Primitive Functions
Chapter 2 of R Internals has a lot about primitive functions and R function .Primitive
.
2023 Announcements of Permanent Interest
CRAN Package foreach
The answers to the questions about R package foreach
were in
the vignette.
-
%do%
does non-parallel and%dopar%
does parallel. -
The package is smart about exports to workers. In
the random forest example
it detects that
x
andy
are global variables that must be exported to worker processes, but it cannot know in what package to find the functionrandomForest
unless told. So that is pretty fancy.
Web Sites that Check R Packages
win-builder.r-project.org was already mentioned in class. This web site checks builds of R packages for Microsoft Windows and allows you to download the built Windows binaries.
R-hub is a site that checks R packages from within R using CRAN package rhub.
S7 Object System for R
Core R already has three object systems (S3, S4, and R6). The S7 object system aims to replace S3 and S4. Unclear when it will arrive in core R. Right now it is a CRAN package.
The Joel Test for Software Teams
The Joel Test: 12 Steps to Better Code. Just found this today, linked from somewhere else. It is more than 20 years old but still as good as the day it was written. You have no way of knowing this, but this site was at the time one of the most popular sites with advice about startups.
Since we are embarking on a homework assignment where the class is a software team, let's grade ourselves.
- Do you use source control?
Yes! We pass this. Notice that Git is not mentioned. At the time (2000) The linux kernel did not pass this. Git was written so it could (with history a little more complicated than that.)
- Can you make a build in one step?
Yes! We pass this.
R CMD build
does this. - Do you make daily builds?
Maybe? Depends on the BDFL. But not automatic, which is what is meant. We are just not a big enough project for that.
- Do you have a bug database?
Yes! GitHub provides one, if we want to use it.
- Do you fix bugs before writing new code?
Haven't talked about this yet.
- Do you have an up-to-date schedule?
No. A due date is not a schedule.
- Do you have a spec?
I think that what Joel mean by a spec is the homework assignment. It completely specifies what the software is to do.
But we are going for more than that, we should have a design document that justifies each choice of algorithm with careful attention to inexactness of computer arithmetic.
- Do programmers have quiet working conditions?
Dunno. Can't do thing about that. Are your offices like Dilbert?
- Do you use the best tools money can buy?
This one is obsolete, at least for our level. A laptop backed up by
compute.cla.umn.edu
is way way in excess of what any software engineer had back then. - Do you have testers?
No. This is bad. But perhaps less so for mathematical software. Writing
d
,p
,q
, andr
functions for yet another distribution is very well defined. So our own testing might be sufficient.Nevertheless R packages do need testers. If R package aster had had testers, some incredibly annoying design mistakes could have been avoided. Some of these mistakes have been ameliorated but not fixed for reasons of backward compatibility.
- Do new candidates write code during their interview?
Inapplicable. Although that would make for an interesting first day of class!
- Do you do hallway usability testing?
No. But usability testing is important for statistical software too. And grabbing people walking by is as good a procedure as any. Try it?
So only 4 definite yesses. But as Joel says, could be worse.
I probably assigned the roles wrong. Should have had an independent testing department (one person assigned to tests only). Too late for this year. Will definitely do next year.
The point of this is that the things we are trying to teach in this course are not the only important things about computing.
2022 Announcements of Permanent Interest
Some MCMC Theory
The fundamental papers on convergence rates of Metropolis-Hastings samplers, in particular, when they are geometrically ergodic (which implies the CLT holds) are Mengersen and Tweedie (1996, doi:10.1214/aos/1033066201) which only does the one-dimensional case, Roberts and Tweedie (1996, doi:10.1093/biomet/83.1.95) which does the finite-dimensional case, and Jarner and Hansen (2000, doi:10.1016/S0304-4149(99)00082-4) which simplifies some of the conditions in Roberts and Tweedie (but is still hard to use).
An application of this theory to random walk Metropolis and R package
mcmc
in particular is
Johnson and Geyer (2012,
doi:10.1214/12-AOS1048).
An application of this theory to a toy problem asked about by a student in class is some notes on geometric ergodicity.
Why Perfect Sampling is the Only Perfect Diagnostic
Perfect sampling (Propp and Wilson, 1996, doi:10.1002/(SICI)1098-2418(199608/09)9:1/2<223::AID-RSA14>3.0.CO;2-O; Kendall, 1998, doi:10.1007/978-1-4612-2224-8_13, Murdoch and Green, 1998, doi:10.1111/1467-9469.00116) should not be thought of as a form of Markov chain Monte Carlo, but rather a form of Markov chain assisted independent and identically distributed sampling. That is, it is a form of ordinary Monte Carlo (OMC) rather than Markov chain Monte Carlo (MCMC).
What is the probability that a sample of (Monte Carlo) size n misses an event of probability ε when doing OMC? Answer: (1 − ε)n. Same question except for MCMC rather than OMC? Answer if only Harris recurrence is assumed: we only know that the event will occur eventually with probability one. Answer if geometric ergodicity is assumed: M ρn for some ρ < 1, which we only know how to calculate in toy problems, and some constant M which depends on the the starting position. So this really doesn't guarantee anything for any particular n.
Regeneration in MCMC (Mykland, Tierney, and Yu, 1995, doi:10.1080/01621459.1995.10476507;
Geyer and Thompson, 1995, doi:10.1080/01621459.1995.10476590)
builds regenerations into Markov chains. Every so often the chain regenerates.
Denote these times
T1,
T2,
….
Then the parts of the Markov chain from just after a regeneration up to and
including the next regeneration are independent random thingummies. I call
them tours
. Since the tours are independent, they can be run on
different computers in parallel. If an MCMC run consists only of complete
tours, then the usual MCMC estimator (the average over the run) is an
unbiased estimator of the expectation with respect to the target distribution
that it estimates. But since the lengths of the tours are random, and you
never know how long a tour will take, it is possible that you start off ten
thousand tours and most of them finish within minutes but one of them takes
ten thousand years, and you have to wait for it to finish to get your
unbiased estimate. And since you never know whether there might be a tour
like that (except by using very hard theory), you never know whether you
are being fooled by pseudoconvergence (unlike with perfect sampling).
So called unbiased MCMC (Jacob, O’Leary, and Atchadé, 2020, doi:10.1111/rssb.12336) has all of the advantages and disadvantages of regeneration. It does provide an unbiased estimator but can also be fooled by pseudoconvergence. You can seem to have a good estimator from ten thousand coupled unbiased runs, but the ten thousand and first can take ten thousand years to do and give a wildly different answer.
R Types ExternalPtr and Raw
I said something wrong in class. The manual Writing R Extensions does cover external pointers in Section 5.13.
R CMD SHLIB
Here is the minimum you need to do to compile a C function and call it from R.
First make a file qux.c
, for example
void qux(double *x) { x[0] = 42.0; }
Now at the operating system command line (not in R) do
R CMD SHLIB qux.cand then look at what you got (on my computer I see)
fagus$ ls qux.c qux.o qux.soThe file with extension
so
is the shared library.
It may have extension sl
or dll
on some
other operating system.
Now go into R and run
> dyn.load("qux.so") > .C("qux", x = double(1)) $x [1] 42and you see it works (the argument of R function
dyn.load
must be the actual name of the shared library).
A Rant about OOP
Execution in the Kingdom of Nouns by Steve Yegge.
Preallocation in R
A student asked in class about preallocating vectors and other stuff (matrices, arrays, data frames, tibbles) in R. This is a very deep subject.
Some googling found a lot of advice that says to do it, but no explanation about why.
Hadley Wickham puts his finger on the key issue: aliasing. But he doesn't explain how R knows about that.
Before we can learn anything on the subject, we are talking about calling
R function `[<-`
which R says is
> `[<-` .Primitive("[<-")but which the manual (
help("[<-")
says is generic (in the Details
section). So maybe the first thing we need to figure out is how that works.
Actually no. The first thing we need to know is the semantics of this function, what is is supposed to do. Section 3.4.4 of the R Language Definition is clear. When the statement
foo[bar] <- bazwhat happens must be as if a copy of
foo
is made, then the
subset assignment is made to the copy, and then that copy is associated
with the name-also-called-symbol foo
. So whether or not
update in place (without copying) is done, it must have the same effect
as if update with copying was done. This does not apply to assignments
to environments or reference classes (which are mutable), but does apply
to all other assignment functions like names<-
.
Also Section 5.9.10 of Writing R Extensions is
also somewhat confused but at the end clearly states that since R-4.0.0
reference counting (see below) is used and C preprocessor macros
MAYBE_REFERENCED
, MAYBE_SHARED
, and
MARK_NOT_MUTABLE
are available in C code called from R.
Then we are on our own. None of the R manuals or any documentation explain further. We have to RTFS also called UTSL.
But reading C code is hard. And reading incredible gnarly C code for some
of the lowest level parts of R is really hard. So I would just say
we don't want to go there
except there isn't anything else to do
(except, of course handwaving and bullshit, but that's already available
in plenitude on the intertubes).
So the first thing to understand is that when R calls a primitive function
like this (and some other functions, I don't understand the details) it
goes through a dispatch table found in the C source file
names.c
where we find that a call to R function `[<-`
is translated
to a call to C function do_subassign
which some grovelling in
the source code is found in C source file
subassign.c.
And the first thing we see in this file is a bunch of stuff about special casing every type of assignment for efficiency. So right away we see that this code will be hard to read.
If we go to the definition of the function do_subassign
we find
/* The [<- operator. "x" is the vector that is to be assigned into, */ /* y is the vector that is going to provide the new values and subs is */ /* the vector of subscripts that are going to be replaced. */ /* On entry (CAR(args)) and the last argument have been evaluated */ /* and the remainder of args have not. If this was called directly */ /* the CAR(args) and the last arg won't have been. */ SEXP attribute_hidden do_subassign(SEXP call, SEXP op, SEXP args, SEXP rho) { SEXP ans; /* This code performs an internal version of method dispatch. */ /* We evaluate the first argument and attempt to dispatch on it. */ /* If the dispatch fails, we "drop through" to the default code below. */ /* DispatchOrEval internal generic: [<- */ if(R_DispatchOrEvalSP(call, op, "[<-", args, rho, &ans)) /* if(DispatchAnyOrEval(call, op, "[<-", args, rho, &ans, 0, 0)) */ return(ans); return do_subassign_dflt(call, op, ans, rho); }
So we see this calls C function R_DispatchOrEvalSP
which is
defined in this file, and it does a lot of complicated things before calling
C function R_DispatchOrEval
which is defined in C source file
eval.c.
So we look in there and find its definition (which is long and complicated)
is preceded by this comment
/* DispatchOrEval is used in internal functions which dispatch to * object methods (e.g. "[" or "[["). The code either builds promises * and dispatches to the appropriate method, or it evaluates the * (unevaluated) arguments it comes in with and returns them so that * the generic built-in C code can continue. * To call this an ugly hack would be to insult all existing ugly hacks * at large in the world. */
This is not quite as programmer-hostile as the famous You are not expected to understand this comment in an early version of the UNIX source code.
The comment does inform us that this function only finished the job if
there is a method for R generic function `[<-`
for the type
of object which is the first argument, that is, if we are trying to execute
the R command
foo[bar] <- bazand R object
foo
has class "qux"
and there exists
a method `[<-.qux`
to handle such objects,
then it will be called here to
do the job. This task is complicated by the fact that we can have complicated
expressions like
as.symbol(paste0("foo", i))[bar > 6] <- baz * 10and it says that if there is no method to handle the class of the object that
as.symbol(paste0("foo", i))
evaluates to, then this function
does evaluate all of its arguments and hands the R objects they evaluate
to to the next function called,
which is C function do_subassign_dflt
also found in
in C source file
subassign.c.
And if we look inside the source code of this function, then we finally find what we are looking for (at least part of it)
/* If there are multiple references to an object we must */ /* duplicate it so that only the local version is mutated. */ /* This will duplicate more often than necessary, but saves */ /* over always duplicating. */ if (MAYBE_SHARED(CAR(args)) || ((! IS_ASSIGNMENT_CALL(call)) && MAYBE_REFERENCED(CAR(args)))) x = SETCAR(args, shallow_duplicate(CAR(args)));
But the code here is really ugly. The convention is that all caps like
MAYBE_SHARED
and CAR
and the rest are not C functions
but rather C preprocessor macros. So we need to look in some header file
for them.
First, I happen to already know that the CAR
macro is a LISPism
(R is more like LISP under the hood than one might imagine, so this is the
LISP CAR operator
and CAR(args)
is the first argument of the function call.
We find these macros defined in the C header file Rinternals.h
/* Macros for some common idioms. */ #ifdef SWITCH_TO_REFCNT # define MAYBE_SHARED(x) (REFCNT(x) > 1) # define NO_REFERENCES(x) (REFCNT(x) == 0) #else # define MAYBE_SHARED(x) (NAMED(x) > 1) # define NO_REFERENCES(x) (NAMED(x) == 0) #endif #define MAYBE_REFERENCED(x) (! NO_REFERENCES(x)) #define NOT_SHARED(x) (! MAYBE_SHARED(x))
And now we are stuck unless we know some computer science.
R has garbage collection, and some garbage collection uses
reference counting. And reference counting can,
in some cases give incorrect results, but AFAIK when it does it says
an object is shared when it actually isn't (so MAYBE_SHARED
can give a false positive but not a false negative, which as the comment
above says, is OK, it wastes time but does not cause errors).
But what is that stuff about SWITCH_TO_REFCNT
? Sometimes
R uses reference counting and sometimes not? Or on some architectures
and on some not? Or what?
So we are really still a long way from understanding this issue. We might have to read and fully understand thousands of lines of the R source code to really grok in fullness.
But since this is probably more than any of you wanted to know, and is even more than I wanted to know (except I am still left feeling dissatisfied and wish I knew more but am just a bit too lazy to learn more). So that's that!
Machine Language
I forgot what may be a very important architecture in the future: RISC-V.
There of course many more architectures all but Intel and ARM mostly of historical interest only.
Rules for Optimizing Computer Code
The link to the quotes about optimization.LU and QR
Why do we use QR decomposition rather than some other matrix decomposition to solve least squares problems? And why don't we use a matrix decomposition on MT M rather than on M itself, where M is the model matrix? The answer isn't operation counts. It's accuracy (Fletcher, 1975, DOI:10.1080/01621459.1975.10480270).Firstly, recall that the right way to think about least squares is regression is projection. What determines a linear model is not the model matrix M but rather its column space V, the vector subspace spanned by its columns. Different model matrices, but same column spaces, same regression model (a statistical model is a family of probability distributions, and same column spaces, same family of probability distributions). Thus to really understand a regression model we have to get a handle on V. When we do a reduced QR decomposition of the model matrix, the columns of Q are an orthonormal basis for V.
A model matrix is just another basis for V or perhaps in case of collinearity not even a basis (some columns of M need to be dropped to get a basis for V). But some bases are better than others in regard to accuracy. Orthornormal bases are the best. But allowing V to be specified in terms of model matrices allows users to specify linear models in arbitrarily bad ways. And software has to deal with that as best it can.
A condition number is a number that bounds the accuracy of the solution of a problem in terms of the accuracy of the data for a problem. For a least squares problem, we can take it to be the ratio of the largest to smallest singular values of the model matrix. If the model matrix has orthornormal columns, then this condition number is one, the best it can be. If there is collinearity, this condition number is infinite, the worst it can be. If this condition number is very large, we say the user's specification of the least squares problem is ill conditioned. That means the solution for the regression coefficients will be very inaccurate.
When you form MT M you square the condition number of the problem, which means your answer will have half the accurate digits it might have had before you did that. So MT M is something that is only OK in theory. You don't want to use that in the computer.
The QR decomposition with pivoting allows computation of the least squares solution without making the condition number any worse than it has to be. The condition number of the procedure is the condition number of the model matrix used to specify the problem. So that is why QR.
R Functions for Matrix Factorizations
We know about R functionsqr
, svd
,
eigen
, and chol
in base R that do
the QR decomposition, singular value decomposition, spectral decomposition,
and Cholesky decomposition. How about LU?
Fortran function DGETRF
does the LU decomposition. R has
the source code for this in the file
dlapack.f. And it is part of the official R API, with prototype in the
include file
R_ext/Lapack.h.
So you can use it in your C, C++, or Fortran code called from R. But R does not appear to use it. For anything.
R package Matrix
does do LU and Shur decompositions, which
base R does not appear to. It also does sparse Cholesky and sparse QR,
which are more efficient than the base R routines for sparse matrices.
How R Calls Pnorm and Friends
> pnorm function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) .Call(C_pnorm, q, mean, sd, lower.tail, log.p) <bytecode: 0x55c3a6790c58> <environment: namespace:stats>So R function
pnorm
calls a C function to do the job through
the .Call
interface. The help for R function .Call
says the first argument is
.NAME: a character string giving the name of a C function, or an object of class ‘"NativeSymbolInfo"’, ‘"RegisteredNativeSymbol"’ or ‘"NativeSymbol"’ referring to such a name.Since this is obviously not a character string, it must be one of these other thingummies, which are described in the section about registering native routines in Writing R Extensions. But here tricks are being played. That section suggests the name of the function is
pnorm
.
But if we look in R source file
init.c for package stats
we find
CALLDEF_MATH3_2(pnorm),used to register this function where the C preprocessor macro being used here is defined by
#define CALLDEF_MATH3_2(name) CALLDEF_DO(name, 5)and this is invoking another C preprocessor macro defined by
#define CALLDEF_DO(name, n) {#name, (DL_FUNC) &do_##name, n}so it seems as if the name of the C function actually registered is
do_pnorm
(you have to know what the C preprocessor does
with double hash. It is the token-pasting operator that makes one token
out of two, so (since here name
is pnorm
it
registers the C function pointer do_pnorm
.
Now if we look for that function, we find in R source file
statsR.h in package stats
a prototype for that function
SEXP do_pnorm(SEXP sa, SEXP sb, SEXP sc, SEXP sI, SEXP sJ);but we don't find (at least not by grepping for
do_pnorm
)
the function definition. So we look harder and find in R source file
distn.c in package stats
DEFMATH3_2(pnorm)where this C preprocessor macro is defined as
#define DEFMATH3_2(name) \ SEXP do_##name(SEXP sa, SEXP sb, SEXP sc, SEXP sI, SEXP sJ) { \ return math3_2(sa, sb, sc, sI, sJ, name); \ }And here
name
is again pnorm
so this does indeed
define a function do_pnorm
which is a one-liner with body
return math3_2(sa, sb, sc, sI, sJ, pnorm);where now
pnorm
is a C function pointer to the C function of that
name.
So now we have to find C function math3_2
,
which is defined in R source file
arithmetic.c
where we find
static SEXP math3_2(SEXP sa, SEXP sb, SEXP sc, SEXP sI, SEXP sJ, double (*f)(double, double, double, int, int), SEXP lcall) {and this is something of a worry that we are calling this with five arguments but it takes six. However, we check that the last is not used anywhere in the function so it is OK to omit it (I think, at least the R core team think so).
Then if we look at the rest of the body of the C function math3_2
we see another preprocessor macro MOD_ITERATE3
which is
defined in R source file
R_ext/Itermacros.h
to be
#define MOD_ITERATE3_CORE(n, n1, n2, n3, i, i1, i2, i3, loop_body) do { \ for (; i < n; \ i1 = (++i1 == n1) ? 0 : i1, \ i2 = (++i2 == n2) ? 0 : i2, \ i3 = (++i3 == n3) ? 0 : i3, \ ++i) { \ loop_body \ } \ } while (0) #define MOD_ITERATE3(n, n1, n2, n3, i, i1, i2, i3, loop_body) do { \ i = i1 = i2 = i3 = 0; \ MOD_ITERATE3_CORE(n, n1, n2, n3, i, i1, i2, i3, loop_body); \ } while (0)and this takes care of the recycling of indices.
And then going back to the source code of math3_2
we see that it fills up the result vector with repeated calls to
f
, which is a C function pointer pointing to the C function
pnorm
(in the instance we are tracing).
So that was a very long winded way (and we omitted a lot of detail) that the
C function that gets called to do the basic work of R function pnorm
is R function pnorm
.
And that C function is found in the R source file pnorm.c where we find this (part of a) comment
* SYNOPSIS * * #include <Rmath.h> * * double pnorm5(double x, double mu, double sigma, int lower_tail,int log_p); * {pnorm (..) is synonymous and preferred inside R}so the name of the C function is actually
pnorm5
but the comment
says the name pnorm
is preferred (better style) so this must
be defined in some header file, probably the one cited in the comment,
Rmath.h
where we do indeed find
#define dnorm dnorm4 #define pnorm pnorm5 #define qnorm qnorm5So C function
pnorm5
also called pnorm
is called once for each value in the
vector returned by R function pnorm
.
In your C code, you can ignore all of this nonsense and just call
pnorm
directly (if you have
#include <Rmath.h>before any call to C function
pnorm
.
And you don't really have any choice but to call it directly because
C function math3_2
and the like are not part of the R API.
2021 Announcements of Permanent Interest
Randomized Algorithms
It is not known whether randomized algorithms are better than deterministic
algorithms.
In computational complexity theory,
BPP is the
class of problems solved in polynomial time by randomized algorithms
and
P is the
class of problems solved in polynomial time by deterministic algorithms.
The link cited above says that it is unknown whether P = BPP, although
most complexity theorists conjecture today
that this is so.
A post on
Scott Aaronson's blog says Avi [who just won the Abel prize] hasn’t
yet proved P = BPP, just taken some major steps toward it
although the
news media missed the distinction.
R Packages for Optimization
The R packages mentioned in class but not on the slides were
- R package clpAPI which is very high quality linear programming software, competitive with very expensive commercial solvers. Its documentation is horrible unfortunately. The package has been defunct for several months. I have replaced it with the following in my code.
- R package glpkAPI which is also very high quality linear programming software, competitive with very expensive commercial solvers. Its documentation is much better than the preceding package, but still borderline unreadable.
- R package alabama
which has done all of the (fairly low dimensional) examples I have used it
for in several courses,
mostly calculating likelihood-based confidence intervals and similar things.
The function in this package I have used is R function
auglag
.
Web Scraping HTML Tables
So reading the bug reports on GitHub for R package htmltab
it looks like the package has possibly been abandoned by the author.
No! It is back in business
https://cran.r-project.org/package=htmltab.
So it looks like we switch to the Hadleyverse (except we don't have to now).
> library(rvest) Loading required package: xml2 > foo <- read_html("https://www.avca.org/polls/di-women/2-22-21.html") > bar <- html_table(foo) > class(bar) [1] "list" > sapply(bar, class) [1] "data.frame"It looks like R function
html_table
returns a list of all of the
tables in the document converted to R data frames.
> baz <- bar[[1]] > names(baz) [1] "Rank" "School (First-Place Votes Adjusted)" [3] "Total Points Adjusted" "2020-21 Record" [5] "Previous Rank"If we look a the web page or just at the second column, we see that AVCA have followed very bad data practice and put two bits of data in one field. We have to extract the two fields using regular expressions, which is gross. Don't do that!
> qux <- baz[ , 2] > head(qux) [1] "Wisconsin (51)" "Texas (8)" "Kentucky (1)" "Nebraska" [5] "Minnesota" "Baylor" > school <- sub(" *\\([0-9]*\\)$", "", qux) > head(school) [1] "Wisconsin" "Texas" "Kentucky" "Nebraska" "Minnesota" "Baylor" votes <- sub("^.*\\(([0-9]*)\\)$", "\\1", qux) head(votes) > votes <- as.numeric(votes) Warning message: NAs introduced by coercion > votes[is.na(votes)] <- 0So now we have our data. The R is gross because the data were gross.
> fred <- data.frame(Rank = baz$Rank, School = school, + First.Place.Votes.Adjusted = votes, + Total.Points.Adjusted = baz[["Total Points Adjusted"]], + Record = baz[["2020-21 Record"]], Previous.Rank = baz[["Previous Rank"]]) > head(fred) Rank School First.Place.Votes.Adjusted Total.Points.Adjusted Record Previous.Rank 1 1 Wisconsin 51 1490 10-0 1 2 2 Texas 8 1432 14-0 2 3 3 Kentucky 1 1359 12-0 3 4 4 Nebraska 0 1313 7-1 4 5 5 Minnesota 0 1276 9-1 5 6 6 Baylor 0 1201 13-3 6And now we are ready to work with these data. If the web site authors had given more thought to making their site computer-readable. We wouldn't have had to do so much work.
Denormalized Numbers
Finishing the example on denormalized numbers I was doing in class.
> exp(-745) [1] 4.940656e-324 > exp(-746) [1] 0So we see that
exp(-745)
is close to the smallest positive nonzero
number defined in IEEE arithmetic. What is it actually?
> library(Rmpfr) > foo <- mpfr(exp(-745), 120) > foo 1 'mpfr' number of precision 120 bits [1] 4.9406564584124654417656879286822137237e-324 > log2(1 / foo) 1 'mpfr' number of precision 120 bits [1] 1074So the result
foo
is exactly
2−1074 in the computer's arithmetic.
And we know from calculus, that isn't exact (should be an irrational number).
We can also see the inexactness in
> log(exp(-745)) [1] -744.4401 > log(2^(-1074)) [1] -744.4401 > log2(exp(-745)) [1] -1074Note that there is nothing special about
exp
and log
here. Denormalized numbers arise whenever results get in this range
> .Machine$double.xmin [1] 2.225074e-308 > 2^(-1074) [1] 4.940656e-324 > 2^(-1074) /2 [1] 0
.Machine$double.xmin
is
the smallest positive normalized number
(in IEEE double precision arithmetic)
and 2−1074 is the smallest positive
denormalized (in IEEE double precision arithmetic).
When you are in the range between them, you are denormalized.
When you are even below that range, you have
underflow.
Makevars File
As discussed in class, you cannot put nonstandard compiler flags in
the src/Makevars
file of an R package. For example, if you
add -Wall -Wextra
to the definition of PKG_CFLAGS
,
then R CMD check
will emit the warning
Non-portable flags in variable 'PKG_CFLAGS': -Wall -WextraSo what you do is follow the instructions in file
package/notes
in R repository foo
in my
account at GitHub.com that I illustrated in class Feb 17.
Searching CRAN
It came up in talking with a student that it would be helpful to cover
how to search CRAN
(as the following announcement is based on).
This is done with R package pkgsearch
. For example,
library("pkgsearch") advanced_search("Arnoldi")returns no hits. But
advanced_search("Conjugate Gradient", size = 200) -> fooreturns 178 hits (as this is written). The result
foo
is a data frame
with variables
> names(foo) [1] "score" "package" "version" [4] "title" "description" "date" [7] "maintainer_name" "maintainer_email" "revdeps" [10] "downloads_last_month" "license" "url" [13] "bugreports" "package_data"And we can now use R function
grep
to whittle down our
search. For example, the method of conjugate gradients can be used either
for optimization or for solving linear equations. If we are only interested
in the latter, then
> > bar <- grepl("equation", foo$title, ignore.case = TRUE) | + grepl("equation", foo$description, ignore.case = TRUE) > sum(bar) [1] 11 > foo$package[bar] [1] "pcg" "cPCG" "sanic" "hgm" "LassoGEE" [6] "shinyKGode" "gclm" "KGode" "deGradInfer" "calculus" [11] "rootSolve"cuts down our task. At this point we had to look at the actual titles, descriptions, and web pages for the packages.
Twenty-First Century Numerical Analysis
A search of CRAN turned up two packages that use modern methods for solving linear equations.
- R package sanic does BiCGSTAB (Biconjugate gradient stabilized method)
- R package Rlinsolve does BiCGSTAB and GMRES.
Creating R Lists in C Called from R
The example I was using for constructing an R object of type "list"
inside C code called from R using the .Call
interface was
C function scdd
in R package rcdd
which is on GitHub. The code that constructs an R object
of type "list"
begins at line 268.
Inheritance in OOP
I may have given the wrong impression in my in class answer about inheritance.
The R S3 OOP system
does not have inheritance in the sense of C++ and Java. In those languages
fields and methods of a class can be inherited from its superclass.
In R S3 there is no such thing. But there is a very weak form of inheritance
in that the class
attribute of an object can be a vector,
and one can think of the components of this vector as being class, superclass,
superclass of superclass and so forth.
Thus one should never test for class equality.
The R idiom is
if (inherits(foo, "Bar"))for does object
foo
have class "Bar"
.
R Type System
A better answer about when to use typeof
, mode
,
storage.mode
, or class
.
The R Language Definition,
Section 1, Objects and
Section 5, Object-oriented programming
give the better answers.
- R function
typeof
gives the type of the object as R considers it. As the manual cited above saysThe R specific function
typeof
returns the type of an R object. Note that in the C code underlying R, all objects are pointers to a structure with typedefSEXPREC
; the different R data types are represented in C bySEXPTYPE
, which determines how the information in the various parts of the structure is used. - R function
mode
is for backwards compatibility with the S programming language (which R is a reimplementation of). As the manual cited above saysFunction
mode
gives information about the mode of an object in the sense of Becker, Chambers & Wilks (1988) and is more compatible with other implementations of the S language. - R function
storage.mode
is also for backwards compatibility with the S programming language. As the manual cited above saysFinally, the function
storage.mode
returns the storage mode of its argument in the sense of Becker et al. (1988). It is generally used when calling functions written in another language, such as C or FORTRAN, to ensure that R objects have the data type expected by the routine being called. (In the S language, vectors with integer or real values are both of mode"numeric"
, so their storage modes need to be distinguished.) - R function
class
gives the class of the object as the various OOP systems of core R consider it. In theS3
OOP system the class of an object is just an attribute named"class"
that is an atomic vector of type"character"
. So classes are nothing like in C++ or Java. They don't say anything about the object except the name of the class. They are used in method dispatch of generic functions.
Mutability
A quotation from Advanced R by Hadley Wickham
It also says that if you need mutability but don't need OOP, then you don't need reference classes, you just need environments. That's what I use.Surprisingly, the implementation of reference classes is almost entirely in R code — they are a combination of S4 methods and environments. This is a testament to the flexibility of S4.
Changing Parent Environment
My answer in class
about whether environments can be changed on the fly was completely
wrong. I should not have guessed. As always,
TRT is to
RTFM. And help(parent.env)
says,
among other things
environment(fun) <- value parent.env(env) <- value(these are the only assignment functions documented on this help page). So both the environment of a function and the parent environment of an environment can be assigned arbitrarily. An examination of the R code base shows that the first is done in model fitting functions and methods functions, and the latter is done only in methods functions and in R startup. So it is not something you would ordinarily do, but it is something you can do. The documentation also says
So changing the parent of an environment is something you can do but are strongly advised not too!The replacement function ‘parent.env<-’ is extremely dangerous as it can be used to destructively change environments in ways that violate assumptions made by the internal C code. It may be removed in the near future.
Removing Sensitive Material from Git and GitHub
A better answer to what to do when sensitive material has mistakenly been
committed to git and pushed to GitHub is found on this GitHub documentation. It suggests using either the
git filter-branch
command or the BFG Repo-Cleaner open source tool.
I think I have used git filter-branch
in the past.
But it also says
Warning: Once you have pushed a commit to GitHub, you should consider any data it contains to be compromised. If you committed a password, change it! If you committed a key, generate a new one.
2020 Announcements
R Packages that do Regression
The info about R packages that do regression is linked on the course notes about R packages, in particular, follow the links to git repository bar and the accompanying gist.
My way of converting formulas to x and y is a way of R, but not the only way. Instead you can follow http://developer.r-project.org/model-fitting-functions.html if you like.
Robustness
Robustness came up today (Apr 24) in class. This area isn't as trendy as it used to be, but it is still very important. Every statistician should know something about it. For a start read the help pages for R functions lqs and rlm in R package MASS and try the examples.
Then for more, the following
library("pkgsearch") foo <- advanced_search(Title = "robust", Description = "robust", size = 1000)returns (as I write this) 144 CRAN packages with
robustin the Title or Description fields of the package DESCRIPTION file. Of course, in some of these robust is used with a different meaning from the robust in
robust statistics(Wikipedia pages Robust statistics and Robust regression). But there are clearly a lot of robustness packages on CRAN (not to mention the ones in the recommended package MASS that comes with every R installation — which we already did mention, of course).
R Standalone Library of C Functions
New handout about the R standalone library under Course Notes in the navigation to the left.
The authoritative documentation is Section 6.17 of Writing R Extensions. This allows use of many C functions that back up R functions, including the R random number system and all of the rest of R functions for probability distributions, from C code not called from R.
Branch and Bound Algorithm
This is in addition to the slides on Branch and Bound under Slides in the navigation to the left.
R package glmbb (https://cloud.r-project.org/package=glmbb) does branch and bound and is programmed in R (no C or C++). It uses three main ideas
- It uses the R formula mini-language to specify models (all possible models), adding and dropping terms to move between models.
- It uses an R environment object to store what it thinks of as global variables. It does not store these variables in the R global environment.
- It uses recursion. There is a function that does branch and bound for all models in a specified set. This function works by partitioning the specified set and then calling itself on each element of the partition.
Alternatives to using R environments to store R objects are various CRAN packages that implement hash tables and may be more efficient than R environments. These include
In particular, R packagefilehash
which stores in a file on disk
rather than in memory may allow slightly larger problems to be done.
But, since branch-and-bound converts exponential in n searches
to exponential in n searches (with a smaller constant) nothing
will allow really large (big dataproblems). That is why things like LASSO were invented. They
workon very large problems for some definition of
work(they provide answers with no guarantee they are anywhere near the best).
Addendum added in 2023 Core R now has hash tables
(in R package utils
, which is a recommended package
installed by default in every installation of R). See
?utils::hashThis is still marked
Experimentalbut is the way of the future. This has been available since R-4.2.0.
Using SSH Without Passwords
A note about using the ssh command without passwords.
- First you have to have ssh. On UNIX (Linux and Mac OS X) it just comes with the computer. On up to date versions of Windows 10 it can easily be installed (so they say — I haven't tried it since I don't do windows).
-
Then, if you have
never used ssh before, do (from the command line in a terminal)
ssh username@lts.cla.umn.edu
where you replace username with your actual U of M username (the one you log in with to read mail, register for courses, use the library, etc.). Type your U of M password when prompted and you should be on lts. - Log back out (say exit), because we need to do some work locally.
-
(Do this step once per computer you own. If you have already done this
for working with GitHub, don't do it again.)
From the command line on your computer do
ssh-keygen
Answer any questions it asks by choosing the default (hit return) until you get to Enter passphrase (empty for no passphrase):. Enter a passphrase. This can be as long as you like, and should be memorable. It should not be used for anything else. Then you have to enter it again to make sure you typed it correctly. Then it should say the keys were successfully created. If you dols .ssh
you should see two filesid_rsa id_rsa.pub
The first is your private key. It stays secret. The only copy anywhere should be right where it is now, on your computer. The second is your public key. You can give that out to anyone and move it to any computer. -
The first thing you do is copy it to lts
scp .ssh/id_rsa.pub lts.cla.umn.edu:
(Note the colon at the end of this command. It's important. You will have to type your U of M password again for this to work). - Now log back into lts again, just like in item 2 above.
-
This time we have some work to do. We need to add this public key
to the authorized_keys in the .ssh directory
on the remote computer (in this case lts).
cat id_rsa.pub >> .ssh/authorized_keys
(you can now remove the file id_rsa.pub if you like, only the copy in .ssh/authorized_keys is used).If the command above complains that the directory does not exist, make it.
mkdir .ssh
- Now log out of lts again, just like in item 3 above.
- Now log back into lts again, just like in item 2 above. This time you are not asked for a password. This time you are asked for the passphrase for that SSH key. Type that passphrase (that you typed into ssh-keygen. This should get you in to lts. On my computer (an Ubuntu Linux box) it pops up a new window for me to type the passphrase in.
- Now log out of lts again, just like in item 3 above.
- Now log back into lts again, just like in item 2 above. This time you should just get in. No password and no passphrase. Your computer remembers the key for you. You will never need to type this passphrase again unless you log out of your account on your computer or reboot your computer. So you have to type this passphrase once per reboot rather than once per ssh or scp command.
- Since you have the same file system on lts and compute. You can now get into it without a password or passphrase. Try that. Same as item 2, except change lts to compute.
-
Now copy your public key to GitHub, either github.umn.edu
or github.com or both. Log into your account on one or the other.
- Find the main menu in the upper right corner of your home page. Choose Settings.
-
Now on the left you should see a
Personal settings
menu one item of which is SSH and GPG keys. Click that. -
Now in the upper right you should see a green button labeled
New SSH key
. Click that. -
Now at a command line on your computer do
cat ~/.ssh/id_rsa.pub
You will see one very long line of text that probably wraps to several lines in the terminal window. Cut and paste that into the text form labeledKey
at GitHub. You probably also want to give this key aTitle
so you will know what it is if you have more than one key (from more than one computer — you can use this key on this computer for everything). Then clickAdd SSH key
. You should now see this key among your SSH keys. At a command line in your computer dossh-add -l -E md5
and you should see thehash
of the key that GitHub shows.
-
Now in any git repo on your computer that has this GitHub (the one you
put the SSH key in) as its origin, what you see when you do
git remote -v
you should be able to dogit pull origin master
orgit push origin master
without a password or passphrase. - GitHub now suggests you call the main branch in any Git repo
main
rather thanmaster
and has instructions on how to do this.As far as git or GitHub is concerned, you can call it anything you want.
Background Jobs
As mentioned in lecture, the UNIX way (including Mac OS X and Linux) to do background jobs is just to put an ampersand at the end of the command line. This is mentioned in help("BATCH") in R.
R CMD BATCH --vanilla foo.R &will run all the R in file foo.R in the background. You get your command line back immediately to do other work while the background is going. Unix functions ps and top can tell you whether the background job is still running and how much time it has taken. top is especially easy to use, requiring no command line arguments in ordinary use.
But if you log out of the computer or even kill the window in which the background job was started, your background job will be killed by default. To avoid this do
nohup R CMD BATCH --vanilla foo.R &The nohup says let the background job run to completion no matter what (unless the computer reboots, of course).
If you are going to run the background job on computers that other people
use for everyday work, then you should nice
your job, giving it lower
priority than processes that involve user interaction.
For that do
nice nohup R CMD BATCH --vanilla foo.R &For more on all of this see the part about batch processing in some old slides about parallel processing (the rest of that slide deck is defunct, see the current course notes on parallel processing).
Sockets Cluster over Multiple Machines
Here is the example I couldn't do in class.
ssh -A ssh.stat.umn.edu ssh -A oak.stat.umn.edu R --vanilla library("parallel") hosts <- c("oak", "bald") cl <- makePSOCKcluster(hosts, homogeneous = FALSE) parLapply(cl, seq(along = hosts), function(x) list(hostname = system("hostname", intern = TRUE), pid = Sys.getpid())) stopCluster(cl) q() exit exit
- The -A flag enables agent forwarding so we don't need a password if ssh keys have been set up, as is discussed here (the rest of this slide deck is obsolete, replaced by our current course notes on parallel processing).
- I have to go to oak because ssh, also called lts.cla.umn.edu does not allow R jobs.
- The homogeneous = FALSE tells R not to assume that all the hosts have Rscript in the same place. That's the bit that threw me in class.
Spatial Point Processes
The reading on spatial point processes (Stat 8501 Lecture Notes on Spatial Point Processes) mentioned right at the end of class one day is NOT REQUIRED. That is a theory class, this is a computing class. Nevertheless, if anyone is interested in spatial point processes, that is a good place to start.
For example, the question asked near the end of class was about an equation in the reading that is equation (5) in these 8501 notes, and that equation in the 8501 notes is surrounded by a lot more explanation, essentially what I said in answer to the question.
Initial Sequence Estimators for MCMC
New handout about autocovariance functions for functionals of reversible Markov chains. Expands an answer to a question asked in class.
MCMC Long Runs
In answer to the question about how long to run MCMC, I should have mentioned the soufflé rule: it takes as long as it takes (if you take a soufflé out of the oven too soon, it is inedible).
Parameterized SQL Queries
In regard to the question about parameterized SQL queries mentioned in the vignette for R package RSQLite, we note the following
- Parameterized queries do not seem to be a part of standard SQL, which is why I could not find a section about them in W3schools SQL Tutorial, although they are mentioned under the topic "SQL injection".
- Even in R there does not seem to be a standard syntax for parameterized queries used in all CRAN packages involving SQL.
- Thus the examples in the RSQLite package vignette seem to be specific to RSQLite and not portable to other databases (some other syntax might work for them).
Copy and Paste Programming
The coolest quote I have ever seen about computing. If you learn nothing
else in this course but really learn this, it would be a very valuable course.
This is by Douglas Crockford from the Read This First
chapter of his
new book.
I strongly recommend that you do not copy and paste code that you do not understand, even mine. This seems to have become a standard practice, but it is dangerous and reckless. It is not nearly as stupid as installing packages that you haven't even looked at, but it is still really bad. Given the current state of the art, the most important security filter is your brain. Use it. This is important.
C++
Why your humble instructor is anti-C++.
- Recently seen on Hacker News, an article Initialization in C++ is Seriously Bonkers: Just Start With C. And your humble instructor just ends with C.
- Less recently seen, a talk by Scott Meyers titled
The Last Thing D Needs
[a bunch of C++ features]. For those who don't know Scott Meyers is a C++ wizard (see next item). - The series of books by Scott Meyers
- Effective C++: 55 Specific Ways to Improve Your Programs and Designs, 3rd Edition
- More Effective C++: 35 New Ways to Improve Your Programs and Designs
- Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library
- Effective Modern C++: 42 Specific Ways to Improve Your Use of C++11 and C++14
- Some personal experience trying to use GiNaC, which is a recursive acronym that stands for
GiNaC is not a CAS
, and CAS is a TLA that stands forcomputer algebra system
(Mathematica, Maple, MACSYMA, or the like). I did not get far enough to become a knowledgeable C++ programmer. - (added for 2022) The C++ Frequently Questioned Answers an anti-FAQ.
Good API Design
API is a TLA that stands for application programming interface. For us that mostly means what functions are provided and how those functions are used (what their arguments are).
Speaking of Scott Meyers, he has another
wonderful talk titled
The Most Important Design Guideline
, the guideline being
Make Interfaces Easy to Use Correctly and Hard to Use Incorrectly.
See also the section on GIEMO in my Stat 3701 notes.
Good Programming Practices
Speaking of wonderful talks, the best talk I know of about R is by Martin Mächler (a member of the R core team), Good Practices in R Programming, which is also recommended on my Stat 3701 home page.
Object-Oriented Programming
Your curmudgeonly instructor is also not a big fan of OOP (object-oriented programming). In this he has some fellow travelers (see here or here).
CPU Caches
Speaking of Scott Meyers, he has another
wonderful talk titled
CPU Caches and Why You Care
. This talk is about hardware and what
you need to do to make it go fast. Two guidelines (of several)
- Small is the same as fast, and when we are talking about hardware cache small means really small, a few hundred bytes.
- The hardware loves arrays in C/C++ terminology, which are called vectors in R terminology. It hates all other data structures. In particular it hates OOP.