Statistics 8054 (Geyer, Spring 2024)

Announcements

One more reading assignment assigned.

R package qux was indeed broken, as found in class. Fixed.

The thing I was confused about in class, I figured out. When not using parallel computing, the toy problem took roughly 7 seconds (7.1 ± 0.6). When using parallel computing, the toy problem took roughly 1.7 seconds (1.66 ± 0.03). The number that was bigger, ranging from 9.5 to 11.7 seconds is for all child processes together. So the parallel processing does more work (there is communication overhead) but the elapsed time (wall clock time) is much less.

The problem with git clone hanging was that github.com does not understand git. The syntax

git clone git://github.com/cjgeyer/foo.git

used to work, and according to the documentation for the current version still works. But it doesn't work with github.com. So I changed the course notes to use the alternative syntax

git clone git@github.com:cjgeyer/foo.git

which is what github.com now says to use and which does work.

Policy in this class is that I am going to lecture as little as possible. There will be reading to be done outside of class, and then in class there will be only questions from students and answers from me (or other students if they volunteer). Since students these days seem reluctant to ask questions, this will be mandatory. I will take attendance and record each question asked. Depending on how many students there are in the class and how long the answers are (one really deep question could take a whole class to answer), each student may need to think up two questions per class period.

I explained to someone in office hours that questions do not have to seem deep to have deep answers. So don't worry about how your questions sound. As the 2021 class saw, even when the question seems misguided — making some assumption that is false — it can still be very interesting and important to know the answer. So just ask!

XKCD Project Roles

Roles for Doing Assignments 2 and 3.

XKCD Project Roles
Name	Role
Alexander Kuhn	bdfl
Rui Zhou	R function dxkcd
Difan Ouyang	R function pxkcd
Seung Won Lee	R function qxkcd
Jeonghwan Lee	R function rxkcd
Jinwen Fu	tests
Hyun Jung Koo	tests
Benjamin Lewis	tests
An Luo	vignette

Class Project Presentations

These are the times for presentations about class projects.

Class Project Presentations
Name	Date	Time
Rui Zhou	2024-04-19	12:45
Alexander Kuhn	2024-04-22	12:20
An Luo	2024-04-22	12:45
Jinwen Fu	2024-04-24	12:20
Benjamin Lewis	2024-04-24	12:45
Jeonghwan Lee	2024-04-26	12:20
Difan Ouyang	2024-04-26	12:45
Seung Won Lee	2024-04-29	12:20
Hyun Jung Koo	2024-04-29	12:45

The code


u <- "https://www.stat.umn.edu/geyer/8054/"
library(rvest)
foo <- read_html(u) |> html_elements("table") |> html_table()
foo
bar <- foo[[2]] |>
    transform(DateTime = paste(Date, Time)) |>
    transform(DateTime = as.POSIXct(DateTime, tz = "America/Chicago"))
bar
sapply(bar, class)

reads this table (so this is an example of how to do an HTML table right).

Course Material

Course material for this course is scattered all over the place. We will look at some material from

this web site, which has copies of stuff from previous versions of this course,
some of my GitHub repositories, which originated as demo packages used for this course
- R packages foo and fooRegister both in repository foo,
- R package bar in repository bar, which is also accompanied by a gist about regression packages,
- R package baz in repository mat,
- R package qux in and repository qux.
- and R package linkingTo in and repository linkingTo,
the web site for my version of STAT 3701, which, although at a lower level than this course, has topics not covered before in this course and also some details that are newer than what this course has,
the web site for my part of graduate student orientation 2018, which, although at a lower level than this course, also has topics not covered in this course (using Rmarkdown in Rstudio).
the book Advanced R by Hadley Wickham, available in various formats from Amazon and other booksellers but also available for free on the web.

Needless to say this is too much material for one course. We can discuss what we want to do.

2024 Announcements of Permanent Interest

The Law of Demos

A student asked whether their class project presentation should include a live demo and I referred to the Law of Demos (they never work).

Following an R markdown vignette (or other notes), like I do in class, is better.

R Primitive Functions

Chapter 2 of R Internals has a lot about primitive functions and R function .Primitive.

2023 Announcements of Permanent Interest

CRAN Package `foreach`

The answers to the questions about R package foreach were in the vignette.

%do% does non-parallel and %dopar% does parallel.
The package is smart about exports to workers. In the random forest example it detects that x and y are global variables that must be exported to worker processes, but it cannot know in what package to find the function randomForest unless told. So that is pretty fancy.

Web Sites that Check R Packages

win-builder.r-project.org was already mentioned in class. This web site checks builds of R packages for Microsoft Windows and allows you to download the built Windows binaries.

R-hub is a site that checks R packages from within R using CRAN package rhub.

S7 Object System for R

Core R already has three object systems (S3, S4, and R6). The S7 object system aims to replace S3 and S4. Unclear when it will arrive in core R. Right now it is a CRAN package.

The Joel Test for Software Teams

The Joel Test: 12 Steps to Better Code. Just found this today, linked from somewhere else. It is more than 20 years old but still as good as the day it was written. You have no way of knowing this, but this site was at the time one of the most popular sites with advice about startups.

Since we are embarking on a homework assignment where the class is a software team, let's grade ourselves.

Do you use source control?
Yes! We pass this. Notice that Git is not mentioned. At the time (2000) The linux kernel did not pass this. Git was written so it could (with history a little more complicated than that.)
Can you make a build in one step?
Yes! We pass this. R CMD build does this.
Do you make daily builds?
Maybe? Depends on the BDFL. But not automatic, which is what is meant. We are just not a big enough project for that.
Do you have a bug database?
Yes! GitHub provides one, if we want to use it.
Do you fix bugs before writing new code?
Haven't talked about this yet.
Do you have an up-to-date schedule?
No. A due date is not a schedule.
Do you have a spec?
I think that what Joel mean by a spec is the homework assignment. It completely specifies what the software is to do.
But we are going for more than that, we should have a design document that justifies each choice of algorithm with careful attention to inexactness of computer arithmetic.
Do programmers have quiet working conditions?
Dunno. Can't do thing about that. Are your offices like Dilbert?
Do you use the best tools money can buy?
This one is obsolete, at least for our level. A laptop backed up by compute.cla.umn.edu is way way in excess of what any software engineer had back then.
Do you have testers?
No. This is bad. But perhaps less so for mathematical software. Writing d, p, q, and r functions for yet another distribution is very well defined. So our own testing might be sufficient.
Nevertheless R packages do need testers. If R package aster had had testers, some incredibly annoying design mistakes could have been avoided. Some of these mistakes have been ameliorated but not fixed for reasons of backward compatibility.
Do new candidates write code during their interview?
Inapplicable. Although that would make for an interesting first day of class!
Do you do hallway usability testing?
No. But usability testing is important for statistical software too. And grabbing people walking by is as good a procedure as any. Try it?

So only 4 definite yesses. But as Joel says, could be worse.

I probably assigned the roles wrong. Should have had an independent testing department (one person assigned to tests only). Too late for this year. Will definitely do next year.

The point of this is that the things we are trying to teach in this course are not the only important things about computing.

2022 Announcements of Permanent Interest

Some MCMC Theory

The fundamental papers on convergence rates of Metropolis-Hastings samplers, in particular, when they are geometrically ergodic (which implies the CLT holds) are Mengersen and Tweedie (1996, doi:10.1214/aos/1033066201) which only does the one-dimensional case, Roberts and Tweedie (1996, doi:10.1093/biomet/83.1.95) which does the finite-dimensional case, and Jarner and Hansen (2000, doi:10.1016/S0304-4149(99)00082-4) which simplifies some of the conditions in Roberts and Tweedie (but is still hard to use).

An application of this theory to random walk Metropolis and R package mcmc in particular is Johnson and Geyer (2012, doi:10.1214/12-AOS1048).

An application of this theory to a toy problem asked about by a student in class is some notes on geometric ergodicity.

Why Perfect Sampling is the Only Perfect Diagnostic

Perfect sampling (Propp and Wilson, 1996, doi:10.1002/(SICI)1098-2418(199608/09)9:1/2<223::AID-RSA14>3.0.CO;2-O; Kendall, 1998, doi:10.1007/978-1-4612-2224-8_13, Murdoch and Green, 1998, doi:10.1111/1467-9469.00116) should not be thought of as a form of Markov chain Monte Carlo, but rather a form of Markov chain assisted independent and identically distributed sampling. That is, it is a form of ordinary Monte Carlo (OMC) rather than Markov chain Monte Carlo (MCMC).

What is the probability that a sample of (Monte Carlo) size n misses an event of probability ε when doing OMC? Answer: (1 − ε)ⁿ. Same question except for MCMC rather than OMC? Answer if only Harris recurrence is assumed: we only know that the event will occur eventually with probability one. Answer if geometric ergodicity is assumed: M ρⁿ for some ρ < 1, which we only know how to calculate in toy problems, and some constant M which depends on the the starting position. So this really doesn't guarantee anything for any particular n.

Regeneration in MCMC (Mykland, Tierney, and Yu, 1995, doi:10.1080/01621459.1995.10476507; Geyer and Thompson, 1995, doi:10.1080/01621459.1995.10476590) builds regenerations into Markov chains. Every so often the chain regenerates. Denote these times T₁, T₂, …. Then the parts of the Markov chain from just after a regeneration up to and including the next regeneration are independent random thingummies. I call them tours. Since the tours are independent, they can be run on different computers in parallel. If an MCMC run consists only of complete tours, then the usual MCMC estimator (the average over the run) is an unbiased estimator of the expectation with respect to the target distribution that it estimates. But since the lengths of the tours are random, and you never know how long a tour will take, it is possible that you start off ten thousand tours and most of them finish within minutes but one of them takes ten thousand years, and you have to wait for it to finish to get your unbiased estimate. And since you never know whether there might be a tour like that (except by using very hard theory), you never know whether you are being fooled by pseudoconvergence (unlike with perfect sampling).

So called unbiased MCMC (Jacob, O’Leary, and Atchadé, 2020, doi:10.1111/rssb.12336) has all of the advantages and disadvantages of regeneration. It does provide an unbiased estimator but can also be fooled by pseudoconvergence. You can seem to have a good estimator from ten thousand coupled unbiased runs, but the ten thousand and first can take ten thousand years to do and give a wildly different answer.

R Types ExternalPtr and Raw

I said something wrong in class. The manual Writing R Extensions does cover external pointers in Section 5.13.

R CMD SHLIB

Here is the minimum you need to do to compile a C function and call it from R.

First make a file qux.c, for example

void qux(double *x)
{
    x[0] = 42.0;
}

Now at the operating system command line (not in R) do

R CMD SHLIB qux.c

and then look at what you got (on my computer I see)

fagus$ ls
qux.c  qux.o  qux.so

The file with extension so is the shared library. It may have extension sl or dll on some other operating system.

Now go into R and run

> dyn.load("qux.so")
> .C("qux", x = double(1))
$x
[1] 42

and you see it works (the argument of R function dyn.load must be the actual name of the shared library).

A Rant about OOP

Execution in the Kingdom of Nouns by Steve Yegge.

Preallocation in R

A student asked in class about preallocating vectors and other stuff (matrices, arrays, data frames, tibbles) in R. This is a very deep subject.

Some googling found a lot of advice that says to do it, but no explanation about why.

Hadley Wickham puts his finger on the key issue: aliasing. But he doesn't explain how R knows about that.

Before we can learn anything on the subject, we are talking about calling R function `[<-` which R says is

> `[<-`
.Primitive("[<-")

but which the manual (help("[<-") says is generic (in the Details section). So maybe the first thing we need to figure out is how that works.

Actually no. The first thing we need to know is the semantics of this function, what is is supposed to do. Section 3.4.4 of the R Language Definition is clear. When the statement

foo[bar] <- baz

what happens must be as if a copy of foo is made, then the subset assignment is made to the copy, and then that copy is associated with the name-also-called-symbol foo. So whether or not update in place (without copying) is done, it must have the same effect as if update with copying was done. This does not apply to assignments to environments or reference classes (which are mutable), but does apply to all other assignment functions like names<-.

Also Section 5.9.10 of Writing R Extensions is also somewhat confused but at the end clearly states that since R-4.0.0 reference counting (see below) is used and C preprocessor macros MAYBE_REFERENCED, MAYBE_SHARED, and MARK_NOT_MUTABLE are available in C code called from R.

Then we are on our own. None of the R manuals or any documentation explain further. We have to RTFS also called UTSL.

But reading C code is hard. And reading incredible gnarly C code for some of the lowest level parts of R is really hard. So I would just say we don't want to go there except there isn't anything else to do (except, of course handwaving and bullshit, but that's already available in plenitude on the intertubes).

So the first thing to understand is that when R calls a primitive function like this (and some other functions, I don't understand the details) it goes through a dispatch table found in the C source file names.c where we find that a call to R function `[<-` is translated to a call to C function do_subassign which some grovelling in the source code is found in C source file subassign.c.

And the first thing we see in this file is a bunch of stuff about special casing every type of assignment for efficiency. So right away we see that this code will be hard to read.

If we go to the definition of the function do_subassign we find

/* The [<- operator.  "x" is the vector that is to be assigned into, */
/* y is the vector that is going to provide the new values and subs is */
/* the vector of subscripts that are going to be replaced. */
/* On entry (CAR(args)) and the last argument have been evaluated */
/* and the remainder of args have not.  If this was called directly */
/* the CAR(args) and the last arg won't have been. */

SEXP attribute_hidden do_subassign(SEXP call, SEXP op, SEXP args, SEXP rho)
{
    SEXP ans;

    /* This code performs an internal version of method dispatch. */
    /* We evaluate the first argument and attempt to dispatch on it. */
    /* If the dispatch fails, we "drop through" to the default code below. */

    /* DispatchOrEval internal generic: [<- */
    if(R_DispatchOrEvalSP(call, op, "[<-", args, rho, &ans))
/*     if(DispatchAnyOrEval(call, op, "[<-", args, rho, &ans, 0, 0)) */
      return(ans);

    return do_subassign_dflt(call, op, ans, rho);
}

So we see this calls C function R_DispatchOrEvalSP which is defined in this file, and it does a lot of complicated things before calling C function R_DispatchOrEval which is defined in C source file eval.c. So we look in there and find its definition (which is long and complicated) is preceded by this comment

/* DispatchOrEval is used in internal functions which dispatch to
 * object methods (e.g. "[" or "[[").  The code either builds promises
 * and dispatches to the appropriate method, or it evaluates the
 * (unevaluated) arguments it comes in with and returns them so that
 * the generic built-in C code can continue.

 * To call this an ugly hack would be to insult all existing ugly hacks
 * at large in the world.
 */

This is not quite as programmer-hostile as the famous You are not expected to understand this comment in an early version of the UNIX source code.

The comment does inform us that this function only finished the job if there is a method for R generic function `[<-` for the type of object which is the first argument, that is, if we are trying to execute the R command

foo[bar] <- baz

and R object foo has class "qux" and there exists a method `[<-.qux` to handle such objects, then it will be called here to do the job. This task is complicated by the fact that we can have complicated expressions like

as.symbol(paste0("foo", i))[bar > 6] <- baz * 10

and it says that if there is no method to handle the class of the object that as.symbol(paste0("foo", i)) evaluates to, then this function does evaluate all of its arguments and hands the R objects they evaluate to to the next function called, which is C function do_subassign_dflt also found in in C source file subassign.c.

And if we look inside the source code of this function, then we finally find what we are looking for (at least part of it)

    /* If there are multiple references to an object we must */
    /* duplicate it so that only the local version is mutated. */
    /* This will duplicate more often than necessary, but saves */
    /* over always duplicating. */

    if (MAYBE_SHARED(CAR(args)) ||
	((! IS_ASSIGNMENT_CALL(call)) && MAYBE_REFERENCED(CAR(args))))
	x = SETCAR(args, shallow_duplicate(CAR(args)));

But the code here is really ugly. The convention is that all caps like MAYBE_SHARED and CAR and the rest are not C functions but rather C preprocessor macros. So we need to look in some header file for them.

First, I happen to already know that the CAR macro is a LISPism (R is more like LISP under the hood than one might imagine, so this is the LISP CAR operator and CAR(args) is the first argument of the function call.

We find these macros defined in the C header file Rinternals.h

/* Macros for some common idioms. */
#ifdef SWITCH_TO_REFCNT
# define MAYBE_SHARED(x) (REFCNT(x) > 1)
# define NO_REFERENCES(x) (REFCNT(x) == 0)
#else
# define MAYBE_SHARED(x) (NAMED(x) > 1)
# define NO_REFERENCES(x) (NAMED(x) == 0)
#endif
#define MAYBE_REFERENCED(x) (! NO_REFERENCES(x))
#define NOT_SHARED(x) (! MAYBE_SHARED(x))

And now we are stuck unless we know some computer science. R has garbage collection, and some garbage collection uses reference counting. And reference counting can, in some cases give incorrect results, but AFAIK when it does it says an object is shared when it actually isn't (so MAYBE_SHARED can give a false positive but not a false negative, which as the comment above says, is OK, it wastes time but does not cause errors). But what is that stuff about SWITCH_TO_REFCNT? Sometimes R uses reference counting and sometimes not? Or on some architectures and on some not? Or what?

So we are really still a long way from understanding this issue. We might have to read and fully understand thousands of lines of the R source code to really grok in fullness.

But since this is probably more than any of you wanted to know, and is even more than I wanted to know (except I am still left feeling dissatisfied and wish I knew more but am just a bit too lazy to learn more). So that's that!

Machine Language

I forgot what may be a very important architecture in the future: RISC-V.

There of course many more architectures all but Intel and ARM mostly of historical interest only.

Rules for Optimizing Computer Code

The link to the quotes about optimization.

LU and QR

Why do we use QR decomposition rather than some other matrix decomposition to solve least squares problems? And why don't we use a matrix decomposition on M^T M rather than on M itself, where M is the model matrix? The answer isn't operation counts. It's accuracy (Fletcher, 1975, DOI:10.1080/01621459.1975.10480270).

Firstly, recall that the right way to think about least squares is regression is projection. What determines a linear model is not the model matrix M but rather its column space V, the vector subspace spanned by its columns. Different model matrices, but same column spaces, same regression model (a statistical model is a family of probability distributions, and same column spaces, same family of probability distributions). Thus to really understand a regression model we have to get a handle on V. When we do a reduced QR decomposition of the model matrix, the columns of Q are an orthonormal basis for V.

A model matrix is just another basis for V or perhaps in case of collinearity not even a basis (some columns of M need to be dropped to get a basis for V). But some bases are better than others in regard to accuracy. Orthornormal bases are the best. But allowing V to be specified in terms of model matrices allows users to specify linear models in arbitrarily bad ways. And software has to deal with that as best it can.

A condition number is a number that bounds the accuracy of the solution of a problem in terms of the accuracy of the data for a problem. For a least squares problem, we can take it to be the ratio of the largest to smallest singular values of the model matrix. If the model matrix has orthornormal columns, then this condition number is one, the best it can be. If there is collinearity, this condition number is infinite, the worst it can be. If this condition number is very large, we say the user's specification of the least squares problem is ill conditioned. That means the solution for the regression coefficients will be very inaccurate.

When you form M^T M you square the condition number of the problem, which means your answer will have half the accurate digits it might have had before you did that. So M^T M is something that is only OK in theory. You don't want to use that in the computer.

The QR decomposition with pivoting allows computation of the least squares solution without making the condition number any worse than it has to be. The condition number of the procedure is the condition number of the model matrix used to specify the problem. So that is why QR.

R Functions for Matrix Factorizations

We know about R functions qr, svd, eigen, and chol in base R that do the QR decomposition, singular value decomposition, spectral decomposition, and Cholesky decomposition. How about LU?

Fortran function DGETRF does the LU decomposition. R has the source code for this in the file dlapack.f. And it is part of the official R API, with prototype in the include file R_ext/Lapack.h. So you can use it in your C, C++, or Fortran code called from R. But R does not appear to use it. For anything.

R package Matrix does do LU and Shur decompositions, which base R does not appear to. It also does sparse Cholesky and sparse QR, which are more efficient than the base R routines for sparse matrices.

How R Calls Pnorm and Friends

> pnorm
function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) 
.Call(C_pnorm, q, mean, sd, lower.tail, log.p)
<bytecode: 0x55c3a6790c58>
<environment: namespace:stats>

So R function pnorm calls a C function to do the job through the .Call interface. The help for R function .Call says the first argument is

   .NAME: a character string giving the name of a C function, or an
          object of class ‘"NativeSymbolInfo"’,
          ‘"RegisteredNativeSymbol"’ or ‘"NativeSymbol"’ referring to
          such a name.

Since this is obviously not a character string, it must be one of these other thingummies, which are described in the section about registering native routines in Writing R Extensions. But here tricks are being played. That section suggests the name of the function is pnorm.

But if we look in R source file init.c for package stats we find

    CALLDEF_MATH3_2(pnorm),

used to register this function where the C preprocessor macro being used here is defined by

#define CALLDEF_MATH3_2(name) CALLDEF_DO(name, 5)

and this is invoking another C preprocessor macro defined by

#define CALLDEF_DO(name, n) {#name, (DL_FUNC) &do_##name, n}

so it seems as if the name of the C function actually registered is do_pnorm (you have to know what the C preprocessor does with double hash. It is the token-pasting operator that makes one token out of two, so (since here name is pnorm it registers the C function pointer do_pnorm.

Now if we look for that function, we find in R source file statsR.h in package stats a prototype for that function

SEXP do_pnorm(SEXP sa, SEXP sb, SEXP sc, SEXP sI, SEXP sJ);

but we don't find (at least not by grepping for do_pnorm) the function definition. So we look harder and find in R source file distn.c in package stats

DEFMATH3_2(pnorm)

where this C preprocessor macro is defined as

#define DEFMATH3_2(name) \
    SEXP do_##name(SEXP sa, SEXP sb, SEXP sc, SEXP sI, SEXP sJ) { \
        return math3_2(sa, sb, sc, sI, sJ, name); \
    }

And here name is again pnorm so this does indeed define a function do_pnorm which is a one-liner with body

        return math3_2(sa, sb, sc, sI, sJ, pnorm);

where now pnorm is a C function pointer to the C function of that name.

So now we have to find C function math3_2, which is defined in R source file arithmetic.c where we find

static SEXP math3_2(SEXP sa, SEXP sb, SEXP sc, SEXP sI, SEXP sJ,
                    double (*f)(double, double, double, int, int), SEXP lcall)
{

and this is something of a worry that we are calling this with five arguments but it takes six. However, we check that the last is not used anywhere in the function so it is OK to omit it (I think, at least the R core team think so).

Then if we look at the rest of the body of the C function math3_2 we see another preprocessor macro MOD_ITERATE3 which is defined in R source file R_ext/Itermacros.h to be

#define MOD_ITERATE3_CORE(n, n1, n2, n3, i, i1, i2, i3, loop_body) do { \
        for (; i < n;                                                   \
             i1 = (++i1 == n1) ? 0 : i1,                                \
                 i2 = (++i2 == n2) ? 0 : i2,                            \
                 i3 = (++i3 == n3) ? 0 : i3,                            \
                 ++i) {                                                 \
            loop_body                                                   \
                }                                                       \
    } while (0)

#define MOD_ITERATE3(n, n1, n2, n3, i, i1, i2, i3, loop_body) do {      \
        i = i1 = i2 = i3 = 0;                                           \
        MOD_ITERATE3_CORE(n, n1, n2, n3, i, i1, i2, i3, loop_body);     \
    } while (0)

and this takes care of the recycling of indices.

And then going back to the source code of math3_2 we see that it fills up the result vector with repeated calls to f, which is a C function pointer pointing to the C function pnorm (in the instance we are tracing).

So that was a very long winded way (and we omitted a lot of detail) that the C function that gets called to do the basic work of R function pnorm is R function pnorm.

And that C function is found in the R source file pnorm.c where we find this (part of a) comment

 *  SYNOPSIS
 *
 *   #include <Rmath.h>
 *
 *   double pnorm5(double x, double mu, double sigma, int lower_tail,int log_p);
 *         {pnorm (..) is synonymous and preferred inside R}

so the name of the C function is actually pnorm5 but the comment says the name pnorm is preferred (better style) so this must be defined in some header file, probably the one cited in the comment, Rmath.h where we do indeed find

#define dnorm dnorm4
#define pnorm pnorm5
#define qnorm qnorm5

So C function pnorm5 also called pnorm is called once for each value in the vector returned by R function pnorm.

In your C code, you can ignore all of this nonsense and just call pnorm directly (if you have

#include <Rmath.h>

before any call to C function pnorm. And you don't really have any choice but to call it directly because C function math3_2 and the like are not part of the R API.

2021 Announcements of Permanent Interest

Randomized Algorithms

It is not known whether randomized algorithms are better than deterministic algorithms. In computational complexity theory, BPP is the class of problems solved in polynomial time by randomized algorithms and P is the class of problems solved in polynomial time by deterministic algorithms. The link cited above says that it is unknown whether P = BPP, although most complexity theorists conjecture today that this is so. A post on Scott Aaronson's blog says Avi [who just won the Abel prize] hasn’t yet proved P = BPP, just taken some major steps toward it although the news media missed the distinction.

R Packages for Optimization

The R packages mentioned in class but not on the slides were

R package clpAPI which is very high quality linear programming software, competitive with very expensive commercial solvers. Its documentation is horrible unfortunately. The package has been defunct for several months. I have replaced it with the following in my code.
R package glpkAPI which is also very high quality linear programming software, competitive with very expensive commercial solvers. Its documentation is much better than the preceding package, but still borderline unreadable.
R package alabama which has done all of the (fairly low dimensional) examples I have used it for in several courses, mostly calculating likelihood-based confidence intervals and similar things. The function in this package I have used is R function auglag.

Web Scraping HTML Tables

So reading the bug reports on GitHub for R package htmltab it looks like the package has possibly been abandoned by the author. No! It is back in business https://cran.r-project.org/package=htmltab.

So it looks like we switch to the Hadleyverse (except we don't have to now).

> library(rvest)
Loading required package: xml2
> foo <- read_html("https://www.avca.org/polls/di-women/2-22-21.html")
> bar <- html_table(foo)
> class(bar)
[1] "list"
> sapply(bar, class)
[1] "data.frame"

It looks like R function html_table returns a list of all of the tables in the document converted to R data frames.

> baz <- bar[[1]]
> names(baz)
[1] "Rank"                                "School (First-Place Votes Adjusted)"
[3] "Total Points Adjusted"               "2020-21 Record"                     
[5] "Previous Rank"

If we look a the web page or just at the second column, we see that AVCA have followed very bad data practice and put two bits of data in one field. We have to extract the two fields using regular expressions, which is gross. Don't do that!

> qux <- baz[ , 2]
> head(qux)
[1] "Wisconsin (51)" "Texas (8)"      "Kentucky (1)"   "Nebraska"
[5] "Minnesota"      "Baylor"
> school <- sub(" *\\([0-9]*\\)$", "", qux)
> head(school)
[1] "Wisconsin" "Texas"     "Kentucky"  "Nebraska"  "Minnesota" "Baylor"
votes <- sub("^.*\\(([0-9]*)\\)$", "\\1", qux)
head(votes)
> votes <- as.numeric(votes)
Warning message:
NAs introduced by coercion 
> votes[is.na(votes)] <- 0

So now we have our data. The R is gross because the data were gross.

> fred <- data.frame(Rank = baz$Rank, School = school,
+     First.Place.Votes.Adjusted = votes,
+     Total.Points.Adjusted = baz[["Total Points Adjusted"]],
+     Record = baz[["2020-21 Record"]], Previous.Rank = baz[["Previous Rank"]])
> head(fred)
  Rank    School First.Place.Votes.Adjusted Total.Points.Adjusted Record Previous.Rank
1    1 Wisconsin                         51                  1490   10-0             1
2    2     Texas                          8                  1432   14-0             2
3    3  Kentucky                          1                  1359   12-0             3
4    4  Nebraska                          0                  1313    7-1             4
5    5 Minnesota                          0                  1276    9-1             5
6    6    Baylor                          0                  1201   13-3             6

And now we are ready to work with these data. If the web site authors had given more thought to making their site computer-readable. We wouldn't have had to do so much work.

Denormalized Numbers

Finishing the example on denormalized numbers I was doing in class.

> exp(-745)
[1] 4.940656e-324
> exp(-746)
[1] 0

So we see that exp(-745) is close to the smallest positive nonzero number defined in IEEE arithmetic. What is it actually?

> library(Rmpfr)
> foo <- mpfr(exp(-745), 120)
> foo
1 'mpfr' number of precision  120   bits 
[1] 4.9406564584124654417656879286822137237e-324
> log2(1 / foo)
1 'mpfr' number of precision  120   bits 
[1] 1074

So the result foo is exactly 2⁻¹⁰⁷⁴ in the computer's arithmetic. And we know from calculus, that isn't exact (should be an irrational number). We can also see the inexactness in

> log(exp(-745))
[1] -744.4401
> log(2^(-1074))
[1] -744.4401
> log2(exp(-745))
[1] -1074

Note that there is nothing special about exp and log here. Denormalized numbers arise whenever results get in this range

> .Machine$double.xmin
[1] 2.225074e-308
> 2^(-1074)
[1] 4.940656e-324
> 2^(-1074) /2
[1] 0

.Machine$double.xmin is the smallest positive normalized number (in IEEE double precision arithmetic) and 2⁻¹⁰⁷⁴ is the smallest positive denormalized (in IEEE double precision arithmetic). When you are in the range between them, you are denormalized. When you are even below that range, you have underflow.

Makevars File

As discussed in class, you cannot put nonstandard compiler flags in the src/Makevars file of an R package. For example, if you add -Wall -Wextra to the definition of PKG_CFLAGS, then R CMD check will emit the warning

Non-portable flags in variable 'PKG_CFLAGS':
  -Wall -Wextra

So what you do is follow the instructions in file package/notes in R repository foo in my account at GitHub.com that I illustrated in class Feb 17.

Searching CRAN

It came up in talking with a student that it would be helpful to cover how to search CRAN (as the following announcement is based on). This is done with R package pkgsearch. For example,

library("pkgsearch")
advanced_search("Arnoldi")

returns no hits. But

advanced_search("Conjugate Gradient", size = 200) -> foo

returns 178 hits (as this is written). The result foo is a data frame with variables

> names(foo)
 [1] "score"                "package"              "version"             
 [4] "title"                "description"          "date"                
 [7] "maintainer_name"      "maintainer_email"     "revdeps"             
[10] "downloads_last_month" "license"              "url"                 
[13] "bugreports"           "package_data"

And we can now use R function grep to whittle down our search. For example, the method of conjugate gradients can be used either for optimization or for solving linear equations. If we are only interested in the latter, then

> 
> bar <- grepl("equation", foo$title, ignore.case = TRUE) |
+     grepl("equation", foo$description, ignore.case = TRUE)
> sum(bar)
[1] 11
> foo$package[bar]
 [1] "pcg"         "cPCG"        "sanic"       "hgm"         "LassoGEE"   
 [6] "shinyKGode"  "gclm"        "KGode"       "deGradInfer" "calculus"   
[11] "rootSolve"

cuts down our task. At this point we had to look at the actual titles, descriptions, and web pages for the packages.

Twenty-First Century Numerical Analysis

A search of CRAN turned up two packages that use modern methods for solving linear equations.

R package sanic does BiCGSTAB (Biconjugate gradient stabilized method)
R package Rlinsolve does BiCGSTAB and GMRES.

But I could not find anything about modern methods for eigenvalue-eigenvector problems.

Creating R Lists in C Called from R

The example I was using for constructing an R object of type "list" inside C code called from R using the .Call interface was C function scdd in R package rcdd which is on GitHub. The code that constructs an R object of type "list" begins at line 268.

Inheritance in OOP

I may have given the wrong impression in my in class answer about inheritance. The R S3 OOP system does not have inheritance in the sense of C++ and Java. In those languages fields and methods of a class can be inherited from its superclass. In R S3 there is no such thing. But there is a very weak form of inheritance in that the class attribute of an object can be a vector, and one can think of the components of this vector as being class, superclass, superclass of superclass and so forth. Thus one should never test for class equality. The R idiom is

if (inherits(foo, "Bar"))

for does object foo have class "Bar".

R Type System

A better answer about when to use typeof, mode, storage.mode, or class. The R Language Definition, Section 1, Objects and Section 5, Object-oriented programming give the better answers.

R function typeof gives the type of the object as R considers it. As the manual cited above says

The R specific function typeof returns the type of an R object. Note that in the C code underlying R, all objects are pointers to a structure with typedef SEXPREC; the different R data types are represented in C by SEXPTYPE, which determines how the information in the various parts of the structure is used.
R function mode is for backwards compatibility with the S programming language (which R is a reimplementation of). As the manual cited above says

Function mode gives information about the mode of an object in the sense of Becker, Chambers & Wilks (1988) and is more compatible with other implementations of the S language.
R function storage.mode is also for backwards compatibility with the S programming language. As the manual cited above says

Finally, the function storage.mode returns the storage mode of its argument in the sense of Becker et al. (1988). It is generally used when calling functions written in another language, such as C or FORTRAN, to ensure that R objects have the data type expected by the routine being called. (In the S language, vectors with integer or real values are both of mode "numeric", so their storage modes need to be distinguished.)
R function class gives the class of the object as the various OOP systems of core R consider it. In the S3 OOP system the class of an object is just an attribute named "class" that is an atomic vector of type "character". So classes are nothing like in C++ or Java. They don't say anything about the object except the name of the class. They are used in method dispatch of generic functions.

Mutability

A quotation from Advanced R by Hadley Wickham

Surprisingly, the implementation of reference classes is almost entirely in R code — they are a combination of S4 methods and environments. This is a testament to the flexibility of S4.

It also says that if you need mutability but don't need OOP, then you don't need reference classes, you just need environments. That's what I use.

Changing Parent Environment

My answer in class about whether environments can be changed on the fly was completely wrong. I should not have guessed. As always, TRT is to RTFM. And help(parent.env) says, among other things

environment(fun) <- value
parent.env(env) <- value

(these are the only assignment functions documented on this help page). So both the environment of a function and the parent environment of an environment can be assigned arbitrarily. An examination of the R code base shows that the first is done in model fitting functions and methods functions, and the latter is done only in methods functions and in R startup. So it is not something you would ordinarily do, but it is something you can do. The documentation also says

The replacement function ‘parent.env<-’ is extremely dangerous as it can be used to destructively change environments in ways that violate assumptions made by the internal C code. It may be removed in the near future.

So changing the parent of an environment is something you can do but are strongly advised not too!

Removing Sensitive Material from Git and GitHub

A better answer to what to do when sensitive material has mistakenly been committed to git and pushed to GitHub is found on this GitHub documentation. It suggests using either the git filter-branch command or the BFG Repo-Cleaner open source tool. I think I have used git filter-branch in the past. But it also says

Warning: Once you have pushed a commit to GitHub, you should consider any data it contains to be compromised. If you committed a password, change it! If you committed a key, generate a new one.

2020 Announcements

R Packages that do Regression

The info about R packages that do regression is linked on the course notes about R packages, in particular, follow the links to git repository bar and the accompanying gist.

My way of converting formulas to x and y is a way of R, but not the only way. Instead you can follow http://developer.r-project.org/model-fitting-functions.html if you like.

Robustness

Robustness came up today (Apr 24) in class. This area isn't as trendy as it used to be, but it is still very important. Every statistician should know something about it. For a start read the help pages for R functions lqs and rlm in R package MASS and try the examples.

Then for more, the following

library("pkgsearch")
foo <- advanced_search(Title = "robust", Description = "robust", size = 1000)

returns (as I write this) 144 CRAN packages with robust in the Title or Description fields of the package DESCRIPTION file. Of course, in some of these robust is used with a different meaning from the robust in robust statistics (Wikipedia pages Robust statistics and Robust regression). But there are clearly a lot of robustness packages on CRAN (not to mention the ones in the recommended package MASS that comes with every R installation — which we already did mention, of course).

R Standalone Library of C Functions

New handout about the R standalone library under Course Notes in the navigation to the left.

The authoritative documentation is Section 6.17 of Writing R Extensions. This allows use of many C functions that back up R functions, including the R random number system and all of the rest of R functions for probability distributions, from C code not called from R.

Branch and Bound Algorithm

This is in addition to the slides on Branch and Bound under Slides in the navigation to the left.

R package glmbb (https://cloud.r-project.org/package=glmbb) does branch and bound and is programmed in R (no C or C++). It uses three main ideas

It uses the R formula mini-language to specify models (all possible models), adding and dropping terms to move between models.
It uses an R environment object to store what it thinks of as global variables. It does not store these variables in the R global environment.
It uses recursion. There is a function that does branch and bound for all models in a specified set. This function works by partitioning the specified set and then calling itself on each element of the partition.

The same ideas will work on any branch and bound problem.

Alternatives to using R environments to store R objects are various CRAN packages that implement hash tables and may be more efficient than R environments. These include

R package filehash,
R package hash, and
R package r2r.

In particular, R package filehash which stores in a file on disk rather than in memory may allow slightly larger problems to be done. But, since branch-and-bound converts exponential in n searches to exponential in n searches (with a smaller constant) nothing will allow really large (big data problems). That is why things like LASSO were invented. They work on very large problems for some definition of work (they provide answers with no guarantee they are anywhere near the best).

Addendum added in 2023 Core R now has hash tables (in R package utils, which is a recommended package installed by default in every installation of R). See

?utils::hash

This is still marked Experimental but is the way of the future. This has been available since R-4.2.0.

Using SSH Without Passwords

A note about using the ssh command without passwords.

First you have to have ssh. On UNIX (Linux and Mac OS X) it just comes with the computer. On up to date versions of Windows 10 it can easily be installed (so they say — I haven't tried it since I don't do windows).
Then, if you have never used ssh before, do (from the command line in a terminal)
```
ssh username@lts.cla.umn.edu
```
where you replace username with your actual U of M username (the one you log in with to read mail, register for courses, use the library, etc.). Type your U of M password when prompted and you should be on lts.
Log back out (say exit), because we need to do some work locally.
(Do this step once per computer you own. If you have already done this for working with GitHub, don't do it again.) From the command line on your computer do
```
ssh-keygen
```
Answer any questions it asks by choosing the default (hit return) until you get to Enter passphrase (empty for no passphrase):. Enter a passphrase. This can be as long as you like, and should be memorable. It should not be used for anything else. Then you have to enter it again to make sure you typed it correctly. Then it should say the keys were successfully created. If you do
```
ls .ssh
```
you should see two files
```
id_rsa
id_rsa.pub
```
The first is your private key. It stays secret. The only copy anywhere should be right where it is now, on your computer. The second is your public key. You can give that out to anyone and move it to any computer.
The first thing you do is copy it to lts
```
scp .ssh/id_rsa.pub lts.cla.umn.edu:
```
(Note the colon at the end of this command. It's important. You will have to type your U of M password again for this to work).
Now log back into lts again, just like in item 2 above.
This time we have some work to do. We need to add this public key to the authorized_keys in the .ssh directory on the remote computer (in this case lts).
```
cat id_rsa.pub >> .ssh/authorized_keys
```
(you can now remove the file id_rsa.pub if you like, only the copy in .ssh/authorized_keys is used).
If the command above complains that the directory does not exist, make it.
```
mkdir .ssh
```
Now log out of lts again, just like in item 3 above.
Now log back into lts again, just like in item 2 above. This time you are not asked for a password. This time you are asked for the passphrase for that SSH key. Type that passphrase (that you typed into ssh-keygen. This should get you in to lts. On my computer (an Ubuntu Linux box) it pops up a new window for me to type the passphrase in.
Now log out of lts again, just like in item 3 above.
Now log back into lts again, just like in item 2 above. This time you should just get in. No password and no passphrase. Your computer remembers the key for you. You will never need to type this passphrase again unless you log out of your account on your computer or reboot your computer. So you have to type this passphrase once per reboot rather than once per ssh or scp command.
Since you have the same file system on lts and compute. You can now get into it without a password or passphrase. Try that. Same as item 2, except change lts to compute.
Now copy your public key to GitHub, either github.umn.edu or github.com or both. Log into your account on one or the other.
- Find the main menu in the upper right corner of your home page. Choose Settings.
- Now on the left you should see a Personal settings menu one item of which is SSH and GPG keys. Click that.
- Now in the upper right you should see a green button labeled New SSH key. Click that.
- Now at a command line on your computer do
```
cat ~/.ssh/id_rsa.pub
```
  You will see one very long line of text that probably wraps to several lines in the terminal window. Cut and paste that into the text form labeled Key at GitHub. You probably also want to give this key a Title so you will know what it is if you have more than one key (from more than one computer — you can use this key on this computer for everything). Then click Add SSH key. You should now see this key among your SSH keys. At a command line in your computer do
```
ssh-add -l -E md5
```
  and you should see the hash of the key that GitHub shows.
Now in any git repo on your computer that has this GitHub (the one you put the SSH key in) as its origin, what you see when you do
```
git remote -v
```
you should be able to do
```
git pull origin master
```
or
```
git push origin master
```
without a password or passphrase.
GitHub now suggests you call the main branch in any Git repo main rather than master and has instructions on how to do this.
As far as git or GitHub is concerned, you can call it anything you want.

Background Jobs

As mentioned in lecture, the UNIX way (including Mac OS X and Linux) to do background jobs is just to put an ampersand at the end of the command line. This is mentioned in help("BATCH") in R.

R CMD BATCH --vanilla foo.R &

will run all the R in file foo.R in the background. You get your command line back immediately to do other work while the background is going. Unix functions ps and top can tell you whether the background job is still running and how much time it has taken. top is especially easy to use, requiring no command line arguments in ordinary use.

But if you log out of the computer or even kill the window in which the background job was started, your background job will be killed by default. To avoid this do

nohup R CMD BATCH --vanilla foo.R &

The nohup says let the background job run to completion no matter what (unless the computer reboots, of course).

If you are going to run the background job on computers that other people use for everyday work, then you should nice your job, giving it lower priority than processes that involve user interaction. For that do

nice nohup R CMD BATCH --vanilla foo.R &

For more on all of this see the part about batch processing in some old slides about parallel processing (the rest of that slide deck is defunct, see the current course notes on parallel processing).

Sockets Cluster over Multiple Machines

Here is the example I couldn't do in class.

ssh -A ssh.stat.umn.edu
ssh -A oak.stat.umn.edu
R --vanilla
library("parallel")
hosts <- c("oak", "bald")
cl <- makePSOCKcluster(hosts, homogeneous = FALSE)
parLapply(cl, seq(along = hosts), function(x)
    list(hostname = system("hostname", intern = TRUE), pid = Sys.getpid()))
stopCluster(cl)
q()
exit
exit

The -A flag enables agent forwarding so we don't need a password if ssh keys have been set up, as is discussed here (the rest of this slide deck is obsolete, replaced by our current course notes on parallel processing).
I have to go to oak because ssh, also called lts.cla.umn.edu does not allow R jobs.
The homogeneous = FALSE tells R not to assume that all the hosts have Rscript in the same place. That's the bit that threw me in class.

Spatial Point Processes

The reading on spatial point processes (Stat 8501 Lecture Notes on Spatial Point Processes) mentioned right at the end of class one day is NOT REQUIRED. That is a theory class, this is a computing class. Nevertheless, if anyone is interested in spatial point processes, that is a good place to start.

For example, the question asked near the end of class was about an equation in the reading that is equation (5) in these 8501 notes, and that equation in the 8501 notes is surrounded by a lot more explanation, essentially what I said in answer to the question.

Initial Sequence Estimators for MCMC

New handout about autocovariance functions for functionals of reversible Markov chains. Expands an answer to a question asked in class.

MCMC Long Runs

In answer to the question about how long to run MCMC, I should have mentioned the soufflé rule: it takes as long as it takes (if you take a soufflé out of the oven too soon, it is inedible).

Parameterized SQL Queries

In regard to the question about parameterized SQL queries mentioned in the vignette for R package RSQLite, we note the following

Parameterized queries do not seem to be a part of standard SQL, which is why I could not find a section about them in W3schools SQL Tutorial, although they are mentioned under the topic "SQL injection".
Even in R there does not seem to be a standard syntax for parameterized queries used in all CRAN packages involving SQL.
Thus the examples in the RSQLite package vignette seem to be specific to RSQLite and not portable to other databases (some other syntax might work for them).

Copy and Paste Programming

The coolest quote I have ever seen about computing. If you learn nothing else in this course but really learn this, it would be a very valuable course. This is by Douglas Crockford from the Read This First chapter of his new book.

I strongly recommend that you do not copy and paste code that you do not understand, even mine. This seems to have become a standard practice, but it is dangerous and reckless. It is not nearly as stupid as installing packages that you haven't even looked at, but it is still really bad. Given the current state of the art, the most important security filter is your brain. Use it. This is important.

C++

Why your humble instructor is anti-C++.

Recently seen on Hacker News, an article Initialization in C++ is Seriously Bonkers: Just Start With C. And your humble instructor just ends with C.
Less recently seen, a talk by Scott Meyers titled The Last Thing D Needs [a bunch of C++ features]. For those who don't know Scott Meyers is a C++ wizard (see next item).
The series of books by Scott Meyers
is very important if you want to use C++ correctly. These books are not just about writing elegant code. They are more about writing correct code that uses C++ features. These books should convince you that is very hard to do.
Some personal experience trying to use GiNaC, which is a recursive acronym that stands for GiNaC is not a CAS, and CAS is a TLA that stands for computer algebra system (Mathematica, Maple, MACSYMA, or the like). I did not get far enough to become a knowledgeable C++ programmer.
(added for 2022) The C++ Frequently Questioned Answers an anti-FAQ.

Good API Design

API is a TLA that stands for application programming interface. For us that mostly means what functions are provided and how those functions are used (what their arguments are).

Speaking of Scott Meyers, he has another wonderful talk titled The Most Important Design Guideline, the guideline being Make Interfaces Easy to Use Correctly and Hard to Use Incorrectly. See also the section on GIEMO in my Stat 3701 notes.

Good Programming Practices

Speaking of wonderful talks, the best talk I know of about R is by Martin Mächler (a member of the R core team), Good Practices in R Programming, which is also recommended on my Stat 3701 home page.

Object-Oriented Programming

Your curmudgeonly instructor is also not a big fan of OOP (object-oriented programming). In this he has some fellow travelers (see here or here).

CPU Caches

Speaking of Scott Meyers, he has another wonderful talk titled CPU Caches and Why You Care. This talk is about hardware and what you need to do to make it go fast. Two guidelines (of several)

Small is the same as fast, and when we are talking about hardware cache small means really small, a few hundred bytes.
The hardware loves arrays in C/C++ terminology, which are called vectors in R terminology. It hates all other data structures. In particular it hates OOP.

Unfortunately, your humble instructor knows too little about this subject to make a unit on it. So this course won't cover it. But you cannot make programs go fast without understanding this, so perhaps we should somehow. We can talk about that.