---
title: "Stat 8054 Lecture Notes: Parallel Computing in R"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  html_document:
    number_sections: true
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
  pdf_document:
    number_sections: true
    md_extensions: -tex_math_single_backslash
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(http://creativecommons.org/licenses/by-sa/4.0/).

# R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document is
   `r packageVersion("rmarkdown")`.

 * The version of the `parallel` package used to make this document is
   `r packageVersion("parallel")`.   This is a "recommended" package
   that is installed by default in every installation of R, so the
   package version goes with the R version.

# Introduction

The example that we will use throughout this document
is simulating the sampling distribution of the MLE
for $\text{Normal}(\theta, \theta^2)$ data.

# Set-Up

```{r "code", cache=TRUE}
# sample size
n <- 10
# simulation sample size
nsim <- 1e4
# true unknown parameter value
# of course in the simulation it is known, but we pretend we don't
# know it and estimate it
theta <- 1

doit <- function(estimator, seed = 42) {
    set.seed(seed)
    result <- double(nsim)
    for (i in 1:nsim) {
        x <- rnorm(n, theta, abs(theta))
        result[i] <- estimator(x)
    }
    return(result)
}

mlogl <- function(theta, x) sum(- dnorm(x, theta, abs(theta), log = TRUE))

mle <- function(x) {
    theta.start <- sign(mean(x)) * sd(x)
    if (all(x == 0) || theta.start == 0)
        return(0)
    nout <- nlm(mlogl, theta.start, iterlim = 1000, x = x)
    if (nout$code > 3)
        return(NaN)
    return(nout$estimate)
}
```

 * R function `doit` simulates `nsim` datasets, applies an estimator
   supplied as an argument to the function to each, and returns the
   vector of results.

 * R function `mlogl` is minus the log likelihood of the model in question.
   We could easily change the code to do another model by changing only
   this function.  (When the code mimics the math, the design is usually good.)

 * R function `mle` calculates the estimator by calling R function `nlm` to
   minimize `mlogl`.  The starting value `sign(mean(x)) * sd(x)` is a
   reasonable estimator because `mean(x)` is a consistent estimator
   of $\theta$ and `sd(x)` is a consistent estimator of
   $\lvert \theta \rvert$.

# Doing the Simulation without Parallelization

## Try It

```{r}
theta.hat <- doit(mle)
```

## Check It

```{r "hist-non-par", fig.align='center'}
hist(theta.hat, probability = TRUE, breaks = 30)
curve(dnorm(x, mean = theta, sd = theta / sqrt(3 * n)), add = TRUE)
```

The curve is the PDF of the asymptotic normal distribution of the MLE,
which uses the formula
$$
   I_n(\theta) = \frac{3 n}{\theta^2}
$$
(the "usual" asymptotics of maximum likelihood).

Looks pretty good.  The large negative estimates are probably not a mistake.
The parameter is allowed to be negative, so sometimes the estimates come
out negative even though the truth is positive.  And not just a little
negative because $\lvert \theta \rvert$ is also the standard deviation,
so it cannot be small and the model fit the data.
```{r}
sum(is.na(theta.hat))
mean(is.na(theta.hat))
sum(theta.hat < 0, na.rm = TRUE)
mean(theta.hat < 0, na.rm = TRUE)
```

## Time It

Now for something new.  We will time it.
```{r "time-non-par-norep", cache=TRUE, dependson="code"}
time1 <- system.time(theta.hat.mle <- doit(mle))
time1
```

The components of this vector are (these are taken from the R help page
for R function `proc.time`, which R function `system.time` calls, and
the UNIX man page for the UNIX system call `getrusage` system call,
which `proc.time` calls)

 * `user.self` the time the parent process (the R process executing the
   commands we see, like `doit`) spends in user mode

 * `sys.self` the time the parent process spends in kernel model
   (doing system calls),

 * `elapsed` the clock time,

We see that there is not a lot of difference between user mode time and
elapsed time.  We can take either to be the time.

## Time It More Accurately

That's too short a time for accurate timing.  So increase the number
of iterations.  Also we should probably average
over several IID iterations to get a good average.  Try again.
```{r "time-non-par", cache=TRUE, dependson="code"}
nsim <- 1e5
nrep <- 7
time1 <- NULL
for (irep in 1:nrep)
    time1 <- rbind(time1, system.time(theta.hat.mle <- doit(mle)))
time1
```

Now we see two more components of `proc.time` objects

 * `user.child` total user mode time for for all children of the calling
   process that have terminated and been waited for and grandchildren
   and further removed descendants, if all of the intervening  descendants
   waited on their terminated children, which is a mouthful, here it is
   all the work done by the fork-and-exec'ed children, and

 * `sys.child` like the preceding except kernel mode rather than user mode.

These components are in every `proc_time` object.  The reason why we didn't
see them before is something about what R function `print.proc_time` does.
Now
```{r "time-non-par-class"}
class(time1)
```
the printing is being done by R function `print.default` since there is no
`print.matrix` and the class of what we are printing is no longer
`proc_time`.

```{r "time-non-par-too"}
apply(time1, 2, mean)
apply(time1, 2, sd) / sqrt(nrep)
```
So we have about one and a half significant figures in our timing on this.
Longer runs would have more accuracy.

# Parallel Computing With Unix Fork and Exec

## Introduction

This method is by far the simplest but

 * it only works on one computer (using however many simultaneous processes
   the computer can do), and

 * it does not work on Windows.

## Toy Problem

First a toy problem that does nothing except show that we are actually
using different processes.
```{r}
library(parallel)
ncores <- detectCores()
mclapply(1:ncores, function(x) Sys.getpid(), mc.cores = ncores)
```

## Warning

Quoted from the help page for R function `mclapply`

>    It is _strongly discouraged_ to use these functions in GUI or
>    embedded environments, because it leads to several processes
>    sharing the same GUI which will likely cause chaos (and possibly
>    crashes).  Child processes should never use on-screen graphics
>    devices.

GUI includes RStudio.  If you want speed, then you will have to learn
how to use plain old R.
The examples in the [section on using clusters](#latis) show that.
Of course, this whole document shows that too.
(No RStudio was used to make this document.  In plain old R we said
`library("rmarkdown")` and then `render("parallel.Rmd")`.)

## Parallel Streams of Random Numbers

To get random numbers in parallel, we need to use a special
random number generator (RNG) designed for parallelization.

```{r}
RNGkind("L'Ecuyer-CMRG")
set.seed(42)
mclapply(1:ncores, function(x) rnorm(5), mc.cores = ncores)
```
Just right!
We have different random numbers in all our jobs.
And it is reproducible.
 
But this may not work like you may think it does.
If we do it again we get exactly the same results.
```{r}
mclapply(1:ncores, function(x) rnorm(5), mc.cores = ncores)
```
Running `mclapply` does not change `.Random.seed` in the parent process
(the R process you are typing into).  It only changes it in the child
processes (that do the work).  But there is no communication from child
to parent *except* the list of results returned by `mclapply`.

This is a fundamental problem with `mclapply` and the fork-exec method
of parallelization.  And it has no real solution.
You just have to be aware of it.

If you want to do exactly the same random thing with `mclapply`
and get different random results, then you must change `.Random.seed`
in the parent process, either with `set.seed` or by otherwise using
random numbers *in the parent process*.

## The Example {#fork-example}

We need to rewrite our `doit` function

 * to only do `1 / ncores` of the work in each child process,

 * to not set the random number generator seed, and

 * to take an argument in some list we provide.

```{r "code-too", cache=TRUE}
doit <- function(nsim, estimator) {
    result <- double(nsim)
    for (i in 1:nsim) {
        x <- rnorm(n, theta, abs(theta))
        result[i] <- estimator(x)
    }
    return(result)
}
```

## Try It {#fork-try}

```{r "try-fork-exec", cache=TRUE, dependson=c("code","code-too")}
mout <- mclapply(rep(nsim / ncores, ncores), doit,
    estimator = mle, mc.cores = ncores)
lapply(mout, head)
```

## Check It {#fork-check}

Seems to have worked.
```{r}
length(mout)
sapply(mout, length)
lapply(mout, head)
```

Plot it.
```{r "hist-fork-exec", fig.align='center'}
theta.hat <- unlist(mout)
hist(theta.hat, probability = TRUE, breaks = 30)
curve(dnorm(x, mean = theta, sd = theta / sqrt(3 * n)), add = TRUE)
```

## Time It {#fork-time}

```{r "time-fork-exec", cache=TRUE, dependson=c("code","code-too")}
time4 <- NULL
for (irep in 1:nrep)
    time4 <- rbind(time4, system.time(theta.hat.mle <-
        unlist(mclapply(rep(nsim / ncores, ncores), doit,
            estimator = mle, mc.cores = ncores))))
time4
```
Now we see that, unlike when we had no parallelism, now `user.self` is
almost no time.  All the time is in `user.child`.  The child processes
are doing all the work.  We also see that the total child time is far
longer than the actual elapsed time (in the real world).  So we are
getting parallelism.
```{r "time-fork-exec-too"}
apply(time4, 2, mean)
apply(time4, 2, sd) / sqrt(nrep)
```

We got the desired speedup.  The elapsed time averages
```{r}
apply(time4, 2, mean)["elapsed"]
```
with parallelization and
```{r}
apply(time1, 2, mean)["elapsed"]
```
```{r echo = FALSE}
rats <- apply(time1, 2, mean)["elapsed"] / apply(time4, 2, mean)["elapsed"]
```
without parallelization.  But we did not get an `r ncores`-fold speedup
with `r ncores` cores.
There is a cost to starting and stopping the child processes.  And some
time needs to be taken from this number crunching to run the rest of the
computer.  However, we did get a `r round(rats, 1)`-fold speedup.  If we had
more cores, we could do even better.

# The Example With a Cluster

## Introduction

This method is more complicated but

 * it works on clusters like the ones at
[LATIS (College of Liberal Arts Technologies and Innovation
Services)](http://z.umn.edu/claresearchcomputing) or at the
[Minnesota Supercomputing Institute](https://www.msi.umn.edu/).

 * according to the documentation, it does work on Windows.

## Toy Problem

First a toy problem that does nothing except show that we are actually
using different processes.
```{r}
library(parallel)
ncores <- detectCores()
cl <- makePSOCKcluster(ncores)
parLapply(cl, 1:ncores, function(x) Sys.getpid())
stopCluster(cl)
```

This is more complicated in that

 * first you you set up a cluster, here with `makePSOCKcluster` but
   not everywhere --- there are a variety of different commands to
   make clusters and the command would be different at LATIS or MSI
   --- and

 * at the end you tear down the cluster with `stopCluster`.

Of course, you do not need to tear down the cluster before you are
done with it.  You can execute multiple `parLapply` commands on the
same cluster.

There are also a lot of other commands other than `parLapply` that
can be used on the cluster.  We will see some of them below.

## Parallel Streams of Random Numbers {#rng-cluster}

```{r}
cl <- makePSOCKcluster(ncores)
clusterSetRNGStream(cl, 42)
parLapply(cl, 1:ncores, function(x) rnorm(5))
parLapply(cl, 1:ncores, function(x) rnorm(5))
```

We see that clusters do not have the same problem with continuing
random number streams that the fork-exec mechanism has.

 * Using fork-exec there is a *parent* process and *child* processes
   (all running on the same computer) and the *child* processes end
   when their work is done (when `mclapply` finishes).

 * Using clusters there is a *controller* process and *worker* processes
   (possibly running on many different computers) and the *worker*
   processes end when the cluster is torn down (with `stopCluster`).

So the worker processes continue and each remembers where it is in its
random number stream (each has a different random number stream).

## The Example on a Cluster

### Set Up {#cluster-setup}

Another complication of using clusters is that the worker processes
are completely independent of the controller process.  Any information they have
must be explicitly passed to them.

This is very unlike the fork-exec model in which all of the child processes
are copies of the parent process inheriting all of its memory (and thus
knowing about any and all R objects it created).

So in order for our example to work we must explicitly distribute stuff
to the cluster.
```{r}
clusterExport(cl, c("doit", "mle", "mlogl", "n", "nsim", "theta"))
```

Now all of the workers have those R objects, as copied from the controller
process right now.  If we change them in the controller (pedantically if
we change the R objects those *names* refer to) the workers won't know
about it.  Thus if we change these objects on the controller, we must re-export
them to the workers.

### Try It {#cluster-try}

So now we are set up to try our example.
```{r "try-cluster", cache=TRUE, dependson=c("code","code-too")}
pout <- parLapply(cl, rep(nsim / ncores, ncores), doit, estimator = mle)
```

### Check It {#cluster-check}

Seems to have worked.
```{r}
length(pout)
sapply(pout, length)
lapply(pout, head)
```

Plot it.
```{r "hist-cluster", fig.align='center'}
theta.hat <- unlist(mout)
hist(theta.hat, probability = TRUE, breaks = 30)
curve(dnorm(x, mean = theta, sd = theta / sqrt(3 * n)), add = TRUE)
```

### Time It {#cluster-time}

```{r "time-cluster", cache=TRUE, dependson=c("code","code-too")}
time5 <- NULL
for (irep in 1:nrep)
    time5 <- rbind(time5, system.time(theta.hat.mle <-
        unlist(parLapply(cl, rep(nsim / ncores, ncores),
            doit, estimator = mle))))
time5
```
Now we are not seeing any child times because the workers are not children
of the controller R process, they need not even be running on the same computer.
If we wanted to know about times on the workers, we would have to run
R function `system.time` on the workers and also include that information
in the result somehow.  So the only time of interest here is `elapsed`.
```{r "time-cluster-too"}
apply(time5, 2, mean)
apply(time5, 2, sd) / sqrt(nrep)
```

We got the desired speedup.  The elapsed time averages
```{r}
apply(time5, 2, mean)["elapsed"]
```
with parallelization and
```{r}
apply(time1, 2, mean)["elapsed"]
```
```{r echo = FALSE}
rats <- apply(time1, 2, mean)["elapsed"] / apply(time5, 2, mean)["elapsed"]
```
without parallelization.  But we did not get an `r ncores`-fold speedup
with `r ncores` cores.
There is a cost to starting and stopping the child processes.  And some
time needs to be taken from this number crunching to run the rest of the
computer.  However, we did get a `r round(rats, 1)`-fold speedup.  If we had
more cores, we could do even better.

We also see that this method isn't quite as fast as the other method.
So why do we want it (other than that the other doesn't work on Windows)?
Because it scales.  You can get clusters with thousands of cores, but
you can't get thousands of cores in one computer.

## Tear Down

Don't forget to tear down the cluster when you are done.
```{r}
stopCluster(cl)
```

# LATIS

For more about the LATIS see https://cla.umn.edu/latis.

For more about the compute cluster `compute.cla.umn.edu` run by LATIS see
https://cla.umn.edu/research-creative-work/research-development/research-computing-and-software.

For all the info on various queues and resource limits see
https://neighborhood.cla.umn.edu/node/5371.

## Fork-Exec

This is just like the
[fork-exec section above](#parallel-computing-with-unix-fork-and-exec)
except for a few minor changes for running on LATIS.

### Interactive Session

SSH into `compute.cla.umn.edu`.  Then
```{r engine='bash', eval=FALSE}
qsub -I -l nodes=1:ppn=8
cd tmp/8054/parallel # or wherever
wget -N http://www.stat.umn.edu/geyer/8054/scripts/latis/fork-exec.R
module load R/3.6.1
R CMD BATCH --vanilla fork-exec.R
cat fork-exec.Rout
exit
```
### Batch Job

```{r engine='bash', eval=FALSE}
cd tmp/8054/parallel # or wherever
wget -N http://www.stat.umn.edu/geyer/8054/scripts/latis/fork-exec.R
wget -N http://www.stat.umn.edu/geyer/8054/scripts/latis/fork-exec.pbs
qsub fork-exec.pbs
```
If you want e-mail sent to you when the job starts and completes,
then add the lines
```
### EMAIL NOTIFICATION OPTIONS ###
# Send email on a:abort, b:begin, e:end
#PBS -m abe
# Your email address
#PBS -M yourusername@umn.edu
```
to `fork-exec.pbs` where, of course, `yourusername` is replaced by
your actual username.

You may also need to alter the line in the file `fork-exec.pbs`
about `walltime`.
This asks for 20 minutes.  If the job doesn't finish in that amount of
time, then it is killed.  If you delete this line entirely, then the
default two days.

The linux command
```
qstat
```
will tell you if your job is running.

The linux command
```
qdel jobnumber
```
where `jobnumber` is the actual job number shown by `qstat`, will kill the job.

You can log out of `compute.cla.umn.edu` after your job starts in batch
mode.  It will keep running.

## MPI Cluster

[MPI clusters](https://en.wikipedia.org/wiki/Message_Passing_Interface)
are the way "big iron" does parallel processing.  So if you ever want to
move up to the clusters at the
[Minnesota Supercomputing Institute](https://www.msi.umn.edu/)
or even bigger clusters, you need to start here.

This is just like the
[cluster section above](#the-example-with-a-cluster)
except for a few minor changes for running on LATIS.

### Interactive Session

In order for an MPI cluster to work, we need to install R packages
`Rmpi` and `snow` which LATIS does not install for us.
So users have to install them themselves (like any other CRAN package
they want to use).  It also turns out that we have to load two modules
rather than one: the module containing R and another module containing
a newer version of openmpi than the default.  This means we will also
have to load these two modules whenever we want to use an MPI cluster.
```
qsub -I
module load R/3.6.1
module load openmpi/3.1.3
R --vanilla
options(repos = c(CRAN="https://cloud.r-project.org/"))
install.packages("Rmpi")
install.packages("snow")
q()
exit
```
R function ```install.packages``` will tell you that it cannot install
the packages in the usual place and asks you if you want to install
it in a "personal library".  Say yes.  Then it suggests a location for
the "personal library".  Say yes.

Now almost the same thing again (again we use only one node because we
are only allowed one node for an interactive queue)
```{r engine='bash', eval=FALSE}
qsub -I -l nodes=1:ppn=8
cd tmp/8054/parallel # or wherever
wget -N http://www.stat.umn.edu/geyer/8054/scripts/latis/cluster.R
module load R/3.6.1
module load openmpi/3.1.3
R CMD BATCH --vanilla cluster.R
cat cluster.Rout
exit
```

The `qsub` command says we want to use 8 cores on 1 node (computer).
The interactive queue is not allowed to use more than one node.

### Batch Job

Same as above [*mutatis mutandis*](https://www.merriam-webster.com/dictionary/mutatis%20mutandis)
```{r engine='bash', eval=FALSE}
cd tmp/8054/parallel # or wherever
wget -N http://www.stat.umn.edu/geyer/8054/scripts/latis/cluster.R
wget -N http://www.stat.umn.edu/geyer/8054/scripts/latis/cluster.pbs
qsub cluster.pbs
```
with the same comments about email and walltime.

## Sockets Cluster on LATIS

You can make a sockets cluster on LATIS if you are only using one node.
It works just like in the example in the main text.