---
title: "Stat 8054 Lecture Notes: R and SQL Databases"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  html_document:
    number_sections: true
    md_extensions: -tex_math_single_backslash
  pdf_document:
    number_sections: true
    md_extensions: -tex_math_single_backslash
---

# License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(https://creativecommons.org/licenses/by-sa/4.0/).

# R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document
   is `r packageVersion("rmarkdown")`.

 * The version of the `RSQLite` package used to make this document is
   `r packageVersion("RSQLite")`.

 * The version of the `DBI` package used to make this document is
   `r packageVersion("DBI")`.

# History

I am not an expert on the history of databases, but at least know there are
phases to this history.  Most of this relies on
the [Wikipedia article](https://en.wikipedia.org/wiki/Database)
but has some different emphases.

The history is divided into four eras, or generations,
which overlap:

 * the dinosaur era (1960's) where there were large databases
   but they were clumsy to use and relied on highly complicated
   and usually buggy code written by highly trained programmers,

 * the relational and SQL database era (1970's through 1990's) had
   the following features (not all of which arrived at the
   same time in the same products):

    - relational databases ([Wikipedia article](https://en.wikipedia.org/wiki/Relational_database)), which to users look like real math:
    stored tables act like mathematical relations that are
    "programmed" using mathematical logic via

    - SQL (acronym for *Structured Query Language* but pronounced
    like the English word "sequel"; [Wikipedia article](https://en.wikipedia.org/wiki/SQL)), a standardized computer language for
    relational database operations, a language just like R or C++
    except just for database operations,

    - ACID (acronym for *Atomicity, Consistency, Isolation, Durability*, pronounced like the English word "acid"; [Wikipedia article](https://en.wikipedia.org/wiki/ACID_(computer_science))) which describes
    the highly reliable transactions that are found in modern
    so-called SQL databases, like Oracle (and many other products),
 
 * the noSQL era (2000's) in which all of the great ideas of the
   relational database era were dropped, putting programmers back
   in the dinosaur era or worse, all in the name of scaling to
   internet scale
   ([Wikipedia article](https://en.wikipedia.org/wiki/NoSQL)),
   leading examples of which are Amazon's Dynamo, Apache Cassandra,
   CouchDB, MongoDB, Redis, HBase, and MemcacheDB,

 * the newSQL era (now,
   [Wikipedia article](https://en.wikipedia.org/wiki/NewSQL))
   has the best of both worlds, relational, SQL, ACID, and highly
   scalable, a leading example is Google Spanner.

So while in the 2000's it looked like SQL was old hat and all
"data scientists" needed to learn about noSQL that is now looking
dated, although a lot of web services run on noSQL databases.

A word about pronounciation: sometimes SQL is "sequel" and sometimes
S-Q-L (sounding the letters).  In "Microsoft SQL server", the SQL
is always "sequel".  In Oracle MySQL server, the SQL is always S-Q-L
so this is pronounced "my-S-Q-L".  This was originally open source
software before acquired by Oracle; its free software successor
(fork) is MariaDB.

# SQLite

For learning SQL the greatest thing since sliced bread is SQLite,
a relational database with full SQL support that runs as a user
application.  It is just a software library backed by a file on disk.
So you can do little database applications with no expensive
database.  And you can learn on it.

The author of SQLite pronounces it S-Q-L-ite "like a mineral"
but does not object to other pronounciations.

# R and SQL and SQLite

The R package that talks to all SQL databases is CRAN package `DBI`
(for database interface).  The R package that makes SQLite available
to R is CRAN package `RSQLite`.  
[Section 6 of my STAT 3701 lecture notes
on data](https://www.stat.umn.edu/geyer/3701/notes/data.html#databases)
is about this.  The example in this document was a homework problem
in that course that had numerous hints.  It is better as a straightforward
example.

# Reading

 * The [`RSQLite` package vignette](https://cloud.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html)

 * The [SQL Tutorial at w3schools.com](https://www.w3schools.com/sql/)

 * There are a bazillion books on SQL.  I don't have a particular
   recommendation.

# An Example

We just happen to have an SQLite database to serve as an example
```{r "download"}
download.file("https://www.stat.umn.edu/geyer/8054/data/cran-info.sqlite",
    "cran-info.sqlite")
```
**Note:** On windows this command must be
```
download.file("https://www.stat.umn.edu/geyer/8054/data/cran-info.sqlite",
    "cran-info.sqlite", mode = "wb")
```
because otherwise Windows does not treat the file as binary but rather
as a text file and messes it up.

We connect to it using R using R packages `DBI` and `RSQLite`.
```{r "connect"}
library(DBI)
mydb <- dbConnect(RSQLite::SQLite(), "cran-info.sqlite")
```

Just to see what we have, we execute some simple SQL queries.
```{r "what"}
dbListTables(mydb)
dbGetQuery(mydb, "SELECT * FROM depends LIMIT 20")
dbGetQuery(mydb, "SELECT * FROM imports LIMIT 20")
dbGetQuery(mydb, "SELECT * FROM suggests LIMIT 20")
dbGetQuery(mydb, "SELECT * FROM linking LIMIT 20")
```
The table `depends` lists CRAN packages in column `packfrom` and in
column `packto` lists other CRAN packages on which they depend (this
does not include R core or recommended packages).

In R one might store data like this in R lists.  Object `depends` would
be a list with one component for each CRAN package, which would be a
a character vector (perhaps of length zero) of all CRAN packages on
which that package depends.  We could use the `names` attribute of
the list to indicate the "from" package.  We could do this in code by
```{r "to-list"}
foo <- dbGetQuery(mydb, "SELECT * FROM depends")
depends <- split(foo$packto, foo$packfrom)
head(depends)
rm(foo, depends)
```
But in a SQL database, everything must be a table.  No lists.
Hence we have a table whose rows are all from-to pairs.

We want to process these data as if we could not fit it all into R
(which is false for this toy problem but might be true for big data)
so we have to use the SQL database to do all operations until we get
down to small results we can return to R.

I could not figure out how to do this in one SQL command,
so I had to create tempory thingummies, which in a SQL database
must be tables, since everything in a SQL database is a table.
```{r "tmp-one"}
query <- paste("CREATE TABLE temp AS",
               "SELECT packto FROM depends",
               "UNION ALL",
               "SELECT packto FROM imports",
               "UNION ALL",
               "SELECT packto FROM suggests",
               "UNION ALL",
               "SELECT packto FROM linking")
dbExecute(mydb, query)
dbGetQuery(mydb, "SELECT * FROM temp LIMIT 10")
```
The `dbGetQuery` command is just to see what we got (to check that we
actually did what we thought we did.

```{r "tmp-two"}
query <- paste("CREATE TABLE temptoo AS",
               "SELECT packto, COUNT(packto) AS packcount",
               "FROM temp GROUP BY packto")
dbExecute(mydb, query)
dbGetQuery(mydb, "SELECT * FROM temptoo LIMIT 10")
```
Looks OK again.

```{r "result"}
query <- paste("SELECT * from temptoo WHERE packcount >= 100",
               "ORDER by packcount DESC")
dbGetQuery(mydb, query)
```

CRAN packages that are used by at least 100 other CRAN packages.
In descending order.

Clean up.
```{r "clean"}
dbListTables(mydb)
dbExecute(mydb, "DROP TABLE temp")
dbExecute(mydb, "DROP TABLE temptoo")
dbListTables(mydb)
```