--- title: "Stat 8054 Lecture Notes: R and SQL Databases" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: number_sections: true md_extensions: -tex_math_single_backslash pdf_document: number_sections: true md_extensions: -tex_math_single_backslash --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (https://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. * The version of the `RSQLite` package used to make this document is `r packageVersion("RSQLite")`. * The version of the `DBI` package used to make this document is `r packageVersion("DBI")`. # History I am not an expert on the history of databases, but at least know there are phases to this history. Most of this relies on the [Wikipedia article](https://en.wikipedia.org/wiki/Database) but has some different emphases. The history is divided into four eras, or generations, which overlap: * the dinosaur era (1960's) where there were large databases but they were clumsy to use and relied on highly complicated and usually buggy code written by highly trained programmers, * the relational and SQL database era (1970's through 1990's) had the following features (not all of which arrived at the same time in the same products): - relational databases ([Wikipedia article](https://en.wikipedia.org/wiki/Relational_database)), which to users look like real math: stored tables act like mathematical relations that are "programmed" using mathematical logic via - SQL (acronym for *Structured Query Language* but pronounced like the English word "sequel"; [Wikipedia article](https://en.wikipedia.org/wiki/SQL)), a standardized computer language for relational database operations, a language just like R or C++ except just for database operations, - ACID (acronym for *Atomicity, Consistency, Isolation, Durability*, pronounced like the English word "acid"; [Wikipedia article](https://en.wikipedia.org/wiki/ACID_(computer_science))) which describes the highly reliable transactions that are found in modern so-called SQL databases, like Oracle (and many other products), * the noSQL era (2000's) in which all of the great ideas of the relational database era were dropped, putting programmers back in the dinosaur era or worse, all in the name of scaling to internet scale ([Wikipedia article](https://en.wikipedia.org/wiki/NoSQL)), leading examples of which are Amazon's Dynamo, Apache Cassandra, CouchDB, MongoDB, Redis, HBase, and MemcacheDB, * the newSQL era (now, [Wikipedia article](https://en.wikipedia.org/wiki/NewSQL)) has the best of both worlds, relational, SQL, ACID, and highly scalable, a leading example is Google Spanner. So while in the 2000's it looked like SQL was old hat and all "data scientists" needed to learn about noSQL that is now looking dated, although a lot of web services run on noSQL databases. A word about pronounciation: sometimes SQL is "sequel" and sometimes S-Q-L (sounding the letters). In "Microsoft SQL server", the SQL is always "sequel". In Oracle MySQL server, the SQL is always S-Q-L so this is pronounced "my-S-Q-L". This was originally open source software before acquired by Oracle; its free software successor (fork) is MariaDB. # SQLite For learning SQL the greatest thing since sliced bread is SQLite, a relational database with full SQL support that runs as a user application. It is just a software library backed by a file on disk. So you can do little database applications with no expensive database. And you can learn on it. The author of SQLite pronounces it S-Q-L-ite "like a mineral" but does not object to other pronounciations. # R and SQL and SQLite The R package that talks to all SQL databases is CRAN package `DBI` (for database interface). The R package that makes SQLite available to R is CRAN package `RSQLite`. [Section 6 of my STAT 3701 lecture notes on data](https://www.stat.umn.edu/geyer/3701/notes/data.html#databases) is about this. The example in this document was a homework problem in that course that had numerous hints. It is better as a straightforward example. # Reading * The [`RSQLite` package vignette](https://cloud.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) * The [SQL Tutorial at w3schools.com](https://www.w3schools.com/sql/) * There are a bazillion books on SQL. I don't have a particular recommendation. # An Example We just happen to have an SQLite database to serve as an example ```{r "download"} download.file("https://www.stat.umn.edu/geyer/8054/data/cran-info.sqlite", "cran-info.sqlite") ``` **Note:** On windows this command must be ``` download.file("https://www.stat.umn.edu/geyer/8054/data/cran-info.sqlite", "cran-info.sqlite", mode = "wb") ``` because otherwise Windows does not treat the file as binary but rather as a text file and messes it up. We connect to it using R using R packages `DBI` and `RSQLite`. ```{r "connect"} library(DBI) mydb <- dbConnect(RSQLite::SQLite(), "cran-info.sqlite") ``` Just to see what we have, we execute some simple SQL queries. ```{r "what"} dbListTables(mydb) dbGetQuery(mydb, "SELECT * FROM depends LIMIT 20") dbGetQuery(mydb, "SELECT * FROM imports LIMIT 20") dbGetQuery(mydb, "SELECT * FROM suggests LIMIT 20") dbGetQuery(mydb, "SELECT * FROM linking LIMIT 20") ``` The table `depends` lists CRAN packages in column `packfrom` and in column `packto` lists other CRAN packages on which they depend (this does not include R core or recommended packages). In R one might store data like this in R lists. Object `depends` would be a list with one component for each CRAN package, which would be a a character vector (perhaps of length zero) of all CRAN packages on which that package depends. We could use the `names` attribute of the list to indicate the "from" package. We could do this in code by ```{r "to-list"} foo <- dbGetQuery(mydb, "SELECT * FROM depends") depends <- split(foo$packto, foo$packfrom) head(depends) rm(foo, depends) ``` But in a SQL database, everything must be a table. No lists. Hence we have a table whose rows are all from-to pairs. We want to process these data as if we could not fit it all into R (which is false for this toy problem but might be true for big data) so we have to use the SQL database to do all operations until we get down to small results we can return to R. I could not figure out how to do this in one SQL command, so I had to create tempory thingummies, which in a SQL database must be tables, since everything in a SQL database is a table. ```{r "tmp-one"} query <- paste("CREATE TABLE temp AS", "SELECT packto FROM depends", "UNION ALL", "SELECT packto FROM imports", "UNION ALL", "SELECT packto FROM suggests", "UNION ALL", "SELECT packto FROM linking") dbExecute(mydb, query) dbGetQuery(mydb, "SELECT * FROM temp LIMIT 10") ``` The `dbGetQuery` command is just to see what we got (to check that we actually did what we thought we did. ```{r "tmp-two"} query <- paste("CREATE TABLE temptoo AS", "SELECT packto, COUNT(packto) AS packcount", "FROM temp GROUP BY packto") dbExecute(mydb, query) dbGetQuery(mydb, "SELECT * FROM temptoo LIMIT 10") ``` Looks OK again. ```{r "result"} query <- paste("SELECT * from temptoo WHERE packcount >= 100", "ORDER by packcount DESC") dbGetQuery(mydb, query) ``` CRAN packages that are used by at least 100 other CRAN packages. In descending order. Clean up. ```{r "clean"} dbListTables(mydb) dbExecute(mydb, "DROP TABLE temp") dbExecute(mydb, "DROP TABLE temptoo") dbListTables(mydb) ```