Learning Objectives

  • Access a database from R.
  • Run SQL queries in R using RSQLite and dplyr.
  • Create an SQLite database from existing .csv files.

Introduction

So far, we have dealt with small datasets that easily fit into your computer’s memory. But what about datasets that are too large for your computer to handle as a whole? In this case, storing the data outside of R and organizing it in a database is helpful. Connecting to the database allows you to retrieve only the chunks needed for the current analysis.

Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.

R can connect to almost any existing database type. Most common database types have R packages that allow you to connect to them (e.g., RSQLite, RMySQL, etc). Furthermore, the dplyr package you used in the previous chapter, in conjunction with dbplyr supports connecting to the widely-used open source databases sqlite, mysql and postgresql, as well as Google’s bigquery, and it can also be extended to other database types (a vignette in the dplyr package explains how to do it).

Interfacing with databases using dplyr focuses on retrieving and analyzing datasets by generating SELECT SQL statements, but it doesn’t modify the database itself. dplyr does not offer functions to UPDATE or DELETE entries. If you need these functionalities, you will need to use additional R packages (e.g., RSQLite). Here we will demonstrate how to interact with a database using dplyr, using both the dplyr’s verb syntax and the SQL syntax.

The idigbio_rodents database

We will continue to explore the specimens data you are already familiar with from previous lessons. First, we are going to install the dbplyr package:

install.packages("dbplyr")
library(dbplyr)

The SQLite database is contained in a single file idigbio_rodents.sqlite that you generated during the SQL lesson. If you don’t have it, you can download it from Figshare into the data subdirectory using:

dir.create("data", showWarnings = FALSE)
download.file(url = "https://ndownloader.figshare.com/files/9583582",
              destfile = "data/idigbio_rodents.sqlite", mode = "wb")

Connecting to databases with dplyr

We can point R to this database with dplyr’s src_sqlite() command.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
rodents <- src_sqlite("data/idigbio_rodents.sqlite")

The src_sqlite() command does not load the data into the R session (as the read.csv() function did). Instead, it merely instructs R to connect to the SQLite database contained in the idigbio_rodents.sqlite file.

(You can use the src_mysql(), src_postgres() and src_bigquery() to connect to the other database types supported by dplyr.)

Let’s take a closer look at the rodents database we just connected to:

rodents
#> src:  sqlite 3.19.3 [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#> tbls: localities, species, specimens

Just like a spreadsheet with multiple worksheets, a SQLite database can contain multiple tables. In this case three of them are listed in the tbls row in the output above:

  • localities
  • species
  • specimens

Now that we know we can connect to the database, let’s explore how to get the data from its tables into R.

Querying the database with the SQL syntax

To connect to tables within a database, you can use the tbl() function from dplyr. This function can be used to send SQL queries to the database. To demonstrate this functionality, let’s select the columns “year”, “speciesID”, and “localityID” from the specimens table:

tbl(rodents, sql("SELECT year, speciesID, localityID FROM specimens"))

With this approach you can use any of the SQL queries we have seen in the database lesson.

Querying the database with the dplyr syntax

One of the strengths of dplyr is that the same operation can be done using dplyr’s verbs instead of writing SQL. First, we select the table on which to do the operations by creating the specimens object, and then we use the standard dplyr syntax as if it were a data frame:

specimens <- tbl(rodents, "specimens")
specimens %>%
    select(year, speciesID, localityID)

In this case, the specimens object behaves like a data frame. Several functions that can be used with data frames can also be used on tables from a database. For instance, the head() function can be used to check the first 10 rows of the table:

head(specimens, n = 10)
#> # Source:   lazy query [?? x 14]
#> # Database: sqlite 3.19.3
#> #   [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#>                                    uuid  speciesID localityID
#>                                   <chr>      <int>      <int>
#>  1 060380ea-7b06-474e-8d2e-b6e4a8c21e1a  411735100  887181682
#>  2 0fb17a79-a8ce-45b6-b57a-2f640e8cccb6  411735100 1668152736
#>  3 1a69c8ad-0ac3-4612-9dc0-6867e8b9a218  411735100 1668152736
#>  4 1a9932b4-beab-4472-bec1-a7e68c4b9e6e  411735100 1668152736
#>  5 1f3b8aea-fbae-46d1-91c8-274924b40c9f  411735100 1668152736
#>  6 203f0531-9b46-403f-ac09-3acab5be977c  596319155 1294942448
#>  7 21ceb6f0-3905-4e9a-92ae-8a1cd2af4a3a 1323405193 1811586400
#>  8 23d3f0d9-0075-472a-96c4-5d9f01c78d0f 2033031125 1797513247
#>  9 244bbe9f-81aa-42b9-bc92-ffef2b530fc4 1390967007 1073686650
#> 10 2682aa08-3d8f-496a-a487-a76575566b13  597275825  944367018
#> # ... with more rows, and 11 more variables: institutionCode <chr>,
#> #   collectionCode <chr>, catalogNumber <chr>, recordedBy <chr>,
#> #   eventDate <chr>, year <int>, month <int>, day <int>, weight <dbl>,
#> #   length <dbl>, sex <chr>

This output of the head command looks just like a regular data.frame: The table has 9 columns and the head() command shows us the first 10 rows. Note that the columns stateProvince, country, genus, and scientificName are missing. These are now located in the tables localities and species which we will join together in a moment.

However, some functions don’t work quite as expected. For instance, let’s check how many rows there are in total using nrow():

nrow(tbl)
#> NULL

That’s strange - R doesn’t know how many rows the species table contains - it returns NULL instead. You might have already noticed that the first line of the head() output included ?? indicating that the number of rows wasn’t known.

The reason for this behavior highlights a key difference between using dplyr on datasets in memory (e.g. loaded into your R session via read.csv()) and those provided by a database. To understand it, we take a closer look at how dplyr communicates with our SQLite database.

SQL translation

Relational databases typically use a special-purpose language, Structured Query Language (SQL), to manage and query data.

For example, the following SQL query returns the first 10 rows from the specimens table:

SELECT *
FROM `specimens`
LIMIT 10

Behind the scenes, dplyr:

  1. translates your R code into SQL
  2. submits it to the database
  3. translates the database’s response into an R data frame

To lift the curtain, we can use dplyr’s show_query() function to show which SQL commands are actually sent to the database:

show_query(head(specimens, n = 10))
#> <SQL>
#> SELECT *
#> FROM `specimens`
#> LIMIT 10

The output shows the actual SQL query sent to the database; it matches our manually constructed SELECT statement above.

Instead of having to formulate the SQL query ourselves - and having to mentally switch back and forth between R and SQL syntax - we can delegate this translation to dplyr. (You don’t even need to know SQL to interact with a database via dplyr!)

dplyr, in turn, doesn’t do the real work of subsetting the table, either. Instead, it merely sends the query to the database, waits for its response and returns it to us.

That way, R never gets to see the full specimens table - and that’s why it could not tell us how many rows it contains. On the bright side, this allows us to work with large datasets - even too large to fit into our computer’s memory.

dplyr can translate many different query types into SQL allowing us to, e.g., select() specific columns, filter() rows, or join tables.

To see this in action, let’s compose a few queries with dplyr.

Simple database queries

First, let’s only request rows of the specimens table in which weight is less than 5 and keep only the speciesID, sex, and weight columns.

specimens %>%
  filter(weight < 50) %>%
  select(speciesID, sex, weight)
#> # Source:   lazy query [?? x 3]
#> # Database: sqlite 3.19.3
#> #   [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#>     speciesID    sex weight
#>         <int>  <chr>  <dbl>
#>  1  411735100   male   30.5
#>  2  411735100 female   22.0
#>  3  411735100   male   49.0
#>  4  411735100 female   33.0
#>  5  411735100 female   29.0
#>  6  596319155 female   23.5
#>  7 1323405193          24.0
#>  8 2033031125 female   27.0
#>  9  597275825   male   12.0
#> 10  621625598 female   33.5
#> # ... with more rows

Executing this command will return a table with 10 rows and the requested speciesID, sex and weight columns. Great!

… but wait, why are there only 10 rows?

The last line:

# ... with more rows

indicates that there are more results that fit our filtering criterion. Why was R lazy and only retrieved 10 of them?

Laziness

Hadley Wickham, the author of dplyr explains:

When working with databases, dplyr tries to be as lazy as possible:

  • It never pulls data into R unless you explicitly ask for it.
  • It delays doing any work until the last possible moment - it collects together everything you want to do and then sends it to the database in one step.

When you construct a dplyr query, you can connect multiple verbs into a single pipeline. For example, we combined the filter() and select() verbs using the %>% pipe.

If we wanted to, we could add on even more steps, e.g. remove the sex column in an additional select call:

data_subset <- specimens %>%
  filter(weight < 50) %>%
  select(speciesID, sex, weight)

data_subset %>%
  select(-sex)
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.19.3
#> #   [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#>     speciesID weight
#>         <int>  <dbl>
#>  1  411735100   30.5
#>  2  411735100   22.0
#>  3  411735100   49.0
#>  4  411735100   33.0
#>  5  411735100   29.0
#>  6  596319155   23.5
#>  7 1323405193   24.0
#>  8 2033031125   27.0
#>  9  597275825   12.0
#> 10  621625598   33.5
#> # ... with more rows

Just like the first select(speciesID, sex, weight) call, the select(-sex) command is not executed by R. It is sent to the database instead. Only the final result is retrieved and displayed to you.

Of course, we could always add on more steps, e.g., we could filter by speciesID or minimum weight. That’s why R doesn’t retrieve the full set of results - instead it only retrieves the first 10 results from the database by default. (After all, you might want to add an additional step and get the database to do more work…)

To instruct R to stop being lazy, e.g. to retrieve all of the query results from the database, we add the collect() command to our pipe. It indicates that our database query is finished: time to get the final results and load them into the R session.

data_subset <- specimens %>%
  filter(weight < 50) %>%
  select(speciesID, sex, weight) %>%
  collect()

Now we have all 7157 rows that match our query in a data.frame and can continue to work with them exclusively in R, without communicating with the database.

Complex database queries

dplyr enables database queries across one or multiple database tables, using the same single- and multiple-table verbs you encountered previously. This means you can use the same commands regardless of whether you interact with a remote database or local dataset! This is a really useful feature if you work with large datasets: you can first prototype your code on a small subset that fits into memory, and when your code is ready, you can change the input dataset to your full database without having to change the syntax.

On the other hand, being able use SQL queries directly can be useful if your collaborators have already put together complex queries to prepare the dataset that you need for your analysis.

To illustrate how to use dplyr with these complex queries, we are going to join the localities and specimens tables. The localities table in the database contains information about the different localities surveyed by the researchers. To access it, we point the tbl() command to it:

localities <- tbl(rodents, "localities")
localities
#> Warning in rsqlite_fetch(res@ptr, n = n): Column `decimalLatitude`: mixed
#> type, first seen values of type real, coercing other values of type string
#> Warning in rsqlite_fetch(res@ptr, n = n): Column `decimalLongitude`: mixed
#> type, first seen values of type real, coercing other values of type string
#> # Source:   table<localities> [?? x 6]
#> # Database: sqlite 3.19.3
#> #   [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#>    localityID countryCode stateProvince           county decimalLatitude
#>         <int>       <chr>         <chr>            <chr>           <dbl>
#>  1 1811586400         usa      oklahoma         le flore        34.53903
#>  2 1814588800         usa      oklahoma         mcintosh         0.00000
#>  3  543639200         usa    washington snohomish county        47.78533
#>  4 1234201200         usa    new mexico  sandoval county        35.80743
#>  5 1133867400         usa    california  mariposa county        37.66312
#>  6  914916400         usa    new mexico     grant county        32.89984
#>  7 1740702200         usa         idaho        bear lake        42.19194
#>  8  599691200         usa      oklahoma          tillman        34.55083
#>  9  612348800         usa    new mexico    sierra county        32.98070
#> 10 1197391000         usa    california del norte county        41.82547
#> # ... with more rows, and 1 more variables: decimalLongitude <dbl>

The localityID column also features in the specimens table:

specimens
#> # Source:   table<specimens> [?? x 14]
#> # Database: sqlite 3.19.3
#> #   [/home/mcollins/Git/R-nhcdata-lesson/data/idigbio_rodents.sqlite]
#>                                    uuid  speciesID localityID
#>                                   <chr>      <int>      <int>
#>  1 060380ea-7b06-474e-8d2e-b6e4a8c21e1a  411735100  887181682
#>  2 0fb17a79-a8ce-45b6-b57a-2f640e8cccb6  411735100 1668152736
#>  3 1a69c8ad-0ac3-4612-9dc0-6867e8b9a218  411735100 1668152736
#>  4 1a9932b4-beab-4472-bec1-a7e68c4b9e6e  411735100 1668152736
#>  5 1f3b8aea-fbae-46d1-91c8-274924b40c9f  411735100 1668152736
#>  6 203f0531-9b46-403f-ac09-3acab5be977c  596319155 1294942448
#>  7 21ceb6f0-3905-4e9a-92ae-8a1cd2af4a3a 1323405193 1811586400
#>  8 23d3f0d9-0075-472a-96c4-5d9f01c78d0f 2033031125 1797513247
#>  9 244bbe9f-81aa-42b9-bc92-ffef2b530fc4 1390967007 1073686650
#> 10 2682aa08-3d8f-496a-a487-a76575566b13  597275825  944367018
#> # ... with more rows, and 11 more variables: institutionCode <chr>,
#> #   collectionCode <chr>, catalogNumber <chr>, recordedBy <chr>,
#> #   eventDate <chr>, year <int>, month <int>, day <int>, weight <dbl>,
#> #   length <dbl>, sex <chr>

Because localityID is listed in both tables, we can use it to look up matching records, and join the two tables.

diagram illustrating inner and left joins

diagram illustrating inner and left joins

For example, to extract all specimens for the first plot, which has localityID 1, we can do:

localities %>%
  filter(localityID == 18620641) %>%
  inner_join(specimens) %>%
  collect()
#> Joining, by = "localityID"
#> # A tibble: 34 x 19
#>    localityID countryCode stateProvince          county decimalLatitude
#>         <int>       <chr>         <chr>           <chr>           <dbl>
#>  1   18620641         usa    california tuolumne county        37.91355
#>  2   18620641         usa    california tuolumne county        37.91355
#>  3   18620641         usa    california tuolumne county        37.91355
#>  4   18620641         usa    california tuolumne county        37.91355
#>  5   18620641         usa    california tuolumne county        37.91355
#>  6   18620641         usa    california tuolumne county        37.91355
#>  7   18620641         usa    california tuolumne county        37.91355
#>  8   18620641         usa    california tuolumne county        37.91355
#>  9   18620641         usa    california tuolumne county        37.91355
#> 10   18620641         usa    california tuolumne county        37.91355
#> # ... with 24 more rows, and 14 more variables: decimalLongitude <dbl>,
#> #   uuid <chr>, speciesID <int>, institutionCode <chr>,
#> #   collectionCode <chr>, catalogNumber <chr>, recordedBy <chr>,
#> #   eventDate <chr>, year <int>, month <int>, day <int>, weight <dbl>,
#> #   length <dbl>, sex <chr>

Important Note: Without the collect() statement, only the first 10 matching rows are returned. By adding collect(), the full set of 34 is retrieved.

Challenge

Write a query that returns the number of dipodomys at each institution in each year.

Hint: Connect to the species table and write a query that joins the species and specimen tables together to exclude all non-dipodomys genera. The query should return counts of specimens by year.

Optional: Write a query in SQL that will produce the same result. You can join multiple tables together using the following syntax where foreign key refers to your unique id (e.g., speciesID):

SELECT table.col, table.col
FROM table1 JOIN table2
ON table1.key = table2.key
JOIN table3 ON table2.key = table3.key

Challenge

Write a query that returns the total number of rodents in each genus collected in the different states.

Hint: Write a query that joins the species, locality, and specimens tables together. The query should return counts of genus by state.

This is useful if we are interested in estimating the number of specimens belonging to each genus found in each state. But what if we were interested in the number of scientific names found in each state? Using tally() gives the number of individuals, instead we need to use n_distinct() to count the number of unique values found in a column.

unique_names <- left_join(specimens, localities) %>%
    left_join(species) %>%
    group_by(stateProvince) %>%
    summarize(
        n_scientificNames = n_distinct(scientificName)
    ) %>%
    collect()
#> Joining, by = "localityID"
#> Joining, by = "speciesID"

n_distinct, like the other dplyr functions we have used in this lesson, works not only on database connections but also on regular data frames.

Creating a new SQLite database

So far, we have used a previously prepared SQLite database. But we can also use R to create a new database, e.g. from existing csv files. Let’s recreate the rodents database that we’ve been working with, in R. First let’s read in the csv files.

species <- read.csv("data/species.csv")
specimens <- read.csv("data/specimens.csv")
localities <- read.csv("data/localities.csv")

Creating a new SQLite database with dplyr is easy. You can re-use the same command we used above to open an existing .sqlite file. The create = TRUE argument instructs R to create a new, empty database instead.

Caution: When create = TRUE is added, any existing database at the same location is overwritten without warning.

my_db_file <- "my_rodent_database.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)

Currently, our new database is empty, it doesn’t contain any tables:

my_db
#> src:  sqlite 3.19.3 [my_rodent_database.sqlite]
#> tbls:

To add tables, we copy the existing data.frames into the database one by one:

copy_to(my_db, specimens)
copy_to(my_db, localities)
my_db

If you check the location of our database you’ll see that data is automatically being written to disk. R and dplyr not only provide easy ways to query existing databases, they also allows you to easily create your own databases from flat files!

Challenge

Add the remaining species table to the my_db database and run some of your queries from earlier in the lesson to verify that you have faithfully recreated the rodents database.

Note: In this example, we first loaded all of the data into the R session by reading the three csv files. Because all the data has to flow through R, this is not suitable for very large datasets.

Page build on: 2017-10-25 10:37:05


Data Carpentry, 2017. License. Contributing.
Questions? Feedback? Please file an issue on GitHub.
On Twitter: @datacarpentry