Quick take-aways from RStudio::conf Training Day 02

For the past few days I’ve been in Austin, Texas with Stephen Knight and Riinu Ots representing the Surgical Informatics Group at RStudio::conf 2019. The conference brings together nearly 2000 data scientists, developers and a couple of surgeons to learn the latest best practice and best approaches when programming with R.

I have attended the Big Data workshop. “Big Data” is a bit of a vague term but it can be helpful to think of Big Data as one of two groups: data that is just so big that you can’t open it on your own computer (imagine opening one of those massive files that just crashes your computer) or data that is stored somewhere remotely and accessed through your computer by a slow connection.

The key principles for handling Big Data effectively include:

  1. Safe storage (often a data administrator sorts this)
  2. Safe access (password protected in many cases with care to avoid publishing passwords in R scripts)
  3. Getting the database itself to do the heavy work whilst leaving R to do the statistical analysis and plotting we know it does best
  4. Leave the data transformation until the latest possible minute
  5. Access the database as few times as possible

Today I’ll focus briefly on safe access, getting the database to do all of the heavy lifting and leaving the transformation to the last possible minute.

Safe Access

Using a R script to access a remote database usually requires credentials. R needs to know where the data is stored and the database needs to know whether it can allow you to access the data. There are lots of different ways to set up a connection to a database but the DBI package and the obdc package are going to come in very handy. You might also need to install a package which supports the driver for the type of database.

There are loads of options when connecting to databases and securing credentials but it’s key to avoid posting critical information like passwords in plain text, for example:

con <- dbConnect(

  odbc::odbc(),

  Driver = "PostgreSQL",

  Server = "localhost",

  UID    = "myusername",

  PWD    = "my_unsecure_password",

  Port = 5432,

  Database = "postgres"

)

Best Solution for Securing Credentials

The most secure option for connecting to a database involves using a Data Source Name (DSN) although this does require some pre-configuration and the ability to perform the following:

1. Establish integrated security between the terminal and the database, usually via Kerberos.

2. Pre-configure an ODBC connection in the Desktop or Server (requires sufficient access rights). The ODBC connection will have a unique Data Source Name, or DSN.

For example:

con <- DBI::dbConnect(odbc::odbc(), "My DSN Name")

Easier alternatives

It is still possible to connect securely to a database using either of the following techniques:

con <- dbConnect(

  odbc::odbc(),

  Driver = "PostgreSQL",

  Server = "localhost",

  UID    = rstudioapi::askForPassword("Database user"),

  PWD    = rstudioapi::askForPassword("Database password"),

  Port = 5432,

  Database = "postgres"

)

This will require a predefined username and password which the user is prompted to enter on setting up the connection.

Finally, another option which may be easier for those without data administrator is to use the config package and create a yml file. After creating the yml file it can be used to configure the connection without directly publishing credentials, in particular password. Instead, password is retrieved from the yml file:

dw <- config::get("datawarehouse-dev")

con <- dbConnect(odbc::odbc(),

   Driver = dw$driver,

   Server = dw$server,

   UID    = dw$uid,

   PWD    = dw$pwd,

   Port   = dw$port,

   Database = dw$database

)

Making the database do the work

For many physicians who work with R we are quite happy to create data frames in the local environment, modify them there and then plot the data from there. This works very well with smaller data sets which don’t require much memory and even with medium-sized data sets (as long as you don’t try to open the whole file using View()!). When working with Big Data often the local environment isn’t large enough because of limited RAM. To give an example if you are using all of the UK Biobank genetic data which amounts to over 12 terabytes then the modest 8 gigabytes (1500 times less) of RAM I have on my own laptop just won’t do.

The solution is to manipulate the data remotely. Think bomb disposal. You want to have a screen to show you what a bomb disposal robot is doing and a way of controlling the robot but you don’t want to be up close and personal with the robot doing the work. Big Data is exactly the same, you want to see the results of the data transformations and plotting on your own device but let the database do the work for you. Trying to bring the heavy lifting of the data onto your own device creates problems and may result in the computer or server crashing.

dplyr is a truly fantastic data manipulation package which is part of the tidyverse and makes everyday data manipulation for clinicians achievable, understandable and consistent. The great news is that when working with remote data, you can use dplyr! The package dbplyr (database plyer) converts the dplyr code into SQL which then runs in the remote database meaning a familiarity with dplyr is almost all that’s needed to handle Big Data.

The show_query() function demonstrates just how much work goes on under the hood of dbplyr:

flights %>%

  summarise_if(is.numeric, mean, na.rm = TRUE) %>%

  show_query

Output:

Applying predicate on the first 100 rows

<SQL>

SELECT AVG("flightid") AS "flightid", AVG("year") AS "year", AVG("month") AS "month", AVG("dayofmonth") AS "dayofmonth", AVG("dayofweek") AS "dayofweek", AVG("deptime") AS "deptime", AVG("crsdeptime") AS "crsdeptime", AVG("arrtime") AS "arrtime", AVG("crsarrtime") AS "crsarrtime", AVG("flightnum") AS "flightnum", AVG("actualelapsedtime") AS "actualelapsedtime", AVG("crselapsedtime") AS "crselapsedtime", AVG("airtime") AS "airtime", AVG("arrdelay") AS "arrdelay", AVG("depdelay") AS "depdelay", AVG("distance") AS "distance", AVG("taxiin") AS "taxiin", AVG("taxiout") AS "taxiout", AVG("cancelled") AS "cancelled", AVG("diverted") AS "diverted", AVG("carrierdelay") AS "carrierdelay", AVG("weatherdelay") AS "weatherdelay", AVG("nasdelay") AS "nasdelay", AVG("securitydelay") AS "securitydelay", AVG("lateaircraftdelay") AS "lateaircraftdelay", AVG("score") AS "score"

FROM datawarehouse.flight

Leaving the Data Transformation to the last possible minute

dbplyr prevents unnecessary work occurring within the database until the user is explicit that they want some results. When modifying a data frame using a remote connect and dbplyr the user can work with references to the remote data frame in the local environment. When performing and saving a data transformation then dplyr saves a reference to the intended transformation (rather than actually transforming the data). Only when the user is explicit that they want to see some of the data will dbplyr let the database get busy and transform the data. The user can do this by plotting the data or by using the collect() function to print out the resulting data frame. Working with data in this way by saving up a pipeline of intended transformations and only executing the transformation at the very end is a much more efficient way of working with Big Data.

One Reply to “Quick take-aways from RStudio::conf Training Day 02”

Leave a Reply

Your email address will not be published. Required fields are marked *