Quick take-aways from RStudio::conf Training Day 02

For the past few days I’ve been in Austin, Texas with Stephen Knight and Riinu Ots representing the Surgical Informatics Group at RStudio::conf 2019. The conference brings together nearly 2000 data scientists, developers and a couple of surgeons to learn the latest best practice and best approaches when programming with R.

I have attended the Big Data workshop. “Big Data” is a bit of a vague term but it can be helpful to think of Big Data as one of two groups: data that is just so big that you can’t open it on your own computer (imagine opening one of those massive files that just crashes your computer) or data that is stored somewhere remotely and accessed through your computer by a slow connection.

The key principles for handling Big Data effectively include:

  1. Safe storage (often a data administrator sorts this)
  2. Safe access (password protected in many cases with care to avoid publishing passwords in R scripts)
  3. Getting the database itself to do the heavy work whilst leaving R to do the statistical analysis and plotting we know it does best
  4. Leave the data transformation until the latest possible minute
  5. Access the database as few times as possible

Today I’ll focus briefly on safe access, getting the database to do all of the heavy lifting and leaving the transformation to the last possible minute.

Safe Access

Using a R script to access a remote database usually requires credentials. R needs to know where the data is stored and the database needs to know whether it can allow you to access the data. There are lots of different ways to set up a connection to a database but the DBI package and the obdc package are going to come in very handy. You might also need to install a package which supports the driver for the type of database.

There are loads of options when connecting to databases and securing credentials but it’s key to avoid posting critical information like passwords in plain text, for example:

con <- dbConnect(

  odbc::odbc(),

  Driver = "PostgreSQL",

  Server = "localhost",

  UID    = "myusername",

  PWD    = "my_unsecure_password",

  Port = 5432,

  Database = "postgres"

)

Best Solution for Securing Credentials

The most secure option for connecting to a database involves using a Data Source Name (DSN) although this does require some pre-configuration and the ability to perform the following:

1. Establish integrated security between the terminal and the database, usually via Kerberos.

2. Pre-configure an ODBC connection in the Desktop or Server (requires sufficient access rights). The ODBC connection will have a unique Data Source Name, or DSN.

For example:

con <- DBI::dbConnect(odbc::odbc(), "My DSN Name")

Easier alternatives

It is still possible to connect securely to a database using either of the following techniques:

con <- dbConnect(

  odbc::odbc(),

  Driver = "PostgreSQL",

  Server = "localhost",

  UID    = rstudioapi::askForPassword("Database user"),

  PWD    = rstudioapi::askForPassword("Database password"),

  Port = 5432,

  Database = "postgres"

)

This will require a predefined username and password which the user is prompted to enter on setting up the connection.

Finally, another option which may be easier for those without data administrator is to use the config package and create a yml file. After creating the yml file it can be used to configure the connection without directly publishing credentials, in particular password. Instead, password is retrieved from the yml file:

dw <- config::get("datawarehouse-dev")

con <- dbConnect(odbc::odbc(),

   Driver = dw$driver,

   Server = dw$server,

   UID    = dw$uid,

   PWD    = dw$pwd,

   Port   = dw$port,

   Database = dw$database

)

Making the database do the work

For many physicians who work with R we are quite happy to create data frames in the local environment, modify them there and then plot the data from there. This works very well with smaller data sets which don’t require much memory and even with medium-sized data sets (as long as you don’t try to open the whole file using View()!). When working with Big Data often the local environment isn’t large enough because of limited RAM. To give an example if you are using all of the UK Biobank genetic data which amounts to over 12 terabytes then the modest 8 gigabytes (1500 times less) of RAM I have on my own laptop just won’t do.

The solution is to manipulate the data remotely. Think bomb disposal. You want to have a screen to show you what a bomb disposal robot is doing and a way of controlling the robot but you don’t want to be up close and personal with the robot doing the work. Big Data is exactly the same, you want to see the results of the data transformations and plotting on your own device but let the database do the work for you. Trying to bring the heavy lifting of the data onto your own device creates problems and may result in the computer or server crashing.

dplyr is a truly fantastic data manipulation package which is part of the tidyverse and makes everyday data manipulation for clinicians achievable, understandable and consistent. The great news is that when working with remote data, you can use dplyr! The package dbplyr (database plyer) converts the dplyr code into SQL which then runs in the remote database meaning a familiarity with dplyr is almost all that’s needed to handle Big Data.

The show_query() function demonstrates just how much work goes on under the hood of dbplyr:

flights %>%

  summarise_if(is.numeric, mean, na.rm = TRUE) %>%

  show_query

Output:

Applying predicate on the first 100 rows

<SQL>

SELECT AVG("flightid") AS "flightid", AVG("year") AS "year", AVG("month") AS "month", AVG("dayofmonth") AS "dayofmonth", AVG("dayofweek") AS "dayofweek", AVG("deptime") AS "deptime", AVG("crsdeptime") AS "crsdeptime", AVG("arrtime") AS "arrtime", AVG("crsarrtime") AS "crsarrtime", AVG("flightnum") AS "flightnum", AVG("actualelapsedtime") AS "actualelapsedtime", AVG("crselapsedtime") AS "crselapsedtime", AVG("airtime") AS "airtime", AVG("arrdelay") AS "arrdelay", AVG("depdelay") AS "depdelay", AVG("distance") AS "distance", AVG("taxiin") AS "taxiin", AVG("taxiout") AS "taxiout", AVG("cancelled") AS "cancelled", AVG("diverted") AS "diverted", AVG("carrierdelay") AS "carrierdelay", AVG("weatherdelay") AS "weatherdelay", AVG("nasdelay") AS "nasdelay", AVG("securitydelay") AS "securitydelay", AVG("lateaircraftdelay") AS "lateaircraftdelay", AVG("score") AS "score"

FROM datawarehouse.flight

Leaving the Data Transformation to the last possible minute

dbplyr prevents unnecessary work occurring within the database until the user is explicit that they want some results. When modifying a data frame using a remote connect and dbplyr the user can work with references to the remote data frame in the local environment. When performing and saving a data transformation then dplyr saves a reference to the intended transformation (rather than actually transforming the data). Only when the user is explicit that they want to see some of the data will dbplyr let the database get busy and transform the data. The user can do this by plotting the data or by using the collect() function to print out the resulting data frame. Working with data in this way by saving up a pipeline of intended transformations and only executing the transformation at the very end is a much more efficient way of working with Big Data.

Teaching our REDCap basic anatomy, cancer classifications, and some common sense

This post was originally published here

With over 11,000 patients now entered into REDCap and more being entered every day we thought it would be a good time to reflect on some of the ways in which GlobalSurg 3 has been set up to help collaborators from around the world enter accurate and high-quality data.

With so many important things to know about each patient undergoing cancer surgery, the GlobalSurg team of nearly 3000 collaborators has been busy entering data into our secure database at redcap.globalsurg.org (REDCap is an amazing database software developed by Vanderbilt University). With over 750,000 values entered, it isn’t surprising that from time to time a mistake occurs whilst entering the data. This might be because the data entered onto a paper form was incorrect to begin with or it might be due to accidentally clicking on the wrong options when entering the data to REDCap. In some cases, the incorrect data might even appear in the notes if the surgeon or the anaesthetist has forgotten how to decide on the most appropriate ASA grade.

To try to help our collaborators identify cases when these mistakes may have happened we have taught our REDCap some basic anatomy, cancer classifications and some common sense so that it can alert collaborators to mistake mistakes as soon as they occur. The automatic alerts appear when a collaborator tries to save a page with incorrect data meaning that they can change it immediately when they still have access to the patient notes.

Our REDCap now knows 58 things about cancer surgery that it is using to help collaborators enter accurate data.

globalsurg_redcap_dq (1).png

Figure 1. Examples of our (a) paper data collection form, (b) REDCap interface, and (c) a data quality pop-up warning.

All 58 rules are given at the bottom of this post, but here are some examples:

  • Our REDCap knows some basic anaesthesiology:
    • As a collaborator if you have tried to enter a patient with diabetes mellitus and stated that they have an ASA grade of 1 Our REDCap should have informed you that diabetic patients should really have an ASA grade of 2 or more
  • Our REDCap knows the basics of anatomy:
    • It knows that if a patient had a total colectomy that they don’t have a colostomy
  • Our REDCap knows about TNM staging:
    • It knows that patients with an M score of M1 should also have an Essential TNM score of M+
  • Our REDCap also knows some common sense:
    • It knows that patients can’t have more involved lymph nodes in a specimen that the total number of lymph nodes in a specimen
    • It knows that a patient couldn’t have their operation before being admitted to the hospital

Our REDCap has been working tirelessly for several months to generate these alerts and help collaborators ensure their data is accurate. We hope that training REDCap to detect problems with the data will make the GlobalSurg 3 analysis more efficient and contribute to the accuracy of the data.

The final data entry deadline is 17th December so remember to upload all of your data before then. Our REDCap is ready and waiting to store and check the data.

View and download all of our Data Quality rules at github.com/SurgicalInformatics