Cameron Fairfield

18th September 201918th September 2019

rmedicine2019 – some quick thoughts and good packages

Kenny McLean and I recently attended rmedicine 2019 in Boston MA. The conference is aimed at clinicians and non-clinicians who use R for day-to-day research and monitoring of clinical processes.

Day 1 covered two parallel workshops: R Markdown for Medicine and Wrangling Survival Data

I attended R Markdown for Medicine run by Alison Hill from RStudio. Using .rmd files has become the default for the Surgical Informatics Group and, so it seems, a great number of others who attended rmedicine. Around a third of the presentations at rmedicine covered workflows involving sharing of data via either .rmd files or through shiny, an R package for creating deploy-able dashboards for data visualisation and interactive exploration.

R Markdown for Medicine

An Overview of Useful Tips and Tricks

R markdown is an extension of R which allows you to combine narrative text and R code within one document. This means your notes, code, results and plots are all in one place. Code is contained in between three backticks with an {r} after the first set. Inline code can also be used between single backticks followed by r without the curly brackets and then the code. This means that results can be changed automatically so that for a trial when you describe the results of numbers included / excluded, this only needs changed in one place so that the rest of the text (and / or flowcharts) updates automatically. It is also possible to mix-and-match other chunks of code from other languages.

Use Params!

Parameters are set in the YAML header at the top of the .rmd document. If you set a parameter of data to a default .csv or .rda file then this can be changed for other similar files without creating a new document. A really useful example would be when you have multiple hospitals or multiple diseases each with a separate data file, a report can then be generated for each file. If you use rmarkdown::render() along with purrr::pwalk you can generate a separate output file for any number of hospitals / diseases / countries / individuals etc. in just a couple of lines of code.

Use Helper Packages

There are some greater .rmd helper packages to improve the workflow, improve the rendering of documents and generally make life easier.

bookdown allows several .rmd documents to be combined to a book but also has some general usefulness for single documents as well. Using bookdown::word_document2 or bookdown::html_document2 in the YAML header under the output field is designed to improve cross-referencing of tables and figures compared to the default versions.

wordcountaddin allows an accurate word count to be performed which will not count YAML or code etc. without knitting the document. This is much easier than knitting the document and then performing a word count!

citr allows automated insertion of markdown citations to assist with referencing. Check out my earlier blog on referencing to get an idea of how to set up .bib files. I may add another blog on this topic, watch this space!

xaringan is a useful package for creating HTML presentations with high levels of customisation. It is possible to use an additional .css file for even greater customisation and styling of your slides but xaringan offers a great deal of user-friendly options.

distill appears to be good at supporting mobile-friendly web publishing for scientific communication with flexible figure layouts, table pagination, LaTeX math support and incorporation of javascript.

There are countless other helper packages and more likely to be on their way. Many allow additional aesthetic modification of the output documents and may allow you to run R code rather than modifying a .css file.

List Numbering the Lazy Way

List numbering in .rmd works without needing to manually enter the correct numbers. Just make a list where every element begins with 1. and .rmd will transform it into an appropriately-numbered list. Great if you need to add in a new element to the middle of the list later!

Multiple lots in a Grid

I’ve previously come across patchwork as a way to plot several plots into a grid which could be 1×2, 2×2, 3 in one column and one in the other etc. There are also two other packages cowplot and egg. I haven’t explored the differences between them but if you find that one doesn’t give you the exact customisation or alignment you need then possibly try another one. cowplot looks as if it might perform better at overlaying plots on top of another and at exact axis line matching.

Use the here package to help with file paths

here is a great package for swapping between Windows and Mac file paths (no more swapping backslashes and forward slashes!). Using here::here() will default to looking for a file in the .Rproj directory rather than the .rmd directory which is the default otherwise – great if you want to have multiple .rmd documents each in their own sub-directory with a shared data file in the parent directory.

Customise Code Outputs

R markdown allows customisation of appearance of code. Some of this can be done through modifying a .css file but there are some simpler ways to make basic changes. Try adding comment = "#>" to knitr::opst_chunk$set()to customise how comments appear in your document.

Word document creations tips

R markdown is generally great for HTML and PDF formats. The options for knitting to Word are not as well developed but there are some good options. The bookdown package is useful as discussed. The redoc package has been used to facilitate conversion to and from word – not tried it personally but if it can print out to word and then handle tracked changes back into markdown then it could be very useful.

For converting more complex tables and figures to word an option is to knit to rtf (rich text format) and then open the rtf file in word. This tends to be very good at keeping the desired formatting.

Future updates – hopefully!

R markdown is a great resource although there are a handful of minor issues which are currently difficult to resolve. One of the main problems I find it with tables and cross-referencing. I really like the syntax and customisation of the gt package but at present it appears cross-referencing in a way which works across HTML, PDF and Word outputs is not supported – a great opportunity to submit a pull request if you think you can get this to work.

Other Useful rmedicine Packages and Ideas

survival Package Update

The latest version (version 3.0) of the survival package was presented by Terry Therneau and is now available on github. This package is used by over 650 additional downstream dependencies. The latest version allows for multiple observations per subject, multiple endpoints per subject and multiple types of end-point. This will be particularly useful for competing risks analyses e.g. outcomes for liver transplant patients (transplanted, still on list, removed from list as no longer eligible or died).

Keep an eye-out for Kenny McLean’s blog where he plans to cover the survival package and many other useful packages presented at rmedicine 2019.

hreport Automated Trial Reporting

hreport by Frank Harrel (currently available on github) is for automated reporting of trials and studies with generation of interactive html graphs based in plotly. Several aspects of a study can be rendered easily into plots demonstrating accrual, exclusions, descriptive statistics, adverse events and time-to-event data. Another key theme of rmedicine 2019 appears to have been the use of plotly or similar packages to enable interaction with data.

timevis – interactive timelines

timevis allows generation of highly interactive timeline plots which allow zooming, adding or removal of events, resizing, etc.

Holepunch package

For working with projects that require a number of packages that then need shared with a colleague, holepunch provides a quick method for generating a list of dependencies and a Dockerfile. The package creates a link for another user to open a free RStudio server with all of the required packages installed. This may be useful for trouble-shooting in a department and showing code examples.

Summary

rmedicine 2019 has shown that clinical researchers are moving increasingly towards literate programming, interactive visualisations and automated workflows using R and Rmarkdown.

The conference was a great mix of methods presentations and data presentations from R users. You definitely don’t need any in-depth knowledge of R to benefit from it and I’d highly recommend booking for rmedicine 2020.

7th May 20197th May 2019

Reference Management – An Efficient Setup for writing a Thesis

This blog is intended for researchers, PhD students, MD students and any other students who wish to have a robust and effective reference management setup. The blog has a particular focus on those using R markdown, Bookdown or LaTeX. Parts of the blog can also help setup Zotero for use with Microsoft Word. The blog has been designed to help achieve the following goals:

Effective citation storage
- Fast and easy citation storage (one-click from Chrome)
- Fast and easy PDF storage using cloud storage
- Immediate, automatic and standardised PDF renaming
- Immediate, automatic and standardised citation key generation
Effective citation integration with markdown etc.
- Generation of citation keys which work with LaTeX and md (no non-standard characters)
- Ability to lock citation keys so that they don’t update with Zotero updates
- Storage of immediately updated .bib files for use with Rmd, Bookdown and LaTeX
- Automated update of the .bib file in RStudio server

Downloads and Setup

For my current reference management setup I need the following software:

Zotero
- Zotero comes with 300MB of free storage which allows well over 1000 references to be stored as long as PDFs are stored separately
- From the same download page download the Chrome connector to enable the “save to zotero” function in Google Chrome
ZotFile
- Zotfile is a Zotero plugin which helps with PDF management, download the .xpi file and then open Zotero, go to “Tools → Add-Ons” and click the little cog in the top right corner and navigate to file to install (Figure 1)
Better BibTeX
- Better BibTeX is a plugin to help generate citation keys which will be essential for writing articles in LaTeX, R Markdown or Bookdown
- If the link doesn’t work go to github and scroll down to the ReadMe to find a link to download the .xpi file
- The same approach is then used to install the Better BibTeX plugin for zotero (“Tools → Add-Ons”)

Figure 1: Zotero Plugin/Add-on Installation

After downloading Zotero, ZotFile and Better BibTeX create an account on Zotero online.

In addition to the Zotero downloads this guide will focus on an efficient setup for writing with R markdown or Bookdown and assumes that you have access to the following software / accounts:

Dropbox / Google Drive / other cloud storage service which allows APIs
- It will also be necessary for these to be accessible using Windows Explorer or Mac Finder (there are many guides online for syncing Google Drive and Dropbox so that they appear in file explorers)
RStudio (this is not 100% essential but it is far harder to use Rmd without it)
- Packages which will be required for this setup include rdrop2 (if using dropbox, other packages are available to convert this setup to Google Drive etc.), encryptr, bookdown or Rmarkdown, tinytex and a LaTeX installation (the Bookdown author recommends using tinytex which can be installed by the similarly named R package: tinytex::install_tinytex())

Folder Setup

When using Zotero it is a good ideal to create a folder in which you will store PDFs retrieved from articles. Ultimately it is optional whether or not PDFs are stored but if you have access to cloud storage with a good quota then it can make writing in Rmd etc. much faster as there is no requirement to search online for the original PDF. This folder should be set up in Google Drive, Dropbox or another cloud storage service which can be accessed from your own computer through the file explorer.

A second folder may be useful to store bibliographies which will be generated for specific projects or submissions. Again this folder should be made available in cloud storage.

ZotFile Preferences

To setup Zotero so that retrieved PDFs are automatically stored and renamed in the cloud storage without consuming the Zotero storage quota go to “Tools → ZotFile Preferences” and on the first tab: General Settings and set the folder and subfolder naming strategy for PDFs. I have set the location of the files to a Custom location and in this case used the path to a Google Drive folder (~\Google Drive\Zotero PDF Library). ZotFile will also store retrieved PDFs in subfolders to help with finding PDFs at a later date. The current setup I use is to create a subfolder with the first author surname so that all papers authored by one (or more) author with the same name are stored together using the \%a in the subfolder field (Figure 2). Other alternatives are to store PDFs in subfolders using year (\%y); journal or publisher (\%w); or item type (\%T).

Figure 2: ZotFile Preferences

Next the Renaming Rules tab can be configured to provide sensible names to each of the files (this is essential if PDFs are not to be stored as random strings of characters which provide no meaning). In this tab I have set the format to: {%a_}{%y_}{%t} which provides names for the PDFs in the format of: Fairfield_2019_Gallstone_Disease_and_the_Risk_of_Cardiovascular_Disease.pdf. I find that this shows author, year and first word of title without needing to expand the file name (Figure 3).

Figure 3: ZotFile PDF Renaming Preferences

I have not changed any of the default settings in either the Tablet Settings or Advanved Settings tabs apart from removing special characters in the Advanced Settings (this stops things from breaking later).

General Zotero Settings

Zotero has several configurable settings (accessed through: “Edit → Preferences”) and I have either adopted the defaults or made changes as follows:

General:

I have ticked the following:
- Automatically attach associated PDFs
- Automatically retrieve metadata for PDFs
- Automatically rename attachments using parent metadata
- Automatically tag items with keywords and subject headings
- All options in Group section
I have left the following unticked:
- Automatically take snapshots
- Rename linked files

Sync:

Enter the account details
Tick sync automatically
Untick sync full text (if you choose to save PDFs then syncing full text will quickly consume the 300MB quota)

Search:

Left unchanged

Export:

Left unchanged

Cite:

There are several sensible defaults but if there is a new citation style you wish to be able to use in Microsoft Word for example then click “Get additional styles” as there is probably a version that you need already created. You can click the “+” button to add a style from a .csl file if you have one already. Finally, if you are desperate for a style that doesn’t already exist then you can select a citation style and click Style Editor and edit the raw .csl file.
In the Word Processors subtab (on the main Cite tab), you can install the Microsoft Word add-in to allow Zotero to work in Microsoft Word.

Advanced:

I changed nothing on the General subtab
In the Files and Folders subtab I have selected the path to base directory for attachments
I have not changed the Shortcuts subtab
I have not changed the Feeds subtab

Better BibTex:

In this section I have set my Citation Key format to [auth:lower:alphanum]_[year:alphanum]_[veryshorttitle:lower:alphanum]_[journal:lower:clean:alphanum] (Figure 4). This generates a citation key for each reference in the format of fairfield_2019_gallstones_scientificreports or harrison_2012_hospital_bmj. It always takes the first author’s surname, the year, the first word of the title and the journal abbreviation if known. The clean and alphanum arguments to this field are used to remove unwanted punctuation which can cause citation to fail in LaTeX.

Once the settings have been configured if you already had references stored in Zotero and wish to change the citation key for old references select your entire library root (above all folders), select all references, right click and use “Better BibTex → Refresh BibTeX Key” and all of the citation keys should be updated.

Creating a `.bib` file

For referencing in a new project, publication or submission it may be helpful to have a dynamic .bib file that updates with every new publication and can be accessed from any device through cloud storage.

To set up a .bib file, first find the folder that you wish to create the file from (this should be the folder which contains any citations you will use and ideally not the full library to cut down on unnecessary storage and syncing requirements). Note that the .bib file will generate a bibliography from any citations stored directly in the folder when using default settings. This prevents use of subfolders which I find particularly helpful for organising citations and I have therefore changed the setting so that folders also show any citations stored in subfolders. To make this change go to “Edit Preferences” and select the “Advanced” tab and at the bottom of the “General” subtab select “Config Editor”. This will bring up a searchable list of configurations (it may show a warning message before this) and search in the search box for “extensions.zotero.recursiveCollections”. Set “Value” to TRUE and then when you click a folder you should see all of the citations also stored in subfolders.

Right click the folder and select “Export Collection”. A pop-up window will appear at which point select “Keep Updated” and if using RStudio desktop save the file in the directory where you have your Rmd project files. If you are working with RStudio server then save the file in a cloud storage location which will then be accessed from the server. I have a .bib file stored in Dropbox which I access from RStudio server.

Linking Dropbox and RStudio Server to Access the `.bib` File

The following covers linking Dropbox to RStudio server but could be adapted to cover another cloud storage service.

Dropbox provides a token to allow communication between different apps. The rdrop2 package is what I used to create a token to allow this. I actually created the token on RStudio desktop as I couldn’t get the creation to work on the server but this is perfectly ok.

Caution: The token generated by this process could be used to access your Dropbox from anywhere using RStudio if you do not keep it secure. If somebody were to access an unencrypted token then it would be equivalent to handing out your email and password. I therefore used the encryptr package to allow safe storage of this token.

Token Creation

Open Rstudio desktop and enter the following code:

library(rdrop2)
library(encryptr)

# Create token
token <- drop_auth()

# Save token
saveRDS(token, "droptoken.rds")

# Encrypt token
genkeys()               # Default file names are id_rsa and id_rsa.pub
encrypt_file("droptoken.rds", "droptoken.rds.encryptr.bin")
encrypt_file(".httr-oauth", ".httr-oauth.encryptr.bin")

# Same details should appear later
drop_acc()

# Remove token from local environment
rm(token)


# Delete the unencrypted files
system("rm droptoken.rds")
system("rm .httr-oauth")

library(rdrop2)

library(encryptr)

# Create token

token <- drop_auth()

# Save token

saveRDS(token, "droptoken.rds")

# Encrypt token

genkeys() # Default file names are id_rsa and id_rsa.pub

encrypt_file("droptoken.rds", "droptoken.rds.encryptr.bin")

encrypt_file(".httr-oauth", ".httr-oauth.encryptr.bin")

# Same details should appear later

drop_acc()

# Remove token from local environment

rm(token)

# Delete the unencrypted files

system("rm droptoken.rds")

system("rm .httr-oauth")

The code will create two files, a token and the .httr-oauth file from which a token can also be made. The encryptr package can then encrypt the files using a public / private key pair. It is essential that the password that is set when using genkeys() is remembered otherwise the token cannot then be used. In this case the original token can’t be retrieved but could be created again from scratch.

The following files will then be needed to upload to the RStudio server:

droptoken.rds.encryptr.bin – or the name provided for the encrypted Dropbox token
id_rsa – or the name provided for the private key from the private / public key pair

Dropbox Linkage for Referencing the `.bib` File

Now that the encrypted token and necessary (password-protected) private key are available in RStudio server, the following can be saved as a separate script. The script is designed to read in and decrypt the encrypted token (this will require a password and should be done if the .bib file needs updated). Only the drop_download() needs repeated if using the token again during the same session. The token should be cleared at the end of every session for additional security.

library(rdrop2)
library(encryptr)

# ******** WARNING ********
# Losing the unencrypted token will give anyone 
# complete control of your Dropbox account
# If you are concerned this has happened,
# you can then revoke the rdrop2 app from your
# dropbox account and start over.
# ******** WARNING ********


safely_extract_dropbox_token <- function(encrypted_db_token = NULL, private_key_file = NULL){
  decrypt_file(encrypted_db_token, file_name = "temporary_dropbox_token.rds", private_key_path = private_key_file)
  
  token <<- readRDS("temporary_dropbox_token.rds")
  
  system("rm temporary_dropbox_token.rds")
}

safely_extract_dropbox_token(encrypted_db_token = "droptoken.rds.encryptr.bin", private_key_file = "id_rsa")

# Then pass the token to each drop_ function
drop_acc(dtoken = token)

# The path is the Dropbox file location
drop_download(path = "My_Dropbox_Home_Directory/Zotero Library/my.bib", 
              local_path = "my.bib", 
              dtoken = token,
              overwrite = TRUE)

library(rdrop2)

library(encryptr)

# ******** WARNING ********

# Losing the unencrypted token will give anyone

# complete control of your Dropbox account

# If you are concerned this has happened,

# you can then revoke the rdrop2 app from your

# dropbox account and start over.

# ******** WARNING ********

safely_extract_dropbox_token <- function(encrypted_db_token = NULL, private_key_file = NULL){

decrypt_file(encrypted_db_token, file_name = "temporary_dropbox_token.rds", private_key_path = private_key_file)

token <<- readRDS("temporary_dropbox_token.rds")

system("rm temporary_dropbox_token.rds")

}

safely_extract_dropbox_token(encrypted_db_token = "droptoken.rds.encryptr.bin", private_key_file = "id_rsa")

# Then pass the token to each drop_ function

drop_acc(dtoken = token)

# The path is the Dropbox file location

drop_download(path = "My_Dropbox_Home_Directory/Zotero Library/my.bib",

local_path = "my.bib",

dtoken = token,

overwrite = TRUE)

Now that the .bib file has been created and is stored as “my.bib” in the local directory, it should update whenever the token is loaded and drop_download() is run.

Final Result

On clicking “Save to Zotero” button in Chrome and running drop_download() the following should all happen almost instantaneously:

Zotero stores a new reference
A PDF is stored in the cloud storage having been named appropriately
A link to the PDF is stored in Zotero (without using up significant memory)
A citation key is established for the reference in a standardised format without conflicts
Pre-existing citation keys which have been referenced earlier in the writing of the paper are not altered
A .bib file is updated in the RStudio server directory
And much unwanted frustration of reference management is resolved

This is my current reference management system which I have so far found to be very effective. If there are ways you think it can be improved I would love to hear about them.

21st February 201922nd February 2019

Encryptr package: easily encrypt and decrypt columns of sensitive data

This post was originally published here

A number of existing R packages support data encryption. However, we haven’t found one that easily suits our needs: to encrypt one or many columns of a data frame or tibble using a private/public key pair in tidyversefunctions. The emphasis is on the easily.

Encrypting and decrypting data securely is important when it comes to healthcare and sociodemographic data. We have developed a simple and secure package encryptyr which allows non-experts to encrypt and decrypt columns of data.

There is a simple and easy-to-follow vignette available on our GitHub page which guides you through the process of using encryptr:

https://github.com/SurgicalInformatics/encryptr.

Confidential data – security challenges

Data containing columns of disclosive or confidential information such as a postcode or a patient ID (CHI in Scotland) require extreme care. Storing sensitive information as raw values leaves the data vulnerable to confidentiality breaches.

It is best to just remove confidential information from the records whenever possible. However, this can mean the data can never be re-associated with an individual. This may be a problem if, for example, auditors of a clinical trial need to re-identify an individual from the trial data.

One potential solution currently in common use is to generate a study number which is linked to the confidential data in a separate lookup table, but this still leaves the confidential data available in another file.

Encryptr package solution – storing encrypted data

The encryptr package allows users to store confidential data in a pseudoanonymised form, which is far less likely to result in re-identification.

The package allows users to create a public key and a private key to enable RSA encryption and decryption of the data. The public key allows encryption of the data. The private key is required to decrypt the data. The data cannot be decrypted with the public key. This is the basis of many modern encryption systems.

When creating keys, the user sets a password for the private key using a dialogue box. This means that the password is included in an R script. We recommend creating a secure password with a variety of alphanumeric characters and symbols.

As the password is not stored, it is important that you are able to remember it if you need to decrypt the data later.

Once the keys are created it is possible to encrypt one or more columns of data in a data frame or tibble using the public key. Every time RSA encryption is used it will generate a unique output. Even if the same information is encrypted more than once, the output will always be different. It is not possible therefore to match two encrypted values.

These outputs are also secure from decryption without the private key. This may allow sharing of data within or between research teams without sharing confidential data.

Caution: data often remains potentially disclosive (or only pseudoanomymised) even after encryption of identifiable variables and all of the required permissions for usage and sharing of data must still be in place.

Encryptr package – decrypting the data

Sometimes decrypting data is necessary. For example, participants in a clinical trial may need to be contacted to explain a change or early termination of the trial.

The encryptr package allows users to securely and reliably decrypt the data. The decrypt function will use the private key to decrypt one or more columns. The user will be required to enter the password created when the keys were generated.

As the private key is able to decrypt all of the data, we do not recommend sharing this key.

Blinding and unblinding clinical trials – another encryptr package use

Often when working with clinical trial data, the participants are randomised to one or more treatment groups. Often teams working on the trial are unaware of the group to which patients were randomised (blinded).

Using the same method of encryption, it is possible to encrypt the participant allocation group, allowing the sharing of data without compromising blinding. If other members of the trial team are permitted to see treatment allocation (unblinded), then the decryption process can be followed to reveal the group allocation.

What this is not

This is a simple set of wrappers of openssl aimed at non-experts. It does not seek to replace the many excellent encryption packages available in R, such as PKI, sodium and safer. We believe however that it makes things much easier. Comments and forks welcome.

30th January 201930th January 2019

Quick take-aways from RStudio::conf Training Day 02 (Part 2 – sparklyr)

It’s now a week since I returned from RStudio::conf 2019 in Austin, Texas and in this blog I’m going to focus using the sparklyr package (spark-lee-r) which enables R to connect to an Apache Spark instance for general purpose cluster-computing. sparklyr has its own inbuilt functions as well as allowing dbplyr to do all of the amazing features I described in my first blog post: https://surgicalinformatics.org/quick-take-aways-from-rstudioconf-training-day-02/. The code contained in this blog should work on your own local RStudio without any preconfigured cluster should you wish to experiment with sparklyr’s capabilities.

Establishing a connection

The following example code will help set up a local connection in order to experiment with some of the functionality of the dbplyr package. This is really useful if you are waiting for data or access to a database so you can have pre-prepared scripts in progress without the remote database connection.

library(tidyverse)
library(dbplyr)
library(sparklyr)
library(nycflights13)

sc <- spark_connect(master = "local")

library(tidyverse)

library(dbplyr)

library(sparklyr)

library(nycflights13)

sc <- spark_connect(master = "local")

The connection is typically stored as “sc” which you can also see in the Environment. This is the object that is referenced each time data is accessed in the spark cluster.

To check that the new connection to a spark instance has been established go to the connections tab in your RStudio interface to see if the connection has been established (this is typically located alongside your “Environment” and “History” tabs. Click on the Spark UI button to view the user interface for the current spark session in a browser (this will be helpful later if you want to view an event log for the activity of your session). Another way to check if the cluster is open is by using: spark_connection_is_open(sc). This should return “TRUE” if the connection is open.

Adding and manipulating data via the connection

Now that you have a connection established some data can be added to the spark cluster:

spark_flights <- sdf_copy_to(sc, flights, "my_flights")

1	spark_flights <- sdf_copy_to(sc, flights, "my_flights")

spark_flights becomes an object in the local environment but is really just a reference to the data in the spark cluster. Click on the Connections tab and you should see that “my_flights” is now a data frame stored in the cluster. The Spark UI page which opened in your browser will also now show some of the changes you have made. Click the Storage tab in the UI and you should see the data frame.

When manipulating the data the reference to the data frame within the local environment can be treated as if the data was stored locally. One key difference is that the creation of new data frames is delayed until the last possible minute. The following example groups flights from the nycflights13 data frame flights and calculated the average delay based on destination. Notice that the real computation happens only once the “average_delay” data frame is printed, the first command simply creates a reference in the local environment in which is saved your intended action. Also notice the “lazy” approach which occurs with sparklyr in which the total number of rows is not returned and is replaced by “… with more rows”. If the full number of rows is then desired the collect function can be used:

spark_flights %>%   
  group_by(dest) %>%   
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) -> average_delay

average_delay
# # Source: spark<?> [?? x 2]
# dest  avg_delay
# <chr>     <dbl>
#   1 IAH        4.24
# 2 PBI        8.56
# 3 BOS        2.91
# 4 CLT        7.36
# 5 SNA       -7.87
# 6 XNA        7.47
# 7 SYR        8.90
# 8 JAX       11.8 
# 9 CHS       10.6 
# 10 MEM       10.6 
# # ... with more rows

average_delay %>% 
  collect()
# # A tibble: 105 x 2
# dest  avg_delay
# <chr>     <dbl>
#   1 IAH        4.24
# 2 PBI        8.56
# 3 BOS        2.91
# 4 CLT        7.36
# 5 SNA       -7.87
# 6 XNA        7.47
# 7 SYR        8.90
# 8 JAX       11.8 
# 9 CHS       10.6 
# 10 MEM       10.6 
# # ... with 95 more rows

spark_flights %>%

group_by(dest) %>%

summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) -> average_delay

average_delay

# # Source: spark<?> [?? x 2]

# dest avg_delay

# <chr> <dbl>

# 1 IAH 4.24

# 2 PBI 8.56

# 3 BOS 2.91

# 4 CLT 7.36

# 5 SNA -7.87

# 6 XNA 7.47

# 7 SYR 8.90

# 8 JAX 11.8

# 9 CHS 10.6

# 10 MEM 10.6

# # ... with more rows

average_delay %>%

collect()

# # A tibble: 105 x 2

# dest avg_delay

# <chr> <dbl>

# 1 IAH 4.24

# 2 PBI 8.56

# 3 BOS 2.91

# 4 CLT 7.36

# 5 SNA -7.87

# 6 XNA 7.47

# 7 SYR 8.90

# 8 JAX 11.8

# 9 CHS 10.6

# 10 MEM 10.6

# # ... with 95 more rows

Caching data

Have a look at the Spark UI and check out the SQL tab. Click on one of the queries (highlight in blue) to get a breakdown of the components for each task. Notice the difference between the query in which collect() was used, it takes a lot longer to execute than the “lazy” query which sparklyr uses by default. This is really useful if you want to leave the “heavy lifting” of data transformation right until the end but if you then want to use an intermediate data frame for several further transformations (this could be sorting destinations based on average delay, only looking at destinations where the average departure time was early etc.) then it might be useful to cache the data within the cluster so that the data is transformed only once. The downside to this approach may be additional memory requirements. The following code using compute() will cache the intermediate data frame:

cached_flights <- spark_flights %>% 
group_by(dest) %>%   
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
compute("sub_flights")

#If you have already created the “average_delay” reference then you
#could also run:
#cached_flights <- compute(average_delay, “sub_flights”)

cached_flights <- spark_flights %>%

group_by(dest) %>%

summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%

compute("sub_flights")

#If you have already created the “average_delay” reference then you

#could also run:

#cached_flights <- compute(average_delay, “sub_flights”)

Now you should be able to see the “sub_flights” data frame in the Connections tab, the Storage tab of the Spark UI and the SQL code generated in the SQL tab of the UI. The cached_flights reference should also appear in the Environment tab in RStudio.

Some extra functions

As well as working through dplyr and dbplyr, sparkylr also comes with its own functions for data analysis and transformation which may be useful particularly when setting up pipelines you plan to execute later. A couple of useful examples are the ft_binnarizer and ft_bucketizer commands which I demonstrate determining destinations which are on average over 10 minutes delayed and then demonstrate grouping by distance:

cached_flights %>%   
ft_binarizer(
input_col = "avg_delay",     
output_col = "delayed",     
threshold = 15) %>%   
select(avg_delay, delayed) %>%   
head(100)

# # Source: spark<?> [?? x 2]
# avg_delay delayed
# <dbl>   <dbl>
#   1      4.24       0
# 2      8.56       0
# 3      2.91       0
# 4      7.36       0
# 5     -7.87       0
# 6      7.47       0
# 7      8.90       0
# 8     11.8        1
# 9     10.6        1
# 10     10.6        1
# # ... with more rows

cached_flights %>%

ft_binarizer(

input_col = "avg_delay",

output_col = "delayed",

threshold = 15) %>%

select(avg_delay, delayed) %>%

head(100)

# # Source: spark<?> [?? x 2]

# avg_delay delayed

# <dbl> <dbl>

# 1 4.24 0

# 2 8.56 0

# 3 2.91 0

# 4 7.36 0

# 5 -7.87 0

# 6 7.47 0

# 7 8.90 0

# 8 11.8 1

# 9 10.6 1

# 10 10.6 1

# # ... with more rows

spark_flights %>%  
  filter(!is.na(arr_delay)) %>% 
  ft_bucketizer(     
    input_col = "distance",     
    output_col = "distance_grouping",     
    splits = c(0, 500, 1000, 1500, 2000, 2500, 6000)) %>%   
  select(distance, distance_grouping) %>%   
  head(100)

# # Source: spark<?> [?? x 2]
# distance distance_grouping
# <dbl>             <dbl>
#   1     1400                 2
# 2     1416                 2
# 3     1089                 2
# 4     1576                 3
# 5      762                 1
# 6      719                 1
# 7     1065                 2
# 8      229                 0
# 9      944                 1
# 10      733                 1
# # ... with more rows

spark_flights %>%

filter(!is.na(arr_delay)) %>%

ft_bucketizer(

input_col = "distance",

output_col = "distance_grouping",

splits = c(0, 500, 1000, 1500, 2000, 2500, 6000)) %>%

select(distance, distance_grouping) %>%

head(100)

# # Source: spark<?> [?? x 2]

# distance distance_grouping

# <dbl> <dbl>

# 1 1400 2

# 2 1416 2

# 3 1089 2

# 4 1576 3

# 5 762 1

# 6 719 1

# 7 1065 2

# 8 229 0

# 9 944 1

# 10 733 1

# # ... with more rows

These functions can be combined with others such as sdf_partition, sdf_pivot and sdf_register to prepare a data set for predictive modelling. Sparklyr has its own inbuilt functions for logistic regression (ml_logistic_regression), predictive modelling (sdf_predict) and even some dedicated natural language processing techniques (ft_tokenizer, ft_stop_words_remover).

To finish the session close down the connection using:

spark_disconnect(sc)

1	spark_disconnect(sc)

The connection should now be terminated meaning the Spark UI will no longer be accessible and the connections tab has changed. Should you wish to work with any data frames or aggregated results following the disconnect then make sure to use collect() and create a new object before disconnecting.

18th January 201918th January 2019

Quick take-aways from RStudio::conf Training Day 02

For the past few days I’ve been in Austin, Texas with Stephen Knight and Riinu Ots representing the Surgical Informatics Group at RStudio::conf 2019. The conference brings together nearly 2000 data scientists, developers and a couple of surgeons to learn the latest best practice and best approaches when programming with R.

I have attended the Big Data workshop. “Big Data” is a bit of a vague term but it can be helpful to think of Big Data as one of two groups: data that is just so big that you can’t open it on your own computer (imagine opening one of those massive files that just crashes your computer) or data that is stored somewhere remotely and accessed through your computer by a slow connection.

The key principles for handling Big Data effectively include:

Safe storage (often a data administrator sorts this)
Safe access (password protected in many cases with care to avoid publishing passwords in R scripts)
Getting the database itself to do the heavy work whilst leaving R to do the statistical analysis and plotting we know it does best
Leave the data transformation until the latest possible minute
Access the database as few times as possible

Today I’ll focus briefly on safe access, getting the database to do all of the heavy lifting and leaving the transformation to the last possible minute.

Safe Access

Using a R script to access a remote database usually requires credentials. R needs to know where the data is stored and the database needs to know whether it can allow you to access the data. There are lots of different ways to set up a connection to a database but the DBI package and the obdc package are going to come in very handy. You might also need to install a package which supports the driver for the type of database.

There are loads of options when connecting to databases and securing credentials but it’s key to avoid posting critical information like passwords in plain text, for example:





con <- dbConnect(



  odbc::odbc(),



  Driver = "PostgreSQL",



  Server =
"localhost",



  UID    = "myusername",



  PWD    = "my_unsecure_password",



  Port = 5432,



  Database =
"postgres"



)

Best Solution for Securing Credentials

The most secure option for connecting to a database involves using a Data Source Name (DSN) although this does require some pre-configuration and the ability to perform the following:

1. Establish integrated security between the terminal and the database, usually via Kerberos.

2. Pre-configure an ODBC connection in the Desktop or Server (requires sufficient access rights). The ODBC connection will have a unique Data Source Name, or DSN.

For example:

con <- DBI::dbConnect(odbc::odbc(), "My DSN Name")

Easier alternatives

It is still possible to connect securely to a database using either of the following techniques:





con <- dbConnect(



  odbc::odbc(),



  Driver =
"PostgreSQL",



  Server =
"localhost",



  UID    = rstudioapi::askForPassword("Database
user"),



  PWD    = rstudioapi::askForPassword("Database
password"),



  Port = 5432,



  Database =
"postgres"



)

This will require a predefined username and password which the user is prompted to enter on setting up the connection.

Finally, another option which may be easier for those without data administrator is to use the config package and create a yml file. After creating the yml file it can be used to configure the connection without directly publishing credentials, in particular password. Instead, password is retrieved from the yml file:





dw <- config::get("datawarehouse-dev")



con <- dbConnect(odbc::odbc(),



   Driver = dw$driver,



   Server = dw$server,



   UID    = dw$uid,



   PWD    = dw$pwd,



   Port   = dw$port,



   Database =
dw$database



)

Making the database do the work

For many physicians who work with R we are quite happy to create data frames in the local environment, modify them there and then plot the data from there. This works very well with smaller data sets which don’t require much memory and even with medium-sized data sets (as long as you don’t try to open the whole file using View()!). When working with Big Data often the local environment isn’t large enough because of limited RAM. To give an example if you are using all of the UK Biobank genetic data which amounts to over 12 terabytes then the modest 8 gigabytes (1500 times less) of RAM I have on my own laptop just won’t do.

The solution is to manipulate the data remotely. Think bomb disposal. You want to have a screen to show you what a bomb disposal robot is doing and a way of controlling the robot but you don’t want to be up close and personal with the robot doing the work. Big Data is exactly the same, you want to see the results of the data transformations and plotting on your own device but let the database do the work for you. Trying to bring the heavy lifting of the data onto your own device creates problems and may result in the computer or server crashing.

dplyr is a truly fantastic data manipulation package which is part of the tidyverse and makes everyday data manipulation for clinicians achievable, understandable and consistent. The great news is that when working with remote data, you can use dplyr! The package dbplyr (database plyer) converts the dplyr code into SQL which then runs in the remote database meaning a familiarity with dplyr is almost all that’s needed to handle Big Data.

The show_query() function demonstrates just how much work goes on under the hood of dbplyr:





flights %>%



 
summarise_if(is.numeric, mean, na.rm = TRUE) %>% 



  show_query

Output:





Applying predicate on the first 100
rows



<SQL>



SELECT AVG("flightid") AS
"flightid", AVG("year") AS "year",
AVG("month") AS "month", AVG("dayofmonth") AS
"dayofmonth", AVG("dayofweek") AS "dayofweek",
AVG("deptime") AS "deptime", AVG("crsdeptime") AS
"crsdeptime", AVG("arrtime") AS "arrtime",
AVG("crsarrtime") AS "crsarrtime",
AVG("flightnum") AS "flightnum",
AVG("actualelapsedtime") AS "actualelapsedtime",
AVG("crselapsedtime") AS "crselapsedtime",
AVG("airtime") AS "airtime", AVG("arrdelay") AS
"arrdelay", AVG("depdelay") AS "depdelay",
AVG("distance") AS "distance", AVG("taxiin") AS
"taxiin", AVG("taxiout") AS "taxiout",
AVG("cancelled") AS "cancelled", AVG("diverted")
AS "diverted", AVG("carrierdelay") AS
"carrierdelay", AVG("weatherdelay") AS
"weatherdelay", AVG("nasdelay") AS "nasdelay",
AVG("securitydelay") AS "securitydelay", AVG("lateaircraftdelay")
AS "lateaircraftdelay", AVG("score") AS "score"



FROM datawarehouse.flight

Leaving the Data Transformation to the last possible minute

dbplyr prevents unnecessary work occurring within the database until the user is explicit that they want some results. When modifying a data frame using a remote connect and dbplyr the user can work with references to the remote data frame in the local environment. When performing and saving a data transformation then dplyr saves a reference to the intended transformation (rather than actually transforming the data). Only when the user is explicit that they want to see some of the data will dbplyr let the database get busy and transform the data. The user can do this by plotting the data or by using the collect() function to print out the resulting data frame. Working with data in this way by saving up a pipeline of intended transformations and only executing the transformation at the very end is a much more efficient way of working with Big Data.

16th November 201816th November 2018

Teaching our REDCap basic anatomy, cancer classifications, and some common sense

This post was originally published here

With over 11,000 patients now entered into REDCap and more being entered every day we thought it would be a good time to reflect on some of the ways in which GlobalSurg 3 has been set up to help collaborators from around the world enter accurate and high-quality data.

With so many important things to know about each patient undergoing cancer surgery, the GlobalSurg team of nearly 3000 collaborators has been busy entering data into our secure database at redcap.globalsurg.org (REDCap is an amazing database software developed by Vanderbilt University). With over 750,000 values entered, it isn’t surprising that from time to time a mistake occurs whilst entering the data. This might be because the data entered onto a paper form was incorrect to begin with or it might be due to accidentally clicking on the wrong options when entering the data to REDCap. In some cases, the incorrect data might even appear in the notes if the surgeon or the anaesthetist has forgotten how to decide on the most appropriate ASA grade.

To try to help our collaborators identify cases when these mistakes may have happened we have taught our REDCap some basic anatomy, cancer classifications and some common sense so that it can alert collaborators to mistake mistakes as soon as they occur. The automatic alerts appear when a collaborator tries to save a page with incorrect data meaning that they can change it immediately when they still have access to the patient notes.

Our REDCap now knows 58 things about cancer surgery that it is using to help collaborators enter accurate data.

globalsurg_redcap_dq (1).png

Figure 1. Examples of our (a) paper data collection form, (b) REDCap interface, and (c) a data quality pop-up warning.

All 58 rules are given at the bottom of this post, but here are some examples:

Our REDCap knows some basic anaesthesiology:
- As a collaborator if you have tried to enter a patient with diabetes mellitus and stated that they have an ASA grade of 1 Our REDCap should have informed you that diabetic patients should really have an ASA grade of 2 or more
Our REDCap knows the basics of anatomy:
- It knows that if a patient had a total colectomy that they don’t have a colostomy
Our REDCap knows about TNM staging:
- It knows that patients with an M score of M1 should also have an Essential TNM score of M+
Our REDCap also knows some common sense:
- It knows that patients can’t have more involved lymph nodes in a specimen that the total number of lymph nodes in a specimen
- It knows that a patient couldn’t have their operation before being admitted to the hospital

Our REDCap has been working tirelessly for several months to generate these alerts and help collaborators ensure their data is accurate. We hope that training REDCap to detect problems with the data will make the GlobalSurg 3 analysis more efficient and contribute to the accuracy of the data.

The final data entry deadline is 17th December so remember to upload all of your data before then. Our REDCap is ready and waiting to store and check the data.

View and download all of our Data Quality rules at github.com/SurgicalInformatics

R Markdown for Medicine

An Overview of Useful Tips and Tricks

Use Params!

Use Helper Packages

List Numbering the Lazy Way

Multiple lots in a Grid

Use the here package to help with file paths

Customise Code Outputs

Word document creations tips

Future updates – hopefully!

Other Useful rmedicine Packages and Ideas

survival Package Update

hreport Automated Trial Reporting

timevis – interactive timelines

Holepunch package

Summary

Downloads and Setup

Folder Setup

ZotFile Preferences

General Zotero Settings

Creating a .bib file

Linking Dropbox and RStudio Server to Access the .bib File

Token Creation

Dropbox Linkage for Referencing the .bib File

Final Result

Confidential data – security challenges

Encryptr package solution – storing encrypted data

Encryptr package – decrypting the data

Blinding and unblinding clinical trials – another encryptr package use

What this is not

Creating a `.bib` file

Linking Dropbox and RStudio Server to Access the `.bib` File

Dropbox Linkage for Referencing the `.bib` File