Tidyverse does sequencing data!

What I’ve been up to…

For the past few months I’ve been stationed between Surgical Informatics in Edinburgh and the CRUK Beatson Institute for Cancer Research in Glasgow. The aim – to identify new treatments for hepatocellular carcinoma, a disease that affects nearly a million people worldwide each year. Surgery is the only cure, but due to the complexity of surgery it tends to be quite a specialist and risky undertaking. This, combined with the late stage at which patients become symptomatic, means that a lot of liver cancer globally in incurable.

Current chemotherapy keeps patients in a ‘holding pattern’ where it slows the growth of tumours, but lacks the ability to kill enough of the tumour to make it possible to then use surgery as a cure (called ‘downstaging’). If we can find a drug or treatment that can facilitate this, that’d be very exciting.

So, I’ve been accruing loads of sequencing data from human and mouse tumours to try and identify weaknesses in these tumours that we can target. But analysing this RNA and DNA sequencing data takes a lot of time and computational power. So I’ve been investigating how to shrink this down and try and take as much of it straight to R as possible. Thus, simplifying the pipeline and hopefully speeding things up a bit. Traditionally, this process takes hours and days. But with this workflow I’ve shrunk it to about a day for 30+ samples so far. This is obviously dependent on how many samples, how big your genome/transcriptome is and sequencing coverage/depth too.

Quick alignment and mapping

Taking the FASTQ files generated by the sequencers, we first run them through the Salmon pseudoaligner. Salmon is a package that uses pseduoalignment to align and map the reads from RNASeq data, thus shrinking two steps into one. This also enables us to leverage the speed and accuracy advantages of pseudoalignment (meant to be more robust in several aspects). I tend to dump all my FASTQ files into one folder per species, experiment or cohort and then run Salmon on this.

String manipulation in Bash

One challenge I have found is how our institute appends suffixes to the files for identification. This is obviously great, but Salmon needs to know which reads come from which end in the case of paired end sequencing or which technical repeat. Otherwise we’re simply going to be aligning gobbledygook.

In this case, the Illumina sequencer kicks out files which have a ‘S’ in front of the sample number, an ‘L’ in front of the technical repeat number and an ‘R’ in front of the direction of sequence (i.e. for paired end reads).

So using some loops in bash, I devised the following (note index file for mouse transcriptome mm10_t_index):

source /home/user/miniconda3/etc/profile.d/conda.sh
conda activate salmon

//note ’18’ can be any number – this is just the number of samples I had.
FILE_LIST=$(for fn in /home/user/data/sequencing_data_here/fastq/*; do echo ${fn::-18} ; done)

#now check we can loop through this list
#for fn_short in $FILE_LIST; do echo “${fn_short}_test”; done

#now remove duplicates
FILE_LIST=$(echo $FILE_LIST | awk ‘BEGIN{ORS=” “}{for (i=1; i<=NF; i++)a[$i]++} END {for (i in a) print i }’)
FILE_LIST=$(echo “$FILE_LIST” | tr ‘ ‘ ‘\n’)

for fn in $FILE_LIST;
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L001_R1_001.fastq \
-2 “${fn}”_L001_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L001_quant

for fn in $FILE_LIST;
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L002_R1_001.fastq \
-2 “${fn}”_L002_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L002_quant

for fn in $FILE_LIST;
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L003_R1_001.fastq \
-2 “${fn}”_L003_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L003_quant

for fn in $FILE_LIST;
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L004_R1_001.fastq \
-2 “${fn}”_L004_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L004_quant

Now, when run, this should generate aligned and mapped data in a separate ‘quants’ folder. Alongside the QC information etc. in each folder.

The most interesting files in these quantified data folders are the tab delimited quants.sf files. These contain the counts for mapped reads at the gene level. I tend to work with count data only, as from this it’s possible to derive TPM and FPKM.

Using the tidyverse with map() to quickly prepare sequencing data

Using the tidyverse we can make this really quick and easy!

First, using readr::read_tsv() and purrr::map() we can open all the files within a specific directory and add the pathname as a column (so we don’t lost track of our samples). Then we use mutate(), to put a path and sample name in. Finally, using spread() and summarise() we sum the counts across our technical repeats!


#required functions
file_directory = ‘/mnt/data/this_is_folder_where_data_is’

file_list = dir(path = file_directory, full.names = T, pattern = “*quant.sf”, recursive = T)

#Step 1 – open all files, read and extract sample identifiers
file_list %>%
map(~read_tsv(.) %>%
mutate(path = gsub(‘/mnt/data/this_is_folder_where_data_is/quants/’, ”, .x)) %>% #Edit this to the appropriate file directory!
mutate(path = gsub(‘_quant/quant.sf’, ”, path)) %>%
mutate(path = gsub(‘__’, ‘_’, path))) -> gene_counts_list

#Step 2 – split repeats up and combine into 1 dataframe
gene_counts_list %>%
bind_rows(gene_counts_list) %>%
extract(path, into = c(“sample_id”, “n_repeat”), “(.*)_([^_]+)$”) %>%
select(sample_id, n_repeat,everything()) -> genes_all_samples_df

#Step 3 – sum reads together!
genes_all_samples_df %>%
group_by(sample_id, Name) %>%
summarise(NumReadsTotal = sum(NumReads)) %>%
spread(key = sample_id, value = NumReadsTotal) %>%
column_to_rownames(var = ‘Name’) -> gene_counts_by_sample

rm(file_directory, file_list, gene_counts_list)
#Save data

Happy sequencing! (and thanks Riinu Ots!)

New intuitive ways for reshaping data in R: long live pivot_longer() and pivot_wider()

This post was originally published here

TLDR: there are two new and very intuitive R functions for reshaping data: see Examples of pivot_longer() and pivot_wider() below. At the time of writing, these new functions are extremely fresh and only exist in the development version on GitHub (see Installation), we should probably wait for the tidyverse team to officially release them (in CRAN) before putting them into day-to-day use.



The juxtapose of data collection vs data analysis: data that was very easy to collect, is probably very hard to analyse, and vice versa. For example, if data is collected/written down whichever format was most convenient at the time of data collection, it is probably not recorded in a regularly shaped table, with various bits of information in different parts of the document. And even if data is collected into a table, it is often intuitive (for data entry) to include information about the same variable in different columns. For example, look at this example data I just made up:


candydata_raw = read_csv("2019-04-07_candy_preference_data.csv")
candy_type likes age: 5 likes age: 10 likes age: 15 gets age: 5 gets age: 10 gets age: 15
Chocolate 4 6 8 2 4 6
Lollipop 10 8 6 8 6 4

For each candy type, there are 8 columns with values. But actually, these 8 columns capture a combination of 3 variables: age, likes and eats. This is known as the wide format, and it is a convenient way to either note down or even present values. It is human-readable. For effective data analysis, however, we need data to be in the tidy data format, where each column is a single variable, and each row a single observation (https://www.jstatsoft.org/article/view/v059i10). It needs to be less human-readable and more computer-friendly.

Some of you may remember now retired reshape2::melt() or reshape2::dcast(), and many of you (inclduing myself!) have struggled remebering the arguments for tidyr::gather() and tidyr::spread(). Based on extensive community feedback, the tidyverse team have reinveted these functions using both more intuitive names, as well as clearer syntax (arguments):


These functions were added just a month ago, so these functions are not yet included in the standard version of tidyr that comes with install.packages("tidyverse") or even update.packages() (the current version of tidyr on CRAN is 0.8.3). To play with the bleeding edge versions of R packages, run install.packages("devtools") and then devtools::install_github("tidyverse/tidyr"). If you are a Mac user and it asks you “Do you want to install from sources the package which needs compilation?”, say Yes.

You might need to Restart R (Session menu at the top) and load library(tidyverse) again. You can check whether you now have these functions installed by typing in pivot_longer and pressing F1 – if a relevant Help tab pops open you got it.


candydata_longer = candydata_raw %>% 
candy_type name value
Chocolate likes age: 5 4
Chocolate likes age: 10 6
Chocolate likes age: 15 8
Chocolate gets age: 5 2
Chocolate gets age: 10 4
Chocolate gets age: 15 6
Lollipop likes age: 5 10
Lollipop likes age: 10 8
Lollipop likes age: 15 6
Lollipop gets age: 5 8
Lollipop gets age: 10 6
Lollipop gets age: 15 4

Now, that’s already a lot better, but we still need to split the name column into the two different variables it really includes. “name” is what pivot_longer() calls this new column by default. Remember, each column is a single variable.

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age")) %>% 
  separate(name, into = c("questions", NA, "age"), convert = TRUE)
candy_type questions age value
Chocolate likes 5 4
Chocolate likes 10 6
Chocolate likes 15 8
Chocolate gets 5 2
Chocolate gets 10 4
Chocolate gets 15 6
Lollipop likes 5 10
Lollipop likes 10 8
Lollipop likes 15 6
Lollipop gets 5 8
Lollipop gets 10 6
Lollipop gets 15 4

And pivot_wider() can be used to do the reverse:

candydata = candydata_longer %>% 
  pivot_wider(names_from = questions, values_from = value)
candy_type age likes gets
Chocolate 5 4 2
Chocolate 10 6 4
Chocolate 15 8 6
Lollipop 5 10 8
Lollipop 10 8 6
Lollipop 15 6 4

It is important to spell out the arguments here (names_from =, values_frame =) since they are not the second and third arguments of pivot_wider() (like they were in spread()). Investigate the pivot_wider+F1 Help tab for more information.

Wrap-up and notes

Now these are datasets we can work with: each column is a variable, each row is an observation.

Do not start replacing working and tested instances of gather() or spread() in your existing R code with these new functions. That is neither efficient nor necessary – gather() and spread() will remain in tidyr to make sure people’s scripts don’t suddenly stop working. Meaning: tidyr is backward compatible. But after these functions are officially released, I will start using them in all new scripts I write.

I made the original messy columns still relatively nice to work with – no typos and reasonable delimiters. Usually, the labels are much worse and need the help of janitor::clean_names(), stringr::str_replace(), and multiple iterations of tidyr::separate() to arrive at a nice tidy tibble/data frame.

tidyr::separate() tips:

into = c("var1", NA, "var2") – now this is an amazing trick I only came across this week! This is a convenient way to drop useless (new) columns. Previously, I would have achieved the same result with:

... %>% 
    separate(..., into = c("var1", "drop", "var2")) %>% 
    select(-drop) %>% 

convert = TRUE: by default, separate() creates new variables that are also just “characters”. This means our age would have been a chacter vector of, e.g., “5”, “10”, rather than 5, 10, and R wouldn’t have known how to do arithmetic on it. In this example, convert = TRUE is equivalent to mutate(age = as.numeric(age)).

Good luck!

P.S. This is one of the coolest Tweets I’ve ever seen:

Making sense of machine learning – how do we measure performance?

An exciting direction for the Surgical Informatics group is the application of machine learning models to clinical problems.

As we hear on a nearly daily basis, machine learning has loads to offer patients and clinicians, but how can we make these models understandable and importantly, how do we measure that these models are looking at what we’re interested in?

Currently, how well a diagnostic test performs is described by four main parameters (most students and clinicians will groan when they hear these words):

  • Sensitivity (how many people who have the condition are identified correctly)
  • Specificity (how many people who don’t have the condition are identified correctly)
  • Positive Predictive Value (how many times a test positive is a true positive)
  • Negative Predictive Value (how many times a test negative is a true negative)

Now, interestingly the field of machine learning has evolved some separate parameters for measuring the usefulness of machine learning models:

  • Recall (synonymous to sensitivity)
  • Precision (synonymous to positive predictive value)

There are other measures too, including F1 score and accuracy. The issue around these metrics is that although they are handy mathematically to describe models, they lack relevance to what is clinically important. For example, if a patient wants to know how many times a test might give a false result, the F1 score (a weighted average of precision and recall) is going to be pretty useless.

Now, if we want to make a machine learning risk prediction model, we need a clinically relevant metric to allow model training to be measured and optimised. In python, there’s lots of functions for this, however, R is far more common in healthcare data analysis. At Surgical Informatics, we use Keras to interact with TensorFlow in R. Keras for R is far newer than python, so there are fewer metric functions available.

Clinically, a model to predict a specific event happening is more useful than ruling it out, particularly if the event is serious (i.e. death). A recall metric would be perfect for this, however, there is no custom function available for recall in R.

So lets make one!

Fortunately Keras provides us with functions to perform calculations on tensors such as k_sum, k_round and k_clip. This lets us manipulate Tensors using Keras and come up with custom metrics. You can find other backend keras functions here:


So if recall is equal to the number of true positives, divided by the number of true positives plus false negatives we need to write a function to define these.

Now should we just add pp and tp? Unforunately Keras doesn’t like this. So we use k_epsilon() to replace tp in the recall expression, to give:

And that should calculate the recall (or sensitivity) for the model!

Quick take-aways from RStudio::conf Training Day 02 (Part 2 – sparklyr)

It’s now a week since I returned from RStudio::conf 2019 in Austin, Texas and in this blog I’m going to focus using the sparklyr package (spark-lee-r) which enables R to connect to an Apache Spark instance for general purpose cluster-computing. sparklyr has its own inbuilt functions as well as allowing dbplyr to do all of the amazing features I described in my first blog post: https://surgicalinformatics.org/quick-take-aways-from-rstudioconf-training-day-02/. The code contained in this blog should work on your own local RStudio without any preconfigured cluster should you wish to experiment with sparklyr’s capabilities.

Establishing a connection

The following example code will help set up a local connection in order to experiment with some of the functionality of the dbplyr package. This is really useful if you are waiting for data or access to a database so you can have pre-prepared scripts in progress without the remote database connection.

The connection is typically stored as “sc” which you can also see in the Environment. This is the object that is referenced each time data is accessed in the spark cluster.

To check that the new connection to a spark instance has been established go to the connections tab in your RStudio interface to see if the connection has been established (this is typically located alongside your “Environment” and “History” tabs. Click on the Spark UI button to view the user interface for the current spark session in a browser (this will be helpful later if you want to view an event log for the activity of your session). Another way to check if the cluster is open is by using: spark_connection_is_open(sc). This should return “TRUE” if the connection is open.

Adding and manipulating data via the connection

Now that you have a connection established some data can be added to the spark cluster:

spark_flights becomes an object in the local environment but is really just a reference to the data in the spark cluster. Click on the Connections tab and you should see that “my_flights” is now a data frame stored in the cluster. The Spark UI page which opened in your browser will also now show some of the changes you have made. Click the Storage tab in the UI and you should see the data frame.

When manipulating the data the reference to the data frame within the local environment can be treated as if the data was stored locally. One key difference is that the creation of new data frames is delayed until the last possible minute. The following example groups flights from the nycflights13 data frame flights and calculated the average delay based on destination. Notice that the real computation happens only once the “average_delay” data frame is printed, the first command simply creates a reference in the local environment in which is saved your intended action. Also notice the “lazy” approach which occurs with sparklyr in which the total number of rows is not returned and is replaced by “… with more rows”. If the full number of rows is then desired the collect function can be used:

Caching data

Have a look at the Spark UI and check out the SQL tab. Click on one of the queries (highlight in blue) to get a breakdown of the components for each task. Notice the difference between the query in which collect() was used, it takes a lot longer to execute than the “lazy” query which sparklyr uses by default. This is really useful if you want to leave the “heavy lifting” of data transformation right until the end but if you then want to use an intermediate data frame for several further transformations (this could be sorting destinations based on average delay, only looking at destinations where the average departure time was early etc.) then it might be useful to cache the data within the cluster so that the data is transformed only once. The downside to this approach may be additional memory requirements. The following code using compute() will cache the intermediate data frame:

Now you should be able to see the “sub_flights” data frame in the Connections tab, the Storage tab of the Spark UI and the SQL code generated in the SQL tab of the UI. The cached_flights reference should also appear in the Environment tab in RStudio.

Some extra functions

As well as working through dplyr and dbplyr, sparkylr also comes with its own functions for data analysis and transformation which may be useful particularly when setting up pipelines you plan to execute later. A couple of useful examples are the ft_binnarizer and ft_bucketizer commands which I demonstrate determining destinations which are on average over 10 minutes delayed and then demonstrate grouping by distance:

These functions can be combined with others such as sdf_partition, sdf_pivot and sdf_register to prepare a data set for predictive modelling. Sparklyr has its own inbuilt functions for logistic regression (ml_logistic_regression), predictive modelling (sdf_predict)  and even some dedicated natural language processing techniques (ft_tokenizer, ft_stop_words_remover).

To finish the session close down the connection using:

The connection should now be terminated meaning the Spark UI will no longer be accessible and the connections tab has changed. Should you wish to work with any data frames or aggregated results following the disconnect then make sure to use collect() and create a new object before disconnecting.

Quick take-aways from RStudio::conf Training Day 02

For the past few days I’ve been in Austin, Texas with Stephen Knight and Riinu Ots representing the Surgical Informatics Group at RStudio::conf 2019. The conference brings together nearly 2000 data scientists, developers and a couple of surgeons to learn the latest best practice and best approaches when programming with R.

I have attended the Big Data workshop. “Big Data” is a bit of a vague term but it can be helpful to think of Big Data as one of two groups: data that is just so big that you can’t open it on your own computer (imagine opening one of those massive files that just crashes your computer) or data that is stored somewhere remotely and accessed through your computer by a slow connection.

The key principles for handling Big Data effectively include:

  1. Safe storage (often a data administrator sorts this)
  2. Safe access (password protected in many cases with care to avoid publishing passwords in R scripts)
  3. Getting the database itself to do the heavy work whilst leaving R to do the statistical analysis and plotting we know it does best
  4. Leave the data transformation until the latest possible minute
  5. Access the database as few times as possible

Today I’ll focus briefly on safe access, getting the database to do all of the heavy lifting and leaving the transformation to the last possible minute.

Safe Access

Using a R script to access a remote database usually requires credentials. R needs to know where the data is stored and the database needs to know whether it can allow you to access the data. There are lots of different ways to set up a connection to a database but the DBI package and the obdc package are going to come in very handy. You might also need to install a package which supports the driver for the type of database.

There are loads of options when connecting to databases and securing credentials but it’s key to avoid posting critical information like passwords in plain text, for example:

con <- dbConnect(


  Driver = "PostgreSQL",

  Server = "localhost",

  UID    = "myusername",

  PWD    = "my_unsecure_password",

  Port = 5432,

  Database = "postgres"


Best Solution for Securing Credentials

The most secure option for connecting to a database involves using a Data Source Name (DSN) although this does require some pre-configuration and the ability to perform the following:

1. Establish integrated security between the terminal and the database, usually via Kerberos.

2. Pre-configure an ODBC connection in the Desktop or Server (requires sufficient access rights). The ODBC connection will have a unique Data Source Name, or DSN.

For example:

con <- DBI::dbConnect(odbc::odbc(), "My DSN Name")

Easier alternatives

It is still possible to connect securely to a database using either of the following techniques:

con <- dbConnect(


  Driver = "PostgreSQL",

  Server = "localhost",

  UID    = rstudioapi::askForPassword("Database user"),

  PWD    = rstudioapi::askForPassword("Database password"),

  Port = 5432,

  Database = "postgres"


This will require a predefined username and password which the user is prompted to enter on setting up the connection.

Finally, another option which may be easier for those without data administrator is to use the config package and create a yml file. After creating the yml file it can be used to configure the connection without directly publishing credentials, in particular password. Instead, password is retrieved from the yml file:

dw <- config::get("datawarehouse-dev")

con <- dbConnect(odbc::odbc(),

   Driver = dw$driver,

   Server = dw$server,

   UID    = dw$uid,

   PWD    = dw$pwd,

   Port   = dw$port,

   Database = dw$database


Making the database do the work

For many physicians who work with R we are quite happy to create data frames in the local environment, modify them there and then plot the data from there. This works very well with smaller data sets which don’t require much memory and even with medium-sized data sets (as long as you don’t try to open the whole file using View()!). When working with Big Data often the local environment isn’t large enough because of limited RAM. To give an example if you are using all of the UK Biobank genetic data which amounts to over 12 terabytes then the modest 8 gigabytes (1500 times less) of RAM I have on my own laptop just won’t do.

The solution is to manipulate the data remotely. Think bomb disposal. You want to have a screen to show you what a bomb disposal robot is doing and a way of controlling the robot but you don’t want to be up close and personal with the robot doing the work. Big Data is exactly the same, you want to see the results of the data transformations and plotting on your own device but let the database do the work for you. Trying to bring the heavy lifting of the data onto your own device creates problems and may result in the computer or server crashing.

dplyr is a truly fantastic data manipulation package which is part of the tidyverse and makes everyday data manipulation for clinicians achievable, understandable and consistent. The great news is that when working with remote data, you can use dplyr! The package dbplyr (database plyer) converts the dplyr code into SQL which then runs in the remote database meaning a familiarity with dplyr is almost all that’s needed to handle Big Data.

The show_query() function demonstrates just how much work goes on under the hood of dbplyr:

flights %>%

  summarise_if(is.numeric, mean, na.rm = TRUE) %>%



Applying predicate on the first 100 rows


SELECT AVG("flightid") AS "flightid", AVG("year") AS "year", AVG("month") AS "month", AVG("dayofmonth") AS "dayofmonth", AVG("dayofweek") AS "dayofweek", AVG("deptime") AS "deptime", AVG("crsdeptime") AS "crsdeptime", AVG("arrtime") AS "arrtime", AVG("crsarrtime") AS "crsarrtime", AVG("flightnum") AS "flightnum", AVG("actualelapsedtime") AS "actualelapsedtime", AVG("crselapsedtime") AS "crselapsedtime", AVG("airtime") AS "airtime", AVG("arrdelay") AS "arrdelay", AVG("depdelay") AS "depdelay", AVG("distance") AS "distance", AVG("taxiin") AS "taxiin", AVG("taxiout") AS "taxiout", AVG("cancelled") AS "cancelled", AVG("diverted") AS "diverted", AVG("carrierdelay") AS "carrierdelay", AVG("weatherdelay") AS "weatherdelay", AVG("nasdelay") AS "nasdelay", AVG("securitydelay") AS "securitydelay", AVG("lateaircraftdelay") AS "lateaircraftdelay", AVG("score") AS "score"

FROM datawarehouse.flight

Leaving the Data Transformation to the last possible minute

dbplyr prevents unnecessary work occurring within the database until the user is explicit that they want some results. When modifying a data frame using a remote connect and dbplyr the user can work with references to the remote data frame in the local environment. When performing and saving a data transformation then dplyr saves a reference to the intended transformation (rather than actually transforming the data). Only when the user is explicit that they want to see some of the data will dbplyr let the database get busy and transform the data. The user can do this by plotting the data or by using the collect() function to print out the resulting data frame. Working with data in this way by saving up a pipeline of intended transformations and only executing the transformation at the very end is a much more efficient way of working with Big Data.

Quick take-aways from RStudio::conf Training Day 01

This week, Riinu, Steve, and Cameron are attending the annual RStudio Workshops (Tue-Wed), Conference (Thu-Fri), and the tidyverse developer day (Sat) in Austin, Texas.

We won’t even try to summarise everything we’re learning here, since the content is vast and the learning is very much hands on, but we will be posting a small selection of some take-aways in this blog.

We’re all attending different workshops (Machine Learning, Big Data, Markdown&Shiny).

Interesting take-away no. 1: terminology

Classification means categorical outcome variable.

Regression means continuous (numeric) outcome variable.

The is a bit confusing when using logistic regression – which by this definition is “classification”, rather than “regression”. But it is very common machine learning terminology, and makes sense considering the wide range of different methods used for classification (so not just regression).

Interesting take-away no. 2: library(parsnip)

The biggest strength of R is how many different packages (=extensions) it has. Basically, if you can think of a statistical or machine learning method, it’s probably implemented in R. This is because a lot of R users are also R developers – if you find a method that you really want to use, but that hasn’t been implemented yet, you can just go on and implement it youself. And then publish this new functionality as an R package than everyone can use.

However, this also means that different R packages sometimes do similar things using very different syntax. This is where the parsnip packages comes to resque, providing a unified interface for using some of these modelling packages.

For example:

Instead of figuring out the syntax for lm() (basic linear regression model), and then for Stan, and Spark, and keras, set_engine() from library(parsnip) provides us with a unified interface for all of these different methods for linear regression.

A fully working example can be found in the course materials (all publicly available):


Interesting take-away no 3: Communication by a new means

The Rmarkdown workshop raised two interesting points within the first few mintues of starting – how prevalent communication by html has become (i.e. the internet, use of interactive documents and apps to relay industry and research data to colleagues and the wider commmunity).

But, maybe more importantly, how little is understood by the general public and how it can be used relatively easily for impressive interactivity with few lines of code….followed by the question – how about that raise boss??

For example, how using the package plotly can add immediate interactivity following on from all the ggplot basics learnt at healthyR:


And when you come across a website called “pimp my Rmarkdown” how can you not want to play!!!!

Interesting take-away no. 4: monitor progress with Viewer pane

Regular knitting, including at the start of an Rmd document to ensure any errors are highlighted early, is key. Your RMarkdown is a toddler who loves to misbehave. Previewing your document in a new window can take time and slow you down….

Frequent knitting into the Viewer pane can give you quick updates on how your code is behaving and identify bugs early!

The default in Rstudio loads your document into a new window when the Knit button is hit. A loading of a preview into the Viewer pane can be set as follows:

Tools tab > Global Options > RMarkdown > Set “Output preview in” to Viewer pane

Rmarkdown hack of the day: New chunk shortcut

Control + Alt + I
Cmd + Alt + I

Teaching our REDCap basic anatomy, cancer classifications, and some common sense

This post was originally published here

With over 11,000 patients now entered into REDCap and more being entered every day we thought it would be a good time to reflect on some of the ways in which GlobalSurg 3 has been set up to help collaborators from around the world enter accurate and high-quality data.

With so many important things to know about each patient undergoing cancer surgery, the GlobalSurg team of nearly 3000 collaborators has been busy entering data into our secure database at redcap.globalsurg.org (REDCap is an amazing database software developed by Vanderbilt University). With over 750,000 values entered, it isn’t surprising that from time to time a mistake occurs whilst entering the data. This might be because the data entered onto a paper form was incorrect to begin with or it might be due to accidentally clicking on the wrong options when entering the data to REDCap. In some cases, the incorrect data might even appear in the notes if the surgeon or the anaesthetist has forgotten how to decide on the most appropriate ASA grade.

To try to help our collaborators identify cases when these mistakes may have happened we have taught our REDCap some basic anatomy, cancer classifications and some common sense so that it can alert collaborators to mistake mistakes as soon as they occur. The automatic alerts appear when a collaborator tries to save a page with incorrect data meaning that they can change it immediately when they still have access to the patient notes.

Our REDCap now knows 58 things about cancer surgery that it is using to help collaborators enter accurate data.

globalsurg_redcap_dq (1).png

Figure 1. Examples of our (a) paper data collection form, (b) REDCap interface, and (c) a data quality pop-up warning.

All 58 rules are given at the bottom of this post, but here are some examples:

  • Our REDCap knows some basic anaesthesiology:
    • As a collaborator if you have tried to enter a patient with diabetes mellitus and stated that they have an ASA grade of 1 Our REDCap should have informed you that diabetic patients should really have an ASA grade of 2 or more
  • Our REDCap knows the basics of anatomy:
    • It knows that if a patient had a total colectomy that they don’t have a colostomy
  • Our REDCap knows about TNM staging:
    • It knows that patients with an M score of M1 should also have an Essential TNM score of M+
  • Our REDCap also knows some common sense:
    • It knows that patients can’t have more involved lymph nodes in a specimen that the total number of lymph nodes in a specimen
    • It knows that a patient couldn’t have their operation before being admitted to the hospital

Our REDCap has been working tirelessly for several months to generate these alerts and help collaborators ensure their data is accurate. We hope that training REDCap to detect problems with the data will make the GlobalSurg 3 analysis more efficient and contribute to the accuracy of the data.

The final data entry deadline is 17th December so remember to upload all of your data before then. Our REDCap is ready and waiting to store and check the data.

View and download all of our Data Quality rules at github.com/SurgicalInformatics

Global map of country names

This post was originally published here

This post demonstrates the use of two very cool R packages – ggrepel and patchwork.

ggrepel deals with overlapping text labels (Code#1 at the bottom of this post):

patchwork is a very convenient new package for combining multiple different plots together (i.e. what we usually to use grid and gridExtra for).

More info:

To really demonstrate the power of them, let’s make a global map of country names using ggrepel:

Now this is very good already with hardly any overlapping labels and the world is pretty recognisable. And really, you can make this plot with just 2 lines of code:

So what these two lines make is already very amazing.

But I feel like Europe is a little bit misshapen and that the Caribbean and Africa are too close together. So I divided the world into regions (in this case same as continents except Russia is it’s own region – it’s just so big). Then wrote two functions that asked ggrepel to plot each region separately and use patchwork to patch each region together:

This gives continents a much better shape, but it does severaly misplace Polynesia. See if you can find where, e.g., Tonga is and where it should be.

To see what I did with patchwork there, let’s add black borders to each region (Code#2):



My data science toolbox

This post was originally published here

I’ve been doing data science for over 10 years now. Although most of this time I didn’t realise I was doing data science. I thought I was just doing normal science but focusing on simulations and data analysis, rather than field or lab work. I’ve switched fields a few times now- physics BSc, Chemistry PhD, now working in medical research. Therefore, instead of this lenghty introduction:

“I’m a physicist by background with substantial interdisciplinary expertise in simulations, data analysis, programming…”

I just go with:

I’m a data scientist.

Anyway, here’s how my toolbox and technical skills have evolved over the years:

My data science toolbox evolution.

P.S. Once a physicist, always a physicist.

Islay distilleries in 3 days

This post was originally published here

Day 0 (Sunday 18-February 2018)

Left Edinburgh at 8am for a 1pm ferry Kennacraig to Port Askaig (Islay). Edinburgh-Kennacraig should be a 3.5h drive (and it was), but we left early to allow for any delays on the road. Arrived on Islay at 3pm and our accommodation near Port Ellen (southern Islay, close to to Ardbeg, Lagavulin, Laphroiaig) was a 40 min drive from the port.

Map of Islay with all its lovely distilleries.

Map of Islay with all its lovely distilleries.

Day 1 (Monday 19-February 2018): Ardbeg, Lagavulin, Laphroiaig

We hadn’t booked anything other than the ferry and accommodation. February is very low season so we were right to think that no other advance bookings were necessary.

We had a lazy morning and drove to Laphroaig at about 11am. We asked which tours or tasting events were on that day and booked Einar onto the Layers of Laphroaig tasting at 3pm (as the driver, I was allowed to accompany him for free). We then drove to Lagavulin (just a few miles from Laphroaig) and booked us onto the tour at 1pm. We then drove to Ardbeg (another few miles) and had second breakfast at their cafe. Then drove back to Lagavulin for the tour, and then back to Laphroaig for the testing.

Ardbeg’s epic cafe.

Ardbeg’s epic cafe.

Waiting for the tour to begin at Lagavulin’s homey tasting room.

Waiting for the tour to begin at Lagavulin’s homey tasting room.

In Laphroiaig’s tasting room: The Layers of Laphroiaig introduced whiskies from different casks that make up their range of malts. These include ex-bourbon, virgin oak (I did not know Scotch could be matured in virgin casks - I thought it always had to be ex-something!), ex-sherry, ex-port. We were the only ones booked on this so it ended up being a private tasting.

In Laphroiaig’s tasting room: The Layers of Laphroiaig introduced whiskies from different casks that make up their range of malts. These include ex-bourbon, virgin oak (I did not know Scotch could be matured in virgin casks – I thought it always had to be ex-something!), ex-sherry, ex-port. We were the only ones booked on this so it ended up being a private tasting.

Day 2 (Tuesday 20-Febaruary 2018): Kilchoman, Bruichladdich

Einar drove us to Kilchoman where I had a tasting of their 3 limited edition malts in the visitor centre. Kilchoman is a “farm-distillery” and they even grow some of their own barley. We bought a bottle of their “100% Islay” which is made from barley grown at the premises. Unfortunately, we completely forgot to take any pictures there. Must go back.

Driving on Islay.

Driving on Islay.

We then went to Bruichladdich and booked me on the Warehouse Experience at 2pm. Simiarly to Laphroaig, the driver was allowed to accompany for free. We had lunch at Port Charlotte while waiting for the event.

Bruichladdich warehouse experience

Bruichladdich warehouse experience

We then went by Bowmore (it was nearly 5pm) and asked about the different tours and experiences they have on the next day. Decided to do the “Bottle Your Own in the Vaults” first thing on Wednesday morning.

Day 3 (Wednesday 21-February 2018): Bowmore, Bunnahabhain, Ardnahoe, Caol Ila

Bottling a 17-year-old sherry cask beauty at Bowmore.

Bottling a 17-year-old sherry cask beauty at Bowmore.

We then dropped by Bunnahabhain – no tours were running that but we were offered a few free tasters at the shop. On our way back from Bunnahabhain we took a picture at Ardanahoe (a new distillery that opens any day now).

Visiting Bunnahabhain and stopping at soon to be opened Ardnahoe.

Visiting Bunnahabhain and stopping at soon to be opened Ardnahoe.

The final distillery was Caol Ila where we went on the standard tour. The view in the stills room was just out of this world. They didn’t allow us to take pictures inside, so I took this from their website:

Caol Ila stills with a view of the Isle of Jura. Picture from: https://www.malts.com/en-row/distilleries/caol-ila/

Caol Ila stills with a view of the Isle of Jura. Picture from: https://www.malts.com/en-row/distilleries/caol-ila/

Me outside Caol Ila with Jura in the background

Me outside Caol Ila with Jura in the background

What we brought back with us

In addition to whisky distilleries, we also visited a nano-brewery, and it turns out that The Botanist (a gin) is made at Bruichladdich.

In addition to whisky distilleries, we also visited a nano-brewery, and it turns out that The Botanist (a gin) is made at Bruichladdich.