Tidyverse does sequencing data!

What I’ve been up to…

For the past few months I’ve been stationed between Surgical Informatics in Edinburgh and the CRUK Beatson Institute for Cancer Research in Glasgow. The aim – to identify new treatments for hepatocellular carcinoma, a disease that affects nearly a million people worldwide each year. Surgery is the only cure, but due to the complexity of surgery it tends to be quite a specialist and risky undertaking. This, combined with the late stage at which patients become symptomatic, means that a lot of liver cancer globally in incurable.

Current chemotherapy keeps patients in a ‘holding pattern’ where it slows the growth of tumours, but lacks the ability to kill enough of the tumour to make it possible to then use surgery as a cure (called ‘downstaging’). If we can find a drug or treatment that can facilitate this, that’d be very exciting.

So, I’ve been accruing loads of sequencing data from human and mouse tumours to try and identify weaknesses in these tumours that we can target. But analysing this RNA and DNA sequencing data takes a lot of time and computational power. So I’ve been investigating how to shrink this down and try and take as much of it straight to R as possible. Thus, simplifying the pipeline and hopefully speeding things up a bit. Traditionally, this process takes hours and days. But with this workflow I’ve shrunk it to about a day for 30+ samples so far. This is obviously dependent on how many samples, how big your genome/transcriptome is and sequencing coverage/depth too.

Quick alignment and mapping

Taking the FASTQ files generated by the sequencers, we first run them through the Salmon pseudoaligner. Salmon is a package that uses pseduoalignment to align and map the reads from RNASeq data, thus shrinking two steps into one. This also enables us to leverage the speed and accuracy advantages of pseudoalignment (meant to be more robust in several aspects). I tend to dump all my FASTQ files into one folder per species, experiment or cohort and then run Salmon on this.

String manipulation in Bash

One challenge I have found is how our institute appends suffixes to the files for identification. This is obviously great, but Salmon needs to know which reads come from which end in the case of paired end sequencing or which technical repeat. Otherwise we’re simply going to be aligning gobbledygook.

In this case, the Illumina sequencer kicks out files which have a ‘S’ in front of the sample number, an ‘L’ in front of the technical repeat number and an ‘R’ in front of the direction of sequence (i.e. for paired end reads).

So using some loops in bash, I devised the following (note index file for mouse transcriptome mm10_t_index):

source /home/user/miniconda3/etc/profile.d/conda.sh
conda activate salmon

//note ’18’ can be any number – this is just the number of samples I had.
FILE_LIST=$(for fn in /home/user/data/sequencing_data_here/fastq/*; do echo ${fn::-18} ; done)

#now check we can loop through this list
#for fn_short in $FILE_LIST; do echo “${fn_short}_test”; done

#now remove duplicates
FILE_LIST=$(echo $FILE_LIST | awk ‘BEGIN{ORS=” “}{for (i=1; i<=NF; i++)a[$i]++} END {for (i in a) print i }’)
FILE_LIST=$(echo “$FILE_LIST” | tr ‘ ‘ ‘\n’)

for fn in $FILE_LIST;
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L001_R1_001.fastq \
-2 “${fn}”_L001_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L001_quant

for fn in $FILE_LIST;
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L002_R1_001.fastq \
-2 “${fn}”_L002_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L002_quant

for fn in $FILE_LIST;
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L003_R1_001.fastq \
-2 “${fn}”_L003_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L003_quant

for fn in $FILE_LIST;
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L004_R1_001.fastq \
-2 “${fn}”_L004_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L004_quant

Now, when run, this should generate aligned and mapped data in a separate ‘quants’ folder. Alongside the QC information etc. in each folder.

The most interesting files in these quantified data folders are the tab delimited quants.sf files. These contain the counts for mapped reads at the gene level. I tend to work with count data only, as from this it’s possible to derive TPM and FPKM.

Using the tidyverse with map() to quickly prepare sequencing data

Using the tidyverse we can make this really quick and easy!

First, using readr::read_tsv() and purrr::map() we can open all the files within a specific directory and add the pathname as a column (so we don’t lost track of our samples). Then we use mutate(), to put a path and sample name in. Finally, using spread() and summarise() we sum the counts across our technical repeats!


#required functions
file_directory = ‘/mnt/data/this_is_folder_where_data_is’

file_list = dir(path = file_directory, full.names = T, pattern = “*quant.sf”, recursive = T)

#Step 1 – open all files, read and extract sample identifiers
file_list %>%
map(~read_tsv(.) %>%
mutate(path = gsub(‘/mnt/data/this_is_folder_where_data_is/quants/’, ”, .x)) %>% #Edit this to the appropriate file directory!
mutate(path = gsub(‘_quant/quant.sf’, ”, path)) %>%
mutate(path = gsub(‘__’, ‘_’, path))) -> gene_counts_list

#Step 2 – split repeats up and combine into 1 dataframe
gene_counts_list %>%
bind_rows(gene_counts_list) %>%
extract(path, into = c(“sample_id”, “n_repeat”), “(.*)_([^_]+)$”) %>%
select(sample_id, n_repeat,everything()) -> genes_all_samples_df

#Step 3 – sum reads together!
genes_all_samples_df %>%
group_by(sample_id, Name) %>%
summarise(NumReadsTotal = sum(NumReads)) %>%
spread(key = sample_id, value = NumReadsTotal) %>%
column_to_rownames(var = ‘Name’) -> gene_counts_by_sample

rm(file_directory, file_list, gene_counts_list)
#Save data

Happy sequencing! (and thanks Riinu Ots!)

Making sense of machine learning – how do we measure performance?

An exciting direction for the Surgical Informatics group is the application of machine learning models to clinical problems.

As we hear on a nearly daily basis, machine learning has loads to offer patients and clinicians, but how can we make these models understandable and importantly, how do we measure that these models are looking at what we’re interested in?

Currently, how well a diagnostic test performs is described by four main parameters (most students and clinicians will groan when they hear these words):

  • Sensitivity (how many people who have the condition are identified correctly)
  • Specificity (how many people who don’t have the condition are identified correctly)
  • Positive Predictive Value (how many times a test positive is a true positive)
  • Negative Predictive Value (how many times a test negative is a true negative)

Now, interestingly the field of machine learning has evolved some separate parameters for measuring the usefulness of machine learning models:

  • Recall (synonymous to sensitivity)
  • Precision (synonymous to positive predictive value)

There are other measures too, including F1 score and accuracy. The issue around these metrics is that although they are handy mathematically to describe models, they lack relevance to what is clinically important. For example, if a patient wants to know how many times a test might give a false result, the F1 score (a weighted average of precision and recall) is going to be pretty useless.

Now, if we want to make a machine learning risk prediction model, we need a clinically relevant metric to allow model training to be measured and optimised. In python, there’s lots of functions for this, however, R is far more common in healthcare data analysis. At Surgical Informatics, we use Keras to interact with TensorFlow in R. Keras for R is far newer than python, so there are fewer metric functions available.

Clinically, a model to predict a specific event happening is more useful than ruling it out, particularly if the event is serious (i.e. death). A recall metric would be perfect for this, however, there is no custom function available for recall in R.

So lets make one!

Fortunately Keras provides us with functions to perform calculations on tensors such as k_sum, k_round and k_clip. This lets us manipulate Tensors using Keras and come up with custom metrics. You can find other backend keras functions here:


So if recall is equal to the number of true positives, divided by the number of true positives plus false negatives we need to write a function to define these.

Now should we just add pp and tp? Unforunately Keras doesn’t like this. So we use k_epsilon() to replace tp in the recall expression, to give:

And that should calculate the recall (or sensitivity) for the model!