Tidyverse does sequencing data!

What I’ve been up to…

For the past few months I’ve been stationed between Surgical Informatics in Edinburgh and the CRUK Beatson Institute for Cancer Research in Glasgow. The aim – to identify new treatments for hepatocellular carcinoma, a disease that affects nearly a million people worldwide each year. Surgery is the only cure, but due to the complexity of surgery it tends to be quite a specialist and risky undertaking. This, combined with the late stage at which patients become symptomatic, means that a lot of liver cancer globally in incurable.

Current chemotherapy keeps patients in a ‘holding pattern’ where it slows the growth of tumours, but lacks the ability to kill enough of the tumour to make it possible to then use surgery as a cure (called ‘downstaging’). If we can find a drug or treatment that can facilitate this, that’d be very exciting.

So, I’ve been accruing loads of sequencing data from human and mouse tumours to try and identify weaknesses in these tumours that we can target. But analysing this RNA and DNA sequencing data takes a lot of time and computational power. So I’ve been investigating how to shrink this down and try and take as much of it straight to R as possible. Thus, simplifying the pipeline and hopefully speeding things up a bit. Traditionally, this process takes hours and days. But with this workflow I’ve shrunk it to about a day for 30+ samples so far. This is obviously dependent on how many samples, how big your genome/transcriptome is and sequencing coverage/depth too.

Quick alignment and mapping

Taking the FASTQ files generated by the sequencers, we first run them through the Salmon pseudoaligner. Salmon is a package that uses pseduoalignment to align and map the reads from RNASeq data, thus shrinking two steps into one. This also enables us to leverage the speed and accuracy advantages of pseudoalignment (meant to be more robust in several aspects). I tend to dump all my FASTQ files into one folder per species, experiment or cohort and then run Salmon on this.

String manipulation in Bash

One challenge I have found is how our institute appends suffixes to the files for identification. This is obviously great, but Salmon needs to know which reads come from which end in the case of paired end sequencing or which technical repeat. Otherwise we’re simply going to be aligning gobbledygook.

In this case, the Illumina sequencer kicks out files which have a ‘S’ in front of the sample number, an ‘L’ in front of the technical repeat number and an ‘R’ in front of the direction of sequence (i.e. for paired end reads).

So using some loops in bash, I devised the following (note index file for mouse transcriptome mm10_t_index):

#!/bin/bash
source /home/user/miniconda3/etc/profile.d/conda.sh
conda activate salmon

//note ’18’ can be any number – this is just the number of samples I had.
FILE_LIST=$(for fn in /home/user/data/sequencing_data_here/fastq/*; do echo ${fn::-18} ; done)

#now check we can loop through this list
#for fn_short in $FILE_LIST; do echo “${fn_short}_test”; done

#now remove duplicates
FILE_LIST=$(echo $FILE_LIST | awk ‘BEGIN{ORS=” “}{for (i=1; i<=NF; i++)a[$i]++} END {for (i in a) print i }’)
FILE_LIST=$(echo “$FILE_LIST” | tr ‘ ‘ ‘\n’)

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L001_R1_001.fastq \
-2 “${fn}”_L001_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L001_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L002_R1_001.fastq \
-2 “${fn}”_L002_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L002_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L003_R1_001.fastq \
-2 “${fn}”_L003_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L003_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L004_R1_001.fastq \
-2 “${fn}”_L004_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L004_quant
done

Now, when run, this should generate aligned and mapped data in a separate ‘quants’ folder. Alongside the QC information etc. in each folder.

The most interesting files in these quantified data folders are the tab delimited quants.sf files. These contain the counts for mapped reads at the gene level. I tend to work with count data only, as from this it’s possible to derive TPM and FPKM.

Using the tidyverse with map() to quickly prepare sequencing data

Using the tidyverse we can make this really quick and easy!

First, using readr::read_tsv() and purrr::map() we can open all the files within a specific directory and add the pathname as a column (so we don’t lost track of our samples). Then we use mutate(), to put a path and sample name in. Finally, using spread() and summarise() we sum the counts across our technical repeats!

library(tidyverse)


#required functions
file_directory = ‘/mnt/data/this_is_folder_where_data_is’


file_list = dir(path = file_directory, full.names = T, pattern = “*quant.sf”, recursive = T)


#Step 1 – open all files, read and extract sample identifiers
file_list %>%
map(~read_tsv(.) %>%
mutate(path = gsub(‘/mnt/data/this_is_folder_where_data_is/quants/’, ”, .x)) %>% #Edit this to the appropriate file directory!
mutate(path = gsub(‘_quant/quant.sf’, ”, path)) %>%
mutate(path = gsub(‘__’, ‘_’, path))) -> gene_counts_list


#Step 2 – split repeats up and combine into 1 dataframe
gene_counts_list %>%
bind_rows(gene_counts_list) %>%
extract(path, into = c(“sample_id”, “n_repeat”), “(.*)_([^_]+)$”) %>%
select(sample_id, n_repeat,everything()) -> genes_all_samples_df


#Step 3 – sum reads together!
genes_all_samples_df %>%
group_by(sample_id, Name) %>%
summarise(NumReadsTotal = sum(NumReads)) %>%
spread(key = sample_id, value = NumReadsTotal) %>%
column_to_rownames(var = ‘Name’) -> gene_counts_by_sample


rm(file_directory, file_list, gene_counts_list)
#Save data
save.image(‘/mnt/data/results_folder’)

Happy sequencing! (and thanks Riinu Ots!)

Leave a Reply

Your email address will not be published. Required fields are marked *