HealthyR Notebooks Estonia Day 2

Everybody came back for Day 2 of HealthyR Notebooks in Estonia!

Today focussed on modelling kicking off with linear regression in detail from Riinu – if you understand this you understand the majority of statistical tests!

Factors were introduced by Cameron, which led nicely into logistic regression. By the end of this the whole room was comfortably building regression models as if they had been doing it for years!

Notebooks are a really powerful tool for teaching this sort of material allowing seamless output into PDF and Word format. Some of the delegates commented on how they had struggled with this aspect of R in the past.

After all the intense work it was a great relief to break early for some fantastic team building including stuff like this!

Some even tried there hand at archery and felt pretty smug about their performance 😂

Finally, we are so grateful Julius Juurmaa for all of the organisation. He even started a company to administer the course! Thank you Julius.

HealthyR Notebooks Estonia Day 1

The Surgical Informatics team arrived in beautiful Estonia yesterday.

We are here as part of our Wellcome Trust Open Research Fund grant – “HealthyR Notebooks: Democratising open and reproducible data analysis in resource-poor environments”.

It’s a mouthful, but important! We have adapted our popular HealthyR training course to be easily delivered on small laptop screens allowing state-of-the-art data analysis to be performed using RStudio anywhere by anyone. We’re testing this in Estonia and running it again in Ghana in November.

Also look out for HealthyR the book coming soon.

The setting is fantastic:

The delegates are already amazing at using R, but we’re teaching Tidyverse and Finalfit to bring everyone up to date with all the great new modern packages.

There are threats of pedalo racing later 😳.

Tidying up edgeR and differential expression

For anyone who has done any differential expression from RNA array/ sequencing datasets in R, edgeR or DESeq2 are the packages to go to! They contain powerful functions and models which cater for most uses.

As with most bioinformatics type packages, they end up creating scripts with endless lines of code and square brackets aplenty. Now not only does this make scripts difficult to read, it’s difficult for R to handle in some aspects. Quite often these packages have custom objects etc. which makes changing formats and sending data into different packages a pain.

Well, if you do lots of differential expression, here’s some wrappers for you! Taking inspiration from the FinalFit and tidyverse approach,
I have written a few wrappers for differential expression analyses to make your code and life happier.

Please note – throughout this all I use count data as this is what edgeR likes, rather than FPKM/RPKM or RSEM.

Step 1 – Taking the first steps to expression happiness

So – first up, preparing and filtering your data. This function turns your data and any clinical/ sample data, wraps it up into a DGEList object,
then will filter it. I like to function based on proportions of lowly expressed transcripts, as purely filtering on arbitary CPM values has its own issues, particularly if your read depth is low.

tidy_dge() is a function which does all of this! It combines two functions dge_bind() – which sets up your DGEList based on two data frames and tidy_gene_filter() which then filters these objects and calculates normalisation values.

We can then combine this into one function tidy_dge(), which shrinks about 10-20 lines into one.

Step 2 – Translating annotations

Next up – convert_genes(). Switching between gene annotation systems (i.e. ENSEMBL, HUGO, ENTREZ) is a pain. We all know that. The package biomaRt tries to solve this a bit, but still we’re left with somewhat unwieldly functions and lines of code trying to pick out where genes sit and what to map them back to. For this, I have devised convert_gene().

convert_gene() passes a request to biomaRt (the ENSEMBL API) and will convert any (common) gene name to any other gene annotation out. It takes a little while to contact the servers, but is well worth it! Alternatively, you will be able to pass a saved biomaRt output to this, so you can make one pull only which speeds it up! If we are to do this, we should create three separate marts – one containing genes, one genes and GO codes, the other the GO codes and processes. Here’s some functions to do this:

The reason for this- in testing some of the marts have errors when being pulled with genes and GO codes all at once! The ensembl biomaRt API is also somewhat unstable and frequently produces 404 errors/ lots of downtime. Plus – we can then use the mart we created for genes in convert_gene().

Back to convert_gene()! It is designed to work with a glm object straight out of edgeR or with a dataframe of counts, but will detect if it’s another type of file. Just be careful! As the default setting is mouse currently:

GO – The final step to expression fulfillment

Finally – those GO analyses, how on earth do you then trace back and find which genes in which pathways are up-regulated? I’ve never been able to find a nice function to do this.

Fear not – we have a new one! gene_GO_explorer(). This is very handy indeed. It basically will take your glm object from edgeR and map the genes to GO processes and codes.


There are some functions wrapped up in this function for dissecting pathways out too; go_grep, go_code, go_pathway and go_components. These allow you to supply a list of GO processes, GO codes or to search the GO annotations for pathways of interest. For example, If your GO enrichment shows pathway activity for a specific GO pathway, you can input the GO code into go_code to get the genes involved in that pathway. For searching GO pathway annotations, we can use go_grep. Here, if we set go_grep to ‘stress’, gene_GO_explorer() will search and return any pathways with the term ‘stress’ in their description.

If genes are part of multiple pathways that are included in our search, the GO terms will be collapsed down together – separated by a semicolon.
This data can be displayed in a ‘long’ format if required by specifying long = TRUE.

Note – unlike goana, gene_GO_explorer() does not mind what type of gene annotation is supplied! I use convert_gene() to prepare my glms for goana then pop it through gene_GO_explorer().

If you’re not using a glm from edgeR – switch off the adjusted p-value bit of gene_GO_explorer() using adjust_p = FALSE, or set the column with your p-values to ‘PValues’.

I’ll be developing more of this sort of thing, in addition to some nicer wrappers for preparing your sequencing data straight from FASTQ formats, making user readable edgeR models and then querying protein interaction databases too! Comments/ thoughts appreciated if people find this kind of stuff useful. Considering making a package. FinalOmics anyone?

Reference Management – An Efficient Setup for writing a Thesis

This blog is intended for researchers, PhD students, MD students and any other students who wish to have a robust and effective reference management setup. The blog has a particular focus on those using R markdown, Bookdown or LaTeX. Parts of the blog can also help setup Zotero for use with Microsoft Word. The blog has been designed to help achieve the following goals:

  • Effective citation storage
    • Fast and easy citation storage (one-click from Chrome)
    • Fast and easy PDF storage using cloud storage
    • Immediate, automatic and standardised PDF renaming
    • Immediate, automatic and standardised citation key generation
  • Effective citation integration with markdown etc.
    • Generation of citation keys which work with LaTeX and md (no non-standard characters)
    • Ability to lock citation keys so that they don’t update with Zotero updates
    • Storage of immediately updated .bib files for use with Rmd, Bookdown and LaTeX
    • Automated update of the .bib file in RStudio server

Downloads and Setup

For my current reference management setup I need the following software:

  • Zotero
    • Zotero comes with 300MB of free storage which allows well over 1000 references to be stored as long as PDFs are stored separately
    • From the same download page download the Chrome connector to enable the “save to zotero” function in Google Chrome
  • ZotFile
    • Zotfile is a Zotero plugin which helps with PDF management, download the .xpi file and then open Zotero, go to “Tools → Add-Ons” and click the little cog in the top right corner and navigate to file to install (Figure 1)
  • Better BibTeX
    • Better BibTeX is a plugin to help generate citation keys which will be essential for writing articles in LaTeX, R Markdown or Bookdown
    • If the link doesn’t work go to github and scroll down to the ReadMe to find a link to download the .xpi file
    • The same approach is then used to install the Better BibTeX plugin for zotero (“Tools → Add-Ons”)

After downloading Zotero, ZotFile and Better BibTeX create an account on Zotero online.

In addition to the Zotero downloads this guide will focus on an efficient setup for writing with R markdown or Bookdown and assumes that you have access to the following software / accounts:

  • Dropbox / Google Drive / other cloud storage service which allows APIs
    • It will also be necessary for these to be accessible using Windows Explorer or Mac Finder (there are many guides online for syncing Google Drive and Dropbox so that they appear in file explorers)
  • RStudio (this is not 100% essential but it is far harder to use Rmd without it)
    • Packages which will be required for this setup include rdrop2 (if using dropbox, other packages are available to convert this setup to Google Drive etc.), encryptr, bookdown or Rmarkdown, tinytex and a LaTeX installation (the Bookdown author recommends using tinytex which can be installed by the similarly named R package: tinytex::install_tinytex())

Folder Setup

When using Zotero it is a good ideal to create a folder in which you will store PDFs retrieved from articles. Ultimately it is optional whether or not PDFs are stored but if you have access to cloud storage with a good quota then it can make writing in Rmd etc. much faster as there is no requirement to search online for the original PDF. This folder should be set up in Google Drive, Dropbox or another cloud storage service which can be accessed from your own computer through the file explorer.

A second folder may be useful to store bibliographies which will be generated for specific projects or submissions. Again this folder should be made available in cloud storage.

ZotFile Preferences

To setup Zotero so that retrieved PDFs are automatically stored and renamed in the cloud storage without consuming the Zotero storage quota go to “Tools → ZotFile Preferences” and on the first tab: General Settings and set the folder and subfolder naming strategy for PDFs. I have set the location of the files to a Custom location and in this case used the path to a Google Drive folder (~\Google Drive\Zotero PDF Library). ZotFile will also store retrieved PDFs in subfolders to help with finding PDFs at a later date. The current setup I use is to create a subfolder with the first author surname so that all papers authored by one (or more) author with the same name are stored together using the \%a in the subfolder field (Figure 2). Other alternatives are to store PDFs in subfolders using year (\%y); journal or publisher (\%w); or item type (\%T).

Next the Renaming Rules tab can be configured to provide sensible names to each of the files (this is essential if PDFs are not to be stored as random strings of characters which provide no meaning). In this tab I have set the format to: {%a_}{%y_}{%t} which provides names for the PDFs in the format of: Fairfield_2019_Gallstone_Disease_and_the_Risk_of_Cardiovascular_Disease.pdf. I find that this shows author, year and first word of title without needing to expand the file name (Figure 3).

I have not changed any of the default settings in either the Tablet Settings or Advanved Settings tabs apart from removing special characters in the Advanced Settings (this stops things from breaking later).

General Zotero Settings

Zotero has several configurable settings (accessed through: “Edit → Preferences”) and I have either adopted the defaults or made changes as follows:

General:

  • I have ticked the following:
    • Automatically attach associated PDFs
    • Automatically retrieve metadata for PDFs
    • Automatically rename attachments using parent metadata
    • Automatically tag items with keywords and subject headings
    • All options in Group section
  • I have left the following unticked:
    • Automatically take snapshots
    • Rename linked files

Sync:

  • Enter the account details
  • Tick sync automatically
  • Untick sync full text (if you choose to save PDFs then syncing full text will quickly consume the 300MB quota)

Search:

  • Left unchanged

Export:

  • Left unchanged

Cite:

  • There are several sensible defaults but if there is a new citation style you wish to be able to use in Microsoft Word for example then click “Get additional styles” as there is probably a version that you need already created. You can click the “+” button to add a style from a .csl file if you have one already. Finally, if you are desperate for a style that doesn’t already exist then you can select a citation style and click Style Editor and edit the raw .csl file.
  • In the Word Processors subtab (on the main Cite tab), you can install the Microsoft Word add-in to allow Zotero to work in Microsoft Word.

Advanced:

  • I changed nothing on the General subtab
  • In the Files and Folders subtab I have selected the path to base directory for attachments
  • I have not changed the Shortcuts subtab
  • I have not changed the Feeds subtab

Better BibTex:

  • In this section I have set my Citation Key format to [auth:lower:alphanum]_[year:alphanum]_[veryshorttitle:lower:alphanum]_[journal:lower:clean:alphanum] (Figure 4). This generates a citation key for each reference in the format of fairfield_2019_gallstones_scientificreports or harrison_2012_hospital_bmj. It always takes the first author’s surname, the year, the first word of the title and the journal abbreviation if known. The clean and alphanum arguments to this field are used to remove unwanted punctuation which can cause citation to fail in LaTeX.
Figure 4: Better BibTeX Citation Key

Once the settings have been configured if you already had references stored in Zotero and wish to change the citation key for old references select your entire library root (above all folders), select all references, right click and use “Better BibTex → Refresh BibTeX Key” and all of the citation keys should be updated.

Creating a .bib file

For referencing in a new project, publication or submission it may be helpful to have a dynamic .bib file that updates with every new publication and can be accessed from any device through cloud storage.

To set up a .bib file, first find the folder that you wish to create the file from (this should be the folder which contains any citations you will use and ideally not the full library to cut down on unnecessary storage and syncing requirements). Note that the .bib file will generate a bibliography from any citations stored directly in the folder when using default settings. This prevents use of subfolders which I find particularly helpful for organising citations and I have therefore changed the setting so that folders also show any citations stored in subfolders. To make this change go to “Edit Preferences” and select the “Advanced” tab and at the bottom of the “General” subtab select “Config Editor”. This will bring up a searchable list of configurations (it may show a warning message before this) and search in the search box for “extensions.zotero.recursiveCollections”. Set “Value” to TRUE and then when you click a folder you should see all of the citations also stored in subfolders.

Right click the folder and select “Export Collection”. A pop-up window will appear at which point select “Keep Updated” and if using RStudio desktop save the file in the directory where you have your Rmd project files. If you are working with RStudio server then save the file in a cloud storage location which will then be accessed from the server. I have a .bib file stored in Dropbox which I access from RStudio server.

Linking Dropbox and RStudio Server to Access the .bib File

The following covers linking Dropbox to RStudio server but could be adapted to cover another cloud storage service.

Dropbox provides a token to allow communication between different apps. The rdrop2 package is what I used to create a token to allow this. I actually created the token on RStudio desktop as I couldn’t get the creation to work on the server but this is perfectly ok.

Caution: The token generated by this process could be used to access your Dropbox from anywhere using RStudio if you do not keep it secure. If somebody were to access an unencrypted token then it would be equivalent to handing out your email and password. I therefore used the encryptr package to allow safe storage of this token.

Token Creation

Open Rstudio desktop and enter the following code:

The code will create two files, a token and the .httr-oauth file from which a token can also be made. The encryptr package can then encrypt the files using a public / private key pair. It is essential that the password that is set when using genkeys() is remembered otherwise the token cannot then be used. In this case the original token can’t be retrieved but could be created again from scratch.

The following files will then be needed to upload to the RStudio server:

  • droptoken.rds.encryptr.bin – or the name provided for the encrypted Dropbox token
  • id_rsa – or the name provided for the private key from the private / public key pair

Dropbox Linkage for Referencing the .bib File

Now that the encrypted token and necessary (password-protected) private key are available in RStudio server, the following can be saved as a separate script. The script is designed to read in and decrypt the encrypted token (this will require a password and should be done if the .bib file needs updated). Only the drop_download() needs repeated if using the token again during the same session. The token should be cleared at the end of every session for additional security.

Now that the .bib file has been created and is stored as “my.bib” in the local directory, it should update whenever the token is loaded and drop_download() is run.

Final Result

On clicking “Save to Zotero” button in Chrome and running drop_download() the following should all happen almost instantaneously:

  • Zotero stores a new reference
  • A PDF is stored in the cloud storage having been named appropriately
  • A link to the PDF is stored in Zotero (without using up significant memory)
  • A citation key is established for the reference in a standardised format without conflicts
  • Pre-existing citation keys which have been referenced earlier in the writing of the paper are not altered
  • A .bib file is updated in the RStudio server directory
  • And much unwanted frustration of reference management is resolved

This is my current reference management system which I have so far found to be very effective. If there are ways you think it can be improved I would love to hear about them.

Encryptr now makes it easy to encrypt and decrypt files

This post was originally published here

Data security is paramount and encryptr was written to make this easier for non-experts. Columns of data can be encrypted with a couple of lines of R code, and single cells decrypted as required.

But what was missing was an easy way to encrypt the file source of that data.

Now files can be encrypted with a couple of lines of R code.

Encryption and decryption with asymmetric keys is computationally expensive. This is how encrypt for data columns works. This makes it easy for each piece of data in a data frame to be decrypted without compromise of the whole data frame. This works on the presumption that each cell contains less than 245 bytes of data.

File encryption requires a different approach as files are larger in size. encrypt_file encrypts a file using a symmetric “session” key and the AES-256 cipher. This key is itself then encrypted using a public key generated using genkeys. In OpenSSL this combination is referred to as an envelope.

It should work with any type of single file but not folders.

Documentation is maintained at encrypt-r.org

Generate keys

genkeys()
#> Private key written with name 'id_rsa'
#> Public key written with name 'id_rsa.pub'

Encrypt file

To demonstrate, the included dataset is written as a .csv file.

write.csv(gp, "gp.csv")
encrypt_file("gp.csv")
#> Encrypted file written with name 'gp.csv.encryptr.bin'

Important: check that the file can be decrypted prior to removing the original file from your system.

Warning: it is strongly suggested that the original unencrypted data file is securely stored else where as a back-up in case unencryption is not possible, e.g., the private key file or password is lost

Decrypt file

The decrypt_file function will not allow the original file to be overwritten, therefore if it is still present, use the option to specify a new name for the unencrypted file.

decrypt_file("gp.csv.encryptr.bin", file_name = "gp2.csv")
#> Decrypted file written with name 'gp2.csv'

Support / bugs

The new version 0.1.3 is on its way to CRAN today or you can install from github:

github.com/SurgicalInformatics/encryptr

Tidyverse does sequencing data!

What I’ve been up to…

For the past few months I’ve been stationed between Surgical Informatics in Edinburgh and the CRUK Beatson Institute for Cancer Research in Glasgow. The aim – to identify new treatments for hepatocellular carcinoma, a disease that affects nearly a million people worldwide each year. Surgery is the only cure, but due to the complexity of surgery it tends to be quite a specialist and risky undertaking. This, combined with the late stage at which patients become symptomatic, means that a lot of liver cancer globally in incurable.

Current chemotherapy keeps patients in a ‘holding pattern’ where it slows the growth of tumours, but lacks the ability to kill enough of the tumour to make it possible to then use surgery as a cure (called ‘downstaging’). If we can find a drug or treatment that can facilitate this, that’d be very exciting.

So, I’ve been accruing loads of sequencing data from human and mouse tumours to try and identify weaknesses in these tumours that we can target. But analysing this RNA and DNA sequencing data takes a lot of time and computational power. So I’ve been investigating how to shrink this down and try and take as much of it straight to R as possible. Thus, simplifying the pipeline and hopefully speeding things up a bit. Traditionally, this process takes hours and days. But with this workflow I’ve shrunk it to about a day for 30+ samples so far. This is obviously dependent on how many samples, how big your genome/transcriptome is and sequencing coverage/depth too.

Quick alignment and mapping

Taking the FASTQ files generated by the sequencers, we first run them through the Salmon pseudoaligner. Salmon is a package that uses pseduoalignment to align and map the reads from RNASeq data, thus shrinking two steps into one. This also enables us to leverage the speed and accuracy advantages of pseudoalignment (meant to be more robust in several aspects). I tend to dump all my FASTQ files into one folder per species, experiment or cohort and then run Salmon on this.

String manipulation in Bash

One challenge I have found is how our institute appends suffixes to the files for identification. This is obviously great, but Salmon needs to know which reads come from which end in the case of paired end sequencing or which technical repeat. Otherwise we’re simply going to be aligning gobbledygook.

In this case, the Illumina sequencer kicks out files which have a ‘S’ in front of the sample number, an ‘L’ in front of the technical repeat number and an ‘R’ in front of the direction of sequence (i.e. for paired end reads).

So using some loops in bash, I devised the following (note index file for mouse transcriptome mm10_t_index):

#!/bin/bash
source /home/user/miniconda3/etc/profile.d/conda.sh
conda activate salmon

//note ’18’ can be any number – this is just the number of samples I had.
FILE_LIST=$(for fn in /home/user/data/sequencing_data_here/fastq/*; do echo ${fn::-18} ; done)

#now check we can loop through this list
#for fn_short in $FILE_LIST; do echo “${fn_short}_test”; done

#now remove duplicates
FILE_LIST=$(echo $FILE_LIST | awk ‘BEGIN{ORS=” “}{for (i=1; i<=NF; i++)a[$i]++} END {for (i in a) print i }’)
FILE_LIST=$(echo “$FILE_LIST” | tr ‘ ‘ ‘\n’)

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L001_R1_001.fastq \
-2 “${fn}”_L001_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L001_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L002_R1_001.fastq \
-2 “${fn}”_L002_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L002_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L003_R1_001.fastq \
-2 “${fn}”_L003_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L003_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L004_R1_001.fastq \
-2 “${fn}”_L004_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L004_quant
done

Now, when run, this should generate aligned and mapped data in a separate ‘quants’ folder. Alongside the QC information etc. in each folder.

The most interesting files in these quantified data folders are the tab delimited quants.sf files. These contain the counts for mapped reads at the gene level. I tend to work with count data only, as from this it’s possible to derive TPM and FPKM.

Using the tidyverse with map() to quickly prepare sequencing data

Using the tidyverse we can make this really quick and easy!

First, using readr::read_tsv() and purrr::map() we can open all the files within a specific directory and add the pathname as a column (so we don’t lost track of our samples). Then we use mutate(), to put a path and sample name in. Finally, using spread() and summarise() we sum the counts across our technical repeats!

library(tidyverse)


#required functions
file_directory = ‘/mnt/data/this_is_folder_where_data_is’


file_list = dir(path = file_directory, full.names = T, pattern = “*quant.sf”, recursive = T)


#Step 1 – open all files, read and extract sample identifiers
file_list %>%
map(~read_tsv(.) %>%
mutate(path = gsub(‘/mnt/data/this_is_folder_where_data_is/quants/’, ”, .x)) %>% #Edit this to the appropriate file directory!
mutate(path = gsub(‘_quant/quant.sf’, ”, path)) %>%
mutate(path = gsub(‘__’, ‘_’, path))) -> gene_counts_list


#Step 2 – split repeats up and combine into 1 dataframe
gene_counts_list %>%
bind_rows(gene_counts_list) %>%
extract(path, into = c(“sample_id”, “n_repeat”), “(.*)_([^_]+)$”) %>%
select(sample_id, n_repeat,everything()) -> genes_all_samples_df


#Step 3 – sum reads together!
genes_all_samples_df %>%
group_by(sample_id, Name) %>%
summarise(NumReadsTotal = sum(NumReads)) %>%
spread(key = sample_id, value = NumReadsTotal) %>%
column_to_rownames(var = ‘Name’) -> gene_counts_by_sample


rm(file_directory, file_list, gene_counts_list)
#Save data
save.image(‘/mnt/data/results_folder’)

Happy sequencing! (and thanks Riinu Ots!)

New intuitive ways for reshaping data in R: long live pivot_longer() and pivot_wider()

This post was originally published here

TLDR: there are two new and very intuitive R functions for reshaping data: see Examples of pivot_longer() and pivot_wider() below. At the time of writing, these new functions are extremely fresh and only exist in the development version on GitHub (see Installation), we should probably wait for the tidyverse team to officially release them (in CRAN) before putting them into day-to-day use.

Exciting!

Introduction

The juxtapose of data collection vs data analysis: data that was very easy to collect, is probably very hard to analyse, and vice versa. For example, if data is collected/written down whichever format was most convenient at the time of data collection, it is probably not recorded in a regularly shaped table, with various bits of information in different parts of the document. And even if data is collected into a table, it is often intuitive (for data entry) to include information about the same variable in different columns. For example, look at this example data I just made up:

library(tidyverse)

candydata_raw = read_csv("2019-04-07_candy_preference_data.csv")
candy_type likes age: 5 likes age: 10 likes age: 15 gets age: 5 gets age: 10 gets age: 15
Chocolate 4 6 8 2 4 6
Lollipop 10 8 6 8 6 4

For each candy type, there are 8 columns with values. But actually, these 8 columns capture a combination of 3 variables: age, likes and eats. This is known as the wide format, and it is a convenient way to either note down or even present values. It is human-readable. For effective data analysis, however, we need data to be in the tidy data format, where each column is a single variable, and each row a single observation (https://www.jstatsoft.org/article/view/v059i10). It needs to be less human-readable and more computer-friendly.

Some of you may remember now retired reshape2::melt() or reshape2::dcast(), and many of you (inclduing myself!) have struggled remebering the arguments for tidyr::gather() and tidyr::spread(). Based on extensive community feedback, the tidyverse team have reinveted these functions using both more intuitive names, as well as clearer syntax (arguments):

Installation

These functions were added just a month ago, so these functions are not yet included in the standard version of tidyr that comes with install.packages("tidyverse") or even update.packages() (the current version of tidyr on CRAN is 0.8.3). To play with the bleeding edge versions of R packages, run install.packages("devtools") and then devtools::install_github("tidyverse/tidyr"). If you are a Mac user and it asks you “Do you want to install from sources the package which needs compilation?”, say Yes.

You might need to Restart R (Session menu at the top) and load library(tidyverse) again. You can check whether you now have these functions installed by typing in pivot_longer and pressing F1 – if a relevant Help tab pops open you got it.

Examples

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age"))
candy_type name value
Chocolate likes age: 5 4
Chocolate likes age: 10 6
Chocolate likes age: 15 8
Chocolate gets age: 5 2
Chocolate gets age: 10 4
Chocolate gets age: 15 6
Lollipop likes age: 5 10
Lollipop likes age: 10 8
Lollipop likes age: 15 6
Lollipop gets age: 5 8
Lollipop gets age: 10 6
Lollipop gets age: 15 4

Now, that’s already a lot better, but we still need to split the name column into the two different variables it really includes. “name” is what pivot_longer() calls this new column by default. Remember, each column is a single variable.

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age")) %>% 
  separate(name, into = c("questions", NA, "age"), convert = TRUE)
candy_type questions age value
Chocolate likes 5 4
Chocolate likes 10 6
Chocolate likes 15 8
Chocolate gets 5 2
Chocolate gets 10 4
Chocolate gets 15 6
Lollipop likes 5 10
Lollipop likes 10 8
Lollipop likes 15 6
Lollipop gets 5 8
Lollipop gets 10 6
Lollipop gets 15 4

And pivot_wider() can be used to do the reverse:

candydata = candydata_longer %>% 
  pivot_wider(names_from = questions, values_from = value)
candy_type age likes gets
Chocolate 5 4 2
Chocolate 10 6 4
Chocolate 15 8 6
Lollipop 5 10 8
Lollipop 10 8 6
Lollipop 15 6 4

It is important to spell out the arguments here (names_from =, values_frame =) since they are not the second and third arguments of pivot_wider() (like they were in spread()). Investigate the pivot_wider+F1 Help tab for more information.

Wrap-up and notes

Now these are datasets we can work with: each column is a variable, each row is an observation.

Do not start replacing working and tested instances of gather() or spread() in your existing R code with these new functions. That is neither efficient nor necessary – gather() and spread() will remain in tidyr to make sure people’s scripts don’t suddenly stop working. Meaning: tidyr is backward compatible. But after these functions are officially released, I will start using them in all new scripts I write.

I made the original messy columns still relatively nice to work with – no typos and reasonable delimiters. Usually, the labels are much worse and need the help of janitor::clean_names(), stringr::str_replace(), and multiple iterations of tidyr::separate() to arrive at a nice tidy tibble/data frame.

tidyr::separate() tips:

into = c("var1", NA, "var2") – now this is an amazing trick I only came across this week! This is a convenient way to drop useless (new) columns. Previously, I would have achieved the same result with:

... %>% 
    separate(..., into = c("var1", "drop", "var2")) %>% 
    select(-drop) %>% 
    ...
    

convert = TRUE: by default, separate() creates new variables that are also just “characters”. This means our age would have been a chacter vector of, e.g., “5”, “10”, rather than 5, 10, and R wouldn’t have known how to do arithmetic on it. In this example, convert = TRUE is equivalent to mutate(age = as.numeric(age)).

Good luck!

P.S. This is one of the coolest Tweets I’ve ever seen:

Making sense of machine learning – how do we measure performance?

An exciting direction for the Surgical Informatics group is the application of machine learning models to clinical problems.

As we hear on a nearly daily basis, machine learning has loads to offer patients and clinicians, but how can we make these models understandable and importantly, how do we measure that these models are looking at what we’re interested in?

Currently, how well a diagnostic test performs is described by four main parameters (most students and clinicians will groan when they hear these words):

  • Sensitivity (how many people who have the condition are identified correctly)
  • Specificity (how many people who don’t have the condition are identified correctly)
  • Positive Predictive Value (how many times a test positive is a true positive)
  • Negative Predictive Value (how many times a test negative is a true negative)

Now, interestingly the field of machine learning has evolved some separate parameters for measuring the usefulness of machine learning models:

  • Recall (synonymous to sensitivity)
  • Precision (synonymous to positive predictive value)

There are other measures too, including F1 score and accuracy. The issue around these metrics is that although they are handy mathematically to describe models, they lack relevance to what is clinically important. For example, if a patient wants to know how many times a test might give a false result, the F1 score (a weighted average of precision and recall) is going to be pretty useless.

Now, if we want to make a machine learning risk prediction model, we need a clinically relevant metric to allow model training to be measured and optimised. In python, there’s lots of functions for this, however, R is far more common in healthcare data analysis. At Surgical Informatics, we use Keras to interact with TensorFlow in R. Keras for R is far newer than python, so there are fewer metric functions available.

Clinically, a model to predict a specific event happening is more useful than ruling it out, particularly if the event is serious (i.e. death). A recall metric would be perfect for this, however, there is no custom function available for recall in R.

So lets make one!

Fortunately Keras provides us with functions to perform calculations on tensors such as k_sum, k_round and k_clip. This lets us manipulate Tensors using Keras and come up with custom metrics. You can find other backend keras functions here:

https://keras.rstudio.com/articles/backend.html#backend-functions.

So if recall is equal to the number of true positives, divided by the number of true positives plus false negatives we need to write a function to define these.

Now should we just add pp and tp? Unforunately Keras doesn’t like this. So we use k_epsilon() to replace tp in the recall expression, to give:

And that should calculate the recall (or sensitivity) for the model!

Encryptr package: easily encrypt and decrypt columns of sensitive data

This post was originally published here

A number of existing R packages support data encryption. However, we haven’t found one that easily suits our needs: to encrypt one or many columns of a data frame or tibble using a private/public key pair in tidyversefunctions. The emphasis is on the easily.

Encrypting and decrypting data securely is important when it comes to healthcare and sociodemographic data. We have developed a simple and secure package encryptyr which allows non-experts to encrypt and decrypt columns of data.

There is a simple and easy-to-follow vignette available on our GitHub page which guides you through the process of using encryptr:

https://github.com/SurgicalInformatics/encryptr.

Confidential data – security challenges

Data containing columns of disclosive or confidential information such as a postcode or a patient ID (CHI in Scotland) require extreme care. Storing sensitive information as raw values leaves the data vulnerable to confidentiality breaches.

It is best to just remove confidential information from the records whenever possible. However, this can mean the data can never be re-associated with an individual. This may be a problem if, for example, auditors of a clinical trial need to re-identify an individual from the trial data.

One potential solution currently in common use is to generate a study number which is linked to the confidential data in a separate lookup table, but this still leaves the confidential data available in another file.

Encryptr package solution – storing encrypted data

The encryptr package allows users to store confidential data in a pseudoanonymised form, which is far less likely to result in re-identification.

The package allows users to create a public key and a private key to enable RSA encryption and decryption of the data. The public key allows encryption of the data. The private key is required to decrypt the data. The data cannot be decrypted with the public key. This is the basis of many modern encryption systems.

When creating keys, the user sets a password for the private key using a dialogue box. This means that the password is included in an R script. We recommend creating a secure password with a variety of alphanumeric characters and symbols.

As the password is not stored, it is important that you are able to remember it if you need to decrypt the data later.

Once the keys are created it is possible to encrypt one or more columns of data in a data frame or tibble using the public key. Every time RSA encryption is used it will generate a unique output. Even if the same information is encrypted more than once, the output will always be different. It is not possible therefore to match two encrypted values.

These outputs are also secure from decryption without the private key. This may allow sharing of data within or between research teams without sharing confidential data.

Caution: data often remains potentially disclosive (or only pseudoanomymised) even after encryption of identifiable variables and all of the required permissions for usage and sharing of data must still be in place.

Encryptr package – decrypting the data

Sometimes decrypting data is necessary. For example, participants in a clinical trial may need to be contacted to explain a change or early termination of the trial.

The encryptr package allows users to securely and reliably decrypt the data. The decrypt function will use the private key to decrypt one or more columns. The user will be required to enter the password created when the keys were generated.

As the private key is able to decrypt all of the data, we do not recommend sharing this key.

Blinding and unblinding clinical trials – another encryptr package use

Often when working with clinical trial data, the participants are randomised to one or more treatment groups. Often teams working on the trial are unaware of the group to which patients were randomised (blinded).

Using the same method of encryption, it is possible to encrypt the participant allocation group, allowing the sharing of data without compromising blinding. If other members of the trial team are permitted to see treatment allocation (unblinded), then the decryption process can be followed to reveal the group allocation.

What this is not

This is a simple set of wrappers of openssl aimed at non-experts. It does not seek to replace the many excellent encryption packages available in R, such as PKI, sodium and safer. We believe however that it makes things much easier. Comments and forks welcome.