rmedicine2019 – some quick thoughts and good packages

Kenny McLean and I recently attended rmedicine 2019 in Boston MA. The conference is aimed at clinicians and non-clinicians who use R for day-to-day research and monitoring of clinical processes.

Day 1 covered two parallel workshops: R Markdown for Medicine and Wrangling Survival Data 

I attended R Markdown for Medicine run by Alison Hill from RStudio. Using .rmd files has become the default for the Surgical Informatics Group and, so it seems, a great number of others who attended rmedicine. Around a third of the presentations at rmedicine covered workflows involving sharing of data via either .rmd files or through shiny, an R package for creating deploy-able dashboards for data visualisation and interactive exploration.

R Markdown for Medicine

An Overview of Useful Tips and Tricks

R markdown is an extension of R which allows you to combine narrative text and R code within one document. This means your notes, code, results and plots are all in one place. Code is contained in between three backticks with an {r} after the first set. Inline code can also be used between single backticks followed by r without the curly brackets and then the code. This means that results can be changed automatically so that for a trial when you describe the results of numbers included / excluded, this only needs changed in one place so that the rest of the text (and / or flowcharts) updates automatically. It is also possible to mix-and-match other chunks of code from other languages.

Use Params!

Parameters are set in the YAML header at the top of the .rmd document. If you set a parameter of data to a default .csv or .rda file then this can be changed for other similar files without creating a new document. A really useful example would be when you have multiple hospitals or multiple diseases each with a separate data file, a report can then be generated for each file. If you use rmarkdown::render() along with purrr::pwalk you can generate a separate output file for any number of hospitals / diseases / countries / individuals etc. in just a couple of lines of code.

Use Helper Packages

There are some greater .rmd helper packages to improve the workflow, improve the rendering of documents and generally make life easier.

bookdown allows several .rmd documents to be combined to a book but also has some general usefulness for single documents as well. Using bookdown::word_document2 or bookdown::html_document2 in the YAML header under the output field is designed to improve cross-referencing of tables and figures compared to the default versions.

wordcountaddin allows an accurate word count to be performed which will not count YAML or code etc. without knitting the document. This is much easier than knitting the document and then performing a word count!

citr allows automated insertion of markdown citations to assist with referencing. Check out my earlier blog on referencing to get an idea of how to set up .bib files. I may add another blog on this topic, watch this space!

xaringan is a useful package for creating HTML presentations with high levels of customisation. It is possible to use an additional .css file for even greater customisation and styling of your slides but xaringan offers a great deal of user-friendly options.

distill appears to be good at supporting mobile-friendly web publishing for scientific communication with flexible figure layouts, table pagination, LaTeX math support and incorporation of javascript.

There are countless other helper packages and more likely to be on their way. Many allow additional aesthetic modification of the output documents and may allow you to run R code rather than modifying a .css file.

List Numbering the Lazy Way

List numbering in .rmd works without needing to manually enter the correct numbers. Just make a list where every element begins with 1. and .rmd will transform it into an appropriately-numbered list. Great if you need to add in a new element to the middle of the list later!

Multiple lots in a Grid

I’ve previously come across patchwork as a way to plot several plots into a grid which could be 1×2, 2×2, 3 in one column and one in the other etc. There are also two other packages cowplot and egg. I haven’t explored the differences between them but if you find that one doesn’t give you the exact customisation or alignment you need then possibly try another one. cowplot looks as if it might perform better at overlaying plots on top of another and at exact axis line matching.

Use the here package to help with file paths

here is a great package for swapping between Windows and Mac file paths (no more swapping backslashes and forward slashes!). Using here::here() will default to looking for a file in the .Rproj directory rather than the .rmd directory which is the default otherwise – great if you want to have multiple .rmd documents each in their own sub-directory with a shared data file in the parent directory.

Customise Code Outputs

R markdown allows customisation of appearance of code. Some of this can be done through modifying a .css file but there are some simpler ways to make basic changes. Try adding comment = "#>" to knitr::opst_chunk$set()to customise how comments appear in your document.

Word document creations tips

R markdown is generally great for HTML and PDF formats. The options for knitting to Word are not as well developed but there are some good options. The bookdown package is useful as discussed. The redoc package has been used to facilitate conversion to and from word – not tried it personally but if it can print out to word and then handle tracked changes back into markdown then it could be very useful.

For converting more complex tables and figures to word an option is to knit to rtf (rich text format) and then open the rtf file in word. This tends to be very good at keeping the desired formatting.

Future updates – hopefully!

R markdown is a great resource although there are a handful of minor issues which are currently difficult to resolve. One of the main problems I find it with tables and cross-referencing. I really like the syntax and customisation of the gt package but at present it appears cross-referencing in a way which works across HTML, PDF and Word outputs is not supported – a great opportunity to submit a pull request if you think you can get this to work.

Other Useful rmedicine Packages and Ideas

survival Package Update

The latest version (version 3.0) of the survival package was presented by Terry Therneau and is now available on github. This package is used by over 650 additional downstream dependencies. The latest version allows for multiple observations per subject, multiple endpoints per subject and multiple types of end-point. This will be particularly useful for competing risks analyses e.g. outcomes for liver transplant patients (transplanted, still on list, removed from list as no longer eligible or died).

Keep an eye-out for Kenny McLean’s blog where he plans to cover the survival package and many other useful packages presented at rmedicine 2019.

hreport Automated Trial Reporting

hreport by Frank Harrel (currently available on github) is for automated reporting of trials and studies with generation of interactive html graphs based in plotly. Several aspects of a study can be rendered easily into plots demonstrating accrual, exclusions, descriptive statistics, adverse events and time-to-event data. Another key theme of rmedicine 2019 appears to have been the use of plotly or similar packages to enable interaction with data.

timevis – interactive timelines

timevis allows generation of highly interactive timeline plots which allow zooming, adding or removal of events, resizing, etc.

Holepunch package

For working with projects that require a number of packages that then need shared with a colleague, holepunch provides a quick method for generating a list of dependencies and a Dockerfile. The package creates a link for another user to open a free RStudio server with all of the required packages installed. This may be useful for trouble-shooting in a department and showing code examples.

Summary

rmedicine 2019 has shown that clinical researchers are moving increasingly towards literate programming, interactive visualisations and automated workflows using R and Rmarkdown.

The conference was a great mix of methods presentations and data presentations from R users. You definitely don’t need any in-depth knowledge of R to benefit from it and I’d highly recommend booking for rmedicine 2020.

Survival analysis with strata, clusters, frailties and competing risks in in Finalfit

This post was originally published here

Background

In healthcare, we deal with a lot of binary outcomes. Death yes/no, disease recurrence yes/no, for instance. These outcomes are often easily analysed using binary logistic regression via finalfit().

When the time taken for the outcome to occur is important, we need a different approach. For instance, in patients with cancer, the time taken until recurrence of the cancer is often just as important as the fact it has recurred.

Finalfit wraps a number of functions to make these analyses easy to perform and output into PDFs and Word documents.

Installation

# Make sure finalfit is up-to-date 
install.packages("finalfit")

Dataset

We’ll use the classic “Survival from Malignant Melanoma” dataset from the boot package to illustrate. The data consist of measurements made on patients with malignant melanoma. Each patient had their tumour removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark during the period 1962 to 1977.

For the purposes of demonstration, we are interested in the association between tumour ulceration and survival after surgery.

Get data and check

library(finalfit)
melanoma = boot::melanoma #F1 here for help page with data dictionary
ff_glimpse(melanoma)
#> Continuous
#>               label var_type   n missing_n missing_percent   mean     sd
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1
#> status       status    <dbl> 205         0             0.0    1.8    0.6
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5
#> age             age    <dbl> 205         0             0.0   52.5   16.7
#> year           year    <dbl> 205         0             0.0 1969.9    2.6
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5
#>              min quartile_25 median quartile_75    max
#> time        10.0      1525.0 2005.0      3042.0 5565.0
#> status       1.0         1.0    2.0         2.0    3.0
#> sex          0.0         0.0    0.0         1.0    1.0
#> age          4.0        42.0   54.0        65.0   95.0
#> year      1962.0      1968.0 1970.0      1972.0 1977.0
#> thickness    0.1         1.0    1.9         3.6   17.4
#> ulcer        0.0         0.0    0.0         1.0    1.0
#> 
#> Categorical
#> data frame with 0 columns and 205 rows

As can be seen, all variables are coded as numeric and some need recoding to factors.

Death status

status is the the patients status at the end of the study.

  • 1 indicates that they had died from melanoma;
  • 2 indicates that they were still alive and;
  • 3 indicates that they had died from causes unrelated to their melanoma.

There are three options for coding this.

  • Overall survival: considering all-cause mortality, comparing 2 (alive) with 1 (died melanoma)/3 (died other);
  • Cause-specific survival: considering disease-specific mortality comparing 2 (alive)/3 (died other) with 1 (died melanoma);
  • Competing risks: comparing 2 (alive) with 1 (died melanoma) accounting for 3 (died other); see more below.

Time and censoring

time is the number of days from surgery until either the occurrence of the event (death) or the last time the patient was known to be alive. For instance, if a patient had surgery and was seen to be well in a clinic 30 days later, but there had been no contact since, then the patient’s status would be considered 30 days. This patient is censored from the analysis at day 30, an important feature of time-to-event analyses.

Recode

library(dplyr)
library(forcats)
melanoma = melanoma %>%
  mutate(
    # Overall survival
    status_os = case_when(
      status == 2 ~ 0, # "still alive"
      TRUE ~ 1), # "died melanoma" or "died other causes"
    
    # Diease-specific survival
    status_dss = case_when(
      status == 2 ~ 0,  # "still alive"
      status == 1 ~ 1,  # "died of melanoma"
      status == 3 ~ 0), # "died of other causes is censored"

    # Competing risks regression
    status_crr = case_when(
    	status == 2 ~ 0,  # "still alive"
        status == 1 ~ 1,  # "died of melanoma"
        status == 3 ~ 2), # "died of other causes"

    # Label and recode other variables
    age = ff_label(age, "Age (years)"), # table friendly labels
    thickness = ff_label(thickness, "Tumour thickness (mm)"),
    sex = factor(sex) %>% 
      fct_recode("Male" = "1", 
                 "Female" = "0") %>% 
      ff_label("Sex"),
    ulcer = factor(ulcer) %>% 
      fct_recode("No" = "0",
                 "Yes" = "1") %>% 
      ff_label("Ulcerated tumour")
  )

Kaplan-Meier survival estimator

We can use the excellent survival package to produce the Kaplan-Meier (KM) survival estimator. This is a non-parametric statistic used to estimate the survival function from time-to-event data. Note use of %$% to expose left-side of pipe to older-style R functions on right-hand side.

library(survival)

survival_object = melanoma %$% 
  Surv(time, status_os)

# Explore:
head(survival_object) # + marks censoring, in this case "Alive"
#> [1]  10   30   35+  99  185  204

# Expressing time in years
survival_object = melanoma %$% 
  Surv(time/365, status_os)

KM analysis for whole cohort

Model

The survival object is the first step to performing univariable and multivariable survival analyses.

If you want to plot survival stratified by a single grouping variable, you can substitute “survival_object ~ 1” by “survival_object ~ factor”

# Overall survival in whole cohort
my_survfit = survfit(survival_object ~ 1, data = melanoma)
my_survfit # 205 patients, 71 events
#> Call: survfit(formula = survival_object ~ 1, data = melanoma)
#> 
#>       n  events  median 0.95LCL 0.95UCL 
#>  205.00   71.00      NA    9.15      NA

Life table

A life table is the tabular form of a KM plot, which you may be familiar with. It shows survival as a proportion, together with confidence limits. The whole table is shown with summary(my_survfit).

summary(my_survfit, times = c(0, 1, 2, 3, 4, 5))
#> Call: survfit(formula = survival_object ~ 1, data = melanoma)
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     0    205       0    1.000  0.0000        1.000        1.000
#>     1    193      11    0.946  0.0158        0.916        0.978
#>     2    183      10    0.897  0.0213        0.856        0.940
#>     3    167      16    0.819  0.0270        0.767        0.873
#>     4    160       7    0.784  0.0288        0.730        0.843
#>     5    122      10    0.732  0.0313        0.673        0.796
# 5 year overall survival is 73%

Kaplan Meier plot

We can plot survival curves using the finalfit wrapper for the package excellent package survminer. There are numerous options available on the help page. You should always include a number-at-risk table under these plots as it is essential for interpretation.

As can be seen, the probability of dying is much greater if the tumour was ulcerated, compared to those that were not ulcerated.

dependent_os = "Surv(time/365, status_os)"
explanatory = "ulcer"

melanoma %>% 
  surv_plot(dependent_os, explanatory, pval = TRUE)

Cox-proportional hazards regression

CPH regression can be performed using the all-in-one finalfit() function. It produces a table containing counts (proportions) for factors, mean (SD) for continuous variables and a univariable and multivariable CPH regression.

A hazard is the term given to the rate at which events happen.
The probability that an event will happen over a period of time is the hazard multiplied by the time interval.
An assumption of CPH is that hazards are constant over time (see below).

It produces a table containing counts (proportions) for factors, mean (SD) for continuous variables and a univariable and multivariable CPH regression.

Univariable and multivariable models

dependent_os = "Surv(time, status_os)"
dependent_dss = "Surv(time, status_dss)"
dependent_crr = "Surv(time, status_crr)"
explanatory = c("age", "sex", "thickness", "ulcer")

melanoma %>% 
    finalfit(dependent_os, explanatory)

The labelling of the final table can be easily adjusted as desired.

melanoma %>% 
    finalfit(dependent_os, explanatory, add_dependent_label = FALSE) %>% 
    rename("Overall survival" = label) %>% 
    rename(" " = levels) %>% 
    rename(" " = all)

Reduced model

If you are using a backwards selection approach or similar, a reduced model can be directly specified and compared. The full model can be kept or dropped.

explanatory_multi = c("age", "thickness", "ulcer")
melanoma %>% 
    finalfit(dependent_os, explanatory, explanatory_multi, 
      keep_models = TRUE)

Testing for proportional hazards

An assumption of CPH regression is that the hazard associated with a particular variable does not change over time. For example, is the magnitude of the increase in risk of death associated with tumour ulceration the same in the early post-operative period as it is in later years.

The cox.zph() function from the survival package allows us to test this assumption for each variable. The plot of scaled Schoenfeld residuals should be a horizontal line. The included hypothesis test identifies whether the gradient differs from zero for each variable. No variable significantly differs from zero at the 5% significance level.

explanatory = c("age", "sex", "thickness", "ulcer", "year")
melanoma %>% 
    coxphmulti(dependent_os, explanatory) %>% 
    cox.zph() %>% 
    {zph_result <<- .} %>% 
    plot(var=5)
zph_result
#>               rho  chisq      p
#> age        0.1633 2.4544 0.1172
#> sexMale   -0.0781 0.4473 0.5036
#> thickness -0.1493 1.3492 0.2454
#> ulcerYes  -0.2044 2.8256 0.0928
#> year       0.0195 0.0284 0.8663
#> GLOBAL         NA 8.4695 0.1322

Stratified models

One approach to dealing with a violation of the proportional hazards assumption is to stratify by that variable. Including a strata() term will result in a separate baseline hazard function being fit for each level in the stratification variable. It will be no longer possible to make direct inference on the effect associated with that variable.

This can be incorporated directly into the explanatory variable list.

explanatory= c("age", "sex", "ulcer", "thickness", "strata(year)")
melanoma %>% 
    finalfit(dependent_os, explanatory)

Correlated groups of observations

As a general rule, you should always try to account for any higher structure in the data within the model. For instance, patients may be clustered within particular hospitals.

There are two broad approaches to dealing with correlated groups of observations.

Including a cluster() term is akin to using generalised estimating equations (GEE). Here, a standard CPH model is fitted but the standard errors of the estimated hazard ratios are adjusted to account for correlations.

Including a frailty() term is akin to using a mixed effects model, where specific random effects term(s) are directly incorporated into the model.

Both approaches achieve the same goal in different ways. Volumes have been written on GEE vs mixed effects models. We favour the latter approach because of its flexibility and our preference for mixed effects modelling in generalised linear modelling. Note cluster() and frailty() terms cannot be combined in the same model.

# Simulate random hospital identifier
melanoma = melanoma %>% 
  mutate(hospital_id = c(rep(1:10, 20), rep(11, 5)))

# Cluster model
explanatory = c("age", "sex", "thickness", "ulcer", "cluster(hospital_id)")
melanoma %>% 
  finalfit(dependent_os, explanatory)
# Frailty model
explanatory = c("age", "sex", "thickness", "ulcer", "frailty(hospital_id)")
melanoma %>% 
  finalfit(dependent_os, explanatory)

The frailty() method here is being superseded by the coxme package, and we’ll incorporate this soon.

Hazard ratio plot

A plot of any of the above models can be produced by passing the terms to hr_plot().

melanoma %>% 
    hr_plot(dependent_os, explanatory)

Competing risks regression

Competing-risks regression is an alternative to CPH regression. It can be useful if the outcome of interest may not be able to occur because something else (like death) has happened first. For instance, in our example it is obviously not possible for a patient to die from melanoma if they have died from another disease first. By simply looking at cause-specific mortality (deaths from melanoma) and considering other deaths as censored, bias may result in estimates of the influence of predictors.

The approach by Fine and Gray is one option for dealing with this. It is implemented in the package cmprsk. The crr() syntax differs from survival::coxph() but finalfit brings these together.

It uses the finalfit::ff_merge() function, which can join any number of models together.

explanatory = c("age", "sex", "thickness", "ulcer")
dependent_dss = "Surv(time, status_dss)"
dependent_crr = "Surv(time, status_crr)"

melanoma %>%

  # Summary table
  summary_factorlist(dependent_dss, explanatory, 
    column = TRUE, fit_id = TRUE) %>%

  # CPH univariable
  ff_merge(
    melanoma %>%
      coxphmulti(dependent_dss, explanatory) %>%
      fit2df(estimate_suffix = " (DSS CPH univariable)")
    ) %>%
    
# CPH multivariable
  ff_merge(
    melanoma %>%
      coxphmulti(dependent_dss, explanatory) %>%
      fit2df(estimate_suffix = " (DSS CPH multivariable)")
    ) %>%
    
# Fine and Gray competing risks regression
  ff_merge(
    melanoma %>%
      crrmulti(dependent_crr, explanatory) %>%
      fit2df(estimate_suffix = " (competing risks multivariable)")
    ) %>%

  select(-fit_id, -index) %>%
  dependent_label(melanoma, "Survival")

Summary

So here we have various aspects of time-to-event analysis commonly used when looking at survival. There are many other applications, some which may not be obvious: for instance we use CPH for modelling length of stay in in hospital.

Stratification can be used to deal with non-proportional hazards in a particular variable.

Hierarchical structure in your data can be accommodated with cluster or frailty (random effects) terms.

Competing risks regression may be useful if your outcome is in competition with another, such as all-cause death, but is currently limited in its ability to accommodate hierarchical structures.

HealthyR Estonia Day 3

Well, what a great 3 days this has been! Again today, we gained extra people to join in HealthyR Notebooks – A formidable achievement for a statistics course!

We kicked off with a brilliant session by Ewen Harrison about survival analysis and time to event data, introducing new concepts and the R survival package. Then went into an amazing session by Riinu Ots, who showcased how to plot your data with real world, practical examples. This session really was brilliant, living up to Riinus catch-phrase ‘always plot your data’. This was followed by a short pop quiz, which all participants did brilliantly!

After a tasty lunch, we then continued into a new session, how to work with your data. This session is aimed at translating the learning of HealthyR, straight to a reallife dataset of the participants choosing. Participants were guided through the practical application of R to their own data, giving them a springboard to produce some cool analyses after the course.

Following the final session we departed to Tallinn for flights back home tomorrow.

All in all, HealthyR notebooks were a success and very fun to teach. Estonia was well worth the trip and showed we could teach R to an international audience (even having fun at the same time!). Looking forward to developing the course further when the team go to Ghana later this year. Big thanks to Julius for organising the course and to the Welcome Trust!

HealthyR Notebooks Estonia Day 2

Everybody came back for Day 2 of HealthyR Notebooks in Estonia!

Today focussed on modelling kicking off with linear regression in detail from Riinu – if you understand this you understand the majority of statistical tests!

Factors were introduced by Cameron, which led nicely into logistic regression. By the end of this the whole room was comfortably building regression models as if they had been doing it for years!

Notebooks are a really powerful tool for teaching this sort of material allowing seamless output into PDF and Word format. Some of the delegates commented on how they had struggled with this aspect of R in the past.

After all the intense work it was a great relief to break early for some fantastic team building including stuff like this!

Some even tried there hand at archery and felt pretty smug about their performance 😂

Finally, we are so grateful Julius Juurmaa for all of the organisation. He even started a company to administer the course! Thank you Julius.

HealthyR Notebooks Estonia Day 1

The Surgical Informatics team arrived in beautiful Estonia yesterday.

We are here as part of our Wellcome Trust Open Research Fund grant – “HealthyR Notebooks: Democratising open and reproducible data analysis in resource-poor environments”.

It’s a mouthful, but important! We have adapted our popular HealthyR training course to be easily delivered on small laptop screens allowing state-of-the-art data analysis to be performed using RStudio anywhere by anyone. We’re testing this in Estonia and running it again in Ghana in November.

Also look out for HealthyR the book coming soon.

The setting is fantastic:

The delegates are already amazing at using R, but we’re teaching Tidyverse and Finalfit to bring everyone up to date with all the great new modern packages.

There are threats of pedalo racing later 😳.

Tidying up edgeR and differential expression

For anyone who has done any differential expression from RNA array/ sequencing datasets in R, edgeR or DESeq2 are the packages to go to! They contain powerful functions and models which cater for most uses.

As with most bioinformatics type packages, they end up creating scripts with endless lines of code and square brackets aplenty. Now not only does this make scripts difficult to read, it’s difficult for R to handle in some aspects. Quite often these packages have custom objects etc. which makes changing formats and sending data into different packages a pain.

Well, if you do lots of differential expression, here’s some wrappers for you! Taking inspiration from the FinalFit and tidyverse approach,
I have written a few wrappers for differential expression analyses to make your code and life happier.

Please note – throughout this all I use count data as this is what edgeR likes, rather than FPKM/RPKM or RSEM.

Step 1 – Taking the first steps to expression happiness

So – first up, preparing and filtering your data. This function turns your data and any clinical/ sample data, wraps it up into a DGEList object,
then will filter it. I like to function based on proportions of lowly expressed transcripts, as purely filtering on arbitary CPM values has its own issues, particularly if your read depth is low.

tidy_dge() is a function which does all of this! It combines two functions dge_bind() – which sets up your DGEList based on two data frames and tidy_gene_filter() which then filters these objects and calculates normalisation values.

We can then combine this into one function tidy_dge(), which shrinks about 10-20 lines into one.

Step 2 – Translating annotations

Next up – convert_genes(). Switching between gene annotation systems (i.e. ENSEMBL, HUGO, ENTREZ) is a pain. We all know that. The package biomaRt tries to solve this a bit, but still we’re left with somewhat unwieldly functions and lines of code trying to pick out where genes sit and what to map them back to. For this, I have devised convert_gene().

convert_gene() passes a request to biomaRt (the ENSEMBL API) and will convert any (common) gene name to any other gene annotation out. It takes a little while to contact the servers, but is well worth it! Alternatively, you will be able to pass a saved biomaRt output to this, so you can make one pull only which speeds it up! If we are to do this, we should create three separate marts – one containing genes, one genes and GO codes, the other the GO codes and processes. Here’s some functions to do this:

The reason for this- in testing some of the marts have errors when being pulled with genes and GO codes all at once! The ensembl biomaRt API is also somewhat unstable and frequently produces 404 errors/ lots of downtime. Plus – we can then use the mart we created for genes in convert_gene().

Back to convert_gene()! It is designed to work with a glm object straight out of edgeR or with a dataframe of counts, but will detect if it’s another type of file. Just be careful! As the default setting is mouse currently:

GO – The final step to expression fulfillment

Finally – those GO analyses, how on earth do you then trace back and find which genes in which pathways are up-regulated? I’ve never been able to find a nice function to do this.

Fear not – we have a new one! gene_GO_explorer(). This is very handy indeed. It basically will take your glm object from edgeR and map the genes to GO processes and codes.


There are some functions wrapped up in this function for dissecting pathways out too; go_grep, go_code, go_pathway and go_components. These allow you to supply a list of GO processes, GO codes or to search the GO annotations for pathways of interest. For example, If your GO enrichment shows pathway activity for a specific GO pathway, you can input the GO code into go_code to get the genes involved in that pathway. For searching GO pathway annotations, we can use go_grep. Here, if we set go_grep to ‘stress’, gene_GO_explorer() will search and return any pathways with the term ‘stress’ in their description.

If genes are part of multiple pathways that are included in our search, the GO terms will be collapsed down together – separated by a semicolon.
This data can be displayed in a ‘long’ format if required by specifying long = TRUE.

Note – unlike goana, gene_GO_explorer() does not mind what type of gene annotation is supplied! I use convert_gene() to prepare my glms for goana then pop it through gene_GO_explorer().

If you’re not using a glm from edgeR – switch off the adjusted p-value bit of gene_GO_explorer() using adjust_p = FALSE, or set the column with your p-values to ‘PValues’.

I’ll be developing more of this sort of thing, in addition to some nicer wrappers for preparing your sequencing data straight from FASTQ formats, making user readable edgeR models and then querying protein interaction databases too! Comments/ thoughts appreciated if people find this kind of stuff useful. Considering making a package. FinalOmics anyone?

Reference Management – An Efficient Setup for writing a Thesis

This blog is intended for researchers, PhD students, MD students and any other students who wish to have a robust and effective reference management setup. The blog has a particular focus on those using R markdown, Bookdown or LaTeX. Parts of the blog can also help setup Zotero for use with Microsoft Word. The blog has been designed to help achieve the following goals:

  • Effective citation storage
    • Fast and easy citation storage (one-click from Chrome)
    • Fast and easy PDF storage using cloud storage
    • Immediate, automatic and standardised PDF renaming
    • Immediate, automatic and standardised citation key generation
  • Effective citation integration with markdown etc.
    • Generation of citation keys which work with LaTeX and md (no non-standard characters)
    • Ability to lock citation keys so that they don’t update with Zotero updates
    • Storage of immediately updated .bib files for use with Rmd, Bookdown and LaTeX
    • Automated update of the .bib file in RStudio server

Downloads and Setup

For my current reference management setup I need the following software:

  • Zotero
    • Zotero comes with 300MB of free storage which allows well over 1000 references to be stored as long as PDFs are stored separately
    • From the same download page download the Chrome connector to enable the “save to zotero” function in Google Chrome
  • ZotFile
    • Zotfile is a Zotero plugin which helps with PDF management, download the .xpi file and then open Zotero, go to “Tools → Add-Ons” and click the little cog in the top right corner and navigate to file to install (Figure 1)
  • Better BibTeX
    • Better BibTeX is a plugin to help generate citation keys which will be essential for writing articles in LaTeX, R Markdown or Bookdown
    • If the link doesn’t work go to github and scroll down to the ReadMe to find a link to download the .xpi file
    • The same approach is then used to install the Better BibTeX plugin for zotero (“Tools → Add-Ons”)

After downloading Zotero, ZotFile and Better BibTeX create an account on Zotero online.

In addition to the Zotero downloads this guide will focus on an efficient setup for writing with R markdown or Bookdown and assumes that you have access to the following software / accounts:

  • Dropbox / Google Drive / other cloud storage service which allows APIs
    • It will also be necessary for these to be accessible using Windows Explorer or Mac Finder (there are many guides online for syncing Google Drive and Dropbox so that they appear in file explorers)
  • RStudio (this is not 100% essential but it is far harder to use Rmd without it)
    • Packages which will be required for this setup include rdrop2 (if using dropbox, other packages are available to convert this setup to Google Drive etc.), encryptr, bookdown or Rmarkdown, tinytex and a LaTeX installation (the Bookdown author recommends using tinytex which can be installed by the similarly named R package: tinytex::install_tinytex())

Folder Setup

When using Zotero it is a good ideal to create a folder in which you will store PDFs retrieved from articles. Ultimately it is optional whether or not PDFs are stored but if you have access to cloud storage with a good quota then it can make writing in Rmd etc. much faster as there is no requirement to search online for the original PDF. This folder should be set up in Google Drive, Dropbox or another cloud storage service which can be accessed from your own computer through the file explorer.

A second folder may be useful to store bibliographies which will be generated for specific projects or submissions. Again this folder should be made available in cloud storage.

ZotFile Preferences

To setup Zotero so that retrieved PDFs are automatically stored and renamed in the cloud storage without consuming the Zotero storage quota go to “Tools → ZotFile Preferences” and on the first tab: General Settings and set the folder and subfolder naming strategy for PDFs. I have set the location of the files to a Custom location and in this case used the path to a Google Drive folder (~\Google Drive\Zotero PDF Library). ZotFile will also store retrieved PDFs in subfolders to help with finding PDFs at a later date. The current setup I use is to create a subfolder with the first author surname so that all papers authored by one (or more) author with the same name are stored together using the \%a in the subfolder field (Figure 2). Other alternatives are to store PDFs in subfolders using year (\%y); journal or publisher (\%w); or item type (\%T).

Next the Renaming Rules tab can be configured to provide sensible names to each of the files (this is essential if PDFs are not to be stored as random strings of characters which provide no meaning). In this tab I have set the format to: {%a_}{%y_}{%t} which provides names for the PDFs in the format of: Fairfield_2019_Gallstone_Disease_and_the_Risk_of_Cardiovascular_Disease.pdf. I find that this shows author, year and first word of title without needing to expand the file name (Figure 3).

I have not changed any of the default settings in either the Tablet Settings or Advanved Settings tabs apart from removing special characters in the Advanced Settings (this stops things from breaking later).

General Zotero Settings

Zotero has several configurable settings (accessed through: “Edit → Preferences”) and I have either adopted the defaults or made changes as follows:

General:

  • I have ticked the following:
    • Automatically attach associated PDFs
    • Automatically retrieve metadata for PDFs
    • Automatically rename attachments using parent metadata
    • Automatically tag items with keywords and subject headings
    • All options in Group section
  • I have left the following unticked:
    • Automatically take snapshots
    • Rename linked files

Sync:

  • Enter the account details
  • Tick sync automatically
  • Untick sync full text (if you choose to save PDFs then syncing full text will quickly consume the 300MB quota)

Search:

  • Left unchanged

Export:

  • Left unchanged

Cite:

  • There are several sensible defaults but if there is a new citation style you wish to be able to use in Microsoft Word for example then click “Get additional styles” as there is probably a version that you need already created. You can click the “+” button to add a style from a .csl file if you have one already. Finally, if you are desperate for a style that doesn’t already exist then you can select a citation style and click Style Editor and edit the raw .csl file.
  • In the Word Processors subtab (on the main Cite tab), you can install the Microsoft Word add-in to allow Zotero to work in Microsoft Word.

Advanced:

  • I changed nothing on the General subtab
  • In the Files and Folders subtab I have selected the path to base directory for attachments
  • I have not changed the Shortcuts subtab
  • I have not changed the Feeds subtab

Better BibTex:

  • In this section I have set my Citation Key format to [auth:lower:alphanum]_[year:alphanum]_[veryshorttitle:lower:alphanum]_[journal:lower:clean:alphanum] (Figure 4). This generates a citation key for each reference in the format of fairfield_2019_gallstones_scientificreports or harrison_2012_hospital_bmj. It always takes the first author’s surname, the year, the first word of the title and the journal abbreviation if known. The clean and alphanum arguments to this field are used to remove unwanted punctuation which can cause citation to fail in LaTeX.
Figure 4: Better BibTeX Citation Key

Once the settings have been configured if you already had references stored in Zotero and wish to change the citation key for old references select your entire library root (above all folders), select all references, right click and use “Better BibTex → Refresh BibTeX Key” and all of the citation keys should be updated.

Creating a .bib file

For referencing in a new project, publication or submission it may be helpful to have a dynamic .bib file that updates with every new publication and can be accessed from any device through cloud storage.

To set up a .bib file, first find the folder that you wish to create the file from (this should be the folder which contains any citations you will use and ideally not the full library to cut down on unnecessary storage and syncing requirements). Note that the .bib file will generate a bibliography from any citations stored directly in the folder when using default settings. This prevents use of subfolders which I find particularly helpful for organising citations and I have therefore changed the setting so that folders also show any citations stored in subfolders. To make this change go to “Edit Preferences” and select the “Advanced” tab and at the bottom of the “General” subtab select “Config Editor”. This will bring up a searchable list of configurations (it may show a warning message before this) and search in the search box for “extensions.zotero.recursiveCollections”. Set “Value” to TRUE and then when you click a folder you should see all of the citations also stored in subfolders.

Right click the folder and select “Export Collection”. A pop-up window will appear at which point select “Keep Updated” and if using RStudio desktop save the file in the directory where you have your Rmd project files. If you are working with RStudio server then save the file in a cloud storage location which will then be accessed from the server. I have a .bib file stored in Dropbox which I access from RStudio server.

Linking Dropbox and RStudio Server to Access the .bib File

The following covers linking Dropbox to RStudio server but could be adapted to cover another cloud storage service.

Dropbox provides a token to allow communication between different apps. The rdrop2 package is what I used to create a token to allow this. I actually created the token on RStudio desktop as I couldn’t get the creation to work on the server but this is perfectly ok.

Caution: The token generated by this process could be used to access your Dropbox from anywhere using RStudio if you do not keep it secure. If somebody were to access an unencrypted token then it would be equivalent to handing out your email and password. I therefore used the encryptr package to allow safe storage of this token.

Token Creation

Open Rstudio desktop and enter the following code:

The code will create two files, a token and the .httr-oauth file from which a token can also be made. The encryptr package can then encrypt the files using a public / private key pair. It is essential that the password that is set when using genkeys() is remembered otherwise the token cannot then be used. In this case the original token can’t be retrieved but could be created again from scratch.

The following files will then be needed to upload to the RStudio server:

  • droptoken.rds.encryptr.bin – or the name provided for the encrypted Dropbox token
  • id_rsa – or the name provided for the private key from the private / public key pair

Dropbox Linkage for Referencing the .bib File

Now that the encrypted token and necessary (password-protected) private key are available in RStudio server, the following can be saved as a separate script. The script is designed to read in and decrypt the encrypted token (this will require a password and should be done if the .bib file needs updated). Only the drop_download() needs repeated if using the token again during the same session. The token should be cleared at the end of every session for additional security.

Now that the .bib file has been created and is stored as “my.bib” in the local directory, it should update whenever the token is loaded and drop_download() is run.

Final Result

On clicking “Save to Zotero” button in Chrome and running drop_download() the following should all happen almost instantaneously:

  • Zotero stores a new reference
  • A PDF is stored in the cloud storage having been named appropriately
  • A link to the PDF is stored in Zotero (without using up significant memory)
  • A citation key is established for the reference in a standardised format without conflicts
  • Pre-existing citation keys which have been referenced earlier in the writing of the paper are not altered
  • A .bib file is updated in the RStudio server directory
  • And much unwanted frustration of reference management is resolved

This is my current reference management system which I have so far found to be very effective. If there are ways you think it can be improved I would love to hear about them.

Encryptr now makes it easy to encrypt and decrypt files

This post was originally published here

Data security is paramount and encryptr was written to make this easier for non-experts. Columns of data can be encrypted with a couple of lines of R code, and single cells decrypted as required.

But what was missing was an easy way to encrypt the file source of that data.

Now files can be encrypted with a couple of lines of R code.

Encryption and decryption with asymmetric keys is computationally expensive. This is how encrypt for data columns works. This makes it easy for each piece of data in a data frame to be decrypted without compromise of the whole data frame. This works on the presumption that each cell contains less than 245 bytes of data.

File encryption requires a different approach as files are larger in size. encrypt_file encrypts a file using a symmetric “session” key and the AES-256 cipher. This key is itself then encrypted using a public key generated using genkeys. In OpenSSL this combination is referred to as an envelope.

It should work with any type of single file but not folders.

Documentation is maintained at encrypt-r.org

Generate keys

genkeys()
#> Private key written with name 'id_rsa'
#> Public key written with name 'id_rsa.pub'

Encrypt file

To demonstrate, the included dataset is written as a .csv file.

write.csv(gp, "gp.csv")
encrypt_file("gp.csv")
#> Encrypted file written with name 'gp.csv.encryptr.bin'

Important: check that the file can be decrypted prior to removing the original file from your system.

Warning: it is strongly suggested that the original unencrypted data file is securely stored else where as a back-up in case unencryption is not possible, e.g., the private key file or password is lost

Decrypt file

The decrypt_file function will not allow the original file to be overwritten, therefore if it is still present, use the option to specify a new name for the unencrypted file.

decrypt_file("gp.csv.encryptr.bin", file_name = "gp2.csv")
#> Decrypted file written with name 'gp2.csv'

Support / bugs

The new version 0.1.3 is on its way to CRAN today or you can install from github:

github.com/SurgicalInformatics/encryptr

Tidyverse does sequencing data!

What I’ve been up to…

For the past few months I’ve been stationed between Surgical Informatics in Edinburgh and the CRUK Beatson Institute for Cancer Research in Glasgow. The aim – to identify new treatments for hepatocellular carcinoma, a disease that affects nearly a million people worldwide each year. Surgery is the only cure, but due to the complexity of surgery it tends to be quite a specialist and risky undertaking. This, combined with the late stage at which patients become symptomatic, means that a lot of liver cancer globally in incurable.

Current chemotherapy keeps patients in a ‘holding pattern’ where it slows the growth of tumours, but lacks the ability to kill enough of the tumour to make it possible to then use surgery as a cure (called ‘downstaging’). If we can find a drug or treatment that can facilitate this, that’d be very exciting.

So, I’ve been accruing loads of sequencing data from human and mouse tumours to try and identify weaknesses in these tumours that we can target. But analysing this RNA and DNA sequencing data takes a lot of time and computational power. So I’ve been investigating how to shrink this down and try and take as much of it straight to R as possible. Thus, simplifying the pipeline and hopefully speeding things up a bit. Traditionally, this process takes hours and days. But with this workflow I’ve shrunk it to about a day for 30+ samples so far. This is obviously dependent on how many samples, how big your genome/transcriptome is and sequencing coverage/depth too.

Quick alignment and mapping

Taking the FASTQ files generated by the sequencers, we first run them through the Salmon pseudoaligner. Salmon is a package that uses pseduoalignment to align and map the reads from RNASeq data, thus shrinking two steps into one. This also enables us to leverage the speed and accuracy advantages of pseudoalignment (meant to be more robust in several aspects). I tend to dump all my FASTQ files into one folder per species, experiment or cohort and then run Salmon on this.

String manipulation in Bash

One challenge I have found is how our institute appends suffixes to the files for identification. This is obviously great, but Salmon needs to know which reads come from which end in the case of paired end sequencing or which technical repeat. Otherwise we’re simply going to be aligning gobbledygook.

In this case, the Illumina sequencer kicks out files which have a ‘S’ in front of the sample number, an ‘L’ in front of the technical repeat number and an ‘R’ in front of the direction of sequence (i.e. for paired end reads).

So using some loops in bash, I devised the following (note index file for mouse transcriptome mm10_t_index):

#!/bin/bash
source /home/user/miniconda3/etc/profile.d/conda.sh
conda activate salmon

//note ’18’ can be any number – this is just the number of samples I had.
FILE_LIST=$(for fn in /home/user/data/sequencing_data_here/fastq/*; do echo ${fn::-18} ; done)

#now check we can loop through this list
#for fn_short in $FILE_LIST; do echo “${fn_short}_test”; done

#now remove duplicates
FILE_LIST=$(echo $FILE_LIST | awk ‘BEGIN{ORS=” “}{for (i=1; i<=NF; i++)a[$i]++} END {for (i in a) print i }’)
FILE_LIST=$(echo “$FILE_LIST” | tr ‘ ‘ ‘\n’)

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L001_R1_001.fastq \
-2 “${fn}”_L001_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L001_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L002_R1_001.fastq \
-2 “${fn}”_L002_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L002_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L003_R1_001.fastq \
-2 “${fn}”_L003_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L003_quant
done

for fn in $FILE_LIST;
do
samp=basename "${fn}"
echo “Processing sample ${samp}”
salmon quant -i “/home/user/mm10_t_index/” -l A \
-1 “${fn}”_L004_R1_001.fastq \
-2 “${fn}”_L004_R2_001.fastq \
-p 8 –validateMappings -o quants/${samp}_L004_quant
done

Now, when run, this should generate aligned and mapped data in a separate ‘quants’ folder. Alongside the QC information etc. in each folder.

The most interesting files in these quantified data folders are the tab delimited quants.sf files. These contain the counts for mapped reads at the gene level. I tend to work with count data only, as from this it’s possible to derive TPM and FPKM.

Using the tidyverse with map() to quickly prepare sequencing data

Using the tidyverse we can make this really quick and easy!

First, using readr::read_tsv() and purrr::map() we can open all the files within a specific directory and add the pathname as a column (so we don’t lost track of our samples). Then we use mutate(), to put a path and sample name in. Finally, using spread() and summarise() we sum the counts across our technical repeats!

library(tidyverse)


#required functions
file_directory = ‘/mnt/data/this_is_folder_where_data_is’


file_list = dir(path = file_directory, full.names = T, pattern = “*quant.sf”, recursive = T)


#Step 1 – open all files, read and extract sample identifiers
file_list %>%
map(~read_tsv(.) %>%
mutate(path = gsub(‘/mnt/data/this_is_folder_where_data_is/quants/’, ”, .x)) %>% #Edit this to the appropriate file directory!
mutate(path = gsub(‘_quant/quant.sf’, ”, path)) %>%
mutate(path = gsub(‘__’, ‘_’, path))) -> gene_counts_list


#Step 2 – split repeats up and combine into 1 dataframe
gene_counts_list %>%
bind_rows(gene_counts_list) %>%
extract(path, into = c(“sample_id”, “n_repeat”), “(.*)_([^_]+)$”) %>%
select(sample_id, n_repeat,everything()) -> genes_all_samples_df


#Step 3 – sum reads together!
genes_all_samples_df %>%
group_by(sample_id, Name) %>%
summarise(NumReadsTotal = sum(NumReads)) %>%
spread(key = sample_id, value = NumReadsTotal) %>%
column_to_rownames(var = ‘Name’) -> gene_counts_by_sample


rm(file_directory, file_list, gene_counts_list)
#Save data
save.image(‘/mnt/data/results_folder’)

Happy sequencing! (and thanks Riinu Ots!)

New intuitive ways for reshaping data in R: long live pivot_longer() and pivot_wider()

This post was originally published here

TLDR: there are two new and very intuitive R functions for reshaping data: see Examples of pivot_longer() and pivot_wider() below. At the time of writing, these new functions are extremely fresh and only exist in the development version on GitHub (see Installation), we should probably wait for the tidyverse team to officially release them (in CRAN) before putting them into day-to-day use.

Exciting!

Introduction

The juxtapose of data collection vs data analysis: data that was very easy to collect, is probably very hard to analyse, and vice versa. For example, if data is collected/written down whichever format was most convenient at the time of data collection, it is probably not recorded in a regularly shaped table, with various bits of information in different parts of the document. And even if data is collected into a table, it is often intuitive (for data entry) to include information about the same variable in different columns. For example, look at this example data I just made up:

library(tidyverse)

candydata_raw = read_csv("2019-04-07_candy_preference_data.csv")
candy_type likes age: 5 likes age: 10 likes age: 15 gets age: 5 gets age: 10 gets age: 15
Chocolate 4 6 8 2 4 6
Lollipop 10 8 6 8 6 4

For each candy type, there are 8 columns with values. But actually, these 8 columns capture a combination of 3 variables: age, likes and eats. This is known as the wide format, and it is a convenient way to either note down or even present values. It is human-readable. For effective data analysis, however, we need data to be in the tidy data format, where each column is a single variable, and each row a single observation (https://www.jstatsoft.org/article/view/v059i10). It needs to be less human-readable and more computer-friendly.

Some of you may remember now retired reshape2::melt() or reshape2::dcast(), and many of you (inclduing myself!) have struggled remebering the arguments for tidyr::gather() and tidyr::spread(). Based on extensive community feedback, the tidyverse team have reinveted these functions using both more intuitive names, as well as clearer syntax (arguments):

Installation

These functions were added just a month ago, so these functions are not yet included in the standard version of tidyr that comes with install.packages("tidyverse") or even update.packages() (the current version of tidyr on CRAN is 0.8.3). To play with the bleeding edge versions of R packages, run install.packages("devtools") and then devtools::install_github("tidyverse/tidyr"). If you are a Mac user and it asks you “Do you want to install from sources the package which needs compilation?”, say Yes.

You might need to Restart R (Session menu at the top) and load library(tidyverse) again. You can check whether you now have these functions installed by typing in pivot_longer and pressing F1 – if a relevant Help tab pops open you got it.

Examples

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age"))
candy_type name value
Chocolate likes age: 5 4
Chocolate likes age: 10 6
Chocolate likes age: 15 8
Chocolate gets age: 5 2
Chocolate gets age: 10 4
Chocolate gets age: 15 6
Lollipop likes age: 5 10
Lollipop likes age: 10 8
Lollipop likes age: 15 6
Lollipop gets age: 5 8
Lollipop gets age: 10 6
Lollipop gets age: 15 4

Now, that’s already a lot better, but we still need to split the name column into the two different variables it really includes. “name” is what pivot_longer() calls this new column by default. Remember, each column is a single variable.

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age")) %>% 
  separate(name, into = c("questions", NA, "age"), convert = TRUE)
candy_type questions age value
Chocolate likes 5 4
Chocolate likes 10 6
Chocolate likes 15 8
Chocolate gets 5 2
Chocolate gets 10 4
Chocolate gets 15 6
Lollipop likes 5 10
Lollipop likes 10 8
Lollipop likes 15 6
Lollipop gets 5 8
Lollipop gets 10 6
Lollipop gets 15 4

And pivot_wider() can be used to do the reverse:

candydata = candydata_longer %>% 
  pivot_wider(names_from = questions, values_from = value)
candy_type age likes gets
Chocolate 5 4 2
Chocolate 10 6 4
Chocolate 15 8 6
Lollipop 5 10 8
Lollipop 10 8 6
Lollipop 15 6 4

It is important to spell out the arguments here (names_from =, values_frame =) since they are not the second and third arguments of pivot_wider() (like they were in spread()). Investigate the pivot_wider+F1 Help tab for more information.

Wrap-up and notes

Now these are datasets we can work with: each column is a variable, each row is an observation.

Do not start replacing working and tested instances of gather() or spread() in your existing R code with these new functions. That is neither efficient nor necessary – gather() and spread() will remain in tidyr to make sure people’s scripts don’t suddenly stop working. Meaning: tidyr is backward compatible. But after these functions are officially released, I will start using them in all new scripts I write.

I made the original messy columns still relatively nice to work with – no typos and reasonable delimiters. Usually, the labels are much worse and need the help of janitor::clean_names(), stringr::str_replace(), and multiple iterations of tidyr::separate() to arrive at a nice tidy tibble/data frame.

tidyr::separate() tips:

into = c("var1", NA, "var2") – now this is an amazing trick I only came across this week! This is a convenient way to drop useless (new) columns. Previously, I would have achieved the same result with:

... %>% 
    separate(..., into = c("var1", "drop", "var2")) %>% 
    select(-drop) %>% 
    ...
    

convert = TRUE: by default, separate() creates new variables that are also just “characters”. This means our age would have been a chacter vector of, e.g., “5”, “10”, rather than 5, 10, and R wouldn’t have known how to do arithmetic on it. In this example, convert = TRUE is equivalent to mutate(age = as.numeric(age)).

Good luck!

P.S. This is one of the coolest Tweets I’ve ever seen: