HealthyR Online: Lockdown Learning

With news of the lockdown in March came the dawning reality that we wouldn’t be able to deliver our usual HealthyR 2.5 day quick start course in May.

The course is always over-subscribed so we were keen to find a solution rather than cancelling altogether.

HealthyR teaches the Notebook format which is already an online tool hosted by RStudio Cloud – so we knew that bit would work online. But what to do about getting attendees and tutors online, delivering lectures and offering interactive support with coding? Could we recreate our usual classroom environment online?

Never a group to shy away from a technical challenge, and with expertise in online education, we set about researching what online tools could be used.

After trying various options we went with Blackboard Collaborate to provide an online classroom, together with our usual RStudio Cloud to provide the Notebooks interface. Collaborate has a really nice feature of ‘break-out rooms’ where small groups can be assigned a separate online room with a tutor to work through exercises. The tutor can provide support and answer questions, using the screen share option to see exactly what each person might be having difficulty with.

After a few rehearsals to work out what roles to assign all our moderators and attendees, how to send people to the break rooms and recall them back to the main room we were set!

Ahead of the course, attendees were emailed the usual pre-course materials and a log in for their RStudio Cloud accounts, together with an invite to a Collaborate session for each of the 3 days. We split the 20 attendees who had confirmed attendance into groups of 5 and assigned one of our fantastic tutors to each group.

We also set up a an extra break out room with a dedicated tutor which could be used for anyone needing specific one-to-one help.

After the ice-breaker, ‘What’s a new thing you’ve done since lockdown?’ – everything from macrame to margaritas plus tie-dying and a lot of baking – the course got underway with the first lecture.

One or two delegates had some problems with internet connections, and the assigning of breakout rooms took a bit of getting used, but Riinu soon worked out an efficient system and the first coding exercises were underway!

We were delighted that the course received really positive feedback overall – none of us were sure this would work, but it did! The live coding sessions and pop quizzes were particularly popular.

We’ll definitely run HealthyR online again if the lockdown continues. Even after the lockdown, moving online widens access and offers the possibility for our international collaborators to join a course without having to travel.

Thank you to all our attendees who quickly adapted to the online format and to our amazing tutors, Tom, Kenny, Derek, Peter, Katie, Stephen, Michael and Ewen, who provided 3 days of their time to run the course, led as ever, by Riinu.

Course Feedback

Collaborate and RStudio Cloud worked very well for me. The breakout rooms were a nice touch to allow discussions.

Very well set-up, particularly considering the challenges of online teaching! Collaborate and RStudio made the course very accessible. Also a fantastic ratio of tutors to pupils and very clear explanations of key concepts in ’R’ languageand stats!

Clear and easy instructions. Worked seamlessly!

Teaching materials fantastic. In particular I thought linear and logistic
regressions were superbly well taught (as difficult to teach/understand). I think
I’ve now understand these for the first time having wasted loads of time reading
about them in the past!

This was a great course. I think in person would have allowed more interaction so I would still keep your original format available after this lockdown is over but well done on adapting and providing an excellent course.

Resources

https://healthyr.surgicalinformatics.org

All the HealthyR resources, including our new online book, are available for free on the HealthyR website

Using codepen.io and google cloud to build a handy risk calculator.

If you’ve been watching the news or twitter over the past week, you may have seen the appendicitis-related headlines about unnecessary operations being performed. The RIFT collaborative and Dmitri Nepogodiev have really spearheaded some cool work looking at who gets unnecessary operations, which are all well worth a read:

Original article:

https://bjssjournals.onlinelibrary.wiley.com/doi/10.1002/bjs.11440

(Selected news coverage):

https://www.theguardian.com/society/2019/dec/04/unnecessary-appendix-surgery-performed-on-thousands-in-uk

https://www.dailymail.co.uk/health/article-7750707/Thousands-young-British-women-needless-operations-remove-appendix.html

https://www.independent.co.uk/life-style/health-and-families/women-appendix-surgery-appendicitis-study-a9232146.html

So, when Dmitri asked if I could develop a web application for risk scoring to help identify those at low risk of appendicitis, I was very excited.

Having quite often used risk calculators in clinical practice, I started to write a list of what makes a good calculator and how to make one that can be used effectively. The most important were:

  • Easy to use
  • Works on any platform (as NHS IT has a wide variety of browsers!) and on mobile (some hospitals have great Wi-Fi through eduroam)
  • Can be quickly updated
  • Looks good and gives an intuitive result
  • Lightweight requiring minimal processing power, so many users can use simultaneously

Now we use a lot of R in surgical informatics, but Shiny wasn’t going to be the one for this as it’s not that mobile friendly and doesn’t necessarily work on every browser that smoothly (sorry shiny!). Similarly, the computational footprint required to run shiny is too heavy for this. So, using codepen.io and a pug html compiler, I wrote a mobile friendly website (Still a couple of tweaks I’d like to make to make entirely mobile friendly!).

Similarly, I get asked why not an app? Well app development requires developing on multiple platforms (Apple, Android, Blackberry) and can’t be used on those pesky NHS PCs. Furthermore, if something goes out of date or needs to be updated quickly – repairing it will take ages as updates sometimes have to be vetted by app stores etc.

My codepen.io for the calculator:

Codepen.io is a great development tool and allows you to combine and get inspired by other people’s work too!

I then set up a micro instance on google cloud, installed the pug compiler and apache2, selected a fixed IP and opened the HTTP port to the world and all done! (this set up is a little more involved than this but was straightforward!). The micro instance is very very cheap so it’s not expensive to run. The Birmingham crew then bought a lovely domain appy-risk.org for me to attach it to.

Here’s the obligatory increase in CPU usage since publication (slightly higher but as you can tell – it’s quite light:

RStudio Server LAN party: Laptop+Router+Docker to serve RStudio offline

This post was originally published here

TLDR: You can teach R on people’s own laptops without having them install anything or require an internet connection.

Members of the Surgical Informatics team in Ghana, 2019. More information: surgicalinformatics.org

Members of the Surgical Informatics team in Ghana, 2019. More information: surgicalinformatics.org

Introduction

Running R programming courses on people’s own laptops is a pain, especially as we use a lot of very useful extensions that actually make learning and using R much easier and more fun. But long installation instructions can be very off-putting for complete beginners, and people can be discouraged to learn programming if installation hurdles invoke their imposter syndrome.

We almost always run our courses in places with a good internet connection (it does not have to be super fast or flawless), so we get our students all set up on RStudio Server (hosted by us) or https://rstudio.cloud (a free service provided by RStudio!).
You connect to either of these options using a web browser, and even very old computers can handle this. That’s because the actual computations happen on the server and not on the student’s computer. So the computer just serves as a window to the training instance used.

Now, these options work really well as long as you have a stable internet connection. But for teaching R offline and on people’s own laptops, you either have to:

  1. make sure everyone installs everything correctly before they attend the course
  2. Download all the software and extensions, put them on USB sticks and try to install them together at the start
  3. start serving RStudio from a your computer using Local Area Network (LAN) created by a router

Now, we already discussed why the first option is problematic (gatekeeper for complete beginners). The second option – installing everything at the start together – means that you start the course with the most boring part. And since everyone’s computers are different (both by operating systems as well as different versions of the operating systems), this can take quite a while to sort. Therefore, queue in option c) – an RStudio Server LAN party.

Requirements

  1. A computer with more than 4GB of RAM. macOS alone uses around 2-3GB just to keep going, and running the RStudio Server docker container was using another 3-4 GB, so you’ll definitely need more than 4GB in total.
  2. A network router. For a small number of participants, the same one you already have at home will work. Had to specify “network” here, as apparently, even my Google search for “router” suggests the power tool before network routers.
  3. Docker – free software, dead easy to install on macOS (search the internet for “download Docker”). Looks like installation on the Windows Home operating system might be trickier. If you are a Windows Home user who is using Docker, please do post a link to your favourite instructions in the comments below.
  4. Internet connection for setting up – to download RStudio’s docker image and install your extra packages.
My MacBook Pro serving RStudio to 10 other computers in Ghana, November 2019.

My MacBook Pro serving RStudio to 10 other computers in Ghana, November 2019.

Set-up

Running RStudio using Docker is so simple you won’t believe me. It honestly is just a single-liner to be entered into your Terminal (Command Prompt on Windows):

docker run -d -p 8787:8787 -e ROOT=TRUE -e USER=user -e PASSWORD=password rstudio/verse 

This will automatically download a Docker image put together by RStudio. The one called verse includes all the tidyverse packages as well as publishing-related ones (R Markdown, Shiny, etc.). You can find a list of the difference ones here: https://github.com/rocker-org/rocker

Then open a browser and go to localhost:8787 and you should be greeted with an RStudio Server login! (Localhost only works on a Mac or Linux, if using Windows, take a note of your IP address and use that instead of localhost.) More information and instructions can be found here: https://github.com/rocker-org/rocker/wiki/Using-the-RStudio-image

Tip: RStudio suggests port 8787, which is what I used for consistency, but if you set it up on 80 you can omit the :80 as that’s the default anyway. So you can just go to localhost (or something like 127.0.0.0 if using Windows).

For those of you who have never seen or used RStudio Server, this is what it looks like:

Rstudio Server is almost identical to RStudio Desktop. Main difference is the “Upload” button in the Files pane. This one is running in a Docker container, served at port 8787, and accessed using Safari (but any web browser will work).

Rstudio Server is almost identical to RStudio Desktop. Main difference is the “Upload” button in the Files pane. This one is running in a Docker container, served at port 8787, and accessed using Safari (but any web browser will work).

The Docker single-liner above will create a single user with sudo rights (since I’ve included -e ROOT=TRUE). After logging into the instance, you can then add other users and copy the course materials to everyone using these scripts: https://github.com/einarpius/create_rstudio_users Note that the instance is running Debian, so you’ll need very basic familiarity with managing file permissions on the command line. For example, you’ll need to make the scripts executable with chmod 700 create_users.sh.

Then connect to the same router you’ll be using for your LAN party, go to router settings and assign yourself a fixed IP address, e.g., 168.192.1.78. Once other people connect to the network created by this router (either by WiFi or cable), they need to type 168.192.1.78:8787 into any browser and can just start using RStudio. This will work as long as your computer is running Docker and you are all connected to the same router.

I had 10 people connected to my laptop and, most of the time, the strain on my CPU was negligible – around 10-20%. That’s because it was a course for complete beginners and they were mostly reading the instructions (included in the training Notebooks they were running R code in). So they weren’t actually hitting Run at the same time, and the tasks weren’t computationally heavy. When we did ask everyone to hit the “Knit to PDF” button all at the same time, it got a bit slower and my CPU was apparently working at 200%. But nothing crashed and everyone got their PDFs made.

Why are you calling it a LAN party?

My friends and I having a LAN party in Estonia, 2010. We would mostly play StarCraft or Civilization, or as pictured here - racing games to wind down at the end.

My friends and I having a LAN party in Estonia, 2010. We would mostly play StarCraft or Civilization, or as pictured here – racing games to wind down at the end.

LAN stands for Local Area Network and in most cases means “devices connected to the same WiFi*”. You’ve probably used LANs lots in your life without even realising. One common example is printers: you know when a printer asks you to connect to the same network to be able to print your files? This usually means your computer and the printer will be in a LAN. If your printed accepted files via any internet connection, rather than just the same local network, then people around the world could submit stuff for your printer. Furthermore, if you have any smart devices in your home, they’ll be having a constant LAN party with each other.

The term “LAN party” means people coming together to play multiplayer computer games – as it will allow people to play in the same “world”, to either build things together or fight with each other. Good internet access has made LAN parties practically obsolete – people and their computers no longer have to physically be in the same location to play multiplayer games together. I use the term very loosely to refer to anything fun happening on the same network. And being able to use RStudio is definitely a party in my books anyway.

But it is for security reasons (e.g., the printer example), or sharing resources in places without excellent internet connection where LAN parties are still very much relevant.

* Overall, most existing LANs operate via Ethernet cables (or “internet cables” as most people, including myself refer to them). WiFi LAN or WLAN is a type of LAN. Have a look at your home router, it will probably have different lights for “internet” and “WLAN”/“wireless”. A LAN can also be connected to the internet – if the router itself is connected to the internet. That’s the main purpose of a router – to take the internet coming into your house via a single Ethernet cable, and share it with all your other devices. A LAN is usually just a nice side-effect of that.

Docker, containers, images

Docker image – a file bundling an operating system + programs and files
Docker container – a running image (it may be paused or stopped)

List of all your containers: docker ps -a (just docker ps will list running containers, so the ones not stopped or paused)

List your images: docker images

Run a container using an image:

docker run -d -p 8787:8787 -e ROOT=TRUE -e USER=user -e PASSWORD=password rstudio/verse 

When you run rstudio/verse for the first time it will be downloaded into your images. The next time it will be taken directly from there, rather than downloaded. So you’ll only need internet access once.

Stop an active container: docker stop container-name

Start it up again: docker start container-name

Save a container as an image (for versioning or passing on to other people):

docker commit container-name pository:tag

For example: docker commit rstudio-server rstudio/riinu:test1

Rename container (by default it will get a random label, I’d change it to rstudio-server):

docker rename happy_hippo rstudio-server

You can then start your container with: docker start rstudio-server

HealthyR Ghana! Quick summary

These past two days are new frontier for the HealthyR course, taking the number of continents we’ve run it in up to 2.After the NIHR Unit on Global Surgery meeting, we travelled to Tamale, Ghana’s third largest city. The Wellcome Trust have kindly funded the development of the innovative, open-source HealthyR notebooks course. Spearheaded by Dr Riinu Ots, this course aims to provide an easy way for anyone in the world to learn R.This is particularly powerful where resources are limited and there are plenty of questions that need to be answered. Enter Stephen Tabiri, professor of Surgery at the University for Development Studies in Tamale. Stephen is as surgeon and has a large team of junior surgeons in training, nurses and other clinicians.In an innovative twist, it was held on a mix of laptops, from the data centre and on delegates own machines. Riinu had a brilliant solution, that served an offline R studio instance to delegates computers.Day 1 quickly introduced some key concepts to the delegates who quickly worked through the materials! After lunch a global surgery showcase event was held, which showcased the wide range of tools available to analyse data in R!Day 2 kicked off nicely, completing the basics session and then straight into everyone’s favourite session – Plotting! Here there were a lot of pleased delegates as they made complicated and colourful ggplots! People were making a lot of progress, in what can sometimes be a challenging language to learn!We finally closed on a logistic regression session delivered by Ewen Harrison, where people built their own models!Throughout the course there were numerous people bringing laptops to install RStudio software on their own desktops. A very enthusiastic and keen bunch of data scientists!Excitingly, members of the Ghana R community also attended, to offer support and discuss how best to provide a sustainable future for data science in Ghana.

Touch Down In Tamale!

The Surgical Informatics team arrive in Tamale, Ghana for the next HealthyR Notebooks course

The Surgical Informations groups are delighted to be visiting Tamale in Ghana to deliver our flagship HealthyR Notebooks course as part of our Wellcome Trust grant, ‘HealthyR Notebooks: Democratising open and reproducible data analysis in resource-poor
environments’
.

We’re being made extremely welcome by our hosts Professor Stephen Tabiri and Benard Ofori Appiah from the NIHR Global Health Research Unit on Global Surgery hub in Ghana.

Over the next few days we’ll be establishing a data centre in Ghana with the provision of 15 laptops and training 20 local delegates to use R for healthcare data analysis. This will build capacity for future data driven research in partnership with the NIHR Global Surgery Unit in Ghana.

Do you speak rlang?

Something for the more advanced R user! We’ll be back to our more exciting programming shortly (I hope!).

rlang? I already speak R

Quite right. rlang is part of the tidyverse side of things, so is probably more useful if you’re an advanced R user. It’s certainly not for the faint-hearted and needs a comprehensive understanding of how R ‘sees’ the code you write.

rlang is a low-level programming API for R which the tidyverse uses (meaning it speaks to R in as R like way as possible, rather than a ‘high-level’ – high level is more user orientated and interpretable). It enables you to extend what the tidyverse can do and adapt it for your own uses. It’s particularly good to use if you’re doing lots of more ‘programming’ type R work, for example, building a package, making a complex shiny app or writing functions. It might also be handy if you’re doing lots of big data manipulation and want to manipulate different datasets in the same way, for example.

Here’s an example of dynamically naming variables

In this example, say we have a tibble of variables, but we want to apply dynamic changes to it (so we feed R a variable, that can change, either using another function like purr::map or in a ShinyApp). In this instance, specifying each variable and each different possible consequence using different logical functions would take forever and be very clunky. So we can use rlang to simply put a dynamic variable/object through the same function.

We make use of the curly curlys too, which allow us to avoid using bulky enquo() – !! syntax

HealthyR Estonia Day 3

Well, what a great 3 days this has been! Again today, we gained extra people to join in HealthyR Notebooks – A formidable achievement for a statistics course!

We kicked off with a brilliant session by Ewen Harrison about survival analysis and time to event data, introducing new concepts and the R survival package. Then went into an amazing session by Riinu Ots, who showcased how to plot your data with real world, practical examples. This session really was brilliant, living up to Riinus catch-phrase ‘always plot your data’. This was followed by a short pop quiz, which all participants did brilliantly!

After a tasty lunch, we then continued into a new session, how to work with your data. This session is aimed at translating the learning of HealthyR, straight to a reallife dataset of the participants choosing. Participants were guided through the practical application of R to their own data, giving them a springboard to produce some cool analyses after the course.

Following the final session we departed to Tallinn for flights back home tomorrow.

All in all, HealthyR notebooks were a success and very fun to teach. Estonia was well worth the trip and showed we could teach R to an international audience (even having fun at the same time!). Looking forward to developing the course further when the team go to Ghana later this year. Big thanks to Julius for organising the course and to the Welcome Trust!

Tidying up edgeR and differential expression

For anyone who has done any differential expression from RNA array/ sequencing datasets in R, edgeR or DESeq2 are the packages to go to! They contain powerful functions and models which cater for most uses.

As with most bioinformatics type packages, they end up creating scripts with endless lines of code and square brackets aplenty. Now not only does this make scripts difficult to read, it’s difficult for R to handle in some aspects. Quite often these packages have custom objects etc. which makes changing formats and sending data into different packages a pain.

Well, if you do lots of differential expression, here’s some wrappers for you! Taking inspiration from the FinalFit and tidyverse approach,
I have written a few wrappers for differential expression analyses to make your code and life happier.

Please note – throughout this all I use count data as this is what edgeR likes, rather than FPKM/RPKM or RSEM.

Step 1 – Taking the first steps to expression happiness

So – first up, preparing and filtering your data. This function turns your data and any clinical/ sample data, wraps it up into a DGEList object,
then will filter it. I like to function based on proportions of lowly expressed transcripts, as purely filtering on arbitary CPM values has its own issues, particularly if your read depth is low.

tidy_dge() is a function which does all of this! It combines two functions dge_bind() – which sets up your DGEList based on two data frames and tidy_gene_filter() which then filters these objects and calculates normalisation values.

We can then combine this into one function tidy_dge(), which shrinks about 10-20 lines into one.

Step 2 – Translating annotations

Next up – convert_genes(). Switching between gene annotation systems (i.e. ENSEMBL, HUGO, ENTREZ) is a pain. We all know that. The package biomaRt tries to solve this a bit, but still we’re left with somewhat unwieldly functions and lines of code trying to pick out where genes sit and what to map them back to. For this, I have devised convert_gene().

convert_gene() passes a request to biomaRt (the ENSEMBL API) and will convert any (common) gene name to any other gene annotation out. It takes a little while to contact the servers, but is well worth it! Alternatively, you will be able to pass a saved biomaRt output to this, so you can make one pull only which speeds it up! If we are to do this, we should create three separate marts – one containing genes, one genes and GO codes, the other the GO codes and processes. Here’s some functions to do this:

The reason for this- in testing some of the marts have errors when being pulled with genes and GO codes all at once! The ensembl biomaRt API is also somewhat unstable and frequently produces 404 errors/ lots of downtime. Plus – we can then use the mart we created for genes in convert_gene().

Back to convert_gene()! It is designed to work with a glm object straight out of edgeR or with a dataframe of counts, but will detect if it’s another type of file. Just be careful! As the default setting is mouse currently:

GO – The final step to expression fulfillment

Finally – those GO analyses, how on earth do you then trace back and find which genes in which pathways are up-regulated? I’ve never been able to find a nice function to do this.

Fear not – we have a new one! gene_GO_explorer(). This is very handy indeed. It basically will take your glm object from edgeR and map the genes to GO processes and codes.


There are some functions wrapped up in this function for dissecting pathways out too; go_grep, go_code, go_pathway and go_components. These allow you to supply a list of GO processes, GO codes or to search the GO annotations for pathways of interest. For example, If your GO enrichment shows pathway activity for a specific GO pathway, you can input the GO code into go_code to get the genes involved in that pathway. For searching GO pathway annotations, we can use go_grep. Here, if we set go_grep to ‘stress’, gene_GO_explorer() will search and return any pathways with the term ‘stress’ in their description.

If genes are part of multiple pathways that are included in our search, the GO terms will be collapsed down together – separated by a semicolon.
This data can be displayed in a ‘long’ format if required by specifying long = TRUE.

Note – unlike goana, gene_GO_explorer() does not mind what type of gene annotation is supplied! I use convert_gene() to prepare my glms for goana then pop it through gene_GO_explorer().

If you’re not using a glm from edgeR – switch off the adjusted p-value bit of gene_GO_explorer() using adjust_p = FALSE, or set the column with your p-values to ‘PValues’.

I’ll be developing more of this sort of thing, in addition to some nicer wrappers for preparing your sequencing data straight from FASTQ formats, making user readable edgeR models and then querying protein interaction databases too! Comments/ thoughts appreciated if people find this kind of stuff useful. Considering making a package. FinalOmics anyone?

Reference Management – An Efficient Setup for writing a Thesis

This blog is intended for researchers, PhD students, MD students and any other students who wish to have a robust and effective reference management setup. The blog has a particular focus on those using R markdown, Bookdown or LaTeX. Parts of the blog can also help setup Zotero for use with Microsoft Word. The blog has been designed to help achieve the following goals:

  • Effective citation storage
    • Fast and easy citation storage (one-click from Chrome)
    • Fast and easy PDF storage using cloud storage
    • Immediate, automatic and standardised PDF renaming
    • Immediate, automatic and standardised citation key generation
  • Effective citation integration with markdown etc.
    • Generation of citation keys which work with LaTeX and md (no non-standard characters)
    • Ability to lock citation keys so that they don’t update with Zotero updates
    • Storage of immediately updated .bib files for use with Rmd, Bookdown and LaTeX
    • Automated update of the .bib file in RStudio server

Downloads and Setup

For my current reference management setup I need the following software:

  • Zotero
    • Zotero comes with 300MB of free storage which allows well over 1000 references to be stored as long as PDFs are stored separately
    • From the same download page download the Chrome connector to enable the “save to zotero” function in Google Chrome
  • ZotFile
    • Zotfile is a Zotero plugin which helps with PDF management, download the .xpi file and then open Zotero, go to “Tools → Add-Ons” and click the little cog in the top right corner and navigate to file to install (Figure 1)
  • Better BibTeX
    • Better BibTeX is a plugin to help generate citation keys which will be essential for writing articles in LaTeX, R Markdown or Bookdown
    • If the link doesn’t work go to github and scroll down to the ReadMe to find a link to download the .xpi file
    • The same approach is then used to install the Better BibTeX plugin for zotero (“Tools → Add-Ons”)

After downloading Zotero, ZotFile and Better BibTeX create an account on Zotero online.

In addition to the Zotero downloads this guide will focus on an efficient setup for writing with R markdown or Bookdown and assumes that you have access to the following software / accounts:

  • Dropbox / Google Drive / other cloud storage service which allows APIs
    • It will also be necessary for these to be accessible using Windows Explorer or Mac Finder (there are many guides online for syncing Google Drive and Dropbox so that they appear in file explorers)
  • RStudio (this is not 100% essential but it is far harder to use Rmd without it)
    • Packages which will be required for this setup include rdrop2 (if using dropbox, other packages are available to convert this setup to Google Drive etc.), encryptr, bookdown or Rmarkdown, tinytex and a LaTeX installation (the Bookdown author recommends using tinytex which can be installed by the similarly named R package: tinytex::install_tinytex())

Folder Setup

When using Zotero it is a good ideal to create a folder in which you will store PDFs retrieved from articles. Ultimately it is optional whether or not PDFs are stored but if you have access to cloud storage with a good quota then it can make writing in Rmd etc. much faster as there is no requirement to search online for the original PDF. This folder should be set up in Google Drive, Dropbox or another cloud storage service which can be accessed from your own computer through the file explorer.

A second folder may be useful to store bibliographies which will be generated for specific projects or submissions. Again this folder should be made available in cloud storage.

ZotFile Preferences

To setup Zotero so that retrieved PDFs are automatically stored and renamed in the cloud storage without consuming the Zotero storage quota go to “Tools → ZotFile Preferences” and on the first tab: General Settings and set the folder and subfolder naming strategy for PDFs. I have set the location of the files to a Custom location and in this case used the path to a Google Drive folder (~\Google Drive\Zotero PDF Library). ZotFile will also store retrieved PDFs in subfolders to help with finding PDFs at a later date. The current setup I use is to create a subfolder with the first author surname so that all papers authored by one (or more) author with the same name are stored together using the \%a in the subfolder field (Figure 2). Other alternatives are to store PDFs in subfolders using year (\%y); journal or publisher (\%w); or item type (\%T).

Next the Renaming Rules tab can be configured to provide sensible names to each of the files (this is essential if PDFs are not to be stored as random strings of characters which provide no meaning). In this tab I have set the format to: {%a_}{%y_}{%t} which provides names for the PDFs in the format of: Fairfield_2019_Gallstone_Disease_and_the_Risk_of_Cardiovascular_Disease.pdf. I find that this shows author, year and first word of title without needing to expand the file name (Figure 3).

I have not changed any of the default settings in either the Tablet Settings or Advanved Settings tabs apart from removing special characters in the Advanced Settings (this stops things from breaking later).

General Zotero Settings

Zotero has several configurable settings (accessed through: “Edit → Preferences”) and I have either adopted the defaults or made changes as follows:

General:

  • I have ticked the following:
    • Automatically attach associated PDFs
    • Automatically retrieve metadata for PDFs
    • Automatically rename attachments using parent metadata
    • Automatically tag items with keywords and subject headings
    • All options in Group section
  • I have left the following unticked:
    • Automatically take snapshots
    • Rename linked files

Sync:

  • Enter the account details
  • Tick sync automatically
  • Untick sync full text (if you choose to save PDFs then syncing full text will quickly consume the 300MB quota)

Search:

  • Left unchanged

Export:

  • Left unchanged

Cite:

  • There are several sensible defaults but if there is a new citation style you wish to be able to use in Microsoft Word for example then click “Get additional styles” as there is probably a version that you need already created. You can click the “+” button to add a style from a .csl file if you have one already. Finally, if you are desperate for a style that doesn’t already exist then you can select a citation style and click Style Editor and edit the raw .csl file.
  • In the Word Processors subtab (on the main Cite tab), you can install the Microsoft Word add-in to allow Zotero to work in Microsoft Word.

Advanced:

  • I changed nothing on the General subtab
  • In the Files and Folders subtab I have selected the path to base directory for attachments
  • I have not changed the Shortcuts subtab
  • I have not changed the Feeds subtab

Better BibTex:

  • In this section I have set my Citation Key format to [auth:lower:alphanum]_[year:alphanum]_[veryshorttitle:lower:alphanum]_[journal:lower:clean:alphanum] (Figure 4). This generates a citation key for each reference in the format of fairfield_2019_gallstones_scientificreports or harrison_2012_hospital_bmj. It always takes the first author’s surname, the year, the first word of the title and the journal abbreviation if known. The clean and alphanum arguments to this field are used to remove unwanted punctuation which can cause citation to fail in LaTeX.
Figure 4: Better BibTeX Citation Key

Once the settings have been configured if you already had references stored in Zotero and wish to change the citation key for old references select your entire library root (above all folders), select all references, right click and use “Better BibTex → Refresh BibTeX Key” and all of the citation keys should be updated.

Creating a .bib file

For referencing in a new project, publication or submission it may be helpful to have a dynamic .bib file that updates with every new publication and can be accessed from any device through cloud storage.

To set up a .bib file, first find the folder that you wish to create the file from (this should be the folder which contains any citations you will use and ideally not the full library to cut down on unnecessary storage and syncing requirements). Note that the .bib file will generate a bibliography from any citations stored directly in the folder when using default settings. This prevents use of subfolders which I find particularly helpful for organising citations and I have therefore changed the setting so that folders also show any citations stored in subfolders. To make this change go to “Edit Preferences” and select the “Advanced” tab and at the bottom of the “General” subtab select “Config Editor”. This will bring up a searchable list of configurations (it may show a warning message before this) and search in the search box for “extensions.zotero.recursiveCollections”. Set “Value” to TRUE and then when you click a folder you should see all of the citations also stored in subfolders.

Right click the folder and select “Export Collection”. A pop-up window will appear at which point select “Keep Updated” and if using RStudio desktop save the file in the directory where you have your Rmd project files. If you are working with RStudio server then save the file in a cloud storage location which will then be accessed from the server. I have a .bib file stored in Dropbox which I access from RStudio server.

Linking Dropbox and RStudio Server to Access the .bib File

The following covers linking Dropbox to RStudio server but could be adapted to cover another cloud storage service.

Dropbox provides a token to allow communication between different apps. The rdrop2 package is what I used to create a token to allow this. I actually created the token on RStudio desktop as I couldn’t get the creation to work on the server but this is perfectly ok.

Caution: The token generated by this process could be used to access your Dropbox from anywhere using RStudio if you do not keep it secure. If somebody were to access an unencrypted token then it would be equivalent to handing out your email and password. I therefore used the encryptr package to allow safe storage of this token.

Token Creation

Open Rstudio desktop and enter the following code:

The code will create two files, a token and the .httr-oauth file from which a token can also be made. The encryptr package can then encrypt the files using a public / private key pair. It is essential that the password that is set when using genkeys() is remembered otherwise the token cannot then be used. In this case the original token can’t be retrieved but could be created again from scratch.

The following files will then be needed to upload to the RStudio server:

  • droptoken.rds.encryptr.bin – or the name provided for the encrypted Dropbox token
  • id_rsa – or the name provided for the private key from the private / public key pair

Dropbox Linkage for Referencing the .bib File

Now that the encrypted token and necessary (password-protected) private key are available in RStudio server, the following can be saved as a separate script. The script is designed to read in and decrypt the encrypted token (this will require a password and should be done if the .bib file needs updated). Only the drop_download() needs repeated if using the token again during the same session. The token should be cleared at the end of every session for additional security.

Now that the .bib file has been created and is stored as “my.bib” in the local directory, it should update whenever the token is loaded and drop_download() is run.

Final Result

On clicking “Save to Zotero” button in Chrome and running drop_download() the following should all happen almost instantaneously:

  • Zotero stores a new reference
  • A PDF is stored in the cloud storage having been named appropriately
  • A link to the PDF is stored in Zotero (without using up significant memory)
  • A citation key is established for the reference in a standardised format without conflicts
  • Pre-existing citation keys which have been referenced earlier in the writing of the paper are not altered
  • A .bib file is updated in the RStudio server directory
  • And much unwanted frustration of reference management is resolved

This is my current reference management system which I have so far found to be very effective. If there are ways you think it can be improved I would love to hear about them.