If data were an animal, what would it be? 

Cath Montgomery, Medical Sociologist, writes about our recent exploration into pottery and data animals.

Often we think about data as an object: inert, manipulable, and something we control. We collect it, harvest it, scrape it, clean it, curate it, store it, share it, analyse it and display it. In these endeavours, we think about human agency and the work that we – as clinicians, statisticians, data scientists, sociologists – do to make sense of the world through quantified means. But what about the data themselves? What do they do? And what would it mean to give them agency? This is something that Science & Technology Studies scholars do routinely to underscore the ways in which the material world interacts with humans to create societal order. As a fun and playful way to think about some of the features of data that we identify with or relate to, we can ask, “If data were an animal, what would it be?”

After a full day and a half of talking about data at the strategy away day, it was time for people to get their hands dirty at Doodles pottery for some ‘team-building’. What better occasion to paint our own data animals! Everyone chose an item to paint: a mug, a jug, a bowl, and got to work dabbing, splatting, etching, and painting their designs. Creative activities like pottery painting are said to be good for team building because they nurture trust between colleagues; usually, everyone starts with minimal expertise, which is a good leveller, and everyone makes themselves a little bit vulnerable by putting their creations out into the world. This kind of activity also helps people get in touch with their inner  artist and the parts of their brain responsible for creativity, imagination and intuition. This is the birth place of data animals! 

If the description of data as inert, manipulable, something we control were sufficient, we might have seen a lot of domestic data animals – cats and dogs, rabbits and rodents. Of these, there were none. Instead, we had a zebra and a giraffe, centipedes and dragonflies, frogs, foxes, owls, a death butterfly and a skull. Certainly, it seems that data are not tame in this group’s collective imagination! 

So what did our data animals have to say about data? Riinu’s rainbow zebra shows the importance of reading between the lines; data analysis is not black and white and datasets are diverse, represented by the zebra’s rainbow stripes. Sarah’s giraffe represents the ability to use data for utilising resources that would otherwise be difficult to access (it’s also an animal in long-format). George’s frog follows an r-selection breeding strategy, otherwise known as an ‘r-strategist’: “this narrative is inspired by my approach to model selection – generating as many as one can sensibly think of and then witling them down using natural selection/data metric driven selection”.  Liz’s centipedes represent lots of quick-moving arms but overall, somewhat slow going; Annemarie’s death butterfly is superficially elegant and beautiful, but must be treated with respect as can be deadly if provoked or used badly. Ewen’s “ripped off owl jug” embodies imitation as the sincerest form of flattery: in data science, it is best to build on what has already been a success. Cath’s barn owl is a flash of light in the dark, but also eats other data animals for breakfast (sociologists of science and medicine can be a critical bunch).  Ian’s animal is deceased and only the skull remains: “being the oldest member of the group I have datasets dead and buried all over Scotland…but a little bit of “data mining” might resurrect some of them?”

So: from sex and death to work and the constant striving for resources, social benefit and success, the data animals have it all. It would be disingenuous to suggest that the explanations we wove to account for our creations preceded the act of painting them; nonetheless, the stories we tell about data are an important way in which we relate to the world and the work that we do to make sense of it through research. 

Making a Research Focus Wordcloud

Is it better to have a narrow or broad research focus? There are obviously pros/cons to both options (and arguably these aren’t mutually exclusive!), but it’s certainly an interesting thought posed in a recent tweet from @dnepo.

While I’m sure we all have a vague idea of where we sit on that spectrum of broad-narrow focus, there’s nothing like a bit of objective data (like a word cloud) to help us understand this better! While there are some online tools out there, R can make getting, cleaning, and displaying this data very easy and reproducible.

We will aim to cut down on the work required in collecting all your publication data by using google scholar – if you don’t have an account already, make one!

Firstly, we need 3 packages to achieve this:

  1. scholar: to download publications associated with your google scholar account.
  2. tidyverse: to clean and wrangle your publication data into the required format.
  3. wordcloud2: to generate a pretty wordcloud of your publication titles.
# install.packages(c("scholar", "wordcloud2"))
library(tidyverse); library(scholar); library(wordcloud2)

Secondly, we need to provide specific information to R to allow it to do the task.

  1. We need to get our Google Scholar ID from our account (look at the URL) to tell R where to download from (we’ll use mine as an example, but anyone’s can be used here).
  2. We want to tell R which words we can ignore because they’re just filler words or irrelevant (e.g. we don’t care how many times titles have “and” in them!). This is optional, but recommended!
gscholarid <- "MfGBD3EAAAAJ" # Kenneth McLean
remove <- c("and", "a","or", "in", "of", "on","an", "to", "the", "for", "with")

Finally, we can generate our word cloud! The code below is generic, so works for anyone so long as you supply the Google Scholar ID (“gscholarid”) and filler words (remove).

# Download dataframe of publications from Google Scholar
scholar::get_publications(id = gscholarid) %>%
  tibble::as_tibble() %>%
  # Do some basic cleaning of paper titles
  dplyr::mutate(title = stringr::str_to_lower(title),
                title = stringr::str_replace_all(title, ":|,|;|\\?", " "),
                title = stringr::str_remove_all(title, "\\(|\\)"),
                title = stringr::str_remove_all(title, "…"),
                title = stringr::str_remove_all(title, "\\."),
                title = stringr::str_squish(title)) %>%
  # Combine all text together then separate by spaces (" ")
  dplyr::summarise(word = paste(title, collapse = " ")) %>%
  tidyr::separate_rows(word, sep = " ") %>%
  # Count each unique word
  dplyr::group_by(word) %>%
  dplyr::summarise(freq = n()) %>%
  # Remove common filler words
  dplyr::filter(! (word %in% remove)) %>%
  # Put into descending order
  dplyr::arrange(-freq) %>%

And here we go! I think safe to say I’m surgical focussed, but quite a lot of different topics under that umbrella! Why not run the code here and figure out how your publications break down!

World map using the tidyverse (ggplot2) and an equal-area projection

This post was originally published here

There are several different ways to make maps in R, and I always have to look it up and figure this out again from previous examples that I’ve used. Today I had another look at what’s currently possible and what’s an easy way of making a world map in ggplot2 that doesn’t require fetching data from various places.
TLDR: Copy this code to plot a world map using the tidyverse:

Reshaping multiple variables into tidy data (wide to long)

This post was originally published here

There’s some explanation on what reshaping data in R means, why we do it, as well as the history, e.g., melt() vs gather() vs pivot_longer() in a previous post: New intuitive ways for reshaping data in R
That post shows how to reshape a single variable that had been recorded/entered across multiple different columns. But if multiple different variables are recorded over multiple different columns, then this is what you might want to do:

Setting up a simple one page website using Nicepage and Netlify

This post was originally published here

I’ve just set up a single page website (= online business card) for myself and my husband: https://pius.cloud/ . This post summarises what I did. If you’re looking to get started with something super quickly, then only the first two steps are essential (Creating a website and Serving a website).
Creating a website (using Nicepage) I’ve created websites using various tools such as straight up HTML, WordPress, Hugo+blogdown (this site – riinu.

HealthyR Online: Lockdown Learning

With news of the lockdown in March came the dawning reality that we wouldn’t be able to deliver our usual HealthyR 2.5 day quick start course in May.

The course is always over-subscribed so we were keen to find a solution rather than cancelling altogether.

HealthyR teaches the Notebook format which is already an online tool hosted by RStudio Cloud – so we knew that bit would work online. But what to do about getting attendees and tutors online, delivering lectures and offering interactive support with coding? Could we recreate our usual classroom environment online?

Never a group to shy away from a technical challenge, and with expertise in online education, we set about researching what online tools could be used.

After trying various options we went with Blackboard Collaborate to provide an online classroom, together with our usual RStudio Cloud to provide the Notebooks interface. Collaborate has a really nice feature of ‘break-out rooms’ where small groups can be assigned a separate online room with a tutor to work through exercises. The tutor can provide support and answer questions, using the screen share option to see exactly what each person might be having difficulty with.

After a few rehearsals to work out what roles to assign all our moderators and attendees, how to send people to the break rooms and recall them back to the main room we were set!

Ahead of the course, attendees were emailed the usual pre-course materials and a log in for their RStudio Cloud accounts, together with an invite to a Collaborate session for each of the 3 days. We split the 20 attendees who had confirmed attendance into groups of 5 and assigned one of our fantastic tutors to each group.

We also set up a an extra break out room with a dedicated tutor which could be used for anyone needing specific one-to-one help.

After the ice-breaker, ‘What’s a new thing you’ve done since lockdown?’ – everything from macrame to margaritas plus tie-dying and a lot of baking – the course got underway with the first lecture.

One or two delegates had some problems with internet connections, and the assigning of breakout rooms took a bit of getting used, but Riinu soon worked out an efficient system and the first coding exercises were underway!

We were delighted that the course received really positive feedback overall – none of us were sure this would work, but it did! The live coding sessions and pop quizzes were particularly popular.

We’ll definitely run HealthyR online again if the lockdown continues. Even after the lockdown, moving online widens access and offers the possibility for our international collaborators to join a course without having to travel.

Thank you to all our attendees who quickly adapted to the online format and to our amazing tutors, Tom, Kenny, Derek, Peter, Katie, Stephen, Michael and Ewen, who provided 3 days of their time to run the course, led as ever, by Riinu.

Course Feedback

Collaborate and RStudio Cloud worked very well for me. The breakout rooms were a nice touch to allow discussions.

Very well set-up, particularly considering the challenges of online teaching! Collaborate and RStudio made the course very accessible. Also a fantastic ratio of tutors to pupils and very clear explanations of key concepts in ’R’ languageand stats!

Clear and easy instructions. Worked seamlessly!

Teaching materials fantastic. In particular I thought linear and logistic
regressions were superbly well taught (as difficult to teach/understand). I think
I’ve now understand these for the first time having wasted loads of time reading
about them in the past!

This was a great course. I think in person would have allowed more interaction so I would still keep your original format available after this lockdown is over but well done on adapting and providing an excellent course.



All the HealthyR resources, including our new online book, are available for free on the HealthyR website

R: filtering with NA values

This post was originally published here

NA – Not Available/Not applicable is R’s way of denoting empty or missing values. When doing comparisons – such as equal to, greater than, etc. – extra care and thought needs to go into how missing values (NAs) are handled. More explanations about this can be found in the Chapter 2: R basics of our book that is freely available at the HealthyR website
This post lists a couple of different ways of keeping or discarding rows based on how important the variables with missing values are to you.

Using codepen.io and google cloud to build a handy risk calculator.

If you’ve been watching the news or twitter over the past week, you may have seen the appendicitis-related headlines about unnecessary operations being performed. The RIFT collaborative and Dmitri Nepogodiev have really spearheaded some cool work looking at who gets unnecessary operations, which are all well worth a read:

Original article:


(Selected news coverage):




So, when Dmitri asked if I could develop a web application for risk scoring to help identify those at low risk of appendicitis, I was very excited.

Having quite often used risk calculators in clinical practice, I started to write a list of what makes a good calculator and how to make one that can be used effectively. The most important were:

  • Easy to use
  • Works on any platform (as NHS IT has a wide variety of browsers!) and on mobile (some hospitals have great Wi-Fi through eduroam)
  • Can be quickly updated
  • Looks good and gives an intuitive result
  • Lightweight requiring minimal processing power, so many users can use simultaneously

Now we use a lot of R in surgical informatics, but Shiny wasn’t going to be the one for this as it’s not that mobile friendly and doesn’t necessarily work on every browser that smoothly (sorry shiny!). Similarly, the computational footprint required to run shiny is too heavy for this. So, using codepen.io and a pug html compiler, I wrote a mobile friendly website (Still a couple of tweaks I’d like to make to make entirely mobile friendly!).

Similarly, I get asked why not an app? Well app development requires developing on multiple platforms (Apple, Android, Blackberry) and can’t be used on those pesky NHS PCs. Furthermore, if something goes out of date or needs to be updated quickly – repairing it will take ages as updates sometimes have to be vetted by app stores etc.

My codepen.io for the calculator:

Codepen.io is a great development tool and allows you to combine and get inspired by other people’s work too!

I then set up a micro instance on google cloud, installed the pug compiler and apache2, selected a fixed IP and opened the HTTP port to the world and all done! (this set up is a little more involved than this but was straightforward!). The micro instance is very very cheap so it’s not expensive to run. The Birmingham crew then bought a lovely domain appy-risk.org for me to attach it to.

Here’s the obligatory increase in CPU usage since publication (slightly higher but as you can tell – it’s quite light:

More Fun with Regression:

Confounding, interaction and random effects

The following blog post provides a general overview of some of the terms encountered when carrying out logistic regression and was inspired by attending the extremely informative HealthyR+: Practical logistic regression course at the University of Edinburgh.

  • Confounding
    • What is confounding?
    • Examples
  • Interaction
    • What are interaction effects?
    • Example
    • How do we detect interactions?
    • What happens if we overlook interactions?
    • Terminology
  • Random effects
    • Clustered data
    • Why should we be aware of clustered data?
    • A solution to clustering
    • Terminology
  • Brief summary


What is confounding?

Confounding occurs when the association between an explanatory (exposure) and outcome variable is distorted, or confused, because another variable is independently associated with both. 

The timeline of events must also be considered, because a variable cannot be described as confounding if it occurs after (and is directly related to) the explanatory variable of interest.  Instead it is sometimes called a mediating variable as it is located along the causal pathway, between explanatory and outcome.


Potential confounders often encountered in healthcare data include for example, age, sex, smoking status, BMI, frailty, disease severity.  One of the ways these variables can be controlled is by including them in regression models. 

In the Stanford marshmallow experiment, a potential confounder was left out – economic background – leading to an overestimate of the influence of a child’s willpower on their future life outcomes.

Another example includes the alleged link between coffee drinking and lung cancer. More smokers than non-smokers are coffee drinkers, so if smoking is not accounted for in a model examining coffee drinking habits, the results are likely to be confounded.


What are interaction effects?

In a previous blog post, we looked at how collinearity is used to describe the relationship between two very similar explanatory variables.  We can think of this as an extreme case of confounding, almost like entering the same variable into our model twice.  An interaction on the other hand, occurs when the effect of an explanatory variable on the outcome, depends on the value of another explanatory variable. 

When explanatory variables are dependent on each other to tell the whole story, this can be described as an interaction effect; it is not possible to understand the exact effect that one variable has on the outcome without first knowing information about the other variable. 

The use of the word dependent here is potentially confusing as explanatory variables are often called independent variables, and the outcome variable is often called the dependent variable (see word clouds here). This is one reason why I tend to avoid the use of these terms.


An interesting example of interaction occurs when examining our perceptions about climate change and the relationship between political preference, and level of education. 

We would be missing an important piece of the story concerning attitudes to climate change if we looked in isolation at either education or political orientation.  This is because the two interact; as level of education increases amongst more conservative thinkers, perception about the threat of global warming decreases, but for liberal thinkers as the level of education increases, so too does the perception about the threat of global warming. 

Here is a link to the New York Times article on this story: https://www.nytimes.com/interactive/2017/11/14/upshot/climate-change-by-education.html

What happens if we overlook interactions?

If interaction effects are not considered, then the output of the model might lead the investigator to the wrong conclusions. For instance, if each explanatory variable was plotted in isolation against the outcome variable, important potential information about the interaction between variables might be lost, only main effects would be apparent.

On the other hand, if many variables are used in a model together, without first exploring the nature of potential interactions, it might be the case that unknown interaction effects are masking true associations between the variables.  This is known as confounding bias.

How do we detect interactions?

The best way to start exploring interactions is to plot the variables. Trends are more apparent when we use graphs to visualise these.

If the relationship between two exposure variables on an outcome variable is constant, then we might visualise this as a graph with two parallel lines.  Another way of describing this is additive effect modification.

Two explanatory variables (x1 and x2) are not dependent on each other to explain the outcome.

But if the effect of the exposure variables on the outcome is not constant then the lines will diverge. We can describe this as multiplicative effect modification.

Two explanatory variables (x1 and x2) are dependent on each other to explain the outcome.

Once an interaction has been confirmed, the next step would be to explore whether the interaction is statistically significant or not.


Some degree of ambiguity exists surrounding the terminology of interactions (and statistical terms in general!), but here are a few commonly encountered terms, often used synonymously. 

  • Interaction
  • Causal interaction
  • Effect modification
  • Effect heterogeneity

There are subtle differences between interaction and effect modification.  You can find out more in this article: On the distinction between interaction and effect modification.

Random effects

Clustered data

Many methods of statistical analysis are intended to be applied with the assumption that, within a data-set, an individual observation is not influenced by the value of another observation: it is assumed that all observations are independent of one another. 

This may not be the case however, if you are using data, for example, from various hospitals, where natural clustering or grouping might occur.  This happens if observations within individual hospitals have a slight tendency to be more similar to each other than to observations in the rest of the data-set.

Random effects modelling is used if the groups of clustered data can be considered as samples from a larger population.

Why should we be aware of clustered data?

Gathering insight into the exact nature of differences between groups may or may not be important to your analysis, but it is important to account for patterns of clustering because otherwise measures such as standard errors, confidence intervals and p-values may appear to be too small or narrow.  Random effects modelling is one approach which can account for this.

A solution to clustering

The random effects model assumes that having allowed for the random effects of the various clusters or groups, the observations within each individual cluster are still independent.  You can think of it as multiple levels of analysis – first there are the individual observations, and these are then nested within observations at a cluster level, hence an alternative name for this type of modelling is multilevel modelling.


There are various terms which are used when referring to random effects modelling, although the terms are not entirely synonymous. Here are a few of them:

  • Random effects
  • Multilevel
  • Mixed-effect
  • Hierarchical

There are two main types of random effects models:

  • Random intercept model
Random intercept: Constrains lines to be parallel
  • Random slope and intercept model
Random slope and intercept: Does not constrain lines to be parallel

Brief summary

To finish, here is a quick look at some of the key differences between confounding and interaction.

If you would like to learn more about these terms and how to carry out logistic regression in R, keep an eye on the HealthyR page for updates on courses available.