If you’ve been watching the news or twitter over the past
week, you may have seen the appendicitis-related headlines about unnecessary operations
being performed. The RIFT collaborative and Dmitri Nepogodiev have really
spearheaded some cool work looking at who gets unnecessary operations, which
are all well worth a read:
So, when Dmitri asked if I could develop a web application
for risk scoring to help identify those at low risk of appendicitis, I was very
excited.
Having quite often used risk calculators in clinical
practice, I started to write a list of what makes a good calculator and how to
make one that can be used effectively. The most important were:
Easy to use
Works on any platform (as NHS IT has a wide
variety of browsers!) and on mobile (some hospitals have great Wi-Fi through
eduroam)
Can be quickly updated
Looks good and gives an intuitive result
Lightweight requiring minimal processing power,
so many users can use simultaneously
Now we use a lot of R in surgical informatics, but Shiny wasn’t going to be the one for this as it’s not that mobile friendly and doesn’t necessarily work on every browser that smoothly (sorry shiny!). Similarly, the computational footprint required to run shiny is too heavy for this. So, using codepen.io and a pug html compiler, I wrote a mobile friendly website (Still a couple of tweaks I’d like to make to make entirely mobile friendly!).
Similarly, I get asked why not an app? Well app development requires developing on multiple platforms (Apple, Android, Blackberry) and can’t be used on those pesky NHS PCs. Furthermore, if something goes out of date or needs to be updated quickly – repairing it will take ages as updates sometimes have to be vetted by app stores etc.
My codepen.io for the calculator:
Codepen.io is a great development tool and allows you to
combine and get inspired by other people’s work too!
I then set up a micro instance on google cloud, installed the pug compiler and apache2, selected a fixed IP and opened the HTTP port to the world and all done! (this set up is a little more involved than this but was straightforward!). The micro instance is very very cheap so it’s not expensive to run. The Birmingham crew then bought a lovely domain appy-risk.org for me to attach it to.
Here’s the obligatory increase in CPU usage since publication (slightly higher but as you can tell – it’s quite light:
The following blog post provides a general overview of some of the terms encountered when carrying out logistic regression and was inspired by attending the extremely informative HealthyR+: Practical logistic regression course at the University of Edinburgh.
Confounding
What is confounding?
Examples
Interaction
What are interaction effects?
Example
How do we detect interactions?
What happens if we overlook interactions?
Terminology
Random effects
Clustered data
Why should we be aware of clustered data?
A solution to clustering
Terminology
Brief summary
Confounding
What is confounding?
Confounding occurs when the association between an explanatory (exposure) and outcome variable is distorted, or confused, because another variable is independently associated with both.
The timeline of events must also be considered, because a variable cannot be described as confounding if it occurs after (and is directly related to) the explanatory variable of interest. Instead it is sometimes called a mediating variable as it is located along the causal pathway, between explanatory and outcome.
Examples
Potential confounders often encountered in healthcare data include for example, age, sex, smoking status, BMI, frailty, disease severity. One of the ways these variables can be controlled is by including them in regression models.
In the Stanford marshmallow experiment, a potential confounder was left out – economic background – leading to an overestimate of the influence of a child’s willpower on their future life outcomes.
Another example includes the alleged link between coffee drinking and lung cancer. More smokers than non-smokers are coffee drinkers, so if smoking is not accounted for in a model examining coffee drinking habits, the results are likely to be confounded.
Interaction
What are interaction effects?
In a previous blog post, we looked at how collinearity is used to describe the relationship between two very similar explanatory variables. We can think of this as an extreme case of confounding, almost like entering the same variable into our model twice. An interaction on the other hand, occurs when the effect of an explanatory variable on the outcome, depends on the value of another explanatory variable.
When explanatory variables are dependent on each other to tell the whole story, this can be described as an interaction effect; it is not possible to understand the exact effect that one variable has on the outcome without first knowing information about the other variable.
The use of the word dependent here is potentially confusing as explanatory variables are often called independent variables, and the outcome variable is often called the dependent variable (see word clouds here). This is one reason why I tend to avoid the use of these terms.
Example
An interesting example of interaction occurs when examining our perceptions about climate change and the relationship between political preference, and level of education.
We would be missing an important piece of the story concerning attitudes to climate change if we looked in isolation at either education or political orientation. This is because the two interact; as level of education increases amongst more conservative thinkers, perception about the threat of global warming decreases, but for liberal thinkers as the level of education increases, so too does the perception about the threat of global warming.
If interaction effects are not considered, then the output of the model might lead the investigator to the wrong conclusions. For instance, if each explanatory variable was plotted in isolation against the outcome variable, important potential information about the interaction between variables might be lost, only main effects would be apparent.
On the other hand, if many variables are used in a model together, without first exploring the nature of potential interactions, it might be the case that unknown interaction effects are masking true associations between the variables. This is known as confounding bias.
How do we detect interactions?
The best way to start exploring interactions is to plot the variables. Trends are more apparent when we use graphs to visualise these.
If the relationship between two exposure variables on an outcome variable is constant, then we might visualise this as a graph with two parallel lines. Another way of describing this is additive effect modification.
Two explanatory variables (x1 and x2)are notdependent on each other to explain the outcome.
But if the effect of the exposure variables on the outcome is not constant then the lines will diverge. We can describe this as multiplicative effect modification.
Two explanatory variables (x1 and x2) aredependent on each other to explain the outcome.
Once an interaction has been confirmed,
the next step would be to explore whether the interaction is statistically
significant or not.
Terminology
Some degree of ambiguity exists surrounding the terminology of interactions (and statistical terms in general!), but here are a few commonly encountered terms, often used synonymously.
Many methods of statistical analysis are intended to be applied with the assumption that, within a data-set, an individual observation is not influenced by the value of another observation: it is assumed that all observations are independent of one another.
This may not be the case however, if you are using data, for example, from various hospitals, where natural clustering or grouping might occur. This happens if observations within individual hospitals have a slight tendency to be more similar to each other than to observations in the rest of the data-set.
Random effects modelling is used if the groups of clustered data can be considered as samples from a larger population.
Why should we be aware of clustered data?
Gathering insight into the exact nature of differences between groups may or may not be important to your analysis, but it is important to account for patterns of clustering because otherwise measures such as standard errors, confidence intervals and p-values may appear to be too small or narrow. Random effects modelling is one approach which can account for this.
A solution to clustering
The random effects model assumes that having allowed for the random effects of the various clusters or groups, the observations within each individual cluster are still independent. You can think of it as multiple levels of analysis – first there are the individual observations, and these are then nested within observations at a cluster level, hence an alternative name for this type of modelling is multilevel modelling.
Terminology
There are various terms which are used when referring to random effects modelling, although the terms are not entirely synonymous. Here are a few of them:
Random effects
Multilevel
Mixed-effect
Hierarchical
There are two main types of random effects models:
Random intercept model
Random intercept: Constrains lines to be parallel
Random slope and intercept model
Random slope and intercept: Does not constrain lines to be parallel
Brief summary
To finish, here is a quick look at some of the key differences between confounding and interaction.
If you would like to learn more about these terms and how to carry out logistic regression in R, keep an eye on the HealthyR page for updates on courses available.
TLDR: You can teach R on people’s own laptops without having them install anything or require an internet connection.
Members of the Surgical Informatics team in Ghana, 2019. More information: surgicalinformatics.org
Introduction
Running R programming courses on people’s own laptops is a pain, especially as we use a lot of very useful extensions that actually make learning and using R much easier and more fun. But long installation instructions can be very off-putting for complete beginners, and people can be discouraged to learn programming if installation hurdles invoke their imposter syndrome.
We almost always run our courses in places with a good internet connection (it does not have to be super fast or flawless), so we get our students all set up on RStudio Server (hosted by us) or https://rstudio.cloud (a free service provided by RStudio!).
You connect to either of these options using a web browser, and even very old computers can handle this. That’s because the actual computations happen on the server and not on the student’s computer. So the computer just serves as a window to the training instance used.
Now, these options work really well as long as you have a stable internet connection. But for teaching R offline and on people’s own laptops, you either have to:
make sure everyone installs everything correctly before they attend the course
Download all the software and extensions, put them on USB sticks and try to install them together at the start
start serving RStudio from a your computer using Local Area Network (LAN) created by a router
Now, we already discussed why the first option is problematic (gatekeeper for complete beginners). The second option – installing everything at the start together – means that you start the course with the most boring part. And since everyone’s computers are different (both by operating systems as well as different versions of the operating systems), this can take quite a while to sort. Therefore, queue in option c) – an RStudio Server LAN party.
Requirements
A computer with more than 4GB of RAM. macOS alone uses around 2-3GB just to keep going, and running the RStudio Server docker container was using another 3-4 GB, so you’ll definitely need more than 4GB in total.
A network router. For a small number of participants, the same one you already have at home will work. Had to specify “network” here, as apparently, even my Google search for “router” suggests the power tool before network routers.
Docker – free software, dead easy to install on macOS (search the internet for “download Docker”). Looks like installation on the Windows Home operating system might be trickier. If you are a Windows Home user who is using Docker, please do post a link to your favourite instructions in the comments below.
Internet connection for setting up – to download RStudio’s docker image and install your extra packages.
My MacBook Pro serving RStudio to 10 other computers in Ghana, November 2019.
Set-up
Running RStudio using Docker is so simple you won’t believe me. It honestly is just a single-liner to be entered into your Terminal (Command Prompt on Windows):
This will automatically download a Docker image put together by RStudio. The one called verse includes all the tidyverse packages as well as publishing-related ones (R Markdown, Shiny, etc.). You can find a list of the difference ones here: https://github.com/rocker-org/rocker
Then open a browser and go to localhost:8787 and you should be greeted with an RStudio Server login! (Localhost only works on a Mac or Linux, if using Windows, take a note of your IP address and use that instead of localhost.) More information and instructions can be found here: https://github.com/rocker-org/rocker/wiki/Using-the-RStudio-image
Tip: RStudio suggests port 8787, which is what I used for consistency, but if you set it up on 80 you can omit the :80 as that’s the default anyway. So you can just go to localhost (or something like 127.0.0.0 if using Windows).
For those of you who have never seen or used RStudio Server, this is what it looks like:
Rstudio Server is almost identical to RStudio Desktop. Main difference is the “Upload” button in the Files pane. This one is running in a Docker container, served at port 8787, and accessed using Safari (but any web browser will work).
The Docker single-liner above will create a single user with sudo rights (since I’ve included -e ROOT=TRUE). After logging into the instance, you can then add other users and copy the course materials to everyone using these scripts: https://github.com/einarpius/create_rstudio_users Note that the instance is running Debian, so you’ll need very basic familiarity with managing file permissions on the command line. For example, you’ll need to make the scripts executable with chmod 700 create_users.sh.
Then connect to the same router you’ll be using for your LAN party, go to router settings and assign yourself a fixed IP address, e.g., 168.192.1.78. Once other people connect to the network created by this router (either by WiFi or cable), they need to type 168.192.1.78:8787 into any browser and can just start using RStudio. This will work as long as your computer is running Docker and you are all connected to the same router.
I had 10 people connected to my laptop and, most of the time, the strain on my CPU was negligible – around 10-20%. That’s because it was a course for complete beginners and they were mostly reading the instructions (included in the training Notebooks they were running R code in). So they weren’t actually hitting Run at the same time, and the tasks weren’t computationally heavy. When we did ask everyone to hit the “Knit to PDF” button all at the same time, it got a bit slower and my CPU was apparently working at 200%. But nothing crashed and everyone got their PDFs made.
Why are you calling it a LAN party?
My friends and I having a LAN party in Estonia, 2010. We would mostly play StarCraft or Civilization, or as pictured here – racing games to wind down at the end.
LAN stands for Local Area Network and in most cases means “devices connected to the same WiFi*”. You’ve probably used LANs lots in your life without even realising. One common example is printers: you know when a printer asks you to connect to the same network to be able to print your files? This usually means your computer and the printer will be in a LAN. If your printed accepted files via any internet connection, rather than just the same local network, then people around the world could submit stuff for your printer. Furthermore, if you have any smart devices in your home, they’ll be having a constant LAN party with each other.
The term “LAN party” means people coming together to play multiplayer computer games – as it will allow people to play in the same “world”, to either build things together or fight with each other. Good internet access has made LAN parties practically obsolete – people and their computers no longer have to physically be in the same location to play multiplayer games together. I use the term very loosely to refer to anything fun happening on the same network. And being able to use RStudio is definitely a party in my books anyway.
But it is for security reasons (e.g., the printer example), or sharing resources in places without excellent internet connection where LAN parties are still very much relevant.
* Overall, most existing LANs operate via Ethernet cables (or “internet cables” as most people, including myself refer to them). WiFi LAN or WLAN is a type of LAN. Have a look at your home router, it will probably have different lights for “internet” and “WLAN”/“wireless”. A LAN can also be connected to the internet – if the router itself is connected to the internet. That’s the main purpose of a router – to take the internet coming into your house via a single Ethernet cable, and share it with all your other devices. A LAN is usually just a nice side-effect of that.
Docker, containers, images
Docker image – a file bundling an operating system + programs and files Docker container – a running image (it may be paused or stopped)
List of all your containers: docker ps -a (just docker ps will list running containers, so the ones not stopped or paused)
When you run rstudio/verse for the first time it will be downloaded into your images. The next time it will be taken directly from there, rather than downloaded. So you’ll only need internet access once.
Stop an active container: docker stop container-name
Start it up again: docker start container-name
Save a container as an image (for versioning or passing on to other people):
docker commit container-name pository:tag
For example: docker commit rstudio-server rstudio/riinu:test1
Rename container (by default it will get a random label, I’d change it to rstudio-server):
docker rename happy_hippo rstudio-server
You can then start your container with: docker start rstudio-server
These past two days are new frontier for the HealthyR course, taking the number of continents we’ve run it in up to 2.After the NIHR Unit on Global Surgery meeting, we travelled to Tamale, Ghana’s third largest city. The Wellcome Trust have kindly funded the development of the innovative, open-source HealthyR notebooks course. Spearheaded by Dr Riinu Ots, this course aims to provide an easy way for anyone in the world to learn R.This is particularly powerful where resources are limited and there are plenty of questions that need to be answered. Enter Stephen Tabiri, professor of Surgery at the University for Development Studies in Tamale. Stephen is as surgeon and has a large team of junior surgeons in training, nurses and other clinicians.In an innovative twist, it was held on a mix of laptops, from the data centre and on delegates own machines. Riinu had a brilliant solution, that served an offline R studio instance to delegates computers.Day 1 quickly introduced some key concepts to the delegates who quickly worked through the materials! After lunch a global surgery showcase event was held, which showcased the wide range of tools available to analyse data in R!Day 2 kicked off nicely, completing the basics session and then straight into everyone’s favourite session – Plotting! Here there were a lot of pleased delegates as they made complicated and colourful ggplots! People were making a lot of progress, in what can sometimes be a challenging language to learn!We finally closed on a logistic regression session delivered by Ewen Harrison, where people built their own models!Throughout the course there were numerous people bringing laptops to install RStudio software on their own desktops. A very enthusiastic and keen bunch of data scientists!Excitingly, members of the Ghana R community also attended, to offer support and discuss how best to provide a sustainable future for data science in Ghana.
The Surgical Informatics team arrive in Tamale, Ghana for the next HealthyR Notebooks course
The Surgical Informations groups are delighted to be visiting Tamale in Ghana to deliver our flagship HealthyR Notebooks course as part of our Wellcome Trust grant, ‘HealthyR Notebooks: Democratising open and reproducible data analysis in resource-poor environments’.
We’re being made extremely welcome by our hosts Professor Stephen Tabiri and Benard Ofori Appiah from the NIHR Global Health Research Unit on Global Surgery hub in Ghana.
Over the next few days we’ll be establishing a data centre in Ghana with the provision of 15 laptops and training 20 local delegates to use R for healthcare data analysis. This will build capacity for future data driven research in partnership with the NIHR Global Surgery Unit in Ghana.
Something for the more advanced R user! We’ll be back to our more exciting programming shortly (I hope!).
rlang? I already speak R
Quite right. rlang is part of the tidyverse side of things, so is probably more useful if you’re an advanced R user. It’s certainly not for the faint-hearted and needs a comprehensive understanding of how R ‘sees’ the code you write.
rlang is a low-level programming API for R which the tidyverse uses (meaning it speaks to R in as R like way as possible, rather than a ‘high-level’ – high level is more user orientated and interpretable). It enables you to extend what the tidyverse can do and adapt it for your own uses. It’s particularly good to use if you’re doing lots of more ‘programming’ type R work, for example, building a package, making a complex shiny app or writing functions. It might also be handy if you’re doing lots of big data manipulation and want to manipulate different datasets in the same way, for example.
Here’s an example of dynamically naming variables
In this example, say we have a tibble of variables, but we want to apply dynamic changes to it (so we feed R a variable, that can change, either using another function like purr::map or in a ShinyApp). In this instance, specifying each variable and each different possible consequence using different logical functions would take forever and be very clunky. So we can use rlang to simply put a dynamic variable/object through the same function.
We make use of the curly curlys too, which allow us to avoid using bulky enquo() – !! syntax
R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
library(tidyverse)
#Make data - here nonsense numbers on deaths per 100k from green spacemen say
Riinu and I are sitting in Frankfurt airport discussing the paper retracted in JAMA this week.
During analysis, the treatment variable coded [1,2] was recoded in error to [1,0]. The results of the analysis were therefore reversed. The lung-disease self-management program actually resulted in more attendances at hospital, rather than fewer as had been originally reported.
Recode check
Checking of recoding is such an important part of data cleaning – we emphasise this a lot in HealthyR courses – but of course mistakes happen.
Our standard approach is this:
library(finalfit)
colon_s %>%
mutate(
sex.factor2 = forcats::fct_recode(sex.factor,
"F" = "Male",
"M" = "Female")
) %>%
count(sex.factor, sex.factor2)
# A tibble: 2 x 3
sex.factor sex.factor2 n
<fct> <fct> <int>
1 Female M 445
2 Male F 484
The miscode should be obvious.
check_recode()
However, mistakes may still happen and be missed. So we’ve bashed out a useful function that can be applied to your whole dataset. This is not to replace careful checking, but may catch something that has been missed.
The function takes a data frame or tibble and fuzzy matches variable names. It produces crosstables similar to above for all matched variables.
So if you have coded something from sex to sex.factor it will be matched. The match is hungry so it is more likely to match unrelated variables than to miss similar variables. But if you recode death to mortality it won’t be matched.
Here’s a walk through.
# Install
devtools::install_github('ewenharrison/finalfit')
library(finalfit)
library(dplyr)
# Recode example
colon_s_small = colon_s %>%
select(-id, -rx, -rx.factor) %>%
mutate(
age.factor2 = forcats::fct_collapse(age.factor,
"<60 years" = c("<40 years", "40-59 years")),
sex.factor2 = forcats::fct_recode(sex.factor,
# Intentional miscode
"F" = "Male",
"M" = "Female")
)
# Check
colon_s_small %>%
check_recode()
$index
# A tibble: 3 x 2
var1 var2
<chr> <chr>
1 sex.factor sex.factor2
2 age.factor age.factor2
3 sex.factor2 age.factor2
$counts
$counts[[1]]
# A tibble: 2 x 3
sex.factor sex.factor2 n
<fct> <fct> <int>
1 Female M 445
2 Male F 484
$counts[[2]]
# A tibble: 3 x 3
age.factor age.factor2 n
<fct> <fct> <int>
1 <40 years <60 years 70
2 40-59 years <60 years 344
3 60+ years 60+ years 515
$counts[[3]]
# A tibble: 4 x 3
sex.factor2 age.factor2 n
<fct> <fct> <int>
1 M <60 years 204
2 M 60+ years 241
3 F <60 years 210
4 F 60+ years 274
As can be seen, the output takes the form of a list length 2. The first is an index of matched variables. The second is crosstables as tibbles for each variable combination. sex.factor2 can be seen as being miscoded. sex.factor2 and age.factor2 have been matched, but should be ignored.
Numerics are not included by default. To do so:
out = colon_s_small %>%
select(-extent, -extent.factor,-time, -time.years) %>% # choose to exclude variables
check_recode(include_numerics = TRUE)
out
# Output not printed for space
Miscoding in survival::colon dataset?
When doing this just today, we noticed something strange in our example dataset, survival::colon.
The variable node4 should be a binary recode of nodes greater than 4. But as can be seen, something is not right!
We’re interested in any explanations those working with this dataset might have.
There we are then, a function that may be useful in detecting miscoding. So useful in fact, that we have immediately found probable miscoding in a standard R dataset.
This quote by statistician George Box feels like a good starting point from which to consider some of the challenges of regression modelling. If we start with the idea that all models are wrong, it follows that one of the main skills in carrying out regression modelling is working out where the weaknesses are and how to minimise these to produce as close an approximation as possible to the data you are working with – to make the model useful.
The idea that producing high-quality regression models is often more of an art than a science appeals to me. Understanding the underlying data, what you want to explore, and the tools you have at hand are essential parts of this process.
After attending the excellent HealthyR+: Practical Logistic Regression course a few weeks ago, my head was buzzing with probabilities, odds ratios and confounding. It was not just the data which was confounded. As someone fairly new to logistic regression, I thought it might be useful to jot down some of the areas I found particularly interesting and concepts which made me want to find out more. In this first blog post we take a brief look at:
Probability and odds
The difference between probability and odds
Why use log(odds) and not just odds?
Famous probability problems
Collinearity and
correlation
What is
collinearity?
How do we detect
collinearity?
Is collinearity a
problem?
Probability and odds
The difference between probability and odds
Odds and probability are both measures of how likely it is that a certain outcome might occur in a series of events. Probability is perhaps more intuitive to understand, but its properties make it less useful in statistical models and so odds, odds ratios, and log(odds) are used instead, more on this in the next section.
Interestingly, when the probability of an event occurring is small – <0.1 (or less than 10%) – the odds are quite similar. However, as probability increases, the odds also increase but at a greater rate, see the following figure:
Here we can also see that whilst probabilities range from 0 to 1, odds can take on any value between 0 and infinity.
Why use log(odds) and not just odds?
Asymmetry of the odds scale makes it difficult to compare binary outcomes, but by using log(odds) we can produce a symmetrical scale, see figure below:
In logistic regression, the odds ratio concerning a particular variable represents the change in odds with each unit increase, whilst holding all other variables constant.
Famous probability problems
I find probability problems fascinating, particularly those which seem counter-intuitive. Below are links to explanations of two intriguing probability problems:
The term collinearity (also referred to as multicollinearity) is used to describe a high correlation between two explanatory variables. This can cause problems in regression modelling because the explanatory variables are assumed to be independent (and indeed are sometimes called independent variables, see word clouds below).
The inclusion of variables which are collinear (highly correlated) in a regression model, can lead to the false impression for example, that neither variable is associated with the outcome, when in fact, individually each variable does have a strong association. The figure below might help to visualise the relationships between the variables:
In this image, y represents the control variable, and x1 and x2 are the highly correlated, collinear explanatory variables. As you can see, there is a large area of (light grey) overlap between the x variables, whereas there are only two very small areas of independent overlap between each x and y variable. These small areas represent the limited information available to the regression model when trying to carry out analysis.
How do we detect collinearity?
A regression coefficient can be thought of as the rate of change, or as the slope of the regression line. The slope describes the mean change in the outcome variable for every unit of change in the explanatory variable. It is important to note that regression coefficients are calculated based on the assumption that all other variables (apart from the variables of interest) are kept constant.
When two variables are highly correlated, this creates problems. The model will try to predict the outcome but finds it hard to disentangle the influence of either of the explanatory variables due to their strong correlation. As a result, coefficient estimates may change erratically in response to small changes in the model.
Various terms are used to describe these x and y variables depending on context. There are slight differences in the meanings, but here are a few terms that you might encounter:
The information I used to generate these word clouds was based on a crude estimate of the number of mentions in Google Scholar within the context of medical statistics.
Is collinearity a problem?
Collinearity is a problem if the purpose of your analysis is to explain the interactions between the data, however it has little effect on the overall predictive properties of your model, i.e. the model will provide accurate predictions based on all variables as one big bundle, but will not be able to tell you about the interactions of isolated variables.
If you are concerned with exploring specific interactions and you encounter collinearity, there are two main approaches you can take:
Drop one of the variables if it is not vital to your analysis
Combine the variables (e.g. weight and height can be combined to produce BMI)
An example of a publication where missed collinearity led to potentially erroneous conclusions, concerns analyses carried out on data relating to the World Trade Organisation (WTO). Here is a related article which attempts to unpick some of the problems with previous WTO research.
Finishing on an example of a problematic attempt at regression analysis may perhaps seem slightly gloomy, but on the contrary, I hope that this might provide comfort if your own analysis throws up challenges or problems – you are in good company! It also brings us back to the quote by George Box at the beginning of this blog post, where we started with the premise that all models are wrong. They are at best a close approximation, and we must always be alert to their weaknesses.
What next?
Look out for the next HealthyR+: Practical Logistic Regression course and sign up. What areas of medical statistics do you find fun, puzzling, tricky, surprising? Let us know below.
We are using multiple imputation more frequently to “fill in” missing data in clinical datasets. Multiple datasets are created, models run, and results pooled so conclusions can be drawn.
We’ve put some improvements into Finalfit on GitHub to make it easier to use with the mice package. These will go to CRAN soon but not immediately.
Multivariate Imputation by Chained Equations (mice)
miceis a great package and contains lots of useful functions for diagnosing and working with missing data. The purpose here is to demonstrate how mice can be integrated into the Finalfit workflow with inclusion of model from imputed datasets in tables and plots.
Choose variables to impute and variables to impute from
finalfit::missing_predictorMatrix()makes it easy to specify which variables do what. For instance, we often do not want to impute our outcome or explanatory variable of interest (exposure), but do want to use them to impute other variables.
This is straightforward to code using the arguments drop_from_imputed and drop_from_imputer.
library(mice)
# Specify model
explanatory = c("age", "sex.factor", "nodes",
"obstruct.factor", "smoking_mar")
dependent = "mort_5yr"
# Choose not to impute missing values
# for explanatory variable of interest and
# outcome variable.
# But include in algorithm for imputation.
predM = colon_s %>%
select(dependent, explanatory) %>%
missing_predictorMatrix(
drop_from_imputed = c("obstruct.factor", "mort_5yr")
)
Create imputed datasets
A set of multiple imputed datasets (mids) can be created as below. Various checks should be performed to ensure you understand the data that has been created. See here.
mids = colon_s %>%
select(dependent, explanatory) %>%
mice(m = 4, predictorMatrix = predM) # Usually m = 10
Run models
Here we sill use a logistic regression model. The with.mids() function takes a model with a formula object, so use base R functions rather than Finalfit wrappers.
We now have multiple models run with each of the imputed datasets. We haven’t found good methods for combining common model metrics like AIC and c-statistic. I’d be interested to hear from anyone working on methods for this. Metrics can be extracted for each individual model to give an idea of goodness-of-fit and discrimination. We’re not suggesting you use these to compare imputed datasets, but could use them to compare models containing different variables created using the imputed datasets, e.g.
fits %>%
getfit() %>%
purrr::map(AIC)
[[1]]
[1] 1192.57
[[2]]
[1] 1191.09
[[3]]
[1] 1195.49
[[4]]
[1] 1193.729
# C-statistic
fits %>%
getfit() %>%
purrr::map(~ pROC::roc(.x$y, .x$fitted)$auc)
[[1]]
Area under the curve: 0.6839
[[2]]
Area under the curve: 0.6818
[[3]]
Area under the curve: 0.6789
[[4]]
Area under the curve: 0.6836
Pool results
Rubin’s rules are used to combine results of multiple models.
# Pool results
fits_pool = fits %>%
pool()
Plot results
Pooled results can be passed directly to Finalfit plotting functions.
# Can be passed to or_plot
colon_s %>%
or_plot(dependent, explanatory, glmfit = fits_pool, table_text_size=4)
Put results in table
The pooled result can be passed directly to fit2df() as can many common models such as lm(), glm(), lmer(), glmer(), coxph(), crr(), etc.
# Summarise and put in table
fit_imputed = fits_pool %>%
fit2df(estimate_name = "OR (multiple imputation)", exp = TRUE)
fit_imputed
explanatory OR (multiple imputation)
1 age 1.01 (1.00-1.02, p=0.212)
2 sex.factorMale 1.01 (0.77-1.34, p=0.917)
3 nodes 1.24 (1.18-1.31, p<0.001)
4 obstruct.factorYes 1.34 (0.94-1.91, p=0.105)
5 smoking_marSmoker 1.28 (0.88-1.85, p=0.192)
Combine results with summary data
Any model passed through fit2df() can be combined with a summary table generated with summary_factorlist() and any number of other models.
Kenny McLean and I recently attended rmedicine 2019 in Boston MA. The conference is aimed at clinicians and non-clinicians who use R for day-to-day research and monitoring of clinical processes.
Day 1 covered two parallel workshops: R Markdown for Medicine and Wrangling Survival Data
I attended R Markdown for Medicine run by Alison Hill from RStudio. Using .rmd files has become the default for the Surgical Informatics Group and, so it seems, a great number of others who attended rmedicine. Around a third of the presentations at rmedicine covered workflows involving sharing of data via either .rmd files or through shiny, an R package for creating deploy-able dashboards for data visualisation and interactive exploration.
R Markdown for Medicine
An Overview of Useful Tips and Tricks
R markdown is an extension of R which allows you to combine narrative text and R code within one document. This means your notes, code, results and plots are all in one place. Code is contained in between three backticks with an {r}after the first set. Inline code can also be used between single backticks followed by r without the curly brackets and then the code. This means that results can be changed automatically so that for a trial when you describe the results of numbers included / excluded, this only needs changed in one place so that the rest of the text (and / or flowcharts) updates automatically. It is also possible to mix-and-match other chunks of code from other languages.
Use Params!
Parameters are set in the YAML header at the top of the .rmd document. If you set a parameter of data to a default .csv or .rda file then this can be changed for other similar files without creating a new document. A really useful example would be when you have multiple hospitals or multiple diseases each with a separate data file, a report can then be generated for each file. If you use rmarkdown::render() along with purrr::pwalk you can generate a separate output file for any number of hospitals / diseases / countries / individuals etc. in just a couple of lines of code.
Use Helper Packages
There are some greater .rmd helper packages to improve the workflow, improve the rendering of documents and generally make life easier.
bookdown allows several .rmd documents to be combined to a book but also has some general usefulness for single documents as well. Using bookdown::word_document2 or bookdown::html_document2 in the YAML header under the output field is designed to improve cross-referencing of tables and figures compared to the default versions.
wordcountaddin allows an accurate word count to be performed which will not count YAML or code etc. without knitting the document. This is much easier than knitting the document and then performing a word count!
citr allows automated insertion of markdown citations to assist with referencing. Check out my earlier blog on referencing to get an idea of how to set up .bib files. I may add another blog on this topic, watch this space!
xaringan is a useful package for creating HTML presentations with high levels of customisation. It is possible to use an additional .css file for even greater customisation and styling of your slides but xaringan offers a great deal of user-friendly options.
distill appears to be good at supporting mobile-friendly web publishing for scientific communication with flexible figure layouts, table pagination, LaTeX math support and incorporation of javascript.
There are countless other helper packages and more likely to be on their way. Many allow additional aesthetic modification of the output documents and may allow you to run R code rather than modifying a .css file.
List Numbering the Lazy Way
List numbering in .rmd works without needing to manually enter the correct numbers. Just make a list where every element begins with 1. and .rmd will transform it into an appropriately-numbered list. Great if you need to add in a new element to the middle of the list later!
Multiple lots in a Grid
I’ve previously come across patchwork as a way to plot several plots into a grid which could be 1×2, 2×2, 3 in one column and one in the other etc. There are also two other packages cowplot and egg. I haven’t explored the differences between them but if you find that one doesn’t give you the exact customisation or alignment you need then possibly try another one. cowplot looks as if it might perform better at overlaying plots on top of another and at exact axis line matching.
Use the here package to help with file paths
here is a great package for swapping between Windows and Mac file paths (no more swapping backslashes and forward slashes!). Using here::here() will default to looking for a file in the .Rproj directory rather than the .rmd directory which is the default otherwise – great if you want to have multiple .rmd documents each in their own sub-directory with a shared data file in the parent directory.
Customise Code Outputs
R markdown allows customisation of appearance of code. Some of this can be done through modifying a .css file but there are some simpler ways to make basic changes. Try adding comment = "#>" to knitr::opst_chunk$set()to customise how comments appear in your document.
Word document creations tips
R markdown is generally great for HTML and PDF formats. The options for knitting to Word are not as well developed but there are some good options. The bookdown package is useful as discussed. The redoc package has been used to facilitate conversion to and from word – not tried it personally but if it can print out to word and then handle tracked changes back into markdown then it could be very useful.
For converting more complex tables and figures to word an option is to knit to rtf (rich text format) and then open the rtf file in word. This tends to be very good at keeping the desired formatting.
Future updates – hopefully!
R markdown is a great resource although there are a handful of minor issues which are currently difficult to resolve. One of the main problems I find it with tables and cross-referencing. I really like the syntax and customisation of the gt package but at present it appears cross-referencing in a way which works across HTML, PDF and Word outputs is not supported – a great opportunity to submit a pull request if you think you can get this to work.
Other Useful rmedicine Packages and Ideas
survival Package Update
The latest version (version 3.0) of the survival package was presented by Terry Therneau and is now available on github. This package is used by over 650 additional downstream dependencies. The latest version allows for multiple observations per subject, multiple endpoints per subject and multiple types of end-point. This will be particularly useful for competing risks analyses e.g. outcomes for liver transplant patients (transplanted, still on list, removed from list as no longer eligible or died).
Keep an eye-out for Kenny McLean’s blog where he plans to cover the survival package and many other useful packages presented at rmedicine 2019.
hreport Automated Trial Reporting
hreport by Frank Harrel (currently available on github) is for automated reporting of trials and studies with generation of interactive html graphs based in plotly. Several aspects of a study can be rendered easily into plots demonstrating accrual, exclusions, descriptive statistics, adverse events and time-to-event data. Another key theme of rmedicine 2019 appears to have been the use of plotly or similar packages to enable interaction with data.
timevis – interactive timelines
timevis allows generation of highly interactive timeline plots which allow zooming, adding or removal of events, resizing, etc.
Holepunch package
For working with projects that require a number of packages that then need shared with a colleague, holepunch provides a quick method for generating a list of dependencies and a Dockerfile. The package creates a link for another user to open a free RStudio server with all of the required packages installed. This may be useful for trouble-shooting in a department and showing code examples.
Summary
rmedicine 2019 has shown that clinical researchers are moving increasingly towards literate programming, interactive visualisations and automated workflows using R and Rmarkdown.
The conference was a great mix of methods presentations and data presentations from R users. You definitely don’t need any in-depth knowledge of R to benefit from it and I’d highly recommend booking for rmedicine 2020.