It’s now a week since I returned from RStudio::conf 2019 in Austin, Texas and in this blog I’m going to focus using the sparklyr package (spark-lee-r) which enables R to connect to an Apache Spark instance for general purpose cluster-computing. sparklyr has its own inbuilt functions as well as allowing dbplyr to do all of the amazing features I described in my first blog post: https://surgicalinformatics.org/quick-take-aways-from-rstudioconf-training-day-02/. The code contained in this blog should work on your own local RStudio without any preconfigured cluster should you wish to experiment with sparklyr’s capabilities.
Establishing
a connection
The following example code will help set up a local connection in
order to experiment with some of the functionality of the dbplyr package. This is
really useful if you are waiting for data or access to a database so you can have
pre-prepared scripts in progress without the remote database connection.
R
1
2
3
4
5
6
library(tidyverse)
library(dbplyr)
library(sparklyr)
library(nycflights13)
sc<-spark_connect(master="local")
The connection is typically stored as “sc” which you can also see
in the Environment. This is the object that is referenced each time data is
accessed in the spark cluster.
To check that the new connection to a spark instance has been
established go to the connections tab in your RStudio interface to see if the
connection has been established (this is typically located alongside your
“Environment” and “History” tabs. Click on the Spark UI button to view the user
interface for the current spark session in a browser (this will be helpful
later if you want to view an event log for the activity of your session).
Another way to check if the cluster is open is by using:
spark_connection_is_open(sc). This should return “TRUE” if the connection is
open.
Adding and manipulating data via the connection
Now that you have a connection established some
data can be added to the spark cluster:
spark_flights becomes an object in the local environment but is
really just a reference to the data in the spark cluster. Click on the
Connections tab and you should see that “my_flights” is now a data frame stored
in the cluster. The Spark UI page which opened in your browser will also now
show some of the changes you have made. Click the Storage tab in the UI and you
should see the data frame.
When manipulating the data the reference to the data frame within
the local environment can be treated as if the data was stored locally. One key
difference is that the creation of new data frames is delayed until the last
possible minute. The following example groups flights from the nycflights13 data
frame flights and calculated the average delay based on destination. Notice
that the real computation happens only once the “average_delay” data frame is
printed, the first command simply creates a reference in the local environment
in which is saved your intended action. Also notice the “lazy” approach which
occurs with sparklyr in which the total number of rows is not returned and is replaced
by “… with more rows”. If the full number of rows is then desired the collect
function can be used:
Have a look at the Spark UI and check out the SQL tab. Click on
one of the queries (highlight in blue) to get a breakdown of the components for
each task. Notice the difference between the query in which collect() was
used, it takes a lot longer to execute than the “lazy” query which sparklyr uses
by default. This is really useful if you want to leave the “heavy lifting” of
data transformation right until the end but if you then want to use an
intermediate data frame for several further transformations (this could be
sorting destinations based on average delay, only looking at destinations where
the average departure time was early etc.) then it might be useful to cache the
data within the cluster so that the data is transformed only once. The downside
to this approach may be additional memory requirements. The following code
using compute() will cache the intermediate data frame:
Now you should be able to see the “sub_flights” data frame in the Connections
tab, the Storage tab of the Spark UI and the SQL code generated in the SQL tab
of the UI. The cached_flights reference should also appear in the Environment
tab in RStudio.
Some extra
functions
As well as working through dplyr and dbplyr, sparkylr also comes
with its own functions for data analysis and transformation which may be useful
particularly when setting up pipelines you plan to execute later. A couple of
useful examples are the ft_binnarizer and ft_bucketizer commands which I
demonstrate determining destinations which are on average over 10 minutes
delayed and then demonstrate grouping by distance:
R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
cached_flights%>%
ft_binarizer(
input_col="avg_delay",
output_col="delayed",
threshold=15)%>%
select(avg_delay,delayed)%>%
head(100)
# # Source: spark<?> [?? x 2]
# avg_delay delayed
# <dbl> <dbl>
# 1 4.24 0
# 2 8.56 0
# 3 2.91 0
# 4 7.36 0
# 5 -7.87 0
# 6 7.47 0
# 7 8.90 0
# 8 11.8 1
# 9 10.6 1
# 10 10.6 1
# # ... with more rows
R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
spark_flights%>%
filter(!is.na(arr_delay))%>%
ft_bucketizer(
input_col="distance",
output_col="distance_grouping",
splits=c(0,500,1000,1500,2000,2500,6000))%>%
select(distance,distance_grouping)%>%
head(100)
# # Source: spark<?> [?? x 2]
# distance distance_grouping
# <dbl> <dbl>
# 1 1400 2
# 2 1416 2
# 3 1089 2
# 4 1576 3
# 5 762 1
# 6 719 1
# 7 1065 2
# 8 229 0
# 9 944 1
# 10 733 1
# # ... with more rows
These functions can be combined with others such as
sdf_partition, sdf_pivot and sdf_register to prepare a data set for predictive
modelling. Sparklyr has its own inbuilt functions for logistic regression
(ml_logistic_regression), predictive modelling (sdf_predict) and even some dedicated natural language
processing techniques (ft_tokenizer, ft_stop_words_remover).
To finish the session close down the connection using:
R
1
spark_disconnect(sc)
The connection should now be terminated meaning the Spark UI will
no longer be accessible and the connections tab has changed. Should you wish to
work with any data frames or aggregated results following the disconnect then
make sure to use collect() and create a new object before disconnecting.
For the past few days I’ve been in Austin, Texas with
Stephen Knight and Riinu Ots representing the Surgical Informatics Group at
RStudio::conf 2019. The conference brings together nearly 2000 data scientists,
developers and a couple of surgeons to learn the latest best practice and best
approaches when programming with R.
I have attended the Big Data workshop. “Big Data” is a bit
of a vague term but it can be helpful to think of Big Data as one of two
groups: data that is just so big that you can’t open it on your own computer
(imagine opening one of those massive files that just crashes your computer) or
data that is stored somewhere remotely and accessed through your computer by a
slow connection.
The key principles for handling Big Data effectively
include:
Safe storage (often a data administrator sorts
this)
Safe access (password protected in many cases
with care to avoid publishing passwords in R scripts)
Getting the database itself to do the heavy work
whilst leaving R to do the statistical analysis and plotting we know it does
best
Leave the data transformation until the latest
possible minute
Access the database as few times as possible
Today I’ll focus briefly on safe access, getting the
database to do all of the heavy lifting and leaving the transformation to the
last possible minute.
Safe Access
Using a R script to access a remote database usually
requires credentials. R needs to know where the data is stored and the database
needs to know whether it can allow you to access the data. There are lots of
different ways to set up a connection to a database but the DBI package and the
obdc package are going to come in very handy. You might also need to install a
package which supports the driver for the type of database.
There are loads of options when connecting to databases and
securing credentials but it’s key to avoid posting critical information like
passwords in plain text, for example:
con <- dbConnect(
odbc::odbc(),
Driver = "PostgreSQL",
Server =
"localhost",
UID = "myusername",
PWD = "my_unsecure_password",
Port = 5432,
Database =
"postgres"
)
Best Solution for
Securing Credentials
The most secure option for connecting to a database involves
using a Data Source Name (DSN) although this does require some
pre-configuration and the ability to perform the following:
1. Establish integrated security between the terminal and
the database, usually via Kerberos.
2. Pre-configure an ODBC connection in the Desktop or Server
(requires sufficient access rights). The ODBC connection will have a unique
Data Source Name, or DSN.
For example:
con <- DBI::dbConnect(odbc::odbc(), "My DSN
Name")
Easier alternatives
It is still possible to connect securely to a database using
either of the following techniques:
This will require a predefined username and password which
the user is prompted to enter on setting up the connection.
Finally, another option which may be easier for those
without data administrator is to use the config package and create a yml file.
After creating the yml file it can be used to configure the connection without
directly publishing credentials, in particular password. Instead, password is
retrieved from the yml file:
dw <- config::get("datawarehouse-dev")
con <- dbConnect(odbc::odbc(),
Driver = dw$driver,
Server = dw$server,
UID = dw$uid,
PWD = dw$pwd,
Port = dw$port,
Database =
dw$database
)
Making the database
do the work
For many physicians who work with R we are quite happy to
create data frames in the local environment, modify them there and then plot
the data from there. This works very well with smaller data sets which don’t
require much memory and even with medium-sized data sets (as long as you don’t
try to open the whole file using View()!). When working with Big Data often the
local environment isn’t large enough because of limited RAM. To give an example
if you are using all of the UK Biobank genetic data which amounts to over 12
terabytes then the modest 8 gigabytes (1500 times less) of RAM I have on my own
laptop just won’t do.
The solution is to manipulate the data remotely. Think bomb
disposal. You want to have a screen to show you what a bomb disposal robot is
doing and a way of controlling the robot but you don’t want to be up close and
personal with the robot doing the work. Big Data is exactly the same, you want
to see the results of the data transformations and plotting on your own device
but let the database do the work for you. Trying to bring the heavy lifting of
the data onto your own device creates problems and may result in the computer
or server crashing.
dplyr is a truly fantastic data manipulation package which
is part of the tidyverse and makes everyday data manipulation for clinicians
achievable, understandable and consistent. The great news is that when working
with remote data, you can use dplyr! The package dbplyr (database plyer)
converts the dplyr code into SQL which then runs in the remote database meaning
a familiarity with dplyr is almost all that’s needed to handle Big Data.
The show_query() function demonstrates just how much work goes
on under the hood of dbplyr:
flights %>%
summarise_if(is.numeric, mean, na.rm = TRUE) %>%
show_query
Output:
Applying predicate on the first 100
rows
<SQL>
SELECT AVG("flightid") AS
"flightid", AVG("year") AS "year",
AVG("month") AS "month", AVG("dayofmonth") AS
"dayofmonth", AVG("dayofweek") AS "dayofweek",
AVG("deptime") AS "deptime", AVG("crsdeptime") AS
"crsdeptime", AVG("arrtime") AS "arrtime",
AVG("crsarrtime") AS "crsarrtime",
AVG("flightnum") AS "flightnum",
AVG("actualelapsedtime") AS "actualelapsedtime",
AVG("crselapsedtime") AS "crselapsedtime",
AVG("airtime") AS "airtime", AVG("arrdelay") AS
"arrdelay", AVG("depdelay") AS "depdelay",
AVG("distance") AS "distance", AVG("taxiin") AS
"taxiin", AVG("taxiout") AS "taxiout",
AVG("cancelled") AS "cancelled", AVG("diverted")
AS "diverted", AVG("carrierdelay") AS
"carrierdelay", AVG("weatherdelay") AS
"weatherdelay", AVG("nasdelay") AS "nasdelay",
AVG("securitydelay") AS "securitydelay", AVG("lateaircraftdelay")
AS "lateaircraftdelay", AVG("score") AS "score"
FROM datawarehouse.flight
Leaving the Data
Transformation to the last possible minute
dbplyr prevents unnecessary work occurring within the
database until the user is explicit that they want some results. When modifying
a data frame using a remote connect and dbplyr the user can work with
references to the remote data frame in the local environment. When performing
and saving a data transformation then dplyr saves a reference to the intended
transformation (rather than actually transforming the data). Only when the user
is explicit that they want to see some of the data will dbplyr let the database
get busy and transform the data. The user can do this by plotting the data or
by using the collect() function to print out the resulting data frame. Working
with data in this way by saving up a pipeline of intended transformations and
only executing the transformation at the very end is a much more efficient way
of working with Big Data.
This week, Riinu, Steve, and Cameron are attending the annual RStudio Workshops (Tue-Wed), Conference (Thu-Fri), and the tidyverse developer day (Sat) in Austin, Texas.
We won’t even try to summarise everything we’re learning here, since the content is vast and the learning is very much hands on, but we will be posting a small selection of some take-aways in this blog.
We’re all attending different workshops (Machine Learning, Big Data, Markdown&Shiny).
Interesting take-away no. 1: terminology
Classification means categorical outcome variable.
Regression means continuous (numeric) outcome variable.
The is a bit confusing when using logistic regression – which by this definition is “classification”, rather than “regression”. But it is very common machine learning terminology, and makes sense considering the wide range of different methods used for classification (so not just regression).
Interesting take-away no. 2: library(parsnip)
The biggest strength of R is how many different packages (=extensions) it has. Basically, if you can think of a statistical or machine learning method, it’s probably implemented in R. This is because a lot of R users are also R developers – if you find a method that you really want to use, but that hasn’t been implemented yet, you can just go on and implement it youself. And then publish this new functionality as an R package than everyone can use.
However, this also means that different R packages sometimes do similar things using very different syntax. This is where the parsnip packages comes to resque, providing a unified interface for using some of these modelling packages.
For example:
R
1
2
3
4
5
spec_lin_reg=linear_reg()
spec_lm=set_engine(spec_lin_reg,"lm")
spec_stan=set_engine(spec_lin_reg,"stan")
spec_spark=set_engine(spec_lin_reg,"spark")
Instead of figuring out the syntax for lm() (basic linear regression model), and then for Stan, and Spark, and keras, set_engine() from library(parsnip) provides us with a unified interface for all of these different methods for linear regression.
A fully working example can be found in the course materials (all publicly available):
Interesting take-away no 3: Communication by a new means
The Rmarkdown workshop raised two interesting points within the first few mintues of starting – how prevalent communication by html has become (i.e. the internet, use of interactive documents and apps to relay industry and research data to colleagues and the wider commmunity).
But, maybe more importantly, how little is understood by the general public and how it can be used relatively easily for impressive interactivity with few lines of code….followed by the question – how about that raise boss??
For example, how using the package plotly can add immediate interactivity following on from all the ggplot basics learnt at healthyR:
And when you come across a website called “pimp my Rmarkdown” how can you not want to play!!!!
Interesting take-away no. 4: monitor progress with Viewer pane
Regular knitting, including at the start of an Rmd document to ensure any errors are highlighted early, is key. Your RMarkdown is a toddler who loves to misbehave. Previewing your document in a new window can take time and slow you down….
Frequent knitting into the Viewer pane can give you quick updates on how your code is behaving and identify bugs early!
The default in Rstudio loads your document into a new window when the Knit button is hit. A loading of a preview into the Viewer pane can be set as follows:
Tools tab > Global Options > RMarkdown > Set “Output preview in” to Viewer pane
Many of our projects involve getting doctors, nurses, and medical students to collect data on the patients they are looking after. We want to involve many of them in data analysis, without the requirement for coding experience or access to statistical software. To achieve this we have built Shinyfit, a shiny app for linear, logistic, and Cox PH regression.
Aim: allow access to model fitting without requirement for statistical software or coding experience.
Audience: Those sharing datasets in context of collaborative research or teaching.
Hosting requirements: Basic R coding skills including tidyverse to prepare dataset (5-10 minutes).
Linear, logistic or CPH regression tables Coefficient, odds ratio or hazard ratio plotsCrosstabsInspect dataset with ff_glimpse
Use your data
To use your own data, clone or download app from github.
Edit 0_prep.R to create a shinyfit_data object.
Test the app, usually within RStudio.
Deploy to your shiny hosting platform of choice.
Ensure you have permission to share the data
Editing 0_prep.R is straightforward and takes about 5 mins. The main purpose is to create human-readable menu items and allows sorting of variables into any categories, such as outcome and explanatory.
Errors in shinyfit are usually related to the underlying dataset, e.g.
Variables not appropriately specified as numerics or factors.
A particular factor level is empty, thus regression function (lm, glm, coxph etc.) gives error.
A variable with >2 factor levels is used as an outcome/dependent. This is not supported.
Use Glimpse tabs to check data when any error occurs.
It is fully mobile compliant, including datatables.
Many of our projects involve getting doctors, nurses, and medical students to collect data on the patients they are looking after. We want to involve many of them in data analysis, without the requirement for coding experience or access to statistical software. To achieve this we have built Shinyfit, a shiny app for linear, logistic, and Cox PH regression.
Aim: allow access to model fitting without requirement for statistical software or coding experience.
Audience: Those sharing datasets in context of collaborative research or teaching.
Hosting requirements: Basic R coding skills including tidyverse to prepare dataset (5-10 minutes).
Linear, logistic or CPH regression tables Coefficient, odds ratio or hazard ratio plotsCrosstabsInspect dataset with ff_glimpse
Use your data
To use your own data, clone or download app from github.
Edit 0_prep.R to create a shinyfit_data object.
Test the app, usually within RStudio.
Deploy to your shiny hosting platform of choice.
Ensure you have permission to share the data
Editing 0_prep.R is straightforward and takes about 5 mins. The main purpose is to create human-readable menu items and allows sorting of variables into any categories, such as outcome and explanatory.
Errors in shinyfit are usually related to the underlying dataset, e.g.
Variables not appropriately specified as numerics or factors.
A particular factor level is empty, thus regression function (lm, glm, coxph etc.) gives error.
A variable with >2 factor levels is used as an outcome/dependent. This is not supported.
Use Glimpse tabs to check data when any error occurs.
It is fully mobile compliant, including datatables.
With over 11,000 patients now entered into REDCap and more being entered every day we thought it would be a good time to reflect on some of the ways in which GlobalSurg 3 has been set up to help collaborators from around the world enter accurate and high-quality data.
With so many important things to know about each patient undergoing cancer surgery, the GlobalSurg team of nearly 3000 collaborators has been busy entering data into our secure database at redcap.globalsurg.org (REDCap is an amazing database software developed by Vanderbilt University). With over 750,000 values entered, it isn’t surprising that from time to time a mistake occurs whilst entering the data. This might be because the data entered onto a paper form was incorrect to begin with or it might be due to accidentally clicking on the wrong options when entering the data to REDCap. In some cases, the incorrect data might even appear in the notes if the surgeon or the anaesthetist has forgotten how to decide on the most appropriate ASA grade.
To try to help our collaborators identify cases when these mistakes may have happened we have taught our REDCap some basic anatomy, cancer classifications and some common sense so that it can alert collaborators to mistake mistakes as soon as they occur. The automatic alerts appear when a collaborator tries to save a page with incorrect data meaning that they can change it immediately when they still have access to the patient notes.
Our REDCap now knows 58 things about cancer surgery that it is using to help collaborators enter accurate data.
Figure 1. Examples of our (a) paper data collection form, (b) REDCap interface, and (c) a data quality pop-up warning.
All 58 rules are given at the bottom of this post, but here are some examples:
Our REDCap knows some basic anaesthesiology:
As a collaborator if you have tried to enter a patient with diabetes mellitus and stated that they have an ASA grade of 1 Our REDCap should have informed you that diabetic patients should really have an ASA grade of 2 or more
Our REDCap knows the basics of anatomy:
It knows that if a patient had a total colectomy that they don’t have a colostomy
Our REDCap knows about TNM staging:
It knows that patients with an M score of M1 should also have an Essential TNM score of M+
Our REDCap also knows some common sense:
It knows that patients can’t have more involved lymph nodes in a specimen that the total number of lymph nodes in a specimen
It knows that a patient couldn’t have their operation before being admitted to the hospital
Our REDCap has been working tirelessly for several months to generate these alerts and help collaborators ensure their data is accurate. We hope that training REDCap to detect problems with the data will make the GlobalSurg 3 analysis more efficient and contribute to the accuracy of the data.
The final data entry deadline is 17th December so remember to upload all of your data before then. Our REDCap is ready and waiting to store and check the data.
Now this is very good already with hardly any overlapping labels and the world is pretty recognisable. And really, you can make this plot with just 2 lines of code:
So what these two lines make is already very amazing.
But I feel like Europe is a little bit misshapen and that the Caribbean and Africa are too close together. So I divided the world into regions (in this case same as continents except Russia is it’s own region – it’s just so big). Then wrote two functions that asked ggrepel to plot each region separately and use patchwork to patch each region together:
As a journal editor, I often receive studies in which the investigators fail to describe, analyse, or even acknowledge missing data. This is frustrating, as it is often of the utmost importance. Conclusions may (and do) change when missing data is accounted for. A few seem to not even appreciate that in conventional regression, only rows with complete data are included.
These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:
Ensure your data are coded correctly.
Identify missing values within each variable.
Look for patterns of missingness.
Check for associations between missing and observed data.
Decide how to handle missing data.
Finalfit includes a number of functions to help with this.
Some confusing terminology
But first there are some terms which easy to mix up. These are important as they describe the mechanism of missingness and this determines how you can handle the missing data.
Missing completely at random (MCAR)
As it says, values are randomly missing from your dataset. Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves.
For instance, when smoking status is not recorded in a random subset of patients.
This is easy to handle, but unfortunately, data are almost never missing completely at random.
Missing at random (MAR)
This is confusing and would be better stated as missing conditionally at random. Here, missing data do have a relationship with other variables in the dataset. However, the actual values that are missing are random.
For example, smoking status is not documented in female patients because the doctor was too shy to ask. Yes ok, not that realistic!
Missing not at random (MNAR)
The pattern of missingness is related to other variables in the dataset, but in addition, the values of the missing data are not random.
For example, when smoking status is not recorded in patients admitted as an emergency, who are also more likely to have worse outcomes from surgery.
Missing not at random data are important, can alter your conclusions, and are the most difficult to diagnose and handle. They can only be detected by collecting and examining some of the missing data. This is often difficult or impossible to do.
How you deal with missing data is dependent on the type of missingness. Once you know this, then you can sort it.
More on this below.
1. Ensure your data are coded correctly: ff_glimpse
While clearly obvious, this step is often ignored in the rush to get results. The first step in any analysis is robust data cleaning and coding. Lots of packages have a glimpse function and finalfit is no different. This function has three specific goals:
Ensure all factors and numerics are correctly assigned. That is the commonest reason to get an error with a finalfit function. You think you’re using a factor variable, but in fact it is incorrectly coded as a continuous numeric.
Ensure you know which variables have missing data. This presumes missing values are correctly assigned NA. See here for more details if you are unsure.
Ensure factor levels and variable labels are assigned correctly.
Example scenario
Using the colon cancer dataset that comes with finalfit, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.
For demonstration purposes, we will create random MCAR and MAR smoking variables to the dataset.
# Make sure finalfit is up-to-date
install.packages("finalfit")
library(finalfit)
# Create some extra missing data
## Smoking missing completely at random
set.seed(1)
colon_s$smoking_mcar =
sample(c("Smoker", "Non-smoker", NA),
dim(colon_s)[1], replace=TRUE,
prob = c(0.2, 0.7, 0.1)) %>%
factor()
Hmisc::label(colon_s$smoking_mcar) = "Smoking (MCAR)"
## Smoking missing conditional on patient sex
colon_s$smoking_mar[colon_s$sex.factor == "Female"] =
sample(c("Smoker", "Non-smoker", NA),
sum(colon_s$sex.factor == "Female"),
replace = TRUE,
prob = c(0.1, 0.5, 0.4))
colon_s$smoking_mar[colon_s$sex.factor == "Male"] =
sample(c("Smoker", "Non-smoker", NA),
sum(colon_s$sex.factor == "Male"),
replace=TRUE, prob = c(0.15, 0.75, 0.1))
colon_s$smoking_mar = factor(colon_s$smoking_mar)
Hmisc::label(colon_s$smoking_mar) = "Smoking (MAR)"
The function summarises a data frame or tibble by numeric (continuous) variables and factor (discrete) variables. The dependent and explanatory are for convenience. Pass either or neither e.g. to summarise data frame or tibble:
colon %>%
ff_glimpse()
It doesn’t present well if you have factors with lots of levels, so you may want to remove these.
Use this to check that the variables are all assigned and behaving as expected. The proportion of missing data can be seen, e.g. smoking_mar has 23% missing data.
2. Identify missing values in each variable: missing_plot
In detecting patterns of missingness, this plot is useful. Row number is on the x-axis and all included variables are on the y-axis. Associations between missingness and observations can be easily seen, as can relationships of missingness between variables.
colon_s %>%
missing_plot()
Click to enlarge.
It was only when writing this post that I discovered the amazing package, naniar. This package is recommended and provides lots of great visualisations for missing data.
3. Look for patterns of missingness: missing_pattern
missing_pattern simply wraps mice::md.pattern using finalfit grammar. This produces a table and a plot showing the pattern of missingness between variables.
This allows us to look for patterns of missingness between variables. There are 14 patterns in this data. The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic.
Make sure you include missing data in demographics tables
Table 1 in a healthcare study is often a demographics table of an “explanatory variable of interest” against other explanatory variables/confounders. Do not silently drop missing values in this table. It is easy to do this correctly with summary_factorlist. This function provides a useful summary of a dependent variable against explanatory variables. Despite its name, continuous variables are handled nicely.
na_include=TRUE ensures missing data from the explanatory variables (but not dependent) are included. Note that any p-values are generated across missing groups as well, so run a second time with na_include=FALSE if you wish a hypothesis test only over observed data.
4. Check for associations between missing and observed data: missing_pairs | missing_compare
In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables. This is particularly important (I would say absolutely required) for a primary outcome measure / dependent variable.
Take for example “death”. When that outcome is missing it is often for a particular reason. For example, perhaps patients undergoing emergency surgery were less likely to have complete records compared with those undergoing planned surgery. And of course, death is more likely after emergency surgery.
missing_pairs uses functions from the excellent GGally package. It produces pairs plots to show relationships between missing values and observed values in all variables.
For continuous variables (age and nodes), the distributions of observed and missing data can be visually compared. Is there a difference between age and mortality above?
For discrete, data, counts are presented by default. It is often easier to compare proportions:
colon_s %>%
missing_pairs(dependent, explanatory, position = "fill", )
It should be obvious that missingness in Smoking (MCAR) does not relate to sex (row 6, column 3). But missingness in Smoking (MAR) does differ by sex (last row, column 3) as was designed above when the missing data were created.
We can confirm this using missing_compare.
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor")
dependent = "smoking_mcar"
colon_s %>%
missing_compare(dependent, explanatory)
Missing data analysis: Smoking (MCAR) Not missing Missing p
Age (years) Mean (SD) 59.7 (11.9) 59.9 (12.6) 0.867
Sex Female 399 (89.7) 46 (10.3) 0.616
Male 429 (88.6) 55 (11.4)
nodes Mean (SD) 3.6 (3.4) 4 (4.5) 0.990
Obstruction No 654 (89.3) 78 (10.7) 0.786
Yes 156 (88.6) 20 (11.4)
dependent = "smoking_mar"
colon_s %>%
missing_compare(dependent, explanatory)
Missing data analysis: Smoking (MAR) Not missing Missing p
Age (years) Mean (SD) 59.6 (11.9) 60.1 (12) 0.709
Sex Female 288 (64.7) 157 (35.3)
It takes “dependent” and “explanatory” variables, but in this context “dependent” just refers to the variable being tested for missingness against the “explanatory” variables.
Comparisons for continuous data use a Kruskal Wallis and for discrete data a chi-squared test.
As expected, a relationship is seen between Sex and Smoking (MAR) but not Smoking (MCAR).
For those who like an omnibus test
If you are work predominately with numeric rather than discrete data (categorical/factors), you may find these tests from the MissMech package useful. The package and output is well documented, and provides two tests which can be used to determine whether data are MCAR.
These pages from Karen Grace-Martin are great for this.
Prior to a standard regression analysis, we can either:
Delete the variable with the missing data
Delete the cases with the missing data
Impute (fill in) the missing data
Model the missing data
MCAR, MAR, or MNAR
MCAR vs MAR
Using the examples, we identify that Smoking (MCAR) is missing completely at random.
We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived. The pattern of missingness is therefore not felt to be MNAR.
Common solution
Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.
We therefore elect to simply omit the patients in whom smoking is missing. This is known as list-wise deletion and will be performed by default in standard regression analyses including finalfit.
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar")
dependent = "mort_5yr"
colon_s %>%
finalfit(dependent, explanatory, metrics=TRUE)
Dependent: Mortality 5 year Alive Died OR (univariable) OR (multivariable)
Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.200)
Sex Female 243 (47.6) 194 (48.0) - -
Male 268 (52.4) 210 (52.0) 0.98 (0.76-1.27, p=0.889) 1.02 (0.76-1.38, p=0.872)
nodes Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.18-1.30, p
Other considerations
Sensitivity analysis
Omit the variable
Imputation
Model the missing data
If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis. A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model. Thus, you may choose to re-label all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labeling with “non-smoker”.
If smoking is not associated with the explanatory variable of interest (bowel obstruction) or the outcome, it may be considered not to be a confounder and so could be omitted. That neatly deals with the missing data issue, but of course may not be appropriate.
Imputation and modelling are considered below.
MCAR vs MAR
But life is rarely that simple.
Consider that the smoking variable is more likely to be missing if the patient is female (missing_compareshows a relationship). But, say, that the missing values are not different from the observed values. Missingness is then MAR.
If we simply drop all the cases (patients) in which smoking is missing (list-wise deletion), then proportionality we drop more females than men. This may have consequences for our conclusions if sex is associated with our explanatory variable of interest or outcome.
Common solution
mice is our go to package for multiple imputation. That’s the process of filling in missing data using a best-estimate from all the other data that exists. When first encountered, this doesn’t sounds like a good idea.
However, taking our simple example, if missingness in smoking is predicted strongly by sex, and the values of the missing data are random, then we can impute (best-guess) the missing smoking values using sex and other variables in the dataset.
Imputation is not usually appropriate for the explanatory variable of interest or the outcome variable. With both of these, the hypothesis is that there is an meaningful association with other variables in the dataset, therefore it doesn’t make sense to use these variables to impute them.
Here is some code to run mice. The package is well documented, and there are a number of checks and considerations that should be made to inform the imputation process. Read the documentation carefully prior to doing this yourself.
# Multivariate Imputation by Chained Equations (mice)
library(finalfit)
library(dplyr)
library(mice)
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor", "smoking_mar")
dependent = "mort_5yr"
colon_s %>%
select(dependent, explanatory) %>%
# Exclude outcome and explanatory variable of interest from imputation
dplyr::filter(!is.na(mort_5yr), !is.na(obstruct.factor)) %>%
# Run imputation with 10 imputed sets
mice(m = 10) %>%
# Run logistic regression on each imputed set
with(glm(formula(ff_formula(dependent, explanatory)),
family="binomial")) %>%
# Pool and summarise results
pool() %>%
summary(conf.int = TRUE, exponentiate = TRUE) %>%
# Jiggle into finalfit format
mutate(explanatory_name = rownames(.)) %>%
select(explanatory_name, estimate, `2.5 %`, `97.5 %`, p.value) %>%
condense_fit(estimate_suffix = " (multiple imputation)") %>%
remove_intercept() -> fit_imputed
# Use finalfit merge methods to create and compare results
colon_s %>%
summary_factorlist(dependent, explanatory, fit_id = TRUE) -> summary1
colon_s %>%
glmuni(dependent, explanatory) %>%
fit2df(estimate_suffix = " (univariable)") -> fit_uni
colon_s %>%
glmmulti(dependent, explanatory) %>%
fit2df(estimate_suffix = " (multivariable inc. smoking)") -> fit_multi
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor")
colon_s %>%
glmmulti(dependent, explanatory) %>%
fit2df(estimate_suffix = " (multivariable)") -> fit_multi_r
# Combine to final table
summary1 %>%
ff_merge(fit_uni) %>%
ff_merge(fit_multi_r) %>%
ff_merge(fit_multi) %>%
ff_merge(fit_imputed) %>%
select(-fit_id, -index)
label levels Alive Died OR (univariable) OR (multivariable) OR (multivariable inc. smoking) OR (multiple imputation)
Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.122) 1.02 (1.00-1.03, p=0.010) 1.01 (1.00-1.02, p=0.116)
Sex Female 243 (55.6) 194 (44.4) - - - -
Male 268 (56.1) 210 (43.9) 0.98 (0.76-1.27, p=0.889) 0.98 (0.74-1.30, p=0.890) 0.88 (0.64-1.23, p=0.461) 0.99 (0.75-1.31, p=0.957)
nodes Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.18-1.30, p
The final table can easily be exported to Word or as a PDF as described else where.
By examining the coefficients, the effect of the imputation compared with the complete case analysis can be clearly seen.
Other considerations
Omit the variable
Imputing factors with new level for missing data
Model the missing data
As above, if the variable does not appear to be important, it may be omitted from the analysis. A sensitivity analysis in this context is another form of imputation. But rather than using all other available information to best-guess the missing data, we simply assign the value as above. Imputation is therefore likely to be more appropriate.
There is an alternative method to model the missing data for the categorical in this setting – just consider the missing data as a factor level. This has the advantage of simplicity, with the disadvantage of increasing the number of terms in the model. Multiple imputation is generally preferred.
library(dplyr)
colon_s %>%
mutate(
smoking_mar = forcats::fct_explicit_na(smoking_mar)
) %>%
finalfit(dependent, explanatory)
Dependent: Mortality 5 year Alive Died OR (univariable) OR (multivariable)
Age (years) Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.119)
Sex Female 243 (47.6) 194 (48.0) - -
Male 268 (52.4) 210 (52.0) 0.98 (0.76-1.27, p=0.889) 0.96 (0.72-1.30, p=0.809)
nodes Mean (SD) 2.7 (2.4) 4.9 (4.4) 1.24 (1.18-1.30, p
MNAR vs MAR
Missing not at random data is tough in healthcare. To determine if data are MNAR for definite, we need to know their value in a subset of observations (patients).
Using our example above. Say smoking status is poorly recorded in patients admitted to hospital as an emergency with an obstructing cancer. Obstructing bowel cancers may be larger or their position may make the prognosis worse. Smoking may relate to the aggressiveness of the cancer and may be an independent predictor of prognosis. The missing values for smoking may therefore not random. Smoking may be more common in the emergency patients and may be more common in those that die.
There is no easy way to handle this. If at all possible, try to get the missing data. Otherwise, take care when drawing conclusions from analyses where data are thought to be missing not at random.
Where to next
We are now doing more in Stan. Missing data can be imputed directly within a Stan model which feels neat. Stan doesn’t yet have the equivalent of NA which makes passing the data block into Stan a bit of a faff.
Alternatively, the missing data can be directly modelled in Stan. Examples are provided in the manual. Again, I haven’t found this that easy to do, but there are a number of Stan developments that will hopefully make this more straightforward in the future.