More Fun with Regression:

Confounding, interaction and random effects

The following blog post provides a general overview of some of the terms encountered when carrying out logistic regression and was inspired by attending the extremely informative HealthyR+: Practical logistic regression course at the University of Edinburgh.

  • Confounding
    • What is confounding?
    • Examples
  • Interaction
    • What are interaction effects?
    • Example
    • How do we detect interactions?
    • What happens if we overlook interactions?
    • Terminology
  • Random effects
    • Clustered data
    • Why should we be aware of clustered data?
    • A solution to clustering
    • Terminology
  • Brief summary

Confounding

What is confounding?

Confounding occurs when the association between an explanatory (exposure) and outcome variable is distorted, or confused, because another variable is independently associated with both. 

The timeline of events must also be considered, because a variable cannot be described as confounding if it occurs after (and is directly related to) the explanatory variable of interest.  Instead it is sometimes called a mediating variable as it is located along the causal pathway, between explanatory and outcome.

Examples

Potential confounders often encountered in healthcare data include for example, age, sex, smoking status, BMI, frailty, disease severity.  One of the ways these variables can be controlled is by including them in regression models. 

In the Stanford marshmallow experiment, a potential confounder was left out – economic background – leading to an overestimate of the influence of a child’s willpower on their future life outcomes.

Another example includes the alleged link between coffee drinking and lung cancer. More smokers than non-smokers are coffee drinkers, so if smoking is not accounted for in a model examining coffee drinking habits, the results are likely to be confounded.

Interaction

What are interaction effects?

In a previous blog post, we looked at how collinearity is used to describe the relationship between two very similar explanatory variables.  We can think of this as an extreme case of confounding, almost like entering the same variable into our model twice.  An interaction on the other hand, occurs when the effect of an explanatory variable on the outcome, depends on the value of another explanatory variable. 

When explanatory variables are dependent on each other to tell the whole story, this can be described as an interaction effect; it is not possible to understand the exact effect that one variable has on the outcome without first knowing information about the other variable. 

The use of the word dependent here is potentially confusing as explanatory variables are often called independent variables, and the outcome variable is often called the dependent variable (see word clouds here). This is one reason why I tend to avoid the use of these terms.

Example

An interesting example of interaction occurs when examining our perceptions about climate change and the relationship between political preference, and level of education. 

We would be missing an important piece of the story concerning attitudes to climate change if we looked in isolation at either education or political orientation.  This is because the two interact; as level of education increases amongst more conservative thinkers, perception about the threat of global warming decreases, but for liberal thinkers as the level of education increases, so too does the perception about the threat of global warming. 

Here is a link to the New York Times article on this story: https://www.nytimes.com/interactive/2017/11/14/upshot/climate-change-by-education.html

What happens if we overlook interactions?

If interaction effects are not considered, then the output of the model might lead the investigator to the wrong conclusions. For instance, if each explanatory variable was plotted in isolation against the outcome variable, important potential information about the interaction between variables might be lost, only main effects would be apparent.

On the other hand, if many variables are used in a model together, without first exploring the nature of potential interactions, it might be the case that unknown interaction effects are masking true associations between the variables.  This is known as confounding bias.

How do we detect interactions?

The best way to start exploring interactions is to plot the variables. Trends are more apparent when we use graphs to visualise these.

If the relationship between two exposure variables on an outcome variable is constant, then we might visualise this as a graph with two parallel lines.  Another way of describing this is additive effect modification.

Two explanatory variables (x1 and x2) are not dependent on each other to explain the outcome.

But if the effect of the exposure variables on the outcome is not constant then the lines will diverge. We can describe this as multiplicative effect modification.

Two explanatory variables (x1 and x2) are dependent on each other to explain the outcome.

Once an interaction has been confirmed, the next step would be to explore whether the interaction is statistically significant or not.

Terminology

Some degree of ambiguity exists surrounding the terminology of interactions (and statistical terms in general!), but here are a few commonly encountered terms, often used synonymously. 

  • Interaction
  • Causal interaction
  • Effect modification
  • Effect heterogeneity

There are subtle differences between interaction and effect modification.  You can find out more in this article: On the distinction between interaction and effect modification.

Random effects

Clustered data

Many methods of statistical analysis are intended to be applied with the assumption that, within a data-set, an individual observation is not influenced by the value of another observation: it is assumed that all observations are independent of one another. 

This may not be the case however, if you are using data, for example, from various hospitals, where natural clustering or grouping might occur.  This happens if observations within individual hospitals have a slight tendency to be more similar to each other than to observations in the rest of the data-set.

Random effects modelling is used if the groups of clustered data can be considered as samples from a larger population.

Why should we be aware of clustered data?

Gathering insight into the exact nature of differences between groups may or may not be important to your analysis, but it is important to account for patterns of clustering because otherwise measures such as standard errors, confidence intervals and p-values may appear to be too small or narrow.  Random effects modelling is one approach which can account for this.

A solution to clustering

The random effects model assumes that having allowed for the random effects of the various clusters or groups, the observations within each individual cluster are still independent.  You can think of it as multiple levels of analysis – first there are the individual observations, and these are then nested within observations at a cluster level, hence an alternative name for this type of modelling is multilevel modelling.

Terminology

There are various terms which are used when referring to random effects modelling, although the terms are not entirely synonymous. Here are a few of them:

  • Random effects
  • Multilevel
  • Mixed-effect
  • Hierarchical

There are two main types of random effects models:

  • Random intercept model
Random intercept: Constrains lines to be parallel
  • Random slope and intercept model
Random slope and intercept: Does not constrain lines to be parallel

Brief summary

To finish, here is a quick look at some of the key differences between confounding and interaction.

If you would like to learn more about these terms and how to carry out logistic regression in R, keep an eye on the HealthyR page for updates on courses available.

Fun with Regression

“All models are wrong, but some are useful”

George Box

This quote by statistician George Box feels like a good starting point from which to consider some of the challenges of regression modelling.  If we start with the idea that all models are wrong, it follows that one of the main skills in carrying out regression modelling is working out where the weaknesses are and how to minimise these to produce as close an approximation as possible to the data you are working with – to make the model useful.

The idea that producing high-quality regression models is often more of an art than a science appeals to me.  Understanding the underlying data, what you want to explore, and the tools you have at hand are essential parts of this process.

After attending the excellent HealthyR+: Practical Logistic Regression course a few weeks ago, my head was buzzing with probabilities, odds ratios and confounding.  It was not just the data which was confounded.  As someone fairly new to logistic regression, I thought it might be useful to jot down some of the areas I found particularly interesting and concepts which made me want to find out more.  In this first blog post we take a brief look at:

  • Probability and odds
    • The difference between probability and odds
    • Why use log(odds) and not just odds?
    • Famous probability problems
  • Collinearity and correlation
    • What is collinearity?
    • How do we detect collinearity?
    • Is collinearity a problem?

Probability and odds

The difference between probability and odds

Odds and probability are both measures of how likely it is that a certain outcome might occur in a series of events.  Probability is perhaps more intuitive to understand, but its properties make it less useful in statistical models and so odds, odds ratios, and log(odds) are used instead, more on this in the next section.

Interestingly, when the probability of an event occurring is small – <0.1 (or less than 10%) – the odds are quite similar.  However, as probability increases, the odds also increase but at a greater rate, see the following figure:

Here we can also see that whilst probabilities range from 0 to 1, odds can take on any value between 0 and infinity.

Why use log(odds) and not just odds?

Asymmetry of the odds scale makes it difficult to compare binary outcomes, but by using log(odds) we can produce a symmetrical scale, see figure below:

In logistic regression, the odds ratio concerning a particular variable represents the change in odds with each unit increase, whilst holding all other variables constant.

Famous probability problems

I find probability problems fascinating, particularly those which seem counter-intuitive. Below are links to explanations of two intriguing probability problems:

Collinearity and correlation

What is collinearity?

The term collinearity (also referred to as multicollinearity) is used to describe a high correlation between two explanatory variables.  This can cause problems in regression modelling because the explanatory variables are assumed to be independent (and indeed are sometimes called independent variables, see word clouds below). 

The inclusion of variables which are collinear (highly correlated) in a regression model, can lead to the false impression for example, that neither variable is associated with the outcome, when in fact, individually each variable does have a strong association.  The figure below might help to visualise the relationships between the variables:

In this image, y represents the control variable, and x1 and x2 are the highly correlated, collinear explanatory variables.  As you can see, there is a large area of (light grey) overlap between the x variables, whereas there are only two very small areas of independent overlap between each x and y variable.  These small areas represent the limited information available to the regression model when trying to carry out analysis.

How do we detect collinearity?

A regression coefficient can be thought of as the rate of change, or as the slope of the regression line.  The slope describes the mean change in the outcome variable for every unit of change in the explanatory variable.  It is important to note that regression coefficients are calculated based on the assumption that all other variables (apart from the variables of interest) are kept constant. 

When two variables are highly correlated, this creates problems. The model will try to predict the outcome but finds it hard to disentangle the influence of either of the explanatory variables due to their strong correlation. As a result, coefficient estimates may change erratically in response to small changes in the model.

Various terms are used to describe these x and y variables depending on context.  There are slight differences in the meanings, but here are a few terms that you might encounter:

The information I used to generate these word clouds was based on a crude estimate of the number of mentions in Google Scholar within the context of medical statistics.

Is collinearity a problem?

Collinearity is a problem if the purpose of your analysis is to explain the interactions between the data, however it has little effect on the overall predictive properties of your model, i.e. the model will provide accurate predictions based on all variables as one big bundle, but will not be able to tell you about the interactions of isolated variables.

If you are concerned with exploring specific interactions and you encounter collinearity, there are two main approaches you can take:

  • Drop one of the variables if it is not vital to your analysis
  • Combine the variables (e.g. weight and height can be combined to produce BMI)

An example of a publication where missed collinearity led to potentially erroneous conclusions, concerns analyses carried out on data relating to the World Trade Organisation (WTO). Here is a related article which attempts to unpick some of the problems with previous WTO research.

Finishing on an example of a problematic attempt at regression analysis may perhaps seem slightly gloomy, but on the contrary, I hope that this might provide comfort if your own analysis throws up challenges or problems – you are in good company!  It also brings us back to the quote by George Box at the beginning of this blog post, where we started with the premise that all models are wrong.  They are at best a close approximation, and we must always be alert to their weaknesses.

What next?

Look out for the next HealthyR+: Practical Logistic Regression course and sign up.  What areas of medical statistics do you find fun, puzzling, tricky, surprising? Let us know below.

Multiple imputation support in Finalfit

This post was originally published here

We are using multiple imputation more frequently to “fill in” missing data in clinical datasets. Multiple datasets are created, models run, and results pooled so conclusions can be drawn.

We’ve put some improvements into Finalfit on GitHub to make it easier to use with the mice package. These will go to CRAN soon but not immediately.

See finalfit.org/missing.html for more on handling missing data.

Let’s get straight to it by imputing smoking status in a cancer dataset.

Install

devtools::install_github("ewenharrison/finalfit")
library(finalfit)
library(dplyr)

Create missing data for example

# Smoking missing completely at random

set.seed(1)

colon_s = colon_s %>% 
  mutate(
    smoking_mcar = sample(c("Smoker", "Non-smoker", NA), 
      dim(colon_s)[1], replace=TRUE, 
      prob = c(0.2, 0.7, 0.1)) %>% 
    factor() %>% 
    ff_label("Smoking (MCAR)")
    )

# Smoking missing conditional on patient sex
colon_s$smoking_mar[colon_s$sex.factor == "Female"] = 
  sample(c("Smoker", "Non-smoker", NA), 
    sum(colon_s$sex.factor == "Female"), 
    replace = TRUE,
    prob = c(0.1, 0.5, 0.4)
  )

colon_s$smoking_mar[colon_s$sex.factor == "Male"] = 
  sample(c("Smoker", "Non-smoker", NA), 
    sum(colon_s$sex.factor == "Male"), 
    replace=TRUE, prob = c(0.15, 0.75, 0.1)
  )
 
colon_s = colon_s %>% 
  mutate(
    smoking_mar = factor(smoking_mar) %>% 
    ff_label("Smoking (MAR)")
  )

Check data

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar", "smoking_mar")
 dependent = "mort_5yr"
 colon_s %>% 
  ff_glimpse(dependent, explanatory)

 Continuous
            label var_type   n missing_n missing_percent mean   sd  min quartile_25 median quartile_75  max
age   Age (years)    <dbl> 929         0             0.0 59.8 11.9 18.0        53.0   61.0        69.0 85.0
nodes       nodes    <dbl> 911        18             1.9  3.7  3.6  0.0         1.0    2.0         5.0 33.0

Categorical
                           label var_type   n missing_n missing_percent levels_n
sex.factor                   Sex    <fct> 929         0             0.0        2
obstruct.factor      Obstruction    <fct> 908        21             2.3        2
mort_5yr        Mortality 5 year    <fct> 915        14             1.5        2
smoking_mcar      Smoking (MCAR)    <fct> 828       101            10.9        2
smoking_mar        Smoking (MAR)    <fct> 719       210            22.6        2
                                             levels  levels_count   levels_percent
sex.factor                         "Female", "Male"      445, 484           48, 52
obstruct.factor            "No", "Yes", "(Missing)"  732, 176, 21 78.8, 18.9,  2.3
mort_5yr               "Alive", "Died", "(Missing)"  511, 404, 14 55.0, 43.5,  1.5
smoking_mcar    "Non-smoker", "Smoker", "(Missing)" 645, 183, 101       69, 20, 11
smoking_mar     "Non-smoker", "Smoker", "(Missing)" 591, 128, 210       64, 14, 23

Multivariate Imputation by Chained Equations (mice)

miceis a great package and contains lots of useful functions for diagnosing and working with missing data. The purpose here is to demonstrate how mice can be integrated into the Finalfit workflow with inclusion of model from imputed datasets in tables and plots.

Choose variables to impute and variables to impute from

finalfit::missing_predictorMatrix()makes it easy to specify which variables do what. For instance, we often do not want to impute our outcome or explanatory variable of interest (exposure), but do want to use them to impute other variables.

This is straightforward to code using the arguments drop_from_imputed and drop_from_imputer.

library(mice)

# Specify model
explanatory = c("age", "sex.factor", "nodes", 
  "obstruct.factor", "smoking_mar")
dependent = "mort_5yr"

# Choose not to impute missing values
# for explanatory variable of interest and
# outcome variable. 
# But include in algorithm for imputation.
predM = colon_s %>% 
	select(dependent, explanatory) %>% 
	missing_predictorMatrix(
		drop_from_imputed = c("obstruct.factor", "mort_5yr")
	)

Create imputed datasets

A set of multiple imputed datasets (mids) can be created as below. Various checks should be performed to ensure you understand the data that has been created. See here.

mids = colon_s %>% 
  select(dependent, explanatory) %>%
  mice(m = 4, predictorMatrix = predM)    # Usually m = 10

Run models

Here we sill use a logistic regression model. The with.mids() function takes a model with a formula object, so use base R functions rather than Finalfit wrappers.

fits = mids %>% 
  with(glm(formula(ff_formula(dependent, explanatory)), 
    family="binomial"))

We now have multiple models run with each of the imputed datasets. We haven’t found good methods for combining common model metrics like AIC and c-statistic. I’d be interested to hear from anyone working on methods for this. Metrics can be extracted for each individual model to give an idea of goodness-of-fit and discrimination. We’re not suggesting you use these to compare imputed datasets, but could use them to compare models containing different variables created using the imputed datasets, e.g.

fits %>% 
  getfit() %>% 
  purrr::map(AIC)
[[1]]
[1] 1192.57

[[2]]
[1] 1191.09

[[3]]
[1] 1195.49

[[4]]
[1] 1193.729

# C-statistic
fits %>% 
  getfit() %>% 
  purrr::map(~ pROC::roc(.x$y, .x$fitted)$auc)
[[1]]
Area under the curve: 0.6839

[[2]]
Area under the curve: 0.6818

[[3]]
Area under the curve: 0.6789

[[4]]
Area under the curve: 0.6836

Pool results

Rubin’s rules are used to combine results of multiple models.

# Pool  results
fits_pool = fits %>% 
  pool()

Plot results

Pooled results can be passed directly to Finalfit plotting functions.

# Can be passed to or_plot
colon_s %>% 
  or_plot(dependent, explanatory, glmfit = fits_pool, table_text_size=4)

Put results in table

The pooled result can be passed directly to fit2df() as can many common models such as lm(), glm(), lmer(), glmer(), coxph(), crr(), etc.

# Summarise and put in table
fit_imputed = fits_pool %>%                                  
  fit2df(estimate_name = "OR (multiple imputation)", exp = TRUE)
fit_imputed

         explanatory  OR (multiple imputation)
1                age 1.01 (1.00-1.02, p=0.212)
2     sex.factorMale 1.01 (0.77-1.34, p=0.917)
3              nodes 1.24 (1.18-1.31, p<0.001)
4 obstruct.factorYes 1.34 (0.94-1.91, p=0.105)
5  smoking_marSmoker 1.28 (0.88-1.85, p=0.192)

Combine results with summary data

Any model passed through fit2df() can be combined with a summary table generated with summary_factorlist() and any number of other models.

# Imputed data alone
## Include missing data in summary table
colon_s %>% 
  summary_factorlist(dependent, explanatory, na_include = TRUE, fit_id = TRUE) %>% 
  ff_merge(fit_imputed, last_merge = TRUE) 

           label     levels       Alive        Died  OR (multiple imputation)
1    Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.01 (1.00-1.02, p=0.212)
6            Sex     Female  243 (55.6)  194 (44.4)                         -
7                      Male  268 (56.1)  210 (43.9) 1.01 (0.77-1.34, p=0.917)
2          nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.31, p<0.001)
4    Obstruction         No  408 (56.7)  312 (43.3)                         -
5                       Yes   89 (51.1)   85 (48.9) 1.34 (0.94-1.91, p=0.105)
3                   Missing   14 (66.7)    7 (33.3)                         -
9  Smoking (MAR) Non-smoker  328 (56.4)  254 (43.6)                         -
10                   Smoker   68 (53.5)   59 (46.5) 1.28 (0.88-1.85, p=0.192)
8                   Missing  115 (55.8)   91 (44.2)                         -

Combine results with other models

Models can be run separately, or using the finalfit()wrapper including the argument keep_fit_it = TRUE.

colon_s %>% 
  finalfit(dependent, explanatory, keep_fit_id = TRUE) %>% 
  ff_merge(fit_imputed, last_merge = TRUE) 

  Dependent: Mortality 5 year                  Alive        Died          OR (univariable)        OR (multivariable)  OR (multiple imputation)
1                 Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.02 (1.00-1.03, p=0.010) 1.01 (1.00-1.02, p=0.212)
5                         Sex     Female  243 (47.6)  194 (48.0)                         -                         -                         -
6                                   Male  268 (52.4)  210 (52.0) 0.98 (0.76-1.27, p=0.889) 0.88 (0.64-1.23, p=0.461) 1.01 (0.77-1.34, p=0.917)
2                       nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.18-1.33, p<0.001) 1.24 (1.18-1.31, p<0.001)
3                 Obstruction         No  408 (82.1)  312 (78.6)                         -                         -                         -
4                                    Yes   89 (17.9)   85 (21.4) 1.25 (0.90-1.74, p=0.189) 1.26 (0.85-1.88, p=0.252) 1.34 (0.94-1.91, p=0.105)
7               Smoking (MAR) Non-smoker  328 (82.8)  254 (81.2)                         -                         -                         -
8                                 Smoker   68 (17.2)   59 (18.8) 1.12 (0.76-1.65, p=0.563) 1.25 (0.82-1.89, p=0.300) 1.28 (0.88-1.85, p=0.192)

Model missing explicitly in complete case models

A straightforward method of modelling missing cases is to make them explicit using the forcats function fct_explicit_na().

library(forcats)
colon_s %>% 
  mutate(
    smoking_mar = fct_explicit_na(smoking_mar)
  ) %>% 
  finalfit(dependent, explanatory, keep_fit_id = TRUE) %>% 
  ff_merge(fit_imputed, last_merge = TRUE)

  Dependent: Mortality 5 year                  Alive        Died          OR (univariable)        OR (multivariable)  OR (multiple imputation)
1                 Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.119) 1.01 (1.00-1.02, p=0.212)
5                         Sex     Female  243 (47.6)  194 (48.0)                         -                         -                         -
6                                   Male  268 (52.4)  210 (52.0) 0.98 (0.76-1.27, p=0.889) 0.96 (0.72-1.30, p=0.809) 1.01 (0.77-1.34, p=0.917)
2                       nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.19-1.32, p<0.001) 1.24 (1.18-1.31, p<0.001)
3                 Obstruction         No  408 (82.1)  312 (78.6)                         -                         -                         -
4                                    Yes   89 (17.9)   85 (21.4) 1.25 (0.90-1.74, p=0.189) 1.34 (0.94-1.91, p=0.102) 1.34 (0.94-1.91, p=0.105)
8               Smoking (MAR) Non-smoker  328 (64.2)  254 (62.9)                         -                         -                         -
9                                 Smoker   68 (13.3)   59 (14.6) 1.12 (0.76-1.65, p=0.563) 1.24 (0.82-1.88, p=0.308) 1.28 (0.88-1.85, p=0.192)
7                              (Missing)  115 (22.5)   91 (22.5) 1.02 (0.74-1.41, p=0.895) 0.99 (0.69-1.41, p=0.943)                         -

Export tables to PDF and Word

As described elsewhere, knitr::kable() can be used to export good looking tables.

rmedicine2019 – some quick thoughts and good packages

Kenny McLean and I recently attended rmedicine 2019 in Boston MA. The conference is aimed at clinicians and non-clinicians who use R for day-to-day research and monitoring of clinical processes.

Day 1 covered two parallel workshops: R Markdown for Medicine and Wrangling Survival Data 

I attended R Markdown for Medicine run by Alison Hill from RStudio. Using .rmd files has become the default for the Surgical Informatics Group and, so it seems, a great number of others who attended rmedicine. Around a third of the presentations at rmedicine covered workflows involving sharing of data via either .rmd files or through shiny, an R package for creating deploy-able dashboards for data visualisation and interactive exploration.

R Markdown for Medicine

An Overview of Useful Tips and Tricks

R markdown is an extension of R which allows you to combine narrative text and R code within one document. This means your notes, code, results and plots are all in one place. Code is contained in between three backticks with an {r} after the first set. Inline code can also be used between single backticks followed by r without the curly brackets and then the code. This means that results can be changed automatically so that for a trial when you describe the results of numbers included / excluded, this only needs changed in one place so that the rest of the text (and / or flowcharts) updates automatically. It is also possible to mix-and-match other chunks of code from other languages.

Use Params!

Parameters are set in the YAML header at the top of the .rmd document. If you set a parameter of data to a default .csv or .rda file then this can be changed for other similar files without creating a new document. A really useful example would be when you have multiple hospitals or multiple diseases each with a separate data file, a report can then be generated for each file. If you use rmarkdown::render() along with purrr::pwalk you can generate a separate output file for any number of hospitals / diseases / countries / individuals etc. in just a couple of lines of code.

Use Helper Packages

There are some greater .rmd helper packages to improve the workflow, improve the rendering of documents and generally make life easier.

bookdown allows several .rmd documents to be combined to a book but also has some general usefulness for single documents as well. Using bookdown::word_document2 or bookdown::html_document2 in the YAML header under the output field is designed to improve cross-referencing of tables and figures compared to the default versions.

wordcountaddin allows an accurate word count to be performed which will not count YAML or code etc. without knitting the document. This is much easier than knitting the document and then performing a word count!

citr allows automated insertion of markdown citations to assist with referencing. Check out my earlier blog on referencing to get an idea of how to set up .bib files. I may add another blog on this topic, watch this space!

xaringan is a useful package for creating HTML presentations with high levels of customisation. It is possible to use an additional .css file for even greater customisation and styling of your slides but xaringan offers a great deal of user-friendly options.

distill appears to be good at supporting mobile-friendly web publishing for scientific communication with flexible figure layouts, table pagination, LaTeX math support and incorporation of javascript.

There are countless other helper packages and more likely to be on their way. Many allow additional aesthetic modification of the output documents and may allow you to run R code rather than modifying a .css file.

List Numbering the Lazy Way

List numbering in .rmd works without needing to manually enter the correct numbers. Just make a list where every element begins with 1. and .rmd will transform it into an appropriately-numbered list. Great if you need to add in a new element to the middle of the list later!

Multiple lots in a Grid

I’ve previously come across patchwork as a way to plot several plots into a grid which could be 1×2, 2×2, 3 in one column and one in the other etc. There are also two other packages cowplot and egg. I haven’t explored the differences between them but if you find that one doesn’t give you the exact customisation or alignment you need then possibly try another one. cowplot looks as if it might perform better at overlaying plots on top of another and at exact axis line matching.

Use the here package to help with file paths

here is a great package for swapping between Windows and Mac file paths (no more swapping backslashes and forward slashes!). Using here::here() will default to looking for a file in the .Rproj directory rather than the .rmd directory which is the default otherwise – great if you want to have multiple .rmd documents each in their own sub-directory with a shared data file in the parent directory.

Customise Code Outputs

R markdown allows customisation of appearance of code. Some of this can be done through modifying a .css file but there are some simpler ways to make basic changes. Try adding comment = "#>" to knitr::opst_chunk$set()to customise how comments appear in your document.

Word document creations tips

R markdown is generally great for HTML and PDF formats. The options for knitting to Word are not as well developed but there are some good options. The bookdown package is useful as discussed. The redoc package has been used to facilitate conversion to and from word – not tried it personally but if it can print out to word and then handle tracked changes back into markdown then it could be very useful.

For converting more complex tables and figures to word an option is to knit to rtf (rich text format) and then open the rtf file in word. This tends to be very good at keeping the desired formatting.

Future updates – hopefully!

R markdown is a great resource although there are a handful of minor issues which are currently difficult to resolve. One of the main problems I find it with tables and cross-referencing. I really like the syntax and customisation of the gt package but at present it appears cross-referencing in a way which works across HTML, PDF and Word outputs is not supported – a great opportunity to submit a pull request if you think you can get this to work.

Other Useful rmedicine Packages and Ideas

survival Package Update

The latest version (version 3.0) of the survival package was presented by Terry Therneau and is now available on github. This package is used by over 650 additional downstream dependencies. The latest version allows for multiple observations per subject, multiple endpoints per subject and multiple types of end-point. This will be particularly useful for competing risks analyses e.g. outcomes for liver transplant patients (transplanted, still on list, removed from list as no longer eligible or died).

Keep an eye-out for Kenny McLean’s blog where he plans to cover the survival package and many other useful packages presented at rmedicine 2019.

hreport Automated Trial Reporting

hreport by Frank Harrel (currently available on github) is for automated reporting of trials and studies with generation of interactive html graphs based in plotly. Several aspects of a study can be rendered easily into plots demonstrating accrual, exclusions, descriptive statistics, adverse events and time-to-event data. Another key theme of rmedicine 2019 appears to have been the use of plotly or similar packages to enable interaction with data.

timevis – interactive timelines

timevis allows generation of highly interactive timeline plots which allow zooming, adding or removal of events, resizing, etc.

Holepunch package

For working with projects that require a number of packages that then need shared with a colleague, holepunch provides a quick method for generating a list of dependencies and a Dockerfile. The package creates a link for another user to open a free RStudio server with all of the required packages installed. This may be useful for trouble-shooting in a department and showing code examples.

Summary

rmedicine 2019 has shown that clinical researchers are moving increasingly towards literate programming, interactive visualisations and automated workflows using R and Rmarkdown.

The conference was a great mix of methods presentations and data presentations from R users. You definitely don’t need any in-depth knowledge of R to benefit from it and I’d highly recommend booking for rmedicine 2020.

Survival analysis with strata, clusters, frailties and competing risks in in Finalfit

This post was originally published here

Background

In healthcare, we deal with a lot of binary outcomes. Death yes/no, disease recurrence yes/no, for instance. These outcomes are often easily analysed using binary logistic regression via finalfit().

When the time taken for the outcome to occur is important, we need a different approach. For instance, in patients with cancer, the time taken until recurrence of the cancer is often just as important as the fact it has recurred.

Finalfit wraps a number of functions to make these analyses easy to perform and output into PDFs and Word documents.

Installation

# Make sure finalfit is up-to-date 
install.packages("finalfit")

Dataset

We’ll use the classic “Survival from Malignant Melanoma” dataset from the boot package to illustrate. The data consist of measurements made on patients with malignant melanoma. Each patient had their tumour removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark during the period 1962 to 1977.

For the purposes of demonstration, we are interested in the association between tumour ulceration and survival after surgery.

Get data and check

library(finalfit)
melanoma = boot::melanoma #F1 here for help page with data dictionary
ff_glimpse(melanoma)
#> Continuous
#>               label var_type   n missing_n missing_percent   mean     sd
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1
#> status       status    <dbl> 205         0             0.0    1.8    0.6
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5
#> age             age    <dbl> 205         0             0.0   52.5   16.7
#> year           year    <dbl> 205         0             0.0 1969.9    2.6
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5
#>              min quartile_25 median quartile_75    max
#> time        10.0      1525.0 2005.0      3042.0 5565.0
#> status       1.0         1.0    2.0         2.0    3.0
#> sex          0.0         0.0    0.0         1.0    1.0
#> age          4.0        42.0   54.0        65.0   95.0
#> year      1962.0      1968.0 1970.0      1972.0 1977.0
#> thickness    0.1         1.0    1.9         3.6   17.4
#> ulcer        0.0         0.0    0.0         1.0    1.0
#> 
#> Categorical
#> data frame with 0 columns and 205 rows

As can be seen, all variables are coded as numeric and some need recoding to factors.

Death status

status is the the patients status at the end of the study.

  • 1 indicates that they had died from melanoma;
  • 2 indicates that they were still alive and;
  • 3 indicates that they had died from causes unrelated to their melanoma.

There are three options for coding this.

  • Overall survival: considering all-cause mortality, comparing 2 (alive) with 1 (died melanoma)/3 (died other);
  • Cause-specific survival: considering disease-specific mortality comparing 2 (alive)/3 (died other) with 1 (died melanoma);
  • Competing risks: comparing 2 (alive) with 1 (died melanoma) accounting for 3 (died other); see more below.

Time and censoring

time is the number of days from surgery until either the occurrence of the event (death) or the last time the patient was known to be alive. For instance, if a patient had surgery and was seen to be well in a clinic 30 days later, but there had been no contact since, then the patient’s status would be considered 30 days. This patient is censored from the analysis at day 30, an important feature of time-to-event analyses.

Recode

library(dplyr)
library(forcats)
melanoma = melanoma %>%
  mutate(
    # Overall survival
    status_os = case_when(
      status == 2 ~ 0, # "still alive"
      TRUE ~ 1), # "died melanoma" or "died other causes"
    
    # Diease-specific survival
    status_dss = case_when(
      status == 2 ~ 0,  # "still alive"
      status == 1 ~ 1,  # "died of melanoma"
      status == 3 ~ 0), # "died of other causes is censored"

    # Competing risks regression
    status_crr = case_when(
    	status == 2 ~ 0,  # "still alive"
        status == 1 ~ 1,  # "died of melanoma"
        status == 3 ~ 2), # "died of other causes"

    # Label and recode other variables
    age = ff_label(age, "Age (years)"), # table friendly labels
    thickness = ff_label(thickness, "Tumour thickness (mm)"),
    sex = factor(sex) %>% 
      fct_recode("Male" = "1", 
                 "Female" = "0") %>% 
      ff_label("Sex"),
    ulcer = factor(ulcer) %>% 
      fct_recode("No" = "0",
                 "Yes" = "1") %>% 
      ff_label("Ulcerated tumour")
  )

Kaplan-Meier survival estimator

We can use the excellent survival package to produce the Kaplan-Meier (KM) survival estimator. This is a non-parametric statistic used to estimate the survival function from time-to-event data. Note use of %$% to expose left-side of pipe to older-style R functions on right-hand side.

library(survival)

survival_object = melanoma %$% 
  Surv(time, status_os)

# Explore:
head(survival_object) # + marks censoring, in this case "Alive"
#> [1]  10   30   35+  99  185  204

# Expressing time in years
survival_object = melanoma %$% 
  Surv(time/365, status_os)

KM analysis for whole cohort

Model

The survival object is the first step to performing univariable and multivariable survival analyses.

If you want to plot survival stratified by a single grouping variable, you can substitute “survival_object ~ 1” by “survival_object ~ factor”

# Overall survival in whole cohort
my_survfit = survfit(survival_object ~ 1, data = melanoma)
my_survfit # 205 patients, 71 events
#> Call: survfit(formula = survival_object ~ 1, data = melanoma)
#> 
#>       n  events  median 0.95LCL 0.95UCL 
#>  205.00   71.00      NA    9.15      NA

Life table

A life table is the tabular form of a KM plot, which you may be familiar with. It shows survival as a proportion, together with confidence limits. The whole table is shown with summary(my_survfit).

summary(my_survfit, times = c(0, 1, 2, 3, 4, 5))
#> Call: survfit(formula = survival_object ~ 1, data = melanoma)
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     0    205       0    1.000  0.0000        1.000        1.000
#>     1    193      11    0.946  0.0158        0.916        0.978
#>     2    183      10    0.897  0.0213        0.856        0.940
#>     3    167      16    0.819  0.0270        0.767        0.873
#>     4    160       7    0.784  0.0288        0.730        0.843
#>     5    122      10    0.732  0.0313        0.673        0.796
# 5 year overall survival is 73%

Kaplan Meier plot

We can plot survival curves using the finalfit wrapper for the package excellent package survminer. There are numerous options available on the help page. You should always include a number-at-risk table under these plots as it is essential for interpretation.

As can be seen, the probability of dying is much greater if the tumour was ulcerated, compared to those that were not ulcerated.

dependent_os = "Surv(time/365, status_os)"
explanatory = "ulcer"

melanoma %>% 
  surv_plot(dependent_os, explanatory, pval = TRUE)

Cox-proportional hazards regression

CPH regression can be performed using the all-in-one finalfit() function. It produces a table containing counts (proportions) for factors, mean (SD) for continuous variables and a univariable and multivariable CPH regression.

A hazard is the term given to the rate at which events happen.
The probability that an event will happen over a period of time is the hazard multiplied by the time interval.
An assumption of CPH is that hazards are constant over time (see below).

It produces a table containing counts (proportions) for factors, mean (SD) for continuous variables and a univariable and multivariable CPH regression.

Univariable and multivariable models

dependent_os = "Surv(time, status_os)"
dependent_dss = "Surv(time, status_dss)"
dependent_crr = "Surv(time, status_crr)"
explanatory = c("age", "sex", "thickness", "ulcer")

melanoma %>% 
    finalfit(dependent_os, explanatory)

The labelling of the final table can be easily adjusted as desired.

melanoma %>% 
    finalfit(dependent_os, explanatory, add_dependent_label = FALSE) %>% 
    rename("Overall survival" = label) %>% 
    rename(" " = levels) %>% 
    rename(" " = all)

Reduced model

If you are using a backwards selection approach or similar, a reduced model can be directly specified and compared. The full model can be kept or dropped.

explanatory_multi = c("age", "thickness", "ulcer")
melanoma %>% 
    finalfit(dependent_os, explanatory, explanatory_multi, 
      keep_models = TRUE)

Testing for proportional hazards

An assumption of CPH regression is that the hazard associated with a particular variable does not change over time. For example, is the magnitude of the increase in risk of death associated with tumour ulceration the same in the early post-operative period as it is in later years.

The cox.zph() function from the survival package allows us to test this assumption for each variable. The plot of scaled Schoenfeld residuals should be a horizontal line. The included hypothesis test identifies whether the gradient differs from zero for each variable. No variable significantly differs from zero at the 5% significance level.

explanatory = c("age", "sex", "thickness", "ulcer", "year")
melanoma %>% 
    coxphmulti(dependent_os, explanatory) %>% 
    cox.zph() %>% 
    {zph_result <<- .} %>% 
    plot(var=5)
zph_result
#>               rho  chisq      p
#> age        0.1633 2.4544 0.1172
#> sexMale   -0.0781 0.4473 0.5036
#> thickness -0.1493 1.3492 0.2454
#> ulcerYes  -0.2044 2.8256 0.0928
#> year       0.0195 0.0284 0.8663
#> GLOBAL         NA 8.4695 0.1322

Stratified models

One approach to dealing with a violation of the proportional hazards assumption is to stratify by that variable. Including a strata() term will result in a separate baseline hazard function being fit for each level in the stratification variable. It will be no longer possible to make direct inference on the effect associated with that variable.

This can be incorporated directly into the explanatory variable list.

explanatory= c("age", "sex", "ulcer", "thickness", "strata(year)")
melanoma %>% 
    finalfit(dependent_os, explanatory)

Correlated groups of observations

As a general rule, you should always try to account for any higher structure in the data within the model. For instance, patients may be clustered within particular hospitals.

There are two broad approaches to dealing with correlated groups of observations.

Including a cluster() term is akin to using generalised estimating equations (GEE). Here, a standard CPH model is fitted but the standard errors of the estimated hazard ratios are adjusted to account for correlations.

Including a frailty() term is akin to using a mixed effects model, where specific random effects term(s) are directly incorporated into the model.

Both approaches achieve the same goal in different ways. Volumes have been written on GEE vs mixed effects models. We favour the latter approach because of its flexibility and our preference for mixed effects modelling in generalised linear modelling. Note cluster() and frailty() terms cannot be combined in the same model.

# Simulate random hospital identifier
melanoma = melanoma %>% 
  mutate(hospital_id = c(rep(1:10, 20), rep(11, 5)))

# Cluster model
explanatory = c("age", "sex", "thickness", "ulcer", "cluster(hospital_id)")
melanoma %>% 
  finalfit(dependent_os, explanatory)
# Frailty model
explanatory = c("age", "sex", "thickness", "ulcer", "frailty(hospital_id)")
melanoma %>% 
  finalfit(dependent_os, explanatory)

The frailty() method here is being superseded by the coxme package, and we’ll incorporate this soon.

Hazard ratio plot

A plot of any of the above models can be produced by passing the terms to hr_plot().

melanoma %>% 
    hr_plot(dependent_os, explanatory)

Competing risks regression

Competing-risks regression is an alternative to CPH regression. It can be useful if the outcome of interest may not be able to occur because something else (like death) has happened first. For instance, in our example it is obviously not possible for a patient to die from melanoma if they have died from another disease first. By simply looking at cause-specific mortality (deaths from melanoma) and considering other deaths as censored, bias may result in estimates of the influence of predictors.

The approach by Fine and Gray is one option for dealing with this. It is implemented in the package cmprsk. The crr() syntax differs from survival::coxph() but finalfit brings these together.

It uses the finalfit::ff_merge() function, which can join any number of models together.

explanatory = c("age", "sex", "thickness", "ulcer")
dependent_dss = "Surv(time, status_dss)"
dependent_crr = "Surv(time, status_crr)"

melanoma %>%

  # Summary table
  summary_factorlist(dependent_dss, explanatory, 
    column = TRUE, fit_id = TRUE) %>%

  # CPH univariable
  ff_merge(
    melanoma %>%
      coxphmulti(dependent_dss, explanatory) %>%
      fit2df(estimate_suffix = " (DSS CPH univariable)")
    ) %>%
    
# CPH multivariable
  ff_merge(
    melanoma %>%
      coxphmulti(dependent_dss, explanatory) %>%
      fit2df(estimate_suffix = " (DSS CPH multivariable)")
    ) %>%
    
# Fine and Gray competing risks regression
  ff_merge(
    melanoma %>%
      crrmulti(dependent_crr, explanatory) %>%
      fit2df(estimate_suffix = " (competing risks multivariable)")
    ) %>%

  select(-fit_id, -index) %>%
  dependent_label(melanoma, "Survival")

Summary

So here we have various aspects of time-to-event analysis commonly used when looking at survival. There are many other applications, some which may not be obvious: for instance we use CPH for modelling length of stay in in hospital.

Stratification can be used to deal with non-proportional hazards in a particular variable.

Hierarchical structure in your data can be accommodated with cluster or frailty (random effects) terms.

Competing risks regression may be useful if your outcome is in competition with another, such as all-cause death, but is currently limited in its ability to accommodate hierarchical structures.

Five steps for missing data with Finalfit

This post was originally published here

As a journal editor, I often receive studies in which the investigators fail to describe, analyse, or even acknowledge missing data. This is frustrating, as it is often of the utmost importance. Conclusions may (and do) change when missing data is accounted for.  A few seem to not even appreciate that in conventional regression, only rows with complete data are included.

These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:

  1. Ensure your data are coded correctly.
  2. Identify missing values within each variable.
  3. Look for patterns of missingness.
  4. Check for associations between missing and observed data.
  5. Decide how to handle missing data.

Finalfit includes a number of functions to help with this.

Some confusing terminology

But first there are some terms which easy to mix up. These are important as they describe the mechanism of missingness and this determines how you can handle the missing data.

Missing completely at random (MCAR)

As it says, values are randomly missing from your dataset. Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves.

For instance, when smoking status is not recorded in a random subset of patients.

This is easy to handle, but unfortunately, data are almost never missing completely at random.

Missing at random (MAR)

This is confusing and would be better stated as missing conditionally at random. Here, missing data do have a relationship with other variables in the dataset. However, the actual values that are missing are random.

For example, smoking status is not documented in female patients because the doctor was too shy to ask. Yes ok, not that realistic!

Missing not at random (MNAR)

The pattern of missingness is related to other variables in the dataset, but in addition, the values of the missing data are not random.

For example, when smoking status is not recorded in patients admitted as an emergency, who are also more likely to have worse outcomes from surgery.

Missing not at random data are important, can alter your conclusions, and are the most difficult to diagnose and handle. They can only be detected by collecting and examining some of the missing data. This is often difficult or impossible to do.

How you deal with missing data is dependent on the type of missingness. Once you know this, then you can sort it.

More on this below.

1. Ensure your data are coded correctly: ff_glimpse

While clearly obvious, this step is often ignored in the rush to get results. The first step in any analysis is robust data cleaning and coding. Lots of packages have a glimpse function and finalfit is no different. This function has three specific goals:

  1. Ensure all factors and numerics are correctly assigned. That is the commonest reason to get an error with a finalfit function. You think you’re using a factor variable, but in fact it is incorrectly coded as a continuous numeric.
  2. Ensure you know which variables have missing data. This presumes missing values are correctly assigned NA. See here for more details if you are unsure.
  3. Ensure factor levels and variable labels are assigned correctly.

Example scenario

Using the colon cancer dataset that comes with finalfit, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.

For demonstration purposes, we will create random MCAR and MAR smoking variables to the dataset.

The function summarises a data frame or tibble by numeric (continuous) variables and factor (discrete) variables. The dependent and explanatory  are for convenience. Pass either or neither e.g. to summarise data frame or tibble:

It doesn’t present well if you have factors with lots of levels, so you may want to remove these.

Use this to check that the variables are all assigned and behaving as expected. The proportion of missing data can be seen, e.g. smoking_mar has 23% missing data.

2. Identify missing values in each variable: missing_plot

In detecting patterns of missingness, this plot is useful. Row number is on the x-axis and all included variables are on the y-axis. Associations between missingness and observations can be easily seen, as can relationships of missingness between variables.

Click to enlarge.

It was only when writing this post that I discovered the amazing package, naniar. This package is recommended and provides lots of great visualisations for missing data.

3. Look for patterns of missingness: missing_pattern

missing_pattern simply wraps mice::md.pattern using finalfit grammar. This produces a table and a plot showing the pattern of missingness between variables.

This allows us to look for patterns of missingness between variables. There are 14 patterns in this data. The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic.

Make sure you include missing data in demographics tables

Table 1 in a healthcare study is often a demographics table of an “explanatory variable of interest” against other explanatory variables/confounders. Do not silently drop missing values in this table. It is easy to do this correctly with summary_factorlist. This function provides a useful summary of a dependent variable against explanatory variables. Despite its name, continuous variables are handled nicely.

na_include=TRUE ensures missing data from the explanatory variables (but not dependent) are included. Note that any p-values are generated across missing groups as well, so run a second time with na_include=FALSE if you wish a hypothesis test only over observed data.

4. Check for associations between missing and observed data: missing_pairs | missing_compare

In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables. This is particularly important (I would say absolutely required) for a primary outcome measure / dependent variable.

Take for example “death”. When that outcome is missing it is often for a particular reason. For example, perhaps patients undergoing emergency surgery were less likely to have complete records compared with those undergoing planned surgery. And of course, death is more likely after emergency surgery.

missing_pairs uses functions from the excellent GGally package. It produces pairs plots to show relationships between missing values and observed values in all variables.

For continuous variables (age and nodes), the distributions of observed and missing data can be visually compared. Is there a difference between age and mortality above?

For discrete, data, counts are presented by default. It is often easier to compare proportions:

It should be obvious that missingness in Smoking (MCAR) does not relate to sex (row 6, column 3). But missingness  in Smoking (MAR) does differ by sex (last row, column 3) as was designed above when the missing data were created.

We can confirm this using missing_compare.

It takes “dependent” and “explanatory” variables, but in this context “dependent” just refers to the variable being tested for missingness against the “explanatory” variables.

Comparisons for continuous data use a Kruskal Wallis and for discrete data a chi-squared test.

As expected, a relationship is seen between Sex and Smoking (MAR) but not Smoking (MCAR).

For those who like an omnibus test

If you are work predominately with numeric rather than discrete data (categorical/factors), you may find these tests from the MissMech package useful. The package and output is well documented, and provides two tests which can be used to determine whether data are MCAR.

5. Decide how to handle missing data

These pages from Karen Grace-Martin are great for this.

Prior to a standard regression analysis, we can either:

  • Delete the variable with the missing data
  • Delete the cases with the missing data
  • Impute (fill in) the missing data
  • Model the missing data

MCAR, MAR, or MNAR

MCAR vs MAR

Using the examples, we identify that Smoking (MCAR) is missing completely at random.

We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived. The pattern of missingness is therefore not felt to be MNAR.

Common solution

Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.

We therefore elect to simply omit the patients in whom smoking is missing. This is known as list-wise deletion and will be performed by default in standard regression analyses including finalfit.

Other considerations

 

  1. Sensitivity analysis
  2. Omit the variable
  3. Imputation
  4. Model the missing data

If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis. A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model. Thus, you may choose to re-label all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labeling with “non-smoker”.

 

If smoking is not associated with the explanatory variable of interest (bowel obstruction) or the outcome, it may be considered not to be a confounder  and so could be omitted. That neatly deals with the missing data issue, but of course may not be appropriate.

 

Imputation and modelling are considered below.

 

MCAR vs MAR

 

But life is rarely that simple.

 

Consider that the smoking variable is more likely to be missing if the patient is female (missing_compareshows a relationship). But, say, that the missing values are not different from the observed values. Missingness is then MAR.

 

If we simply drop all the cases (patients) in which smoking is missing (list-wise deletion), then proportionality we drop more females than men. This may have consequences for our conclusions if sex is associated with our explanatory variable of interest or outcome.

 

Common solution

 

mice is our go to package for multiple imputation. That’s the process of filling in missing data using a best-estimate from all the other data that exists. When first encountered, this doesn’t sounds like a good idea.

 

However, taking our simple example, if missingness in smoking is predicted strongly by sex, and the values of the missing data are random, then we can impute (best-guess) the missing smoking values using sex and other variables in the dataset.

 

Imputation is not usually appropriate for the explanatory variable of interest or the outcome variable. With both of these, the hypothesis is that there is an meaningful association with other variables in the dataset, therefore it doesn’t make sense to use these variables to impute them.

 

Here is some code to run mice. The package is well documented, and there are a number of checks and considerations that should be made to inform the imputation process. Read the documentation carefully prior to doing this yourself.

 

The final table can easily be exported to Word or as a PDF as described else where.

 

By examining the coefficients, the effect of the imputation compared with the complete case analysis can be clearly seen.

 

Other considerations

 

  1. Omit the variable
  2. Imputing factors with new level for missing data
  3. Model the missing data

As above, if the variable does not appear to be important, it may be omitted from the analysis. A sensitivity analysis in this context is another form of imputation. But rather than using all other available information to best-guess the missing data, we simply assign the value as above. Imputation is therefore likely to be more appropriate.

 

There is an alternative method to model the missing data for the categorical in this setting – just consider the missing data as a factor level. This has the advantage of simplicity, with the disadvantage of increasing the number of terms in the model. Multiple imputation is generally preferred.

 

MNAR vs MAR

 

Missing not at random data is tough in healthcare. To determine if data are MNAR for definite, we need to know their value in a subset of observations (patients).

 

Using our example above. Say smoking status is poorly recorded in patients admitted to hospital as an emergency with an obstructing cancer. Obstructing bowel cancers may be larger or their position may make the prognosis worse. Smoking may relate to the aggressiveness of the cancer and may be an independent predictor of prognosis. The missing values for smoking may therefore not random. Smoking may be more common in the emergency patients and may be more common in those that die.

 

There is no easy way to handle this. If at all possible, try to get the missing data. Otherwise, take care when drawing conclusions from analyses where data are thought to be missing not at random.

 

Where to next

 

We are now doing more in Stan. Missing data can be imputed directly within a Stan model which feels neat. Stan doesn’t yet have the equivalent of NA which makes passing the data block into Stan a bit of a faff.

 

Alternatively, the missing data can be directly modelled in Stan. Examples are provided in the manual. Again, I haven’t found this that easy to do, but there are a number of Stan developments that will hopefully make this more straightforward in the future.

Five steps for missing data with Finalfit

This post was originally published here

As a journal editor, I often receive studies in which the investigators fail to describe, analyse, or even acknowledge missing data. This is frustrating, as it is often of the utmost importance. Conclusions may (and do) change when missing data is accounted for.  A few seem to not even appreciate that in conventional regression, only rows with complete data are included.

These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:

  1. Ensure your data are coded correctly.
  2. Identify missing values within each variable.
  3. Look for patterns of missingness.
  4. Check for associations between missing and observed data.
  5. Decide how to handle missing data.

Finalfit includes a number of functions to help with this.

Some confusing terminology

But first there are some terms which easy to mix up. These are important as they describe the mechanism of missingness and this determines how you can handle the missing data.

Missing completely at random (MCAR)

As it says, values are randomly missing from your dataset. Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves.

For instance, when smoking status is not recorded in a random subset of patients.

This is easy to handle, but unfortunately, data are almost never missing completely at random.

Missing at random (MAR)

This is confusing and would be better stated as missing conditionally at random. Here, missing data do have a relationship with other variables in the dataset. However, the actual values that are missing are random.

For example, smoking status is not documented in female patients because the doctor was too shy to ask. Yes ok, not that realistic!

Missing not at random (MNAR)

The pattern of missingness is related to other variables in the dataset, but in addition, the values of the missing data are not random.

For example, when smoking status is not recorded in patients admitted as an emergency, who are also more likely to have worse outcomes from surgery.

Missing not at random data are important, can alter your conclusions, and are the most difficult to diagnose and handle. They can only be detected by collecting and examining some of the missing data. This is often difficult or impossible to do.

How you deal with missing data is dependent on the type of missingness. Once you know this, then you can sort it.

More on this below.

1. Ensure your data are coded correctly: ff_glimpse

While clearly obvious, this step is often ignored in the rush to get results. The first step in any analysis is robust data cleaning and coding. Lots of packages have a glimpse function and finalfit is no different. This function has three specific goals:

  1. Ensure all factors and numerics are correctly assigned. That is the commonest reason to get an error with a finalfit function. You think you’re using a factor variable, but in fact it is incorrectly coded as a continuous numeric.
  2. Ensure you know which variables have missing data. This presumes missing values are correctly assigned NA. See here for more details if you are unsure.
  3. Ensure factor levels and variable labels are assigned correctly.

Example scenario

Using the colon cancer dataset that comes with finalfit, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.

For demonstration purposes, we will create random MCAR and MAR smoking variables to the dataset.

# Make sure finalfit is up-to-date 
install.packages("finalfit") 

library(finalfit) 

# Create some extra missing data
## Smoking missing completely at random
set.seed(1)
colon_s$smoking_mcar = 
  sample(c("Smoker", "Non-smoker", NA), 
    dim(colon_s)[1], replace=TRUE, 
    prob = c(0.2, 0.7, 0.1)) %>% 
  factor()
Hmisc::label(colon_s$smoking_mcar) = "Smoking (MCAR)"

## Smoking missing conditional on patient sex
colon_s$smoking_mar[colon_s$sex.factor == "Female"] = 
  sample(c("Smoker", "Non-smoker", NA), 
    sum(colon_s$sex.factor == "Female"), 
    replace = TRUE,
    prob = c(0.1, 0.5, 0.4))

colon_s$smoking_mar[colon_s$sex.factor == "Male"] = 
  sample(c("Smoker", "Non-smoker", NA), 
    sum(colon_s$sex.factor == "Male"), 
    replace=TRUE, prob = c(0.15, 0.75, 0.1))
colon_s$smoking_mar = factor(colon_s$smoking_mar)
Hmisc::label(colon_s$smoking_mar) = "Smoking (MAR)"
# Examine with ff_glimpse
explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar", "smoking_mar")
dependent = "mort_5yr"

colon_s %>% 
  ff_glimpse(dependent, explanatory)

Numerics
            label   n missing_n missing_percent mean sd min max range  se
age   Age (years) 929         0             0.0   60 12  18  85    67 0.4
nodes        NULL 911        18             1.9    4  4   0  33    33 0.1

Factors
                           label   n missing_n missing_percent level_n
sex.factor                   Sex 929         0               0       2
obstruct.factor      Obstruction 908        21             2.3       2
mort_5yr        Mortality 5 year 915        14             1.5       2
smoking_mcar      Smoking (MCAR) 828       101              11       2
smoking_mar        Smoking (MAR) 719       210              23       2
                                levels  levels_count   levels_percent
sex.factor            "Female", "Male"      445, 484           48, 52
obstruct.factor            "No", "Yes"  732, 176, 21 78.8, 18.9,  2.3
mort_5yr               "Alive", "Died"  511, 404, 14 55.0, 43.5,  1.5
smoking_mcar    "Non-smoker", "Smoker" 645, 183, 101       69, 20, 11
smoking_mar     "Non-smoker", "Smoker" 591, 128, 210       64, 14, 23

The function summarises a data frame or tibble by numeric (continuous) variables and factor (discrete) variables. The dependent and explanatory  are for convenience. Pass either or neither e.g. to summarise data frame or tibble:

colon %>%
  ff_glimpse()

It doesn’t present well if you have factors with lots of levels, so you may want to remove these.

library(dplyr)
colon_s %>% 
  select(-hospital) %>% 
  ff_glimpse()

Use this to check that the variables are all assigned and behaving as expected. The proportion of missing data can be seen, e.g. smoking_mar has 23% missing data.

2. Identify missing values in each variable: missing_plot

In detecting patterns of missingness, this plot is useful. Row number is on the x-axis and all included variables are on the y-axis. Associations between missingness and observations can be easily seen, as can relationships of missingness between variables.

colon_s %>%
  missing_plot()

Click to enlarge.

It was only when writing this post that I discovered the amazing package, naniar. This package is recommended and provides lots of great visualisations for missing data.

3. Look for patterns of missingness: missing_pattern

missing_pattern simply wraps mice::md.pattern using finalfit grammar. This produces a table and a plot showing the pattern of missingness between variables.

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar", "smoking_mar")
dependent = "mort_5yr"

colon_s %>% 
  missing_pattern(dependent, explanatory)

This allows us to look for patterns of missingness between variables. There are 14 patterns in this data. The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic. 

Make sure you include missing data in demographics tables

Table 1 in a healthcare study is often a demographics table of an “explanatory variable of interest” against other explanatory variables/confounders. Do not silently drop missing values in this table. It is easy to do this correctly with summary_factorlist. This function provides a useful summary of a dependent variable against explanatory variables. Despite its name, continuous variables are handled nicely.

na_include=TRUE ensures missing data from the explanatory variables (but not dependent) are included. Note that any p-values are generated across missing groups as well, so run a second time with na_include=FALSE if you wish a hypothesis test only over observed data.

library(finalfit)

# Explanatory or confounding variables
explanatory = c("age", "sex.factor", 
  "nodes",  
  "smoking_mcar", "smoking_mar")

# Explanatory variable of interest
dependent = "obstruct.factor" # Bowel obstruction

colon_s %>% 
  summary_factorlist(dependent, explanatory, 
  na_include=TRUE, p=TRUE)
  
          label     levels          No         Yes     p
    Age (years)  Mean (SD) 60.2 (11.5) 57.3 (13.3) 0.014
            Sex     Female  346 (79.2)   91 (20.8) 0.290
                      Male  386 (82.0)   85 (18.0)      
          nodes  Mean (SD)   3.7 (3.7)   3.5 (3.2) 0.774
 Smoking (MCAR) Non-smoker  500 (79.4)  130 (20.6) 0.173
                    Smoker  154 (85.6)   26 (14.4)      
                   Missing   78 (79.6)   20 (20.4)      
  Smoking (MAR) Non-smoker  467 (80.9)  110 (19.1) 0.056
                    Smoker   91 (73.4)   33 (26.6)      
                   Missing  174 (84.1)   33 (15.9)

4. Check for associations between missing and observed data: missing_pairs | missing_compare

In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables. This is particularly important (I would say absolutely required) for a primary outcome measure / dependent variable.

Take for example “death”. When that outcome is missing it is often for a particular reason. For example, perhaps patients undergoing emergency surgery were less likely to have complete records compared with those undergoing planned surgery. And of course, death is more likely after emergency surgery.

missing_pairs uses functions from the excellent GGally package. It produces pairs plots to show relationships between missing values and observed values in all variables.

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar", "smoking_mar")
dependent = "mort_5yr"
colon_s %>% 
  missing_pairs(dependent, explanatory)

For continuous variables (age and nodes), the distributions of observed and missing data can be visually compared. Is there a difference between age and mortality above?

For discrete, data, counts are presented by default. It is often easier to compare proportions:

colon_s %>% 
  missing_pairs(dependent, explanatory, position = "fill", )

It should be obvious that missingness in Smoking (MCAR) does not relate to sex (row 6, column 3). But missingness  in Smoking (MAR) does differ by sex (last row, column 3) as was designed above when the missing data were created.

We can confirm this using missing_compare.

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor")
dependent = "smoking_mcar"
colon_s %>% 
  missing_compare(dependent, explanatory)

 Missing data analysis: Smoking (MCAR)           Not missing     Missing     p
                           Age (years) Mean (SD) 59.7 (11.9) 59.9 (12.6) 0.867
                                   Sex    Female  399 (89.7)   46 (10.3) 0.616
                                            Male  429 (88.6)   55 (11.4)      
                                 nodes Mean (SD)   3.6 (3.4)     4 (4.5) 0.990
                           Obstruction        No  654 (89.3)   78 (10.7) 0.786
                                             Yes  156 (88.6)   20 (11.4)     
											 
dependent = "smoking_mar"
colon_s %>% 
  missing_compare(dependent, explanatory)

 Missing data analysis: Smoking (MAR)           Not missing    Missing      p
                          Age (years) Mean (SD) 59.6 (11.9)  60.1 (12)  0.709
                                  Sex    Female  288 (64.7) 157 (35.3) 



It takes “dependent” and “explanatory” variables, but in this context “dependent” just refers to the variable being tested for missingness against the “explanatory” variables.

Comparisons for continuous data use a Kruskal Wallis and for discrete data a chi-squared test.

As expected, a relationship is seen between Sex and Smoking (MAR) but not Smoking (MCAR).

For those who like an omnibus test

If you are work predominately with numeric rather than discrete data (categorical/factors), you may find these tests from the MissMech package useful. The package and output is well documented, and provides two tests which can be used to determine whether data are MCAR.

library(finalfit)
library(dplyr)
library(MissMech)
explanatory = c("age", "nodes")
dependent = "mort_5yr" 

colon_s %>% 
  select(explanatory) %>% 
  MissMech::TestMCARNormality()

5. Decide how to handle missing data

These pages from Karen Grace-Martin are great for this.

Prior to a standard regression analysis, we can either:

  • Delete the variable with the missing data
  • Delete the cases with the missing data
  • Impute (fill in) the missing data
  • Model the missing data

MCAR, MAR, or MNAR

MCAR vs MAR

Using the examples, we identify that Smoking (MCAR) is missing completely at random. 

We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived. The pattern of missingness is therefore not felt to be MNAR.

Common solution

Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.

We therefore elect to simply omit the patients in whom smoking is missing. This is known as list-wise deletion and will be performed by default in standard regression analyses including finalfit.

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar")
dependent = "mort_5yr"
colon_s %>% 
	finalfit(dependent, explanatory, metrics=TRUE)

 Dependent: Mortality 5 year                  Alive        Died          OR (univariable)        OR (multivariable)
                 Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.200)
                         Sex     Female  243 (47.6)  194 (48.0)                         -                         -
                                   Male  268 (52.4)  210 (52.0) 0.98 (0.76-1.27, p=0.889) 1.02 (0.76-1.38, p=0.872)
                       nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.30, p



Other considerations

  1. Sensitivity analysis
  2. Omit the variable
  3. Imputation
  4. Model the missing data

If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis. A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model. Thus, you may choose to re-label all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labeling with “non-smoker”.

If smoking is not associated with the explanatory variable of interest (bowel obstruction) or the outcome, it may be considered not to be a confounder  and so could be omitted. That neatly deals with the missing data issue, but of course may not be appropriate.

Imputation and modelling are considered below.

MCAR vs MAR

But life is rarely that simple.

Consider that the smoking variable is more likely to be missing if the patient is female (missing_compareshows a relationship). But, say, that the missing values are not different from the observed values. Missingness is then MAR.

If we simply drop all the cases (patients) in which smoking is missing (list-wise deletion), then proportionality we drop more females than men. This may have consequences for our conclusions if sex is associated with our explanatory variable of interest or outcome.

Common solution

mice is our go to package for multiple imputation. That’s the process of filling in missing data using a best-estimate from all the other data that exists. When first encountered, this doesn’t sounds like a good idea.

However, taking our simple example, if missingness in smoking is predicted strongly by sex, and the values of the missing data are random, then we can impute (best-guess) the missing smoking values using sex and other variables in the dataset.

Imputation is not usually appropriate for the explanatory variable of interest or the outcome variable. With both of these, the hypothesis is that there is an meaningful association with other variables in the dataset, therefore it doesn’t make sense to use these variables to impute them.

Here is some code to run mice. The package is well documented, and there are a number of checks and considerations that should be made to inform the imputation process. Read the documentation carefully prior to doing this yourself.

# Multivariate Imputation by Chained Equations (mice)
library(finalfit)
library(dplyr)
library(mice)
explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor", "smoking_mar")
dependent = "mort_5yr"

colon_s %>% 
  select(dependent, explanatory) %>% 
  # Exclude outcome and explanatory variable of interest from imputation
  dplyr::filter(!is.na(mort_5yr), !is.na(obstruct.factor)) %>%
  # Run imputation with 10 imputed sets
  mice(m = 10) %>% 
  # Run logistic regression on each imputed set
  with(glm(formula(ff_formula(dependent, explanatory)), 
    family="binomial")) %>%
  # Pool and summarise results
  pool() %>%                                            
  summary(conf.int = TRUE, exponentiate = TRUE) %>%
  # Jiggle into finalfit format
  mutate(explanatory_name = rownames(.)) %>%            
  select(explanatory_name, estimate, `2.5 %`, `97.5 %`, p.value) %>% 
  condense_fit(estimate_suffix = " (multiple imputation)") %>% 
  remove_intercept() -> fit_imputed

# Use finalfit merge methods to create and compare results
colon_s %>% 
  summary_factorlist(dependent, explanatory, fit_id = TRUE) -> summary1

colon_s %>% 
  glmuni(dependent, explanatory) %>% 
  fit2df(estimate_suffix = " (univariable)") -> fit_uni

colon_s %>% 
  glmmulti(dependent, explanatory) %>% 
  fit2df(estimate_suffix = " (multivariable inc. smoking)") -> fit_multi

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor")
colon_s %>% 
  glmmulti(dependent, explanatory) %>% 
  fit2df(estimate_suffix = " (multivariable)") -> fit_multi_r

# Combine to final table
summary1 %>% 
  ff_merge(fit_uni) %>% 
  ff_merge(fit_multi_r) %>% 
  ff_merge(fit_multi) %>% 
  ff_merge(fit_imputed) %>% 
  select(-fit_id, -index)

         label     levels       Alive        Died          OR (univariable)        OR (multivariable) OR (multivariable inc. smoking)  OR (multiple imputation)
   Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.122)       1.02 (1.00-1.03, p=0.010) 1.01 (1.00-1.02, p=0.116)
           Sex     Female  243 (55.6)  194 (44.4)                         -                         -                               -                         -
                     Male  268 (56.1)  210 (43.9) 0.98 (0.76-1.27, p=0.889) 0.98 (0.74-1.30, p=0.890)       0.88 (0.64-1.23, p=0.461) 0.99 (0.75-1.31, p=0.957)
         nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.30, p



The final table can easily be exported to Word or as a PDF as described else where.

By examining the coefficients, the effect of the imputation compared with the complete case analysis can be clearly seen.

Other considerations

  1. Omit the variable
  2. Imputing factors with new level for missing data
  3. Model the missing data

As above, if the variable does not appear to be important, it may be omitted from the analysis. A sensitivity analysis in this context is another form of imputation. But rather than using all other available information to best-guess the missing data, we simply assign the value as above. Imputation is therefore likely to be more appropriate.

There is an alternative method to model the missing data for the categorical in this setting – just consider the missing data as a factor level. This has the advantage of simplicity, with the disadvantage of increasing the number of terms in the model. Multiple imputation is generally preferred. 

library(dplyr)
colon_s %>% 
  mutate(
    smoking_mar = forcats::fct_explicit_na(smoking_mar)
  ) %>% 
  finalfit(dependent, explanatory)

 Dependent: Mortality 5 year                  Alive        Died          OR (univariable)        OR (multivariable)
                 Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.119)
                         Sex     Female  243 (47.6)  194 (48.0)                         -                         -
                                   Male  268 (52.4)  210 (52.0) 0.98 (0.76-1.27, p=0.889) 0.96 (0.72-1.30, p=0.809)
                       nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.30, p



MNAR vs MAR

Missing not at random data is tough in healthcare. To determine if data are MNAR for definite, we need to know their value in a subset of observations (patients).

Using our example above. Say smoking status is poorly recorded in patients admitted to hospital as an emergency with an obstructing cancer. Obstructing bowel cancers may be larger or their position may make the prognosis worse. Smoking may relate to the aggressiveness of the cancer and may be an independent predictor of prognosis. The missing values for smoking may therefore not random. Smoking may be more common in the emergency patients and may be more common in those that die.

There is no easy way to handle this. If at all possible, try to get the missing data. Otherwise, take care when drawing conclusions from analyses where data are thought to be missing not at random. 

Where to next

We are now doing more in Stan. Missing data can be imputed directly within a Stan model which feels neat. Stan doesn’t yet have the equivalent of NA which makes passing the data block into Stan a bit of a faff. 

Alternatively, the missing data can be directly modelled in Stan. Examples are provided in the manual. Again, I haven’t found this that easy to do, but there are a number of Stan developments that will hopefully make this more straightforward in the future. 

Elegant regression results tables and plots in R: the finalfit package

This post was originally published here

The finafit package brings together the day-to-day functions we use to generate final results tables and plots when modelling. I spent many years repeatedly manually copying results from R analyses and built these functions to automate our standard healthcare data workflow. It is particularly useful when undertaking a large study involving multiple different regression analyses. When combined with RMarkdown, the reporting becomes entirely automated. Its design follows Hadley Wickham’s tidy tool manifesto.

Installation and Documentation

It lives on GitHub.

You can install finalfit from github with:

It is recommended that this package is used together with dplyr, which is a dependent.

Some of the functions require rstan and boot. These have been left as Suggests rather than Depends to avoid unnecessary installation. If needed, they can be installed in the normal way:

To install off-line (or in a Safe Haven), download the zip file and use devtools::install_local().

Main Features

1. Summarise variables/factors by a categorical variable

summary_factorlist() is a wrapper used to aggregate any number of explanatory variables by a single variable of interest. This is often “Table 1” of a published study. When categorical, the variable of interest can have a maximum of five levels. It uses Hmisc::summary.formula().

See other options relating to inclusion of missing data, mean vs. median for continuous variables, column vs. row proportions, include a total column etc.

summary_factorlist() is also commonly used to summarise any number of variables by an outcome variable (say dead yes/no).

Tables can be knitted to PDF, Word or html documents. We do this in RStudio from a .Rmd document. Example chunk:

2. Summarise regression model results in final table format

The second main feature is the ability to create final tables for linear (lm()), logistic (glm()), hierarchical logistic (lme4::glmer()) and
Cox proportional hazards (survival::coxph()) regression models.

The finalfit() “all-in-one” function takes a single dependent variable with a vector of explanatory variable names (continuous or categorical variables) to produce a final table for publication including summary statistics, univariable and multivariable regression analyses. The first columns are those produced by summary_factorist(). The appropriate regression model is chosen on the basis of the dependent variable type and other arguments passed.

Logistic regression: glm()

Of the form: glm(depdendent ~ explanatory, family="binomial")

Logistic regression with reduced model: glm()

Where a multivariable model contains a subset of the variables included specified in the full univariable set, this can be specified.

Mixed effects logistic regression: lme4::glmer()

Of the form: lme4::glmer(dependent ~ explanatory + (1 | random_effect), family="binomial")

Hierarchical/mixed effects/multilevel logistic regression models can be specified using the argument random_effect. At the moment it is just set up for random intercepts (i.e. (1 | random_effect), but in the future I’ll adjust this to accommodate random gradients if needed (i.e. (variable1 | variable2).

Cox proportional hazards: survival::coxph()

Of the form: survival::coxph(dependent ~ explanatory)

Add common model metrics to output

metrics=TRUE provides common model metrics. The output is a list of two dataframes. Note chunk specification for output below.

Rather than going all-in-one, any number of subset models can be manually added on to a summary_factorlist() table using finalfit_merge(). This is particularly useful when models take a long-time to run or are complicated.

Note the requirement for fit_id=TRUE in summary_factorlist(). fit2df extracts, condenses, and add metrics to supported models.

Bayesian logistic regression: with stan

Our own particular rstan models are supported and will be documented in the future. Broadly, if you are running (hierarchical) logistic regression models in [Stan](http://mc-stan.org/users/interfaces/rstan) with coefficients specified as a vector labelled beta, then fit2df() will work directly on the stanfit object in a similar manner to if it was a glm or glmerMod object.

3. Summarise regression model results in plot

Models can be summarized with odds ratio/hazard ratio plots using or_plot, hr_plot and surv_plot.

OR plot

HR plot

Kaplan-Meier survival plots

KM plots can be produced using the library(survminer)

Notes

Use Hmisc::label() to assign labels to variables for tables and plots.

Export dataframe tables directly or to R Markdown knitr::kable().

Note wrapper summary_missing() is also useful. Wraps mice::md.pattern.

Development will be on-going, but any input appreciated.

Elegant regression results tables and plots in R: the finalfit package

This post was originally published here

The finafit package brings together the day-to-day functions we use to generate final results tables and plots when modelling. I spent many years repeatedly manually copying results from R analyses and built these functions to automate our standard healthcare data workflow. It is particularly useful when undertaking a large study involving multiple different regression analyses. When combined with RMarkdown, the reporting becomes entirely automated. Its design follows Hadley Wickham’s tidy tool manifesto.

Installation and Documentation

It lives on GitHub.

You can install finalfit from github with:

# install.packages("devtools")
devtools::install_github("ewenharrison/finalfit")

It is recommended that this package is used together with dplyr, which is a dependent.

Some of the functions require rstan and boot. These have been left as Suggests rather than Depends to avoid unnecessary installation. If needed, they can be installed in the normal way:

install.packages("rstan")
install.packages("boot")

To install off-line (or in a Safe Haven), download the zip file and use devtools::install_local().

Main Features

1. Summarise variables/factors by a categorical variable

summary_factorlist() is a wrapper used to aggregate any number of explanatory variables by a single variable of interest. This is often “Table 1” of a published study. When categorical, the variable of interest can have a maximum of five levels. It uses Hmisc::summary.formula().

library(finalfit)
library(dplyr)

# Load example dataset, modified version of survival::colon
data(colon_s)

# Table 1 - Patient demographics by variable of interest ----
explanatory = c("age", "age.factor", 
  "sex.factor", "obstruct.factor")
dependent = "perfor.factor" # Bowel perforation
colon_s %>%
  summary_factorlist(dependent, explanatory,
  p=TRUE, add_dependent_label=TRUE)

See other options relating to inclusion of missing data, mean vs. median for continuous variables, column vs. row proportions, include a total column etc.

summary_factorlist() is also commonly used to summarise any number of variables by an outcome variable (say dead yes/no).

# Table 2 - 5 yr mortality ----
explanatory = c("age.factor", 
  "sex.factor",
  "obstruct.factor")
dependent = 'mort_5yr'
colon_s %>%
  summary_factorlist(dependent, explanatory, 
  p=TRUE, add_dependent_label=TRUE)

Tables can be knitted to PDF, Word or html documents. We do this in RStudio from a .Rmd document. Example chunk:

```{r, echo = FALSE, results='asis'}
knitr::kable(example_table, row.names=FALSE, 
    align=c("l", "l", "r", "r", "r", "r"))
```

2. Summarise regression model results in final table format

The second main feature is the ability to create final tables for linear (lm()), logistic (glm()), hierarchical logistic (lme4::glmer()) and
Cox proportional hazards (survival::coxph()) regression models.

The finalfit() “all-in-one” function takes a single dependent variable with a vector of explanatory variable names (continuous or categorical variables) to produce a final table for publication including summary statistics, univariable and multivariable regression analyses. The first columns are those produced by summary_factorist(). The appropriate regression model is chosen on the basis of the dependent variable type and other arguments passed.

Logistic regression: glm()

Of the form: glm(depdendent ~ explanatory, family="binomial")

explanatory = c("age.factor", "sex.factor", 
  "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
  finalfit(dependent, explanatory)

Logistic regression with reduced model: glm()

Where a multivariable model contains a subset of the variables included specified in the full univariable set, this can be specified.

explanatory = c("age.factor", "sex.factor", 
  "obstruct.factor", "perfor.factor")
explanatory_multi = c("age.factor", 
  "obstruct.factor")
dependent = 'mort_5yr'
colon_s %>%
  finalfit(dependent, explanatory, 
  explanatory_multi)

Mixed effects logistic regression: lme4::glmer()

Of the form: lme4::glmer(dependent ~ explanatory + (1 | random_effect), family="binomial")

Hierarchical/mixed effects/multilevel logistic regression models can be specified using the argument random_effect. At the moment it is just set up for random intercepts (i.e. (1 | random_effect), but in the future I’ll adjust this to accommodate random gradients if needed (i.e. (variable1 | variable2).

explanatory = c("age.factor", "sex.factor", 
  "obstruct.factor", "perfor.factor")
explanatory_multi = c("age.factor", "obstruct.factor")
random_effect = "hospital"
dependent = 'mort_5yr'
colon_s %>%
  finalfit(dependent, explanatory, 
  explanatory_multi, random_effect)

Cox proportional hazards: survival::coxph()

Of the form: survival::coxph(dependent ~ explanatory)

explanatory = c("age.factor", "sex.factor", 
"obstruct.factor", "perfor.factor")
dependent = "Surv(time, status)"
colon_s %>%
  finalfit(dependent, explanatory)

Add common model metrics to output

metrics=TRUE provides common model metrics. The output is a list of two dataframes. Note chunk specification for output below.

explanatory = c("age.factor", "sex.factor", 
  "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
  finalfit(dependent, explanatory, 
  metrics=TRUE)

```{r, echo=FALSE, results="asis"}
knitr::kable(table7[[1]], row.names=FALSE, align=c("l", "l", "r", "r", "r"))
knitr::kable(table7[[2]], row.names=FALSE)
```

Rather than going all-in-one, any number of subset models can be manually added on to a summary_factorlist() table using finalfit_merge(). This is particularly useful when models take a long-time to run or are complicated.

Note the requirement for fit_id=TRUE in summary_factorlist(). fit2df extracts, condenses, and add metrics to supported models.

explanatory = c("age.factor", "sex.factor", 
  "obstruct.factor", "perfor.factor")
explanatory_multi = c("age.factor", "obstruct.factor")
random_effect = "hospital"
dependent = 'mort_5yr'

# Separate tables
colon_s %>%
  summary_factorlist(dependent, 
  explanatory, fit_id=TRUE) -> example.summary

colon_s %>%
  glmuni(dependent, explanatory) %>%
  fit2df(estimate_suffix=" (univariable)") -> example.univariable

colon_s %>%
  glmmulti(dependent, explanatory) %>%
  fit2df(estimate_suffix=" (multivariable)") -> example.multivariable

colon_s %>%
  glmmixed(dependent, explanatory, random_effect) %>%
  fit2df(estimate_suffix=" (multilevel)") -> example.multilevel

# Pipe together
example.summary %>%
  finalfit_merge(example.univariable) %>%
  finalfit_merge(example.multivariable) %>%
  finalfit_merge(example.multilevel) %>%
  select(-c(fit_id, index)) %>% # remove unnecessary columns
  dependent_label(colon_s, dependent, prefix="") # place dependent variable label

Bayesian logistic regression: with stan

Our own particular rstan models are supported and will be documented in the future. Broadly, if you are running (hierarchical) logistic regression models in [Stan](http://mc-stan.org/users/interfaces/rstan) with coefficients specified as a vector labelled beta, then fit2df() will work directly on the stanfit object in a similar manner to if it was a glm or glmerMod object.

3. Summarise regression model results in plot

Models can be summarized with odds ratio/hazard ratio plots using or_plot, hr_plot and surv_plot.

OR plot

# OR plot
explanatory = c("age.factor", "sex.factor", 
  "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
  or_plot(dependent, explanatory)
# Previously fitted models (`glmmulti()` or 
# `glmmixed()`) can be provided directly to `glmfit`

HR plot

# HR plot
explanatory = c("age.factor", "sex.factor", 
  "obstruct.factor", "perfor.factor")
dependent = "Surv(time, status)"
colon_s %>%
  hr_plot(dependent, explanatory, dependent_label = "Survival")
# Previously fitted models (`coxphmulti`) can be provided directly using `coxfit`

Kaplan-Meier survival plots

KM plots can be produced using the library(survminer)

# KM plot
explanatory = c("perfor.factor")
dependent = "Surv(time, status)"
colon_s %>%
  surv_plot(dependent, explanatory, 
  xlab="Time (days)", pval=TRUE, legend="none")

Notes

Use Hmisc::label() to assign labels to variables for tables and plots.

label(colon_s$age.factor) = "Age (years)"

Export dataframe tables directly or to R Markdown knitr::kable().

Note wrapper summary_missing() is also useful. Wraps mice::md.pattern.

colon_s %>%
  summary_missing(dependent, explanatory)

Development will be on-going, but any input appreciated.

P-values from random effects linear regression models

This post was originally published here

 is a useful frequentist approach to hierarchical/multilevel linear regression modelling. For good reason, the model output only includes t-values and doesn’t include p-values (partly due to the difficulty in estimating the degrees of freedom, as discussed here).

Yes, p-values are evil and we should continue to try and expunge them from our analyses. But I keep getting asked about this. So here is a simple bootstrap method to generate two-sided parametric p-values on the fixed effects coefficients. Interpret with caution.