Riinu and I are sitting in Frankfurt airport discussing the paper retracted in JAMA this week.
During analysis, the treatment variable coded [1,2] was recoded in error to [1,0]. The results of the analysis were therefore reversed. The lung-disease self-management program actually resulted in more attendances at hospital, rather than fewer as had been originally reported.
Recode check
Checking of recoding is such an important part of data cleaning – we emphasise this a lot in HealthyR courses – but of course mistakes happen.
Our standard approach is this:
library(finalfit)
colon_s %>%
mutate(
sex.factor2 = forcats::fct_recode(sex.factor,
"F" = "Male",
"M" = "Female")
) %>%
count(sex.factor, sex.factor2)
# A tibble: 2 x 3
sex.factor sex.factor2 n
<fct> <fct> <int>
1 Female M 445
2 Male F 484
The miscode should be obvious.
check_recode()
However, mistakes may still happen and be missed. So we’ve bashed out a useful function that can be applied to your whole dataset. This is not to replace careful checking, but may catch something that has been missed.
The function takes a data frame or tibble and fuzzy matches variable names. It produces crosstables similar to above for all matched variables.
So if you have coded something from sex to sex.factor it will be matched. The match is hungry so it is more likely to match unrelated variables than to miss similar variables. But if you recode death to mortality it won’t be matched.
Here’s a walk through.
# Install
devtools::install_github('ewenharrison/finalfit')
library(finalfit)
library(dplyr)
# Recode example
colon_s_small = colon_s %>%
select(-id, -rx, -rx.factor) %>%
mutate(
age.factor2 = forcats::fct_collapse(age.factor,
"<60 years" = c("<40 years", "40-59 years")),
sex.factor2 = forcats::fct_recode(sex.factor,
# Intentional miscode
"F" = "Male",
"M" = "Female")
)
# Check
colon_s_small %>%
check_recode()
$index
# A tibble: 3 x 2
var1 var2
<chr> <chr>
1 sex.factor sex.factor2
2 age.factor age.factor2
3 sex.factor2 age.factor2
$counts
$counts[[1]]
# A tibble: 2 x 3
sex.factor sex.factor2 n
<fct> <fct> <int>
1 Female M 445
2 Male F 484
$counts[[2]]
# A tibble: 3 x 3
age.factor age.factor2 n
<fct> <fct> <int>
1 <40 years <60 years 70
2 40-59 years <60 years 344
3 60+ years 60+ years 515
$counts[[3]]
# A tibble: 4 x 3
sex.factor2 age.factor2 n
<fct> <fct> <int>
1 M <60 years 204
2 M 60+ years 241
3 F <60 years 210
4 F 60+ years 274
As can be seen, the output takes the form of a list length 2. The first is an index of matched variables. The second is crosstables as tibbles for each variable combination. sex.factor2 can be seen as being miscoded. sex.factor2 and age.factor2 have been matched, but should be ignored.
Numerics are not included by default. To do so:
out = colon_s_small %>%
select(-extent, -extent.factor,-time, -time.years) %>% # choose to exclude variables
check_recode(include_numerics = TRUE)
out
# Output not printed for space
Miscoding in survival::colon dataset?
When doing this just today, we noticed something strange in our example dataset, survival::colon.
The variable node4 should be a binary recode of nodes greater than 4. But as can be seen, something is not right!
We’re interested in any explanations those working with this dataset might have.
There we are then, a function that may be useful in detecting miscoding. So useful in fact, that we have immediately found probable miscoding in a standard R dataset.
We are using multiple imputation more frequently to “fill in” missing data in clinical datasets. Multiple datasets are created, models run, and results pooled so conclusions can be drawn.
We’ve put some improvements into Finalfit on GitHub to make it easier to use with the mice package. These will go to CRAN soon but not immediately.
Multivariate Imputation by Chained Equations (mice)
miceis a great package and contains lots of useful functions for diagnosing and working with missing data. The purpose here is to demonstrate how mice can be integrated into the Finalfit workflow with inclusion of model from imputed datasets in tables and plots.
Choose variables to impute and variables to impute from
finalfit::missing_predictorMatrix()makes it easy to specify which variables do what. For instance, we often do not want to impute our outcome or explanatory variable of interest (exposure), but do want to use them to impute other variables.
This is straightforward to code using the arguments drop_from_imputed and drop_from_imputer.
library(mice)
# Specify model
explanatory = c("age", "sex.factor", "nodes",
"obstruct.factor", "smoking_mar")
dependent = "mort_5yr"
# Choose not to impute missing values
# for explanatory variable of interest and
# outcome variable.
# But include in algorithm for imputation.
predM = colon_s %>%
select(dependent, explanatory) %>%
missing_predictorMatrix(
drop_from_imputed = c("obstruct.factor", "mort_5yr")
)
Create imputed datasets
A set of multiple imputed datasets (mids) can be created as below. Various checks should be performed to ensure you understand the data that has been created. See here.
mids = colon_s %>%
select(dependent, explanatory) %>%
mice(m = 4, predictorMatrix = predM) # Usually m = 10
Run models
Here we sill use a logistic regression model. The with.mids() function takes a model with a formula object, so use base R functions rather than Finalfit wrappers.
We now have multiple models run with each of the imputed datasets. We haven’t found good methods for combining common model metrics like AIC and c-statistic. I’d be interested to hear from anyone working on methods for this. Metrics can be extracted for each individual model to give an idea of goodness-of-fit and discrimination. We’re not suggesting you use these to compare imputed datasets, but could use them to compare models containing different variables created using the imputed datasets, e.g.
fits %>%
getfit() %>%
purrr::map(AIC)
[[1]]
[1] 1192.57
[[2]]
[1] 1191.09
[[3]]
[1] 1195.49
[[4]]
[1] 1193.729
# C-statistic
fits %>%
getfit() %>%
purrr::map(~ pROC::roc(.x$y, .x$fitted)$auc)
[[1]]
Area under the curve: 0.6839
[[2]]
Area under the curve: 0.6818
[[3]]
Area under the curve: 0.6789
[[4]]
Area under the curve: 0.6836
Pool results
Rubin’s rules are used to combine results of multiple models.
# Pool results
fits_pool = fits %>%
pool()
Plot results
Pooled results can be passed directly to Finalfit plotting functions.
# Can be passed to or_plot
colon_s %>%
or_plot(dependent, explanatory, glmfit = fits_pool, table_text_size=4)
Put results in table
The pooled result can be passed directly to fit2df() as can many common models such as lm(), glm(), lmer(), glmer(), coxph(), crr(), etc.
# Summarise and put in table
fit_imputed = fits_pool %>%
fit2df(estimate_name = "OR (multiple imputation)", exp = TRUE)
fit_imputed
explanatory OR (multiple imputation)
1 age 1.01 (1.00-1.02, p=0.212)
2 sex.factorMale 1.01 (0.77-1.34, p=0.917)
3 nodes 1.24 (1.18-1.31, p<0.001)
4 obstruct.factorYes 1.34 (0.94-1.91, p=0.105)
5 smoking_marSmoker 1.28 (0.88-1.85, p=0.192)
Combine results with summary data
Any model passed through fit2df() can be combined with a summary table generated with summary_factorlist() and any number of other models.
In healthcare, we deal with a lot of binary outcomes. Death yes/no, disease recurrence yes/no, for instance. These outcomes are often easily analysed using binary logistic regression via finalfit().
When the time taken for the outcome to occur is important, we need a different approach. For instance, in patients with cancer, the time taken until recurrence of the cancer is often just as important as the fact it has recurred.
Finalfit wraps a number of functions to make these analyses easy to perform and output into PDFs and Word documents.
Installation
# Make sure finalfit is up-to-date
install.packages("finalfit")
Dataset
We’ll use the classic “Survival from Malignant Melanoma” dataset from the boot package to illustrate. The data consist of measurements made on patients with malignant melanoma. Each patient had their tumour removed by surgery at the Department of Plastic Surgery, University Hospital of Odense, Denmark during the period 1962 to 1977.
For the purposes of demonstration, we are interested in the association between tumour ulceration and survival after surgery.
Get data and check
library(finalfit)
melanoma = boot::melanoma #F1 here for help page with data dictionary
ff_glimpse(melanoma)
#> Continuous
#> label var_type n missing_n missing_percent mean sd
#> time time <dbl> 205 0 0.0 2152.8 1122.1
#> status status <dbl> 205 0 0.0 1.8 0.6
#> sex sex <dbl> 205 0 0.0 0.4 0.5
#> age age <dbl> 205 0 0.0 52.5 16.7
#> year year <dbl> 205 0 0.0 1969.9 2.6
#> thickness thickness <dbl> 205 0 0.0 2.9 3.0
#> ulcer ulcer <dbl> 205 0 0.0 0.4 0.5
#> min quartile_25 median quartile_75 max
#> time 10.0 1525.0 2005.0 3042.0 5565.0
#> status 1.0 1.0 2.0 2.0 3.0
#> sex 0.0 0.0 0.0 1.0 1.0
#> age 4.0 42.0 54.0 65.0 95.0
#> year 1962.0 1968.0 1970.0 1972.0 1977.0
#> thickness 0.1 1.0 1.9 3.6 17.4
#> ulcer 0.0 0.0 0.0 1.0 1.0
#>
#> Categorical
#> data frame with 0 columns and 205 rows
As can be seen, all variables are coded as numeric and some need recoding to factors.
Death status
status is the the patients status at the end of the study.
1 indicates that they had died from melanoma;
2 indicates that they were still alive and;
3 indicates that they had died from causes unrelated to their melanoma.
Competing risks: comparing 2 (alive) with 1 (died melanoma) accounting for 3 (died other); see more below.
Time and censoring
time is the number of days from surgery until either the occurrence of the event (death) or the last time the patient was known to be alive. For instance, if a patient had surgery and was seen to be well in a clinic 30 days later, but there had been no contact since, then the patient’s status would be considered 30 days. This patient is censored from the analysis at day 30, an important feature of time-to-event analyses.
Recode
library(dplyr)
library(forcats)
melanoma = melanoma %>%
mutate(
# Overall survival
status_os = case_when(
status == 2 ~ 0, # "still alive"
TRUE ~ 1), # "died melanoma" or "died other causes"
# Diease-specific survival
status_dss = case_when(
status == 2 ~ 0, # "still alive"
status == 1 ~ 1, # "died of melanoma"
status == 3 ~ 0), # "died of other causes is censored"
# Competing risks regression
status_crr = case_when(
status == 2 ~ 0, # "still alive"
status == 1 ~ 1, # "died of melanoma"
status == 3 ~ 2), # "died of other causes"
# Label and recode other variables
age = ff_label(age, "Age (years)"), # table friendly labels
thickness = ff_label(thickness, "Tumour thickness (mm)"),
sex = factor(sex) %>%
fct_recode("Male" = "1",
"Female" = "0") %>%
ff_label("Sex"),
ulcer = factor(ulcer) %>%
fct_recode("No" = "0",
"Yes" = "1") %>%
ff_label("Ulcerated tumour")
)
Kaplan-Meier survival estimator
We can use the excellent survival package to produce the Kaplan-Meier (KM) survival estimator. This is a non-parametric statistic used to estimate the survival function from time-to-event data. Note use of %$% to expose left-side of pipe to older-style R functions on right-hand side.
library(survival)
survival_object = melanoma %$%
Surv(time, status_os)
# Explore:
head(survival_object) # + marks censoring, in this case "Alive"
#> [1] 10 30 35+ 99 185 204
# Expressing time in years
survival_object = melanoma %$%
Surv(time/365, status_os)
KM analysis for whole cohort
Model
The survival object is the first step to performing univariable and multivariable survival analyses.
If you want to plot survival stratified by a single grouping variable, you can substitute “survival_object ~ 1” by “survival_object ~ factor”
# Overall survival in whole cohort
my_survfit = survfit(survival_object ~ 1, data = melanoma)
my_survfit # 205 patients, 71 events
#> Call: survfit(formula = survival_object ~ 1, data = melanoma)
#>
#> n events median 0.95LCL 0.95UCL
#> 205.00 71.00 NA 9.15 NA
Life table
A life table is the tabular form of a KM plot, which you may be familiar with. It shows survival as a proportion, together with confidence limits. The whole table is shown with summary(my_survfit).
We can plot survival curves using the finalfit wrapper for the package excellent package survminer. There are numerous options available on the help page. You should always include a number-at-risk table under these plots as it is essential for interpretation.
As can be seen, the probability of dying is much greater if the tumour was ulcerated, compared to those that were not ulcerated.
CPH regression can be performed using the all-in-one finalfit() function. It produces a table containing counts (proportions) for factors, mean (SD) for continuous variables and a univariable and multivariable CPH regression.
A hazard is the term given to the rate at which events happen.
The probability that an event will happen over a period of time is the hazard multiplied by the time interval.
An assumption of CPH is that hazards are constant over time (see below).
It produces a table containing counts (proportions) for factors, mean (SD) for continuous variables and a univariable and multivariable CPH regression.
If you are using a backwards selection approach or similar, a reduced model can be directly specified and compared. The full model can be kept or dropped.
An assumption of CPH regression is that the hazard associated with a particular variable does not change over time. For example, is the magnitude of the increase in risk of death associated with tumour ulceration the same in the early post-operative period as it is in later years.
The cox.zph() function from the survival package allows us to test this assumption for each variable. The plot of scaled Schoenfeld residuals should be a horizontal line. The included hypothesis test identifies whether the gradient differs from zero for each variable. No variable significantly differs from zero at the 5% significance level.
zph_result
#> rho chisq p
#> age 0.1633 2.4544 0.1172
#> sexMale -0.0781 0.4473 0.5036
#> thickness -0.1493 1.3492 0.2454
#> ulcerYes -0.2044 2.8256 0.0928
#> year 0.0195 0.0284 0.8663
#> GLOBAL NA 8.4695 0.1322
Stratified models
One approach to dealing with a violation of the proportional hazards assumption is to stratify by that variable. Including a strata() term will result in a separate baseline hazard function being fit for each level in the stratification variable. It will be no longer possible to make direct inference on the effect associated with that variable.
This can be incorporated directly into the explanatory variable list.
As a general rule, you should always try to account for any higher structure in the data within the model. For instance, patients may be clustered within particular hospitals.
There are two broad approaches to dealing with correlated groups of observations.
Including a cluster() term is akin to using generalised estimating equations (GEE). Here, a standard CPH model is fitted but the standard errors of the estimated hazard ratios are adjusted to account for correlations.
Including a frailty() term is akin to using a mixed effects model, where specific random effects term(s) are directly incorporated into the model.
Both approaches achieve the same goal in different ways. Volumes have been written on GEE vs mixed effects models. We favour the latter approach because of its flexibility and our preference for mixed effects modelling in generalised linear modelling. Note cluster() and frailty() terms cannot be combined in the same model.
The frailty() method here is being superseded by the coxme package, and we’ll incorporate this soon.
Hazard ratio plot
A plot of any of the above models can be produced by passing the terms to hr_plot().
melanoma %>%
hr_plot(dependent_os, explanatory)
Competing risks regression
Competing-risks regression is an alternative to CPH regression. It can be useful if the outcome of interest may not be able to occur because something else (like death) has happened first. For instance, in our example it is obviously not possible for a patient to die from melanoma if they have died from another disease first. By simply looking at cause-specific mortality (deaths from melanoma) and considering other deaths as censored, bias may result in estimates of the influence of predictors.
The approach by Fine and Gray is one option for dealing with this. It is implemented in the package cmprsk. The crr() syntax differs from survival::coxph() but finalfit brings these together.
It uses the finalfit::ff_merge() function, which can join any number of models together.
So here we have various aspects of time-to-event analysis commonly used when looking at survival. There are many other applications, some which may not be obvious: for instance we use CPH for modelling length of stay in in hospital.
Stratification can be used to deal with non-proportional hazards in a particular variable.
Hierarchical structure in your data can be accommodated with cluster or frailty (random effects) terms.
Competing risks regression may be useful if your outcome is in competition with another, such as all-cause death, but is currently limited in its ability to accommodate hierarchical structures.
Everybody came back for Day 2 of HealthyR Notebooks in Estonia!
Today focussed on modelling kicking off with linear regression in detail from Riinu – if you understand this you understand the majority of statistical tests!
Factors were introduced by Cameron, which led nicely into logistic regression. By the end of this the whole room was comfortably building regression models as if they had been doing it for years!
Notebooks are a really powerful tool for teaching this sort of material allowing seamless output into PDF and Word format. Some of the delegates commented on how they had struggled with this aspect of R in the past.
After all the intense work it was a great relief to break early for some fantastic team building including stuff like this!
Some even tried there hand at archery and felt pretty smug about their performance 😂
Finally, we are so grateful Julius Juurmaa for all of the organisation. He even started a company to administer the course! Thank you Julius.
The Surgical Informatics team arrived in beautiful Estonia yesterday.
We are here as part of our Wellcome Trust Open Research Fund grant – “HealthyR Notebooks: Democratising open and reproducible data analysis in resource-poor environments”.
It’s a mouthful, but important! We have adapted our popular HealthyR training course to be easily delivered on small laptop screens allowing state-of-the-art data analysis to be performed using RStudio anywhere by anyone. We’re testing this in Estonia and running it again in Ghana in November.
Also look out for HealthyR the book coming soon.
The setting is fantastic:
The delegates are already amazing at using R, but we’re teaching Tidyverse and Finalfit to bring everyone up to date with all the great new modern packages.
Data security is paramount and encryptr was written to make this easier for non-experts. Columns of data can be encrypted with a couple of lines of R code, and single cells decrypted as required.
But what was missing was an easy way to encrypt the file source of that data.
Now files can be encrypted with a couple of lines of R code.
Encryption and decryption with asymmetric keys is computationally expensive. This is how encrypt for data columns works. This makes it easy for each piece of data in a data frame to be decrypted without compromise of the whole data frame. This works on the presumption that each cell contains less than 245 bytes of data.
File encryption requires a different approach as files are larger in size. encrypt_file encrypts a file using a symmetric “session” key and the AES-256 cipher. This key is itself then encrypted using a public key generated using genkeys. In OpenSSL this combination is referred to as an envelope.
It should work with any type of single file but not folders.
genkeys()
#> Private key written with name 'id_rsa'
#> Public key written with name 'id_rsa.pub'
Encrypt file
To demonstrate, the included dataset is written as a .csv file.
write.csv(gp, "gp.csv")
encrypt_file("gp.csv")
#> Encrypted file written with name 'gp.csv.encryptr.bin'
Important: check that the file can be decrypted prior to removing the original file from your system.
Warning: it is strongly suggested that the original unencrypted data file is securely stored else where as a back-up in case unencryption is not possible, e.g., the private key file or password is lost
Decrypt file
The decrypt_file function will not allow the original file to be overwritten, therefore if it is still present, use the option to specify a new name for the unencrypted file.
decrypt_file("gp.csv.encryptr.bin", file_name = "gp2.csv")
#> Decrypted file written with name 'gp2.csv'
Support / bugs
The new version 0.1.3 is on its way to CRAN today or you can install from github:
Many of our projects involve getting doctors, nurses, and medical students to collect data on the patients they are looking after. We want to involve many of them in data analysis, without the requirement for coding experience or access to statistical software. To achieve this we have built Shinyfit, a shiny app for linear, logistic, and Cox PH regression.
Aim: allow access to model fitting without requirement for statistical software or coding experience.
Audience: Those sharing datasets in context of collaborative research or teaching.
Hosting requirements: Basic R coding skills including tidyverse to prepare dataset (5-10 minutes).
Linear, logistic or CPH regression tables Coefficient, odds ratio or hazard ratio plotsCrosstabsInspect dataset with ff_glimpse
Use your data
To use your own data, clone or download app from github.
Edit 0_prep.R to create a shinyfit_data object.
Test the app, usually within RStudio.
Deploy to your shiny hosting platform of choice.
Ensure you have permission to share the data
Editing 0_prep.R is straightforward and takes about 5 mins. The main purpose is to create human-readable menu items and allows sorting of variables into any categories, such as outcome and explanatory.
Errors in shinyfit are usually related to the underlying dataset, e.g.
Variables not appropriately specified as numerics or factors.
A particular factor level is empty, thus regression function (lm, glm, coxph etc.) gives error.
A variable with >2 factor levels is used as an outcome/dependent. This is not supported.
Use Glimpse tabs to check data when any error occurs.
It is fully mobile compliant, including datatables.