Five steps for missing data with Finalfit

As a journal editor, I often receive studies in which the investigators fail to describe, analyse, or even acknowledge missing data. This is frustrating, as it is often of the utmost importance. Conclusions may (and do) change when missing data is accounted for.  A few seem to not even appreciate that in conventional regression, only rows with complete data are included.

These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:

  1. Ensure your data are coded correctly.
  2. Identify missing values within each variable.
  3. Look for patterns of missingness.
  4. Check for associations between missing and observed data.
  5. Decide how to handle missing data.

Finalfit includes a number of functions to help with this.

Some confusing terminology

But first there are some terms which easy to mix up. These are important as they describe the mechanism of missingness and this determines how you can handle the missing data.

Missing completely at random (MCAR)

As it says, values are randomly missing from your dataset. Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves.

For instance, when smoking status is not recorded in a random subset of patients.

This is easy to handle, but unfortunately, data are almost never missing completely at random.

Missing at random (MAR)

This is confusing and would be better stated as missing conditionally at random. Here, missing data do have a relationship with other variables in the dataset. However, the actual values that are missing are random.

For example, smoking status is not documented in female patients because the doctor was too shy to ask. Yes ok, not that realistic!

Missing not at random (MNAR)

The pattern of missingness is related to other variables in the dataset, but in addition, the values of the missing data are not random.

For example, when smoking status is not recorded in patients admitted as an emergency, who are also more likely to have worse outcomes from surgery.

Missing not at random data are important, can alter your conclusions, and are the most difficult to diagnose and handle. They can only be detected by collecting and examining some of the missing data. This is often difficult or impossible to do.

How you deal with missing data is dependent on the type of missingness. Once you know this, then you can sort it.

More on this below.

1. Ensure your data are coded correctly: ff_glimpse

While clearly obvious, this step is often ignored in the rush to get results. The first step in any analysis is robust data cleaning and coding. Lots of packages have a glimpse function and finalfit is no different. This function has three specific goals:

  1. Ensure all factors and numerics are correctly assigned. That is the commonest reason to get an error with a finalfit function. You think you’re using a factor variable, but in fact it is incorrectly coded as a continuous numeric.
  2. Ensure you know which variables have missing data. This presumes missing values are correctly assigned NA. See here for more details if you are unsure.
  3. Ensure factor levels and variable labels are assigned correctly.

Example scenario

Using the colon cancer dataset that comes with finalfit, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.

For demonstration purposes, we will create random MCAR and MAR smoking variables to the dataset.

# Make sure finalfit is up-to-date 
install.packages("finalfit") 

library(finalfit) 

# Create some extra missing data
## Smoking missing completely at random
set.seed(1)
colon_s$smoking_mcar = 
  sample(c("Smoker", "Non-smoker", NA), 
    dim(colon_s)[1], replace=TRUE, 
    prob = c(0.2, 0.7, 0.1)) %>% 
  factor()
Hmisc::label(colon_s$smoking_mcar) = "Smoking (MCAR)"

## Smoking missing conditional on patient sex
colon_s$smoking_mar[colon_s$sex.factor == "Female"] = 
  sample(c("Smoker", "Non-smoker", NA), 
    sum(colon_s$sex.factor == "Female"), 
    replace = TRUE,
    prob = c(0.1, 0.5, 0.4))

colon_s$smoking_mar[colon_s$sex.factor == "Male"] = 
  sample(c("Smoker", "Non-smoker", NA), 
    sum(colon_s$sex.factor == "Male"), 
    replace=TRUE, prob = c(0.15, 0.75, 0.1))
colon_s$smoking_mar = factor(colon_s$smoking_mar)
Hmisc::label(colon_s$smoking_mar) = "Smoking (MAR)"
# Examine with ff_glimpse
explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar", "smoking_mar")
dependent = "mort_5yr"

colon_s %>% 
  ff_glimpse(dependent, explanatory)

Numerics
            label   n missing_n missing_percent mean sd min max range  se
age   Age (years) 929         0             0.0   60 12  18  85    67 0.4
nodes        NULL 911        18             1.9    4  4   0  33    33 0.1

Factors
                           label   n missing_n missing_percent level_n
sex.factor                   Sex 929         0               0       2
obstruct.factor      Obstruction 908        21             2.3       2
mort_5yr        Mortality 5 year 915        14             1.5       2
smoking_mcar      Smoking (MCAR) 828       101              11       2
smoking_mar        Smoking (MAR) 719       210              23       2
                                levels  levels_count   levels_percent
sex.factor            "Female", "Male"      445, 484           48, 52
obstruct.factor            "No", "Yes"  732, 176, 21 78.8, 18.9,  2.3
mort_5yr               "Alive", "Died"  511, 404, 14 55.0, 43.5,  1.5
smoking_mcar    "Non-smoker", "Smoker" 645, 183, 101       69, 20, 11
smoking_mar     "Non-smoker", "Smoker" 591, 128, 210       64, 14, 23

The function summarises a data frame or tibble by numeric (continuous) variables and factor (discrete) variables. The dependent and explanatory  are for convenience. Pass either or neither e.g. to summarise data frame or tibble:

colon %>%
  ff_glimpse()

It doesn’t present well if you have factors with lots of levels, so you may want to remove these.

library(dplyr)
colon_s %>% 
  select(-hospital) %>% 
  ff_glimpse()

Use this to check that the variables are all assigned and behaving as expected. The proportion of missing data can be seen, e.g. smoking_mar has 23% missing data.

2. Identify missing values in each variable: missing_plot

In detecting patterns of missingness, this plot is useful. Row number is on the x-axis and all included variables are on the y-axis. Associations between missingness and observations can be easily seen, as can relationships of missingness between variables.

colon_s %>%
  missing_plot()

Click to enlarge.

It was only when writing this post that I discovered the amazing package, naniar. This package is recommended and provides lots of great visualisations for missing data.

3. Look for patterns of missingness: missing_pattern

missing_pattern simply wraps mice::md.pattern using finalfit grammar. This produces a table and a plot showing the pattern of missingness between variables.

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar", "smoking_mar")
dependent = "mort_5yr"

colon_s %>% 
  missing_pattern(dependent, explanatory)

This allows us to look for patterns of missingness between variables. There are 14 patterns in this data. The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic. 

Make sure you include missing data in demographics tables

Table 1 in a healthcare study is often a demographics table of an “explanatory variable of interest” against other explanatory variables/confounders. Do not silently drop missing values in this table. It is easy to do this correctly with summary_factorlist. This function provides a useful summary of a dependent variable against explanatory variables. Despite its name, continuous variables are handled nicely.

na_include=TRUE ensures missing data from the explanatory variables (but not dependent) are included. Note that any p-values are generated across missing groups as well, so run a second time with na_include=FALSE if you wish a hypothesis test only over observed data.

library(finalfit)

# Explanatory or confounding variables
explanatory = c("age", "sex.factor", 
  "nodes",  
  "smoking_mcar", "smoking_mar")

# Explanatory variable of interest
dependent = "obstruct.factor" # Bowel obstruction

colon_s %>% 
  summary_factorlist(dependent, explanatory, 
  na_include=TRUE, p=TRUE)
  
          label     levels          No         Yes     p
    Age (years)  Mean (SD) 60.2 (11.5) 57.3 (13.3) 0.014
            Sex     Female  346 (79.2)   91 (20.8) 0.290
                      Male  386 (82.0)   85 (18.0)      
          nodes  Mean (SD)   3.7 (3.7)   3.5 (3.2) 0.774
 Smoking (MCAR) Non-smoker  500 (79.4)  130 (20.6) 0.173
                    Smoker  154 (85.6)   26 (14.4)      
                   Missing   78 (79.6)   20 (20.4)      
  Smoking (MAR) Non-smoker  467 (80.9)  110 (19.1) 0.056
                    Smoker   91 (73.4)   33 (26.6)      
                   Missing  174 (84.1)   33 (15.9)

4. Check for associations between missing and observed data: missing_pairs | missing_compare

In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables. This is particularly important (I would say absolutely required) for a primary outcome measure / dependent variable.

Take for example “death”. When that outcome is missing it is often for a particular reason. For example, perhaps patients undergoing emergency surgery were less likely to have complete records compared with those undergoing planned surgery. And of course, death is more likely after emergency surgery.

missing_pairs uses functions from the excellent GGally package. It produces pairs plots to show relationships between missing values and observed values in all variables.

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar", "smoking_mar")
dependent = "mort_5yr"
colon_s %>% 
  missing_pairs(dependent, explanatory)

For continuous variables (age and nodes), the distributions of observed and missing data can be visually compared. Is there a difference between age and mortality above?

For discrete, data, counts are presented by default. It is often easier to compare proportions:

colon_s %>% 
  missing_pairs(dependent, explanatory, position = "fill", )

It should be obvious that missingness in Smoking (MCAR) does not relate to sex (row 6, column 3). But missingness  in Smoking (MAR) does differ by sex (last row, column 3) as was designed above when the missing data were created.

We can confirm this using missing_compare.

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor")
dependent = "smoking_mcar"
colon_s %>% 
  missing_compare(dependent, explanatory)

 Missing data analysis: Smoking (MCAR)           Not missing     Missing     p
                           Age (years) Mean (SD) 59.7 (11.9) 59.9 (12.6) 0.867
                                   Sex    Female  399 (89.7)   46 (10.3) 0.616
                                            Male  429 (88.6)   55 (11.4)      
                                 nodes Mean (SD)   3.6 (3.4)     4 (4.5) 0.990
                           Obstruction        No  654 (89.3)   78 (10.7) 0.786
                                             Yes  156 (88.6)   20 (11.4)     
											 
dependent = "smoking_mar"
colon_s %>% 
  missing_compare(dependent, explanatory)

 Missing data analysis: Smoking (MAR)           Not missing    Missing      p
                          Age (years) Mean (SD) 59.6 (11.9)  60.1 (12)  0.709
                                  Sex    Female  288 (64.7) 157 (35.3) <0.001
                                           Male  431 (89.0)  53 (11.0)       
                                nodes Mean (SD)   3.6 (3.6)  3.8 (3.6)  0.730
                          Obstruction        No  558 (76.2) 174 (23.8)  0.154
                                            Yes  143 (81.2)  33 (18.8)

It takes “dependent” and “explanatory” variables, but in this context “dependent” just refers to the variable being tested for missingness against the “explanatory” variables.

Comparisons for continuous data use a Kruskal Wallis and for discrete data a chi-squared test.

As expected, a relationship is seen between Sex and Smoking (MAR) but not Smoking (MCAR).

For those who like an omnibus test

If you are work predominately with numeric rather than discrete data (categorical/factors), you may find these tests from the MissMech package useful. The package and output is well documented, and provides two tests which can be used to determine whether data are MCAR.

library(finalfit)
library(dplyr)
library(MissMech)
explanatory = c("age", "nodes")
dependent = "mort_5yr" 

colon_s %>% 
  select(explanatory) %>% 
  MissMech::TestMCARNormality()

5. Decide how to handle missing data

These pages from Karen Grace-Martin are great for this.

Prior to a standard regression analysis, we can either:

  • Delete the variable with the missing data
  • Delete the cases with the missing data
  • Impute (fill in) the missing data
  • Model the missing data

MCAR, MAR, or MNAR

MCAR vs MAR

Using the examples, we identify that Smoking (MCAR) is missing completely at random. 

We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived. The pattern of missingness is therefore not felt to be MNAR.

Common solution

Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.

We therefore elect to simply omit the patients in whom smoking is missing. This is known as list-wise deletion and will be performed by default in standard regression analyses including finalfit.

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor",  
  "smoking_mcar")
dependent = "mort_5yr"
colon_s %>% 
	finalfit(dependent, explanatory, metrics=TRUE)

 Dependent: Mortality 5 year                  Alive        Died          OR (univariable)        OR (multivariable)
                 Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.200)
                         Sex     Female  243 (47.6)  194 (48.0)                         -                         -
                                   Male  268 (52.4)  210 (52.0) 0.98 (0.76-1.27, p=0.889) 1.02 (0.76-1.38, p=0.872)
                       nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.18-1.33, p<0.001)
                 Obstruction         No  408 (82.1)  312 (78.6)                         -                         -
                                    Yes   89 (17.9)   85 (21.4) 1.25 (0.90-1.74, p=0.189) 1.53 (1.05-2.22, p=0.027)
              Smoking (MCAR) Non-smoker  358 (79.9)  277 (75.3)                         -                         -
                                 Smoker   90 (20.1)   91 (24.7) 1.31 (0.94-1.82, p=0.113) 1.37 (0.96-1.96, p=0.083)
								 
"Number in dataframe = 929, Number in model = 782, Missing = 147, AIC = 1003.3, C-statistic = 0.687, H&L = Chi-sq(8) 15.03 (p=0.058)

Other considerations

  1. Sensitivity analysis
  2. Omit the variable
  3. Imputation
  4. Model the missing data

If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis. A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model. Thus, you may choose to re-label all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labeling with “non-smoker”.

If smoking is not associated with the explanatory variable of interest (bowel obstruction) or the outcome, it may be considered not to be a confounder  and so could be omitted. That neatly deals with the missing data issue, but of course may not be appropriate.

Imputation and modelling are considered below.

MCAR vs MAR

But life is rarely that simple.

Consider that the smoking variable is more likely to be missing if the patient is female (missing_compareshows a relationship). But, say, that the missing values are not different from the observed values. Missingness is then MAR.

If we simply drop all the cases (patients) in which smoking is missing (list-wise deletion), then proportionality we drop more females than men. This may have consequences for our conclusions if sex is associated with our explanatory variable of interest or outcome.

Common solution

mice is our go to package for multiple imputation. That’s the process of filling in missing data using a best-estimate from all the other data that exists. When first encountered, this doesn’t sounds like a good idea.

However, taking our simple example, if missingness in smoking is predicted strongly by sex, and the values of the missing data are random, then we can impute (best-guess) the missing smoking values using sex and other variables in the dataset.

Imputation is not usually appropriate for the explanatory variable of interest or the outcome variable. With both of these, the hypothesis is that there is an meaningful association with other variables in the dataset, therefore it doesn’t make sense to use these variables to impute them.

Here is some code to run mice. The package is well documented, and there are a number of checks and considerations that should be made to inform the imputation process. Read the documentation carefully prior to doing this yourself.

# Multivariate Imputation by Chained Equations (mice)
library(finalfit)
library(dplyr)
library(mice)
explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor", "smoking_mar")
dependent = "mort_5yr"

colon_s %>% 
  select(dependent, explanatory) %>% 
  # Exclude outcome and explanatory variable of interest from imputation
  dplyr::filter(!is.na(mort_5yr), !is.na(obstruct.factor)) %>%
  # Run imputation with 10 imputed sets
  mice(m = 10) %>% 
  # Run logistic regression on each imputed set
  with(glm(formula(ff_formula(dependent, explanatory)), 
    family="binomial")) %>%
  # Pool and summarise results
  pool() %>%                                            
  summary(conf.int = TRUE, exponentiate = TRUE) %>%
  # Jiggle into finalfit format
  mutate(explanatory_name = rownames(.)) %>%            
  select(explanatory_name, estimate, `2.5 %`, `97.5 %`, p.value) %>% 
  condense_fit(estimate_suffix = " (multiple imputation)") %>% 
  remove_intercept() -> fit_imputed

# Use finalfit merge methods to create and compare results
colon_s %>% 
  summary_factorlist(dependent, explanatory, fit_id = TRUE) -> summary1

colon_s %>% 
  glmuni(dependent, explanatory) %>% 
  fit2df(estimate_suffix = " (univariable)") -> fit_uni

colon_s %>% 
  glmmulti(dependent, explanatory) %>% 
  fit2df(estimate_suffix = " (multivariable inc. smoking)") -> fit_multi

explanatory = c("age", "sex.factor", 
  "nodes", "obstruct.factor")
colon_s %>% 
  glmmulti(dependent, explanatory) %>% 
  fit2df(estimate_suffix = " (multivariable)") -> fit_multi_r

# Combine to final table
summary1 %>% 
  ff_merge(fit_uni) %>% 
  ff_merge(fit_multi_r) %>% 
  ff_merge(fit_multi) %>% 
  ff_merge(fit_imputed) %>% 
  select(-fit_id, -index)

         label     levels       Alive        Died          OR (univariable)        OR (multivariable) OR (multivariable inc. smoking)  OR (multiple imputation)
   Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.122)       1.02 (1.00-1.03, p=0.010) 1.01 (1.00-1.02, p=0.116)
           Sex     Female  243 (55.6)  194 (44.4)                         -                         -                               -                         -
                     Male  268 (56.1)  210 (43.9) 0.98 (0.76-1.27, p=0.889) 0.98 (0.74-1.30, p=0.890)       0.88 (0.64-1.23, p=0.461) 0.99 (0.75-1.31, p=0.957)
         nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.19-1.32, p<0.001)       1.25 (1.18-1.33, p<0.001) 1.25 (1.19-1.32, p<0.001)
   Obstruction         No  408 (56.7)  312 (43.3)                         -                         -                               -                         -
                      Yes   89 (51.1)   85 (48.9) 1.25 (0.90-1.74, p=0.189) 1.36 (0.95-1.93, p=0.089)       1.26 (0.85-1.88, p=0.252) 1.36 (0.95-1.93, p=0.089)
 Smoking (MAR) Non-smoker  328 (56.4)  254 (43.6)                         -                         -                               -                         -
                   Smoker   68 (53.5)   59 (46.5) 1.12 (0.76-1.65, p=0.563)                         -       1.25 (0.82-1.89, p=0.300) 1.26 (0.82-1.94, p=0.289)

The final table can easily be exported to Word or as a PDF as described else where.

By examining the coefficients, the effect of the imputation compared with the complete case analysis can be clearly seen.

Other considerations

  1. Omit the variable
  2. Imputing factors with new level for missing data
  3. Model the missing data

As above, if the variable does not appear to be important, it may be omitted from the analysis. A sensitivity analysis in this context is another form of imputation. But rather than using all other available information to best-guess the missing data, we simply assign the value as above. Imputation is therefore likely to be more appropriate.

There is an alternative method to model the missing data for the categorical in this setting – just consider the missing data as a factor level. This has the advantage of simplicity, with the disadvantage of increasing the number of terms in the model. Multiple imputation is generally preferred. 

library(dplyr)
colon_s %>% 
  mutate(
    smoking_mar = forcats::fct_explicit_na(smoking_mar)
  ) %>% 
  finalfit(dependent, explanatory)

 Dependent: Mortality 5 year                  Alive        Died          OR (univariable)        OR (multivariable)
                 Age (years)  Mean (SD) 59.8 (11.4) 59.9 (12.5) 1.00 (0.99-1.01, p=0.986) 1.01 (1.00-1.02, p=0.119)
                         Sex     Female  243 (47.6)  194 (48.0)                         -                         -
                                   Male  268 (52.4)  210 (52.0) 0.98 (0.76-1.27, p=0.889) 0.96 (0.72-1.30, p=0.809)
                       nodes  Mean (SD)   2.7 (2.4)   4.9 (4.4) 1.24 (1.18-1.30, p<0.001) 1.25 (1.19-1.32, p<0.001)
                 Obstruction         No  408 (82.1)  312 (78.6)                         -                         -
                                    Yes   89 (17.9)   85 (21.4) 1.25 (0.90-1.74, p=0.189) 1.34 (0.94-1.91, p=0.102)
               Smoking (MAR) Non-smoker  328 (64.2)  254 (62.9)                         -                         -
                                 Smoker   68 (13.3)   59 (14.6) 1.12 (0.76-1.65, p=0.563) 1.24 (0.82-1.88, p=0.308)
                              (Missing)  115 (22.5)   91 (22.5) 1.02 (0.74-1.41, p=0.895) 0.99 (0.69-1.41, p=0.943)

MNAR vs MAR

Missing not at random data is tough in healthcare. To determine if data are MNAR for definite, we need to know their value in a subset of observations (patients).

Using our example above. Say smoking status is poorly recorded in patients admitted to hospital as an emergency with an obstructing cancer. Obstructing bowel cancers may be larger or their position may make the prognosis worse. Smoking may relate to the aggressiveness of the cancer and may be an independent predictor of prognosis. The missing values for smoking may therefore not random. Smoking may be more common in the emergency patients and may be more common in those that die.

There is no easy way to handle this. If at all possible, try to get the missing data. Otherwise, take care when drawing conclusions from analyses where data are thought to be missing not at random. 

Where to next

We are now doing more in Stan. Missing data can be imputed directly within a Stan model which feels neat. Stan doesn’t yet have the equivalent of NA which makes passing the data block into Stan a bit of a faff. 

Alternatively, the missing data can be directly modelled in Stan. Examples are provided in the manual. Again, I haven’t found this that easy to do, but there are a number of Stan developments that will hopefully make this more straightforward in the future. 

Source: Blog

Finalfit now includes bootstrap simulation for model prediction

If your new to modelling in R and don’t know what this title means, you definitely want to look into doing it.

I’ve always been a fan of converting model outputs to real-life quantities of interest. For example, I like to supplement a logistic regression model table with predicted probabilities for a given set of explanatory variable levels. This can be more intuitive than odds ratios, particularly for a lay audience.

For example, say I have run a logistic regression model for predicted 5 year survival after colon cancer. What is the actual probability of death for a patient under 40 with a small cancer that has not perforated? How does that probability differ for a patient over 40?

I’ve tried this various ways. I used Zelig for a while including here, but it started trying to do too much and was always broken (I updated it the other day in the hope that things were better, but was met with a string of errors again).

I also used rms, including here (checkout the nice plots!). I like it and respect the package. But I don’t use it as standard and so need to convert all the models first, e.g. to lrm. Again, for my needs it tries to do too much and I find datadist awkward.

Thirdly, I love Stan for this, e.g. used in this paper. The generated quantities block allows great flexibility to simulate whatever you wish from the posterior. I’m a Bayesian at heart will always come back to this. But for some applications it’s a bit much, and takes some time to get running as I want.

I often simply want to predict y-hat from lm and glm with bootstrapped intervals and ideally a comparison of explanatory levels sets. Just like sim does in Zelig. But I want it in a format I can immediately use in a publication.

Well now I can with finalfit.

You need to use the github version of the package until CRAN is updated

devtools::install_github("eharrison/finalfit")

There’s two main functions with some new internals to help expand to other models in the future.

Create new dataframe of explanatory variable levels

finalfit_newdata is used to generate a new dataframe. I usually want to set 4 or 5 combinations of x levels and often find it difficult to get this formatted for predict. Pass the original dataset, the names of explanatory variables used in the model, and a list of levels for these. For the latter, they can be included as rows or columns. If the data type is incorrect or you try to pass factor levels that don’t exist, it will fail with a useful warning.

library(finalfit)
explanatory = c("age.factor", "extent.factor", "perfor.factor")
dependent = 'mort_5yr'

colon_s %>%
  finalfit_newdata(explanatory = explanatory, newdata = list(
    c("<40 years",  "Submucosa", "No"),
    c("<40 years", "Submucosa", "Yes"),
    c("<40 years", "Adjacent structures", "No"),
    c("<40 years", "Adjacent structures", "Yes")
  )) -> newdata
newdata

  age.factor       extent.factor perfor.factor
1  <40 years           Submucosa            No
2  <40 years           Submucosa           Yes
3  <40 years Adjacent structures            No
4  <40 years Adjacent structures           Yes

Run bootstrap simulations of model predictions

boot_predict takes standard lm and glm model objects, together with finalfit lmlist and glmlist objects from fitters, e.g. lmmulti and glmmulti. In addition, it requires a newdata object generated from finalfit_newdata. If you’re new to this, don’t be put off by all those model acronyms, it is straightforward.

colon_s %>% 
  glmmulti(dependent, explanatory) %>% 
  boot_predict(newdata, 
    estimate_name = "Predicted probability of death",
    R=100, boot_compare = FALSE,
    digits = c(2,3))

        Age    Extent of spread Perforation Predicted probability of death
1 <40 years           Submucosa          No            0.28 (0.00 to 0.52)
2 <40 years           Submucosa         Yes            0.29 (0.00 to 0.61)
3 <40 years Adjacent structures          No            0.71 (0.50 to 0.86)
4 <40 years Adjacent structures         Yes            0.72 (0.45 to 0.89)

Note that the number of simulations (R) here is low for demonstration purposes. You should expect to use 1000 to 10000 to ensure you have stable estimates.

Output to Word, PDF, and html via RMarkdown

Simulations are produced using bootstrapping and everything is tidily outputted in a table/dataframe, which can be passed to knitr::kable.

# Within an .Rmd file
```{r}
knitr::kable(table, row.names = FALSE, align = c(&quot;l&quot;, &quot;l&quot;, &quot;l&quot;, &quot;r&quot;))
```

Make comparisons

Better still, by including boot_compare==TRUE, comparisons are made between the first row of newdata and each subsequent row. These can be first differences (e.g. absolute risk differences) or ratios (e.g. relative risk ratios). The comparisons are done on the individual bootstrap predictions and the distribution summarised as a mean with percentile confidence intervals (95% CI as default, e.g. 2.5 and 97.5 percentiles). A p-value is generated on the proportion of values on the other side of the null from the mean, e.g. for a ratio greater than 1.0, p is the number of bootstrapped predictions under 1.0. Multiplied by two so it is two-sided. (Sorry about including a p-value).

Scroll right here:

colon_s %>% 
  glmmulti(dependent, explanatory) %>% 
  boot_predict(newdata, 
    estimate_name = "Predicted probability of death",
    compare_name = "Absolute risk difference",
    R=100, digits = c(2,3))

        Age    Extent of spread Perforation Predicted probability of death      Absolute risk difference
1 <40 years           Submucosa          No            0.28 (0.00 to 0.52)                             -
2 <40 years           Submucosa         Yes            0.29 (0.00 to 0.62) 0.01 (-0.15 to 0.20, p=0.920)
3 <40 years Adjacent structures          No            0.71 (0.56 to 0.89)  0.43 (0.19 to 0.68, p<0.001)
4 <40 years Adjacent structures         Yes            0.72 (0.45 to 0.91)  0.43 (0.11 to 0.73, p<0.001)

What is not included?

It doesn’t yet include our other common models, such as coxph which I may add in. It doesn’t do lmer or glmer either. bootMer works well mixed-effects models which take a bit more care and thought, e.g. how are random effects to be handled in the simulations. So I don’t have immediate plans to add that in, better to do directly.

Plotting

Finally, as with all finalfit functions, results can be produced as individual variables using condense == FALSE. This is particularly useful for plotting

library(finalfit)
library(ggplot2)
theme_set(theme_bw())

explanatory = c("nodes", "extent.factor", "perfor.factor")
dependent = 'mort_5yr'

colon_s %>%
  finalfit_newdata(explanatory = explanatory, rowwise = FALSE,
    newdata = list(
      rep(seq(0, 30), 4),
      c(rep("Muscle", 62), rep("Adjacent structures", 62)),
      c(rep("No", 31), rep("Yes", 31), rep("No", 31), rep("Yes", 31))
    )
  ) -> newdata

colon_s %>% 
  glmmulti(dependent, explanatory) %>% 
  boot_predict(newdata, boot_compare = FALSE, 
  R=100, condense=FALSE) %>% 
  ggplot(aes(x = nodes, y = estimate, ymin = estimate_conf.low,
      ymax = estimate_conf.high, fill=extent.factor))+
    geom_line(aes(colour = extent.factor))+
    geom_ribbon(alpha=0.1)+
    facet_grid(.~perfor.factor)+
    xlab("Number of postive lymph nodes")+
    ylab("Probability of death")+
    labs(fill = "Extent of tumour", colour = "Extent of tumour")+
    ggtitle("Probability of death by lymph node count")

So there you have it. Straightforward bootstrapped simulations of model predictions, together with comparisons and easy plotting.

Source: Blog

Finalfit now in CRAN

Your favourite package for getting model outputs directly into publication ready tables is now available on CRAN. They make you work for it! Thank you to all that helped. The development version will continue to be available from github.

Source: Blog

Finalfit, knitr and R Markdown for quick results

Thank you for the many requests to provide some extra info on how best to get finalfit results out of RStudio, and particularly into Microsoft Word.

Here is how.

Make sure you are on the most up-to-date version of finalfit.

devtools::install_github("ewenharrison/finalfit")

What follows is for demonstration purposes and is not meant to illustrate model building.

Does a tumour characteristic (differentiation) predict 5-year survival?

Demographics table

First explore variable of interest (exposure) by making it the dependent.

library(finalfit)
library(dplyr)

dependent = "differ.factor"

# Specify explanatory variables of interest
explanatory = c("age", "sex.factor", 
  "extent.factor", "obstruct.factor", 
  "nodes")

Note this useful alternative way of specifying explanatory variable lists:

colon_s %>% 
  select(age, sex.factor, 
  extent.factor, obstruct.factor, nodes) %>% 
  names() -> explanatory

Look at associations between our exposure and other explanatory variables. Include missing data.

colon_s %>% 
  summary_factorlist(dependent, explanatory, 
  p=TRUE, na_include=TRUE)

label              levels        Well    Moderate       Poor      p
       Age (years)           Mean (SD) 60.2 (12.8) 59.9 (11.7)  59 (12.8)  0.788
               Sex              Female   51 (11.6)  314 (71.7)  73 (16.7)  0.400
                                  Male    42 (9.0)  349 (74.6)  77 (16.5)       
  Extent of spread           Submucosa    5 (25.0)   12 (60.0)   3 (15.0)  0.081
                                Muscle   12 (11.8)   78 (76.5)  12 (11.8)       
                                Serosa   76 (10.2)  542 (72.8) 127 (17.0)       
                   Adjacent structures     0 (0.0)   31 (79.5)   8 (20.5)       
       Obstruction                  No    69 (9.7)  531 (74.4) 114 (16.0)  0.110
                                   Yes   19 (11.0)  122 (70.9)  31 (18.0)       
                               Missing    5 (25.0)   10 (50.0)   5 (25.0)       
             nodes           Mean (SD)   2.7 (2.2)   3.6 (3.4)  4.7 (4.4) <0.001
Warning messages:
1: In chisq.test(tab, correct = FALSE) :
  Chi-squared approximation may be incorrect
2: In chisq.test(tab, correct = FALSE) :
  Chi-squared approximation may be incorrect

Note missing data in obstruct.factor. We will drop this variable for now (again, this is for demonstration only). Also that nodes has not been labelled.
There are small numbers in some variables generating chisq.test warnings (predicted less than 5 in any cell). Generate final table.

Hmisc::label(colon_s$nodes) = "Lymph nodes involved"
explanatory = c("age", "sex.factor", 
  "extent.factor", "nodes")

colon_s %>% 
  summary_factorlist(dependent, explanatory, 
  p=TRUE, na_include=TRUE, 
  add_dependent_label=TRUE) -> table1
table1

Dependent: Differentiation                            Well    Moderate       Poor      p
                Age (years)           Mean (SD) 60.2 (12.8) 59.9 (11.7)  59 (12.8)  0.788
                        Sex              Female   51 (11.6)  314 (71.7)  73 (16.7)  0.400
                                           Male    42 (9.0)  349 (74.6)  77 (16.5)       
           Extent of spread           Submucosa    5 (25.0)   12 (60.0)   3 (15.0)  0.081
                                         Muscle   12 (11.8)   78 (76.5)  12 (11.8)       
                                         Serosa   76 (10.2)  542 (72.8) 127 (17.0)       
                            Adjacent structures     0 (0.0)   31 (79.5)   8 (20.5)       
       Lymph nodes involved           Mean (SD)   2.7 (2.2)   3.6 (3.4)  4.7 (4.4) <0.001

Logistic regression table

Now examine explanatory variables against outcome. Check plot runs ok.

explanatory = c("age", "sex.factor", 
  "extent.factor", "nodes", 
  "differ.factor")
dependent = "mort_5yr"
colon_s %>% 
  finalfit(dependent, explanatory, 
  dependent_label_prefix = "") -> table2

Mortality 5 year                           Alive        Died           OR (univariable)         OR (multivariable)
          Age (years)           Mean (SD) 59.8 (11.4) 59.9 (12.5)  1.00 (0.99-1.01, p=0.986)  1.01 (1.00-1.02, p=0.195)
                  Sex              Female  243 (47.6)  194 (48.0)                          -                          -
                                     Male  268 (52.4)  210 (52.0)  0.98 (0.76-1.27, p=0.889)  0.98 (0.74-1.30, p=0.885)
     Extent of spread           Submucosa    16 (3.1)     4 (1.0)                          -                          -
                                   Muscle   78 (15.3)    25 (6.2)  1.28 (0.42-4.79, p=0.681)  1.28 (0.37-5.92, p=0.722)
                                   Serosa  401 (78.5)  349 (86.4) 3.48 (1.26-12.24, p=0.027) 3.13 (1.01-13.76, p=0.076)
                      Adjacent structures    16 (3.1)    26 (6.4) 6.50 (1.98-25.93, p=0.004) 6.04 (1.58-30.41, p=0.015)
 Lymph nodes involved           Mean (SD)   2.7 (2.4)   4.9 (4.4)  1.24 (1.18-1.30, p<0.001)  1.23 (1.17-1.30, p<0.001)
      Differentiation                Well   52 (10.5)   40 (10.1)                          -                          -
                                 Moderate  382 (76.9)  269 (68.1)  0.92 (0.59-1.43, p=0.694)  0.70 (0.44-1.12, p=0.132)
                                     Poor   63 (12.7)   86 (21.8)  1.77 (1.05-3.01, p=0.032)  1.08 (0.61-1.90, p=0.796)

Odds ratio plot

colon_s %>% 
  or_plot(dependent, explanatory, 
  breaks = c(0.5, 1, 5, 10, 20, 30))

To MS Word via knitr/R Markdown

Important. In most R Markdown set-ups, environment objects require to be saved and loaded to R Markdown document.

# Save objects for knitr/markdown
save(table1, table2, dependent, explanatory, file = "out.rda")

We use RStudio Server Pro set-up on Ubuntu. But these instructions should work fine for most/all RStudio/Markdown default set-ups.

In RStudio, select File > New File > R Markdown.

A useful template file is produced by default. Try hitting knit to Word on the knitr button at the top of the .Rmd script window.

Now paste this into the file:

---
title: "Example knitr/R Markdown document"
author: "Ewen Harrison"
date: "22/5/2018"
output:
  word_document: default
---

```{r setup, include=FALSE}
# Load data into global environment. 
library(finalfit)
library(dplyr)
library(knitr)
load("out.rda")
```

## Table 1 - Demographics
```{r table1, echo = FALSE, results='asis'}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Table 2 - Association between tumour factors and 5 year mortality
```{r table2, echo = FALSE, results='asis'}
kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Figure 1 - Association between tumour factors and 5 year mortality
```{r figure1, echo = FALSE}
colon_s %>% 
  or_plot(dependent, explanatory)
```

It’s ok, but not great.

Create Word template file

Now, edit the Word template. Click on a table. The style should be compact. Right click > Modify... > font size = 9. Alter heading and text styles in the same way as desired. Save this as template.docx. Upload to your project folder. Add this reference to the .Rmd YAML heading, as below. Make sure you get the space correct.

The plot also doesn’t look quite right and it prints with warning messages. Experiment with fig.width to get it looking right.

Now paste this into your .Rmd file and run:

---
title: "Example knitr/R Markdown document"
author: "Ewen Harrison"
date: "21/5/2018"
output:
  word_document:
    reference_docx: template.docx  
---

```{r setup, include=FALSE}
# Load data into global environment. 
library(finalfit)
library(dplyr)
library(knitr)
load("out.rda")
```

## Table 1 - Demographics
```{r table1, echo = FALSE, results='asis'}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Table 2 - Association between tumour factors and 5 year mortality
```{r table2, echo = FALSE, results='asis'}
kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Figure 1 - Association between tumour factors and 5 year mortality
```{r figure1, echo = FALSE, warning=FALSE, message=FALSE, fig.width=10}
colon_s %>% 
  or_plot(dependent, explanatory)
```

This is now looking good for me, and further tweaks can be made.

To PDF via knitr/R Markdown

Default settings for PDF:

---
title: "Example knitr/R Markdown document"
author: "Ewen Harrison"
date: "21/5/2018"
output:
  pdf_document: default
---

```{r setup, include=FALSE}
# Load data into global environment. 
library(finalfit)
library(dplyr)
library(knitr)
load("out.rda")
```

## Table 1 - Demographics
```{r table1, echo = FALSE, results='asis'}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Table 2 - Association between tumour factors and 5 year mortality
```{r table2, echo = FALSE, results='asis'}
kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Figure 1 - Association between tumour factors and 5 year mortality
```{r figure1, echo = FALSE}
colon_s %>% 
  or_plot(dependent, explanatory)
```

Again, ok but not great.

We can fix the plot in exactly the same way. But the table is off the side of the page. For this we use the kableExtra package. Install this in the normal manner. You may also want to alter the margins of your page using geometry in the preamble.

---
title: "Example knitr/R Markdown document"
author: "Ewen Harrison"
date: "21/5/2018"
output:
  pdf_document: default
geometry: margin=0.75in
---

```{r setup, include=FALSE}
# Load data into global environment. 
library(finalfit)
library(dplyr)
library(knitr)
library(kableExtra)
load("out.rda")
```

## Table 1 - Demographics
```{r table1, echo = FALSE, results='asis'}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"),
						booktabs=TRUE)
```

## Table 2 - Association between tumour factors and 5 year mortality
```{r table2, echo = FALSE, results='asis'}
kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"),
			booktabs=TRUE) %>% 
	kable_styling(font_size=8)
```

## Figure 1 - Association between tumour factors and 5 year mortality
```{r figure1, echo = FALSE, warning=FALSE, message=FALSE, fig.width=10}
colon_s %>% 
  or_plot(dependent, explanatory)
```

This is now looking pretty good for me as well.

There you have it. A pretty quick workflow to get final results into Word and a PDF.

Source: Blog

Elegant regression results tables and plots in R: the finalfit package

The finafit package brings together the day-to-day functions we use to generate final results tables and plots when modelling. I spent many years repeatedly manually copying results from R analyses and built these functions to automate our standard healthcare data workflow. It is particularly useful when undertaking a large study involving multiple different regression analyses. When combined with RMarkdown, the reporting becomes entirely automated. Its design follows Hadley Wickham’s tidy tool manifesto.

Installation and Documentation

It lives on GitHub.

You can install finalfit from github with:

# install.packages("devtools")
devtools::install_github("ewenharrison/finalfit")

It is recommended that this package is used together with dplyr, which is a dependent.

Some of the functions require rstan and boot. These have been left as Suggests rather than Depends to avoid unnecessary installation. If needed, they can be installed in the normal way:

install.packages("rstan")
install.packages("boot")

To install off-line (or in a Safe Haven), download the zip file and use devtools::install_local().

Main Features

1. Summarise variables/factors by a categorical variable

summary_factorlist() is a wrapper used to aggregate any number of explanatory variables by a single variable of interest. This is often “Table 1” of a published study. When categorical, the variable of interest can have a maximum of five levels. It uses Hmisc::summary.formula().

library(finalfit)
library(dplyr)

# Load example dataset, modified version of survival::colon
data(colon_s)

# Table 1 - Patient demographics by variable of interest ----
explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")
dependent = "perfor.factor" # Bowel perforation
colon_s %>%
	summary_factorlist(dependent, explanatory, p=TRUE, add_dependent_label=TRUE)

See other options relating to inclusion of missing data, mean vs. median for continuous variables, column vs. row proportions, include a total column etc.

summary_factorlist() is also commonly used to summarise any number of variables by an outcome variable (say dead yes/no).

# Table 2 - 5 yr mortality ----
explanatory = c("age", "age.factor", "sex.factor",
                "obstruct.factor")
dependent = 'mort_5yr'
colon_s %>%
	summary_factorlist(dependent, explanatory, p=TRUE, add_dependent_label=TRUE)

Tables can be knitted to PDF, Word or html documents. We do this in RStudio from a .Rmd document. Example chunk:

```{r, echo = FALSE, results='asis'}
knitr::kable(example_table, row.names=FALSE, 
    align=c("l", "l", "r", "r", "r", "r"))
```

2. Summarise regression model results in final table format

The second main feature is the ability to create final tables for linear (lm()), logistic (glm()), hierarchical logistic (lme4::glmer()) and
Cox proportional hazards (survival::coxph()) regression models.

The finalfit() “all-in-one” function takes a single dependent variable with a vector of explanatory variable names (continuous or categorical variables) to produce a final table for publication including summary statistics, univariable and multivariable regression analyses. The first columns are those produced by summary_factorist(). The appropriate regression model is chosen on the basis of the dependent variable type and other arguments passed.

Logistic regression: glm()

Of the form: glm(depdendent ~ explanatory, family="binomial")

explanatory = c("age.factor", "sex.factor", 
                "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
    finalfit(dependent, explanatory)

Logistic regression with reduced model: glm()

Where a multivariable model contains a subset of the variables included specified in the full univariable set, this can be specified.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")
explanatory_multi = c("age.factor", "obstruct.factor")
dependent = 'mort_5yr'
colon_s %>%
    finalfit(dependent, explanatory, explanatory_multi)

Mixed effects logistic regression: lme4::glmer()

Of the form: lme4::glmer(dependent ~ explanatory + (1 | random_effect), family="binomial")

Hierarchical/mixed effects/multilevel logistic regression models can be specified using the argument random_effect. At the moment it is just set up for random intercepts (i.e. (1 | random_effect), but in the future I’ll adjust this to accommodate random gradients if needed (i.e. (variable1 | variable2).

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")
explanatory_multi = c("age.factor", "obstruct.factor")
random_effect = "hospital"
dependent = 'mort_5yr'
colon_s %>%
    finalfit(dependent, explanatory, explanatory_multi, random_effect)

Cox proportional hazards: survival::coxph()

Of the form: survival::coxph(dependent ~ explanatory)

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")
dependent = "Surv(time, status)"
colon_s %>%
    finalfit(dependent, explanatory)

Add common model metrics to output

metrics=TRUE provides common model metrics. The output is a list of two dataframes. Note chunk specification for output below.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
  finalfit(dependent, explanatory, metrics=TRUE)

```{r, echo=FALSE, results="asis"}
knitr::kable(table7[[1]], row.names=FALSE, align=c("l", "l", "r", "r", "r"))
knitr::kable(table7[[2]], row.names=FALSE)
```

Rather than going all-in-one, any number of subset models can be manually added on to a summary_factorlist() table using finalfit_merge(). This is particularly useful when models take a long-time to run or are complicated.

Note the requirement for fit_id=TRUE in summary_factorlist(). fit2df extracts, condenses, and add metrics to supported models.

explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")
explanatory_multi = c("age.factor", "obstruct.factor")
random_effect = "hospital"
dependent = 'mort_5yr'

# Separate tables
colon_s %>%
	summary_factorlist(dependent, explanatory, fit_id=TRUE) -> example.summary

colon_s %>%
	glmuni(dependent, explanatory) %>%
	fit2df(estimate_suffix=" (univariable)") -> example.univariable

colon_s %>%
	glmmulti(dependent, explanatory) %>%
	fit2df(estimate_suffix=" (multivariable)") -> example.multivariable

colon_s %>%
	glmmixed(dependent, explanatory, random_effect) %>%
	fit2df(estimate_suffix=" (multilevel)") -> example.multilevel

# Pipe together
example.summary %>%
	finalfit_merge(example.univariable) %>%
	finalfit_merge(example.multivariable) %>%
	finalfit_merge(example.multilevel) %>%
	select(-c(fit_id, index)) %>% # remove unnecessary columns
	dependent_label(colon_s, dependent, prefix="") # place dependent variable label

Bayesian logistic regression: with stan

Our own particular rstan models are supported and will be documented in the future. Broadly, if you are running (hierarchical) logistic regression models in [Stan](http://mc-stan.org/users/interfaces/rstan) with coefficients specified as a vector labelled beta, then fit2df() will work directly on the stanfit object in a similar manner to if it was a glm or glmerMod object.

3. Summarise regression model results in plot

Models can be summarized with odds ratio/hazard ratio plots using or_plot, hr_plot and surv_plot.

OR plot

# OR plot
explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
  or_plot(dependent, explanatory)
# Previously fitted models (`glmmulti()` or `glmmixed()`) can be provided directly to `glmfit`

HR plot

# HR plot
explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")
dependent = "Surv(time, status)"
colon_s %>%
  hr_plot(dependent, explanatory, dependent_label = "Survival")
# Previously fitted models (`coxphmulti`) can be provided directly using `coxfit`

Kaplan-Meier survival plots

KM plots can be produced using the library(survminer)

# KM plot
explanatory = c("perfor.factor")
dependent = "Surv(time, status)"
colon_s %>%
	surv_plot(dependent, explanatory, xlab="Time (days)", pval=TRUE, legend="none")

Notes

Use Hmisc::label() to assign labels to variables for tables and plots.

label(colon_s$age.factor) = "Age (years)"

Export dataframe tables directly or to R Markdown knitr::kable().

Note wrapper summary_missing() is also useful. Wraps mice::md.pattern.

colon_s %>%
  summary_missing(dependent, explanatory)

Development will be on-going, but any input appreciated.

Source: Blog

P-values from random effects linear regression models

lme4::lmer

is a useful frequentist approach to hierarchical/multilevel linear regression modelling. For good reason, the model output only includes t-values and doesn’t include p-values (partly due to the difficulty in estimating the degrees of freedom, as discussed here).

Yes, p-values are evil and we should continue to try and expunge them from our analyses. But I keep getting asked about this. So here is a simple bootstrap method to generate two-sided parametric p-values on the fixed effects coefficients. Interpret with caution.

library(lme4)

# Run model with lme4 example data
fit = lmer(angle ~ recipe + temp + (1|recipe:replicate), cake)

# Model summary
summary(fit)

# lme4 profile method confidence intervals
confint(fit)

# Bootstrapped parametric p-values
boot.out = bootMer(fit, fixef, nsim=1000) #nsim determines p-value decimal places
p = rbind(
  (1-apply(boot.out$t<0, 2, mean))*2,
  (1-apply(boot.out$t>0, 2, mean))*2)
apply(p, 2, min)

# Alternative "pipe" syntax
library(magrittr)

lmer(angle ~ recipe + temp + (1|recipe:replicate), cake) %>%
  bootMer(fixef, nsim=100) %$%
  rbind(
  (1-apply(t<0, 2, mean))*2,
  (1-apply(t>0, 2, mean))*2) %>%
  apply(2, min)

 

Prediction is very difficult, especially about the future

As Niels Bohr, the Danish physicist, put it, “prediction is very difficult, especially about the future”. Prognostic models are commonplace and seek to help patients and the surgical team estimate the risk of a specific event, for instance, the recurrence of disease or a complication of surgery. “Decision-support tools” aim to help patients make difficult choices, with the most useful providing personalized estimates to assist in balancing the trade-offs between risks and benefits. As we enter the world of precision medicine, these tools will become central to all our practice.

In the meantime, there are limitations. Overwhelming evidence shows that the quality of reporting of prediction model studies is poor. In some instances, the details of the actual model are considered commercially sensitive and are not published, making the assessment of the risk of bias and potential usefulness of the model difficult.

In this edition of HPB, Beal and colleagues aim to validate the American College of Surgeons National Quality Improvement Program (ACS NSQIP) Surgical Risk Calculator (SRC) using data from 854 gallbladder cancer and extrahepatic cholangiocarcinoma patients from the US Extrahepatic Biliary Malignancy Consortium. The authors conclude that the “estimates of risk were variable in terms of accuracy and generally calculator performance was poor”. The SRC underpredicted the occurrence of all examined end-points (death, readmission, reoperation and surgical site infection) and discrimination and calibration were particularly poor for readmission and surgical site infection. This is not the first report of predictive failures of the SRC. Possible explanations cited previously include small sample size, homogeneity of patients, and too few institutions in the validation set. That does not seem to the case in the current study.

The SRC is a general-purpose risk calculator and while it may be applicable across many surgical domains, it should be used with caution in extrahepatic biliary cancer. It is not clear why the calculator does not provide measures of uncertainty around estimates. This would greatly help patients interpret its output and would go a long way to addressing some of the broader concerns around accuracy.

Source: Blog

Radical but conservative liver surgery

Cutting-edge liver surgery is often associated with modern technology such as the robot. In this edition of HPB, Torzilli and colleagues provide a fascinating account of 12 years of “radical but conservative” open liver surgery.

This is extreme parenchymal-sparing hepatectomy (PSH) in 169 patients with colorectal liver metastases. In all cases, tumour was touching or infiltrating portal pedicles or hepatic veins, a situation where most surgeons would advocate a major hepatectomy where possible. The PSH by its nature results in a 0 mm resection margin when the vessel is preserved, which was the aim in many of these procedures. Although this is off-putting, the cut-edge recurrence rate was no higher than average.

PSH in the form of “easy atypicals” is performed by all HPB surgeons. There are two main differences here. First is the aim to detach tumours from intrahepatic vascular structures. For instance, hepatic veins in contact with tumour were preserved and only resected if infiltrated. Even then, they were tangentially incised if possible and reconstructed with a bovine pericardial patch. Second is the careful attention paid to identifying and using communicating hepatic veins. This is well described but used extensively here to allow complete resection of segments while avoiding congestion in the draining region.

Short-term mortality and morbidity rates are comparable with other published series. A median survival of 36 months and 5-year overall survival of around 30% is reasonable given some of these patients may not be offered surgery in certain centres. The authors describe the parenchymal sparing approach “failing” in 14 (10%) patients: 7 (5%) has recurrence at the cut edge and 8 (6%) within segments which would have been removed using a standard approach. 44% of the 55 patients with liver-only recurrence underwent re-resection.

This is not small surgery. The average operating time is 8.5 h with the longest taking 18.5 h. The 66% thoracotomy rate is also notable in an era of minimally invasive surgery and certainly differs from my own practice. This study is challenging and I look forward to the debates that should arise from it.

Effect of day of the week on mortality after emergency general surgery

Out latest paper published in the BJS describes short- and long-term outcomes after emergency surgery in Scotland. We looked for a weekend effect and didn’t find one.

  • In around 50,000 emergency general surgery patients, we didn’t find an association between day of surgery or day of admission and death rates;
  • In around 100,000 emergency surgery patients including orthopaedic and gynaecology procedures, we didn’t find an association between day of surgery or day of admission and death rates;
  • In around 500,000 emergency and planned surgery patients, we didn’t find an association between day of surgery or day of admission and death rates.

We also found that emergency surgery performed at weekends, or in those admitted at weekends, was performed a little quicker compared with weekdays.

More details can be found here:

Effect of day of the week on short- and long-term mortality after emergency general surgery
http://onlinelibrary.wiley.com/doi/10.1002/bjs.10507/full

bjs_dow-100

bjs_dow2-100