Elegant regression results tables and plots in R: the finalfit package

This post was originally published here

The finafit package brings together the day-to-day functions we use to generate final results tables and plots when modelling. I spent many years repeatedly manually copying results from R analyses and built these functions to automate our standard healthcare data workflow. It is particularly useful when undertaking a large study involving multiple different regression analyses. When combined with RMarkdown, the reporting becomes entirely automated. Its design follows Hadley Wickham’s tidy tool manifesto.

Installation and Documentation

It lives on GitHub.

You can install finalfit from github with:

It is recommended that this package is used together with dplyr, which is a dependent.

Some of the functions require rstan and boot. These have been left as Suggests rather than Depends to avoid unnecessary installation. If needed, they can be installed in the normal way:

To install off-line (or in a Safe Haven), download the zip file and use devtools::install_local().

Main Features

1. Summarise variables/factors by a categorical variable

summary_factorlist() is a wrapper used to aggregate any number of explanatory variables by a single variable of interest. This is often “Table 1” of a published study. When categorical, the variable of interest can have a maximum of five levels. It uses Hmisc::summary.formula().

See other options relating to inclusion of missing data, mean vs. median for continuous variables, column vs. row proportions, include a total column etc.

summary_factorlist() is also commonly used to summarise any number of variables by an outcome variable (say dead yes/no).

Tables can be knitted to PDF, Word or html documents. We do this in RStudio from a .Rmd document. Example chunk:

2. Summarise regression model results in final table format

The second main feature is the ability to create final tables for linear (lm()), logistic (glm()), hierarchical logistic (lme4::glmer()) and
Cox proportional hazards (survival::coxph()) regression models.

The finalfit() “all-in-one” function takes a single dependent variable with a vector of explanatory variable names (continuous or categorical variables) to produce a final table for publication including summary statistics, univariable and multivariable regression analyses. The first columns are those produced by summary_factorist(). The appropriate regression model is chosen on the basis of the dependent variable type and other arguments passed.

Logistic regression: glm()

Of the form: glm(depdendent ~ explanatory, family="binomial")

Logistic regression with reduced model: glm()

Where a multivariable model contains a subset of the variables included specified in the full univariable set, this can be specified.

Mixed effects logistic regression: lme4::glmer()

Of the form: lme4::glmer(dependent ~ explanatory + (1 | random_effect), family="binomial")

Hierarchical/mixed effects/multilevel logistic regression models can be specified using the argument random_effect. At the moment it is just set up for random intercepts (i.e. (1 | random_effect), but in the future I’ll adjust this to accommodate random gradients if needed (i.e. (variable1 | variable2).

Cox proportional hazards: survival::coxph()

Of the form: survival::coxph(dependent ~ explanatory)

Add common model metrics to output

metrics=TRUE provides common model metrics. The output is a list of two dataframes. Note chunk specification for output below.

Rather than going all-in-one, any number of subset models can be manually added on to a summary_factorlist() table using finalfit_merge(). This is particularly useful when models take a long-time to run or are complicated.

Note the requirement for fit_id=TRUE in summary_factorlist(). fit2df extracts, condenses, and add metrics to supported models.

Bayesian logistic regression: with stan

Our own particular rstan models are supported and will be documented in the future. Broadly, if you are running (hierarchical) logistic regression models in [Stan](http://mc-stan.org/users/interfaces/rstan) with coefficients specified as a vector labelled beta, then fit2df() will work directly on the stanfit object in a similar manner to if it was a glm or glmerMod object.

3. Summarise regression model results in plot

Models can be summarized with odds ratio/hazard ratio plots using or_plot, hr_plot and surv_plot.

OR plot

HR plot

Kaplan-Meier survival plots

KM plots can be produced using the library(survminer)

Notes

Use Hmisc::label() to assign labels to variables for tables and plots.

Export dataframe tables directly or to R Markdown knitr::kable().

Note wrapper summary_missing() is also useful. Wraps mice::md.pattern.

Development will be on-going, but any input appreciated.

Install github package on safe haven server

This post was originally published here

I’ve had few enquires about how to install the summarizer package on a server without internet access, such as the NHS Safe Havens.

  1. Uploadsummarizer-master.zip from here to server.
  2. Unzip.
  3. Run this:


library(devtools)
source = devtools:::source_pkg("summarizer-master")
install(source)

Edit

As per comments, devtools::install_local() has previously failed, but may now also work directly.

Install github package on safe haven server

This post was originally published here

I’ve had few enquires about how to install the summarizer package on a server without internet access, such as the NHS Safe Havens.

  1. Uploadsummarizer-master.zip from here to server.
  2. Unzip.
  3. Run this:


library(devtools)
source = devtools:::source_pkg("summarizer-master")
install(source)

Edit

As per comments, devtools::install_local() has previously failed, but may now also work directly.

Islay distilleries in 3 days

This post was originally published here

Day 0 (Sunday 18-February 2018)

Left Edinburgh at 8am for a 1pm ferry Kennacraig to Port Askaig (Islay). Edinburgh-Kennacraig should be a 3.5h drive (and it was), but we left early to allow for any delays on the road. Arrived on Islay at 3pm and our accommodation near Port Ellen (southern Islay, close to to Ardbeg, Lagavulin, Laphroiaig) was a 40 min drive from the port.

Map of Islay with all its lovely distilleries.

Map of Islay with all its lovely distilleries.

Day 1 (Monday 19-February 2018): Ardbeg, Lagavulin, Laphroiaig

We hadn’t booked anything other than the ferry and accommodation. February is very low season so we were right to think that no other advance bookings were necessary.

We had a lazy morning and drove to Laphroaig at about 11am. We asked which tours or tasting events were on that day and booked Einar onto the Layers of Laphroaig tasting at 3pm (as the driver, I was allowed to accompany him for free). We then drove to Lagavulin (just a few miles from Laphroaig) and booked us onto the tour at 1pm. We then drove to Ardbeg (another few miles) and had second breakfast at their cafe. Then drove back to Lagavulin for the tour, and then back to Laphroaig for the testing.

Ardbeg’s epic cafe.

Ardbeg’s epic cafe.

Waiting for the tour to begin at Lagavulin’s homey tasting room.

Waiting for the tour to begin at Lagavulin’s homey tasting room.

In Laphroiaig’s tasting room: The Layers of Laphroiaig introduced whiskies from different casks that make up their range of malts. These include ex-bourbon, virgin oak (I did not know Scotch could be matured in virgin casks - I thought it always had to be ex-something!), ex-sherry, ex-port. We were the only ones booked on this so it ended up being a private tasting.

In Laphroiaig’s tasting room: The Layers of Laphroiaig introduced whiskies from different casks that make up their range of malts. These include ex-bourbon, virgin oak (I did not know Scotch could be matured in virgin casks – I thought it always had to be ex-something!), ex-sherry, ex-port. We were the only ones booked on this so it ended up being a private tasting.

Day 2 (Tuesday 20-Febaruary 2018): Kilchoman, Bruichladdich

Einar drove us to Kilchoman where I had a tasting of their 3 limited edition malts in the visitor centre. Kilchoman is a “farm-distillery” and they even grow some of their own barley. We bought a bottle of their “100% Islay” which is made from barley grown at the premises. Unfortunately, we completely forgot to take any pictures there. Must go back.

Driving on Islay.

Driving on Islay.

We then went to Bruichladdich and booked me on the Warehouse Experience at 2pm. Simiarly to Laphroaig, the driver was allowed to accompany for free. We had lunch at Port Charlotte while waiting for the event.

Bruichladdich warehouse experience

Bruichladdich warehouse experience

We then went by Bowmore (it was nearly 5pm) and asked about the different tours and experiences they have on the next day. Decided to do the “Bottle Your Own in the Vaults” first thing on Wednesday morning.

Day 3 (Wednesday 21-February 2018): Bowmore, Bunnahabhain, Ardnahoe, Caol Ila

Bottling a 17-year-old sherry cask beauty at Bowmore.

Bottling a 17-year-old sherry cask beauty at Bowmore.

We then dropped by Bunnahabhain – no tours were running that but we were offered a few free tasters at the shop. On our way back from Bunnahabhain we took a picture at Ardanahoe (a new distillery that opens any day now).

Visiting Bunnahabhain and stopping at soon to be opened Ardnahoe.

Visiting Bunnahabhain and stopping at soon to be opened Ardnahoe.

The final distillery was Caol Ila where we went on the standard tour. The view in the stills room was just out of this world. They didn’t allow us to take pictures inside, so I took this from their website:

Caol Ila stills with a view of the Isle of Jura. Picture from: https://www.malts.com/en-row/distilleries/caol-ila/

Caol Ila stills with a view of the Isle of Jura. Picture from: https://www.malts.com/en-row/distilleries/caol-ila/

Me outside Caol Ila with Jura in the background

Me outside Caol Ila with Jura in the background

What we brought back with us

In addition to whisky distilleries, we also visited a nano-brewery, and it turns out that The Botanist (a gin) is made at Bruichladdich.

In addition to whisky distilleries, we also visited a nano-brewery, and it turns out that The Botanist (a gin) is made at Bruichladdich.

Converting old WordPress posts to Hugo

This post was originally published here

Between 2014-2018 I published 29 posts on riinudata.wordpress.com. Today I’m converting all of those to my new website powered by blogdown-Hugo.

Step 1

Read the Migration: From WordPress chapter of the blogdown book.

Step 2

Get all your wordpress posts into one XML: WP Admin – Tools – Export.

Step 3

Install Exitwp and its dependencies (pyyamp, beautifulsoup4, html2text):

This worked on macOS1 High Sierra – I already had python installed.

Step 4

Working in the directory that git clone created (exitwp):

  • Put the WordPress XML in the wordpress-xml directory.
  • Run xmllint riinu_wordpress.xml, worked the first time for me and I didn’t get any errors (so not sure what the fix errors if there are would entail).
  • Back in the exitwp folder, run python exitwp.py
  • This created folders build/jekyll/riinudata.wordpress.com/_posts and the content looked like this:
  • Move all these into exitwp/post folder.

Step 5

  • Take a copy of https://github.com/yihui/oldblog_xml/blob/master/convert.R to clean these .markdown files up and ready for Hugo. I edited the first three lines, skipped the “Do not run if…” chunk as I’d already done that in Step 3, edited the authors = c(), did not run the very last chunk (local({if (!dir.exist...})).
  • Move all of the files (now .md) into content/post of your blogdown repo. Build and voila!

Further modifications

Looks like most of my posts were converted like a charm, with nicely formatted code blocks and images. But I few things I noticed that I think I have to fix:

  • GitHub gists are now displayed as links, will make those into code blocks (or embed them using a Hugo shortcodes.
  • Most images show up perfectly, but some have gotten stuck in a code block, e.g. showing up as <img src="https://surgicalinformatics.org/wp-content/uploads/2018/02/rplot.png" alt="Rplot"/>. Will sort these

Overall I feared a lot worse and am super happy with the conversion experience. Took exactly 3 h.

My name is Hildegard and I approve this message.

My name is Hildegard and I approve this message.


  1. I’m only 1.5 years late to discover that OS X has been rebranded as macOS: https://www.wired.com/2016/06/apple-os-x-dead-long-live-macos/

Hello world: blogdown loves Hugo

This post was originally published here

We are live!

I wrote my last blog post on WordPress on 20-October 2017 and promised myself this was the last time. I’ve been blogging on WordPress since 2014 and the more I used it the more painful it got! This is most likely caused by the fact that I have been thrifting further and further away from point-and-click interfaces anyway…oh and discovering MARKDOWN.

My two rules:

So I finally got round to creating a blogdown-Hugo site:

Hugo is a website generator that is code-based (no more dragging around those pesky WordPress elements); blogdown is an R package that will help you generate Hugo, Jekyll, or Hexo sites, especially if you will be including R Markdown in it.

Steps on 12-February 2018:

  • Created a new blogdown project on RStudio, set kakawait/hugo-tranquilpeak-theme as the theme
  • Edited my name, email etc. information in the config.toml.
  • Absolutely could not figure out how to change coverImage = "cover.jpg". Tried putting my cover image in /static/img/, /static/_images/, source/assets/images and tried linking to these any way I could think of (e.g. with and without the first /) but it just wasn’t happening. Ended up putting my picture in /themes/hugo-tranquilpeak-theme/static/images/ and blatantly naming it cover.jpg (replacing the theme’s default photo). This worked.
  • Pushed the whole project to https://github.com/riinuots/hugo-tranquil-website and then created a submobule in https://github.com/riinuots/hugo-tranquil-website/tree/master/themes so when the theme gets updated I can pull the new version. This is not essential. I need to figure out the cover image issue though.
  • Set up Netlify as in https://bookdown.org/yihui/blogdown/netlify.html which was superquick but then spent some time troubleshooting why my theme wasn’t displaying properly. Turns out that for this theme, it is essential to set the baseURL = "https://riinu.netlify.com/" (in config.toml).
  • Created this Hello World post which seemed to work fine at first. I then added an unquoted semicolon to the title, broke everything and spent 2 h trying to figure out what went wrong. These were the errors I was getting and that no-one else in the world (Google) seemed to have reported:
    • edits to the new post not happening, but the site isn’t broken either
    • clean_site() errors with:

rmarkdown::clean_site() Error in file.exists(files) : invalid 'file' argument

  • after spending 2h on Google/github/rstudio/rmarkdown, blogdown book, blogdown repo, Hugo documentation, I finally came across hugo -v (v for verbose). Noticed

yaml: line 1: mapping values are not allowed in this context

(which I had indeed seen before at some point during these 2 hours). Anyway, seeing it for the second time clicked – markdown thinks I’m mapping something that shouldn’t be mapped (mapping usually means defining variables). My title was (second line of the markdown file, really) title: Hello world: blogdown loves Hugo, but if using a semicolon you need quotes: title: "Hello world: blogdown loves Hugo".

Still better than WordPress.

Next steps:

  • Set up Disqus (comments).
  • Bring over old posts from https://riinudata.wordpress.com
  • Write all the new posts ideas I’ve been gathering over the past 4 months.

P-values from random effects linear regression models

This post was originally published here

lme4::lmer

 is a useful frequentist approach to hierarchical/multilevel linear regression modelling. For good reason, the model output only includes t-values and doesn’t include p-values (partly due to the difficulty in estimating the degrees of freedom, as discussed here).

Yes, p-values are evil and we should continue to try and expunge them from our analyses. But I keep getting asked about this. So here is a simple bootstrap method to generate two-sided parametric p-values on the fixed effects coefficients. Interpret with caution.

library(lme4)

# Run model with lme4 example data
fit = lmer(angle ~ recipe + temp + (1|recipe:replicate), cake)

# Model summary
summary(fit)

# lme4 profile method confidence intervals
confint(fit)

# Bootstrapped parametric p-values
boot.out = bootMer(fit, fixef, nsim=1000) #nsim determines p-value decimal places 
p = rbind(
  (1-apply(boot.out$t<0, 2, mean))*2,
  (1-apply(boot.out$t>0, 2, mean))*2)
apply(p, 2, min)

# Alternative "pipe" syntax
library(magrittr)

lmer(angle ~ recipe + temp + (1|recipe:replicate), cake) %>% 
  bootMer(fixef, nsim=100) %$% 
  rbind(
  (1-apply(t<0, 2, mean))*2,
  (1-apply(t>0, 2, mean))*2) %>% 
  apply(2, min)

 

P-values from random effects linear regression models

This post was originally published here

 is a useful frequentist approach to hierarchical/multilevel linear regression modelling. For good reason, the model output only includes t-values and doesn’t include p-values (partly due to the difficulty in estimating the degrees of freedom, as discussed here).

Yes, p-values are evil and we should continue to try and expunge them from our analyses. But I keep getting asked about this. So here is a simple bootstrap method to generate two-sided parametric p-values on the fixed effects coefficients. Interpret with caution.

 

Prediction is very difficult, especially about the future

This post was originally published here

As Niels Bohr, the Danish physicist, put it, “prediction is very difficult, especially about the future”. Prognostic models are commonplace and seek to help patients and the surgical team estimate the risk of a specific event, for instance, the recurrence of disease or a complication of surgery. “Decision-support tools” aim to help patients make difficult choices, with the most useful providing personalized estimates to assist in balancing the trade-offs between risks and benefits. As we enter the world of precision medicine, these tools will become central to all our practice.

In the meantime, there are limitations. Overwhelming evidence shows that the quality of reporting of prediction model studies is poor. In some instances, the details of the actual model are considered commercially sensitive and are not published, making the assessment of the risk of bias and potential usefulness of the model difficult.

In this edition of HPB, Beal and colleagues aim to validate the American College of Surgeons National Quality Improvement Program (ACS NSQIP) Surgical Risk Calculator (SRC) using data from 854 gallbladder cancer and extrahepatic cholangiocarcinoma patients from the US Extrahepatic Biliary Malignancy Consortium. The authors conclude that the “estimates of risk were variable in terms of accuracy and generally calculator performance was poor”. The SRC underpredicted the occurrence of all examined end-points (death, readmission, reoperation and surgical site infection) and discrimination and calibration were particularly poor for readmission and surgical site infection. This is not the first report of predictive failures of the SRC. Possible explanations cited previously include small sample size, homogeneity of patients, and too few institutions in the validation set. That does not seem to the case in the current study.

The SRC is a general-purpose risk calculator and while it may be applicable across many surgical domains, it should be used with caution in extrahepatic biliary cancer. It is not clear why the calculator does not provide measures of uncertainty around estimates. This would greatly help patients interpret its output and would go a long way to addressing some of the broader concerns around accuracy.

Prediction is very difficult, especially about the future

This post was originally published here

As Niels Bohr, the Danish physicist, put it, “prediction is very difficult, especially about the future”. Prognostic models are commonplace and seek to help patients and the surgical team estimate the risk of a specific event, for instance, the recurrence of disease or a complication of surgery. “Decision-support tools” aim to help patients make difficult choices, with the most useful providing personalized estimates to assist in balancing the trade-offs between risks and benefits. As we enter the world of precision medicine, these tools will become central to all our practice.

In the meantime, there are limitations. Overwhelming evidence shows that the quality of reporting of prediction model studies is poor. In some instances, the details of the actual model are considered commercially sensitive and are not published, making the assessment of the risk of bias and potential usefulness of the model difficult.

In this edition of HPB, Beal and colleagues aim to validate the American College of Surgeons National Quality Improvement Program (ACS NSQIP) Surgical Risk Calculator (SRC) using data from 854 gallbladder cancer and extrahepatic cholangiocarcinoma patients from the US Extrahepatic Biliary Malignancy Consortium. The authors conclude that the “estimates of risk were variable in terms of accuracy and generally calculator performance was poor”. The SRC underpredicted the occurrence of all examined end-points (death, readmission, reoperation and surgical site infection) and discrimination and calibration were particularly poor for readmission and surgical site infection. This is not the first report of predictive failures of the SRC. Possible explanations cited previously include small sample size, homogeneity of patients, and too few institutions in the validation set. That does not seem to the case in the current study.

The SRC is a general-purpose risk calculator and while it may be applicable across many surgical domains, it should be used with caution in extrahepatic biliary cancer. It is not clear why the calculator does not provide measures of uncertainty around estimates. This would greatly help patients interpret its output and would go a long way to addressing some of the broader concerns around accuracy.