Making sense of machine learning – how do we measure performance?

An exciting direction for the Surgical Informatics group is the application of machine learning models to clinical problems.

As we hear on a nearly daily basis, machine learning has loads to offer patients and clinicians, but how can we make these models understandable and importantly, how do we measure that these models are looking at what we’re interested in?

Currently, how well a diagnostic test performs is described by four main parameters (most students and clinicians will groan when they hear these words):

  • Sensitivity (how many people who have the condition are identified correctly)
  • Specificity (how many people who don’t have the condition are identified correctly)
  • Positive Predictive Value (how many times a test positive is a true positive)
  • Negative Predictive Value (how many times a test negative is a true negative)

Now, interestingly the field of machine learning has evolved some separate parameters for measuring the usefulness of machine learning models:

  • Recall (synonymous to sensitivity)
  • Precision (synonymous to positive predictive value)

There are other measures too, including F1 score and accuracy. The issue around these metrics is that although they are handy mathematically to describe models, they lack relevance to what is clinically important. For example, if a patient wants to know how many times a test might give a false result, the F1 score (a weighted average of precision and recall) is going to be pretty useless.

Now, if we want to make a machine learning risk prediction model, we need a clinically relevant metric to allow model training to be measured and optimised. In python, there’s lots of functions for this, however, R is far more common in healthcare data analysis. At Surgical Informatics, we use Keras to interact with TensorFlow in R. Keras for R is far newer than python, so there are fewer metric functions available.

Clinically, a model to predict a specific event happening is more useful than ruling it out, particularly if the event is serious (i.e. death). A recall metric would be perfect for this, however, there is no custom function available for recall in R.

So lets make one!

Fortunately Keras provides us with functions to perform calculations on tensors such as k_sum, k_round and k_clip. This lets us manipulate Tensors using Keras and come up with custom metrics. You can find other backend keras functions here:

https://keras.rstudio.com/articles/backend.html#backend-functions.

So if recall is equal to the number of true positives, divided by the number of true positives plus false negatives we need to write a function to define these.

Now should we just add pp and tp? Unforunately Keras doesn’t like this. So we use k_epsilon() to replace tp in the recall expression, to give:

And that should calculate the recall (or sensitivity) for the model!

Encryptr package: easily encrypt and decrypt columns of sensitive data

This post was originally published here

A number of existing R packages support data encryption. However, we haven’t found one that easily suits our needs: to encrypt one or many columns of a data frame or tibble using a private/public key pair in tidyversefunctions. The emphasis is on the easily.

Encrypting and decrypting data securely is important when it comes to healthcare and sociodemographic data. We have developed a simple and secure package encryptyr which allows non-experts to encrypt and decrypt columns of data.

There is a simple and easy-to-follow vignette available on our GitHub page which guides you through the process of using encryptr:

https://github.com/SurgicalInformatics/encryptr.

Confidential data – security challenges

Data containing columns of disclosive or confidential information such as a postcode or a patient ID (CHI in Scotland) require extreme care. Storing sensitive information as raw values leaves the data vulnerable to confidentiality breaches.

It is best to just remove confidential information from the records whenever possible. However, this can mean the data can never be re-associated with an individual. This may be a problem if, for example, auditors of a clinical trial need to re-identify an individual from the trial data.

One potential solution currently in common use is to generate a study number which is linked to the confidential data in a separate lookup table, but this still leaves the confidential data available in another file.

Encryptr package solution – storing encrypted data

The encryptr package allows users to store confidential data in a pseudoanonymised form, which is far less likely to result in re-identification.

The package allows users to create a public key and a private key to enable RSA encryption and decryption of the data. The public key allows encryption of the data. The private key is required to decrypt the data. The data cannot be decrypted with the public key. This is the basis of many modern encryption systems.

When creating keys, the user sets a password for the private key using a dialogue box. This means that the password is included in an R script. We recommend creating a secure password with a variety of alphanumeric characters and symbols.

As the password is not stored, it is important that you are able to remember it if you need to decrypt the data later.

Once the keys are created it is possible to encrypt one or more columns of data in a data frame or tibble using the public key. Every time RSA encryption is used it will generate a unique output. Even if the same information is encrypted more than once, the output will always be different. It is not possible therefore to match two encrypted values.

These outputs are also secure from decryption without the private key. This may allow sharing of data within or between research teams without sharing confidential data.

Caution: data often remains potentially disclosive (or only pseudoanomymised) even after encryption of identifiable variables and all of the required permissions for usage and sharing of data must still be in place.

Encryptr package – decrypting the data

Sometimes decrypting data is necessary. For example, participants in a clinical trial may need to be contacted to explain a change or early termination of the trial.

The encryptr package allows users to securely and reliably decrypt the data. The decrypt function will use the private key to decrypt one or more columns. The user will be required to enter the password created when the keys were generated.

As the private key is able to decrypt all of the data, we do not recommend sharing this key.

Blinding and unblinding clinical trials – another encryptr package use

Often when working with clinical trial data, the participants are randomised to one or more treatment groups. Often teams working on the trial are unaware of the group to which patients were randomised (blinded).

Using the same method of encryption, it is possible to encrypt the participant allocation group, allowing the sharing of data without compromising blinding. If other members of the trial team are permitted to see treatment allocation (unblinded), then the decryption process can be followed to reveal the group allocation.

What this is not

This is a simple set of wrappers of openssl aimed at non-experts. It does not seek to replace the many excellent encryption packages available in R, such as PKI, sodium and safer. We believe however that it makes things much easier. Comments and forks welcome.

Quick take-aways from RStudio::conf Training Day 02 (Part 2 – sparklyr)

It’s now a week since I returned from RStudio::conf 2019 in Austin, Texas and in this blog I’m going to focus using the sparklyr package (spark-lee-r) which enables R to connect to an Apache Spark instance for general purpose cluster-computing. sparklyr has its own inbuilt functions as well as allowing dbplyr to do all of the amazing features I described in my first blog post: https://surgicalinformatics.org/quick-take-aways-from-rstudioconf-training-day-02/. The code contained in this blog should work on your own local RStudio without any preconfigured cluster should you wish to experiment with sparklyr’s capabilities.

Establishing a connection

The following example code will help set up a local connection in order to experiment with some of the functionality of the dbplyr package. This is really useful if you are waiting for data or access to a database so you can have pre-prepared scripts in progress without the remote database connection.

The connection is typically stored as “sc” which you can also see in the Environment. This is the object that is referenced each time data is accessed in the spark cluster.

To check that the new connection to a spark instance has been established go to the connections tab in your RStudio interface to see if the connection has been established (this is typically located alongside your “Environment” and “History” tabs. Click on the Spark UI button to view the user interface for the current spark session in a browser (this will be helpful later if you want to view an event log for the activity of your session). Another way to check if the cluster is open is by using: spark_connection_is_open(sc). This should return “TRUE” if the connection is open.

Adding and manipulating data via the connection

Now that you have a connection established some data can be added to the spark cluster:

spark_flights becomes an object in the local environment but is really just a reference to the data in the spark cluster. Click on the Connections tab and you should see that “my_flights” is now a data frame stored in the cluster. The Spark UI page which opened in your browser will also now show some of the changes you have made. Click the Storage tab in the UI and you should see the data frame.

When manipulating the data the reference to the data frame within the local environment can be treated as if the data was stored locally. One key difference is that the creation of new data frames is delayed until the last possible minute. The following example groups flights from the nycflights13 data frame flights and calculated the average delay based on destination. Notice that the real computation happens only once the “average_delay” data frame is printed, the first command simply creates a reference in the local environment in which is saved your intended action. Also notice the “lazy” approach which occurs with sparklyr in which the total number of rows is not returned and is replaced by “… with more rows”. If the full number of rows is then desired the collect function can be used:

Caching data

Have a look at the Spark UI and check out the SQL tab. Click on one of the queries (highlight in blue) to get a breakdown of the components for each task. Notice the difference between the query in which collect() was used, it takes a lot longer to execute than the “lazy” query which sparklyr uses by default. This is really useful if you want to leave the “heavy lifting” of data transformation right until the end but if you then want to use an intermediate data frame for several further transformations (this could be sorting destinations based on average delay, only looking at destinations where the average departure time was early etc.) then it might be useful to cache the data within the cluster so that the data is transformed only once. The downside to this approach may be additional memory requirements. The following code using compute() will cache the intermediate data frame:

Now you should be able to see the “sub_flights” data frame in the Connections tab, the Storage tab of the Spark UI and the SQL code generated in the SQL tab of the UI. The cached_flights reference should also appear in the Environment tab in RStudio.

Some extra functions

As well as working through dplyr and dbplyr, sparkylr also comes with its own functions for data analysis and transformation which may be useful particularly when setting up pipelines you plan to execute later. A couple of useful examples are the ft_binnarizer and ft_bucketizer commands which I demonstrate determining destinations which are on average over 10 minutes delayed and then demonstrate grouping by distance:

These functions can be combined with others such as sdf_partition, sdf_pivot and sdf_register to prepare a data set for predictive modelling. Sparklyr has its own inbuilt functions for logistic regression (ml_logistic_regression), predictive modelling (sdf_predict)  and even some dedicated natural language processing techniques (ft_tokenizer, ft_stop_words_remover).

To finish the session close down the connection using:

The connection should now be terminated meaning the Spark UI will no longer be accessible and the connections tab has changed. Should you wish to work with any data frames or aggregated results following the disconnect then make sure to use collect() and create a new object before disconnecting.

Quick take-aways from RStudio::conf Training Day 02

For the past few days I’ve been in Austin, Texas with Stephen Knight and Riinu Ots representing the Surgical Informatics Group at RStudio::conf 2019. The conference brings together nearly 2000 data scientists, developers and a couple of surgeons to learn the latest best practice and best approaches when programming with R.

I have attended the Big Data workshop. “Big Data” is a bit of a vague term but it can be helpful to think of Big Data as one of two groups: data that is just so big that you can’t open it on your own computer (imagine opening one of those massive files that just crashes your computer) or data that is stored somewhere remotely and accessed through your computer by a slow connection.

The key principles for handling Big Data effectively include:

  1. Safe storage (often a data administrator sorts this)
  2. Safe access (password protected in many cases with care to avoid publishing passwords in R scripts)
  3. Getting the database itself to do the heavy work whilst leaving R to do the statistical analysis and plotting we know it does best
  4. Leave the data transformation until the latest possible minute
  5. Access the database as few times as possible

Today I’ll focus briefly on safe access, getting the database to do all of the heavy lifting and leaving the transformation to the last possible minute.

Safe Access

Using a R script to access a remote database usually requires credentials. R needs to know where the data is stored and the database needs to know whether it can allow you to access the data. There are lots of different ways to set up a connection to a database but the DBI package and the obdc package are going to come in very handy. You might also need to install a package which supports the driver for the type of database.

There are loads of options when connecting to databases and securing credentials but it’s key to avoid posting critical information like passwords in plain text, for example:

con <- dbConnect(

  odbc::odbc(),

  Driver = "PostgreSQL",

  Server = "localhost",

  UID    = "myusername",

  PWD    = "my_unsecure_password",

  Port = 5432,

  Database = "postgres"

)

Best Solution for Securing Credentials

The most secure option for connecting to a database involves using a Data Source Name (DSN) although this does require some pre-configuration and the ability to perform the following:

1. Establish integrated security between the terminal and the database, usually via Kerberos.

2. Pre-configure an ODBC connection in the Desktop or Server (requires sufficient access rights). The ODBC connection will have a unique Data Source Name, or DSN.

For example:

con <- DBI::dbConnect(odbc::odbc(), "My DSN Name")

Easier alternatives

It is still possible to connect securely to a database using either of the following techniques:

con <- dbConnect(

  odbc::odbc(),

  Driver = "PostgreSQL",

  Server = "localhost",

  UID    = rstudioapi::askForPassword("Database user"),

  PWD    = rstudioapi::askForPassword("Database password"),

  Port = 5432,

  Database = "postgres"

)

This will require a predefined username and password which the user is prompted to enter on setting up the connection.

Finally, another option which may be easier for those without data administrator is to use the config package and create a yml file. After creating the yml file it can be used to configure the connection without directly publishing credentials, in particular password. Instead, password is retrieved from the yml file:

dw <- config::get("datawarehouse-dev")

con <- dbConnect(odbc::odbc(),

   Driver = dw$driver,

   Server = dw$server,

   UID    = dw$uid,

   PWD    = dw$pwd,

   Port   = dw$port,

   Database = dw$database

)

Making the database do the work

For many physicians who work with R we are quite happy to create data frames in the local environment, modify them there and then plot the data from there. This works very well with smaller data sets which don’t require much memory and even with medium-sized data sets (as long as you don’t try to open the whole file using View()!). When working with Big Data often the local environment isn’t large enough because of limited RAM. To give an example if you are using all of the UK Biobank genetic data which amounts to over 12 terabytes then the modest 8 gigabytes (1500 times less) of RAM I have on my own laptop just won’t do.

The solution is to manipulate the data remotely. Think bomb disposal. You want to have a screen to show you what a bomb disposal robot is doing and a way of controlling the robot but you don’t want to be up close and personal with the robot doing the work. Big Data is exactly the same, you want to see the results of the data transformations and plotting on your own device but let the database do the work for you. Trying to bring the heavy lifting of the data onto your own device creates problems and may result in the computer or server crashing.

dplyr is a truly fantastic data manipulation package which is part of the tidyverse and makes everyday data manipulation for clinicians achievable, understandable and consistent. The great news is that when working with remote data, you can use dplyr! The package dbplyr (database plyer) converts the dplyr code into SQL which then runs in the remote database meaning a familiarity with dplyr is almost all that’s needed to handle Big Data.

The show_query() function demonstrates just how much work goes on under the hood of dbplyr:

flights %>%

  summarise_if(is.numeric, mean, na.rm = TRUE) %>%

  show_query

Output:

Applying predicate on the first 100 rows

<SQL>

SELECT AVG("flightid") AS "flightid", AVG("year") AS "year", AVG("month") AS "month", AVG("dayofmonth") AS "dayofmonth", AVG("dayofweek") AS "dayofweek", AVG("deptime") AS "deptime", AVG("crsdeptime") AS "crsdeptime", AVG("arrtime") AS "arrtime", AVG("crsarrtime") AS "crsarrtime", AVG("flightnum") AS "flightnum", AVG("actualelapsedtime") AS "actualelapsedtime", AVG("crselapsedtime") AS "crselapsedtime", AVG("airtime") AS "airtime", AVG("arrdelay") AS "arrdelay", AVG("depdelay") AS "depdelay", AVG("distance") AS "distance", AVG("taxiin") AS "taxiin", AVG("taxiout") AS "taxiout", AVG("cancelled") AS "cancelled", AVG("diverted") AS "diverted", AVG("carrierdelay") AS "carrierdelay", AVG("weatherdelay") AS "weatherdelay", AVG("nasdelay") AS "nasdelay", AVG("securitydelay") AS "securitydelay", AVG("lateaircraftdelay") AS "lateaircraftdelay", AVG("score") AS "score"

FROM datawarehouse.flight

Leaving the Data Transformation to the last possible minute

dbplyr prevents unnecessary work occurring within the database until the user is explicit that they want some results. When modifying a data frame using a remote connect and dbplyr the user can work with references to the remote data frame in the local environment. When performing and saving a data transformation then dplyr saves a reference to the intended transformation (rather than actually transforming the data). Only when the user is explicit that they want to see some of the data will dbplyr let the database get busy and transform the data. The user can do this by plotting the data or by using the collect() function to print out the resulting data frame. Working with data in this way by saving up a pipeline of intended transformations and only executing the transformation at the very end is a much more efficient way of working with Big Data.

Quick take-aways from RStudio::conf Training Day 01

This week, Riinu, Steve, and Cameron are attending the annual RStudio Workshops (Tue-Wed), Conference (Thu-Fri), and the tidyverse developer day (Sat) in Austin, Texas.

We won’t even try to summarise everything we’re learning here, since the content is vast and the learning is very much hands on, but we will be posting a small selection of some take-aways in this blog.

We’re all attending different workshops (Machine Learning, Big Data, Markdown&Shiny).

Interesting take-away no. 1: terminology

Classification means categorical outcome variable.

Regression means continuous (numeric) outcome variable.

The is a bit confusing when using logistic regression – which by this definition is “classification”, rather than “regression”. But it is very common machine learning terminology, and makes sense considering the wide range of different methods used for classification (so not just regression).

Interesting take-away no. 2: library(parsnip)

The biggest strength of R is how many different packages (=extensions) it has. Basically, if you can think of a statistical or machine learning method, it’s probably implemented in R. This is because a lot of R users are also R developers – if you find a method that you really want to use, but that hasn’t been implemented yet, you can just go on and implement it youself. And then publish this new functionality as an R package than everyone can use.

However, this also means that different R packages sometimes do similar things using very different syntax. This is where the parsnip packages comes to resque, providing a unified interface for using some of these modelling packages.

For example:

Instead of figuring out the syntax for lm() (basic linear regression model), and then for Stan, and Spark, and keras, set_engine() from library(parsnip) provides us with a unified interface for all of these different methods for linear regression.

A fully working example can be found in the course materials (all publicly available):

https://github.com/topepo/rstudio-conf-2019/blob/master/Materials/Part_2_Basic_Principles.R

Interesting take-away no 3: Communication by a new means

The Rmarkdown workshop raised two interesting points within the first few mintues of starting – how prevalent communication by html has become (i.e. the internet, use of interactive documents and apps to relay industry and research data to colleagues and the wider commmunity).

But, maybe more importantly, how little is understood by the general public and how it can be used relatively easily for impressive interactivity with few lines of code….followed by the question – how about that raise boss??

For example, how using the package plotly can add immediate interactivity following on from all the ggplot basics learnt at healthyR:

 

And when you come across a website called “pimp my Rmarkdown” how can you not want to play!!!!

Interesting take-away no. 4: monitor progress with Viewer pane

Regular knitting, including at the start of an Rmd document to ensure any errors are highlighted early, is key. Your RMarkdown is a toddler who loves to misbehave. Previewing your document in a new window can take time and slow you down….

Frequent knitting into the Viewer pane can give you quick updates on how your code is behaving and identify bugs early!

The default in Rstudio loads your document into a new window when the Knit button is hit. A loading of a preview into the Viewer pane can be set as follows:

Tools tab > Global Options > RMarkdown > Set “Output preview in” to Viewer pane

Rmarkdown hack of the day: New chunk shortcut

Control + Alt + I
or
Cmd + Alt + I

Shinyfit: Advanced regression modelling in a shiny app

This post was originally published here

Many of our projects involve getting doctors, nurses, and medical students to collect data on the patients they are looking after. We want to involve many of them in data analysis, without the requirement for coding experience or access to statistical software. To achieve this we have built Shinyfit, a shiny app for linear, logistic, and Cox PH regression.

  • Aim: allow access to model fitting without requirement for statistical software or coding experience.
  • Audience: Those sharing datasets in context of collaborative research or teaching.
  • Hosting requirements: Basic R coding skills including tidyverse to prepare dataset (5-10 minutes).
  • Deployment: Any shiny platform, shinyapps.io, ShinyServer, RStudio Connect etc.

shinyfit uses our finalfit package.

Features

  • Univariable, multivariable and mixed effects linear, logistic, and Cox Proportional Hazards regression via a web browser.
  • Intuitive model building with option to include a reduced model and common metrics.
  • Coefficient, odds ratio, hazard ratio plots.
  • Cross tabulation across multiple variables with statistical comparisons.
  • Subset data by any included factor.
  • Dataset inspection functions.
  • Export tables to Word for publication or as a CSV for further analysis/plotting.
  • Easy to deploy with your own data.

Examples

argoshare.is.ed.ac.uk/shinyfit_colon
argoshare.is.ed.ac.uk/shinyfit_melanoma

Code

github.com/ewenharrison/shinyfit

Screenshots

Linear, logistic or CPH regression tables
Coefficient, odds ratio or hazard ratio plots
Crosstabs
Inspect dataset with ff_glimpse

Use your data

To use your own data, clone or download app from github.

  • Edit 0_prep.R to create a shinyfit_data object. 
  • Test the app, usually within RStudio.
  • Deploy to your shiny hosting platform of choice.
  • Ensure you have permission to share the data

Editing 0_prep.R is straightforward and takes about 5 mins. The main purpose is to create human-readable menu items and allows sorting of variables into any categories, such as outcome and explanatory

Errors in shinyfit are usually related to the underlying dataset, e.g.

  • Variables not appropriately specified as numerics or factors. 
  • A particular factor level is empty, thus regression function (lm, glm, coxph etc.) gives error.
  • A variable with >2 factor levels is used as an outcome/dependent. This is not supported.
  • Use Glimpse tabs to check data when any error occurs.

It is fully mobile compliant, including datatables.

There will be bugs. Please report here

Shinyfit: Advanced regression modelling in a shiny app

This post was originally published here

Many of our projects involve getting doctors, nurses, and medical students to collect data on the patients they are looking after. We want to involve many of them in data analysis, without the requirement for coding experience or access to statistical software. To achieve this we have built Shinyfit, a shiny app for linear, logistic, and Cox PH regression.

  • Aim: allow access to model fitting without requirement for statistical software or coding experience.
  • Audience: Those sharing datasets in context of collaborative research or teaching.
  • Hosting requirements: Basic R coding skills including tidyverse to prepare dataset (5-10 minutes).
  • Deployment: Any shiny platform, shinyapps.io, ShinyServer, RStudio Connect etc.

shinyfit uses our finalfit package.

Features

  • Univariable, multivariable and mixed effects linear, logistic, and Cox Proportional Hazards regression via a web browser.
  • Intuitive model building with option to include a reduced model and common metrics.
  • Coefficient, odds ratio, hazard ratio plots.
  • Cross tabulation across multiple variables with statistical comparisons.
  • Subset data by any included factor.
  • Dataset inspection functions.
  • Export tables to Word for publication or as a CSV for further analysis/plotting.
  • Easy to deploy with your own data.

Examples

argoshare.is.ed.ac.uk/shinyfit_colon
argoshare.is.ed.ac.uk/shinyfit_melanoma

Code

github.com/ewenharrison/shinyfit

Screenshots

Linear, logistic or CPH regression tables
Coefficient, odds ratio or hazard ratio plots
Crosstabs
Inspect dataset with ff_glimpse

Use your data

To use your own data, clone or download app from github.

  • Edit 0_prep.R to create a shinyfit_data object. 
  • Test the app, usually within RStudio.
  • Deploy to your shiny hosting platform of choice.
  • Ensure you have permission to share the data

Editing 0_prep.R is straightforward and takes about 5 mins. The main purpose is to create human-readable menu items and allows sorting of variables into any categories, such as outcome and explanatory

Errors in shinyfit are usually related to the underlying dataset, e.g.

  • Variables not appropriately specified as numerics or factors. 
  • A particular factor level is empty, thus regression function (lm, glm, coxph etc.) gives error.
  • A variable with >2 factor levels is used as an outcome/dependent. This is not supported.
  • Use Glimpse tabs to check data when any error occurs.

It is fully mobile compliant, including datatables.

There will be bugs. Please report here