Making sense of machine learning – how do we measure performance?

An exciting direction for the Surgical Informatics group is the application of machine learning models to clinical problems.

As we hear on a nearly daily basis, machine learning has loads to offer patients and clinicians, but how can we make these models understandable and importantly, how do we measure that these models are looking at what we’re interested in?

Currently, how well a diagnostic test performs is described by four main parameters (most students and clinicians will groan when they hear these words):

  • Sensitivity (how many people who have the condition are identified correctly)
  • Specificity (how many people who don’t have the condition are identified correctly)
  • Positive Predictive Value (how many times a test positive is a true positive)
  • Negative Predictive Value (how many times a test negative is a true negative)

Now, interestingly the field of machine learning has evolved some separate parameters for measuring the usefulness of machine learning models:

  • Recall (synonymous to sensitivity)
  • Precision (synonymous to positive predictive value)

There are other measures too, including F1 score and accuracy. The issue around these metrics is that although they are handy mathematically to describe models, they lack relevance to what is clinically important. For example, if a patient wants to know how many times a test might give a false result, the F1 score (a weighted average of precision and recall) is going to be pretty useless.

Now, if we want to make a machine learning risk prediction model, we need a clinically relevant metric to allow model training to be measured and optimised. In python, there’s lots of functions for this, however, R is far more common in healthcare data analysis. At Surgical Informatics, we use Keras to interact with TensorFlow in R. Keras for R is far newer than python, so there are fewer metric functions available.

Clinically, a model to predict a specific event happening is more useful than ruling it out, particularly if the event is serious (i.e. death). A recall metric would be perfect for this, however, there is no custom function available for recall in R.

So lets make one!

Fortunately Keras provides us with functions to perform calculations on tensors such as k_sum, k_round and k_clip. This lets us manipulate Tensors using Keras and come up with custom metrics. You can find other backend keras functions here:

https://keras.rstudio.com/articles/backend.html#backend-functions.

So if recall is equal to the number of true positives, divided by the number of true positives plus false negatives we need to write a function to define these.

Now should we just add pp and tp? Unforunately Keras doesn’t like this. So we use k_epsilon() to replace tp in the recall expression, to give:

And that should calculate the recall (or sensitivity) for the model!

Leave a Reply

Your email address will not be published. Required fields are marked *