Fun with Regression

“All models are wrong, but some are useful”

George Box

This quote by statistician George Box feels like a good starting point from which to consider some of the challenges of regression modelling.  If we start with the idea that all models are wrong, it follows that one of the main skills in carrying out regression modelling is working out where the weaknesses are and how to minimise these to produce as close an approximation as possible to the data you are working with – to make the model useful.

The idea that producing high-quality regression models is often more of an art than a science appeals to me.  Understanding the underlying data, what you want to explore, and the tools you have at hand are essential parts of this process.

After attending the excellent HealthyR+: Practical Logistic Regression course a few weeks ago, my head was buzzing with probabilities, odds ratios and confounding.  It was not just the data which was confounded.  As someone fairly new to logistic regression, I thought it might be useful to jot down some of the areas I found particularly interesting and concepts which made me want to find out more.  In this first blog post we take a brief look at:

  • Probability and odds
    • The difference between probability and odds
    • Why use log(odds) and not just odds?
    • Famous probability problems
  • Collinearity and correlation
    • What is collinearity?
    • How do we detect collinearity?
    • Is collinearity a problem?

Probability and odds

The difference between probability and odds

Odds and probability are both measures of how likely it is that a certain outcome might occur in a series of events.  Probability is perhaps more intuitive to understand, but its properties make it less useful in statistical models and so odds, odds ratios, and log(odds) are used instead, more on this in the next section.

Interestingly, when the probability of an event occurring is small – <0.1 (or less than 10%) – the odds are quite similar.  However, as probability increases, the odds also increase but at a greater rate, see the following figure:

Here we can also see that whilst probabilities range from 0 to 1, odds can take on any value between 0 and infinity.

Why use log(odds) and not just odds?

Asymmetry of the odds scale makes it difficult to compare binary outcomes, but by using log(odds) we can produce a symmetrical scale, see figure below:

In logistic regression, the odds ratio concerning a particular variable represents the change in odds with each unit increase, whilst holding all other variables constant.

Famous probability problems

I find probability problems fascinating, particularly those which seem counter-intuitive. Below are links to explanations of two intriguing probability problems:

Collinearity and correlation

What is collinearity?

The term collinearity (also referred to as multicollinearity) is used to describe a high correlation between two explanatory variables.  This can cause problems in regression modelling because the explanatory variables are assumed to be independent (and indeed are sometimes called independent variables, see word clouds below). 

The inclusion of variables which are collinear (highly correlated) in a regression model, can lead to the false impression for example, that neither variable is associated with the outcome, when in fact, individually each variable does have a strong association.  The figure below might help to visualise the relationships between the variables:

In this image, y represents the control variable, and x1 and x2 are the highly correlated, collinear explanatory variables.  As you can see, there is a large area of (light grey) overlap between the x variables, whereas there are only two very small areas of independent overlap between each x and y variable.  These small areas represent the limited information available to the regression model when trying to carry out analysis.

How do we detect collinearity?

A regression coefficient can be thought of as the rate of change, or as the slope of the regression line.  The slope describes the mean change in the outcome variable for every unit of change in the explanatory variable.  It is important to note that regression coefficients are calculated based on the assumption that all other variables (apart from the variables of interest) are kept constant. 

When two variables are highly correlated, this creates problems. The model will try to predict the outcome but finds it hard to disentangle the influence of either of the explanatory variables due to their strong correlation. As a result, coefficient estimates may change erratically in response to small changes in the model.

Various terms are used to describe these x and y variables depending on context.  There are slight differences in the meanings, but here are a few terms that you might encounter:

The information I used to generate these word clouds was based on a crude estimate of the number of mentions in Google Scholar within the context of medical statistics.

Is collinearity a problem?

Collinearity is a problem if the purpose of your analysis is to explain the interactions between the data, however it has little effect on the overall predictive properties of your model, i.e. the model will provide accurate predictions based on all variables as one big bundle, but will not be able to tell you about the interactions of isolated variables.

If you are concerned with exploring specific interactions and you encounter collinearity, there are two main approaches you can take:

  • Drop one of the variables if it is not vital to your analysis
  • Combine the variables (e.g. weight and height can be combined to produce BMI)

An example of a publication where missed collinearity led to potentially erroneous conclusions, concerns analyses carried out on data relating to the World Trade Organisation (WTO). Here is a related article which attempts to unpick some of the problems with previous WTO research.

Finishing on an example of a problematic attempt at regression analysis may perhaps seem slightly gloomy, but on the contrary, I hope that this might provide comfort if your own analysis throws up challenges or problems – you are in good company!  It also brings us back to the quote by George Box at the beginning of this blog post, where we started with the premise that all models are wrong.  They are at best a close approximation, and we must always be alert to their weaknesses.

What next?

Look out for the next HealthyR+: Practical Logistic Regression course and sign up.  What areas of medical statistics do you find fun, puzzling, tricky, surprising? Let us know below.