## More Fun with Regression:

### Confounding, interaction and random effects

The following blog post provides a general overview of some of the terms encountered when carrying out logistic regression and was inspired by attending the extremely informative HealthyR+: Practical logistic regression course at the University of Edinburgh.

• Confounding
• What is confounding?
• Examples
• Interaction
• What are interaction effects?
• Example
• How do we detect interactions?
• What happens if we overlook interactions?
• Terminology
• Random effects
• Clustered data
• Why should we be aware of clustered data?
• A solution to clustering
• Terminology
• Brief summary

## Confounding

### What is confounding?

Confounding occurs when the association between an explanatory (exposure) and outcome variable is distorted, or confused, because another variable is independently associated with both.

The timeline of events must also be considered, because a variable cannot be described as confounding if it occurs after (and is directly related to) the explanatory variable of interest.  Instead it is sometimes called a mediating variable as it is located along the causal pathway, between explanatory and outcome.

### Examples

Potential confounders often encountered in healthcare data include for example, age, sex, smoking status, BMI, frailty, disease severity.  One of the ways these variables can be controlled is by including them in regression models.

In the Stanford marshmallow experiment, a potential confounder was left out – economic background – leading to an overestimate of the influence of a child’s willpower on their future life outcomes.

Another example includes the alleged link between coffee drinking and lung cancer. More smokers than non-smokers are coffee drinkers, so if smoking is not accounted for in a model examining coffee drinking habits, the results are likely to be confounded.

## Interaction

### What are interaction effects?

In a previous blog post, we looked at how collinearity is used to describe the relationship between two very similar explanatory variables.  We can think of this as an extreme case of confounding, almost like entering the same variable into our model twice.  An interaction on the other hand, occurs when the effect of an explanatory variable on the outcome, depends on the value of another explanatory variable.

When explanatory variables are dependent on each other to tell the whole story, this can be described as an interaction effect; it is not possible to understand the exact effect that one variable has on the outcome without first knowing information about the other variable.

The use of the word dependent here is potentially confusing as explanatory variables are often called independent variables, and the outcome variable is often called the dependent variable (see word clouds here). This is one reason why I tend to avoid the use of these terms.

### Example

An interesting example of interaction occurs when examining our perceptions about climate change and the relationship between political preference, and level of education.

We would be missing an important piece of the story concerning attitudes to climate change if we looked in isolation at either education or political orientation.  This is because the two interact; as level of education increases amongst more conservative thinkers, perception about the threat of global warming decreases, but for liberal thinkers as the level of education increases, so too does the perception about the threat of global warming.

Here is a link to the New York Times article on this story: https://www.nytimes.com/interactive/2017/11/14/upshot/climate-change-by-education.html

### What happens if we overlook interactions?

If interaction effects are not considered, then the output of the model might lead the investigator to the wrong conclusions. For instance, if each explanatory variable was plotted in isolation against the outcome variable, important potential information about the interaction between variables might be lost, only main effects would be apparent.

On the other hand, if many variables are used in a model together, without first exploring the nature of potential interactions, it might be the case that unknown interaction effects are masking true associations between the variables.  This is known as confounding bias.

### How do we detect interactions?

The best way to start exploring interactions is to plot the variables. Trends are more apparent when we use graphs to visualise these.

If the relationship between two exposure variables on an outcome variable is constant, then we might visualise this as a graph with two parallel lines.  Another way of describing this is additive effect modification.

But if the effect of the exposure variables on the outcome is not constant then the lines will diverge. We can describe this as multiplicative effect modification.

Once an interaction has been confirmed, the next step would be to explore whether the interaction is statistically significant or not.

### Terminology

Some degree of ambiguity exists surrounding the terminology of interactions (and statistical terms in general!), but here are a few commonly encountered terms, often used synonymously.

• Interaction
• Causal interaction
• Effect modification
• Effect heterogeneity

There are subtle differences between interaction and effect modification.  You can find out more in this article: On the distinction between interaction and effect modification.

## Random effects

### Clustered data

Many methods of statistical analysis are intended to be applied with the assumption that, within a data-set, an individual observation is not influenced by the value of another observation: it is assumed that all observations are independent of one another.

This may not be the case however, if you are using data, for example, from various hospitals, where natural clustering or grouping might occur.  This happens if observations within individual hospitals have a slight tendency to be more similar to each other than to observations in the rest of the data-set.

Random effects modelling is used if the groups of clustered data can be considered as samples from a larger population.

### Why should we be aware of clustered data?

Gathering insight into the exact nature of differences between groups may or may not be important to your analysis, but it is important to account for patterns of clustering because otherwise measures such as standard errors, confidence intervals and p-values may appear to be too small or narrow.  Random effects modelling is one approach which can account for this.

### A solution to clustering

The random effects model assumes that having allowed for the random effects of the various clusters or groups, the observations within each individual cluster are still independent.  You can think of it as multiple levels of analysis – first there are the individual observations, and these are then nested within observations at a cluster level, hence an alternative name for this type of modelling is multilevel modelling.

### Terminology

There are various terms which are used when referring to random effects modelling, although the terms are not entirely synonymous. Here are a few of them:

• Random effects
• Multilevel
• Mixed-effect
• Hierarchical

There are two main types of random effects models:

• Random intercept model
• Random slope and intercept model

## Brief summary

To finish, here is a quick look at some of the key differences between confounding and interaction.

If you would like to learn more about these terms and how to carry out logistic regression in R, keep an eye on the HealthyR page for updates on courses available.

## Fun with Regression

“All models are wrong, but some are useful”

George Box

This quote by statistician George Box feels like a good starting point from which to consider some of the challenges of regression modelling.  If we start with the idea that all models are wrong, it follows that one of the main skills in carrying out regression modelling is working out where the weaknesses are and how to minimise these to produce as close an approximation as possible to the data you are working with – to make the model useful.

The idea that producing high-quality regression models is often more of an art than a science appeals to me.  Understanding the underlying data, what you want to explore, and the tools you have at hand are essential parts of this process.

After attending the excellent HealthyR+: Practical Logistic Regression course a few weeks ago, my head was buzzing with probabilities, odds ratios and confounding.  It was not just the data which was confounded.  As someone fairly new to logistic regression, I thought it might be useful to jot down some of the areas I found particularly interesting and concepts which made me want to find out more.  In this first blog post we take a brief look at:

• Probability and odds
• The difference between probability and odds
• Why use log(odds) and not just odds?
• Famous probability problems
• Collinearity and correlation
• What is collinearity?
• How do we detect collinearity?
• Is collinearity a problem?

### Probability and odds

#### The difference between probability and odds

Odds and probability are both measures of how likely it is that a certain outcome might occur in a series of events.  Probability is perhaps more intuitive to understand, but its properties make it less useful in statistical models and so odds, odds ratios, and log(odds) are used instead, more on this in the next section.

Interestingly, when the probability of an event occurring is small – <0.1 (or less than 10%) – the odds are quite similar.  However, as probability increases, the odds also increase but at a greater rate, see the following figure:

Here we can also see that whilst probabilities range from 0 to 1, odds can take on any value between 0 and infinity.

#### Why use log(odds) and not just odds?

Asymmetry of the odds scale makes it difficult to compare binary outcomes, but by using log(odds) we can produce a symmetrical scale, see figure below:

In logistic regression, the odds ratio concerning a particular variable represents the change in odds with each unit increase, whilst holding all other variables constant.

#### Famous probability problems

I find probability problems fascinating, particularly those which seem counter-intuitive. Below are links to explanations of two intriguing probability problems:

### Collinearity and correlation

#### What is collinearity?

The term collinearity (also referred to as multicollinearity) is used to describe a high correlation between two explanatory variables.  This can cause problems in regression modelling because the explanatory variables are assumed to be independent (and indeed are sometimes called independent variables, see word clouds below).

The inclusion of variables which are collinear (highly correlated) in a regression model, can lead to the false impression for example, that neither variable is associated with the outcome, when in fact, individually each variable does have a strong association.  The figure below might help to visualise the relationships between the variables:

In this image, y represents the control variable, and x1 and x2 are the highly correlated, collinear explanatory variables.  As you can see, there is a large area of (light grey) overlap between the x variables, whereas there are only two very small areas of independent overlap between each x and y variable.  These small areas represent the limited information available to the regression model when trying to carry out analysis.

#### How do we detect collinearity?

A regression coefficient can be thought of as the rate of change, or as the slope of the regression line.  The slope describes the mean change in the outcome variable for every unit of change in the explanatory variable.  It is important to note that regression coefficients are calculated based on the assumption that all other variables (apart from the variables of interest) are kept constant.

When two variables are highly correlated, this creates problems. The model will try to predict the outcome but finds it hard to disentangle the influence of either of the explanatory variables due to their strong correlation. As a result, coefficient estimates may change erratically in response to small changes in the model.

Various terms are used to describe these x and y variables depending on context.  There are slight differences in the meanings, but here are a few terms that you might encounter:

The information I used to generate these word clouds was based on a crude estimate of the number of mentions in Google Scholar within the context of medical statistics.

#### Is collinearity a problem?

Collinearity is a problem if the purpose of your analysis is to explain the interactions between the data, however it has little effect on the overall predictive properties of your model, i.e. the model will provide accurate predictions based on all variables as one big bundle, but will not be able to tell you about the interactions of isolated variables.

If you are concerned with exploring specific interactions and you encounter collinearity, there are two main approaches you can take:

• Drop one of the variables if it is not vital to your analysis
• Combine the variables (e.g. weight and height can be combined to produce BMI)

An example of a publication where missed collinearity led to potentially erroneous conclusions, concerns analyses carried out on data relating to the World Trade Organisation (WTO). Here is a related article which attempts to unpick some of the problems with previous WTO research.

Finishing on an example of a problematic attempt at regression analysis may perhaps seem slightly gloomy, but on the contrary, I hope that this might provide comfort if your own analysis throws up challenges or problems – you are in good company!  It also brings us back to the quote by George Box at the beginning of this blog post, where we started with the premise that all models are wrong.  They are at best a close approximation, and we must always be alert to their weaknesses.

### What next?

Look out for the next HealthyR+: Practical Logistic Regression course and sign up.  What areas of medical statistics do you find fun, puzzling, tricky, surprising? Let us know below.

## rmedicine2019 – some quick thoughts and good packages

Kenny McLean and I recently attended rmedicine 2019 in Boston MA. The conference is aimed at clinicians and non-clinicians who use R for day-to-day research and monitoring of clinical processes.

Day 1 covered two parallel workshops: R Markdown for Medicine and Wrangling Survival Data

I attended R Markdown for Medicine run by Alison Hill from RStudio. Using .rmd files has become the default for the Surgical Informatics Group and, so it seems, a great number of others who attended rmedicine. Around a third of the presentations at rmedicine covered workflows involving sharing of data via either .rmd files or through shiny, an R package for creating deploy-able dashboards for data visualisation and interactive exploration.

## R Markdown for Medicine

### An Overview of Useful Tips and Tricks

R markdown is an extension of R which allows you to combine narrative text and R code within one document. This means your notes, code, results and plots are all in one place. Code is contained in between three backticks with an {r} after the first set. Inline code can also be used between single backticks followed by r without the curly brackets and then the code. This means that results can be changed automatically so that for a trial when you describe the results of numbers included / excluded, this only needs changed in one place so that the rest of the text (and / or flowcharts) updates automatically. It is also possible to mix-and-match other chunks of code from other languages.

#### Use Params!

Parameters are set in the YAML header at the top of the .rmd document. If you set a parameter of data to a default .csv or .rda file then this can be changed for other similar files without creating a new document. A really useful example would be when you have multiple hospitals or multiple diseases each with a separate data file, a report can then be generated for each file. If you use rmarkdown::render() along with purrr::pwalk you can generate a separate output file for any number of hospitals / diseases / countries / individuals etc. in just a couple of lines of code.

#### Use Helper Packages

There are some greater .rmd helper packages to improve the workflow, improve the rendering of documents and generally make life easier.

bookdown allows several .rmd documents to be combined to a book but also has some general usefulness for single documents as well. Using bookdown::word_document2 or bookdown::html_document2 in the YAML header under the output field is designed to improve cross-referencing of tables and figures compared to the default versions.

wordcountaddin allows an accurate word count to be performed which will not count YAML or code etc. without knitting the document. This is much easier than knitting the document and then performing a word count!

citr allows automated insertion of markdown citations to assist with referencing. Check out my earlier blog on referencing to get an idea of how to set up .bib files. I may add another blog on this topic, watch this space!

xaringan is a useful package for creating HTML presentations with high levels of customisation. It is possible to use an additional .css file for even greater customisation and styling of your slides but xaringan offers a great deal of user-friendly options.

distill appears to be good at supporting mobile-friendly web publishing for scientific communication with flexible figure layouts, table pagination, LaTeX math support and incorporation of javascript.

There are countless other helper packages and more likely to be on their way. Many allow additional aesthetic modification of the output documents and may allow you to run R code rather than modifying a .css file.

#### List Numbering the Lazy Way

List numbering in .rmd works without needing to manually enter the correct numbers. Just make a list where every element begins with 1. and .rmd will transform it into an appropriately-numbered list. Great if you need to add in a new element to the middle of the list later!

#### Multiple lots in a Grid

I’ve previously come across patchwork as a way to plot several plots into a grid which could be 1×2, 2×2, 3 in one column and one in the other etc. There are also two other packages cowplot and egg. I haven’t explored the differences between them but if you find that one doesn’t give you the exact customisation or alignment you need then possibly try another one. cowplot looks as if it might perform better at overlaying plots on top of another and at exact axis line matching.

#### Use the here package to help with file paths

here is a great package for swapping between Windows and Mac file paths (no more swapping backslashes and forward slashes!). Using here::here() will default to looking for a file in the .Rproj directory rather than the .rmd directory which is the default otherwise – great if you want to have multiple .rmd documents each in their own sub-directory with a shared data file in the parent directory.

#### Customise Code Outputs

R markdown allows customisation of appearance of code. Some of this can be done through modifying a .css file but there are some simpler ways to make basic changes. Try adding comment = "#>" to knitr::opst_chunk\$set()to customise how comments appear in your document.

#### Word document creations tips

R markdown is generally great for HTML and PDF formats. The options for knitting to Word are not as well developed but there are some good options. The bookdown package is useful as discussed. The redoc package has been used to facilitate conversion to and from word – not tried it personally but if it can print out to word and then handle tracked changes back into markdown then it could be very useful.

For converting more complex tables and figures to word an option is to knit to rtf (rich text format) and then open the rtf file in word. This tends to be very good at keeping the desired formatting.

R markdown is a great resource although there are a handful of minor issues which are currently difficult to resolve. One of the main problems I find it with tables and cross-referencing. I really like the syntax and customisation of the gt package but at present it appears cross-referencing in a way which works across HTML, PDF and Word outputs is not supported – a great opportunity to submit a pull request if you think you can get this to work.

## Other Useful rmedicine Packages and Ideas

### survival Package Update

The latest version (version 3.0) of the survival package was presented by Terry Therneau and is now available on github. This package is used by over 650 additional downstream dependencies. The latest version allows for multiple observations per subject, multiple endpoints per subject and multiple types of end-point. This will be particularly useful for competing risks analyses e.g. outcomes for liver transplant patients (transplanted, still on list, removed from list as no longer eligible or died).

Keep an eye-out for Kenny McLean’s blog where he plans to cover the survival package and many other useful packages presented at rmedicine 2019.

### hreport Automated Trial Reporting

hreport by Frank Harrel (currently available on github) is for automated reporting of trials and studies with generation of interactive html graphs based in plotly. Several aspects of a study can be rendered easily into plots demonstrating accrual, exclusions, descriptive statistics, adverse events and time-to-event data. Another key theme of rmedicine 2019 appears to have been the use of plotly or similar packages to enable interaction with data.

### timevis – interactive timelines

timevis allows generation of highly interactive timeline plots which allow zooming, adding or removal of events, resizing, etc.

### Holepunch package

For working with projects that require a number of packages that then need shared with a colleague, holepunch provides a quick method for generating a list of dependencies and a Dockerfile. The package creates a link for another user to open a free RStudio server with all of the required packages installed. This may be useful for trouble-shooting in a department and showing code examples.

## Summary

rmedicine 2019 has shown that clinical researchers are moving increasingly towards literate programming, interactive visualisations and automated workflows using R and Rmarkdown.

The conference was a great mix of methods presentations and data presentations from R users. You definitely don’t need any in-depth knowledge of R to benefit from it and I’d highly recommend booking for rmedicine 2020.

## HealthyR Notebooks Estonia Day 2

Everybody came back for Day 2 of HealthyR Notebooks in Estonia!

Today focussed on modelling kicking off with linear regression in detail from Riinu – if you understand this you understand the majority of statistical tests!

Factors were introduced by Cameron, which led nicely into logistic regression. By the end of this the whole room was comfortably building regression models as if they had been doing it for years!

Notebooks are a really powerful tool for teaching this sort of material allowing seamless output into PDF and Word format. Some of the delegates commented on how they had struggled with this aspect of R in the past.

After all the intense work it was a great relief to break early for some fantastic team building including stuff like this!

Some even tried there hand at archery and felt pretty smug about their performance 😂

Finally, we are so grateful Julius Juurmaa for all of the organisation. He even started a company to administer the course! Thank you Julius.

## HealthyR Notebooks Estonia Day 1

The Surgical Informatics team arrived in beautiful Estonia yesterday.

We are here as part of our Wellcome Trust Open Research Fund grant – “HealthyR Notebooks: Democratising open and reproducible data analysis in resource-poor environments”.

It’s a mouthful, but important! We have adapted our popular HealthyR training course to be easily delivered on small laptop screens allowing state-of-the-art data analysis to be performed using RStudio anywhere by anyone. We’re testing this in Estonia and running it again in Ghana in November.

Also look out for HealthyR the book coming soon.

The setting is fantastic:

The delegates are already amazing at using R, but we’re teaching Tidyverse and Finalfit to bring everyone up to date with all the great new modern packages.

There are threats of pedalo racing later 😳.