## More Fun with Regression:

### Confounding, interaction and random effects

The following blog post provides a general overview of some of the terms encountered when carrying out logistic regression and was inspired by attending the extremely informative HealthyR+: Practical logistic regression course at the University of Edinburgh.

• Confounding
• What is confounding?
• Examples
• Interaction
• What are interaction effects?
• Example
• How do we detect interactions?
• What happens if we overlook interactions?
• Terminology
• Random effects
• Clustered data
• Why should we be aware of clustered data?
• A solution to clustering
• Terminology
• Brief summary

## Confounding

### What is confounding?

Confounding occurs when the association between an explanatory (exposure) and outcome variable is distorted, or confused, because another variable is independently associated with both.

The timeline of events must also be considered, because a variable cannot be described as confounding if it occurs after (and is directly related to) the explanatory variable of interest.  Instead it is sometimes called a mediating variable as it is located along the causal pathway, between explanatory and outcome.

### Examples

Potential confounders often encountered in healthcare data include for example, age, sex, smoking status, BMI, frailty, disease severity.  One of the ways these variables can be controlled is by including them in regression models.

In the Stanford marshmallow experiment, a potential confounder was left out – economic background – leading to an overestimate of the influence of a child’s willpower on their future life outcomes.

Another example includes the alleged link between coffee drinking and lung cancer. More smokers than non-smokers are coffee drinkers, so if smoking is not accounted for in a model examining coffee drinking habits, the results are likely to be confounded.

## Interaction

### What are interaction effects?

In a previous blog post, we looked at how collinearity is used to describe the relationship between two very similar explanatory variables.  We can think of this as an extreme case of confounding, almost like entering the same variable into our model twice.  An interaction on the other hand, occurs when the effect of an explanatory variable on the outcome, depends on the value of another explanatory variable.

When explanatory variables are dependent on each other to tell the whole story, this can be described as an interaction effect; it is not possible to understand the exact effect that one variable has on the outcome without first knowing information about the other variable.

The use of the word dependent here is potentially confusing as explanatory variables are often called independent variables, and the outcome variable is often called the dependent variable (see word clouds here). This is one reason why I tend to avoid the use of these terms.

### Example

An interesting example of interaction occurs when examining our perceptions about climate change and the relationship between political preference, and level of education.

We would be missing an important piece of the story concerning attitudes to climate change if we looked in isolation at either education or political orientation.  This is because the two interact; as level of education increases amongst more conservative thinkers, perception about the threat of global warming decreases, but for liberal thinkers as the level of education increases, so too does the perception about the threat of global warming.

Here is a link to the New York Times article on this story: https://www.nytimes.com/interactive/2017/11/14/upshot/climate-change-by-education.html

### What happens if we overlook interactions?

If interaction effects are not considered, then the output of the model might lead the investigator to the wrong conclusions. For instance, if each explanatory variable was plotted in isolation against the outcome variable, important potential information about the interaction between variables might be lost, only main effects would be apparent.

On the other hand, if many variables are used in a model together, without first exploring the nature of potential interactions, it might be the case that unknown interaction effects are masking true associations between the variables.  This is known as confounding bias.

### How do we detect interactions?

The best way to start exploring interactions is to plot the variables. Trends are more apparent when we use graphs to visualise these.

If the relationship between two exposure variables on an outcome variable is constant, then we might visualise this as a graph with two parallel lines.  Another way of describing this is additive effect modification.

But if the effect of the exposure variables on the outcome is not constant then the lines will diverge. We can describe this as multiplicative effect modification.

Once an interaction has been confirmed, the next step would be to explore whether the interaction is statistically significant or not.

### Terminology

Some degree of ambiguity exists surrounding the terminology of interactions (and statistical terms in general!), but here are a few commonly encountered terms, often used synonymously.

• Interaction
• Causal interaction
• Effect modification
• Effect heterogeneity

There are subtle differences between interaction and effect modification.  You can find out more in this article: On the distinction between interaction and effect modification.

## Random effects

### Clustered data

Many methods of statistical analysis are intended to be applied with the assumption that, within a data-set, an individual observation is not influenced by the value of another observation: it is assumed that all observations are independent of one another.

This may not be the case however, if you are using data, for example, from various hospitals, where natural clustering or grouping might occur.  This happens if observations within individual hospitals have a slight tendency to be more similar to each other than to observations in the rest of the data-set.

Random effects modelling is used if the groups of clustered data can be considered as samples from a larger population.

### Why should we be aware of clustered data?

Gathering insight into the exact nature of differences between groups may or may not be important to your analysis, but it is important to account for patterns of clustering because otherwise measures such as standard errors, confidence intervals and p-values may appear to be too small or narrow.  Random effects modelling is one approach which can account for this.

### A solution to clustering

The random effects model assumes that having allowed for the random effects of the various clusters or groups, the observations within each individual cluster are still independent.  You can think of it as multiple levels of analysis – first there are the individual observations, and these are then nested within observations at a cluster level, hence an alternative name for this type of modelling is multilevel modelling.

### Terminology

There are various terms which are used when referring to random effects modelling, although the terms are not entirely synonymous. Here are a few of them:

• Random effects
• Multilevel
• Mixed-effect
• Hierarchical

There are two main types of random effects models:

• Random intercept model
• Random slope and intercept model

## Brief summary

To finish, here is a quick look at some of the key differences between confounding and interaction.

If you would like to learn more about these terms and how to carry out logistic regression in R, keep an eye on the HealthyR page for updates on courses available.