Artificial Intelligence

Start Asking Data Why | Causality Intro| Eyal Kazin

Admin

September 22, 2024

For simplicity we’ll examine Simpson’s paradox focusing on two cohorts, male and female adults.

Outcome of the imaginary therapeutic trial, similar to the previous but focusing on the adults. Each symbol is one patient from the repsective age-gender cohort and the red line indicates the naïve population trend.

Examining this data we can make three statements about three variables of interest:

Gender is an independent variable (it does not “listen to” the other two)
Treatment depends on Gender (as we can see, in this setting the level given depends on Gender — women have been given, for some reason, a higher dosage.)
Outcome depends on both Gender and Treatment

According to these we can draw the causal graph as the following

Simpson’s paradox Graphic Model where Gender is a confounding variable between Treatment and Outcome

Notice how each arrow contributes to communicate the statements above. As important, the lack of an arrow pointing into Gender conveys that it is an independent variable.

We also notice that by having arrows pointing from Gender to Treatment and Outcome it is considered a common cause between them.

The essence of the Simpson’s paradox is that although the Outcome is effected by changes in Treatment, as expected, there is also a backdoor path flow of information via Gender.

The solution to this paradox, as you may have guessed by this stage, is that the common cause Gender is a confounding variable that needs to be controlled.

Controlling for a variable, in terms of a causal graph, means eliminating the relationship between Gender and Treatment.

This may be done in two manners:

Pre data collection: Setting up a Randomised Control Trial (RCT) in which participants will be given dosage regardless of their Gender.
Post data collection: As in this made up scenario the data has already been collected and hence we need to deal with what is referred to as Observational Data.

In both pre- and post- data collection the elimination of the Treatment dependency of Gender (i.e, controlling for the Gender) may be done by modifying the graph such that the arrow between them is removed as such:

A modified version of the Simpson’s paradox Graphic Model. The dark node means we control for Gender

Applying this “graphical surgery” means that the last two statements need to be modified (for convenience I’ll write all three):

Gender is an independent variable
Treatment is an independent variable
Outcome depends on Gender and Treatment (but with no backdoor path)

This enables obtaining the causal relationship of interest : we can assess the direct impact of modification Treatment on the Outcome.

The process of controlling for a confounder, i.e manipulation of the data generation process, is formally referred to as applying an intervention. That is to say we are no longer passive observers of the data, but we are taking an active role in modification it to assess the causal impact.

How is this manifested in practice?

In the case of the RCT the researcher needs ensure to control for important confounding variables. Here we limit the discussion to Gender (but in real world settings you can imagine other variables such as Age, Social Status and anything else that might be relevant to one’s health).

RCTs are considered the golden standard for causal analysis in many experimental settings thanks to its practice of confounding variables. That said, it has many setbacks:

It may be expensive to recruit individuals and may be complicated logistically
The intervention under investigation may not be physically possible or ethical to conduct (e.g, one can’t ask randomly selected people to smoke or not for ten years)
Artificial setting of a laboratory — not true natural habitat of the population

Observational data on the other hand is much more readily available in the industry and academia and hence much cheaper and could be more representative of actual habits of the individuals. But as illustrated in the Simpson’s diagram it may have confounding variables that need to be controlled.

This is where ingenious solutions developed in the causal community in the past few decades are making headway. Detailing them are beyond the scope of this post, but I briefly mention how to learn more at the end.

To resolve for this Simpson’s paradox with the given observational data one

Calculates for each cohort the impact of the change of the treatment on the outcome
Calculates a weighted average contribution of each cohort on the population.

Here we will focus on intuition, but in a future post we will describe the maths behind this solution.

I am sure that many analysts, just like myself, have noticed Simpson’s at some stage in their data and hopefully have corrected for it. Now you know the name of this effect and hopefully start to appreciate how causal tools are useful.

That said … being confused at this stage is OK 😕

I’ll be the first to admit that I struggled to understand this concept and it took me three weekends of deep diving into examples to internalised it. This was the gateway drug to causality for me. Part of my process to understanding statistics is playing with data. For this purpose I created an interactive web application hosted in Streamlit which I call Simpson’s Calculator 🧮. I’ll write a separate post for this in the future.

Even if you are confused the main takeaways of Simpson’s paradox is that:

It is a situation where trends can exist in subgroups but reverse for the whole.
It may be resolved by identifying confounding variables between the treatment and the outcome variables and controlling for them.

This raises the question — should we just control for all variables except for the treatment and outcome? Let’s keep this in mind when resolving for the Berkson’s paradox.