This article on causal machine learning covers a practical example of how to learn structural causal models (SCM) directly from data. We will use bnlearn, which is an open-source library for learning the graphical structure of Bayesian networks in Python. Check out my Github repo for additional code examples. For other frameworks, checkout my page on causal stuff.
Learning a Bayesian network can be split into structure learning and parameter learning which are both implemented in bnlearn.
- Structure learning: Given a set of data samples, estimate a DAG that captures the dependencies between the variables.
- Parameter learning: Given a set of data samples and a DAG that captures the dependencies between the variables, estimate the (conditional) probability distributions of the individual variables.
Libraries
We will learn through a practical example and code. The following libraries are used to implement the example. Numpy and pandas are used for recreating a classic synthetic dataset often used in causal machine learning, the "sprinkler" dataset. BNLearn is then used to learn the causal structure among the variables in the dataset.
You will need the following imports in Python:
1 2 3 | import numpy as np import pandas as pd import bnlearn as bn |
The sprinkler dataset
Photo by Rémi Müller on Unsplash
Imagine a small world with a lawn that is sometimes wet. I bet you can smell that lawn just thinking about it. Only two things cause this lawn to be wet. If it rains or if the sprinkler is on. Otherwise the lawn is dry (i.e. ¬wet). While clouds are needed for rain, not all clouds carry rain. It may therefore be cloudy without rain. On sunny days the lawn might need some water and then the sprinkler is turned on. On other sunny days the lawn does not need water and the sprinkler is off. The sprinkler is never on when it is cloudy, because somehow clouds help the lawn stay moist if not wet.
The lawn world implies four stochastic variables:
- Cloudy (independent)
- Rain (depends on Cloudy)
- Sprinkler (depends on not-Cloudy)
- Grass wet (depends on Rain and Sprinkler)
The following code samples the four variables and creates the sprinker dataset :
1 2 3 4 5 6 7 | n_samples = 10000 cloudy = np.random.choice(2, p=[0.25, 0.75], size=n_samples) rain = cloudy * np.random.choice(2, p=[0.7, 0.3], size=n_samples) sprinkler = (1-rain) * (1-cloudy) * np.random.choice(2, p=[0.5, 0.5], size=n_samples) grass_wet = np.maximum(rain, sprinkler) data = np.column_stack((cloudy, rain, sprinkler, grass_wet)) df = pd.DataFrame(data, columns=["cloudy", "rain", "sprinkler", "grass_wet"]) |
The resulting dataset may look like this:
Cloudy | Rain | Sprinkler | Grass wet |
---|---|---|---|
1 | 0 | 0 | 0 |
1 | 1 | 0 | 1 |
0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 |
Take a moment to verify that the observations are consistent with the story told above.
Learning the causal structure
The variables were created to have a specific causal structures. Examples of structures among the variables are shown below, where an arrow (X → Y) should be read as "X causes Y":
- Cloudy → Rain → Grass wet (chain)
- Rain → Grass wet ← Sprinkler (collider)
- Sprinkler ← Cloudy → Rain (fork)
Let's see if we can learn this causal structure using bnlearn as shown in the following code snippet:
1 2 | model = bn.structure_learning.fit(df) model = bn.independence_test(model, df) |
Notice that we may learn the wrong causal relationships. For example, it may seem that turning the sprinkler off causes clouds to appear. This is because the sprinkler is never on while there are clouds and vice versa. However, any observations where the sprinkler is off and no clouds appear would be evidence to the contrary, which may or may not be present in the sample we generated above.
Visualising the causal DAG
Because bnlearn includes networkx, we get the ability to visualise the graph that was learned. It's a single line of code:
1 | G = bn.plot(model) |
If all went well with the data generation and learning, the graph should look similar to this.
If it does not, simply try to generate the data again, optionally increasing the number of samples.
Conclusion
BNLearn can be used to learn the causal relationships of variables directly from data. It does not always work and is somewhat sensitive to the sample drawn, as causal relationships may sometimes be misinterpreted if insufficient evidence exists in the sample to indicate otherwise.