A core problem that arises in most data-driven personalized decision scenarios is the estimation of heterogeneous treatment effects: what is the effect of an intervention on an outcome of interest as a function of a set of observable characteristics of the treated sample?
In personalized pricing (my work field) one goal is to estimate the causal effect of a pricing (i.e. how much something costs to buy) on consumer demand. In this settings we may have an abundance of observational data, where the treatment was chosen via some unknown policy and the ability to run A/B tests is limited.
So, how can we estimate the effect of an intervention, e.g. a certain price discount or price point?
This is an overview of methods, some taken from the EconML website.
- Structural Equation Modeling (SEM), introduction video
- Double Machine Learning (see e.g. Chernozhukov2016, Chernozhukov2017, Mackey2017, Nie2017, Chernozhukov2018, Foster2019)
- Causal Forests (see e.g. Wager2018, Athey2019, Oprescu2019)
- Deep Instrumental Variables (see e.g. Hartford2017)
- Non-parametric Instrumental Variables Newey2003
- Meta-learners (see e.g. Kunzel2017)
Terminology in causal inference
Observational study: observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because. This is different from an experiment
Experiment: a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried
Stable unit treatment value assumption (SUTVA): ensures that only two potential outcomes exist and that (exactly) one of them is observed for each individual. Requires:
- Consistency, i.e. treatment is well defined. For example, if the treatment is exercise, then how much and what kind of exercise counts as a treatment?
- No interference, i.e. the treatment of one individual has no effect on any other individual.
Average Treatment Effect (ATE): measures the difference in mean (average) outcomes between units assigned to the treatment and units assigned to the control.
Conditional Average Treatment Effect (CATE) (video): divide the study data into subgroups (e.g., men and women, or by state), and see if the average treatment effects are different by subgroup. If the average treatment effects are different, SUTVA is violated. A per-subgroup ATE is called a "conditional average treatment effect" (CATE), i.e. the ATE conditioned on membership in the subgroup. CATE can be used as an estimate if SUTVA does not hold.
Individual Treatment Effect (ITE): the estimated treatment effect on a specific individual
Heterogenous treatment effects: treatment affects different individuals differently (heterogeneously)
Markov condition (or assumption): every node in a Bayesian network is conditionally independent of its nondescendants, given its parents.
Meta-algorithms are methods for learning the CATE function t(x):
- S-learner (one step):
- learn single model that includes treatment as covariate
- t(x) = µ(x,1) - µ(x,0)
- T-learner (two steps):
- learn two models, µ0(x) for control group and µ1(x) for treatment group.
- t(x) = µ1(x) - µ0(x)
- X-learner (three steps):
- learn same two models as in T-learner
- estimate ITE for control group as D_i = µ1(X) - Y_i, i.e. effect of not getting treatment compared to hypothetically getting treatment, and for the treatment group as D_i = Y_i - µ0(X), i.e. effect of getting treatment compared to hypothetically not getting treatment. Build two models t0(x) and t1(x) using ITE as response for control and treatment group separately.
- t(x) = g(x)t0(x) + (1-g(x))t1(x), using weight g ∈ [0,1] over x, e.g. using estimated propensity score for x.
- S-learner (one step):
Three assumptions sufficient to identify the average causal effect are consistency, positivity, and exchangeability.
- an individual's potential outcome under her observed exposure history is precisely her observed outcome.
- nonzero (ie, positive) probability of receiving every level of exposure for every combination of values of exposure and confounders that occur among individuals in the population
Exchangeability (i.e. no unmeasured confounders and no informative censoring):
- exposed and unexposed subjects, and censored and uncensored subjects have equal distributions of potential outcomes
Propensity score (video): the probability that someone will receive treatment. In randomized trial the probability is 50% (assuming two outcomes). This can be used e.g. for matching. It can be calculated with a multi-variate logistic regression model.
Randomized experiments (video): TODO
Controlled experiments (video): TODO
Uplift modeling: predictive modelling technique that directly models the incremental impact of a treatment (such as a direct marketing action) on an individual's behaviour.
One in ten rule: for every 10 observations that experienced the outcome you can only adjust for one variable. In other words, one predictive variable can be studied for every ten events.
Non-Parametric Statistics (cute video)
Research & white papers
Double/Debiased Machine Learning for Treatment and Causal Parameters, Chernozhukov et al., 2016
Metalearners for estimating heterogeneous treatment effects using machine learning, Künzel et al, PNAS 2018
A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms, Yoshua Bengio et al., 2019.
semopy: A Python package for Structural Equation Modeling, Meshcheryakov Georgy, Igolkina Anna, 2019
CausalML: Python Package for Causal Machine Learning, Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, Zhenyu Zhao, 2020
Uplift modeling with multiple treatments and general response types, Y Zhao, X Fang, D Simchi-Levi, SIAM 2017
Representation learning for treatment effect estimation from observational data, Liuyi Yao et al., NeurIPS 2018
More papers in Uber and CausalML literature list.
- Mann-Whitney U Test: tests whether there is a difference between two independent groups
- Kruskal–Wallis test: tests whether samples originate from the same distribution
- Chi-Squared test (video): tests whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table
- P-values (video): number between 0 and 1 that quantifies how certain we are that a measured effect is not purely random. Values closer to 0 means higher confidence.
- The Book Of Why: The New Science Of Cause And Effect, Judea Pearl & Dana MacKenzie, 2018.
- What If, Miguel Hernan & Jamie Robins, 2020.
- T-learners, S-learners and X-learners
- Introduction to Causality in Machine Learning, Alexandre Gonfalonieri, 2020.
- A Step-by-Step Guide in detecting causal relationships using Bayesian Structure Learning in Python, Erdogan Taskesen, 2021
- Using machine learning metrics to evaluate causal inference models, Ehud Karavani, 2020.
- The Consistency Statement in Causal Inference
- Double ML basics
Causal data science (Elias Bareinboim):
Double Machine Learning for Causal and Treatment Effects (Victor Chernozhukov). Points to revisit:
- Prediction-based ML approach is bad (~8:00)
- Double ML approach is good (~9:30)
Jin Tian: Estimating Identifiable Causal Effects through Double Machine Learning (Jin Tian)