Motivation
A core problem that arises in most datadriven personalized decision scenarios is the estimation of heterogeneous treatment effects: what is the effect of an intervention on an outcome of interest as a function of a set of observable characteristics of the treated sample?
In personalized pricing (my work field) the goal is to estimate the effect of a price discount on the demand as a function of characteristics of the consumer. In this settings we may have an abundance of observational data, where the treatment was chosen via some unknown policy and the ability to run A/B tests is limited.
So, how can we estimate the effect of an intervention, e.g. a certain price discount or price point?
Methods
This is an overview of methods, from the EconML website.
 Double Machine Learning (see e.g. Chernozhukov2016, Chernozhukov2017, Mackey2017, Nie2017, Chernozhukov2018, Foster2019)
 Causal Forests (see e.g. Wager2018, Athey2019, Oprescu2019)
 Deep Instrumental Variables (see e.g. Hartford2017)
 Nonparametric Instrumental Variables Newey2003
 Metalearners (see e.g. Kunzel2017)
Frameworks
Terminology in causal inference

Observational study: observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because. This is different from an experiment

Experiment: a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried

Stable unit treatment value assumption (SUTVA): ensures that only two potential outcomes exist and that (exactly) one of them is observed for each individual. Requires:
 Consistency, i.e. treatment is well defined. For example, if the treatment is exercise, then how much and what kind of exercise counts as a treatment?
 No interference, i.e. the treatment of one individual has no effect on any other individual.

Average Treatment Effect (ATE): measures the difference in mean (average) outcomes between units assigned to the treatment and units assigned to the control.

Conditional Average Treatment Effect (CATE) (video): divide the study data into subgroups (e.g., men and women, or by state), and see if the average treatment effects are different by subgroup. If the average treatment effects are different, SUTVA is violated. A persubgroup ATE is called a "conditional average treatment effect" (CATE), i.e. the ATE conditioned on membership in the subgroup. CATE can be used as an estimate if SUTVA does not hold.

Individual Treatment Effect (ITE): the estimated treatment effect on a specific individual

Heterogenous treatment effects: treatment affects different individuals differently (heterogeneously)

Markov condition (or assumption): every node in a Bayesian network is conditionally independent of its nondescendants, given its parents.

Metaalgorithms are methods for learning the CATE function t(x):
 Slearner (one step):
 learn single model that includes treatment as covariate
 t(x) = µ(x,1) – µ(x,0)
 Tlearner (two steps):
 learn two models, µ0(x) for control group and µ1(x) for treatment group.
 t(x) = µ1(x) – µ0(x)
 Xlearner (three steps):
 learn same two models as in Tlearner
 estimate ITE for control group as D_i = µ1(X) – Y_i, i.e. effect of not getting treatment compared to hypothetically getting treatment, and for the treatment group as D_i = Y_i – µ0(X), i.e. effect of getting treatment compared to hypothetically not getting treatment. Build two models t0(x) and t1(x) using ITE as response for control and treatment group separately.
 t(x) = g(x)t0(x) + (1g(x))t1(x), using weight g ∈ [0,1] over x, e.g. using estimated propensity score for x.
 Slearner (one step):
Three assumptions sufficient to identify the average causal effect are consistency, positivity, and exchangeability.

Consistency:
 an individual’s potential outcome under her observed exposure history is precisely her observed outcome.

Positivity:
 nonzero (ie, positive) probability of receiving every level of exposure for every combination of values of exposure and confounders that occur among individuals in the population

Exchangeability (i.e. no unmeasured confounders and no informative censoring):
 exposed and unexposed subjects, and censored and uncensored subjects have equal distributions of potential outcomes

Propensity score (video): the probability that someone will receive treatment. In randomized trial the probability is 50% (assuming two outcomes). This can be used e.g. for matching. It can be calculated with a multivariate logistic regression model.

Randomized experiments (video): TODO

Controlled experiments (video): TODO

Uplift modeling: predictive modelling technique that directly models the incremental impact of a treatment (such as a direct marketing action) on an individual’s behaviour.

One in ten rule: for every 10 observations that experienced the outcome you can only adjust for one variable. In other words, one predictive variable can be studied for every ten events.

NonParametric Statistics (cute video)
Research & white papers

Double/Debiased Machine Learning for Treatment and Causal Parameters, Chernozhukov et al., 2016

Metalearners for estimating heterogeneous treatment effects using machine learning, Künzel et al, PNAS 2018

A MetaTransfer Objective for Learning to Disentangle Causal Mechanisms, Yoshua Bengio et al., 2019.

semopy: A Python package for Structural Equation Modeling, Meshcheryakov Georgy, Igolkina Anna, 2019

CausalML: Python Package for Causal Machine Learning, Huigang Chen, Totte Harinen, JeongYoon Lee, Mike Yung, Zhenyu Zhao, 2020

Uplift modeling with multiple treatments and general response types, Y Zhao, X Fang, D SimchiLevi, SIAM 2017

Representation learning for treatment effect estimation from observational data, Liuyi Yao et al., NeurIPS 2018
More papers in Uber and CausalML literature list.
Statistical tests
 MannWhitney U Test: tests whether there is a difference between two independent groups
 Kruskal–Wallis test: tests whether samples originate from the same distribution
 ChiSquared test (video): tests whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table
 Pvalues (video): number between 0 and 1 that quantifies how certain we are that a measured effect is not purely random. Values closer to 0 means higher confidence.
Books
 The Book Of Why: The New Science Of Cause And Effect, Judea Pearl & Dana MacKenzie, 2018.
 What If, Miguel Hernan & Jamie Robins, 2020.
Web articles
 Tlearners, Slearners and Xlearners
 Introduction to Causality in Machine Learning, Alexandre Gonfalonieri, 2020.
 A StepbyStep Guide in detecting causal relationships using Bayesian Structure Learning in Python, Erdogan Taskesen, 2021
 Using machine learning metrics to evaluate causal inference models, Ehud Karavani, 2020.
 The Consistency Statement in Causal Inference
 Double ML basics
Video lectures
Causal data science (Elias Bareinboim):
Double Machine Learning for Causal and Treatment Effects (Victor Chernozhukov). Points to revisit:
 Predictionbased ML approach is bad (~8:00)
 Double ML approach is good (~9:30)
Jin Tian: Estimating Identifiable Causal Effects through Double Machine Learning (Jin Tian)