Causal stuff


A core problem that arises in most data-driven personalized decision scenarios is the estimation of heterogeneous treatment effects: what is the effect of an intervention on an outcome of interest as a function of a set of observable characteristics of the treated sample?

In personalized pricing (my work field) the goal is to estimate the effect of a price discount on the demand as a function of characteristics of the consumer. In this settings we may have an abundance of observational data, where the treatment was chosen via some unknown policy and the ability to run A/B tests is limited.

So, how can we estimate the effect of an intervention, e.g. a certain price discount or price point?


This is an overview of methods, from the EconML website.


Terminology in causal inference

  • Observational study: observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because. This is different from an experiment

  • Experiment: a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried

  • Stable unit treatment value assumption (SUTVA): ensures that only two potential outcomes exist and that (exactly) one of them is observed for each individual. Requires:

    • Consistency, i.e. treatment is well defined. For example, if the treatment is exercise, then how much and what kind of exercise counts as a treatment?
    • No interference, i.e. the treatment of one individual has no effect on any other individual.
  • Average Treatment Effect (ATE): measures the difference in mean (average) outcomes between units assigned to the treatment and units assigned to the control.

  • Conditional Average Treatment Effect (CATE) (video): divide the study data into subgroups (e.g., men and women, or by state), and see if the average treatment effects are different by subgroup. If the average treatment effects are different, SUTVA is violated. A per-subgroup ATE is called a "conditional average treatment effect" (CATE), i.e. the ATE conditioned on membership in the subgroup. CATE can be used as an estimate if SUTVA does not hold.

  • Individual Treatment Effect (ITE): the estimated treatment effect on a specific individual

  • Heterogenous treatment effects: treatment affects different individuals differently (heterogeneously)

  • Markov condition (or assumption): every node in a Bayesian network is conditionally independent of its nondescendants, given its parents.

  • Meta-algorithms are methods for learning the CATE function t(x):

    • S-learner (one step):
      1. learn single model that includes treatment as covariate
      2. t(x) = µ(x,1) – µ(x,0)
    • T-learner (two steps):
      1. learn two models, µ0(x) for control group and µ1(x) for treatment group.
      2. t(x) = µ1(x) – µ0(x)
    • X-learner (three steps):
      1. learn same two models as in T-learner
      2. estimate ITE for control group as D_i = µ1(X) – Y_i, i.e. effect of not getting treatment compared to hypothetically getting treatment, and for the treatment group as D_i = Y_i – µ0(X), i.e. effect of getting treatment compared to hypothetically not getting treatment. Build two models t0(x) and t1(x) using ITE as response for control and treatment group separately.
      3. t(x) = g(x)t0(x) + (1-g(x))t1(x), using weight g ∈ [0,1] over x, e.g. using estimated propensity score for x.

Three assumptions sufficient to identify the average causal effect are consistency, positivity, and exchangeability.

  • Consistency:

    • an individual’s potential outcome under her observed exposure history is precisely her observed outcome.
  • Positivity:

    • nonzero (ie, positive) probability of receiving every level of exposure for every combination of values of exposure and confounders that occur among individuals in the population
  • Exchangeability (i.e. no unmeasured confounders and no informative censoring):

    • exposed and unexposed subjects, and censored and uncensored subjects have equal distributions of potential outcomes
  • Propensity score (video): the probability that someone will receive treatment. In randomized trial the probability is 50% (assuming two outcomes). This can be used e.g. for matching. It can be calculated with a multi-variate logistic regression model.

  • Randomized experiments (video): TODO

  • Controlled experiments (video): TODO

  • Uplift modeling: predictive modelling technique that directly models the incremental impact of a treatment (such as a direct marketing action) on an individual’s behaviour.

  • One in ten rule: for every 10 observations that experienced the outcome you can only adjust for one variable. In other words, one predictive variable can be studied for every ten events.

  • Non-Parametric Statistics (cute video)

Research & white papers

More papers in Uber and CausalML literature list.

Statistical tests

  • Mann-Whitney U Test: tests whether there is a difference between two independent groups
  • Kruskal–Wallis test: tests whether samples originate from the same distribution
  • Chi-Squared test (video): tests whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table
  • P-values (video): number between 0 and 1 that quantifies how certain we are that a measured effect is not purely random. Values closer to 0 means higher confidence.


Web articles

Video lectures

Causal data science (Elias Bareinboim):

Double Machine Learning for Causal and Treatment Effects (Victor Chernozhukov). Points to revisit:

Jin Tian: Estimating Identifiable Causal Effects through Double Machine Learning (Jin Tian)