Summary: | Randomized controlled trials (RCTs) have become increasingly popular in both marketing practice and academia. However, RCTs are not always available as a solution for advertising measurement, necessitating the use of observational methods. We present the first large-scale exploration of two observational methods, double/debiased machine learning (DML) and stratified propensity score matching (SPSM). Specifically, we analyze 663 large-scale experiments at Facebook, each of which is described using over 5,000 user- and experiment-level features. Although DML performs better than SPSM, neither method performs well, despite using deep learning models to implement the propensity scores and outcome models. The median absolute percentage point difference in lift is 115%, 107%, and 62% for upper, mid, and lower funnel outcomes, respectively. These are large measurement errors, given that the median RCT lifts are 28%, 19%, and 6% for the funnel outcomes, respectively. We further leverage our large sample of experiments to characterize the circumstances under which each method performs comparatively better. However, broadly speaking, our results suggest that state-of-the-art observational methods are unable to recover the causal effect of online advertising at Facebook. We conclude that observational methods for estimating ad effectiveness may not work until advertising platforms log auction-specific features for modeling.
|