Graphical Models for Missing Data: Recoverability, Testability and Recent Surprises!

The bulk of literature on missing data employs procedures that are data-centric as opposed to process-centric and relies on a set of strong assumptions that are primarily untestable (eg: Missing At Random, Rubin 1976). As a result this area of research is wanting in tools to encode assumptions about the underlying data-generating process, methods to test these assumptions and procedures to decide if queries of interest are estimable and if so to compute their estimands.

We address these deficiencies by using a graphical representation called “Missingness Graph” which portrays the causal mechanisms responsible for missingness. Using this representation, we define the notion of recoverability, i.e., deciding whether there exists a consistent estimator for a given query. We identify graphical conditions for recovering joint and conditional distributions and present algorithms for detecting these conditions in the missingness graph. Our results apply to missing data problems in all three categories – MCAR, MAR and MNAR – the latter is relatively unexplored. We further address the question of testability i.e. whether an assumed model can be subjected to statistical tests, considering the missingness in the data.

Furthermore viewing the missing data problem from a causal perspective has ushered in several surprises. These include recoverability when variables are causes of their own missingness, testability of the MAR assumption, alternatives to iterative procedures such as EM Algorithm and the indispensability of causal assumptions for large sets of missing data problems.