Beyond prediction: NLP for causal inference

Why do some misleading articles go viral? Does partisan speech affect how people behave? Many pressing questions require understanding the effects of language. These are causal questions: did an article’s writing style cause it to go viral or would it have gone viral anyway? With text data from social media and news sites, we can build predictors with natural language processing (NLP) techniques but these methods can confuse correlation with causation. In this talk, I discuss my recent work on NLP methods for making causal inferences from text. Text data present unique challenges for disentangling causal effects from non-causal correlations. I present approaches that address these challenges by extending black box and probabilistic NLP methods. I outline the validity of these methods for causal inference, and demonstrate their applications to online forum comments and consumer complaints. I conclude with my research vision for a data analysis pipeline that bridges causal thinking and machine learning to enable better decision-making and scientific understanding.

Speaker Biography

Dhanya Sridhar is a postdoctoral researcher in the Data Science Institute at Columbia University. She completed her PhD at the University of California Santa Cruz. Her current research is at the intersection of machine learning and causal inference, focusing on applications to social science. Her thesis research focused on probabilistic models of relational data.