Learning predictive models of gene regulation

Christina Leslie, Columbia University

Studying the behavior of gene regulatory networks by learning from high-throughput genomic data has become one of the central problems in computational systems biology. Most work in this area has focused on learning structure from data – e.g. finding clusters or modules of potentially co-regulated genes, or building a graph of putative regulatory “edges” between genes – and has been successful at generating qualitative hypotheses about regulatory networks.

Instead of adopting the structure learning viewpoint, our focus is to build predictive models of gene regulation that allow us both to make accurate quantitative predictions on new or held-out experiments (test data) and to capture mechanistic information about transcriptional regulation. Our algorithm, called MEDUSA, integrates promoter sequence, mRNA expression, and transcription factor occupancy data to learn gene regulatory programs that predict the differential expression of target genes. Instead of using clustering or correlation of expression profiles to infer regulatory relationships, the algorithm learns to predict up/down expression of target genes by identifying condition-specific regulators and discovering regulatory motifs that may mediate their regulation of targets. We use boosting, a technique from statistical learning, to help avoid overfitting as the algorithm searches through the high dimensional space of potential regulators and sequence motifs. We will report computational results on the yeast environmental stress response, where MEDUSA achieves high prediction accuracy on held-out experiments and retrieves key stress-related transcriptional regulators, signal transducers, and transcription factor binding sites. We will also describe recent results on the hypoxic response in yeast, where we used MEDUSA to propose the first global model of the oxygen sensing and regulatory network, including new putative context-specific regulators. Through our experimental collaborator on this project, the Zhang Lab at Columbia University, we are in the process of validating our computational predictions with wet lab experiments.

Speaker Biography

Dr. Leslie received her Ph.D. in Mathematics from Berkeley and held an NSERC Postdoctoral Fellowship in the Mathematics Department at Columbia University. She joined the Columbia Computer Science Department in Fall 2000 and moved to the Center for Computational Learning Systems, a new machine learning research center in the School of Engineering at Columbia, in Spring 2004. She is the principal investigator leading the Computational Biology Group (http://www.cs.columbia.edu/compbio) and a faculty member of the Center for Computational Biology and Bioinformatics (C2B2) at Columbia University. Her research lab focuses on the development of machine learning algorithms for studying biological processes at the molecular and systems levels.