Refreshments are available starting at 10:15 a.m. The seminar will begin at 10:30 a.m.
Abstract
Modern genomics produces massive, high-dimensional datasets, yet extracting reliable biological insight remains challenging. A central challenge—and opportunity—in genomics is that we do not directly receive a sanitized data matrix: Our machine learning pipeline starts upstream, with what samples we choose to collect, how we measure them, and how we transform raw signals before inference. These stages are typically handled in isolation, quietly introducing bias, discarding information, and limiting discovery potential. Tavor Baharav’s research addresses this by treating genomics as an end-to-end system, developing rigorous machine learning methods that account for—and leverage—these upstream choices.
In this talk, Baharav will illustrate this approach through his work on reference-free genomic analysis. Alignment of reads to a reference genome, though ubiquitous, fundamentally limits discovery of novel biology that deviates from the reference. To overcome this, Baharav’s team developed SPLASH, a statistical tool that compares raw sequencing reads directly across conditions. SPLASH rediscovers strain-defining mutations in SARS-CoV-2 and identifies previously unannotated tissue-specific transcripts in the octopus genome, enabling discovery without any reference or annotation. Bypassing alignment reshapes the statistical problem: To identify genomic features of interest, his team developed a new statistical test for contingency tables. Aggregating information across the resulting data matrices raised broader methodological and theoretical questions about data integration, leading the team to develop a random matrix theory framework for detecting shared structure across datasets. Together, these results show how rethinking upstream pipeline choices can simultaneously improve biological discovery and yield generalizable statistical insights.
Speaker Biography
Tavor Baharav is a postdoctoral fellow at the Eric and Wendy Schmidt Center at Broad Institute, working with Rafael Irizarry. His research co-designs the machine learning pipeline for computational genomics, jointly optimizing upstream processing stages with downstream inference. Baharav is broadly interested in high-dimensional statistics, adaptive algorithms and statistical machine learning, as well as their application to problems in computational genomics.
Prior to his postdoctoral work, he earned his PhD in electrical engineering from Stanford University in 2023 under the guidance of David Tse and Julia Salzman and funded by the NSF Graduate Research Fellowship and the Stanford Graduate Fellowship. Baharav’s research at the intersection of machine learning and genomics has been published in venues ranging from the Conference on Neural Information Processing Systems and the Journal of Machine Learning Research to the Research in Computational Molecular Biology conference and Cell.