Algorithms for genome assembly and disease analytics

Michael Schatz, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory

Computational biology is emerging as one of the exemplar data sciences, with abundant data, complex interactions, and the need for scalable algorithms and statistics. During my presentation, I will describe my research on two major problems: The first is de novo genome assembly, in which the genome of an organism must be computationally reconstructed from millions or billions of short DNA sequences. An emerging assembly strategy is to use PacBio single molecule sequencing to overcome the limitations seen with Illumina and other older technologies. We and others have developed new assembly algorithms to utilize the long reads (currently averaging over 8,500bp) to achieve near-perfect assemblies of many microbes and small eukaryotes, and greatly improved assemblies of several significant plant and animal species. Even though the raw sequence data have high error rates (>10%) and a non-uniform error model, the accuracy of the assembled sequences approaches 100%. I’ll summarize the field with a support vector regression based model that can predict the outcome for a genome assembly project today and into the future as the read lengths and available coverage improves. The second major problem I’ll discuss is disease analytics, and how we can identify disease-relevant genetic mutations in a population of healthy and affected individuals. Namely, I will describe my lab’s work examining the genetic components of autism spectrum disorders (ASD) using our new variation detection algorithm Scalpel. Scalpel uses a hybrid approach of read mapping and de novo assembly to accurately discover insertion/deletion (indel) mutations up to 100bp long. In a battery of >10,000 simulated and >1,000 experimentally validated indel mutations, Scalpel is significantly more accurate than the other leading algorithms GATK and SOAPindel. Using Scalpel, we have analyzed the exomes of >800 families (>3200 individuals) in which one child in each family is affected with ASD, and see a strong enrichment of “gene killing” de novo mutations associated with the disorder. Finally, I’ll conclude with a brief description of our work using single cell sequencing to study genetic heterogeneity in cancer.

Speaker Biography

Michael Schatz is an assistant professor in the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory. His research interests include developing large-scale sequence analysis methods for sequence alignment, de novo assembly, variation detection, and related analysis. Schatz received his Ph.D. in Computer Science from the University of Maryland in 2010, and his B.S. in Computer Science from Carnegie Mellon University in 2000, with 4 years at the Institute for Genomic Research in between. For more information see: http://schatzlab.cshl.edu.