Second-generation DNA sequencing instruments are improving rapidly and are now capable of sequencing hundreds of billions of nucleotides of data, enough to cover the human genome hundreds of times over, in about a week for a few thousand dollars. Consequently, sequencing is now a common tool in the study of molecular biology, genetics, and human disease. But with these developments comes a problem: growth in per-sequencer throughput (currently about 4-fold per year) is drastically outpacing growth in computer speed. As the throughput gap widens over time, the crucial research bottlenecks are increasingly computational: computing, storage, labor, power.
I will describe two methods and four open source software tools (Bowtie, Bowtie 2, Crossbow and Myrna) that tackle this throughput gap using novel algorithms and approaches from data-intensive computing. These tools build primarily on two insights. First, that the Burrows-Wheeler Transform and the FM Index, previously used for data compression and exact string matching, can be extended to facilitate fast and memory-efficient alignment of DNA sequences to long reference genomes such as the human genome. Second, that those methods can be combined with MapReduce and cloud computing to solve common comparative genomics problems in a manner that addresses “big data” desiderata, including scalability, fault tolerance, and economy.
Ben Langmead is a Research Associate at the Johns Hopkins Bloomberg School of Public Health, Department of Biostatistics. He is also completing his Ph.D. in Computer Science this semester at University of Maryland, advised by Steven L. Salzberg. He received a M.Sc. in Computer Science in 2009 from University of Maryland, advised by Steven L. Salzberg and Mihai Pop. Before graduate school, Ben was employed at Reservoir Labs, Inc., where he worked on developing compiler software and network intrusion detection software for parallel network processor architectures. Ben received a B.A. in Computer Science from Columbia University in 2003.
Ben’s research tackles problems at the intersection of Computer Science and Genomics. At Johns Hopkins, Ben collaborates with Biostatisticians, Biologists, and other Computer Scientists to develop methods for analyzing second-generation DNA sequencing data.