Methods for Identifying Variation in Large-Scale Genomic Data

The rise of next-generation sequencing has produced an abundance of data with almost limitless analysis applications. As sequencing technology decreases in cost and increases in throughput, the amount of available data is quickly outpacing improvements in processor speed. Analysis methods must also increase in scale to remain computationally tractable. At the same time, larger datasets and the availability of population-wide data offer a broader context with which to improve accuracy.

This thesis presents three tools that improve the scalability of sequencing data storage and analysis. First, a lossy compression method for RNA-seq alignments offers extreme size reduction without compromising downstream accuracy of isoform assembly and quantitation. Second, I describe a graph genome analysis tool that filters population variants for optimal aligner performance. Finally, I offer several methods for improving CNV segmentation accuracy, including borrowing strength across samples to overcome the limitations of low coverage. Together, these methods compose a practical toolkit for improving the computational power of genomic analysis.

Speaker Biography

I am a Ph.D. candidate advised by Ben Langmead in the Department of Computer Science at Johns Hopkins University. My research focuses on developing scalable tools for genomic data analysis. I received a B.S. in Computer Science from Harvard University in 2013.