Speaker: Dominik Kempa
Affiliation: Johns Hopkins University
Title: How to store massive sequence collections in a searchable form
Compressed indexing is concerned with the design and construction of data structures to store massive sequences in space close to the size of compressed data, while simultaneously providing searching functionality (such as pattern matching) on the original uncompressed data. These indexes have a huge impact on the analysis of large-scale DNA databases (containing hundreds of thousands of genomes) but until recently the size of many indexes lacked theoretical guarantees and their construction remains a computational bottleneck.
In this talk I will describe my work proving theoretical guarantees on index size as a function of compressed data size, resolving a key open question in this field. Related to that, I will also describe new algorithms for converting between two conceptually distinct compressed representations, Lempel-Ziv and the Burrows-Wheeler Transform. I will show how these new findings enable advanced computation directly on compressed data. I will conclude by describing avenues for future work, including the new algorithms for the construction of compressed indexes that have the potential to cut indexing time by further orders of magnitude, unlocking the ability to search truly enormous text or DNA datasets.