Johns Hopkins computer scientists won a Best Paper Award at the 2025 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, held October 12-15 in Philadelphia. Their winning paper focuses on streamlining taxonomic classification to accelerate progress in metagenomics, which is critical for studying biodiversity, monitoring ecosystems, and discovering new species.
Led by first author and third-year undergraduate computer science student Steven Tan, the work introduces Movi Color, a software tool that can efficiently and accurately index datasets containing tens of thousands of genomes. The project was co-authored by Professor Ben Langmead and former postdoctoral fellows Sina Majidian—now an assistant professor at Chalmers University of Technology—and Mohsen Zakeri, now a GPU software engineer at Roche.
In metagenomics, scientists study DNA sequences obtained from environmental and clinical samples such as soil, water, or the human gut microbiome. Fast and accurate taxonomic classification of these sequences is essential for identifying what organisms are present in these samples.
“We would like to match each DNA fragment to the DNA of, say, different bacteria species to see which may be present in the sample,” explains Zakeri. “Matching each fragment to each species one by one is computationally expensive, as there are millions of DNA fragments in each sample and thousands of bacteria species to match with.”
The conventional solution to this problem is to build a database, or index, comprising different target DNA. Researchers can then query a DNA fragment to the index and find the target species with the best match.
Traditional approaches to building such an index come with trade-offs: k-mer based indexes (which break DNA sequences into smaller, fixed-length substrings) are fast but less accurate, while full-text methods (which index the entire sequence to find exact matches) are more accurate but slower to query.

Steven Tan, left, with his award.
“We wanted to combine both the speed and accuracy of these approaches,” says Tan.
To achieve this, the researchers built upon a compressed-index data structure known as the “move structure,” augmenting their original tool, Movi, with “color” information, which roughly tracks which genomes contain similar DNA segments, and developing string-matching algorithms that collect evidence from exact matches to perform classification.
“The colors group together DNA sequences that are similar so that they can be stored more efficiently,” says Zakeri. “This color notion had been defined and used for k-mer based indexes before, but we found a way to define it for our full-text index such that it leads to more accurate classification decisions.”
After exploring many other methods, iterations, and refinements, the researchers found that this approach yielded the best combination of accuracy, speed, and computational memory usage.
“It’s a challenge to balance scalability, speed, and accuracy when you’re building real-world genomic software,” explains Tan. “Even small design decisions in data structures or compression strategies can have large downstream effects on performance.”
In their paper, the researchers show through experiments on two bacteria datasets that Movi Color can be used to perform taxonomic classification much more accurately than popular k-mer based tools while remaining comparably fast.
And while the current version of the tool has a large memory footprint compared to other indexes, the team plans to reduce Movi Color’s index size, whether through minimizer digestion—a technique used to compress genomic sequences by selecting a subset of k-mers within a sliding window, thus reducing the amount of memory needed—or by compressing the color information. The researchers are also considering distributing their software across multiple computers to scale the tool to even larger pangenomes.
“Working on Movi Color showed us how theoretical algorithms can translate into tangible tools with real impact in genomics research,” says Tan, who began working on this project as a first-year undergrad and received an honorable mention for the Computing Research Association’s Outstanding Undergraduate Research Award earlier this year for related work.
Learn more about full-text indexes by watching Langmead’s inaugural professorial lecture:
This research was funded in part by the National Institutes of Health and the National Human Genome Research Institute.