2-AIN-506, 2-AIN-252: Seminar in Bioinformatics (2), (4)
Summer 2024

Karel Brinda, Leandro Lima, Simone Pignotti, Natalia Quinones-Olvera, Kamil Salikhov, Rayan Chikhi, Gregory Kucherov, Zamin Iqbal, Michael Baym. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. bioRxiv, 2023.

Download preprint: not available

Download from publisher: not available PubMed

Related web page: not available

Bibliography entry: BibTeX


Comprehensive collections approaching millions of sequenced genomes have become 
central information sources in the life sciences. However, the rapid growth of 
these collections makes it effectively impossible to search these data using 
tools such as BLAST and its successors. Here, we present a technique called 
phylogenetic compression, which uses evolutionary history to guide compression 
and efficiently search large collections of microbial genomes using existing 
algorithms and data structures. We show that, when applied to modern diverse 
collections approaching millions of genomes, lossless phylogenetic compression 
improves the compression ratios of assemblies, de Bruijn graphs, and k-mer 
indexes by one to two orders of magnitude. Additionally, we develop a pipeline 
for a BLAST-like search over these phylogeny-compressed reference data, and 
demonstrate it can align genes, plasmids, or entire sequencing experiments 
against all sequenced bacteria until 2019 on ordinary desktop computers within a 
few hours. Phylogenetic compression has broad applications in computational 
biology and may provide a fundamental design principle for future genomics