2-AIN-505, 2-AIN-251: Seminar in Bioinformatics (1), (3)
Winter 2022
Abstrakt

Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, Christina Boucher. MONI: A Pangenomic Index for Finding Maximal Exact Matches. Journal of computational biology, 29(2):169-187. 2022.

Download preprint: not available

Download from publisher: https://www.biorxiv.org/content/10.1101/2021.07.06.451246.full PubMed

Related web page: not available

Bibliography entry: BibTeX

Abstract:

Recently, Gagie et al. proposed a version of the FM-index, called the r-index,
that can store thousands of human genomes on a commodity computer. Then Kuhnle et
al. showed how to build the r-index efficiently via a technique called
prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern
matching. Exact pattern matching can be leveraged to support approximate pattern 
matching, but the r-index itself cannot support efficiently popular and important
queries such as finding maximal exact matches (MEMs). To address this
shortcoming, Bannai et al. introduced the concept of thresholds, and showed that 
storing them together with the r-index enables efficient MEM finding-but they did
not say how to find those thresholds. We present a novel algorithm that applies
PFP to build the r-index and find the thresholds simultaneously and in linear
time and space with respect to the size of the prefix-free parse. Our
implementation called MONI can rapidly find MEMs between reads and large-sequence
collections of highly repetitive sequences. Compared with other read
aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less
memory and was 2-32 times faster for index construction. Moreover, MONI was less 
than one thousandth the size of competing indexes for large collections of human 
chromosomes. Thus, MONI represents a major advance in our ability to perform MEM 
finding against very large collections of related references.