2-AIN-506 a 2-AIN-252: Seminár z bioinformatiky (2) a (4)
Leto 2018

Guillaume Holley, Roland Wittler, Jens Stoye. Bloom Filter Trie: an alignment-free and reference-free data structure forpan-genome storage. Algorithms Mol Biol, 11:3. 2016.

Download preprint: not available

Download from publisher: not available PubMed

Related web page: not available

Bibliography entry: BibTeX


BACKGROUND: High throughput sequencing technologies have become fast and cheap in
the past years. As a result, large-scale projects started to sequence tens to
several thousands of genomes per species, producing a high number of sequences
sampled from each genome. Such a highly redundant collection of very similar
sequences is called a pan-genome. It can be transformed into a set of sequences
\"colored\" by the genomes to which they belong. A colored de Bruijn graph (C-DBG) 
extracts from the sequences all colored k-mers, strings of length k, and stores
them in vertices. RESULTS: In this paper, we present an alignment-free,
reference-free and incremental data structure for storing a pan-genome as a
C-DBG: the bloom filter trie (BFT). The data structure allows to store and
compress a set of colored k-mers, and also to efficiently traverse the graph.
Bloom filter trie was used to index and query different pangenome datasets.
Compared to another state-of-the-art data structure, BFT was up to two times
faster to build while using about the same amount of main memory. For querying
k-mers, BFT was about 52-66 times faster while using about 5.5-14.3 times less
memory. CONCLUSION: We present a novel succinct data structure called the Bloom
Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie
stores k-mers and their colors based on a new representation of vertices that
compress and index shared substrings. Vertices use basic data structures for
lightweight substrings storage as well as Bloom filters for efficient trie and
graph traversals. Experimental results prove better performance compared to
another state-of-the-art data structure. AVAILABILITY: