Rastislav Rabatin, Broňa Brejová, Tomáš Vinař.
Using Sequence Ensembles for Seeding Alignments of MinION Sequencing Data .
In J. Hlaváčová, ed.,
Information Technologies - Applications and Theory (ITAT),
number 1885 in CEUR-WS,
pp. 48-54, Martinské hole, Slovakia, 2017.
Download from publisher | BibTeX
Sequence similarity search is in bioinformatics often solved by seed-and-extend heuristics: we first locate short exact matches (hits) by hashing or other efficient indexing techniques and then extend these hits to longer sequence alignments. Such approaches are effective at finding very similar sequences, but they quickly loose sensitivity when trying to locate weaker similarities. In this paper, we develop seeding strategies for data from MinION DNA sequencer. This recent technology produces sequencing reads which are prone to high error rates of up to 30%. Since most of these errors are insertions or deletions, it is difficult to adapt seed-and-extend algorithms to this type of data. We propose to represent each read by an ensemble of sequences sampled from a probabilistic model, instead of a single sequence. Using this extended representation, we were able to design a seeding strategy with 99.9% sensitivity and very low false positive rate. Our technique can be used to locate the part of the genome corresponding to a particular read, or even to find overlaps between pairs of reads.