Rastislav Rabatin, Broňa Brejová, Tomáš Vinař. Using Sequence Ensembles for Seeding Alignments of MinION Sequencing Data . In J. Hlaváčová, ed., Information Technologies - Applications and Theory (ITAT), number 1885 in CEUR-WS, pp. 48-54, Martinské hole, Slovakia, 2017.

Download preprint: not available

Download from publisher: http://ceur-ws.org/Vol-1885/48.pdf

Related web page: not available

Bibliography entry: BibTeX


Sequence similarity search is in bioinformatics
often solved by seed-and-extend heuristics: we first locate
short exact matches (hits) by hashing or other efficient indexing
techniques and then extend these hits to longer sequence
alignments. Such approaches are effective at finding
very similar sequences, but they quickly loose sensitivity
when trying to locate weaker similarities.
In this paper, we develop seeding strategies for data
from MinION DNA sequencer. This recent technology
produces sequencing reads which are prone to high error
rates of up to 30%. Since most of these errors are insertions
or deletions, it is difficult to adapt seed-and-extend
algorithms to this type of data. We propose to represent
each read by an ensemble of sequences sampled from a
probabilistic model, instead of a single sequence. Using
this extended representation, we were able to design a
seeding strategy with 99.9% sensitivity and very low false
positive rate. Our technique can be used to locate the part
of the genome corresponding to a particular read, or even
to find overlaps between pairs of reads.