Vladimír Boža. Algorithms for high-throughput sequencing data. PhD thesis, Comenius University in Bratislava, 2017. Supervised by Tomáš Vinař.

Download preprint: 17bozath.pdf, 1308Kb

Download from publisher: http://alis.uniba.sk/storage/dpg/dostupne/FM/2017/2017-FM-51063/

Related web page: not available

Bibliography entry: BibTeX

Abstract:

In this thesis, we study several problems related to the DNA
sequence assembly.  Our main focus is on handling different
combinations of read types, including short reads with short and
long inserts, and long reads.  We propose the genome assembly by
maximum likelihood (GAML) framework, which handles a variety of
sequencing data in a systematic way by using probabilistic
models.  In particular, GAML optimizes assembly likelihood score,
which has previously been shown to be strongly correlated with
the assembly quality.  During the development of GAML, we have
encountered several interesting problems concerning indexing of
sequencing reads.  We have developed a new data structure
CR-index, an index for a collection of short reads that exploits
the property that reads usually originate from a common
superstring.  We also propose a index for long DNA strings, based
on the idea of minimizers, called MH-index.  Oxford Nanopore
MinION is a technology producing long reads, which are important
for improving the sequence assembly.  By using recurrent neural
networks, we have developed tools for improving the base
calling (translation of the raw electric signal from the
sequencer to the DNA bases).  We have also improved approaches
for comparing raw signals from the sequencer to the reference
sequence.

Keywords: sequence assembly, string indexing, minimizers,
recurrent neural networks