Vladimír Boža. Algorithms for high-throughput sequencing data. PhD thesis, Comenius University in Bratislava, 2017. Supervised by Tomáš Vinař.
Download preprint: 17bozath.pdf, 1308Kb
Download from publisher: http://alis.uniba.sk/storage/dpg/dostupne/FM/2017/2017-FM-51063/
Related web page: not available
Bibliography entry: BibTeX
In this thesis, we study several problems related to the DNA sequence assembly. Our main focus is on handling different combinations of read types, including short reads with short and long inserts, and long reads. We propose the genome assembly by maximum likelihood (GAML) framework, which handles a variety of sequencing data in a systematic way by using probabilistic models. In particular, GAML optimizes assembly likelihood score, which has previously been shown to be strongly correlated with the assembly quality. During the development of GAML, we have encountered several interesting problems concerning indexing of sequencing reads. We have developed a new data structure CR-index, an index for a collection of short reads that exploits the property that reads usually originate from a common superstring. We also propose a index for long DNA strings, based on the idea of minimizers, called MH-index. Oxford Nanopore MinION is a technology producing long reads, which are important for improving the sequence assembly. By using recurrent neural networks, we have developed tools for improving the base calling (translation of the raw electric signal from the sequencer to the DNA bases). We have also improved approaches for comparing raw signals from the sequencer to the reference sequence. Keywords: sequence assembly, string indexing, minimizers, recurrent neural networks