Martina Višňovská. Alignments on Sequences with Internal Structure. PhD thesis, Comenius University in Bratislava, 2013. Supervised by Broňa Brejová.
Download preprint: 13visnovskath.pdf, 1733Kb
Download from publisher: http://alis.uniba.sk/storage/dpg/dostupne/FM/2014/2014-FM-50702
Related web page: not available
Bibliography entry: BibTeX
Abstract:
Search for sequence similarity in genomic databases is one of the essential problems in bioinformatics. Genomic sequences evolve by local changes affecting one or several adjacent symbols, as well as by large-scale rearrangements and duplications. In this thesis we address two different problems, one connected to the local changes and the other to the large-scale events. Both problems deal with genomic databases with a rich internal structure consisting of repeating sequences. First, considering only the local changes, we discuss the problem of distinguishing random matches from biologically relevant similarities. Customary approach to this task is to compute statistical P-value of each found match between a query and the searched database. Biological databases often contain redundant identical or very similar sequences. This fact is not taken into account in P-value estimation, resulting in pessimistic estimates. We propose to use a lower effective database size instead of its real size and to estimate the effective size of a database by its compressed size. We evaluate our approach on real and simulated databases. Next, we concentrate on large-scale duplications and rearrangements, which lead to mosaic sequences with various degree of similarity between regions within a single genome or in genomes of related organisms. Our goal is to segment DNA to regions and to assign such regions to classes so that regions within a single class are similar and there is low or no similarity between regions of different classes. We provide a formal definition of the segmentation problem, prove its NP-hardness, and give two practical heuristic algorithms. We evaluated the algorithms on real and simulated data. Segments found by our algorithm can be used as markers in a wide range of evolutionary studies.