Martina Višňovská. Alignments on Sequences with Internal Structure. PhD thesis, Comenius University in Bratislava, 2013. Supervised by Broňa Brejová.

Download preprint: 13visnovskath.pdf, 1733Kb

Download from publisher: http://alis.uniba.sk/storage/dpg/dostupne/FM/2014/2014-FM-50702

Related web page: not available

Bibliography entry: BibTeX

Abstract:

Search for sequence similarity in genomic databases is one of the essential problems
in bioinformatics. Genomic sequences evolve by local changes affecting one or several
adjacent symbols, as well as by large-scale rearrangements and duplications. In this
thesis we address two different problems, one connected to the local changes and the
other to the large-scale events. Both problems deal with genomic databases with
a rich internal structure consisting of repeating sequences.
First, considering only the local changes, we discuss the problem of distinguishing
random matches from biologically relevant similarities. Customary approach to this
task is to compute statistical P-value of each found match between a query and
the searched database. Biological databases often contain redundant identical or
very similar sequences. This fact is not taken into account in P-value estimation,
resulting in pessimistic estimates. We propose to use a lower effective database
size instead of its real size and to estimate the effective size of a database by its
compressed size. We evaluate our approach on real and simulated databases.
Next, we concentrate on large-scale duplications and rearrangements, which lead to
mosaic sequences with various degree of similarity between regions within a single
genome or in genomes of related organisms. Our goal is to segment DNA to regions
and to assign such regions to classes so that regions within a single class are similar
and there is low or no similarity between regions of different classes. We provide
a formal definition of the segmentation problem, prove its NP-hardness, and give two
practical heuristic algorithms. We evaluated the algorithms on real and simulated
data. Segments found by our algorithm can be used as markers in a wide range of
evolutionary studies.