@ Comenius University in Bratislava

Thu 16 Sep. 2010, 15:00

I-9

**Title:** Odhad efektívnej veľkosti DNA databáz pomocou kompresie

**Speaker:** Martina Višňovská

**Note:** Článok prijatý na konferenciu ITAT 2010

Search for sequence similarity in large-scale databases of DNA and protein sequences is one of the essential problems in bioinformatics. To distinguish random matches from biologically relevant similarities, it is customary to compute statistical P-value of each discovered match. In this context, P-value is the probability that a similarity with a given score or higher would appear by chance in a comparison of a random query and a random database. Note that P-value is a function of the database size, since a high-scoring similarity is more likely to exist by chance in a larger database. Biological databases often contain redundant, identical, or very similar sequences. This fact is not taken into account in P-value estimation, resulting in pessimistic estimates. One way to address this problem is to use a lower effective database size instead of its real size. In this work, we propose to estimate the effective size of a database by its compressed size. An appropriate compression algorithm will effectively store only a single copy of each repeated string, resulting in a file whose size roughly corresponds to the amount of unique sequence in the database. We evaluate our approach on real and simulated databases. Joint work with Michal Nanasi, Tomas Vinar and Brona Brejova