Thu 16 Sep. 2010, 15:00

Title: Odhad efektívnej veľkosti DNA databáz pomocou kompresie
Speaker: Martina Višňovská
Note: Článok prijatý na konferenciu ITAT 2010

Search for sequence similarity in large-scale databases of DNA and
protein sequences is one of the essential problems in
bioinformatics. To distinguish random matches from biologically
relevant similarities, it is customary to compute statistical P-value
of each discovered match. In this context, P-value is the probability that
a similarity with a given score or higher would appear by chance in a
comparison of a random query and a random database. Note that P-value is a
function of the database size, since a high-scoring similarity is more
likely to exist by chance in a larger database.

Biological databases often contain redundant, identical, or very similar
sequences. This fact is not taken into account in P-value estimation,
resulting in pessimistic estimates. One way to address this problem is
to use a lower effective database size instead of its real size. In
this work, we propose to estimate the effective size of a database by
its compressed size. An appropriate compression algorithm will
effectively store only a single copy of each repeated string, resulting in
a file whose size roughly corresponds to the amount of unique sequence
in the database. We evaluate our approach on real and simulated

Joint work with Michal Nanasi, Tomas Vinar and Brona Brejova