Martina Visnovska, Michal Nanasi, Tomas Vinar, Brona Brejova.
**Estimating effective DNA database size via compression**.
In Dana Pardubska, ed.,
*Information Technologies, Applications and Theory (ITAT)*,
683 volume of *CEUR Workshop Proceedings*,
pp. 63-70, Smrekovica, 2010. Best paper award.

**Download preprint:** 10compress.pdf, 109Kb

**Download from publisher:** not available

**Related www page:** not available

**Bibliography entry:**
BibTeX

**Abstract:**

Search for sequence similarity in large-scale databases of DNA and protein sequences is one of the essential problems in bioinformatics. To distinguish random matches from biologically relevant similarities, it is customary to compute statistical P-value of each discovered match. In this context, P-value is the probability that a similarity with a given score or higher would appear by chance in a comparison of a random query and a random database. Note that P-value is a function of the database size, since a high-scoring similarity is more likely to exist by chance in a larger database. Biological databases often contain redundant, identical, or very similar sequences. This fact is not taken into account in P-value estimation, resulting in pessimistic estimates. One way to address this problem is to use a lower effective database size instead of its real size. In this work, we propose to estimate the effective size of a database by its compressed size. An appropriate compression algorithm will effectively store only a single copy of each repeated string, resulting in a file whose size roughly corresponds to the amount of unique sequence in the database. We evaluate our approach on real and simulated databases.

Last update: 12/22/2010