Martina Visnovska, Michal Nanasi, Tomas Vinar, Brona Brejova. Estimating effective DNA database size via compression. In Dana Pardubska, ed., Information Technologies, Applications and Theory (ITAT), 683 volume of CEUR Workshop Proceedings, pp. 63-70, Smrekovica, 2010. Best paper award.

Download preprint: 10compress.pdf, 109Kb

Download from publisher: not available

Related www page: not available

Bibliography entry: BibTeX


Search for sequence similarity in large-scale databases of DNA and protein 
sequences is one of the essential problems in bioinformatics. To 
distinguish random matches from biologically relevant similarities, it is 
customary to compute statistical P-value of each discovered match. In this 
context, P-value is the probability that a similarity with a given score 
or higher would appear by chance in a comparison of a random query and a 
random database. Note that P-value is a function of the database size, 
since a high-scoring similarity is more likely to exist by chance in a 
larger database.

Biological databases often contain redundant, identical, or very similar 
sequences. This fact is not taken into account in P-value estimation, 
resulting in pessimistic estimates. One way to address this problem is to 
use a lower effective database size instead of its real size. In this 
work, we propose to estimate the effective size of a database by its 
compressed size. An appropriate compression algorithm will effectively 
store only a single copy of each repeated string, resulting in a file 
whose size roughly corresponds to the amount of unique sequence in the 
database. We evaluate our approach on real and simulated databases.

Last update: 12/22/2010