Martina Visnovska, Michal Nanasi, Tomas Vinar, Brona Brejova. Estimating effective DNA database size via compression. In Dana Pardubska, ed., Information Technologies, Applications and Theory (ITAT), 683 volume of CEUR Workshop Proceedings, pp. 63-70, Smrekovica, 2010. Best paper award.
Download preprint: 10compress.pdf, 109Kb
Download from publisher: not available
Related www page: not available
Bibliography entry: BibTeX
Abstract:
Search for sequence similarity in large-scale databases of DNA and protein sequences is one of the essential problems in bioinformatics. To distinguish random matches from biologically relevant similarities, it is customary to compute statistical P-value of each discovered match. In this context, P-value is the probability that a similarity with a given score or higher would appear by chance in a comparison of a random query and a random database. Note that P-value is a function of the database size, since a high-scoring similarity is more likely to exist by chance in a larger database. Biological databases often contain redundant, identical, or very similar sequences. This fact is not taken into account in P-value estimation, resulting in pessimistic estimates. One way to address this problem is to use a lower effective database size instead of its real size. In this work, we propose to estimate the effective size of a database by its compressed size. An appropriate compression algorithm will effectively store only a single copy of each repeated string, resulting in a file whose size roughly corresponds to the amount of unique sequence in the database. We evaluate our approach on real and simulated databases.
Last update: 12/22/2010