Publication details

Martina Visnovska, Michal Nanasi, Tomas Vinar, Brona Brejova. Estimating effective DNA database size via compression. In Information Technologies, Applications and Theory (ITAT), 683 volume of pp. 63-70, Smrekovica, Slovakia, 2010. CEUR-WS. Best paper award.
Preprint, 109Kb | Download from publisher | BibTeX

Abstract

Search for sequence similarity in large-scale databases of DNA and protein 
sequences is one of the essential problems in bioinformatics. To 
distinguish random matches from biologically relevant similarities, it is 
customary to compute statistical P-value of each discovered match. In this 
context, P-value is the probability that a similarity with a given score 
or higher would appear by chance in a comparison of a random query and a 
random database. Note that P-value is a function of the database size, 
since a high-scoring similarity is more likely to exist by chance in a 
larger database.

Biological databases often contain redundant, identical, or very similar 
sequences. This fact is not taken into account in P-value estimation, 
resulting in pessimistic estimates. One way to address this problem is to 
use a lower effective database size instead of its real size. In this 
work, we propose to estimate the effective size of a database by its 
compressed size. An appropriate compression algorithm will effectively 
store only a single copy of each repeated string, resulting in a file 
whose size roughly corresponds to the amount of unique sequence in the 
database. We evaluate our approach on real and simulated databases.

Photo

Contact

Mailing address:
Bronislava Brejova
Katedra informatiky
FMFI UK
Mlynska Dolina
842 48 Bratislava
Slovakia

Office: M163

Phone: +421-2-60295217

E-mail: my_surname at dcs.fmph.uniba.sk