Brona Brejova, Daniel G. Brown, Tomas Vinar. Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions. In R. Baeza-Yates, E. Chavez, M. Crochemore, ed., Combinatorial Pattern Matching, 14th Annual Symposium (CPM), 2676 volume of Lecture Notes in Computer Science, pp. 42-54, Morelia, Michoacan, Mexico, June 25-27 2003. Springer.

Download preprint: 03seeds.ps.gz, 710Kb

Download from publisher: http://www.springerlink.com/link.asp?id=87hkfanyt5ln5td1

Related web page: http://www.bioinformatics.uwaterloo.ca/supplements/03seeds/

Bibliography entry: BibTeX

Abstract:

We study the problem of computing optimal spaced seeds for detecting 
sequences generated by a Hidden Markov model. Inspired by recent work in DNA 
sequence alignment, we have developed such a model for representing the 
conservation between related DNA coding sequences. Our model includes 
positional dependencies and periodic rates of conservation, as well as 
regional deviations in overall conservation rate. We show that, for hidden 
Markov models in general, the probability that a seed is matched in a region 
can be computed efficiently, and use these methods to compute the optimal 
seed for our models. Our experiments on real data show that the optimal seeds 
are substantially more sensitive than the seeds used in the standard 
alignment program BLAST, and also substantially better than those of 
PatternHunter or WABA, both of which use spaced seeds. Our results offer the 
hope of improved gene finding due to fewer missed exons in DNA/DNA 
comparison, and more effective homology search in general, and may have 
applications outside of bioinformatics.