Publication details

Brona Brejova, Daniel G. Brown, Tomas Vinar. Optimal spaced seeds for hidden Markov models, with application to homologous coding regions. In R. Baeza-Yates, E. Chavez, M. Crochemore, ed., Combinatorial Pattern Matching, 14th Annual Symposium (CPM), 2676 volume of Lecture Notes in Computer Science, pp. 42-54, Morelia, Michoacan, Mexico, June 25-27 2003. Springer.
Preprint, 710Kb | Download from publisher | Webpage | Early version | BibTeX | MathSciNet

Abstract

We study the problem of computing optimal spaced seeds for detecting
sequences generated by a Hidden Markov model. Inspired by recent work
in DNA sequence alignment, we have developed such a model for
representing the conservation between related DNA coding
sequences. Our model includes positional dependencies and periodic
rates of conservation, as well as regional deviations in overall
conservation rate. We show that, for hidden Markov models in general,
the probability that a seed is matched in a region can be computed
efficiently, and use these methods to compute the optimal seed for our
models. Our experiments on real data show that the optimal seeds are
substantially more sensitive than the seeds used in the standard
alignment program BLAST, and also substantially better than those of
PatternHunter or WABA, both of which use spaced seeds. Our results
offer the hope of improved gene finding due to fewer missed exons in
DNA/DNA comparison, and more effective homology search in general, and
may have applications outside of bioinformatics.