Tomas Vinar. Enhancements to Hidden Markov Models for Gene Finding and Other Biological Applications. PhD thesis, University of Waterloo, October 2005.

Download preprint: 05thesis.pdf, 1474Kb

Download from publisher: not available

Related www page: not available

Bibliography entry: BibTeX


In this thesis, we present enhancements of hidden Markov models for
the problem of finding genes in DNA sequences. Genes are the parts of
DNA that serve as a template for synthesis of proteins. Thus, gene
finding is a crucial step in the analysis of DNA sequencing data.

Hidden Markov models are a key tool used in gene finding. Yhis
thesis presents three methods for extending the capabilities of hidden
Markov models to better capture the statistical properties of DNA
sequences. In all three, we encounter limiting factors that lead to
trade-offs between the model accuracy and
those limiting factors.

First, we build better models for recognizing biological
signals in DNA sequences. Our new models capture
non-adjacent dependencies within
these signals. In this case, the main limiting factor is the amount of
training data: more training data allows more complex models.
Second, we design methods for better representation of length
distributions in hidden Markov models, where we balance the accuracy
of the representation against the running time needed to find
genes in novel sequences. Finally, we show that creating hidden Markov
models with complex topologies may be detrimental to the prediction
accuracy, unless we use more complex prediction algorithms.
However, such algorithms require longer running time, and
in many cases the prediction problem is NP-hard. For gene finding this means
that incorporating some of the prior biological knowledge into the
model would require impractical running times. However, we also
demonstrate that our methods can be used for solving
other biological problems, where input sequences are short.

As a model example to evaluate our methods, we built a gene finder
ExonHunter that outperforms programs commonly used in genome projects.

Last update: 01/24/2007