Evidence Combination in Hidden Markov Models for Gene Prediction.
PhD thesis, University of Waterloo, November 2005.
Preprint, 2409Kb | Download from publisher | BibTeX
This thesis introduces new techniques for finding genes in genomic
sequences. Genes are regions of a genome encoding proteins of an
organism. Identification of genes in a genome is an important step in
the annotation process after a new genome is sequenced. The prediction
accuracy of gene finding can be greatly improved by using experimental
evidence. This evidence includes homologies between the genome and
databases of known proteins, or evolutionary conservation of genomic
sequence in different species.
We propose a flexible framework to incorporate several different
sources of such evidence into a gene finder based on a hidden Markov
model. Various sources of evidence are expressed as partial
probabilistic statements about the annotation of positions in the
sequence, and these are combined with the hidden Markov model to
obtain the final gene prediction. The opportunity to use partial
statements allows us to handle missing information transparently
and to cope with the heterogeneous character of individual sources
of evidence. On the other hand, this feature makes the combination
step more difficult. We present a new method for combining partial
probabilistic statements and prove that it is an extension of
existing methods for combining complete probability statements. We
evaluate the performance of our system and its individual
components on data from the human and fruit fly genomes.
The use of sequence evolutionary conservation as a source of
evidence in gene finding requires efficient and sensitive tools for
finding similar regions in very long sequences. We present a method
for improving the sensitivity of existing tools for this task by
careful modeling of sequence properties. In particular, we build a
hidden Markov model representing a typical homology between two
protein coding regions and then use this model to optimize a
component of a heuristic algorithm called a spaced seed. The seeds
that we discover significantly improve the accuracy and running
time of similarity search in protein coding regions, and are
directly applicable to our gene finder.