Michal Nánási. Biological sequence annotation with hidden Markov models. Master thesis, Comenius University in Bratislava, 2010. Supervised by Broňa Brejová.

Download preprint: 10nanasith.pdf, 739Kb

Download from publisher: https://stella.uniba.sk/zkp-storage/ddp/dostupne/FM/2010/2010-FM-ojAKhr/

Related web page: http://compbio.fmph.uniba.sk/herd/

Bibliography entry: BibTeX


Hidden Markov models (HMMs) are an important tool for modeling
biological sequences and their annotations. By sequence
annotation we mean an assignment of labels to each symbol
according to its function. For example, in gene finding we want
to distinguish the regions of DNA that encode proteins from
surrounding non-coding sequence. A hidden Markov model defines a
probability distribution over all annotations of a given sequence

HMMs are traditionally decoded by the Viterbi algorithm. Viterbi
algorithm finds the most probable annotation for a subset of
HMMs. In general, the sequence annotation is NP-hard and the
Viterbi algorithm is used as a heuristic algorithm.

Recently it has been shown that other decoding methods give
better results than Viterbi in specifific applications. We
propose a new decoding method that allows uncertainty in region
boundaries by considering all annotations with slightly shifted
boundaries as the same. Our method (the highest expected reward
decoding - HERD) is an extension of the framework of maximum
expected boundary accuracy decoding introduced by Gross et al. 

We evaluate HERD on the problem of detecting viral recombination
in the HIV genome and compare it with an existing tool called
jumping HMM which uses the Viterbi algorithm. HERD has better
prediction accuracy in terms of the feature sensitivity and
specifixcity, where feature is a single-colored block in an

Keywords: hidden Markov models, sequence annotation, viral