Michal Nánási. Biological sequence annotation with hidden Markov models. Master thesis, Comenius University in Bratislava, 2010. Supervised by Broňa Brejová.
Download preprint: 10nanasith.pdf, 739Kb
Download from publisher: https://stella.uniba.sk/zkp-storage/ddp/dostupne/FM/2010/2010-FM-ojAKhr/
Related web page: http://compbio.fmph.uniba.sk/herd/
Bibliography entry: BibTeX
Abstract:
Hidden Markov models (HMMs) are an important tool for modeling biological sequences and their annotations. By sequence annotation we mean an assignment of labels to each symbol according to its function. For example, in gene finding we want to distinguish the regions of DNA that encode proteins from surrounding non-coding sequence. A hidden Markov model defines a probability distribution over all annotations of a given sequence X. HMMs are traditionally decoded by the Viterbi algorithm. Viterbi algorithm finds the most probable annotation for a subset of HMMs. In general, the sequence annotation is NP-hard and the Viterbi algorithm is used as a heuristic algorithm. Recently it has been shown that other decoding methods give better results than Viterbi in specifific applications. We propose a new decoding method that allows uncertainty in region boundaries by considering all annotations with slightly shifted boundaries as the same. Our method (the highest expected reward decoding - HERD) is an extension of the framework of maximum expected boundary accuracy decoding introduced by Gross et al. We evaluate HERD on the problem of detecting viral recombination in the HIV genome and compare it with an existing tool called jumping HMM which uses the Viterbi algorithm. HERD has better prediction accuracy in terms of the feature sensitivity and specifixcity, where feature is a single-colored block in an annotation. Keywords: hidden Markov models, sequence annotation, viral recombination