Bioinformatický seminár

Tue 15 Mar. 2011, 17:20
I-9

Title: Hawkins et al. Assessing phylogenetic motif models for predicting transcription factor binding sites
Speaker: Jaro Budiš

MOTIVATION: A variety of algorithms have been developed to predict
transcription factor binding sites (TFBSs) within the genome by exploiting
the evolutionary information implicit in multiple alignments of the
genomes of related species. One such approach uses an extension of the
standard position-specific motif model that incorporates phylogenetic
information via a phylogenetic tree and a model of evolution. However,
these phylogenetic motif models (PMMs) have never been rigorously
benchmarked in order to determine whether they lead to better prediction
of TFBSs than obtained using simple position weight matrix scanning.
RESULTS: We evaluate three PMM-based prediction algorithms, each of which
uses a different treatment of gapped alignments, and we compare their
prediction accuracy with that of a non-phylogenetic motif scanning
approach. Surprisingly, all of these algorithms appear to be inferior to
simple motif scanning, when accuracy is measured using a gold standard of
validated yeast TFBSs. However, the PMM scanners perform much better than
simple motif scanning when we abandon the gold standard and consider the
number of statistically significant sites predicted, using column-shuffled
'random' motifs to measure significance. These results suggest that the
common practice of measuring the accuracy of binding site predictors using
collections of known sites may be dangerously misleading since such
collections may be missing 'weak' sites, which are exactly the type of
sites needed to discriminate among predictors. We then extend our previous
theoretical model of the statistical power of PMM-based prediction
algorithms to allow for loss of binding sites during evolution, and show
that it gives a more accurate upper bound on scanner accuracy. Finally,
utilizing our theoretical model, we introduce a new method for predicting
the number of real binding sites in a genome. The results suggest that the
number of true sites for a yeast TF is in general several times greater
than the number of known sites listed in the Saccharomyces cerevisiae
Database (SCPD). Among the three scanning algorithms that we test, the
MONKEY algorithm has the highest accuracy for predicting yeast TFBSs.