2-AIN-506, 2-AIN-252: Seminar in Bioinformatics (2), (4)
Summer 2024

Arnaud Kress, Olivier Poch, Odile Lecompte, Julie D. Thompson. Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events. Front Bioinform, 3:1178926. 2023.

Download preprint: not available

Download from publisher: https://doi.org/10.3389/fbinf.2023.1178926 PubMed

Related web page: not available

Bibliography entry: BibTeX


Protein annotation errors can have significant consequences in a wide range of 
fields, ranging from protein structure and function prediction to biomedical 
research, drug discovery, and biotechnology. By comparing the domains of 
different proteins, scientists can identify common domains, classify proteins 
based on their domain architecture, and highlight proteins that have evolved 
differently in one or more species or clades. However, genome-wide identification 
of different protein domain architectures involves a complex error-prone pipeline 
that includes genome sequencing, prediction of gene exon/intron structures, and 
inference of protein sequences and domain annotations. Here we developed an 
automated fact-checking approach to distinguish true domain loss/gain events from 
false events caused by errors that occur during the annotation process. Using 
genome-wide ortholog sets and taking advantage of the high-quality human and 
Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss 
events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. 
cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our 
approach allowed us to quantify the impact of errors on estimates of protein 
domain gains and losses, and we show that domain losses are over-estimated 
ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line 
with previous studies of gene-level losses, where issues with genome sequencing 
or gene annotation led to genes being falsely inferred as absent. In addition, we 
show that insistent protein domain annotations are a major factor contributing to 
the false events. For the first time, to our knowledge, we show that domain gains 
are also over-estimated by three-fold and two-fold respectively in NHP and NSF 
proteins. Based on our more accurate estimates, we infer that true domain losses 
and gains in NHP with respect to humans are observed at similar rates, while 
domain gains in the more divergent NSF are observed twice as frequently as domain 
losses with respect to S. cerevisiae. This study highlights the need to 
critically examine the scientific validity of protein annotations, and represents 
a significant step toward scalable computational fact-checking methods that may 
1 day mitigate the propagation of wrong information in protein databases.