2-AIN-505, 2-AIN-251: Seminar in Bioinformatics (1), (3)
Winter 2023

Jorge Miguel Silva, Diogo Pratas, Tania Caetano, Sergio Matos. The complexity landscape of viral genomes. Gigascience, 11. 2022.

Download preprint: not available

Download from publisher: https://doi.org/10.1093/gigascience/giac079 PubMed

Related web page: not available

Bibliography entry: BibTeX


BACKGROUND: Viruses are among the shortest yet highly abundant species that 
harbor minimal instructions to infect cells, adapt, multiply, and exist. However, 
with the current substantial availability of viral genome sequences, the 
scientific repertory lacks a complexity landscape that automatically enlights 
viral genomes' organization, relation, and fundamental characteristics. RESULTS: 
This work provides a comprehensive landscape of the viral genome's complexity (or 
quantity of information), identifying the most redundant and complex groups 
regarding their genome sequence while providing their distribution and 
characteristics at a large and local scale. Moreover, we identify and quantify 
inverted repeats abundance in viral genomes. For this purpose, we measure the 
sequence complexity of each available viral genome using data compression, 
demonstrating that adequate data compressors can efficiently quantify the 
complexity of viral genome sequences, including subsequences better represented 
by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic 
compressor on an extensive viral genomes database, we show that double-stranded 
DNA viruses are, on average, the most redundant viruses while single-stranded DNA 
viruses are the least. Contrarily, double-stranded RNA viruses show a lower 
redundancy relative to single-stranded RNA. Furthermore, we extend the ability of 
data compressors to quantify local complexity (or information content) in viral 
genomes using complexity profiles, unprecedently providing a direct complexity 
analysis of human herpesviruses. We also conceive a features-based classification 
methodology that can accurately distinguish viral genomes at different taxonomic 
levels without direct comparisons between sequences. This methodology combines 
data compression with simple measures such as GC-content percentage and sequence 
length, followed by machine learning classifiers. CONCLUSIONS: This article 
presents methodologies and findings that are highly relevant for understanding 
the patterns of similarity and singularity between viral groups, opening new 
frontiers for studying viral genomes' organization while depicting the complexity 
trends and classification components of these genomes at different taxonomic 
levels. The whole study is supported by an extensive website 
(https://asilab.github.io/canvas/) for comprehending the viral genome 
characterization using dynamic and interactive approaches.