2-AIN-505, 2-AIN-251: Seminár z bioinformatiky (1) a (3)
Zima 2020
Abstrakt

Jie Ren, Nathan A. Ahlgren, Yang Young Lu, Jed A. Fuhrman, Fengzhu Sun. VirFinder: a novel k-mer based tool for identifying viral sequences fromassembled metagenomic data. Microbiome, 5(1):69. 2017.

Download preprint: not available

Download from publisher: https://doi.org/10.1186/s40168-017-0283-5 PubMed

Related web page: not available

Bibliography entry: BibTeX

Abstract:

BACKGROUND: Identifying viral sequences in mixed metagenomes containing both
viral and host contigs is a critical first step in analyzing the viral component 
of samples. Current tools for distinguishing prokaryotic virus and host contigs
primarily use gene-based similarity approaches. Such approaches can significantly
limit results especially for short contigs that have few predicted proteins or
lack proteins with similarity to previously known viruses. METHODS: We have
developed VirFinder, the first k-mer frequency based, machine learning method for
virus contig identification that entirely avoids gene-based similarity searches. 
VirFinder instead identifies viral sequences based on our empirical observation
that viruses and hosts have discernibly different k-mer signatures. VirFinder's
performance in correctly identifying viral sequences was tested by training its
machine learning model on sequences from host and viral genomes sequenced before 
1 January 2014 and evaluating on sequences obtained after 1 January 2014.
RESULTS: VirFinder had significantly better rates of identifying true viral
contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art
gene-based virus classification tool, when evaluated with either contigs
subsampled from complete genomes or assembled from a simulated human gut
metagenome. For example, for contigs subsampled from complete genomes, VirFinder 
had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb
contigs, respectively, at the same false positive rates as VirSorter (0, 0.003,
and 0.006, respectively), thus VirFinder works considerably better for small
contigs than VirSorter. VirFinder furthermore identified several recently
sequenced virus genomes (after 1 January 2014) that VirSorter did not and that
have no nucleotide similarity to previously sequenced viruses, demonstrating
VirFinder's potential advantage in identifying novel viral sequences. Application
of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis
patients reveals higher viral diversity in healthy individuals than cirrhosis
patients. We also identified contig bins containing crAssphage-like contigs with 
higher abundance in healthy patients and a putative Veillonella genus prophage
associated with cirrhosis patients. CONCLUSIONS: This innovative k-mer based tool
complements gene-based approaches and will significantly improve prokaryotic
viral sequence identification, especially for metagenomic-based studies of viral 
ecology.