2-AIN-506, 2-AIN-252: Seminar in Bioinformatics (2), (4)
Summer 2025
Abstrakt

Qianhui Zhu, Shenghan Gao, Binghan Xiao, Zilong He, Songnian Hu. Plasmer: an Accurate and Sensitive Bacterial Plasmid Prediction Tool Based on Machine Learning of Shared k-mers and Genomic Features. Microbiol Spectr, 11(3):e0464522. 2023.

Download preprint: not available

Download from publisher: https://doi.org/10.1128/spectrum.04645-22 PubMed

Related web page: not available

Bibliography entry: BibTeX

Abstract:

Identification of plasmids in bacterial genomes is critical for many factors, 
including horizontal gene transfer, antibiotic resistance genes, host-microbe 
interactions, cloning vectors, and industrial production. There are several in 
silico methods to predict plasmid sequences in assembled genomes. However, 
existing methods have evident shortcomings, such as unbalance in sensitivity and 
specificity, dependency on species-specific models, and performance reduction in 
sequences shorter than 10 kb, which has limited their scope of applicability. In 
this work, we proposed Plasmer, a novel plasmid predictor based on 
machine-learning of shared k-mers and genomic features. Unlike existing k-mer or 
genomic-feature based methods, Plasmer employs the random forest algorithm to 
make predictions using the percent of shared k-mers with plasmid and chromosome 
databases combined with other genomic features, including alignment E value and 
replicon distribution scores (RDS). Plasmer can predict on multiple species and 
has achieved an average the area under the curve (AUC) of 0.996 with accuracy of 
98.4%. Compared to existing methods, tests of both sliding sequences and 
simulated and de novo assemblies have consistently shown that Plasmer has 
outperforming accuracy and stable performance across long and short contigs above 
500 bp, demonstrating its applicability for fragmented assemblies. Plasmer also 
has excellent and balanced performance on both sensitivity and specificity 
(both >0.95 above 500 bp) with the highest F1-score, which has eliminated the 
bias on sensitivity or specificity that was common in existing methods. Plasmer 
also provides taxonomy classification to help identify the origin of plasmids. 
IMPORTANCE In this study, we proposed a novel plasmid prediction tool named 
Plasmer. Technically, unlike existing k-mer or genomic features-based methods, 
Plasmer is the first tool to combine the advantages of the percent of shared 
k-mers and the alignment score of genomic features. This has given Plasmer (i) 
evident improvement in performance compared to other methods, with the best 
F1-score and accuracy on sliding sequences, simulated contigs, and de novo 
assemblies; (ii) applicability for contigs above 500 bp with highest accuracy, 
enabling plasmid prediction in fragmented short-read assemblies; (iii) excellent 
and balanced performance between sensitivity and specificity (both >0.95 above 
500 bp) with the highest F1-score, which eliminated the bias on sensitivity or 
specificity that commonly existed in other methods; and (iv) no dependency of 
species-specific training models. We believe that Plasmer provides a more 
reliable alternative for plasmid prediction in bacterial genome assemblies.