2-AIN-505, 2-AIN-251: Seminar in Bioinformatics (1), (3)
Winter 2023

Yanrong Ji, Zhihan Zhou, Han Liu, Ramana V. Davuluri. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112-2120. 2021.

Download preprint: not available

Download from publisher: https://doi.org/10.1093/bioinformatics/btab083 PubMed

Related web page: not available

Bibliography entry: BibTeX


MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental 
problems in genome research. Gene regulatory code is highly complex due to the 
existence of polysemy and distant semantic relationship, which previous 
informatics methods often fail to capture especially in data-scarce scenarios. 
RESULTS: To address this challenge, we developed a novel pre-trained 
bidirectional encoder representation, named DNABERT, to capture global and 
transferrable understanding of genomic DNA sequences based on up and downstream 
nucleotide contexts. We compared DNABERT to the most widely used programs for 
genome-wide regulatory elements prediction and demonstrate its ease of use, 
accuracy and efficiency. We show that the single pre-trained transformers model 
can simultaneously achieve state-of-the-art performance on prediction of 
promoters, splice sites and transcription factor binding sites, after easy 
fine-tuning using small task-specific labeled data. Further, DNABERT enables 
direct visualization of nucleotide-level importance and semantic relationship 
within input sequences for better interpretability and accurate identification of 
conserved sequence motifs and functional genetic variant candidates. Finally, we 
demonstrate that pre-trained DNABERT with human genome can even be readily 
applied to other organisms with exceptional performance. We anticipate that the 
pre-trained DNABERT model can be fined tuned to many other sequence analyses 
tasks. AVAILABILITY AND IMPLEMENTATION: The source code, pretrained and finetuned 
model for DNABERT are available at GitHub 
(https://github.com/jerryji1993/DNABERT). SUPPLEMENTARY INFORMATION: 
Supplementary data are available at Bioinformatics online.