Unlocking The Language Of DNA [DNABERT]

Bioinformatics 02/12/2021

3 main points
✔️ Developed a prior learning method (DNABERT) that takes into account global contextual information in genome sequences
✔️ Fine-tune the pre-learning model to achieve SOTA in predicting promoters, splice sites, and transcription factor binding sites
✔️ Apply DNABERT learned on the human genome to genomes of other species

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
written by Yanrong Ji, Zhihan Zhou, Han Liu, View ORCID ProfileRamana V Davuluri
(Submitted on 1 Aug 2021)
Comments: Bioinformatics2021

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

such as ...

Unraveling the language of DNA sequences (genome sequences), which are the blueprints of living organisms, is one of the major goals in biology. In addition to regions that encode genes, DNA sequences contain regions that regulate gene expression in vivo, and one such region is cis-elements (CREs ).

It is known that the function of cis-elements changes when the same sequence is used in different contexts, and successful modeling of such polysemy is necessary to unravel the "language" of DNA sequences. Conventional studies have applied CNNs and RNNs to DNA sequences. However, CNN has a disadvantage that it can use only local information because it uses a filter of limited length. In addition, RNN compresses the information in the direction of the sequence length, so it is difficult to learn long sequences such as DNA sequences well.

In this paper, we apply BERT, which has shown remarkable results as a pre-learning method in natural language processing, to DNA sequences to understand the "language" in DNA.

DNABERT

In this section, we elaborate on DNA sequence tokenization, pre-training, and downstream tasks.

DNA sequence tokenization

Describes the tokenization of DNA sequences as they are input into a pre-training model.

The k-mer is a representation method similar to n-grams in natural language processing, in which k characters are extracted from a DNA sequence by shifting one character at a time.

We input a tokenized version of this k-mer into the pre-training model. Since the lexicon can change significantly by changing the setting of k, we tried four k={3,4,5,6} in the paper. We use DNABERT-k as the model pre-trained with each k.

prior learning

DNABERT learns only on the MLM task, not on the NSP task used in BERT, which masks a certain percentage of the DNA sequence and predicts k-mer tokens at the masked sites. However, the masked regions must be contiguous.

The training data used for pre-training are DNA sequences sampled from the human genome. Two sampling methods are used: one is to divide the sequence into two parts with no overlapping regions, and the other is to sample randomly with overlapping regions.

fine-tuning

In fine-tuning, we use the weight parameters obtained in pre-training as a starting point and train for each downstream task. We predict promoters, transcription factor binding sites, and splice sites as downstream tasks. We explain each task in detail.

Promoter Forecast

This task is to estimate the proximal promoter region. A proximal promoter is a region of DNA upstream of a gene that is essential for its transcription, and the TATA box in eukaryotes is a typical example. This time, we have prepared a task to predict the TATA box and other promoter regions.

We have two main comparison methods, the first is the deep learning-based DeePromoter and the second is SOTA's PromID method. Fine-tuning is performed for each of them.

For comparison with the DeePromoter, the sequence from -249 bp upstream to 50 bp downstream of the transcription start site is used as a positive example, a randomly selected sequence including the TATA box is used as a negative example of the TATA box, and a DNA sequence shuffled so that the distribution of two adjacent bases does not change is used as a negative example of the non-TATA box negative examples.

In comparison with PromID, a scan of 1001 bp is performed, and success is defined as more than half of the predicted area overlapping with the area 500 bp before and after the transfer start point.

Prediction of transcription factor binding sites

Transcription factors are proteins that control gene transcription by specifically binding to DNA sequences. The region on the DNA where the transcription factor binds is called the transcription factor binding site.

This task is to predict transcription factor binding sites. We will fine-tune the ENCODE database, which is an experimental database of transcription factor binding sites on the genome, using ChIP-seq, a technique that combines next-generation sequencing and chromatin immunoprecipitation.

Prediction of splice sites

A splice site is a site where an intron is removed and an exon is attached in selective splicing. In this task, we will classify splice sites into three classes: 5' ends (donor), 3' ends (acceptor), and non-splice sites.

A splice site is usually flanked by two pairs of nucleotides, GT, and AG, but there are splice sites that do not follow this rule and regions that fit this rule but are not splice sites. This makes predicting splice sites a difficult task.

Performance in downstream tasks

This section describes the results for each task in the fine-tuning described above.

promoter forecast

The following figure shows the performance comparison with DeePromoter, from left to right: percentage correct, F1 score, and Matthews correlation coefficient. From left to right, we can see that DNABERT performs better on TATA than DeePromoter, suggesting that DNABERT is better at capturing features other than sequence motifs in TATA boxes. We can assume that DNABERT captures features other than sequence motifs in the TATA box. promoter prediction

The figure below shows the performance comparison with other deep learning methods on a combined CNN and RNN architecture. On the left is the ROC curve and on the right is the PR curve, suggesting that DNABERT can model the global features of DNA sequences.

Prediction of transcriptional binding sites

The figure below summarizes the results of the transcription binding site prediction tool. On the left is the percent correct and on the right is the violin plot for the F1 score. DNABERT-TF is the only method with mean and median values above 0.9 for percent correct and F1 score.

tf binding sites

In addition, while other tools performed poorly concerning low-quality data, such as those obtained in experiments, DNABERT-TF recorded relatively high reproducibility and low false positives.

Prediction of splice sites

The figure below shows the results of comparing SpliceFinder with other tools for predicting splice sites. From left to right, we plot the percentage of correct answers, F1 scores, and Matthews correlation coefficients, showing that DNABERT performs the best in the same setting, even though SpliceFinder reconstructs the dataset by recursively adding sequences with false positives.

summary

DNABERT is a simple application of MLM tasks in natural language processing to DNA sequences, but it has shown remarkable performance in a variety of tasks. It would be interesting to obtain new biological knowledge by analyzing the attention inside DNABERT.