A BERT-based Model For Predicting The Function Of MRNAs With Genetic Information Is Now Available!

Large Language Models 30/10/2024

3 main points
✔️ m6A-BERT-Deg proposed as a model to predict whether a genetically relevant mRNA substance will be degraded based on its state of modification
✔️ Improved prediction accuracy compared to no prior learning and to each of the previous models
✔️ by analyzing the contribution of BERT tokens, Suggests also the discovery of a new biological mechanism

Understanding YTHDF2-mediated mRNA Degradation By m6A-BERT-Deg
written by View ORCID ProfileTing-He Zhang,Sumin Jo,Michelle Zhang,Kai Wang,Shou-Jiang Gao,Yufei Huang
(Submitted on 15 Jan 2024)
Comments: Published on arXiv
Subjects: Molecular Networks (q-bio.MN)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Prerequisite Knowledge 1 (about mRNA)

The genetic information of humans and other living organisms is stored in DNA, but in order to actually control our body functions based on genetic information, the genetic information stored in DNA must first be copied into a substance called mRNA, and then the mRNA information must be used to synthesize proteins.

In other words, it is known that the information in DNA must be copied once into another medium, mRNA, in order to be converted into proteins that control body functions. to illustrate the relationship between DNA, mRNA, and proteins, DNA is a recipe book for a dish and mRNA is a copy of the recipe book,Protein can be thought of as the finished dish.

Prerequisite Knowledge 2 (Regulation of mRNA function)

Sometimes mRNAs undergo operations in which decorations , called modifications, are added.A typical example of such a modification is the m6A modification. m6A is a single letter that represents the addition of a "methyl group" to the nitrogen atom of the "A," which is known as the "A," "G," "C," and "U" in the four units that make up the mRNA string.

It was known that mRNAs undergo such modifications in such a way that they attract proteins that are responsible for regulating their own degradation. However, m6A modification does not necessarily result in mRNA degradation, and the detailed mechanism has not yet been fully elucidated.

The regulation of this stability is highly relevant to various cellular and biological processes, including cancer stem cellsin acute myeloid leukemia, and their elucidation has been anticipated.

Research Background

Therefore, in this study, we developed a model called m6A-BERT to predict whether mRNAs with m6A modifications are degraded. Furthermore, using mRNA lifetime data (half-life), we proposed m6A-BERT-Deg, which is an improved version of this model using fine tuning.

The half-life of an mRNA is related to the rate at which the mRNA is degraded and is an important parameter in understanding the mechanism of degradation.

The effectiveness of m6A-BERT-Deg was confirmed by its high accuracy compared to other state-of-the-art deep learning-based methods.

Model Structure

Overall Model

The structure of mRNA is like a long, thin chain, with each of the four types of constituent units linked together to form a string. In other words, if each component is represented by a single letter abbreviation, mRNA can be represented in the form of a string.

m6A-BERT is a method based on the BERT model familiar from natural language processing, and its overall picture is shown in the figure above, consisting of the pre-training shown in A and the process of fine tuning as shown in B using the data in C (bottom row).

Below are details of each structure of the model.

Tokenization Details

In this model, mRNA character sequences are used as input data for tokenization. Tokenization is performed using a sliding window technique as shown in the figure.

In this method, a portion of a mRNA sequence of a certain width (the length is set to 3 in the above figure) is extracted, and then tokenized by gradually shifting the portion to be extracted for the entire string.

Here, a collection of three character strings is considered as a single chunk. In other words, "AGC" (the first to third letters) is treated as a single token in the figure. In addition, "GCG," which corresponds to the second to fourth characters, "CGG," "GGA," etc., are also treated as tokens.

The area shown in red in the figure indicates the area where the m6A modification is actually performed, and the data in the columns up to 250 characters away from the area is the target of the analysis. The tokens [CLS] and [SEP] refer to special tokensadded at thebeginning and end, respectively.

In the example above, we consider the case where the width is 3. In this paper, we set the width to 4, 3, 4, 5, and 6, and perform pre-training at different granularities ( experiments in the paper have shown that the accuracy is almost the same at these widths ).

Preliminary Study Details

During pre-training, 15% of the tokens obtained from the m6A sequence are randomly masked. That is, we replace the tokens with[MASK] tokens, as indicated by the black areas in the figure.

Then, the output obtained by embedding is passed through the Transformer block, which consists of 12 layers, and the Classification layer to predict the masked tokens.

Note that the dataset m6A-AtlasV2, which shows mRNA sequences including m6A modifications from 24 tissues and cell lines, was used during the pre-study.

Fine Tuning Details

During fine tuning, a binary classification layer is introduced into the pre-trained model. This layer is such that it outputs 1if it predicts regulation of mRNA degradation and 0if it predicts no regulation.

For fine tuning, the dataset is constructed as shown in the figure above. The upper circle shows the number of sites for " mRNA with m6A modification and increased half-life," and the slightly larger circle below shows the number of sites for " mRNA withm6A modification andbinding ofa certain protein (formally called YTHDF2 protein) that causes gene degradation.

We also construct the data set as a negative set of 485 sites randomly selected from 7726 sites that fall into the lower circle but do not fall into the upper circle (i.e., the protein is bound and the half-lifeis notincreasing ), and as a positive set of 485 sites randomly selected from 7726 sites that fall into both circles (the protein is bound and the half-life is increasing). The 485 sites randomly selected from the 7726 sites that fall in either circle (i.e., protein bound and not increasing half-life) are used to construct the data set as the positive set.

In other words, we treat the data of mRNA degradation-induced protein binding to the modified location of m6A as a positive data set when degradation actually takes place, and a negative data set when degradation does not take place.

Evaluation Indicators for the Model

Five indices were selected to evaluate model performance: ACC, Matthews correlation coefficient, AUC, accuracy, and reproducibility. The Matthews correlation coefficient is one of the evaluation indices used in binary classification problems to evaluate model performanceon unbalanced data sets. The performance of this model was compared using the 5-part cross-validation method.

Experimental Results

The table compares the predictive performance of m6A-BERT-Degwith the baseline model.

In this paper, to demonstrate the effect of prior learning, we compare the prediction performance of BERT-baseline, which was trained without prior learning, DNABERT-Deg, which is a conventional method of DNABERT fine-tuned with the method presented in this paper, and iDeepMVDeg andCNN+LSTM-Deg as conventional models. LSTM-Deg as well as iDeepMVDeg and CNN+LSTM-Deg as conventional models.

The experiments showed that m6A-BERT-Degperformed the best among all models. In particular, ACC and AUC improved by about 4% compared to the method without prior learning, indicating the effectiveness of prior learning.

Furthermore, the application of m6A-BERT-Deg to the regulation of mRNA degradation was validated using the HEK293T cell line (a cell line refers to a population of cells that are allowed to grow continuously in vitro) and compared to the results using another sequencing method, m6A-express. The paper shows that the predictions of this model were correct.

Considerations Obtained by Token Contribution Scores

The authors created a heat map that visualizes the magnitude of the attribution score in terms of color intensity, as shown in the figure. The attribution score is a measure of how each token contributes to the forecast, and a high score indicates that the token has a significant impact on the forecast.

The upper half of the figure shows the attribution score for the positive data set and the lower half for the negative data set. The horizontal axis indicates the portion of the sequence where the m6A modification is taking place, and you can safely assume that the number corresponds to the number of letters away from the area where the m6A modification is taking place.

From this figure, you can see that there are some areas where the score is higher (blue dots scattered around) with respect to the area where the horizontal axis is around -100.

From this figure, it can be seen that the attribution scorefor the area where the m6A modification occurs itself will be low (i.e., the contribution ofthetokeninlearning is low ), while there are regions that are somewhat upstream (i.e., negative) to the m6A site where the attribution score will be high (i.e., the contribution of the token in learning is high ) It can be seen that there are regions

This indicates that the region upstream of the area where the modification is occurring may have a significant impact on the regulation of mRNA degradation.

Possibility of elucidating new biological mechanisms

In addition, the paper provides validation to determine which proteins bind to certain sequences of RNA more frequently. The author pointed out that some of these proteins promote mRNA stability, and he also mentioned the possibility that a new biological mechanism may have been elucidated, in which mRNA stability is enhancedbecausemRNA degradation is prevented. The report also noted the possibility that a new biological mechanism may have been elucidated.

Summary

In this study, the BERT-basedm6A-BERT-Deg was proposed as a model to predict mRNA degradation by m6A modification.

The model is trained by tokenizing mRNA sequences as strings, pre-training to predict masked tokens, and fine tuning that introduces a binary classification layer to predict decomposition.

The performance of this model was higher than that of other advanced models without pre-training or in the past. In addition, the accuracy of the model was confirmed by conducting experiments with real cells.

Further analysis using an attribution score based on token contribution revealed high scores upstream of the m6A modification site, suggesting that this region is important for regulating mRNA degradation.

Personally, I think that the best part of the BERT model is that it allows us to fully consider the biological background knowledge by considering the contribution of these embedding layers, and I think it is remarkable that we are now able to elucidate new mechanisms through this kind of consideration.

Categories related to this article

medicalAI