[SpliceBERT] A BERT Model That Pre-trains With The Genetic Information Of A Species!
3 main points
✔️ SpliceBERT has been developed to outperform conventional methods in predicting the genetics-related task of splicing
✔️ SpliceBERT is pre-trained using genetic information in the form of precursor mRNA from 72 vertebrate species, and then fine-tuned with human data. SpliceBERT achieves improved accuracy over conventional methods by performing fine tuning on human data after pre-training with genetic information in the form of precursor mRNA from 72 vertebrate species.
✔️ SpliceBERT is based on gene sequences from multiple species, allowing it to capture important evolutionary information.
Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction
written by , , , , ,
(Submitted on 3 February 2023)
Comments: Published on bioRxiv
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
The relationship between human genetics and natural language processing.
Information about human heredity is contained in a portion of a string of letters called the genome (this is called a gene) that resides within the human cell. The building blocks of this string are called nucleotides, and it is known that there are four specific types of letters : A (adenine), G (guanine), C (cytosine), and T (thymine).
It is known that by considering the genome as a string and the nucleotides that make up the genome as characters, analysis based on the genome sequence can be considered similar to natural language processing, and by utilizing the latest language models using deep learning, human genetics, which have been poorly understood using conventional methods It is believed that by utilizing state-of-the-art language models based on deep learning, it will be possible to elucidate the details of human genetic functions and to understand how specific genetic variants are related to diseases, which have not been fully elucidated using conventional methods.
Challenges in Natural Language Processing with the Human Genome
However, it is considered difficult to apply a language model to a genome sequence in the same way as to a human language. This is because the genome sequence, which contains genetic information, is very long (the human genome is said to consist of approximately 3.2 billion strings), and even the same gene sequence shows different characteristics depending on where it is located on the genome.
Conventional Methods and Research Background
Models such as MMSplice, SpliceAI, and Pangolin are previously known for analyzing abnormal splicing caused by gene mutations.
These methods are models that have made it possible to identify spliced sites from genome sequences and to make predictions about alternative splicing (i.e., the creation of multiple types of mRNA from a single precursor mRNA).
There are many other language models for making such splicing-related predictions, but they were pre-trained on the human genome, so it was unclear whether pre-training on sequences from many species would improve splicing-related predictions.
Another chronic problem in genome analysis is the lack of sufficient data, although self-supervised learning methods used in large-scale language models (LLMs ) such as BERT have been used to learn protein sequence representations to alleviate the problem of insufficient data, It was not clear whether a similar approach could be used to study splicing.
Model Details
Model Overview
In this study, we developed a model called SpliceBERT, which focuses on precursor mRNA, a substance obtained from DNA through an operation called transcription, and fine-tuned it with human data after preliminary learning. A precursor mRNA is a substance that becomes mRNA when a certain operation called splicing is added to it (mRNA is known to become protein by an operation called translation).
This paper shows that SpliceBERT can be utilized to more accurately predict the branching point at which a precursor mRNA becomes an mRNA (one of the key splicing-related sequences).
Furthermore, SpliceBERT pre-trained with precursor mRNA sequences from multiple species was shown to improve prediction accuracy for sequence- and splicing-related tasks compared to language models that utilize precursor mRNAs from a single species. The results also showed that SpliceBERT, when fine-tuned with human data, outperformed traditional baseline models and language models pre-trained with only human data.
Applications of this model
By utilizing SpliceBERT, we can expect to be able to do things like (1)~(4) in the figure.
(1) Nucleotide properties can be expressed as numerical vectors (embeddings), which can then be quantitatively visualized. This will enable an intuitive understanding of the relationships and patterns among nucleotides.
(2) The potential effect or impact of a gene mutation can be estimated using unsupervised learning. This can help predict how a mutation will affect the function of a gene (splicing in this paper) when labeled data is lacking.
(3) It can predict splice sites (sites that play an important role in splicing, specifically, the boundary regions in mRNA where non-protein regions are removed) that are common among different species. This is expected to advance research on gene function and evolution across species.
(4) It would be possible to predict branch points in splicing (also, sites that play an important role in splicing) and analyze how mutations at these locations affect splicing.
In addition, although omitted in the figure, it is believed that based on the attrition weights used in Trasnformer, it is possible to make multifaceted observations about evolution that could not be made in the past.
Model Structure
SpliceBERT consists of six Transformer encoders as shown in the figure. The positional information in each sequence is tokenized using a one-hot positional embedding method. During pre-training, more than 2 million precursor mRNA sequences from 72 vertebrate species are extracted for pre-training.
In pre-training, approximately 15% of the sequence is randomly masked, and the learning of the characters of the masked tokens is done using a cross-entropy loss function. The method can then be applied to a variety of downstream tasks related to splicing, similar to BERT's model.
Experimental results
The upper part of Figure B shows the accuracy of SpliceBERT's model (i.e., how accurately the model predicts the masked portions) in regions of multiple genes with different functions. The lower part of Figure B shows how repetitive the gene sequence is.
The accuracy varies greatly depending on the region of the gene, but the model is particularly accurate in the introns, where there are many repetitive regions. This suggests that the proportion of repeated regions has a significant impact on the MLM task.
Figure C shows the distribution of phastCons100way scores. phastCons100way is a tool that identifies regions where mutations have occurred without changing during evolution.
The tool compares the genome sequences of 100 different species and indicates the probability of how conserved each nucleotide is (i.e., is the sequence unchanged in different species), with a value closer to 1 indicating that the sequence is conserved across multiple species. For the purposes of this paper, this value is defined as conservative if it is greater than 0.8 and non-conservative if it is less than 0.8.
Figure D shows the accuracy curves for classification problems using SpliceBERT, its derivative SpliceBERT-human, and the one-hot encoding model. than the other models .
In Figures F and G above, we show a comparison of the performance of SpliceBERT with other models in the task of predicting how much the mutation will be affected. This shows that SpliceBERT performs better than the other methods.
The three figures above show an analysis of the Transformer's Attention weights used in the model.
From Figure A, donor-acceptor pairs from the same intron (an intron is the portion of the precursor mRNA that is removed from the precursor mRNA during the splicing process) are known to be paired in splicing, with the part called donor and the part of the sequence called acceptor. groups, indicating that Attention weights are higher than those of other groups.
Figure B also shows that the exon region (exon region refers to the portion of the precursor mRNA that is not removed from the precursor mRNA during the splicing process) has a higher phastCons score score (i.e., is more conserved ) than the intron region.
Figure C shows an analysis of the distribution of Attention weights around donors and acceptors by Transformer layer, showing that Attention is enriched around acceptors and donors, especially in layers 3 to 5, suggesting that these layers may be particularly relevant for the analysis of RNA splicing This suggests that these layers may be particularly relevant to the analysis of RNA splicing.
Thus, by combining Figures A~C, it is possible to examine the relationship between Transformer's Attention weights and conserved regions, and to gain deeper insight into evolution.
Furthermore, in the figure above, wevisualized the representation of the splice site embedding vector in two dimensions usingUMAP (Uniform Manifold Approximation and Projection ), and found that the spliced sites are clustered into four patterns as shown in the right side of the figure: blue, orange, green, and red. The results show that Splice BERT performs better than DNA BERT and one-hot methods, which are conventional methods.
The figure above also shows how the F1 score changes when solving the problem of predicting the location of a sprite site in five different species, compared to a conventional model. The harmonic mean of the model's accuracy and recall, a measure of the accuracy of its predictions, shows that SpliceBERT performs particularly well on humans, but performs equally well on other species. In addition to spritesites, SpliceBERT also outperforms conventional methods in predicting branchpoints.
Summary of experimental results
SpliceBERT has been shown to outperform conventional methods in a variety of splicing-related tasks, such as estimating how large an effect a genetic variant has on function and the regions that are important in splicing.
Summary
We have developed SpliceBERT, a pre-trained language model of precursor mRNA sequences from multiple species, to facilitate the study of splicing as it occurs in human cells.
SpliceBERT not only contributes to our understanding of splicing functions, but has been demonstrated to outperform other language models that have been pre-trained solely on human data.
Splice BERT is expected to be further improved in the future, as it may have difficulty predicting splicing specific to certain tissues and cells. In addition, Splice BERT could be made to handle longer sequences by employing distillation learning techniques to transfer LLM knowledge to a lightweight architecture such as a CNN, or by developing a pre-trained genome sequence language model without using Transformer. The following are some of the possibilities.
Personally, I consider that if the similarity of each organism's genome and phylogenetic tree could be reflected in the pre-training, the performance would change even more.
Categories related to this article