Unraveling "Language" In Proteins - ProteinBERT
3 main points
✔️ Added a task for predicting GO annotations to the pre-training task for protein language models in addition to the traditional MLM task
✔️Proposed an architecture that is smaller and faster than conventional neural networks by treating local and global features separately
✔️ Outperforms conventional methods on benchmarks including structure prediction and post-translational modification prediction
ProteinBERT: A universal deep-learning model of protein sequence and function
(Submitted on 25 May 2021), , , ,
Copyright:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
Proteins can be thought of as a kind of string since they are made up of 20 different amino acids linked together. This has led to an increasing number of studies applying natural language processing techniques to proteins.
One example of such research is protein language modeling. By applying context-aware word embedding computational methods, such as BERT in natural language processing, to proteins, research is underway to obtain pre-trained models useful for downstream tasks such as secondary structure prediction and post-translational modification. Interpretation of protein language models has attracted much attention because of the possibility of discovering new biological insights that have not been discovered so far.
However, there are significant differences between protein and natural language, including
- Proteins do not have distinct components such as words or sentences
- Proteins show more variation in sequence length than natural language
- Proteins may interact with each other even if they are far apart in sequence due to steric structure, etc.
Therefore, there is a need to find a way to successfully model the features of such proteins.
To take these features of proteins into account, ProteinBERT, which we introduce here, devises an architecture that considers local and global representations separately. This architecture enables faster training with a smaller network size than the architectures used in conventional natural language processing.
ProteinBERT follows BERT in natural language processing and performs pre-training by two tasks: Masked Language Modeling (MLM) and Gene Ontology (GO) annotation prediction.
Here, GO is a hierarchical classification based on the function of a gene in a cell and its subcellular localization, and annotation is the construction of correspondence between a gene and GO. In this paper, we learn by adding noise to the input amino acid sequence and GO annotation labels and then restoring the original input to the deep learning model. These two tasks are trained simultaneously, and the overall model loss is the sum of the cross-entropy in predicting the tokens at each site and the binary cross-entropy in predicting the GO annotation labels.
The MLM task of predicting amino acids at each site can be viewed as learning local features, while the prediction of GO annotations can be viewed as learning global features.
The architecture of deep learning models
The architecture of the deep learning model used in ProteinBERT is shown in the figure below. As mentioned in the pre-training task, ProteinBERT performs self-supervised learning to recover the input, which means that the size of the input and output tensors are the same for each.
The network architecture consists of six layers of Transformer blocks following BERT. One-dimensional convolutional layers are used to compute local features, and all-connected layers are used to compute global features. The local features are reflected in the global features through the global attention in the figure, and the global features affect the local features through the broadcast all-coupling layer.
Thus, ProteinBERT is characterized by the explicit separation of learning local and global representations in the network architecture.
Performance in downstream tasks
The way to confirm the effect of prior learning is to check the performance change in the downstream task by fine-tuning. In fine-tuning, the network parameters obtained by pre-training are used as initial values, and training is performed on a task different from the pre-training task.
We evaluated ProteinBERT using protein-related benchmarks proposed in our previous study. As benchmarks, we use four tasks proposed in TAPE: secondary structure prediction, homology prediction, fluorescence prediction, and stability prediction.
As a comparison method, we use TAPE Transformer and LSTM with TransformerEncoder. These models are huge with up to 38 million parameters, whereas ProteinBERT is relatively small with up to 16 million parameters.
The results show that pre-training is beneficial for improving the performance of downstream tasks, similar to the results obtained in previous work on protein language models. It also shows that ProteinBERT records the same or better performance than the conventional methods.
In addition, ProteinBERT periodically changes the sequence length when training the model to avoid overfitting to the length, and although in many cases the performance of the downstream task decreased as the protein sequence length increased, it was not a significant decrease, and the sequence length The generalization performance was also reported to be confirmed concerning the size of the
We also found that some benchmark tasks performed better with longer array lengths, so we speculate that these performance variations are due to different causes rather than array length.
Understanding Attention Mechanisms
The global attention described in the architecture of the deep learning model is responsible for reflecting the local features to the global features. Therefore, by analyzing these attention weights, we can find patterns in which part of the sequence the downstream task is focusing on.
The paper shows that although the pattern of attention weights varies widely among proteins, there are some common patterns, and it is reported that some attention heads tend to focus on a certain part of the sequence, such as the upstream part of the sequence. In addition, we found that the attention weights changed significantly in the final layer of the model when the attention weights were compared before and after fine-tuning.
How was it? In this paper, we proposed a new pre-learning method ProteinBERT for proteins. The contribution of this paper is that it enables more efficient pre-training by introducing a network architecture that explicitly separates the learning of local features from the learning of global features. This allows us to achieve the same or even better performance than the conventional method using only one GPU. It's astonishing.
We look forward to future progress in the protein language model.
Categories related to this article