Unraveling The Attention Mechanism Of Protein Language Models

Bioinformatics 17/11/2021

3 main points
✔️ Analyze the attention of Transformer pre-trained by the protein language model
✔️ Discovered that complex biological features such as protein folding can be acquired only by training with the language model
✔️ Confirmed the above phenomena on multiple architectures (TAPE etc.) and datasets

BERTology Meets Biology: Interpreting Attention in Protein Language Models
written by Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani
(Submitted on 26 Jun 2020 (v1), last revised 28 Mar 2021 (this version, v3))
Comments: ICLR 2021.
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Biomolecules (q-bio.BM)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

A protein language model is a language model applied to proteins (amino acid sequences) in natural language processing. Since there are many unknowns about how proteins function in the body, it is expected that new insights will be gained by using large amounts of data and pre-learning as represented by BERT. The transformer used in BERT is a widely used architecture in natural language processing. In recent years, there has been an increasing number of attempts to improve the interpretability of models by analyzing the attention mechanism in the Transformer. In particular, the interpretation of BERT is called BERTology.

In this paper, we are trying to improve the interpretability of the model by analyzing the attention mechanism in the pre-training model using protein sequence information.

Model interpretation methods

In this paper, we apply two main analysis methods to the Transformer-based pre-training model.

Analysis of Attention Mechanism

Among those whose attention weights between amino acid residues exceed the threshold, we check how many of the combinations have biological characteristics.

Specifically, it is calculated by the following formula. It is an evaluation index similar to Precision.

attention precision

probe task

The probe task is an auxiliary task used to improve the interpretability of pre-training models in natural language processing. It checks what kind of information is contained in the internal representation acquired in the pre-trained model.

Specifically, we input the representation obtained from the model to the classifier and let it solve some tasks to check whether the information useful for the task is encoded. There are two types of probing tasks for transformer-based models: embedded probes and attentional probes. Embedded probes target the output of each layer, while attentional probes target the attentional weights.

Analysis results of attention mechanism

Secondary structure of the protein

An important feature of protein secondary structure is the contact map. A contact map is a combination of amino acids that are spatially close in protein folding.

When this contact map was used as a feature for amino acid pairs and the attention analysis method described above was applied, the $p_{\alpha}(f)$ for each pre-training model took values between 44.5% and 63.2%. Also, the attention heads that best reflected the contact maps were located in deeper layers.

Considering that the background probability in the contact map is 1.3%, we can say that the pre-training in the protein language model acquires a higher-order representation that reflects the contact map.

binding site

A binding site is a site where a protein interacts with another molecule. Binding sites are very important features for protein function.

The fraction of attenuations referring to this binding site, $p_{\alpha}(f)$, ranges from 45.6% to 50.7%, which is very high considering that the background probability of the binding site is 4.8%.

In addition, the majority of attention heads referred to the binding site at a high rate.

It has been suggested that the reason for this importance of binding sites in the model, even though they are features that reflect interactions with external molecules, is that structural motifs may be highly conserved because they are features directly related to protein function.

posttranslational modification

Post-translational modifications are changes made to a protein after it has been translated from mRNA.

It is known that post-translational modifications play a major role in protein structure and function. The $p_{\alpha}(f)$ in post-translational modifications is 64%, which is very high considering that the background probability of post-translational modifications is 0.8%.

However, the number of attention heads referring to post-translational modification sites was small.

Results of the probe task

The following figures show the results of embedding and attention probing at each layer of the pre-training model.

The orange plot shows the results of the embedding probe and the blue plot shows the results of the attention probe. The evaluation metrics used (y-axis) are F1 score for secondary structure prediction and Precision for the binding site and contact prediction. It can be seen that the accuracy of the secondary structure prediction such as helix, turn the bend, and strand is good even if the output at relatively low layers is used.

In the case of embedded probes, the accuracy rises steadily, and useful information is gradually accumulated with each layer. On the other hand, for attention probes, the accuracy of the probe task suddenly increases in the last layer, indicating that the representation of the information is different for embedded and attention probes.

summary

How did you like it? In this paper, we applied the interpretation method of pre-training models in natural language processing to protein language models. It is interesting to note that internal representations of higher-order structures, such as secondary structures and contact maps, can be obtained only by pre-training using 20 kinds of amino acid tokens. Since structural information, which is important for protein function, is conserved during the evolutionary process, it is possible to find hidden signals by solving the MLM task with a large amount of data.

Although the main focus of this project was the evaluation of known biological features, there is a possibility that biological knowledge that we do not know lies dormant in prior learning models. It is very exciting to use machine learning to solve the mysteries of life, but improving the interpretability of machine learning is going to be a bottleneck.

We'll see what happens next!