Zero-shot Prediction Using Protein Language Models

Bioinformatics 19/04/2022

3 main points
✔️ Zero-shot prediction of mutation effects in proteins using protein language models
✔️ Record performance comparable to existing methods for mutation effect prediction
✔️ Build one generic pre-trained model that does not need to be given new teacher signals

Language models enable zero-shot prediction of the effects of mutations on protein function
written by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, Alexander Rives
(Submitted on 22 May 2021)
Comments: NeurIPS 2021 Poster
Keywords: Proteins, language modeling, generative biology, zero-shot learning, unsupervised learning, variant prediction

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

background

Mutations in protein sequences can alter the structure and function of proteins depending on the location of the mutation. It is thought that mutations are less likely to occur at sites that are important for protein function during evolution and that amino acid residues that are structurally close to each other constrain mutations.

Can we know how mutations affect protein function? One such method is deep mutational scanning. This method is an experimental technique to track the changes in the abundance of each mutant before and after a function-based screening using a next-generation sequencer.

However, the difficulty and cost of experiments are limited to a maximum of 20 to 30 proteins, which is not enough to study tens of thousands of proteins in the human genome.

Therefore, research has been conducted to obtain a model to predict the effect of mutations using machine learning without experiments by unsupervised learning using sequences.

One example, the protein language model, achieved SOTA by training with a large number of sequences and then fine-tuning to the desired task. However, the drawback is that a new model must be trained for each task.

In the paper presented in this paper, a generic pre-trained model is obtained by training a large number of arrays, and then a zero-shot transition is performed for each task without providing any supervised signals.

Zero-Shot Transition in Protein Language Models

Originally, zero-shot learning referred to a problem set in which a classifier predicts classes that did not appear in the training in the test, but in natural language processing, it has been extended to a problem set in which the model is transferred to a new task without additional training.

We fit this zero-shot transition in natural language processing to a protein language model and evaluate the generalization performance of the pre-trained model by performing the transition without fitting the model to a new task.

The authors applied a pre-trained Masked Language Model (MLM) to the task of ranking the functional activity of proteins to predict how the function of a wild-type protein changes when a mutation is introduced. In this case, only pre-training with a protein language model is required, and no new model training is needed to predict the mutation effect.

Prediction using prior learning models and their evaluation

How can we numerically represent the impact of mutations at each site using protein language models?

The authors have quantified the effect of mutations by inputting wild-type and mutant amino acid sequences into a trained pre-trained model, calculating the predicted probability for each site, and calculating the log odds ratio. The formula for calculating the log odds ratio is as follows (mt: mutant, wt: wild type)

mutation score

To evaluate the predictions of the model, experimental data from deep mutational scanning is used as the correct data. deep mutational scanning yields the score matrix shown in the lower left of the figure below. The score matrix is a relative representation of the effect of each mutation on the functional activity of the gene.

deep mutational scanning

In the experiments in this paper, we examine the rank correlation between the MLM log-odds ratios and the experimentally confirmed scores and check whether useful information is extracted for each task in the pre-training. The computation of the log odds ratio does not require any additional training, and thus the zero-shot transition described above is possible.

Prediction of mutation effects in comparative methods

The figure below shows a comparison between the method proposed in this paper and the conventional model.

variant effect prediction

While mutation and Deep Sequence require new model training for each task, ESM-1v (proposed method) does not require new training. Another unique feature of ESM-1v is that it does not need to generate Multiple Sequence Alignment (MSA) of sequences belonging to the same protein family using JackHMMer during inference.

Model Performance

The table below shows the results of evaluating the model on 41 deep mutational scanning datasets, 10 of the 41 scans are validation datasets and the rest are test datasets. The values in the table are the average of the absolute values of the Spearman rank correlation coefficients between the correct data and the predictions.

spearman p

Position-Specific Scoring Matrix (PSSM) treats each site as independent and cannot account for co-evolution due to amino acid residue interactions, EVMutation is a method that also accounts for secondary interactions by using a covariation model, and Deep Sequence is a method to model higher-order interactions of amino acid residues by using latent variables. Thus, it is clear that the model that takes into account the interdependence of multiple amino acid residues can estimate the effect of mutations more accurately.

The prediction using MSA Transformer has the best performance, but ESM-1v (the proposed method) with fine-tuning has a similar performance. The result is

Even ESM-1v in the zero-shot problem setting succeeds in producing performance comparable to EVMutation, indicating that pre-training with the protein language model extracts some relationship between mutation and function.

The following table shows the results of a comparison of zero-shot predictions using existing methods of protein language modeling. The table ☨ means the average of five different models and ★ means an ensemble of five different models.

zero shot comparison

It can be seen that ESM-1v outperforms other existing methods in zero-shot prediction.

ESM-1v uses the same architecture as ESM-1b and is pre-trained with standard MLM. The authors point out the difference in the way the training data is created as the cause of this difference in performance even though the training settings are almost the same. In particular, we confirm that the threshold for clustering the training data based on sequence similarity has a significant impact on the performance of the downstream analysis.

in conclusion

How was it? The paper introduced in this article was about the prediction of the effect of mutations in proteins in an unsupervised problem setting. This is an attempt that has not been done in conventional protein language models.

The main advantage of zero-shot prediction is that it does not require expensive new model training. As the performance of zero-shot prediction improves, even users with no knowledge of machine learning will be able to easily perform analysis using protein language models.