Models Evolving From RNN-based BiGRUs Improve The Accuracy Of Immune Response Prediction!
3 main points
✔️ TripHLApan, the model proposed in this paper, provides improved accuracy in the task of making predictions related to immune response
✔️ Introducing the BiGRU module, Attention mechanism, and transition learning that extend the base model of the RNN
✔️ Improved performance observed not only on IEBD, a general dataset, but also on melanoma, a dataset related to skin cancer cells
TripHLApan: predicting HLA molecules binding peptides based on triple coding matrix and transfer learning
written by Meng Wang, Chuqi Lei, Jianxin Wang, Yaohang Li, Min Li
(Submitted on 6 August 2022)
Comments: 25 pages, 7 figures
Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Application of deep learning in the field of immunology
People have a mechanism called the immune system to protect themselves from harmful foreign substances, including viruses and bacteria. The immune system refers to a system in which various elements in the body, including white blood cells, cooperate with each other to defend the body, and is known to be an indispensable mechanism for human survival. In recent years, research has been conducted to cure cancer, which is difficult to treat, by utilizing the immune system mechanism.
One of the most important components of the immune system is the molecule HLA, which acts to induce an immune response by presenting a substance called peptide, which is taken up by a cell, to another cell, making HLA an essential component of the immune response. In recent years, research has been conducted to elucidate the mechanisms of the immune system and how HLA presents peptides.
In particular, HLA can be classified into several versions based on the sequence of its own component units (these different gene versions are called alleles ). Accurately predicting how peptides will present according to those classifications is an important clinical challenge.
In addition to experiments using a general data set, this paper utilizes data from cells with a skin cancer called melanoma to demonstrate the potential clinical application of this model.
Limitations and problems of current tools and research streams
Over the past two decades, a number of tools have been developed to predict HLA-peptide binding. In particular, in recent years, models utilizing deep learning have been utilized.
However, these models are valid only for a limited number of HLA alleles (versions) and are insufficient in terms of practical accuracy (HLA is known to be classified into HLA-I and HLA-II, and this tendency is considered particularly pronounced for HLA-II).
It is also known that while prediction accuracy is good when the peptides that bind to HLA are of a certain length (e.g., 9 or 10), when the peptides are long, prediction performance is greatly reduced due to the lack of training samples with that length. In addition, current methods do not take full advantage of relationships between data (especially sequence contextual information between proteins) and biological information.
Therefore, TripHLApan is proposed in this paper to solve these issues.
Model Details
Overall model
The overall workflow of TripHLApan is shown in Figure a.
In the TripHLApan model, peptide sequences and HLA molecules are obtained from the IEDB database and represented in the form of strings as input data (each HLA and peptide building block is represented using a single English letter, as shown in the figure). These input data are preprocessed prior to training, taking into account various properties of the HLA molecules and peptides.
In this experiment, data were first selected so that the training set, the test data set, and the data set containing alleles that would not be included in the training data (hereinafter defined as unseen data set) do not overlap with each other.
The above data used as input is encoded using three methods: AAIndex, Blosum62 , and Embedding. Through the process of parallelizing these three encoding methods, it is possible to obtain latent multifaceted information,such asbiochemical properties andphysical information on binding, that cannot be obtained from superficial sequence information alone.
The output of the encoded model is utilized as input for a model called BiGRU.
In addition, the model in this paper uses the Attention mechanism in the BiGRU model to reflect the learning of which of the sequences are important points (the reason for using the BiGRU module and the Attention mechanism in this model is discussed below).
The three matrices thus obtained are combined and then output (using an all-combining or sigmoidal layer prior to the final output). The model shows that learning with such parallelized multiple encoding methods allows the properties of amino acids to be exploited from multiple angles.
Details of the BiGRU model and why we are utilizing this model
The BiGRU (Bidirectional Gated Recurrent Unit) model is an extension of the RNN model; one of the most important features of the BiGRU is that it includes the process of processing information in both the forward andreverse directions for the array.
Unlike the usual RNN model, which learns from only one direction, learning a string from both the forward and reverse directions allows us to better capture the contextual information of the character sequence.
BiGRU also introduces a gating mechanism to capture long-term dependencies. In addition, TripHLApan adds an Attention mechanism to the BiGRU model, which includes a process that redistributes weights according to the importance a sequence has, thus allowing it to fully reflect the information that the context holds.
Thus, by utilizing BiGRU and the Attention mechanism, the model is able to maximize the use of sequence contextual information in the prediction of HLA and peptides, even when sufficient 3D structure is not available. In this paper, one of the greatest advantages of this model is the ability to understand how the peptide ends of peptides that bind directly to HLA affect the binding.
Introduction to Transfer Learning
As shown in Figure b, the model also introduces transition learning as a countermeasure to the lack of prediction accuracy associated with the lack of training data for peptides with long lengths. One of the reasons for introducing such transition learning is that a special coupling is known to occur when the peptide length is 8.
Therefore, the model is trained initially using peptide lengths from 9 to 14 (i.e., data for relatively long peptides), and then the model obtained from that training is used to make predictions for peptide lengths of 8. This mechanism allows the model to be trained without being affected by data with a length of 8 when making predictions for peptide lengths greater than 8, and prevents overfitting to data with a specific peptide length.
Experimental results
Figure b shows the results of an experiment in which the AUC of BiGRU was measured under different ratios of positive and negative samples (specifically, under conditions where the ratio of positive to negative samples was 1:1, 1:5, 1:10, and 1:50 in the four graphs from the left). The horizontal axis shows the length of the peptides used in the experiment (in this experiment, we are learning while classifying the peptides by some peptide length).
Figure b consists of three rows: the upper row shows the AUC for the test set, the middle row shows the AUC for the unseen data set (data set including alleles (versions) that are not included in the training data as described above), and the lower row shows the AUC for the unseen data set with transfer learning. The bottom row shows the AUC for the case of using the unseen dataset with transfer learning. The blue areas of the graphs show the evaluation indices for the models proposed in this paper, while the other colored areas show the evaluation indices for the conventional models.
The figure shows that the new method performs better than the conventional method for all peptide lengths, but especially so for longer peptides. The lower graph also shows the effectiveness of transition learning.
In addition, Figures c and d show the AUPR and top-PPV, which are measures of the model's performance when using unbalanced data sets. This confirms the validity of the model in the case of imbalance in the data.
Experimental results
In the figure above, we show the results of a Pearson correlation measure performed using a predictive tool on a sample of various alleles, tested on a dataset associated with a single melanoma (a melanoma is a type of skin cancer for which immunotherapy is being considered for introduction). The mean Pearson correlation coefficient (PCC) obtained is used as the vertical axis, and all of the cell lines used in the experiment are associated with melanoma (a cell line is a group of cells that would be cultured continuously for research purposes).
PCC is a metric used to measure how well the frequency of predicted peptide-HLA binding correlates with the actual. In this paper, we found that TripHLApan exhibits high PCC across all peptide lengths and samples.
HLA is known to be classified into I and II according to its function. The experiments conducted up to now have shown high performance using HLA-I, but as shown in the figure above, TripHLApan's model also shows excellent AUC values for II.This indicates that the model may be particularly effective for HLA-II,since it was previously validatedonly for a limited number of data sets and could not ensure sufficient prediction accuracy .
Summary
HLA molecules and peptides were found to be more accurate than conventional methods by integrating multiple pieces of information and encoding them in parallel after appropriate preprocessing using the biological and statistical properties of the molecules, by combining BiGRU's architecture with the Attention module, and by performing transfer learning. The results show that the accuracy of the method is improved when compared to conventional methods. This is thought to be due to the ability to utilize biological characteristics and sequence context information from multiple perspectives.
TripHLApan has outperformed current state-of-the-art predictive tools on both general and melanoma data sets, such as those associated with skin cancer cells, in a comparison of HLA-I and HLA-II models.
One issue for the future is that we have not seen enough improvement in predicting HLA-I binding to peptides for samples with a peptide length of 9, which is the most common peptide length in the prediction of HLA-I binding. Therefore, in the future, it will be important to find ways to place more emphasis on 3D structure, which is currently not being utilized in learning. Personally, I thought it would be important to make the peptide length more flexible for generality, rather than just setting it to a pre-specified value (8 in this case) when performing transition learning.
Categories related to this article