[IGModel] Methodology Applying GNN+Attention Mechanism Improves Practicality In Drug Discovery
3 main points
✔️ IGModel, a model using deep learning that can simultaneously predict the binding strength and shape of binding between proteins and drug candidate molecules in the body, is proposed
✔️ IGModel learns about physical interactions in the body (how atoms interact with each other), thereby improving its performance compared to previous models IGModel is robust to various types of data, including datasets containing new protein structures predicted by AlphaFold2.
✔️ including datasets containing new protein structures predicted by AlphaFold2.
A New Paradigm for Applying Deep Learning to Protein-Ligand Interaction Prediction
written by , , , , , , ,
(Submitted on 3 November 2023)
Comments: Published on bioRxiv
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Application of deep learning in drug discovery (discovery of therapeutic drugs) and challenges of conventional methods
Drugs (especially drugs called inhibitors, which we will be dealing with here) exert their therapeutic effects by binding to proteins that do a bad job for humans and altering the function of those proteins. Therefore, in designing drugs, it is important to optimize the affinity andpharmacological properties of these bindings and to accurately predict protein-drug interactions.
In particular, in recent years, the use of deep learning has been promoted to analyze their interactions. Prominent examples of such models included AtomNet, Kdeep, and Pafnucy, which utilize CNNs, and OnionNet, which uses two-dimensional convolutional networks for analysis. However, these methods had a low success rate in actually docking proteins and drugs, and there were issues in terms of practicality.
In addition, the conventional model was unable to simultaneously represent RMSD, an indicator of how the protein binds to the drug candidate, and pKd, an indicator of the strength of their interaction. This meant that the information could not be used from multiple perspectives when discussing drug candidates, and improvements were expected.
Novelty of this model
Therefore, a new model called IGModel was proposed in this paper. By utilizing the geometrical information of the protein and the drug candidate that binds to it, this model enables the simultaneous measurement of RMSD (a measure related to the accuracy of binding as mentioned earlier) and pKd (a measure related to the strength of binding as mentioned earlier) within a single framework compared to conventional models. IGModel increased the success rate of drug docking in datasets containing structures generated by the CASF-2016 benchmark and the tools PDBbind-CrossDocked-Core, the DISCO set, and AlphaFold2, and compared to traditional models, the utility The results showed an improvement in the
Model Details
Overall Model
IGModel is composed of an encoder block, which takes a protein, a drug candidate, and its binding site as inputs, and outputs RMSD and pKd, the two indicators of binding mentioned above. It consists of an encoderblock and a decoder block consisting of an RMSD decoder and a pKd decoder.
The encoder block performs embedding in the latent space based on the input data, and is characterized by the introduction of the EdgeGAT layer described below to better reflect the interaction between the protein and the drug candidate in the model. The decoder block uses two learning modules that output each of the two types of indicators using information from the latent space.
Details of Graph Structure
The graph structure used in this paper represents information about the atoms that make up the protein and the drug candidate as nodes, and information about the interactions between the nodes as edges. The nodes are roughly classified into two types: VR nodes, which represent information about the atoms constituting the protein, and VL nodes, which represent information about the atoms constituting the drug candidate. Different information is embedded in VR and VL nodes.
Specifically, the following describes what information is embedded in nodes and edges:in VR, nodes are embedded with information about the constituent elements of drug candidates (e.g., one-hot vectorized nitrogen (N), carbon (C), etc.). In VL, on the other hand, information is embedded about the chemical properties of the protein, such as which amino acids make up the protein, the constituent elements of the protein, the size of the protein's charge, whether the protein is aromatic or not, the distance to the alpha carbon, and so on.
On the other hand, the edges are embedded with information mainly about the chemical bonds that link atoms together, such as the type of bond (single bond or double bond), presence of a ring structure, steric configuration, and whether the bond is conjugated. In addition, information about the angle between the protein and the drug candidate, as shown in B in the figure above, is also embedded.
Encoder Details
The first (upper left graph in Figure A) contains information on the protein, the drug candidate, and the interrelationship between the protein and the drug candidate, while the second (lower left graph in Figure A) contains only the interrelationship between the protein and the drug candidate, applying information on the three-dimensional structure and physical chemistry. The second graph (the lower left graph in Figure A) shows only the interrelationships of the binding parts in a graph structure while applying information on the steric structure and the viewpoint of physical chemistry.
In the EdgeGAT layer of the encoder, nodes and edges are updated as inputs are made. the EdgeGAT layer is a type of graph neural network that uses an attentional mechanism to integrate information when aggregating the features of nodes and their surrounding nodes The model is a further development of the graph attention network concept.
The EdgeGAT layer has a mechanism that incorporates edge information into the feature representation for node and edge features as input. This allows for iterative processing of node and edge features in parallel and in parallel with each other. As a result, the relationship between nodes and the attributes and features of edges themselves can be properly utilized, and interactions can be considered more appropriately than without this mechanism.
Note that in this model, updates are made twice at a time, and a process called message passing rounds is introduced between the two updates. A message passing round refers to the transmission of information between two nodes of different types.
The detailed mechanism is omitted, but during the first update, after the respective updates are made for VL and VR of the two types mentioned earlier, the process of message passing rounds involves information transfer between VLs, between VRs, and between VLs and VRs. This process of information transfer between the update round s allows for a more accurate representation of the interactions between proteins and drug candidates.
After these two updates, interspersed with a message-passing round process, a total of three 1024-dimensional vectors of information about the protein, the drug candidate, and their binding are embedded. The three resulting feature vectors are then combined. This is the output of the encoder part and the input of the decoder part.
Decoder Details
The decoder part has two learning modules, each module consists of a gMLP layer and alinear layer. The decoder converts the output obtained from the encoder into two 128-dimensional vectors obtained by passing through the two learning modules.
The gMLP layer is a learning layer that extends the MLP and is a technique that has been used recently in natural language processing models. This layer makes it possible to achieve Transformer-like performance without the use of an attention mechanism.
A unique feature of gMLP is that there is a gating mechanism that is responsible for emphasizing or suppressing certain information. This allows for effective learning because it allows for dynamic determination of how the features of each location are conveyed to the next layer.
In this paper, the RMSD and pKd are output based on the representation obtained by each of the two modules in the decoder. The information is integrated as shown by the downward arrow in the decoder section so that changes in the RMSD can be reflected in the pKd. The pKd decoder also outputs the attenuation factor W, which indicates the attenuation of the value.
Experimental results
The figure above shows the results of an experiment on the prediction of protein-drug candidate binding using the CASF2016 dataset: in A, the correlation between the model predictions and the actual experimental data is analyzed using a Pearson correlation measure; in B, the results of ranking the adequacy as ligand are Spearman correlation measure.
Note that the Pearson correlation measure is a method that measures the strength and direction of a linear relationship between two variables, while the Spearman correlation measure is a measurement method that determines the correlation between variables based on the rank (ranking) of each variable's value. In addition, C and D show a comparison of the docking success rates of the models. These graphs from A to D show that the IGModel performs better than the other models.
This figure shows the Top1success rate (left) and the TopN success rate (right) when using the IGModel . The TopN success rate is used when there may be more than one valid candidate.
Note that Surflex, Glide, and Vina refer to the docking software used in this experiment. The experiment showed that IGModel significantly outperforms the prediction accuracy of the conventional model in both the Top1 success rate and TopN success rate metrics.
In Figures A~D above, the embedded representation in the latent space output by the EdgeGAT layer is color-coded with the actual RMSD as A, the actual pKd as C, the predicted RMSD as B, and the predicted pKd as D. The first principal component resulting from the principal component analysis is shown on the horizontal axis and the second principal component on the vertical axis The first principal component resulting from the principal component analysis is shown on the horizontal axis and the second principal component on the vertical axis.
From this figure, we can see that as the RMSD and pKd change (i.e., the performance of predicting binding and the strength of binding changes ), they form a pattern that looks like layers. For example, in Figures A through D, you can visually see that as the abscissa (first principal component) increases, its color (RMSD representing accuracy) gradually changes from purple to green.
This visualization of the encoded latent space allows for a highly visible consideration of intuitive representations of precision and coupling strength.
Summary
In this paper, we proposed a new framework, IGModel, for predicting protein-drug candidate interactions. By using this model, which utilizes deep learning, it is possible to simultaneously predict the RMSD and binding strength pKd of the drug candidate at the location where it binds.
Currently, the weights for the decay of RMSD and coupling strength are set manually, but further improvement may be possible in the future by introducing a mechanism to learn the relationship between these two prior to learning. The author of this article is interested in seeing the difference in performance when using AlphaFold3, which was just announced in May 2012.
Categories related to this article