CNN-based Models Are Now Available To Predict Protein Function Based On Chemical Bonds And Improve Performance

Medical 23/09/2024

3 main points
✔️A CNN-based DeepSS2GO model is proposed to predict protein function that improves accuracy and reduces computation time from previous methods
✔️ Learning integrates information about the sequence comprising the protein, as well as homology information, and shape based on chemical bonds
✔ ️ Experiments conducted on a dataset of 6 species achieved high accuracy in the task of predicting protein function in various domains

DeepSS2GO: protein function prediction from secondary structure
written by Fu V. Song, Jiaqi Su, Sixing Huang, Neng Zhang, Kaiyue Li, Ming Ni, Maofu Liao
(Submitted on 1 April 2024)
Comments: Published on bioRxiv

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Prediction Methods for Protein Function

Predicting what function a certain protein will have for humans is extremely important in understanding life processes, preventing disease, anddeveloping new drugs. In particular, recent years have seen the development of models that annotate the function of a protein based on the shape and size of its constituent units, the sequence of amino acids (called the primary structure ) and the three-dimensional structure of the protein (called the tertiary structure ).

However, primary structures contain excessive information and show redundancy, which limits their ability to accurately predict the function of a protein from an unknown species. In addition, tertiary structure requires a huge computational cost to reflect the three-dimensional information in the learning process, making it difficult to perform analysis using large-scale data.

Therefore, the authors of this paper proposed DeepSS2GO, a model that integrates information about the sequence of amino acids (primary structure), structural features obtained based on chemical bonds in the protein molecule (this is called secondary structure ), and protein homology for learning. This model not only shows better prediction performance than conventional algorithms, but also reduces the amount of computation required.

Primary, secondary, and tertiary structures of proteins

To help you better understand the primary, secondary, and tertiary structures of proteins, this paper compares them to the building blocks you see in everyday life.

In the figure, how the building materials, such as fiber and stone, are arranged corresponds to the primary structure, the shape of the blocks made from these materials corresponds to the secondary structure, and the structure of a bridge, tower, or other structure composed of blocks corresponds to the tertiary structure.

The traditional model " model aa"predicts the tertiary structure from features of the primary structure (fibers and gravel). However, it can be difficult to predict the pattern and function of the finished product, a bridge or tower, based solely on the arrangement of the primary structure, the fibers and gravel.

In contrast, the proposed model in this paper, model ss8, predicts tertiary structures (bridges and towers) from secondary structures ( block features). In other words, it claims to be able to predict protein function more accurately by effectively using secondary structure (blocks of wood or stone) for learning.

The experiment also shows that the model training is done in species A and the model predictions are done in species B. Thus, it is emphasized that training and forecasting across different speciesallows for highly versatile forecasting.

Application of deep learning in predicting protein function

Methods for predicting protein function can be classified according to the sources of information or algorithmsused. Sources include primary structure, tertiary structure, and protein-protein interactions. Algorithm-based methods include sequence homology alignment, which does not use deep learning, and deep learning models (natural language processing models), and in practice, a combination of both is commonly used.

In this experiment, protein function is represented using Gene Ontology (GO), a method of representing molecular functions, cellular components, and biological processes as a directed graph. Since the functions of proteins are not independent of each other, but are often similar in parts, a graphical representation of their relationships is used.

In the model of this paper, the output is a score, expressed as a number between 0 and 1, indicating whether the word associated with each function (corresponding to GO1, GO2, etc. in the diagram of the model's structure below) has a function.

Model Details

Model Structure

An overall view of the DeepSS2GO model is shown in the figure above.

During training, we first pre-process the data by obtaining the primary structure of the protein and its annotation, which we have previously filtered. Note that in this experiment, protein sequences and annotations are collected from two datasets, SwissProt and CAFA3, and used as input data.

Next, we perform the task of predicting the secondary structure from the primary structure using the " SPOT1D-LM" algorithm shown in the figure. Note that the primary structure consists of 20 different characters and the secondary structure consists of 8 different characters.

The input is represented in the form of a one-hot matrix consisting of 21 columns for the primary structure and 9 columns for the secondary structure, embedded in 1024 rows of array information for each of the primary and secondary structures.

The primary structure is processed by " Model-aa" and the predicted score (Pred-aa in the figure) is output. Secondary structure is processed by " Model-ss8" and the prediction score (Pred-ss8 in the figure) is output.

In addition, a tool called the " Diamond method" is used to predict the homology of the primary structure of the protein. The score obtained from this prediction (Pred-bit-score in the figure) is output. (Note that the Diamond method does not use a machine learning model, but uses methods used in traditional bioinformatics.)

Next, the three types of predictive scores (i.e., Pred-ss8, Pred-aa, and Pred-bit-score) obtained by the primary sequence, secondary sequence, and Diamond method are integrated. The pre-specified parameters are used for the integration of the three types of scores.

The input data passes through several convolutional neural networks with various kernel sizes and filters. It is then normalized by a McSpurring layer and activated by a sigmoid function to keep the output within the range of 0 to 1. Note that early stopping is introduced during training of the model to prevent over-training.

Here, K represents the width of the kernel, and this experiment was conducted using various values of K.

Experimental results

The figure above shows a score map of Fmax, themaximum value of the harmonic mean of accuracy and reproducibility, as an index for evaluating the model. The six rows and columns represent different species, including humans, for example, if the vertical axis is HUMAN and the horizontal axis is MOUSE, the human data is used for the test data and the mouse data is used for the training data. data are shown to be used for training data.

The darker the color of the score map, the higher the Fmax score (i.e., better model performance). Figures A to C show the evaluation results of Model-aabased on the primary structure. On the other hand, figures D to F show the evaluation results of Model-ss8based on the secondary structure.

Specifically, A, D, and G are evaluated based on the function of the molecule, B, E, and H on what components the cell contains, and C, I, and F on the processes in biology. On the other hand, the figures in G through I show how much better the model based on secondary structure (Model-ss8) performed compared to the model based on primary structure (Model-aa).

The darker the red color, the better the performance of the model based on the secondary structure. The results of this experiment indicate that using information on the secondary structure of a protein can improve the accuracy of the task of predicting protein function compared to using only primary structure information.

The table above evaluates the performance of the aforementioned model. The bottom row shows the methodology of this model, DeepSS2GO.

In addition to the Fmax mentioned earlier, the evaluation measures used are AUPR, which represents the area under the accuracy reproducibility curve, and Smin, a measure of how well the model can distinguish positives by calculating the difference between the true and false positive rates.

The AUPR is a useful metric when using unbalanced data sets and is used as a measure of the model's ability to accurately identify a minority of positives because the penalty for misclassifying a positive case is greater.

Smin, on the other hand, is a measure of a model's discriminative ability by calculating the difference between the true positive and false positive rates. In other words, the smaller this value is, the more accurately the model can distinguish between positives and negatives.

Using these metrics such as Fmax, AUPR, and Smin, we were able to confirm the high performance of the DeepSS2GO methodology compared to conventional models.

The table above shows the results of training with only some of the learning modules of the Model-aa, Model-ss8, and Diamond methods; the best results are obtained when all three types of learning modules are present.

The combination of Model-ss8 and Diamond tends to be higher when two types of learning modules are used, indicating a good compatibility between models that use deep learning to predict secondary structure and Diamond, a traditional bioinformatics approach.

The figure above shows the results of the function predictions made by the proposed method, DeepSS2GO. Each box contains the terms used to describe the function, and the terms in the box in the upper position are related to the terms in the box in the lower position by inclusion (being the same or including in part).

The colored circles that appear at each node indicate whether different forecasting methods were able to predict a particular function (i.e., a colored circle next to a box indicates that the model was able to predict that function).

The blue circles indicate the function prediction results of the proposed method (DeepSS2GO), while the other colored circles indicate the function prediction results of existing methods. This figure shows that the proposed method can predict the function of a wide range of proteins, including various functions in the lower layers, as shown in the figure.

Summary

We found that DeepSS2GOimproves performance in predicting protein function by reducing redundant information in the primary sequence and introducing a learning module that integrates secondary structure features.

The author has introduced a classical convolutional neural network to improve the effectiveness of the quadratic structure of this model, but considers that the performance of the functional prediction model could be further improved by using GNN andself-supervised learning.

In addition, the algorithm used to predict primary structure to secondary structure considers large proteins with amino acids exceeding 1024 in length to be excluded. In the future, it will be important to introduce methods for secondary structure prediction for longer sequences.

Furthermore, the model could be applied to the prediction of polypeptide function in the broader elucidation of disease and in the discovery of drug targets. Personally, I thought that the model might have potential applications in the task of predicting whether or not a particular polypeptide structure is unique.

Categories related to this article

medicalAI