Semi-supervised Cooperative Learning To Integrate Multiple Types Of Biological Data

Semi-supervised 19/02/2024

3 main points
✔️ Multi-omics data collectively refers to data on diverse molecules that shape living organisms, but there is little supervised data.
✔️ Proposes semi-supervised cooperative learning that can improve performance by successfully integrating even un-supervised multi-omics data .
✔️ Maximizes the use of diverse data and achieves excellent prediction performance in the analysis of real data on aging.

Semi-supervised Cooperative Learning for Multiomics Data Fusion
written by Daisy Yi Ding, Xiaotao Shen, Michael Snyder, Robert Tibshirani
(Submitted on 2 Aug 2023)
Comments: The 2023 ICML Workshop on Machine Learning for Multimodal Healthcare Data. arXiv admin note: text overlap with arXiv:2112.12337
Subjects: Quantitative Methods (q-bio.QM); Genomics (q-bio.GN); Applications (stat.AP)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Have you ever heard of multi-omics data?

In the world of biology, the study of each of the diverse molecules that make up living organisms has developed as 00omics. Genomics, epigenomics, transcriptomics, proteomics, ..., radiomics. These studies are sometimes referred to as multi-omics in anticipation of the development of our understanding of living organisms through cross-sectional studies.

These are just different ways of looking at the same organism, but ultimately what we really want to know is still the same thing: the organism. Thus, there is a need for data analysis technology that integrates different views of the same thing to improve analytical performance.

The paper I am going to explain here is about a technique called cooperative learning, in which different ways of looking at data can be coordinated to achieve more appropriate data analysis.

It is difficult to collect such a wide variety of data before analyzing the data in the first place. Even if some data are collected, not all of them have the labels (objective variables that are the values to be predicted) necessary for learning.

Therefore, a cooperative learning method that can take advantage of semi-supervised data, which is a mixture of labeled and unlabeled data, has been proposed.

Although this application example is for multi-omics data, in terms of utilizing different types of data, it is the same as multimodal learning (learning different types of data such as language, image, and audio), which is a hot topic in large-scale language modeling (LLM), so it is expected to have a wide range of applications.

In this section, we will explain the multi-omics data, our proposed method for analyzing the data, and the results of its validation.

Multi-omics data

A conceptual diagram of the multi-omics data is shown in Figure 1.

As shown in the figure, the study of the molecules that make up living organisms is diverse. There are five studies listed in the figure alone: Genomics, Epigenomics, Transcriptomics, Proteomics, and Radiomics. Since data exist for each of these disciplines, there are at least five different types of data.

Briefly describe each.

Genomics is the study of DNA, which contains genetic information. You have probably heard that intelligence is genetic, that the human genome has been decoded, or that the genomes of apes and humans are only slightly different. It is the study of the blueprints of living organisms.

Epigenomics is the study of chemical modifications (chemical changes) that control how genes are read from an organism's blueprint. For example, DNA has a helical structure, in which DNA wraps around a protein called histone. One of these chemical modifications of histones will be called Me3. The presence or absence of a chemical modification in a gene portion affects whether or not that gene will be read.

Transcriptomics is the study of RNA, which receives and carries protein creation instructions from DNA.

Proteomics is the study of proteins in living organisms.

Radiomics is the study of medical imaging such as MRI and CT images.

Integrating these disciplines allows us to comprehensively trace the process by which an organism's blueprint is read, its proteins are made, and it takes shape as a human organism. For a comprehensive understanding of living organisms, methods for integrating and analyzing data from these disciplines are needed.

Conventional methods: Early Fusion, Late Fusion

The goal of this project is to integrate and analyze a variety of biological data (multi-omics data) (multi-omics data fusion). More specifically, it is to integrate different types of data to predict the outcome of interest.

There are two main approaches to such data fusion techniques. Early Fusion and Late Fusion.

Early fusion

Early fusion is an approach that uses a predictive model to learn from several different sets of concatenated data. Figure 2 shows its conceptual diagram.

As shown in the figure, suppose we have data for an explanatory variable X (View X) related to genes and Z (View Z) related to proteins. In this case, early fusion is used to learn y=f({X,Z}) such that the objective function y is predicted from the data of the explanatory variables (Combined View) that combine these variables.

Late fusion

Late fusion is an approach that trains a prediction model for each piece of data for several different pieces of data and then integrates several prediction models to make a prediction. Figure 3 illustrates the concept.

As shown in the figure, first, a prediction model y=f_X(X) is learned to predict the target variable y from the explanatory variable X for genes (View X), and a prediction model y=f_Z(Z) is learned to predict the target variable y from the explanatory variable Z for proteins (View Z). Late fusion is then used to learn a prediction model y=f(f_X(X),f_Z(Z)) that combines f_X(X) and f_Z(Z) to predict y.

Proposed method: Semi-supervised cooperative learning with matching penalty

In general, early fusion has the advantage of capturing and predicting the interaction between explanatory variables because it concatenates explanatory variables. The disadvantages are that the explanatory variables in the forecasting model become highly dimensional because the explanatory variables are linked first, and if explanatory variables that are not related to y are linked, they become a disincentive to forecasting.

Conversely, late fusion does not worsen forecasting performance by introducing unrelated explanatory variables into the mix because they are forecasted separately, nor does the concatenation of explanatory variables make them higher-dimensional. However, there is a risk of missing interactions between different data.

Thus, early fusion and late fusion have their advantages and disadvantages.

Therefore, it is desirable to have a method that can adaptively adjust the saltiness of early and late fusion to the data. This is achieved through cooperative learning, as described in Technology Point 1.

Furthermore, the extension of this cooperative learning to allow semi-supervised learning is technology point 2, which is the novelty of this paper.

Technology Point 1. cooperative learning (matching penalty)

A conceptual diagram of cooperative learning is shown in Figure 4.

The min equation in the figure shows the loss function of the prediction model. Cooperative learning is a method of learning a prediction model (in this paper, a linear regression model is specifically considered) by optimizing the parameters of the prediction model to minimize this loss function.

The first term in this equation is the prediction error when the objective variable y is predicted by the sum of the prediction model f_X(X) with the explanatory variable X and the prediction model f_Z(Z) with the explanatory variable Z. In the case of only this first term, that is, when ρ=0 in the second term, it is an early fusion because it is consistent with the prediction by concatenating X and Z. Since it is the square of the sum of y, f_X, and f_Z, expansion yields the product of each two. In it, we get f_X*f_Z. In other words, f_X*f_Z must be learned to reduce the prediction error, so the interaction of X and Z is taken into account.

The second term is the square of the difference between f_X(X) and f_Z(Z), so the penalty is such that the predictions of f_X(X) and f_Z(Z) match (match penalty). Given the basic premise of multi-omics data, that there are only different views of the same thing, we should want different views (explanatory variables) to make the same prediction, so we can interpret this as a term that directly embodies this. ρ=1, when minimizing the prediction error with y independently for f_X and f_Z and When ρ=1, the interaction f_X*f_Z just described in Early Fusion is canceled out by the matching penalty. Thus, the end result is consistent with learning to reduce the prediction error for f_X and f_Z independently.

Thus, by changing ρ from 0 to 1, the model changes continuously from early fusion to late fusion. By determining ρ so that it fits the data well through cross-validation, we can achieve a proper balance between early fusion and late fusion.

Theoretical analysis has shown that when a latent factor model (latent common structure) exists between different data, including a matching penalty can reduce prediction error.

Technique point 2: Semi-supervised learning

A conceptual diagram of semi-supervised cooperative learning is shown in Figure 5.

Figure 5. semi-supervised cooperative learning

In the previous explanations, y was given for all rows of X. However, as shown in Figure 5, there is no y for some of the data (No Label in the figure).

In order to make use of this unlabeled data in learning, semi-supervised cooperative learning adds the third term (Unlabeled Data) shown in Figure 5 to the loss function in the previous equation in Figure 4. This is the agreement penalty for predictions based on the explanatory variable data for unlabeled data.

Certainly, even different views (explanatory variables) should be closer to the correct prediction if they satisfy the basic premise that we want them to make the same prediction, so including this penalty should have the effect of preventing overlearning on a small number of data with labels.

Evaluation results based on actual data

In this paper, we evaluate the proposed method using transcriptomics and proteomics data on aging. The predictive model is Lasso, a well-known linear regression model-based variable selection method.

The comparison methods are Separate Proteomics, trained using only proteomics data, Separate Transcriptomics, trained using only transcriptomics data, Early fusion, trained using early fusion approach, Late fusion trained with a late fusion approach, Cooperative learning trained with supervised cooperative learning, and Semi-supervised cooperative learning trained with semi-supervised cooperative learning of the proposed method.

The evaluation index is MAE (mean absolute error), the smaller the MAE, the better the forecast accuracy.

The results of the evaluation are shown in Table 1. (Incidentally, Relative to Late Fusion in Table 1 appears to be an error for Early Fusion.)

Table 1. evaluation results based on actual data

As shown in the table, Early fusion and Late fusion are outperformed by Separate Proteomics and Separate Transcriptomics. In other words, the attempt to learn by integrating different types of data has conversely caused performance degradation.

In contrast, Cooperative learning outperforms Separate Proteomics and Separate Transcriptomics. Cooperative learning is able to improve prediction performance using different types of data.

Furthermore, the proposed method, Semi-supervised cooperative learning, which applies a matching penalty to predictions made using unlabeled explanatory variable data, shows the best results among the compared methods. In addition, the proposed method reportedly identifies factors involved in the aging process (which are considered correct factors based on previous studies) that have not been identified by previous methods.

Conclusion

In this commentary, semi-supervised cooperative learning for multi-omics data analysis was explained.

Early fusion and late fusion have been proposed for analysis that integrates different types of data, but in some cases, the use of different types of data degraded prediction performance.

Therefore, we achieve improved performance with different types of data through cooperative learning, in which the appropriate salt of early and late fusion can be determined from the data.

Furthermore, assuming that it is difficult to collect labeled multi-omics data in the first place for practical use, we propose semi-supervised cooperative learning that can also utilize unlabeled multi-omics data. Further performance improvement is achieved by using semi-supervised data.

It is a simple formula that can be extended to multimodal learning and semi-supervised learning, which are hot topics in recent years, and is also helpful in terms of grasping the essence of multimodal learning and semi-supervised learning. It is a method that is easy to introduce in practical use.