From Omics Data, Generate Images! Proposed Image Generation Method For Cancer
3 main points
✔️ With the increase of multifactorial diseases in which multiple factors are intertwined with each other, omics analysis of genetics and proteins is attracting attention, but due to the high dimensionality of the data, it is difficult to perform accurate analysis using conventional methods such as statistical analysis.
✔️ In order to solve the high dimensionality of omics data, we focus on the introduction of deep learning - especially algorithms in the field of image analysis - and propose the OmicsMapNet approach to analyze the data as 2D images using molecular features and data in the database. We propose an approach
✔️ that achieves higher accuracy than conventional methods in the classification problem of cancer datasets (TCGA), especially in the classification performance of cancers with higher severity.
OmicsMapNet: Transforming omics data to take advantage of Deep Convolutional Neural Network for discovery
written by Shiyong Ma, Zhen Zhang
(Submitted on 14 Apr 2018 (v1), last revised 23 May 2019 (this version, v2))
Comments: Accepted by arXiv.
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Can the high dimensionality in omics data be eliminated by the introduction of image data?
In this paper, we analyze high-dimensional omics data by mapping the data based on the molecular information accumulated in the database and transforming it into two-dimensional image data, aiming to realize high-dimensional large-scale data analysis using image analysis techniques in deep learning. Currently, much attention is being paid to omics analysis, which is the analysis of biological materials such as genomes, genomics for genes, and proteomics for proteins. In the analysis of such omics data, it is necessary to take into account the interactions within and between the layers of each omics, so approaches that use machine learning to automatically analyze and interpret the data are attracting attention.
In order to overcome the high dimensionality of omics data, we have investigated methods using deep learning algorithms, especially in image analysis, and proposed a method for image transformation. Specifically, we propose a method to generate image data by constructing a treemap from gene expression levels using the cancer dataset and the KEGG database. This method is expected to solve the high dimensionality and difficulty of analysis of omics data.
What is Omics?
First of all, I would like to briefly explain Omics because many of you may not be used to hearing about it.
It is said that the analysis of this biological information can promote the estimation of diseases and the development of drug discovery. There are multiple layers of omics, such as genomics (genetic information), transcriptomics (RNA), proteomics (proteins), metabolomics (metabolites), and interactomics (protein-protein interactions), and they are related to each other within and between layers. In the current research, the first three layers related to the central dogma are considered to be the mainstream and are being actively studied.
One of the characteristics of omics data is that new knowledge that cannot be obtained at each level can be obtained by analyzing multiple levels from a network perspective. In the medical field, omics data can be an effective approach for diseases that are difficult to treat with conventional analysis methods due to complex factors (e.g., cardiovascular diseases) and diseases whose developmental mechanisms are unclear (e.g., cancer). In particular, the number of patients has been increasing rapidly in recent years. In particular, many lifestyle-related diseases, for which the number of patients has been increasing rapidly in recent years, are called multifactorial diseases because they are caused not by a single factor but by a combination of multiple factors - not only heredity but also genetic information and environmental factors - and it is difficult to analyze and interpret them accurately by targeting a single factor. It is said to be difficult to analyze and interpret accurately by targeting single factors. Against this backdrop, understanding diseases from multiple omics, such as genomics and metabolomics close to environmental factors, is expected to clarify interactions that would be obscured by single layers, deepen insights into the mechanisms of diseases, and lead to prevention and treatment, and is attracting attention. In addition, as mentioned above, omics analysis is said to be difficult to analyze manually because it targets factors that span within and between layers, and in order to resolve this complexity, approaches that can automatically analyze multiple factors, such as machine learning and deep learning, are said to be the mainstream in the future. In order to solve these complexities, approaches that can automatically analyze multiple factors, including machine learning and deep learning, are expected to become mainstream in the future.
Current status and issues of previous research on omics analysis using machine learning.
While the research of machine learning and deep learning for omics analysis is being conducted, the solution of the high dimensionality of the data set is pointed out as a problem. Since conventional methods focus only on single omics, the key to efficient analysis is how to solve the high dimensionality of the dataset when multiple omics with rapidly increasing dimensionality, such as multifactorial diseases, are considered. Various methods have been proposed as one of the approaches to deal with such high dimensionality. One of them is deep learning technology in the field of image analysis - a technology that can perform efficient analysis on large data with high dimensionality by convolutional processing. On the other hand, this technique requires the input data to follow the image format, which is difficult to analyze with the conventional dataset of omics analysis, which mainly consists of numerical data. This research focuses on applying such an image analysis technique by having the omics data converted to the image format to eliminate the high dimensionality.
Purpose of this study
The aim of this research is to transform high-dimensional omics expression data as two-dimensional (2D) images based on functional features, which will enable the implementation of image analysis techniques in deep learning for efficient analysis.
More specifically, omics expression data is used to construct two-dimensional image data using hierarchical mapping and functional annotation of genes extracted from the KEGG BRITE database, which targets biological features, especially the functional hierarchy of KEGG objects (http://www.kegg.jp/), and tree structures We construct two-dimensional image data using graph structures, including tree structures.
In order to validate our method, we use the gene expression dataset of the Cancer Genome Atlas (TCGA) to create a treemap image. As a preprocessing step in this dataset, we filtered out genes with extremely low expression levels (threshold: -5). For the remaining genes, we match the gene names with KEGG-IDs and select the genes with the highest average expression values for the KEGG-IDs corresponding to multiple genes in the data matrix (see the figure below).
Conversion of expression data of omics to treemap images.
To convert the omics expression data into a treemap image, KEGG BRITE was used to extract only the gene and protein information related to cancer. Then, based on the KEGG IDs, genes were assigned to the corresponding child nodes of the tree structure, and finally, a five-layer hierarchical tree was confirmed to have been constructed. since one gene may have multiple KEGG functional annotations, these genes represent multiple positions in the tree Since a gene may have multiple KEGG functional annotations, these genes represent multiple positions in the tree. Next, we used a rectangular treemap to spatially align the sample genes in the 2D image. In this treemap, each rectangular unit represents one gene, and by placing these units in the treemap, an image of the tree structure is generated. We use the Pivot method (Bederson, Shneiderman, and Wattenberg 2002) to generate the treemap. After mapping, we color the treemap based on the normalized expression levels of the genes to make the differences in expression levels more apparent. Specifically, for each sample, we mapped the highest expression levels to red and the lowest values to blue and used linear completion. The original treemap image was 1024*1024 pixels and was subsampled to 512*512 pixels before being input to the DCNN.
Learning and Evaluation
To demonstrate the effectiveness of the OmicsMapNet approach, we performed a comparative analysis: using gene expression data without 2D treemap transformation, we compared the accuracy of tumor grade prediction for logistic regression and gradient boosting decision trees (XgBoost) We compare the accuracy of tumor grade prediction for Logistic regression and gradient boosting decision tree (XgBoost) on gene expression data without 2D treemap transformation.
In addition, to confirm the validity of the learned CNN feature map, we select the top 10% of the weights in the map and compare them with the generated image to analyze the pathway.
Conversion of TCGA LGG&GBM gene expression data
This evaluation is carried out using the KEGG database and the TCGA dataset in order to clarify the appearance of the generated images.
The proposed method, OmicsMapNet, extracts the hierarchical structure of functional annotations from the KEGG BRITE hierarchical file, assigns genes to the corresponding child nodes, and constructs a treemap image. At first, 20330 genes were obtained from the gene expression matrix, and 17715 genes were extracted by eliminating the genes with extremely low expression levels. These genes were mapped to KEGG IDs and generated using OmicsMapNet for 7095 genes and the structure (spatial arrangement of genes) of Treemap (previous study) which contains 10772 gene quadrats (see figure below). In the treemap, each quadrangle represents one gene, and each color represents a normalized gene intensity. In this data set, RNA-Seq analysis was performed on 667 samples, 607 of which were labeled with WHO grade - a measure of cancer severity.
Learning and Prediction of Tumor Sample Grades by DCNN
The purpose of this analysis is to clarify the accuracy of training using Deep CNN (DCNN) on the generated images.
The DCNN used in this study (see the figure below) has three Convolution and two Dense and is trained with the generated treemap image as input, the WHO grade of the corresponding sample as output, and the grade of the tumor sample as a label. The distribution of WHO grade II, III, and IV subjects in the 607 TCGA LGG&GBM samples in our dataset was 215, 239, and 153, respectively, and was evaluated using 10-fold cross-validation. The mean accuracy was 75.09% (95% CI: 70.38-79.79%), and the median was 74.35%. From the ROC curve (see the figure below), the average area-under-curve (AUC) of G2 and G3 in this training model was 0.86 and 0.83, respectively. On the other hand, the average AUC of G4 was 0.99, indicating that G4 could be distinguished from G2 and G3 with higher accuracy.
Classification of Grade 2 and Grade 3 samples using OmicsMapNet, Logistic Regression, and Gradient Boosting Decision Trees.
The purpose of this evaluation is to compare and validate OmicsMapNet and related methods with respect to their Grade 2 (G2) and Grade 3 (G3) classification performance.
As related methods, we used logistic regression and gradient boosting decision tree by XGBoost, and the proposed method, OmicsMapNet, used 10-fold cross-validation for DCNN architecture and training procedure (see the figure below). As a result, the average value of AUC is 0.86 (proposed method), 0.79 (Logistic Regression), and 0.72 (XGBoost).
Considering that the input dimension is large compared to the number of samples (Grade 2: 215, Grade 3: 239), to reduce overfitting, we used Logistic Regression and gradient boosting To reduce overfitting, we used Logistic Regression and gradient boosting decision trees (GBDT) to classify a subset of the sampled genes. For each gene, we sampled 50 times, measured the performance using 10-fold cross-validation, and plotted the mean and standard deviation of the AUC. These results show that OmicsMapNet is able to classify grade 2 and grade 3 samples more accurately than other benchmark algorithms (see figure below).
In order to overcome the high dimensionality of omics data, we have investigated and studied the introduction of deep learning algorithms, especially those for image analysis. The proposed method is to generate image data by constructing a treemap using the expression levels of genes in the TCGA cancer data set and the KEGG database. As an evaluation, we constructed a DCNN learning model using the generated images as input and compared the classification accuracy of the proposed method with Logistic regression and XGBoost. As a result, it is confirmed that the classification performance of the proposed method is high, especially for high-grade cancers. These results indicate that images generated by combining gene expression levels and ecological information from the database have high performance in cancer classification and that there is a wide range of applications for generating images combining other diseases and non-genetic information.
On the other hand, since the data set used in this study is related to cancer, it is unclear whether it is effective for other diseases. Since the generated image data is based on the analysis information for cancer, the tree structure may be different for other diseases, and the universality for the target disease may be lacking. In order to solve this problem, it is expected that the validity and robustness of the proposed method will be demonstrated by evaluating other datasets than the one used in this study, especially for rare diseases that are strongly related to genes.
Categories related to this article