
Fragrance Science, Proposal Of A Mixture Fragrance Prediction Model Using Graph Neural Networks
3 main points
✔️Proposed and published amixturescentprediction modelutilizing graph neural networks
✔️ Data collection and analysis of over 160,000 molecular pairs using the GoodScents dataset
✔️ Validates the model's highly accurate prediction performance and suggests new possibilities for scent design
Olfactory Label Prediction on Aroma-Chemical Pairs
written by Laura Sisson, Aryan Amit Barsainyan, Mrityunjay Sharma, Ritesh Kumar
(Submitted on 26 Dec 2023 (v1), last revised 5 Jun 2024 (this version, v2))
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Our daily livesare filled withfood, beverages, hygiene products, and otheritems that utilize fragrances. However, designing these fragrance molecules and creating the desired fragrance is very labor intensive and time consuming. In the field of research on fragrance, studies are underway tofind and use explainable characteristics about fragrance molecules to predict fragrances. However, the world of chemistry is very vast, and it is said that there are approximately 1,060 molecules. Thismeans that there arethat manycandidate molecules tobe predicted.
In the past, researchers have characterized fragrance molecules based on specific molecular structures, such as aromaticity or other specific functional groups. These methods have had some success in benchmarks such as the DREAM Olfactory Challenge. Now, however,Graph Neural Networks (GNNs) are being used, and theirpredictive performance has improved significantly.
Recently, probabilistic methods and deep learning, rather than manual feature extraction, have become mainstream. New machine learning methods that represent molecules graphically or textually have emerged andare significantly advancing the field of new drug and material development, including molecular property prediction and novel molecule design.And in 2022, Lee et al.predicted fragrance labels with high accuracyusing graph neural networksand constructed an " Odor Map" from vector embedded representations of each molecule.These advances in techniques and datasets for predicting fragrance labelsare allowing researchers to gain deeper insights into the relationship between fragrance and molecular structure.
However, research in this area is limited to the prediction of single molecules. In practical applications,molecules are most often mixed inmanyfood and hygiene products. And the nonlinear and complex relationships in mixtures of fragrance molecules are not yet understood.In this paper, we propose a new technique that applies graph neural networks to generate vector embeddings of mixtures of fragrance molecules.
Until now, most graph neural networks used in chemistry have been specialized models for specific prediction, requiring exploration of different architectures depending on the task.In this paper, weextendgraph neural networksfrom the prediction task for a single moleculetothe prediction task fora mixture ofmolecules.
It is hoped that this paper will lead to a deeper understanding of the relationship between fragrance and molecular structure and open up new possibilities in fragrance design.
Datasets and Models
To build a dataset of mixtures of fragrance molecules, molecular structures (in SMILES format) and fragrance labels are collected from the GoodScents online chemistry repository. approximately 3,500 molecules are catalogued on the GoodScents website, and each fragrance molecule has a suggested complementary flavorant (blender) with a specific aroma. This allows one to find combinations of molecules that create unique fragrances. In this way, we have collected data on more than 160,000 pairs of molecules.
We use BeautifulSoup in Python to create a crawler that analyzes GoodScents' fragrance names, odor labels, and proposed blenders; we exclude data where SMILES are not mentioned or data that cannot be analyzed. It should be noted that only 0.05% of all such inaccurate data is removed.
All pairs of molecules in the database form a metagraph, where each node is a molecular graph and the edges between nodes indicate the labels of the blends. To separate training and test data, the meta-graph is divided into two components. Each component must contain data covering all labels, and furthermore, the number of edges is minimized to maximize the amount of available data.
The generated dataset contains 109 odor labels. Some of them contain unlabeled pairs (marked "no label found for these"), and these unlabeled pairs have been removed. In addition, "anisic" has been replaced with the more general "anise," "medicinal," (with a trailing comma) has been corrected to "medicinal," and "corn chip" has been replaced with "corn. These modifications resulted in a final total of 104 notes. In addition, data on single fragrance molecules were obtained from Leffingwell and GoodScents and integrated and used to test the ability of the proposed model to learn transitions.
The meta-graph is randomly partitioned into a set of numerators, which are then split into training and test data. This partitioning is repeated until at least one training data and one test data are generated for every label. Although the Kullback-Leibler divergence between the distribution of smell labels in the training and test data and the distribution of the overall graph is used to score the graph partitioning, it states that it prioritizes optimizing the number of available data over these similarities. In the end, 44,000 training pairs and 40,000 test pairs were obtained and 83,000 data were removed; of the 109 odor labels, only 74 appeared in enough molecules for the cutoff.
Various experiments arethenconducted on the isolated training and testing componentsto identify the best model for odor prediction.The experimental procedure is outlined in the figure below. This creates a solid foundation for efficiently collecting and analyzing data and validating model performance.
Figures (a and b)show the nonlinear relationship between the properties of the molecules that make up a fragrance and their blends. The same molecules appear in both single and blended data sets, but the molecules combine to create new fragrance notes, while other notes are weakened in the blend.
Figure (c)shows a sample of the densest region of themixedmetagraph. Here, 0.5% of the nodes of the metagraph are visualized, with 7 training molecules (blue) and 7 testing molecules (red). With an average degree of 6 and many data/edges in each molecule, the metagraph is very dense and difficult to isolate.
Figure (d)visualizes an overview of the graph partitioning. The algorithm of the partition aims to maximize the number of available pairs without causing a distribution shift of labels.
Figures(e and f)provide an overview of the experiments.Figure(e)shows theoverall optimization and learning pipeline used in this paper, andFigure(f) shows the 50:25:25 learning/testing/verification five-partition used to optimize the hyperparameters.
Figure (g)shows the graph neural network's prediction for a single odor molecule. A message passing layer is applied to the entire molecular graph, followed by a readout layer and a multilayer perceptron (MLP) to predict the final label. Figure (h)shows theMPNN-GNN prediction for amixedpair.The molecular graph is treated as a single graph, and the readout layer and multilayer perceptron are applied as in Figure (g). Finally, Figure (i) illustratesthe GIN-GNN prediction for mixed pairs. The molecular graphs are separately passed through the message passing and readoutlayers andcombined into amultilayer perceptron.
In addition, various graph neural networks are trained to predict blended scent labels from pairs of scent molecules. The models we use here are derived from two main architectures.
First,we are developing a model based on anewGraph Isomorphism Network (GIN). This model generates an embedding for each molecule of each molecular pair independently and combines these embeddings in the final stage of predicting the blend pair.Next, we are developing a model based on the Message Passing Neural Network (MPNN). In this model, the structure of the molecular pairs is combined into a single graph before being input into the message passing layer.
These models have significantly improved the prediction accuracy of blends of fragrance molecules. Further improvements are expected to lead to more accurate prediction models in the future.
Experiment
To evaluate the predictive ability of each model, we use AUROC for odor labels. To compare results, micro-averages are calculated for all test data. We first evaluate the mixture label predictions: the MPNN-GNN achieves an average AUROC score of 0.77, while the GIN-GNN model achieves a score of 0.76. As a baseline model, we also generated 2048-bit Morgan fingerprints (MFPs) of radius 4 for each molecule pair, concatenated them and fed them into a logistic regression to predict odor labels for the mixture pairs.
We found that GIN-GNN predicted very accurately for some labels, but significantly worse for others compared to the la baseline. In contrast, MPNN-GNN was found to perform consistently well across all labels.
We also evaluated the model's performance on a single molecule prediction task. To adapt the GIN-GNN model to this task, we generated a graph-level embedding for each molecule and trained a logistic regression classifier to predict the same 74 scent labels. Since the graph-level embedding and the original pair-level embedding have different dimensions, the MLP portion of the architecture was not transferable; no changes were required for MPNN-GNN, except for the input of one molecule in the message passing phase. The entire trained architecture could be reused.
For the single molecule task, MPNN-GNN achieved an average AUROC score of 0.89, while the GIN-GNN and Morgan fingerprinting models achieved scores of 0.85 and 0.82, respectively. The significant improvement in the single molecule prediction task over the blended pair prediction task for all models suggests that the single molecule prediction task is much more difficult than the blended pair prediction task. The authors also suggest that a possible reason for the wider performance gap between MPNN-GNN and GIN-GNN in this task is that the prediction layer of GIN-GNN could not be reused.
Summary
This paperproposes a model thatutilizes a graph neural network toaccurately predict nonlinear and complex propertiesfor mixtures of aroma molecules.We show thatthisgraph neural networkcan be used for the task of predicting not only mixtures ofmolecules, but also conventional single molecules. The model is alsoavailableon GitHubto stimulate and further research in this area.
The authors of this paper state that their ultimate research goal isto create a model that can predict continuous labels for mixtures offragrance moleculesmixed at various concentrations. They believe that this will contribute to the utilization of odors performed in various fields that deal with odors, such as food, pharmaceuticals, and hygiene products.
However, there is a lack of publicly available data sets on the odor molecules needed to make this research feasible. Even single molecules are still in short supply. Flavor and fragrance companies probably have recipes for a rich mixture of molecules, but this information is naturally a trade secret and is not expected to be made public.Therefore, the authors of this paper also aim to address the lack of public data sets.
With the development of machine learning, digitization of various types of perceptual information is being attempted. Among these, digitization related to smell is lagging behind and is considered to be difficult. As evidenced by its wide range of applications,smell is considered to be of great importance as a preference inpeople's lives.Smelling one's favorite smells can help one concentrate and relax.It is hoped that the enhancement of publicly available data sets on smells and the advancement of research that makes use of these data sets will help to solve this problem.
Categories related to this article