Evaluation Of The Performance Of "LLMs That Can Understand The Geometric Structure Of Molecules
3 main points
✔️ Large-scale language models are inferior to existing machine learning models for molecular prediction tasks
✔️ Large-scale language models can be used as complementary tools to improve prediction accuracy
✔️ Limitations of large-scale language models for understanding molecular geometry need to be overcome
Benchmarking Large Language Models for Molecule Prediction Tasks
written by Zhiqiang Zhong, Kuangyu Zhou, Davide Mottin
(Submitted on 8 Mar 2024)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
In recent years, machine learning models have become increasingly popular in a variety of fields. Both academia and industry have put tremendous effort into improving the efficiency of machine learning, with the goal of achieving Artificial General Intelligence (AGI). In particular, tremendous advances in generative models such as large-scale language models (LLMs) have revolutionized the field of natural language processing (NLP). Large-scale language models have demonstrated an exceptional ability to understand and generate human-like text and have become indispensable in a wide variety of natural language processing tasks, including machine translation, common sense reasoning, and coding tasks.
A recent breakthrough, in-context learning (ICL), has further improved the adaptability of large-scale language models by acquiring task-specific knowledge during inference, reducing the need for extensive fine tuning. While large-scale language models have demonstrated their effectiveness in a variety of natural language processing applications, their full potential in other areas remains understudied. In particular, large-scale language models struggle with structured data such as graphs and face challenges with domain-specific queries such as biology and chemistry.
To fill this gap, this paper addresses the key research question, "Can large-scale language models effectively handle molecular prediction tasks?" To answer this research question, we identify key tasks, including classification and regression prediction tasks, and investigate six benchmark molecular datasets (e.g., ogbg-molbace, ogbg-molbbbp, ogbg-molhiv, ogbg-molesol, ogbg-molfreesolv, and (ogbg-mollipo) are used to conduct the survey.
As shown in the figure below, molecules can be represented in a variety of representation formats, including SMILES strings and geometric structures. However, one of the major limitations of existing large-scale language models is their reliance on unstructured text, which prevents them from incorporating important geometric structures as input. To address this challenge, Fatemi et al. propose a way to encode graph structures into textual descriptions. In this paper, we extend this method to encode both the atomic properties and the graph structure of a molecule into a textual description, as shown in the figure below.
Next, a series of prompts are strategically designed to leverage the various capabilities of the large-scale language model (e.g., domain knowledge, ICL capabilities) to generate responses to the molecular task. These responses are then evaluated in terms of consistency and performance on downstream tasks and compared to those generated by existing machine learning models designed for molecular prediction tasks.
The studyfound thatlarge-scale languagemodels lack competitive performance compared to existing machine learning models, especially those specialized to capture molecular geometry, and show inferior results. while ICL techniquescan greatly improve the performance oflarge-scale languagemodels, they still fall short of existingmachinelearningmodels,highlighting the limited ability ofcurrentlarge-scale languagemodels to directly address molecular tasks.
Next, we explore the possibility of integrating the responses of large-scale language models with existing machine learning models and observe significant improvements in a number of scenarios. These results suggest that, at this time, it may be more effective to use large-scale language models as domain knowledge enhancers than to have them directly tackle the molecular prediction task. In addition, they provide insight into the limitations and promising approaches of existing large-scale language models in molecular tasks. It is hoped that this study will provide new insights into the design of an interdisciplinary framework for molecular tasks augmented by large-scale language models.
Method
The purpose of this paper is to evaluate the performance of a large-scale language model for handling challenging prediction tasks on structured molecular data in the field of biology. Molecules can be represented in a variety of formats, including SMILES strings and geometric structures. However, existing large-scale language models are limited by their reliance on unstructured text, which prevents them from incorporating important geometric structures as input. To overcome this limitation, Fatemi et al. propose a method to encode graph structures into textual descriptions. In this paper, we further extend this method to encode both the atomic properties and the graph structure of a molecule into a textual description. This allows for the incorporation of molecular properties that are important for different prediction tasks.
First, let us discuss the problem setup. First,the molecule G is represented as G=(𝑆, 𝐺, 𝐷) where 𝑆 is the SMILES string, 𝐺 is the geometric structure, and 𝐷 is the description of the generated atomic properties and graph structure of G. 𝑦∈Y denotes the label of G. 𝑆 is the label of the molecule. where theset of molecules M={G1, G2, . , G𝑚}, given MT⊂M contains molecules with the known label 𝑦𝑣. The goal is to predict the unknown label 𝑦𝑢 for all G𝑢∈M𝑡𝑒𝑠𝑡 (M𝑡𝑒𝑠𝑡 = M𝑢𝑢𝑡 ). Additionally, MT is split into two subsets, M𝑡𝑟𝑎𝑖𝑛 and M𝑣𝑎𝑙; M𝑡𝑟𝑎𝑖𝑛 serves as the training set and M𝑣𝑎𝑙 as the validation set. This separation allows for fine-tuning model parameters, reducing over-fitting, and validating the machine learning (ML) model before applying it to the test data set M𝑡𝑒𝑠𝑡.
The next goal of prompt engineering is to properly format question Q so that the large-scale language model ( 𝑓𝐿𝐿𝑀 ) returns the corresponding answer 𝐴. The goal of this paper is to provide the large-scale language model with useful and comprehensive knowledge about molecules so that it can make predictions on the test data set. There are several methods to improve the performance of large-scale language models, such as fine tuning and LoRA, but these usually require access to the internals of the model and are computationally expensive, making them often difficult to apply in real-world scenarios. Therefore, in this paper we target a situation where 𝑓𝐿𝐿𝑀 and its parameters are fixed and 𝑓𝐿𝐿𝑀 is a black box setting with only text input and output. This setting is especially important as the number of proprietary models grows and their hardware requirements increase.
The first set of prompts (IF, IP, IE) provides the SMILES string 𝑆 and description 𝐷 of the numerator and asks the large-scale language model to produce output in the desired format without prior training or knowledge of the task. The instructions to the large-scale language model only provide background information. In particular, IF asks the large-scale language model to provide meaningful insights useful for the prediction task.
IP seeks predictions about molecular properties, IE seeks further explanations, and a large-scale language model should clarify the process of explanation generation and provide useful evidence to understand the predictions. Complementing the IF, IP, and IE descriptions, the IFD, IPD, and IED prompts are also derived. While the descriptions provide more comprehensive information about the characteristics and structural information of the molecular graph, they generate a large number of tokens, which may affect the consistency and constraints of the responses in the large-scale language model.
The following prompt set (FS) provides a small number of examples of tasks and desired outputs, allowing the large-scale language model to learn from these samples and perform tasks on new input. This method is classified as a simple in-context learning (ICL) technique, and a prompt template is presented as a sample; FS-𝑋 shows the 𝑋 context knowledge instances included in the prompt. The paper does not discuss FSD prompts, which may exceed the input constraints of large language models because the generated descriptions contain a large number of tokens.
Recently popular ICL technologies include Chain-of-thought (CoT), Tree-of-thought (ToT), Graph-of-thought (GoT), and Retrieval Augmented Generation (RaG). These can theoretically support complex tasks and contain large amounts of knowledge context. However, early experiments have shown that CoT, ToT, and GoT perform much worse on molecular property prediction tasks. We attribute this to the need for solid expertise to design appropriate chain thinking; the implementation of RaG has been found to be unstable, slow to query, and inferior to the performance of FS. We believe this is due to the quality of the information retrieval system and will discuss this in detail in a future study.
In addition, the paper introduces predictive models for generating predictions for the target molecule M𝑡𝑒𝑠𝑡. Here we discuss large-scale language models (LLMs), language models (LMs), and graph neural network (GNN)-based methods to capture molecular information in a comprehensive manner.
The large language model-based methods take as input the prompts generated according to the templates described above and generate responses according to the specified format. In particular, LLMSolo takes as input queries based on the IF, IP, IE, and FS templates, while LLMDuo takes as input queries based on the IFD, IPD, and IED templates.
Language models generate predictions based on available textual information. Examples include SMILES strings, descriptions, and responses provided by large language models. Experimental results show that the performance of language models using descriptions is not competitive with other settings. Therefore, two designs are employed in this paper: one with only SMILES strings as input (LMSolo) and one with SMILES strings and responses provided by a large-scale language model as input (LMDuo).
Graph neural network models are state-of-the-art methods in molecular property prediction tasks because they effectively capture geometric structural information of molecules. In addition, with the assistance of language models, available textual information can be converted into additional features and subsequently fed into the graph neural network model.
In particular, the flexibility of the language model allows textual information to be converted into embeddings, giving graph neural network models the flexibility to incorporate information from different perspectives. In this paper, we employ three designs, GNNSolo, GNNDuo, and GNNTrio, as shown in the figure below (reproduced).
Experiment
Here we present an empirical study and analysis to evaluate the effectiveness of large-scale language models in molecular prediction tasks. The analysis focuses on a particularly challenging molecular graph property prediction task.
First, we discuss the experimental setup.We use six benchmark molecular property prediction datasets commonly used in machine learning studies. These include ogbg-molbace, ogbg-molbbbp, ogbg-molhiv, ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo. A summary of the collected data sets is presented in the table below.
To investigate the effectiveness of large-scale language models in molecular prediction tasks, we also consider two different categories of opportunity learning models: the first is a language model that takes only textual information as input, using DeBERTa; the second is a graph neural network that captures molecular geometric structure information and other features available to We consider two classic graph neural network variants, GCN and GIN, which are graph neural networks that capture These frameworks are illustrated in the figure below.
In this paper, we focus on situations where the parameters of the large-scale language model are fixed and the system is available in a black box setting. In this case, Llama-2-7b, Llama-2-13b, GPT-3.5, and GPT-4 are used as large-scale language models, with GPT-3.5 being the primary large-scale language model in most experiments. The data is taken from official APIs or official implementations.
The machine learning prediction models are implemented according to the official implementation. For example, we adopt the available code for the variant of the graph neural network model on the OGB benchmark leaderboard; for DeBERTa, we adopt the official implementation and incorporate it into the pipeline. For large language models, we invoke the API or official implementation provided by OpenAI with default hyperparameter settings.
Next is the workflow of the evaluation process. An overview is shown in the figure below. The traditional evaluation workflow evaluates the performance of the model in downstream tasks, but this time we also analyze the response consistency of a large language model.
Large-scale language models can create the illusion of knowledge (halucination) and may generate responses that deviate from user expectations. For this reason, we calculate the proportion of responses (response consistency) of large-scale language models that follow the requested format. To ensure fair comparisons, we employ the fixed split provided by Hu et al. This ensures consistency of evaluation conditions across different experiments and allows for meaningful comparisons between models.
The initial study also uses the ogbg-molhiv dataset to evaluate the effects of various large-scale language models. Prompts are generated according to the templates (IP, IPD, IE, IED, FS-1, FS-2, and FS-3). As shown in the figure below, the GPT model outperforms the Llama model on all evaluation metrics, demonstrating consistent performance.
This suggests that the GPT model is superior for molecular prediction tasks. However, we found that the GPT-4 API is 20 times more expensive to use than GPT-3.5 and 10 times slower in response time. Therefore, for performance and computational efficiency reasons, GPT-3.5 is used as the default large-scale language model in this paper.
The table below shows the results of the analysis on the six data sets. The results reveal that LLM is consistently inferior to the three ML models. This suggests that relying on large-scale language models as experts in molecular prediction tasks may be inadequate. They state that there is a need to understand the limitations of large-scale language models and explore alternative approaches to improve prediction accuracy.
Currentlarge-scale languagemodels rely on unstructured text, which limits their ability to incorporate molecular geometric structures as input. To address this limitation, Fatemi et al. propose a way to encode graph structures into the text. However, the results in the table above show that adding explanations to the prompts does not always improve performance, but rather degrades it.They attribute this to the fact that theadditional tokensdistribute attention and increase complexity inlarge languagemodels.
The results in the table below (reproduced below) show that models that integrate geometric structure outperform those that do not. Existing large-scale language models have difficulty incorporating geometric information directly into prompts because the number of tokens in the generated explanations exceeds their constraints.
In this paper, we argue that addressing this challenge is important for future research. Possible solutions include token management techniques, sophisticated prompt engineering strategies, or alternative model architectures that can handle a wide range of input representations. This would allow large-scale language models to better capture the geometric complexity of molecules and improve their predictive ability in chemical modeling tasks.
In addition to using large-scale language models directly for molecular prediction tasks, we also explore the potential benefits of integrating them with existing machine learning models. Following the framework shown in the figure below (reproduced below), the input features of machine learning models such as graph neural networks are augmented with responses generated by large-scale language models.
The results in the two tables below show that introducing responses from large-scale language models as additional input features significantly improves prediction performance. This suggests that using responses generated by large-scale language models complements the information captured by traditional machine learning models and improves prediction accuracy. This hybrid approach represents a promising direction for advancing the state-of-the-art in molecular property prediction.
The table below shows the predictive performance of molecular graph properties onsix datasets (ogbg-molbace, ogbg-molbbbp, ogbg-molhiv, ogbg-molesol, ogbg-molfreesolv, ogbg-mollipo ), and follows the Duo pipeline The classification task is performed in ROC-AUC. The classification task is evaluated by ROC-AUC (↑: higher is better) and the regression task is evaluated by RMSE (↓: lower is better). The best performance of each model is underlined, and the best overall performance is shown in bold.
The table below alsoshows the predictive performance of molecular graph properties forsix datasets (ogbg-molbace, ogbg-molbbbp, ogbg-molhiv, ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo), which follow the Trio pipeline is followed. The classification task is evaluated with ROC-AUC (↑: higher is better) and the regression task is evaluated with RMSE (↓: lower is better). The best performance of each model is underlined and the overall best performance is shown in bold.
Summary
This paper provides important insights into the ability of large-scale language models to handle tasks related to molecules. a comprehensive analysis of six benchmark datasets reveals that large-scale language models are less competitive in the task of predicting molecules than existing machine learning models specifically designed to capture molecular geometry. Furthermore, the potential for using large-scale language models as a complementary tool was suggested, showing that integrating large-scale language models with existing machine learning models can improve prediction accuracy. This suggests a promising way to effectively combine large-scale language models with traditional machine learning models.
This work highlights the current limitations of large-scale language models in tasks related to molecules, while opening new directions for future research. In particular, exploring innovative methodologies that better integrate large-scale language models with domain-specific knowledge and structural information has the potential to fill the observed performance gaps. In this paper, we provide a better understanding of the strengths and weaknesses of large-scale language models in tasks related to molecules and suggest informed strategies for their practical use in chemistry, biology, and related fields.
In addition to the molecular prediction task, there are many other promising directions for future research. In particular, it is important to address the limitations of large-scale language models in understanding the geometric structure of molecules. The inability of large-scale language models to capture the subtleties of such structures often leads to inaccurate results. Overcoming this limitation and enhancing the understanding of molecular geometric structures in large-scale language models is considered essential to broaden the applicability of large-scale language models in molecular tasks.
While the paper proposes a simple and effective framework for integrating large-scale language models with traditional machine learning models, there is room for further methodological refinement in this regard. Designing a sophisticated framework that seamlessly integrates large-scale language models with existing machine learning models would be a promising direction for future research and could lead to improved prediction performance and model interpretability.
The development oflarge scale languagemodelsdedicated to moleculesis also considered very important.Despite the inferiority oflarge-scale languagemodelscompared to baselines for many tasks, their ability to derive solutions from limited samples indicates the potential for generalized intelligence in the molecular domain. However, the currentlarge-scale languagemodel exhibits significant hallucinations (halucinations) in the chemistry task, suggesting room for improvement.Continued development oflarge-scale languagemodelsand research aimed at reducing hallucinations (halucination) will be increasingly required to increase their effectiveness in solving real chemical problems.
Categories related to this article