CACTUS] A Drug Discovery Method That Combines LLM And Chemistry-based Tools

Large Language Models 25/11/2024

3 main points
✔️ CACTUS, an agent that leverages large-scale language models and cheminformatics tools to accelerate drug discovery and molecular property prediction research
✔️ Using large-scale language models, CACTUS performance was evaluated on a set of 1000 chemical questions and achieved significantly higher accuracy than the reference model
✔️ Provides innovation in the discovery and design of therapeutics, catalysts, and materials by integrating advanced computational techniques with models and improving ease of use and explainability

CACTUS: Chemistry Agent Connecting Tool-Usage to Science
written by Andrew D. McNaughton, Gautham Ramalaxmi, Agustin Kruel, Carter R. Knutson, Rohith A. Varikoti, Neeraj Kumar
(Submitted on 2 May 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Large-scale language models aretransformer-basedinfrastructure models that havebeen deployed in a variety of services and have attracted much attention.However, these transformer-basedlarge-scale languagemodels, although trained on large amounts of data, may not be accurate enough in certain areas. Current researchshows that tools that augmentlarge-scale languagemodels can compensate for these shortcomings and increase the efficiency of problem solving. It has also shown that providing prompts for specific tasks can improve the quality and speed of the text produced by the model.Combining these techniques isa framework called theToolAugmentedLanguage Model (TALM) proposed by Parisi et al.This framework achieves better performance than existing models on a set task.

However, it is also clear thatlarge language models, whileseeminglycorrect,have difficulty in ensuring thatthe responses generatedexhibit human-like reasoning and expertise. Errors based on statistics learned from the data by the model can be replicated similarly in different applications. If the underlying model is integrated into a critical system, its failure to do so can have a significant detrimental effect on users.

Large-scale languagemodelssuch as GPT-4, LLaMA, Gemma, MPT, Falcon, and Mistralhave improved performance in many areas, but their limitations become apparent when faced with challenges that require access to dynamic or sensitive data.This underscores the need forlarge-scale languagemodels to leverage external APIs to obtain real-time data and increase their usefulness in real-world applications.

The limitations oflarge-scale languagemodels are even more pronounced, especially in the fields of chemistry, biology, and materials science.The complex nature of chemical data combined with the dynamic context of drug discovery is considered a complex challenge that cannot be addressed by purely computational models. To address this problem,the integration oftools for handling chemical informationwith the cognitive and analytical capabilities oflarge-scale languagemodels is seen as an effective way to addressthis issue.

The technology considered at the forefront of this field is autonomous agents. These agents can utilize a variety oflarge-scale languagemodels forspecific tasksand use APIs and Internet search tools to gather relevant materials and data. For example, integrating agents into platforms that leverage tools such as KNIME and Galaxy can introduce a natural language interface between the user and the analysis. This greatly streamlines the process of scientific discovery and autonomous experimentation.

Inspired by ChemCrow, this paperdevelopsCACTUS(Chemistry AgentConnecting Tool Usage to Science), anadvanced cheminformatics agent to support de novo drug design and molecular discovery.This agent can optimize chemistry research and development workflows by properly determining the best tools for a particular task and the order in which they are applied.

Technique

TALM consists of two main components: external tools and language models. This section describes how to implement the language model agent and the tools used.

When building a TALM, the framework used to implement it is important. This paper utilizes LangChain, a commonly used open source platform. This frameworkuses a set of pre-built Python modules known as"Chain" to simplify the integration of prompts and large-scale language models. It also facilitates integration with popular large-scale language model hosting/inference platforms such as OpenAI API and HuggingFace Transformers.

CACTUS utilizes a custom LangChain implementation of the MRKL agent, which is divided into three parts: tools, LLMChain, and agent classes. The firsttool is a helper function for chemical informatics that utilizes the well-known Python libraryThe second, LLMChain, is a LangChain-specific feature that integrates tools and agents, and is a prompt provided to large language models when performing arbitrary inferences. This facilitates model initialization and parsing of user input; CACTUS provides prompts that explain the steps an agent takes to answer a cheminformatics question.

The third agent class is a function of the LangChain implementation, which interprets user input after the initial prompt and determines the best action to take to resolve the question.CACTUS uses the ReAct framework to determine which tool to use from the tool description, a zero shot It uses a generic implementation of the agent class.

The combination of this tool, LLMChain, and the Zero-Shot Agent allows CACTUS to quickly integrate new tools to create a scalable, large-scale language modeling tool capable of solving a wide variety of cheminformatics questions.

The figure belowshows the general workflow of a CACTUS agent, illustrating howa large-scale languagemodel interprets input to select the correct tool and answer.

Starting with user input, CACTUS follows a standard "chain of sort (CoT)" reasoning method with planning, action, execution, and observation phases to obtain output.

CACTUS also incorporates a wide range of tools that integrate common Python libraries such as RDKit and SciPy to create powerful large-scale language model agents that answer a wide variety of cheminformatics questions. It also provides interfaces to databases such as PubChem, ChEMBL, and ZINC. These tools allow for chat-basedmolecularanalysis,starting with SMILES strings, and extending to information such as molecular descriptors, similarity, absorption, distribution, metabolism, and excretion (ADME) attributes.

The model consists of 10 different tools that provide information on the various descriptors of the entered compounds. The table below lists the tools currently available. It helps to obtain the different physicochemical properties and molecular descriptors of the input compound. These include molecular weight, logarithm of partition coefficient (LogP), topological polar surface area (TPSA), quantitative estimation of drug-like properties (QED), and synthetic ease (SA).

In addition,ACTUScan also estimate pharmacokinetic properties such as C blood-brain barrier permeability and gastrointestinal absorption using the BOILED-Egg method. The model also implements drug-like, PAINS, and Brenk filters to identify structural and toxicity alerts. These tools can identify and screen both currently available and new lead compounds.

Currently, simple SMILES are used as input, but the authors plan to extend this to apply to a variety of user inputs in the future, including compound name, molecular formula, InChI key, CAS number, ChEMBL ID, and ZINC ID.

We also examine the importance of prompts to agents. The LangChain implementation of the large-scale language model agent provides a default prompt that also gives general instructions about the available tools and large-scale language model tasks. However, this is not necessarily optimized to understand domain-specific information, leaving room for improvement, and to test this hypothesis, two scenarios are run: one "minimal prompt" that includes only a description of the tools without modifying the default prompt, and the other a "domain prompt" in which the agent is more tailored to the chemical domain.

Domain-aligned prompts are thought to better interpret questions and increase the effectiveness of responses to user queries. Because we used extensive large-scale language models in our testing, the minimal prompts also included model-specific tokens to ensure that models were not unfairly evaluated against domain prompts.

Also, domain-specific TALMs are difficult to evaluate, but can follow the example of a typical benchmark suite. Thus, we rely on a set of questions that replicate the questions typically encountered by agents and evaluate whether the agent can answer them correctly without requiring additional prompting work by the user.

To evaluate CACTUS, we have created three sets of questions based on the output of the tool. The first set is qualitative questions, which return answers such as Yes/No or True/False; the second is quantitative questions, which return numerical values, which are then interpreted by the agent; the third is an "integrated set," which combines both qualitative and quantitative questions.

The table below shows examples of questions passed as user input to the CACTUS agent. The qualitative and quantitative datasets each contain 500 questions, while the integrated dataset contains 1000 questions. In order to test the ability of the large language modeling agents to perform a wide variety of tasks, most of the testing was done on the integrated dataset.

Experiments and Results

The implementation of CACTUS has contributed much to the field of cheminformatics, providing a powerful and flexible tool for researchers and chemists working in molecular discovery and drug design. Here," a benchmarking study of various 7b parameter models reveals the robustness and efficiency of CACTUS, highlighting its potential to streamline and accelerate the drug discovery process.

CACTUS performance is evaluated using a set of 1000 questions covering 10 different tools. Each 7b-parameter model is also evaluated with and without domain prompts.Responsesare scored as correct (Correct) andmarked asincorrect(Incorrect)if they are incorrect, if they fail to produce an answer, or if they fail to use the provided tool correctly.

This paper does not distinguish between the use of incorrect tools and simply wrong answers. Any failure to provide a consistent answer to a question is also considered incorrect. If additional formatted text containing the correct answer was included, it was acceptable, but this is not the preferred format. This additional information can be programmatically removed or reduced by designing additional prompts. Each type of question in the complete set of questions is asked 100 times, corresponding to 10 different questions for 10 different tools.

This method identifies tools that are more challenging for the model and finds areas for improvement in tool descriptions and model prompts.The results shown inthe previous figure illustratethe importance of domain-specific prompts in improving model response accuracy. This is especially true for qualitative questions. This is consistent with recent research highlighting the role that prompt engineering plays in improving the performance of language models.

In the advancement of AI and its application in scientific inquiry, it is important to analyze the comparative effectiveness of different models that handle domain-specific tasks.

The benchmark analysis presented in the figure below provides important insight into the performance of the different language models when prompted with minimal and domain-specific information.A comprehensive review of performance data across question types reveals that the Gemma-7b and Mistral-7b models exhibit robustness and versatility and perform well regardless of the nature of the prompt.

A comprehensive review of performance data across question types reveals that the Gemma-7b and Mistral-7b models exhibit robustness and versatility and perform well regardless of the nature of the prompt. These consistent accuraciesdemonstrate reliability fora wide range of queries within the molecular sciences, from physicochemical properties such as drug-likeness and blood-brain barrier permeability, to more complex measures such as quantitative drug-likeness estimation (QED).The Falcon-7b model, on the other hand, shows a marked performance difference between minimal and domain prompts. This variation suggests that Falcon-7b requires more detailed prompt tuning to effectively reach its potential. The large differences in performance based on prompt type indicate the sensitivity of the model to input structure and content, which is important in the development of effective query strategies.

Furthermore, as shown in the figure below, smaller models such as the Phi2 and OLMo-1b have shown superior performance on consumer hardware. This demonstrates the potential for democratizing access to powerful cheminformatics tools. This will allow researchers with limited computational resources to take advantage of the capabilities of CACTUS.

The results of this comprehensive model comparison and analysis indicate that there are broad implications for the use of open source models in scientific settings. The ability of models to perform well with domain-specific prompts is particularly promising and suggests that in the right setting, open source models can be highly effective tools.

The adaptability demonstrated by the Gemma-7b and Mistral-7b models demonstrates their broad applicability across a variety of computational settings, from high-performance clusters to more modest research environments. In addition, the ability to effectively prompt open source models allows their use in a variety of scientific contexts. This allows researchers to customize the models to their specific domain, potentially bridging the gap between general AI capabilities and domain expertise.

The flexibility and performance of these models also have a significant impact on scientific research, particularly in areas such as synthetic organic chemistry and drug discovery. For researchers in these fields, the ability to effectively use open source models can accelerate the discovery process, improve prediction accuracy, and optimize computational resources. The insights gained from this benchmarking study can provide a roadmap for selecting and tailoring models to meet specific research needs and maximize support for achieving scientific goals. The benchmark study of selected 7b parameter models demonstrates the advances in AI-driven research tools and highlights the need for prompt optimization and the promise of open source models in a variety of scientific inquiry. This analysis shows the potential for these models to become an integral part of the computational chemist's toolkit, paving the way for innovative breakthroughs in molecular design and drug discovery.

WhileCACTUShasalreadydemonstrated the ability toestimate basic metrics for input chemical compounds, the authors state that in the future they aim to evolve it into a comprehensive open source tool dedicated to the design and discovery of therapeutic agents. And to achieve this goal, the authors state that they plan to integrate the following features.

Physics-based molecular AI/ML model implementation
- They include 3D-scaffold, reinforcement learning, and graph neural networks (GNN). These models, in conjunction with molecular dynamics simulations, quantum chemical calculations, and high-throughput virtual screening, are essential for accurately modeling molecular interactions and predicting the efficacy and safety of therapeutic agents.
Implementation of advanced capabilities to identify compounds that exhibit structural and chemical similarities or fragments that are important for biological activity
- Researchers will be able to explore vast chemical spaces more efficiently and identify lead compounds with a high degree of accuracy. These additional capabilities will greatly improve the agent's ability to understand compound behavior in 3D space and can assist in the development of a comprehensive and effective workflow for therapeutics discovery and materials design.
Additional tools to identify important fragments and compounds with similar structures and chemical properties from extensive chemical databases
- Tools to calculate physicochemical and pharmacokinetic properties and about 60 other descriptors can be added to the agent to help identify quantitative structure-activity relationships (QSAR) and quantitative structure-property relationships (QSPR) to help screen compounds and identify toxicity groups.

In addition to these technical improvements, CACTUS also states that it aims to make CACTUS more capable of explaining and symbolic reasoning in orderto address a common criticism of large-scale language models: difficulty in reasoning and providing explainable output.By integrating advanced symbolic reasoning capabilities, CACTUS would become more powerful in its predictive and analytical functions, providing users with understandable and logical explanations for its recommendations and predictions. And it is hoped that this capability will automate the process of predicting how a drug candidate molecule will interact with a target, such as a protein, and provide valuable insight into the efficacy of a new compound.

The applications of CACTUS extend beyond drug discovery into other areas such as chemistry, catalysis, and materials science. In catalysis, CACTUS can predict the properties and performance of catalysts based on their structural and chemical properties and assist in the discovery and optimization of new catalysts. Similarly, in materials science, CACTUS can help design new materials with desirable properties by exploring the vast chemical space and identifying promising candidates for further experimental validation.

Future development of CACTUS is directed toward creating an intelligent and comprehensive scientific informatics tool for the discovery and optimization of therapeutics, as well as catalysts and materials. Through the integration of advanced computational techniques and models, and improvements in ease of use and explainability, CACTUS is expected to become an essential resource for the discovery of new, effective, and safe therapeutics, and for the optimization of catalysts and materials.

Summary

This paperintroducesa new open source agent,CACTUS, which leverages large-scale language models and chemical informatics tools to accelerate research in the areas of drug discovery and molecular property prediction. By integrating a variety of computational tools and models, CACTUS provides a comprehensive and easy-to-use platform for researchers and chemists to explore the vast chemical space and identify promising compounds for therapeutic applications.

We evaluated CACTUS' performance on a set of 1000 chemical questions using open-source large-scale language models such as Gemma-7b, Falcon-7b, MPT-7b, Llama2-7b, and Mistral-7b. The results show that CACTUSsignificantly outperformsthe referencelarge-scale languagemodels, with the Gemma-7b and Mistral-7b models in particular achieving the highest accuracy regardless of the prompting strategy used. In addition, the impact of domain-specific prompts and hardware configuration on model performance was investigated, highlighting the importance of prompt engineering and the potential for deploying small models on consumer hardware. shows the potential for broad adoption and increased accessibility of CACTUS.

CACTUS has the potential to revolutionize approaches to drug discovery, catalyst design, and materials science as it continues to be integrated with other computational tools and autonomous discovery platforms. future development of CACTUS will be directed toward creating intelligent and comprehensive cheminformatics tools that ensure high safety and efficacy in the identification and design of therapeutic drugs, catalysts, and materials. and comprehensivecheminformaticstools that ensure high safety and efficacy in the identification and design of therapeutics, catalysts, and materials . Through the integration of advanced computational techniques and models and improvements in ease of use and explainability, the authors aim to make CACTUS an indispensable resource for researchers in a variety of scientific disciplines.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

CACTUS] A Drug Discovery Method That Combines LLM And Chemistry-based Tools

Summary

Technique

Experiments and Results

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...