Catch up on the latest AI articles

The LLM Revolution In Chemistry! ChemCrow, An Integrated Engine That Leverages External Tools

The LLM Revolution In Chemistry! ChemCrow, An Integrated Engine That Leverages External Tools

Large Language Models

3 main points
✔️ ChemCrow Introduction and Purpose: ChemCrow is a new engine that makes large-scale language models specialized for chemistry tasks. It can be combined with specialized tools to broaden applicability, reduce barriers to accessing chemical knowledge, and provide chemistry automation tools for experts and non-experts alike.
✔️ Implementation and Outcomes:ChemCrow has demonstrated the ability to automate specific chemistry tasks such as drug design and material synthesis, achieving multiple results, including screening and synthesis of insect repellents, organocatalysts, and novel dyes.

✔️ Challenges and Future Prospects: ChemCrow has limitations that depend on the quantity and quality of the tools selected, but its capabilities could be greatly expanded in the future with the integration of language-based and image processing tools. With the release of the open source version, it is expected to be used for a wide range of research and development.

ChemCrow: Augmenting large-language models with chemistry tools
written by Andres M BranSam CoxOliver SchilterCarlo BaldassariAndrew D WhitePhilippe Schwaller
(Submitted on 11 Apr 2023 (v1), last revised 2 Oct 2023 (this version, v5))
Subjects: Chemical Physics (physics.chem-ph); Machine Learning (stat.ML)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Over the past few years, large-scale language models have revolutionized various industries through the automation of natural language processing tasks. The best example of this is GitHub Copilot and later StarCoder, which appeared in 2021 and offer code completion that dramatically increases developer productivity. While these advances are primarily based on the Transformer architecture, which is applicable to many natural language processing tasks, limitations of large-scale language models are also evident, such as struggling with simple arithmetic and chemistry problems. These challenges are due to the fact that the model is primarily based on a design that predicts the next word.

One approach to address this is to augment large-scale language models with external tools and plug-ins for specific tasks. These specialized tools can increase the accuracy of large-scale language models and broaden their applicability in specific fields. In chemistry, AI systems have been deployed to solve specific problems, such as reaction prediction and molecule generation, but chemistry automation remains a challenging area. This is due to the nature of the experiments, lack of data, and limited range of tools.

Attempts to integrate these tools often take place in independent environments, presenting a significant challenge to experimental chemists in terms of interoperability and integration of these tools. To address this challenge, this paper proposes ChemCrow, a new chemistry engine that simplifies chemistry tasks. ChemCrow leverages the power of specialized tools for tasks such as drug design and materials synthesis, and works by giving specific instructions to large language models, including GPT-4 The system can be made to work by giving specific instructions to a large language model, including GPT-4. The system has the ability to use the appropriate tools in response to user-given prompts, understand the current status of the task, and plan next steps.

This approach combines task-related tools with chained thought reasoning (CoT) to help large-scale language models perform more sophisticated reasoning to reach a final solution.ChemCrow is accessible to chemistry experts as well as non-experts through its interface, reducing barriers to the dissemination of and access to chemical knowledge.

ChemCrow and its performance

ChemCrow finds the corresponding molecule from a simple user input, such as "plan and execute synthesis of insect repellents" or "find and synthesize catalysts to accelerate Diels-Alder reactions," plans the synthesis, and executes it on IBM Research's RoboRXN, a dedicated cloud-connected platform. The synthesis is being performed on IBM Research's dedicated cloud-connected platform RoboRXN.

To do this, ChemCrow in turn queries tools such as LitSearch/WebSearch, Name2SMILES, ReactionPlanner, and finally ReactionExecute to combine information and solve problems. Large-scale language model agents associated with synthesis planning and execution are shown interacting with the physical world.

Standardized synthesis procedures are key to success. However, predicted procedures are not always directly executable on the RoboRXN platform. Typical problems include "Insufficient Solvent" and "Invalid Purification Action". Addressing these issues requires human intervention to correct the invalid action before attempting synthesis.

ChemCrow can autonomously query synthesis validation data from the platform and iteratively adapt the synthesis procedure (e.g., increase solvent volume) until the procedure is fully valid. and does not require human intervention. This example demonstrates ChemCrow's ability to autonomously adapt and successfully execute standardized synthesis procedures, alleviating laboratory safety concerns and allowing the robotic platform to adapt itself to specific conditions.

Human-computer interaction can have particularly beneficial results in the area of chemistry. In this area, decisions are often based on experimental results, and the execution of the experiment itself can be challenging and even beyond the capabilities of state-of-the-art autonomous labs. Here we show how such interactions can lead to the discovery of new chromophores.

In this example, ChemCrow was instructed to train a machine learning model to help screen a library of candidate chromophores. As shown in the figure below, ChemCrow is capable of reading, cleaning, and processing the data, training and evaluating the random forest model, and ultimately providing suggestions based on the given target absorption maximum wavelength of 369 nm and model.

The proposed molecule was subsequently synthesized and analyzed, confirming the discovery of a new chromophore. This chromophore has almost desirable properties (measured absorption maximum wavelength of 336 nm).

In addition, the application of machine learning is expanding in chemistry, and many datasets and benchmarks have been developed. However, these benchmarks often do not accurately assess the ability of language models to solve chemistry-specific challenges. To fill this gap, we are working with chemistry experts to develop a new set of tasks to measure chemical problem-solving ability.

In this new approach, both ChemCrow and GPT-4 (the latter set to mimic chemistry expertise) are being tested, and the results are being validated with expert and machine learning model (EvaluatorGPT) assessments. quality, with detailed feedback for each solution.

The adequacy of a ChemCrow run is highly dependent on the quality of the tool and the reasoning process. For example, the ability to plan synthesis is enhanced by the evolution of the underlying synthesis engine, but improper reasoning and input can render even the best tools useless. With this in mind, chemistry experts evaluate each model based on the accuracy of the chemistry, quality of reasoning, and task completion.

The result is shown in the figure below.

On more complex tasks, ChemCrow outperforms GPT-4, which does not use the tool in situations where chemical reasoning is required. On the other hand, while GPT-4 gives a good impression in terms of fluency and superficial completeness, it is clear that its information is not as accurate as it could be. While GPT-4 may have an advantage in providing answers based on training data, especially on easier tasks, ChemCrow consistently provides superior solutions across a range of objectives and difficulty levels, and is favored by chemistry professionals.

Furthermore, the difference in ratings by humans and EvaluatorGPT is noteworthy. Experts prefer and rate ChemCrow's answers highly, while the EvaluatorGPT, on average, rates GPT-4 as the superior model based on the fluency and superficial completeness of GPT-4's answers. This result suggests that it is difficult for language models to provide reliable ratings when their understanding of prompts is lacking, and that they are not suitable for benchmarking the ability of machine learning models on assessments where factuality plays an important role.

This research highlights the need for new evaluation methods with respect to the application of machine learning in chemistry and opens possibilities for accurately assessing the capabilities of language models in chemical problem solving.

Risk strategy

The implementation and use of large-scale language model-driven chemistry engines, such as ChemCrow, has the potential to support non-expert researchers by combining tools designed by different experts. While these automated platforms are subject to rigorous review by human operators and chemistry experts, it is essential to ensure responsible development and use of large-scale language model agents.

Worldwide safety standards limit the use of chemical laboratories to chemists who have received prior training or other means. However, experiments based on the recommendations of large language model-driven chemistry engines can lead to accidents and hazardous situations. Therefore, as shown in the figure below, ChemCrow follows a set of hard-coded guidelines by checking that the queried molecule is not a known controlled chemical or other safety information. If so, the run stops. If not, the run proceeds and this information is reused by the model to provide a more complete response that includes safety concerns for the proposed substances and well-founded recommendations on how to handle them safely.

It also provides safety instructions, including safety information checks, and ensures that recommendations are in line with safety standards and protocols through the integration of safety checks and expert review systems.

Inadequate knowledge of chemistry in large language model-driven chemistry engines creates the risk of erroneous decision making and problematic experimental results. To mitigate this problem, we improve the engines' understanding of chemistry concepts by integrating expert-designed tools and improving the quality and scope of training data.

In addition, users are encouraged to critically evaluate the information provided and compare it to established literature and expert opinion. This further reduces the risk of relying on incomplete reasoning.

We also address intellectual property: addressing intellectual property issues is critical to the responsible development and use of generative AI models such as ChemCrow. Clear guidelines and policies need to be established regarding potential infringement of synthesized chemical structures and materials, their anticipated uses, and proprietary information. By working with legal experts and industry stakeholders, appropriate steps can be taken to address these issues and protect intellectual property.

Addressing the potential drawbacks of ChemCrow and ensuring its safe and responsible application is critical to its success. Integrating expert tools, improving training data, and implementing effective mitigation strategies will minimize risk while maximizing positive impact on the chemical sector. As technology evolves, collaboration and vigilance among developers, users, and industry stakeholders can help address emerging risks and challenges and promote responsible innovation and progress in the area of large-scale language model-driven chemistry engines.

External Tools

ChemCrow uses OpenAI's GPT-4 as a large-scale language model. In addition, external tools are integrated via LangChain. The external tools used in this paper can be easily extended according to need and availability and are categorized as "general tools," "molecular tools," and "chemical reaction tools.

First, the "WebSearch" tool is designed to collect current and relevant information from the Internet. This is accomplished by executing a search query using the SerpAPI and extracting information from the first page of Google search results. Through this process, the language model has access to up-to-date information across a whole range of scientific topics.

Next, the "LitSearch" tool is dedicated to extracting information from scientific documents. This tool efficiently searches scientific papers and other documents to provide accurate and reliable answers to questions. This is accomplished by using OpenAI's embedding technology and the FAISS vector database to search documents and produce summaries of relevant passages.

The "Python REPL" tool is also a standard tool in Langchain that provides the ability to write and run Python code directly on the language model. This makes it easy to perform a wide range of tasks from numerical computation to data analysis to training AI models.

Finally, the "Human" tool enables more dynamic problem solving by allowing language models to interact directly with humans and receive instructions. This allows human intuition and judgment to be incorporated into the process, especially in difficult problems or when uncertainty is high.

We also utilize tools that enable analysis and manipulation at the molecular level. These tools can address a variety of challenges faced by researchers, from molecule identification to market price evaluation to structural similarity analysis.

The "Name2SMILES" tool quickly retrieves the SMILES (Simplified Molecular Input Line Entry System) representation of a molecule based on its molecular name or CAS number. This allows you to easily reference a variety of molecules, including common and IUPAC names such as caffeine and atorvastatin, for molecular analysis and manipulation. Database searches are conducted primarily using chem-space, supplemented as necessary by PubChem and OPSIN.

The "Name2CAS" tool identifies Chemical Abstracts Service (CAS) numbers using a variety of molecular representations (common names, IUPAC names, SMILES strings), leveraging the PubChem database to convert molecules to unique CAS numbers, allowing researchers easy access to relevant information. The PubChem database is used to convert molecules into unique CAS numbers, allowing researchers easy access to relevant information.

The "SMILES2Price" tool takes as input the SMILES representation of a molecule and evaluates its affordability and lowest price on the market. The process uses molbloom to check the affordability of a molecule in the ZINC20 database and provides market price information via the chem-space API. This tool allows researchers to select the best molecules from an economic point of view.

The "Molecular Similarity" tool evaluates the structural similarity between two molecules using the Tanimoto similarity, based on the ECFP2 molecular fingerprint, to quantify the similarity between molecules and provide an important indicator for drug discovery and potential analogs in chemical research.

The "ModifyMol" tool is designed to explore the chemical space around a molecule and make structural modifications. In this process, 50 medicinal chemical reactions are used, following the principles of retro- and forward synthesis, in order to expand synthetic possibilities. In particular, the SynSpace package is applied to derive modified molecules from the SMILES representation of the molecule through fine modifications.

The "PatentCheck" tool quickly checks whether a molecule is patent-registered or not, using a C library called molbloom87 to evaluate the patent status of a molecule via a Bloom filter. This tool provides an important step to avoid intellectual property conflicts, especially in the development of new compounds, and helps researchers to proceed with their research and development with confidence.

The "FuncGroups" tool is designed to identify functional groups in molecules. It takes a SMILES representation of a molecule as input and uses predefined SMARTS patterns to confirm the presence of functional groups. This analysis provides valuable insight into the reactivity and properties of molecules, increasing the efficiency of scientific research and drug discovery.

The "SMILES2Weight" tool accurately calculates the molecular weight of a molecule from its SMILES representation, using the RDKit library to derive the molecular weight of a molecule based on an input SMILES string. This information is an important indicator during the synthesis planning and characterization stages and assists in the molecular design process.

These tools enhance the molecular design, analysis, and evaluation process and help scientists make informed decisions more quickly and efficiently.

In addition, one of the most salient issues related to the development of tools such as ChemCrow is safety. One of the proposed risk mitigation strategies is to incorporate tools that allow large-scale language models to assess the potential risks of proposed molecules, reactions, and procedures. To achieve a safe research environment, we have implemented the following three safety tools

The "ControlledChemicalCheck" tool uses the CAS number of a molecule to check it against a specialized list in order to pre-identify substances recognized as chemical weapons or their precursors. This automated check works when the user requests a synthesis method or experiment for a particular molecule, and immediately stops the operation if the corresponding hazardous substance is detected. This provides the user with critical safety information, allowing them to make safer decisions.

The "ExplosiveCheck" tool determines if a molecule is explosive or not based on the globally harmonized system (GHS). It searches the PubChem database using the molecule's generic name, IUPAC name, or CAS number and confirms its properties if it is rated as "explosive". It is automatically invoked when a synthesis method is requested and, if necessary, provides warnings and error messages to help the user make appropriate safety-related decisions.

The "SafetySummary" tool provides a comprehensive safety overview of any molecule, revealing four main aspects: operational safety, GHS information, environmental risks, and social impacts. The tool presents safety information in a comprehensive manner through a user-friendly interface. Where information is lacking, the GPT-4 is designed to complement it, explicitly pointing out gaps, so that users have complete and easily accessible safety data.

Through these tools, ChemCrow enhances safety. In addition, we also use the "Chemical Reaction Tool" to identify, predict, plan, and even execute chemical reactions.

NextMove Software's "NameRxn" tool leverages an extensive database of named reactions to identify and classify chemical reactions. By simply entering the SMILES form of a reaction equation, the reaction name and its classification code can be obtained, which facilitates understanding of the reaction mechanism and optimization of experimental conditions.

The RXN4Chemistry API-driven "ReactionPredict" tool by IBM Research predicts products from reactants with high accuracy. The tool uses a transformer model dedicated to predicting chemical reactions and reverse synthesis paths, mimicking the abstract reasoning done by chemists.

The "ReactionPlanner" tool is a tool for planning multi-step synthetic processes, based on the RXN4Chemistry API, which translates reaction sequences into machine-interpretable actions, which are then retranslated into natural language. The tool plays a key role in designing efficient pathways to synthesize compounds of interest.

ReactionExecute" works directly with the robotic chemistry lab through ChemCrow to physically execute the planned synthesis. The process involves requesting a synthesis plan, executing it robotically, and even adapting to errors and warnings during execution. Finally, after user permission, the synthesis is initiated, returning a confirmation message upon success.

These tools streamline the entire process, from understanding the reaction to running the experiment.


This study presents the development of ChemCrow, a novel large-scale language model-driven framework for integrating computational tools in chemistry. By combining the advanced reasoning capabilities of large-scale language models with the expert chemical knowledge available from computational tools, ChemCrow is a precursor to chemistry-related large-scale language model agents that can interact with the physical world.

In fact, it has achieved multiple results, including the screening and synthesis of an insect repellant, three organocatalysts, and a new dye with target properties. ChemCrow also has the ability to autonomously solve a wide variety of chemical problems, from simple drug discovery to planning the synthesis of complex substances, and has the potential to become a ChatGPT-like chemistry assistant in the future.

Despite the limitations of the current results due to the quantity and quality of the selected tools, the potential for a broad range of tools not limited to the chemical field is enormous. The capabilities of ChemCrow could be greatly extended with the incorporation of language-based and image processing tools. Furthermore, while the selected evaluation tasks are limited, future research and development could extend and diversify these tasks to unlock the true potential of the system.

Evaluations by chemistry experts showed that ChemCrow outperformed GPT-4 in chemical facticity, reasoning, and completeness of answers. ChemCrow's advantage is particularly pronounced for novel and little-known tasks. On the other hand, while large language model evaluations tend to favor GPT-4, such evaluations are not always as reliable as human evaluations in assessing the true effectiveness of a model in chemical inference. This gap indicates a need for improved methods to more accurately assess the unique capabilities of systems like ChemCrow in solving complex real-world chemical problems.

Challenges exist in the evaluation process, but improvements in experimental design can increase the reliability of results. There are a variety of challenges, including the limitations of closed models and the difficulties of large-scale chemical logic testing, but despite these, systems such as ChemCrow serve as valuable assistants in chemistry laboratories and show promising capabilities and potential to address chemistry tasks in a wide range of fields.

The experiments performed in this paper can also be accessed through GitHub. An open source version of the ChemCrow platform is also available. You can access the experimental setup and details of the ChemCrow platform and use it for your own projects and research. This is expected to further facilitate advanced research and development in the prediction, planning, and execution of chemical reactions.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us