Catch up on the latest AI articles

What Can Large-Scale Language Models Do In Chemistry? The Role And Potential Of LLMs In Chemical Research

What Can Large-Scale Language Models Do In Chemistry? The Role And Potential Of LLMs In Chemical Research

Large Language Models

3 main points
✔️ Exploring the application of large-scale language models to chemistry: Investigate the applicability and potential of large-scale language models in the specific field of chemistry. Developed the first benchmark to evaluate the application of large-scale language models to a wide range of practical tasks in chemistry.
✔️ Evaluating the performance of large-scale language models: evaluated the performance of five models (GPT-4, GPT-3.5, Davinci-003, LLama, and Galactica) using eight tasks to address basic chemistry problems. Highlights differences in model performance in the generation and classification/ranking tasks, and also identifies tasks that show competitiveness under certain conditions.

✔️ Potential and Need for Improvement of Large-Scale Language Models in Chemistry: shows the potential of large-scale language models in chemistry, but suggests that improvements are needed to further improve their performance.

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
written by Taicheng GuoKehan GuoBozhao NanZhenwen LiangZhichun GuoNitesh V. ChawlaOlaf WiestXiangliang Zhang
(Submitted on 27 May 2023 (v1), last revised 28 Dec 2023 (this version, v3))
Comments: NeurIPS 2023 Datasets and Benchmarks Track camera-ready version
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In recent years, large-scale language models have been gaining attention for their tremendous capabilities in a variety of fields. Starting with natural language processing, they are expanding their impact to a wide variety of applications in the scientific domain. While research has been particularly strong in areas such as medicine and scientific question answering, their application in the field of chemistry has not yet been fully explored. (is a fascinating question for both AI researchers and chemists. However, while the potential of Large Language Models in chemistry has the potential to have a significant impact on the evolution of the field, it still contains many challenges.

In this paper, we have developed the first comprehensive benchmark that evaluates a wide range of practical chemistry tasks in order to identify the potential of large-scale language models in chemistry and to facilitate their advancement. This effort emerged from a collaboration between AI researchers and chemists and aims to explore the applicability of large-scale language models to a wide variety of chemistry tasks. It includes eight practical tasks that require different capabilities in understanding chemistry tasks, reasoning, and using domain-specific knowledge. These tasks address fundamental chemistry problems such as name prediction, property prediction, yield prediction, and reaction prediction.

The paper demonstrates how large-scale language models can be applied to a wide variety of chemistry problems, providing AI researchers with insights into the strengths and limitations of large-scale language models and chemists with an understanding of which tasks current large-scale language models are suitable for. It also highlights the importance of reliable assessments and standardized methods through the development of an experimental framework. It is hoped that this research will pave the way for the further use of large-scale language models in chemistry and accelerate research and development activities in this area.

Evaluation Process and Settings

In this paper, under the supervision of Professor Olaf Wiest of the Department of Chemistry at the University of Notre Dame (co-author) and in collaboration with doctoral students at the NSF Center for Computer Aided Synthesis (C-CAS), eight specific tasks are identified. The process involves the selection of prompts to be sent to a large-scale language model through steps such as generating, evaluating, and selecting appropriate prompts. The collected responses are qualitatively evaluated by chemists for usefulness in real-world situations and quantitatively analyzed based on selected criteria. The workflow of the evaluation process is shown in the figure below.

First, the chemistry task assessment considers eight practical chemistry tasks, focusing on the three basic competencies of comprehension, inference, and explanation to assess the chemistry-related competence of large-scale language models. These tasks are summarized in the table below in terms of task type, data set used for assessment, and assessment metrics.

For all tasks, we evaluate performance using popular large-scale language models: GPT-4, GPT-3.5, Davinci-003, LLama, and Galactica. For each task, we also use a standardized zero-shot prompt template to instruct the large-scale language model to act as a chemist.

We also designed task-specific ICL (In-Context Learning) prompt templates for each chemistry task, through which we delve deeply into the capabilities of the large-scale language model. The prompts instruct the large-scale language model to play the role of a chemist and specify chemistry tasks with specific inputs and outputs.

Two strategies are employed to explore the impact of ICL sample quality and quantity on performance. The first is random selection and the second is a scaffold strategy based on specific criteria. This is used to find the best way to select the best examples.

We also perform grid searches for different task types with respect to the number of ICL examples in each task. An initial validation set is used to identify optimal k values and search strategies, and then these strategies are tested on 100 randomly selected test instances. The evaluation of each task is repeated five times and the mean and standard deviation of the results are reported.

In this way, we comprehensively assess the ability of large-scale language models to solve chemical tasks and validate their effectiveness quantitatively and qualitatively.

Experimental Analysis

Here we explore key findings obtained through a comprehensive benchmark analysis, providing a deep understanding of the limitations of large-scale language models and insight into their impact on the performance of large-scale language models for a variety of chemistry challenges.

Can Large-Scale Language Models Outperform Existing Models for Chemical Tasks? Many traditional machine learning-based predictive models exist for chemical tasks. For example, graph neural network-based MolR has been developed for the binary classification problem of predicting molecular properties, UAGNN boasts state-of-the-art performance in yield prediction, and T5-based MolT5-Large specializes in molecule-to-text translation. In this paper, we compare the performance of the GPT model to these existing baselines and identify the following key findings

  • Performance advantage: The GPT-4 outperformed other models evaluated in eight different tasks.
  • Task-dependent competitiveness: the GPT model was not competitive for tasks that require accurate interpretation of SMILES representations of molecules (e.g., name prediction, reaction prediction, inverse synthesis analysis).
  • Strong capabilities in text-related tasks: In text-related explanatory tasks such as molecular caption generation, the GPT model demonstrated remarkable qualitative and quantitative capabilities.
  • Applicability to classification and ranking: For chemical problems that can be converted to classification and ranking, such as property prediction and yield prediction, the GPT model was able to show competitive or better performance than existing baselines using classical machine learning models.

Through this analysis, we have gained valuable insight into how the GPT model compares to existing models in the chemistry task, as well as its limitations and potential. In addition, the performance of the GPT model is analyzed in detail and the results are discussed in three categories (see figure below: reproduced below). Non-competitive performance (NC), competitive performance (C), and selectively competitive performance (SC).

Non-Competitive Performance (NC): the GPT model performs poorly on some tasks compared to existing machine learning models with large amounts of training data, such as reaction prediction and inverse synthesis analysis. This is due to limitations in understanding SMILES strings of molecules. Although reaction prediction and inverse synthesis analysis tasks use SMILES strings as input and output, it appears to be difficult to generate accurate answers because of the difficulty in gaining a deep understanding of reactants and products and their transformation processes. The GPT model also performs poorly on the name prediction task. This indicates that it is difficult to perform accurate conversions between complex strings such as SMILES, IUPAC names, and molecular formulas.

Competitive Performance (C): For chemical tasks organized in the form of classification or ranking, the GPT model can achieve satisfactory results. This is because selecting from a particular set of alternatives is a simpler task than generating or transforming. For example, for reactant, solvent, and ligand selection, the model achieves 40% to 50% accuracy. However, for yield prediction, the results were inferior to those of the specific baseline model. Nevertheless, improved performance was reported for the few-shot learning scenario, suggesting potential room for improvement in the GPT model.

Selective Competitive Performance (SC): the GPT model performs remarkably well on certain tasks. In particular, the F1 score and accuracy were near perfect for the characteristic prediction task on the HIV and ClinTox datasets. This is likely due to the fact that the responses required are simple "yes" or "no" responses. The language generation capability of the GPT model also elicits strong performance in the text-based molecular design and molecular capsulation tasks. However, the low accuracy of perfect matches remains a challenge, but these are appreciated as beneficial results when the generated molecules are chemically valid.

In general, the GPT model shows remarkable potential in certain tasks in chemistry, but there is still room for improvement in some areas. In particular, future research and development is warranted in understanding complex chemical reactions and in the generation of accurate chemical compounds.

In addition, a comparison of the capabilities of the large language models shows that the GPT-4 model is superior to Davinci-003, GPT-3.5, Llama, and Galactica in understanding, reasoning, and explaining chemistry, as shown in the table below (reproduced in the table below). This further validates that the GPT-4 model outperforms the other models in both basic and realistic scenarios.

We also examine the impact of ICL. As a result, the following key findings were made

  • The ICL prompt performed better than the zero-shot prompt on all tasks.
  • ICL cases retrieved using scaffold similarity yielded better results on many tasks compared to random sampling.
  • In general, using more ICL examples tended to lead to better performance than using fewer.

These results indicate that the appropriate selection and quantity of ICL examples have a significant impact on learning effectiveness, and highlight the need to develop higher quality ICL examples as a future challenge.

Further experiments were conducted to test whether SELFIES or SMILES is a better molecular representation for the language model. Across the four tasks of molecular property prediction, reaction prediction, molecular design, and molecular capsulation, the SELFIES representation was inferior to the SMILES representation. This may be due to the language model's greater familiarity with the SMILES representation. However, the fact that SELFIES has fewer invalid expressions shows the advantages of its design.

While this article reports only some of the experimental results, this paper reports more comprehensive and detailed experimental results.


This paper identifies the skills needed to apply large-scale language models in chemistry and provides detailed criteria for comparing the performance of five popular models (GPT-4, GPT-3.5, Davinci-003, LLama, and Galactica) on eight widely used chemistry tasks The following is a list of the criteria that have been established.

Experimental results revealed that the large-scale language model performed poorly compared to others in generative tasks that require a deep understanding of SMILES representations of molecules, such as reaction prediction, name prediction, and inverse synthesis analysis.

On the other hand, for classification and ranking-style tasks such as yield prediction and reagent selection, large-scale language models showed promising results. Furthermore, for tasks that utilize text within prompts, such as property prediction and text-based molecular design, and tasks that require explanation, such as molecular capsulation, large-scale language models were found to be competitive under certain conditions.

These findings suggest the potential of large-scale language models in chemistry tasks and the need for further improvements to enhance their performance. It is hoped that by incorporating more novel and practical tasks in the future, large-scale language models will bridge the gap between large-scale language models and the chemistry research domain, and that the further potential of large-scale language models in chemistry will be explored.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us