ChemLLM: Innovation And Application Of Large-scale Language Models Specific To The Field Of Chemistry

Large Language Models 12/03/2024

3 main points
✔️ Development of new methods and models: ChemData, which converts chemical data into natural language format, and ChemLLM, a large-scale language model with chemistry expertise.
✔️ ChemLLM, an interactive model with chemistry expertise: A large-scale language model with chemistry expertise that performs better than GPT-3.5 and can handle a wide variety of chemistry tasks interactively.
✔️ Expanding application of large-scale language models in science: ChemLLM offers new application possibilities for language processing tasks not only in chemistry, but also in science in general.

ChemLLM: A Chemical Large Language Model
written by Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Dongzhan Zhou, Shufei Zhang, Mao Su, Hansen Zhong, Yuqiang Li, Wanli Ouyang
(Submitted on 10 Feb 2024)
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Large-scale language models (LLMs) are expected to find applications in various fields of science due to their outstanding language understanding and generation capabilities. In particular, their potential in the field of chemistry, ranging from the prediction of molecular properties to the design of experimental protocols, has attracted much attention. However, a challenge is that existing large-scale language models do not fully exploit expertise in the chemical domain.

To address this problem, this paper develops ChemData, a new method for converting chemical data into a natural interactive format, and ChemLLM, a large-scale language model with chemistry expertise. These innovations will not only accelerate the progress of chemistry research, but also contribute to broadening the applicability of large-scale language models across scientific disciplines.

ChemData" contains 7 million chemical instruction data sets, which are highly effective for training large-scale language models. We are also making this dataset publicly available to encourage the development of chemical language models. ChemLLM" is the first open-source chemical language model that maintains natural language processing capabilities while retaining chemical expertise. In addition, we show how to effectively incorporate chemical knowledge into ChemLLM through a two-step instruction tuning pipeline. This approach suggests new directions in the learning of scientific language models as well as the development of chemistry-capable language models.

This paper explores the potential of large-scale language models in chemical research to provide the scientific community with new tools and become trusted assistants in solving chemistry-related problems.

ChemData

Building large-scale language models for high-performance chemistry requires comprehensive, high-quality data sets. In this paper, we collect chemistry data from a wide range of Internet sources.

The dataset encompasses a wide range of chemistry domain knowledge and follows three main task categories (Molecule, Reaction, and Domain).

The Molecule category is essential for understanding and identifying molecular structures and their properties. It includes four main areas.

Molecular recognition: involves the conversion between various molecular representations such as SMILES, IUPAC, chemical formulas, etc.
Molecular Property Prediction: focuses on the prediction of a wide range of molecular attributes, including solubility.
Molecular Generation: focuses on designing and generating SMILES-format molecular structures that meet specific property criteria.
Molecular Captioning: focuses on generating molecular features, functions, or related information in natural language format.

The Reaction category is important for deciphering chemical reactions and includes reaction product prediction, reaction yield prediction, reaction condition selection, and reverse synthesis analysis. Each of these tasks plays an important role in understanding the dynamics and consequences of chemical reactions.

Reaction Product Prediction: focuses on predicting the outcome of a chemical reaction based on the reactants involved.
Reaction yield prediction: focuses on estimating the yield of reaction products.
Reaction Condition Selection: Determine the optimal conditions under which the reaction should proceed to maximize yield and efficiency. Factors include temperature, pressure, catalyst, and solvent.
Reverse Synthesis Analysis: focuses on performing a reverse analysis from the target molecule to infer possible reactants and synthetic pathways.

The Domain category encompasses molecular and reaction-centric tasks as well as domain-specific tasks that greatly extend the versatility of large-scale language models. These include cheminformatic programming, domain Q&A, literature translation, and reaction design.

Cheminformatic Programming aims to implement the skills to understand and generate cheminformatics codes in large language models and incorporate them into chemical analysis and research workflows.
Domain Q&A builds on the general chemistry knowledge gained from the textbook and aims to build the capacity of a large scale language model to address a variety of questions in the chemistry domain, from basic concepts to advanced topics.

Thus, we are collecting a wide range of chemical data and building a foundation for analysis in order to build a large-scale language model for chemistry. This is expected to lead to a deeper understanding of chemistry and an expanded range of its applications.

In addition, the development of large-scale language models for chemistry is not a straightforward task due to its unique representation format and data complexity. This paper introduces an innovative method for converting chemical data into natural language and arranging it into a format suitable for training large-scale language models.

This approach preserves chemistry expertise while transforming data into a more accessible and interpretable format. Specifically, it utilizes a strategy called "Play as Playwrights" that leverages Seed Template to generate single and multi-turn conversational scenarios, dramatically increasing the diversity of the study data. Although designed specifically for the field of chemistry, this approach is applicable to other scientific domains and opens up new possibilities for the study of large-scale language models.

This approach significantly improves the training efficiency of large-scale language models through the conversion of chemical data into natural language. Initially, we start with a Seed Template that converts the data into an intuitive and understandable format while preserving chemical knowledge; using GPT-4, we generate a variety of Q&A pairs, which are then used to construct single-turn conversational examples. This method can also be applied to tasks such as predicting chemical reactions and describing attributes.

In addition, the multi-turn dialogue generation focuses on training the model's reasoning skills by mimicking inter-expert discussions. Here, we employ a technique called "Play as Playwrights" to create a variety of dialogue scenarios while maintaining content expertise and depth of discussion. This technique enhances the quality of the dialogue and allows for more specialized and in-depth discussions.

For molecule-related tasks, Seed Template is used to form Q&A pairs that are appropriate for specific chemistry tasks. For example, this is the task of converting molecular names into each other in different formats. Furthermore, when processing chemical reaction data, we design special templates to accommodate the diversity and incompleteness of reaction conditions.

Finally, a condition-masking strategy (condition-masking strategy) is employed to strengthen the logical consistency of the multi-turn dialogue. The goal is for the model to make inferences comparable to expert-level analysis. In addition, to provide the model with extensive domain knowledge, we aggregate numerous textbook data and research topics and synthesize topics to develop deep reading skills.

To improve the capability of our language models in specific domains, we have introduced an innovative "two-step instructional tuning pipeline". This approach has been particularly effective in the development of ChemLLM, a language model dedicated to the chemical domain. This model is based on "InternLM2-Base-7B", which is available in both Chinese and English and has a long context window of 4096 tokens, ideal for complex tasks.

ChemLLM

In order to improve the capability of language models in specific domains, we have introduced a method called the Two-Stage Instruction Tuning Pipeline (Two-Stage Instruction Tuning Pipeline). This approach is particularly effective in the development of ChemLLM, a large-scale language model dedicated to the chemical field. The model is based on "InternLM2-Base-7B," which is available in both Chinese and English and has a long context window of 4096 tokens, ideal for complex tasks.

In the first phase, we use an extensive corpus of 1.7 million diverse examples to enhance the model's ability to understand language. Through this process, we are building a solid foundation for understanding subtle differences in language and dialogue structure, and for absorbing specialized knowledge. In this phase, we leverage datasets such as FireFly, OpenOrca, and UltraChat to give our models a deep understanding of human interaction and its dynamics.

In the next phase, the model is further specialized by integrating ChemData, a proprietary dataset dedicated to the field of chemistry. This phase focuses on increasing the model's capabilities in a variety of subtasks, from understanding chemical terminology to interpreting reaction mechanisms. Thus, the transition from general conversational skills to specific expertise is smooth, and the adaptability and accuracy of the model is greatly enhanced.

This two-step approach makes a clear distinction between the general-purpose "InternLM2-Chat-7B" and the ChemLLM specialized for the chemical field. This approach expands the potential for AI technology to be leveraged into specific specialized areas of indispensability, and shows the potential for bridging the gap between general AI capabilities and specific domain requirements.

Experimental results

The assessment is based on three aspects: specialized chemistry tasks, general language proficiency, and multilingual adaptability. Designed specifically for the chemistry domain, ChemLLM is of course important for proficiency in managing chemical complexity and data. Proficiency in the language in general, as well as competence in comprehensive tasks such as conducting literature reviews and writing reports, is also necessary. A deep understanding of the nuances of different texts is required. In addition, the model's ability to handle multiple languages is essential for worldwide use and facilitates supporting a broad spectrum of users by navigating chemical information in a variety of languages. These aspects are important in assessing ChemLLM's performance and in shaping its advancement and integration within chemistry research and studies. In this article, we will focus in particular on core specialized chemistry tasks and general language capabilities.

First, the assessment of specialized chemistry tasks assesses the language model's understanding of chemistry through ChemBench, a new benchmark designed specifically for the field of chemistry. This benchmark is a framework that includes three stepwise tasks: molecular name conversion, molecular caption creation, and chemical reaction prediction. Each task assesses step-by-step how well the model grasps chemical concepts, from basic chemical knowledge, to understanding molecular properties, to predicting the outcome of chemical reactions.

The results of a series of comparative analyses performed using ChemBench are shown in the table below, examining the performance of various large-scale language models on chemistry, including GPT-3.5 and GPT-4. ChemLLM outperforms GPT-4 on tasks such as name conversion and molecular captioning, and significantly outperforms other models of similar size. In particular, ChemLLM outperforms GPT-3.5 on the task of chemical reaction prediction, second only to GPT-4. These results demonstrate how language models can provide advanced understanding in the chemistry domain through guides that deeply incorporate chemical knowledge.

The high performance of ChemLLM, in contrast to the limited performance of the base model, InternLM2-7B-Chat, also underscores the value of incorporating specialized chemical knowledge into the model learning process. This comparative analysis reveals that chemistry-specific language models significantly outperform generic models, and the evaluation using ChemBench provides an opportunity to quantify their ability to solve a variety of challenges faced by chemical language models, and to consider the role and potential of language models in chemical research from a new perspective. and provides an opportunity to consider the role and potential of language models in chemical research from a new perspective.

Next, we assess general language proficiency. Here, we use "MMLU" and "GSM8K" The MMLU (Massive Multitask Language Understanding) is a rigorous test of language modeling ability in a wide range of disciplines, including 57 subjects in STEM (science, technology, engineering, and mathematics), humanities, and social sciences. It is a rigorous test of language modeling proficiency in 57 subject areas, including STEM (Science, Technology, Engineering, Mathematics, and Social Sciences). Through this holistic benchmark, it reveals how well language models possess global knowledge and problem-solving skills.GSM8K is a set of tests designed to determine the mathematical abilities of language models. It tests a model's multi-step mathematical reasoning ability by solving 2- to 8-step problems that require basic arithmetic operations.

Despite being a specialized language model for chemical questions, ChemLLM also demonstrates superior performance in universal domains such as general conversational skills and logical reasoning. This means that specialized models are able to achieve deeper understanding through interdisciplinary knowledge, and evaluations using the MMLU benchmarks show that ChemLLM outperforms models of similar size, such as ChatGLM3-6B, Qwen7B, LLaMA2-7B, and Mistral-7B, in a wide range of scientific Outstanding performance in a wide range of fields. In particular, strong performance in college-level physics and mathematics demonstrates that chemistry training enhances the ability to generalize to adjacent scientific disciplines.

Chemistry-specific learning also contributes to improved model reasoning and ethical decision-making skills in sections such as formal logic and moral scenarios.ChemLLM outperforms the foundational InternLM2-7B-Chat model by a wide margin, with particularly high scores in the formal logic section. Scores are particularly high on the Formal Logic section.

The ChemLLM's excellent performance in a wide range of subjects, including the humanities, STEM, and social sciences, reveals that its general task performance is not compromised by ChemLLM's focus on chemistry-specific tasks, but rather works significantly. This highlights the comprehensive and versatile capabilities of the ChemLLM and suggests its potential for development.

Summary

At a time when large-scale language models (LLMs) are revolutionizing the development of the field of chemistry, the absence of interactive models for chemistry specialization is a challenge. To address this problem, this paper develops a novel method for learning chemical knowledge interactively. This template-based approach overcomes the previous challenges by allowing chemical data to be incorporated into language models in an easily accessible form.

And "ChemLLM" has been proposed. It is the first large-scale language model dedicated to chemistry that can interactively handle a wide variety of chemical tasks, from molecular recognition to reaction prediction, with performance exceeding GPT-3.5, and has been recognized for its versatility in fields other than chemistry.

In addition, ChemLLM has shown excellent performance in special natural language processing tasks in chemistry, such as translation of literature, programming of chemical informatics, and compliance with research ethics. It is hoped that this expertise-infused approach will open new avenues for further applications of large-scale language models in the sciences.

ChemLLM will take dialogue and understanding in the world of chemistry to a new level and will be a valuable tool for professionals as well as students and researchers.

Categories related to this article

Large Language Models

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

ChemLLM: Innovation And Application Of Large-scale Language Models Specific To The Field Of Chemistry

Summary

ChemData

ChemLLM

Experimental results

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...