Evolution Of Mining Chemical Literature With AI Agents Using ChatGPT

Large Language Models 11/11/2024

3 main points
✔️ Propose a new method for AI agents in literature mining in chemistry
✔️ This new method saves a lot of human effort and automates tasks
✔️ Design a new scheme to evaluate the performance of AI agents in literature mining

An Autonomous Large Language Model Agent for Chemical Literature Data Mining
written by Kexin Chen, Hanqun Cao, Junyou Li, Yuyang Du, Menghao Guo, Xin Zeng, Lanqing Li, Jiezhong Qiu, Pheng Ann Heng, Guangyong Chen
(Submitted on 20 Feb 2024)
Comments: Published on arxiv.
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Because of its wide range of applications, the field of chemistry plays a particularly important role in the synthesis of materials and the development of drugs. Research on new materials promotes the development of energy, environmental science, and nanotechnology, and also contributes significantly to the development of new drugs and the advancement of life sciences.However, although a vast amount of data on chemical reactions has been accumulated, the challenge is to effectively utilize this data to discover new reaction schemes and use them to synthesize materials and develop drugs. This is where the use of artificial intelligence is attracting attention.

Artificial intelligence can identify reaction features and patterns by learning from large amounts of existing data and predicting the outcome of new reactions. This allows chemists to rapidly screen and evaluate diverse reaction conditions and optimize synthetic pathways. In addition, by combining artificial intelligence with reaction prediction and optimization algorithms, efficient synthetic pathways can be generated and identified.Analyzing chemical reactions and discovering new reactions requires extensive expertise in reaction schemes. Uncovering hidden relationships and patterns in the data and finding common features and mechanisms of reactions is critical for chemists to understand the basic principles of reactions and to design new reactions.

This requires integration and knowledge management of data on chemical reactions. Through automated data collection, organization, and annotation, artificial intelligence builds a comprehensive database on chemical reactions, making data easily accessible and usable by chemists. This will improve the discoverability and reproducibility of data and enable researchers to better utilize existing knowledge.

However, there are several challenges in processing data related to the chemical reaction literature that have been addressed by artificial intelligence techniques to date. First, the data is mostly not organized in a systematic manner, and extracting the essence information from a complex and voluminous literature is a very difficult task. This requires artificial intelligence to have advanced context analysis capabilities and the ability to recognize patterns in text style and content.

And with the recentintroduction of ChatGPT, a large-scale language model,the use ofartificial intelligence inchemistryhas advanced to a new level. This has expanded the possibilities of literature mining and opened up new possibilities for chemical exploration through artificial intelligence.

Conventional methods of literature information extraction include manual, rule-based, and model-based extraction. However, manual extraction relies on the labor of chemists and increases costs, while rule-based extraction is difficult to adapt to new literature. Model-based extraction also suffers from poor performance due to lack of data on annotated reactions.

In this paper, wepropose an end-to-end framework based on apowerfulartificial intelligenceagent ("AI agent")to address these challenges.These agents can efficiently exploit large-scale language models based on "automatic recognition" and "inferential decision making" to save significant human effort and improve the performance of the models. In addition, we have developed a new multitasking literature mining scheme and use Chat-GPT to build highly efficient prompts. This enhances the interaction environment with the literature database and also allows for automatic improvement of prompts.

To evaluate AI agents, we also propose a new evaluation system using precision, recall, and F1 scores to measure their effectiveness in extracting chemical reaction-related information. Furthermore, we validate the effectiveness of AI agents by comparing the performance of experts and artificial intelligence.

Method

The following is an overview of the framework for an AI agent that performs chemical literature analysis and reaction information extraction based on a large-scale language model.

The first step in developing an AI agent is to acquire a high-quality literature dataset. In this paper, wecollect a vast amount of chemical literature from Sci-Hub, with a particular focus on literature related to the Suzuki-Miyaura coupling, a well-known coupling in organic chemistry.In order to utilize the collected literature as data, we use Optical Character Recognition (OCR). This allows PDFs to be converted into text for computational processing.

However, it is important to consider that OCR can be error-prone for complex layouts and low-quality scans.Given the errors in the OCR process, this paper introduces a quality control mechanism to ensure the reliability of the data set. In each document,keywords such as "General Procedure," "Typical Procedure," and "General Experiment," whichoften indicate detailed methodology,were considered to be of insufficient quality and excluded from the dataset if they were not included. Similarly, if these keywords are included more than five times, they are also considered unsuitable for the extraction process and are excluded, as they often indicate overly complex or cumbersome methodologies.This process ultimately resulted in a dataset consisting of 1,000 references.

Next, an AI agent is used to extract the conditions for chemical reactions from the literature. This agent can analyze the literature in the same way as a chemist and efficiently extract the necessary information.First, the AI agent extracts chemical information from standardized text. This is similar to a chemist extracting key data about a reaction from an experimental notebook.

This task extracts information about yields, reactants, catalysts, solvents, and products; the AI agent accomplishes this using a multitasking framework and in-context learning.

The AI agent firstidentifies the textual passages describing the reaction conditionsby searching for keywords and phrases commonly used in the chemical literature.By applying several algorithms to extract information from the identified passages, a dataset containing yield, reactant, catalyst, solvent, and product information for each reaction is obtained. The figure below shows the prompts, example inputs, and example outputs of the AI agent's in-context learning process.

Next, AI agents identify "co-references" in the chemical literature. Co-references are used in place of long, complex chemical names, but are difficult for machines to understand; AI agents use GPT's ability to understand context to accurately identify these co-references. Specifically, the context is analyzed in depth and verified against commonly used patterns of coreference.The figure below shows an example prompt, example input, and example output of the AI agent's in-context learning process.

After identifying co-references, the AI agent also maps them to full chemical names. This converts the abbreviations into a complete form so that they can be treated as context-independent information. The agent uses GPT's ability to understand context to identify where co-references are defined, and then analyzes the sentence structure to piece the information together. This mapping is recorded in a structured format and can be updated as needed.

And finally, the AI agent replaces all coreferences in the text with the corresponding full chemical name. It does this by creating a dictionary with the coreference as key and the full chemical name as value, and then processing the text and replacing each coreference as it is found. The result is a text in which all abbreviations are replaced with the full chemical name, making information extraction more accurate and easier.

Experimental Results

AI agents are intended to be efficient supporters of chemists, quickly obtaining high-quality reaction information and reducing time costs. For this reason, it is important to quantitatively measure their performance and compare them to human experts. To investigate the effectiveness of this framework, this paper proposes a novel pipeline for assessing the proficiency of GPT-based literature mining methods.

The evaluation processfocuses on assessing the quality of reactants, reagents, solvents, products, and yields involved inthe Suzuki-Miyaura coupling reaction. And to quantify this, we have implemented an evaluation scheme using precision, recall, and F1 scores. These metrics assess the ability to accurately extract reaction information and comprehensively retrieve the elements involved in a reaction.

The results generated from ChatGPT are retrieved and stored for comparison with results collected by human experts. The paper annotates 17 references and 326 responses to validate the effectiveness of the AI agent.Theresults, as shown in the table below,achieve on average 90.14% accuracy, 77.13% recall, and 83.11% F1 score.

Since no other open source tools exist at this time for extracting chemical reaction data from academic journals, this paper primarily examines the effectiveness and performance of the AI agent in comparison to manual data from human chemists. The main metrics evaluated are accuracy, average cost, and average speed. To minimize the uncertainty and randomness of human chemists, we have selected 10 graduate students (either master's or doctoral students) specializing in chemistry to perform manual data collection. The results obtained from these chemistry experts are averaged and compared to the agent. The table below shows that the AI agent achieves high accuracy performance and superior performance in average cost and average speed.

Summary

This paper presents an AI agent that leverages large-scale language models to automatically extract highly accurate chemical data from the chemical literature. The system performs well in accuracy, recall, and F1 scores, and streamlines the data collection and analysis process, significantly reducing human effort and improving performance.

AI agents are characterized by their ability to iteratively optimize and generate prompts to deal with the diverse and unstructured information in the literature. This has proven to be as efficient and accurate as an expert in the management and use of chemical data. This technology is expected to revolutionize data processing in chemistry.It is also expected to lay a solid foundation for AI's role in literature mining in chemistry and accelerate progress in various areas of chemistry, such as materials synthesis and new drug discovery.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Evolution Of Mining Chemical Literature With AI Agents Using ChatGPT

Summary

Method

Experimental Results

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...