ChemBench] A New Benchmark In Chemistry!
3 main points
✔️ New Benchmarking Framework "ChemBench" Proposed to Better Understand the Capabilities of Large-Scale Linguistic Models in Chemistry
✔️ Modern large-scale linguistic models outperform experts on many problems in chemistry
✔️ Current evaluation methods do not adequately measure the intrinsic capabilities of models, especially performance on problems requiring complex reasoning
Are large language models superhuman chemists?
written by Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Benedict Emoekabu, Aswanth Krishnan, Mara Wilhelmi, Macjonathan Okereke, Juliane Eberhardt, Amir Mohammad Elahi, Maximilian Greiner, Caroline T. Holick, Tanya Gupta, Mehrdad Asgari, Christina Glaubitz, Lea C. Klepsch, Yannik Köster, Jakob Meyer, Santiago Miret, Tim Hoffmann, Fabian Alexander Kreth, Michael Ringleb, Nicole Roesner, Ulrich S. Schubert, Leanne M. Stafast, Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka
(Submitted on 1 Apr 2024)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Large-scale language models are machine learning models that learn large amounts of text to generate text. The capabilities of these models are rapidly improving and can now pass the U.S. National Medical Examination. They can also be used in conjunction with tools such as web search and synthetic planners to design chemical reactions and experiment autonomously.
Some consider these models "signs of Artificial General Intelligence (AGI)," while others see them as "stochastic parrots." In other words, they consider them simple systems that simply repeat what they have learned. Nonetheless, large-scale language models have shown the ability to solve a variety of tasks that are not explicitly learned, and economic interest and investment is growing rapidly; by 2032, the market for this sector is estimated to be worth over $1.3 trillion.
Chemists and materials scientists are also increasingly interested in large-scale language models. This is because large-scale language models are being used to predict the properties of molecules and materials, optimize reactions, generate new materials, and extract information. In addition, prototype systems have been developed that autonomously execute physical reactions based on natural language instructions.
With much of the information about chemistry now stored in text, there is still much untapped potential for large-scale language models. For example, many chemical research insightscome not fromdatabases, but from the ability ofchemiststo interpret data. These insights exist as text in scientific papers, and new insights can be gained by manipulating the text. This will lead to a co-pilot system for chemists, allowing them to answer questions and propose new experiments based on vast amounts of information beyond what a human can read.
However, the increasing capabilities of machine learning models for chemistry have raised concerns about the potential for dual use of the technology. For example, techniques that design non-toxic molecules could conversely be used to predict toxic ones. It is important to be aware of these risks and to develop an appropriate framework for evaluation.Currently, however, large language models are evaluated through standardized benchmarks such as BigBench and LM Eval Harness, which include few chemistry-related tasks.
This paper proposes a new benchmarking framework, ChemBench, which reveals the limitations of current state-of-the-art models.ChemBench consists of 7059 question-answer pairs collected from a variety of sources, covering the majority of undergraduate and graduate chemistry curricula in the It covers the majority of undergraduate and graduate chemistry curricula. In addition, 41 chemistry professionals were surveyed to compare the performance of the current model with that of human chemists.
The results show that while current models demonstrate capabilities that exceed those of humans in some aspects, they can be very misleading with respect to safety-related aspects. A carefully crafted, broad benchmark would be an important step forward for progress in this area.
Method
The questions in the dataset are curated from existing exams and exercise sheets, as well as new program-generated questions. Questions are added via pull requests to the GitHub repository and are merged into the corpus only after passing manual review and automated checks.
To ensure that questions are not included in the training data set, we use the same canary string as the BigBench project. This requires developers of large language models to filter this canary string out of the training dataset.The manually curated questions were taken from a variety of sources, including university exams, exercises, and problem sets. A summary of the sources of the manually curated questions is presented in the table below.
Additionally, in addition to manually curated questions, programmatically generated questions are also included. An overview of the sources of the semi-automatically generated questions is provided in the table below.
For consistency,we use different prompt templates for thecompletion andinstruction coordination models. Constraints are imposed on models within the template to receive responses in a specific format, allowing for robust, unbiased, and consistent analysis. Specific models are trained with special annotations and LATEX notation for scientific notation, chemical reactions, and symbols in the text. For example, all SMILES expressions are enclosed in [START SMILES][\END SMILES] in Galactica.The prompt strategy consistently reflects these details on a model-by-model basis, post-processing (adding or removing wrappers) LATEX notation, chemical symbols, chemical equations, and physical units. This step is easily customized in the code base.
The parsing workflow also consists of multiple steps and is primarily based on regular expressions. For an instruction adjustment model, the first step is to identify the [ANSWER][\ANSWER] environment that instructs the model to report the response. For completion models, this step is skipped. From there, we attempt to extract the relevant enumeration letter (for multiple choice questions) or number. In the case of numbers, our regular expressions are designed to accommodate different forms of scientific notation. In our initial testing, we found that the model sometimes returned numbers in word form (e.g., "one" instead of "1"), so we also implemented word-to-digit conversions using regular expressions. When these hard-coded parsing steps failed, a large language model (e.g., Claude 2) was used to parse the completion.
We use custom regular expressions to account for variability in output. We selected a large and diverse subset of 10 questions per topic for all model reports and manually examined where the parsed output did not match the actual answers intended by the model: in 99.76% of cases for MCQ questions and 99.17% of cases for floating-point questions parsing was found to be accurate. The models generating the most frequent errors are pplx-7b-chat and Mixtral-8x7b.
Experiment
The benchmark corpus is created using a wide range of sources, including questions semi-automatically generated from university exam questions and selected datasets from chemistry databases. To ensure quality, all questions are reviewed by at least one chemist in addition to the original curator and automated checks. This large collection of questions covers a wide variety of chemistry topics.For example, the figure below compares the number of questions in each area of chemistry.
The figure below also visualizes the question embedding in a two-dimensional space using principal component analysis (PCA). In this figure, semantically similar questions are placed close together, and points are color-coded based on 11 topics; ChemBench's emphasis on safety-related aspects is clearly demonstrated in this figure.
Many existing benchmarks concentrate on Multiple Choice Questions (MCQs), which do not reflect the reality of chemistry education and research. Therefore, ChemBench samples both MCQ and open-ended questions (6202 MCQ questions and 857 open-ended questions).
For routine evaluations, a small subset of the entire corpus may be practical. For example, Liang et al. report that the cost of an API call for a single evaluation on the widely used HELM benchmark can be over $10,000. To address this issue, we also provide a diverse and representative subset (209 questions) of the entire corpus. This subset is curated so that the topics are more balanced than the entire corpus and is also used to seed a web application for human baseline studies.
Because the text used in chemistry differs from normal natural language, a number of models have been developed to specifically process such text. For example, the Galactica model uses special tokenization and encoding methods for molecules and equations. However, current benchmark suites do not support special handling of scientific information. To address this issue, ChemBench encodes the meaning of various parts of a question or answer. For example, molecules represented in the Simplified Molecular Input Line-Entry System (SMILES) are enclosed in [START SMILES][END SMILES] tags. This allows the model to process SMILES strings differently than other text.
ChemBench is designed to manipulate text completion, as many widely used systems only have access to text completion. This is especially important because of the growing number of tool extension systems that use external tools such as search APIs and code executors to enhance the capabilities of large language models. In this case, thelarge-scale languagemodel thatreturns theprobabilitiesof various tokensis only one part of the overall system, and interpreting those probabilities in the context of the overall system is not clear. However, since text completion is the final output of the system used in real applications, it is used in our evaluation.
To understand the capabilities of current large-scale language models, we are evaluating key models with the ChemBench corpus. This includes the system in conjunction with external tools. A summary of the evaluation results is shown in the figure below, showing the percentage of questions that the models answered correctly.
The worst, best, and average performance of experts is also shown. Remarkably, Claude 3, a state-of-the-art large-scale language model, outperforms humans on this overall measure, more than twice the average performance of experts. Many other models also outperform the average human. In particular, the Galactica model, trained specifically for chemical applications, performed poorly compared to many advanced commercial and open source models, only slightly above the random baseline.
With the growing interest in tool-enhanced systems, it is worth noting that the evaluation results of these systems (GPT-3.5 and tool-enhanced Claude 2) are mediocre. This lack of performance is due to limiting the systems to a maximum of 10 large language model calls. With the default tool-enhanced settings (the so-called ReAct method), the system was unable to identify the correct solution because it repeatedly searched the web and did not find a solution within 10 calls. This observation underscores the importance of the computational cost (in terms of API calls) of the tool extension system as well as its predictive performance.
To gain a more detailed understanding of the model's performance, we also analyze its performance in different areas of chemistry. For this analysis, we have defined several topics and manually categorized all questions in the ChemBench corpus by creating rules. We then calculate the percentage of questions that were answered correctly for each topic by models and humans. In the spider chart, the worst score for each dimension is zero (no correct answers) and the best score is one (all questions answered correctly). Larger colored areas indicate better performance. We can see that this performance varies widely by model and topic.
While many models scored relatively well in polymer chemistry and biochemistry, performance was poor in topics such as chemical safety and analytical chemistry. For example, predicting the number of signals observed in a nuclear magnetic resonance (NMR) spectrum was difficult for the models, with a 10% correct response rate on the GPT-4. On the other hand, humans with learned expertise gave 25% correct answers to the same question. This may be because the human is given a diagram of the compound, whereas the model is given only a SMILES string, which must be used to infer about the symmetry of the compound.
It is important to be able to estimate whether the model can answer a question correctly. If it can do so, it will be less problematic because it will be able to detect errors, even if the answers are incorrect. To investigate this issue, we asked the top-ranked models to estimate their confidence in their ability to answer a question correctly on an ordinal scale. Figure 6 shows that for some models, there is no significant correlation between the estimated difficulty and whether the model answered the question correctly.
In applications where humans may rely on model responses, this is a worrisome observation that underscores the need for critical reasoning in the interpretation of model outputs. For example, for questions about the safety profile of a compound, the GPT-4 reported an average confidence of 3.97 (on a scale of 1 to 5) for the 120 questions it answered correctly and an average confidence of 3.57 for the 667 questions it answered incorrectly Claude 3's verbal confidence estimates were on average more Although they appear to be well calibrated, they can still be misleading in some cases. For example, for questions about the Globally Harmonized System of Classification and Labeling of Chemicals (GHS), Claude 3 returns an average score of 2.39 for correct answers and 2.34 for incorrect answers.
Summary
This paperreveals that large scale language models have surprising capabilities in the field of chemistry. The state-of-the-art modelsoutperform experts onmany topicalproblems inchemistry. However, many limitations still exist. In particular, models are often incorrect in their answers on important topics, and many models have failed to accurately identify their own limitations.
The high performance of the models seen in the evaluations in this paper may also indicate the limitations of the tests used to evaluate the models and the chemists used to evaluate them rather than the models themselves. For example, the model performs well on textbook questions, but struggles on questions that require more complex reasoning. With this in mind, we need to rethink how we teach and assess chemistry. Critical thinking skills will become increasingly important, and large-scale language models will continue to outperform humans in areas of mere problem solving and fact memorization.
Thepaperalso highlights the delicate balance between the breadth and depth of the evaluation framework. Analysis of model performance on different topics shows that model outcomes vary widely across disciplines. Even within the same topic, model performance varies widely depending on the type of problem and the reasoning required to answer it.
Current evaluation frameworks for large-scale language models in chemistry are designed to measure model performance on specific property prediction tasks, but these are inadequate for evaluating systems built for reasoning and scientific applications.As a result, our understanding of the capabilities oflarge-scale languagemodels inchemistryhas been limited.In this paper,weshow thatcarefully crafted benchmarksprovide a means to better understand the capabilities oflarge-scale languagemodels inchemistry. In particular, more focus needs to be placed on developing a human-model interaction framework, especially given the inability of models to accurately identify their own limitations.
Whilethispapershows thatthere are manyareas in which large-scale language model-based systems need further improvement, it also shows that clearly defined metrics are important in many areas of machine learning. Current systems are far from being able to reason like a chemist, but the ChemBench framework is an important step toward this goal.
Categories related to this article