Paradigm Shift In Chemistry With Large-Scale Linguistic Modeling, With Applications To Classification, Regression, And Reverse Reaction Design

Large Language Models 06/06/2024

3 main points
✔️ Suggests the potential to further expand the range of applications of GPT by representing problems in chemistry in text format
✔️ Confirms that even with fewer data points,GPToutperforms traditional specialized machine learning models
✔️ More efficient than traditional methods in discovering new compounds and designing materials Proposes more efficient methods for discovering new compounds and designing new materials than conventional methods

Leveraging Large Language Models for Predictive Chemistry
written by Kevin Maik Jablonka,Philippe Schwaller ,Andres Ortega-Guerrero ,Berend Smit
(Submitted on 17 Oct 2023)
Comments: Published on ChemRxiv.
Subjects: Theoretical and Computational Chemistry

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Large-scale language models have received particular attention in recent developments in machine learning. It has attracted much interest because of its simplicity: for a given phrase, it can generate a natural continuation of the text that does not appear to have been written by a machine.

Practical examples in the scientific field, such as the creation of abstracts for scientific papers and the generation of code for specific programming tasks, have shown remarkable results. It is also clear that these models can solve simple tabular regression and classification tasks even though they are not explicitly trained.

These results suggest the possibility of finding solutions to scientific questions for which we do not have answers. The application of these models is particularly promising in the field of chemistry, where most questions can be expressed in text form. As in, "How does changing the metal in a MOF (Metal-Organic Framework) change its stability in water?" and other questions that cannot be answered by experiment or theory alone may be answered with new answers.

In chemistry and materials science, research is always based on limited experimental data. In this context, models such as the Generative Pre-trained Transformer 3 (GPT-3) have been shown to provide meaningful results even with few data points. In this paper, we show that GPT-3 outperforms traditional specialized machine learning models on multiple chemistry-related questions based on the data it provides.

It also focuses on the performance of these models, which have been trained on an extensive corpus of text collected from the Internet and fine-tuned to specific tasks. This makes the models more flexible in application, independent of the structure of the prompts.

Through benchmarking on a variety of datasets and applications, we have demonstrated that these models can answer a wide range of scientific questions, from materials characterization, to proposed synthesis methods, and even materials design. This approach allows us to provide new perspectives, especially for problems that machine learning has already addressed.

The figure below shows an overview of the data sets and tasks addressed in this paper.

In this paper, we benchmark GPT-3 on datasets covering the chemical space from molecules to materials to reactions. These datasets investigate a variety of tasks, including classification (e.g., predicting class ("high," "low," etc.) given a textual representation of a molecule, material, or reaction), regression (predicting floating point numbers), and inverse design (predicting molecules) The MOF renderings are created in iRASPA The MOF renderings are created by iRASPA.

Classification and regression using large-scale language models

This paper focuses on a new class of structural metals, high entropy alloys, and explores their potential applications using the GPT-3 model. The complexity of high-entropy alloys allows for nearly infinite combinations of metals, and knowing whether a given metal combination will form a solid solution or multiple phases is critical from a practical standpoint.

Specifically, we have fine-tuned our GPT-3 model to answer the question, " What are the phases of composition in a high-entropy alloy?" our GPT-3 model was fine-tuned to answer the question "What is the phase of composition of a high-entropy alloy?" by choosing from a selection of possible single-phase or multiphase answers. In this process, tuning the model using the OpenAI API took only a few minutes and produced an answer of "1" for the specific input "Sm0.75Y0.25". This implies a single phase and is an example of the remarkable results obtained during the fine-tuning process.

The choice of this approach was made for direct comparison with state-of-the-art machine learning models specifically developed to mimic specific chemical reactions. Interestingly, with only about 50 data points, the performance is comparable to the Pei et al. model trained with over 1000 data points.

We are also investigating a range of very different properties of molecules, materials, and chemical reactions in the hope that these results will lead to similar results for other properties. The study focuses on applications where traditional machine learning methods have been developed and accepted as benchmarks in their respective fields. In addition, we also compare our results with the top performing models in the Matbench26 suite of benchmark tasks.

Comparisons between the fine-tuned GPT-3 model and the existing baseline identify the points where the learning curves intersect in low data domains and measure the amount of data required to achieve the same or better performance as the traditional ML model. As a result, GPT-3 models often achieve comparable results with less data, and this is especially true when the data set size is limited.

The paper also explores various properties of the molecule, ranging from the HOMO-LUMO gap and water solubility to its performance as an organic photoelectric material. In materials, we delve deeply into the properties of alloys, metal-organic frameworks, and polymers, and in reactions, we examine important cross-coupling reactions in organic chemistry.

While GPT-3 models perform well in low data domains, traditional machine learning models tend to catch up as the amount of data increases. This may be due to the fact that additional data and correlations may not be so needed by GPT-3. However, optimization of the fine-tuning process has not yet been addressed, and better tokenization and tuning of learning parameters in the chemical context could lead to further improvements.

As large-scale language models such as OpenAI's GPT-3 and GPT-4 have evolved, the approach in this experiment has been extended accordingly. Of particular note is the fact that good performance is achieved not only through fine-tuning, but also through a technique called "in-context learning," which incorporates examples directly into the prompts. This technique, which involves learning at the time of inference, is particularly highly effective in modern GPT models.

Furthermore, this study is not limited to OpenAI models, but shows that excellent results can be achieved on consumer hardware by using parameter-efficient fine-tuning techniques, even for large open-source language models. This allows us to provide a Python package that can easily apply this approach to new problems.

How to represent molecules and materials is one of the key issues in ML applications. While IUPAC names are primarily used in the literature, efforts have been made to use unique linear encodings such as SMILES and SELFIES. Chemical names may be preferred over these linear representations in GPT-3 models trained in natural language. In this regard, we are investigating the effect of different representation methods in a molecular property prediction task.Interestingly, it is clear that good results can be obtained regardless of the representation method. In particular, the best performance is often obtained when using the IUPAC name of the molecule, which makes the process of fine-tuning GPT-3 to a particular application relatively easy for non-experts.

Next to the classification problem, a more advanced challenge is the development of regression models. This implies the ability to predict continuous properties, such as the Henry coefficient of gas adsorption in porous materials. Because of the use of pre-trained language models, direct prediction of real values is difficult to achieve without modifying the model structure and training methods. However, there are always limits to the accuracy of predictions in real-world applications. For example, for the Henry coefficient of a material, an accuracy of 1% (or a specific number of decimal places) is often sufficient.

Given this limitation, we proceed under the assumption that the GPT-3 model can interpolate these numbers, using the numerators with Henry coefficients rounded to this precision as the training set. One way to turn this regression task into a classification problem would be to create microbins. This more challenging regression task requires more data to tune the GPT-3 model, while this approach requires far more data and therefore offers fewer advantages, but can yield performance approaching the state-of-the-art.

In addition, a challenging task of machine learning in chemistry is the development of models that can produce molecules with specific properties, or "inverse design". Two main approaches exist for this. One is to train generative models such as variational autoencoders or generative adversarial networks when large data sets are available. The other is to use evolutionary techniques such as genetic algorithms to generate new potential molecules in situations where data sets are limited.

Of particular note are inverse design efforts utilizing advanced language models such as GPT-3. These models can predict the properties of molecules and materials even with little data, making it possible to propose new materials in the early stages of research. This is especially useful when experimental data is scarce and understanding is limited.

Through the example of molecular photoswitching, we show how GPT-3 can generate accurate answers to specific questions. The reverse design process can be simply performed by reversing the question and answer, and the molecules generated are verified to meet the conditions of a real chemical reaction.

Quantifying the novelty of the generated molecules is another important step. In this paper, we assess the extent to which the generated molecules contain new structures not included in the known database. This confirms the ability of GPT-3 to propose truly novel compounds.In addition, tuning the softmax temperature in the generation process allows for more natural and chemically meaningful text generation of molecules. Fine tuning of this parameter helps manage the risk of generating diverse, new, but chemically invalid structures.

In this paper, we show that inverse design opens up new possibilities in chemical research. In particular, this approach has the potential to accelerate innovation in chemistry, as fine-tuning natural language models is more accessible than learning traditional machine learning models.

Summary

As this paper demonstrates, the GPT-3-based machine learning system has performed remarkably well for a wide variety of problems in chemistry. In particular, the system shows excellent results for compounds for which traditional line representations such as SMILES cannot be used. This suggests that GPT-3 has excellent ability to extract correlations from text and has the potential to outperform specialized machine learning models without prior learning of chemistry.

The range of applications for this technique is broad and can be learned andusedbased on questions formulated in natural language. This approach sets a new standard for future machine learning research, indicating that new models should aim to outperform this simple technique.

The use of GPT-3 is similar to a literature search in a research setting and opens new avenues for chemists to leverage their accumulated knowledge. GPT-3 is designed specifically for the discovery of correlations from text fragments, and because the correlations are highly relevant to chemistry, they offer new possibilities for chemists and materials scientists, offering new possibilities.

The paper alsostates that the next step is to use GPT-3 to further identify these correlations for deeper understanding GPT-3 is a tool that allows scientists to more effectively use the knowledge they have accumulated over the years. The addition of many scientific results and experimental data not included in the study data can be even more impressive.This advanced approach has the potential to have a revolutionary impact on the future of chemical research.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Paradigm Shift In Chemistry With Large-Scale Linguistic Modeling, With Applications To Classification, Regression, And Reverse Reaction Design

Summary

Classification and regression using large-scale language models

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...