Catch up on the latest AI articles

Potential For Large-scale Language Models In Chemistry And Chemical Engineering; Usefulness And Limitations Of Code Generation Capabilities

Potential For Large-scale Language Models In Chemistry And Chemical Engineering; Usefulness And Limitations Of Code Generation Capabilities

Large Language Models

3 main points
✔️ Large-scale language models for chemistry: Large-scale language models are useful for solving chemistry problems, but their scope and limitations have not been fully explored. The accumulation of expertise in solving chemistry problems may affect the applicability of the models.
✔️ Assessing Code Generation Capability: Large-scale language models, especially CodeBERT and GPT-3 variants, are useful for generating code to solve chemistry and chemical engineering problems, but may need to be evaluated by experts. These models are being tested for their capabilities in tasks such as chemical simulation and molecular dynamics.

✔️ SIGNIFICANCE OF THE STUDY AND FUTURE APPLICATIONS: this paper is an important step toward understanding the capabilities and limitations of how large-scale language models can be used effectively in chemistry and chemical engineering.

Assessment of chemistry knowledge in large language models that generate code
written by 
Andrew D. WhiteGlen M. HockyHeta A. GandhiMehrad AnsariSam CoxGeemi P. WellawatteSubarna SasmalZiyue YangKangxin LiuYuvraj SinghWillmor J. Peña Ccoa 

(Submitted on 12 Dec 2022)
Subjects: deep learning, language models, large language models, prompt engineering

The images used in this article are from the paper, the introductory slides, or were created based on them.


Large-scale language models are based on complex transformer neural networks and learn from extensive sets of documents with billions of parameters. These models perform tasks as diverse as text classification, translation, and retrieval, and have attracted significant attention, especially in the aspect of automatic text completion. The technology has a wide range of applications, from creating unit tests to documenting functions, generating code, answering questions, and completing chemical reaction equations.

There has long been debate in chemistry about the application of large-scale language models specific to solving particular problems or more general-purpose large-scale language models; the extent to which large-scale language models trained on diverse texts such as GPT-3 and T5 can understand and utilize chemistry expertise and language remains an important question. While early studies have shown that interactions between SMILES notations of molecular structures and natural language are possible, the scope of application of large-scale language models in chemistry and their limitations have not yet been fully explored. In particular, the extensive body of expertise required to solve chemical problems may limit the applicability of large-scale language models.

With regard to code generation, recent research has explored the extent to which large-scale language models are applicable to computational chemistry problems and other programming tasks. Many large-scale language models are decoder-only models that generate a continuation based on textual prompts, while others, such as CodeBERT, specialize in code generation. However, these are primarily used for tasks such as code embedding, classification, and translation into natural language.

This paper assesses the ability of large-scale language models to relate natural language, equations, codes, and chemistry discovery methods. It is important to note that the compelling text generation exhibited by large-scale language models does not necessarily indicate deep understanding or reasoning ability. The goal of this study is to go beyond superficial understanding and reveal the true potential of large-scale language models in chemistry.


This paper delves deeply into knowledge in the fields of chemistry and chemical engineering and conducts a comprehensive survey to measure the performance of large-scale language models on code generation. The study aggregates a classified set of questions covering a wide variety of topics, ranging from chemistry in general to specific disciplines. First, the knowledge areas of chemistry and chemical engineering are subdivided so that each classification contains a minimum of 10 examples. These examples were selected using the extensive teaching and research background of the authors' research team and are accompanied by expert insights and reference solutions.

At the core of this research is the evaluation of several state-of-the-art code-generating large-scale language models with the ability to solve chemistry-related problems. Examples include chemical simulation, molecular dynamics, chemical informatics, and quantum mechanics. These problems are chosen to test the ability of the large-scale language models along with more general questions that might be encountered in a real chemistry or chemical engineering class.

Of particular note is that there are certain problem types (e.g., plot creation) that only experts can adequately assess, and these are difficult to address with automated assessment. Overall, 84 problem examples were collected, 25 of which required expert evaluation. Focusing on models with up to billions of parameters, the study delves deeply into the relationship between their performance and accuracy.

This study specifically evaluated several models, including Codex (cushman), a variation of GPT-3; davinci (code-davinci-002), derived from GPT-3.5; and text-davinci-003, a reinforcement learning based on human feedback (davinci 3), and several other models are being evaluated. These models were chosen to compare their abilities in chemistry problem solving.

We are also considering the "incoder" model, which focuses on code generation, and the "codegen" model, which is trained on a similar data set but focuses on natural language and code synthesis. These models provide valuable comparisons in measuring performance on specific chemistry-related tasks.

This study provides a deep understanding of the capabilities and limitations of large-scale language models of code generation in chemistry and chemical engineering. It is a valuable step toward future scientific and technological advances and provides insight for researchers and developers to design more effective models.

The artificial intelligence model "davinci" has been shown to perform best on common programming tasks. The "Incoder" used in this study is based on the HuggingFace transformer model and was evaluated using Python versions and packages from June 2021 to ensure that the latest library changes do not affect the results. The choice of this date is based on the training data range of previous studies.

Although davinci was used to test and modify the prompts and their answers as they were developed, prompt engineering could not be avoided entirely. It is important to note that the prompts were designed to explore how much the artificial intelligence model knows about chemistry, not simply to derive the correct answer. The reported accuracy is only an example, and recent studies have shown that further accuracy can be improved through prompt engineering.

The evaluation criteria were based on whether the prompts achieved exact completion, i.e., whether the code functioned correctly. In most cases, prompts and unit tests were provided, and expert evaluation was based on these criteria. Top-k sampling and multiple temperature settings were used to generate the finished product, and specific parameters were ultimately selected to balance model versatility and accuracy.

In addition, expert evaluations were conducted through a web interface and under specific conditions. The evaluation process included questions regarding the difficulty of the problem and the accuracy of the solution, as well as an additional comments section to provide detailed feedback. Data from these evaluations provide important insights for future research and development.

This research focuses on the extent to which artificial intelligence can accurately perform programming tasks, particularly in the area of chemistry, and its knowledge and capabilities. As advances in prompt engineering are made, the capabilities of these models are expected to be further extended.

Experimental results

The first section of the paper describes the introduction to the example problem. To show how a large-scale language model solves a variety of tasks and produces impressive results, the paper presents task outputs for specific categories in the figures below.

In order to unify the format of the task questions, we have set up prompts in the form of functions to be filled in. Initially, the code to load a Python library (numpy39) specialized for numerical computations is shown, which serves as an additional "context". The information required for the task consists of two parts: the names of the input variables "n_steps", "T" and "k", and a comment explaining the purpose of the function. For example, we present a function that performs the Metropolis Monte Carlo method in harmonic potentials.

Here, "k" denotes the spring constant, and we are instructed to generate a sample based on the energy function U(x) = 1/2k(x-x0)^2 (with x0 = 0), but using reduced units with Boltzmann's constant kB = 1.0.

Even with just a few instructions, the output code is accurate, except for the errors marked on certain lines. However, this line, where the position of the particle is newly sampled in the range [-1,1), yields accurate results only under certain conditions of the system. Other outputs from the model suggest fine-tuning the positions with Gaussian random numbers, which is also an appropriate approach, but the σ^2 = 1 setting may not be optimal for some spring constants or temperatures.

The figure below also presents an additional example that demonstrates that the davinci-codex model has internal knowledge of chemistry, especially the general chemistry of phase equilibria.

The model "knows" the proper rearrangement of the Claussius-Clapeyron equation and outputs accurate results when the heat of evaporation ('Hvap') is given in joules/mol. These examples demonstrate that LLM has the ability to not only generate text, but also apply specific expertise to draw logical conclusions. The user should be aware of this, as the results can vary greatly depending on what is included or explicitly stated in the prompt.

The next section of the report presents an expert evaluation. The paper focuses on the best performing model, Davinci, which has extensive knowledge of equations and general computational methods in various areas of chemistry. As shown in the table below, we evaluate the overall accuracy on a variety of topics and models that can be evaluated by experts on a variety of topics and models.

The results show that the models, including Davinci, are able to provide accurate answers to a wide range of topics, with Davinci faring the best. Prompt engineering has been shown to improve accuracy by approximately 30 percentage points.

However, the average accuracy on topics that can be evaluated by humans is relatively low, reflecting the difficulty of the task. Specifically, the more advanced tasks include NWChem input file creation, Monte Carlo simulation implementation of harmonic oscillators, and complex multi-panel plot generation.

In addition, we have identified a breakdown of task difficulty in terms of individual ratings, as shown in the figure below. In the expert's judgment, the data set contains a balance of easy and difficult prompts.

A key finding is that model accuracy is negatively correlated with perceived prompt difficulty. This is a predictable result, but not necessarily a natural one.

No randomization or specific controls are implemented during the evaluation process, and evaluators have access to all prompts and all outputs in response to them. Therefore, we recognize that a variety of factors, such as the order in which prompts are presented on the website and the order in which results are displayed, can introduce bias into the evaluation results. In this paper, we focus only on prompts that can be automatically evaluated for accuracy relative to expected answers.

It also discusses improving the performance of large-scale language models. By using the basic strategies of prompt engineering, the accuracy of large-scale language models can be significantly improved. The paper reveals the impact of "context" on the model on accuracy, as shown in the figure below.

This context is the code that is added before the prompt and serves as supporting information. Specifically, the "custom" context includes the import of the library associated with the topic and a single example that directs the model to the end of the completion. This approach could be useful for both error prevention and providing context.

For example, by importing "rdkit" for chemical informatics, the word "structure" implies the bonding arrangement of atoms. On the other hand, if you import "openmm", the word "structure" will imply a 3D atomic arrangement. Also, by providing an example of a completion that displays the version number of the imported package, the large language model allows the completion to terminate at a specific point.

In addition, we have found that adding certain phrases to the prompts significantly improves the response of the large-scale language model. For example, adding "very" repeatedly or stating "this code has no bugs" can be effective. In this paper, we found that inserting a copyright notice into the prompt significantly improves accuracy at high temperatures, as shown in the figure below. This is because large language models tend to select more standard or high-quality code. A further improvement is provided by adding the statement "This is written by an expert Python programmer.

These findings have led to new research that uses metadata such as code popularity to condition large-scale language models. This improves performance without resorting to ad hoc prompt engineering. Interestingly, improvements to the davinci3 model have shown reduced susceptibility to prompt engineering. This leads to the development of large-scale language models that leverage human feedback to provide a more natural and intuitive user experience.

Our research has shown that the performance of large-scale language models can be dramatically improved by using appropriate strategies for prompt engineering. This finding provides a foundation for adopting more sophisticated approaches in future large-scale language model development.

In addition, the paper attempted to assess the extent to which large-scale language models possess knowledge of chemistry and the ability to directly link natural language to molecular structures. The study tested two models in particular, InstructGPT and davinci, with InstructGPT showing better results.

When we tried to convert the SMILES representation of a molecule to its name, neither model succeeded, recording 0% for 100 random molecules (relatively small and simple molecules) selected from pubchem. However, InstructGPT was shown to have the ability to convert sentences describing a molecule into a SMILES representation. For example, it can link specific functional groups from SMILES to natural language, as shown in the figure below. Although not a perfect match, some correlation is seen, for example, in the case of phenol, where oxygen is present near the ring.

Also, InstructGPT was able to associate molecular properties (e.g., hydrophilicity) with SMILES and rarely produced invalid SMILES. However, one invalid character was found for the first molecule shown in the figure above.

This study suggests that InstructGPT and other large-scale language models have the potential to learn and fine-tune the associations between natural language and chemical structures. It is also worth noting that specific models that can translate between molecular structures and natural language have recently been trained from scratch. These developments expand the potential applications of large-scale language models in chemistry.


Davinci seems to struggle with computational chemistry reasoning. In particular, for prompts such as "High accuracy single point quantum calculations using pyscf," it frequently chooses the relativistic Hartree-Fock method because "relativistic" is associated with accuracy, regardless of the nature of the calculation being made. Also, for prompts such as the "force constant" prompt, where a particular formula needs to be rearranged into the harmonic mean of the masses, davinci is unable to find a solution.

In addition, davinci sometimes cites functions that do not exist and attempts to use the MolToRDC method, which does not exist in reality, for difficult problems such as "return the residue dipole bond from a SMILES string". This demonstrates the difficulty LLMs have in making chemical inferences when completing prompts.

It is worth noting that LLM could solve many benchmarking problems when the natural language is Chinese, German, or Spanish. This may help lower the barrier for non-English speakers to use computational tools.

LLMs are now readily available through tools such as tabnine and copilot. While high accuracy is expected for computational chemistry problems, caution is required when using difficult prompts. Surprising abilities included generating molecules from natural language and outputting accurate results with non-English prompts.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us