Catch up on the latest AI articles

The Future Of Chemical Research And New Drug Development Created By Transformers And Large-Scale Language Models

The Future Of Chemical Research And New Drug Development Created By Transformers And Large-Scale Language Models

Large Language Models

3 main points
✔️ Transformers have an impact on the chemical field and play an important role in the discovery and development of new drugs
✔️ Developed a way to process chemistry tasks as textual sequences, suggesting efficiency in the new drug development process
✔️ task-specific models achieve superior performance in molecular transformation tasks such as reaction prediction and reverse synthesis analysis

Transformers and Large Language Models for Chemistry and Drug Discovery
written by Andres M Bran, Philippe Schwaller
(Submitted on 9 Oct 2023)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)


The images used in this article are from the paper, the introductory slides, or were created based on them.


The field of machine learning has long been concerned with processing and accurately modeling human language. The idea behind this is that language is essential to human reasoning ability. Accurately modeled language models have the potential to enhance a variety of information processing tasks and bring revolutionary advances to multiple industries. In particular, the field of natural language processing is making great strides due to improvements in computing infrastructure, algorithmic breakthroughs, and the proliferation of rich data.

This advancement has also impacted the area of chemistry, which is fundamental to the discovery and development of new drugs. Understanding and accurately modeling the language of chemistry is essential for research and development in the pharmaceutical industry. The application of machine learning techniques to the field of chemistry has enabled the efficient analysis and interpretation of vast amounts of chemical data and literature, facilitating the discovery of new drugs.

Introduced in 2017, the Transformer architecture has brought a revolutionary change in natural language processing. The model is based on a core structure called the Attention Layer, which is able to capture the meaning of words and sub-words within context. The transformer has continued to evolve and has demonstrated superior performance in a wide range of language modeling tasks, including translation, sentiment analysis, and summarization.

In the field of chemistry, this technology is also bringing about a new revolution. Researchers are developing ways to process chemical tasks in the form of text sequences, and the introduction of open data sets and benchmarks is streamlining efforts to address fundamental challenges in the process of developing new drugs. They also aim to bridge the gap between chemistry and natural language modeling by developing multimodal models that incorporate additional data types, such as the spectrum of analytical techniques and synthetic procedures.

Today, the process of developing new drugs is advancing rapidly. The impact of transformer modelson the field of chemistry has been profound and their centralrole in shaping the future of chemistry and drug discovery. This paper provides a brief introduction to textual representations of molecules and reactions, followed by a discussion of single-modality and multimodality task-specific transformers, and finally, a discussion of large-scale language models and their potential uses in chemistry and new drug discovery.

Organic Chemistry Modeling

Chemistry resembles language in many aspects. Not only is human language widely used to communicate information, but the rules behind chemical transformations themselves seem to form their own language. Success in accurately understanding and modeling this "language of chemistry" will unlock the complexity of chemistry and open the door to new applications, such as automated reverse synthesis planning and efficient exploration of chemical space.

However, the language of chemistry is distinct from traditional languages such as English and Chinese. In organic chemistry, grammars are constructed based on molecular graphs and reaction conditions, which has been a barrier to the direct application of transformer technology. The key to overcoming this challenge lies in the decades-old traditional approach of representing molecules as linear strings. Indeed, recent years have seen new improvements and proposals in this area, which have the potential to further our understanding of organic chemistry.

Research in organic chemistry can be described as the process of discovering new molecules and reactions, analyzing them, and cataloging them in databases. Researchers use a variety of sources, including scientific papers, patents, handbooks, and, more recently, computational databases. To make it easier to store and retrieve this information, SMILES (Simplified Molecular Input Line Entry System) has been proposed and widely used since the 1980s.

SMILES is a method of representing a molecule as a linear string, starting with a particular atom and enumerating all other atoms in the molecule in sequence. In this representation method, special characters are used to indicate bond types, branching, ring structure, stereochemistry, and other important information to represent the molecule. In this way, the broad spectrum of organic chemistry can be represented as a string.

However, with the advent of machine learning applications for molecules, it has become clear that the SMILES representation has limitations. For example, the lack of robustness of this representation has led to the problem of generating invalid molecules. To address this issue, a new string-based representation method,SELFIES(Self-Referencing EmbeddedStrings), was introduced. This method has a unique structure that guarantees mapping any given string to a valid molecule and has been applied in areas such as new drug discovery and molecule generation.

In addition, these molecular text representations can be used to easily encode chemical reactions. Reactants and products can be separated by the dot "." and separated by ">" symbols to represent the grammar of the chemical equation. Details such as catalysts and reagents are expressed by interspersing them with ">>" symbols. In this way, "A.B>catalyst. Reagent>C.D" The reaction is represented in the form "reaction SMILES", which is widely used in SMILES.

The introduction of the transformer architecture in chemistry also provided a new approach to solving chemistry problems. This technology made it possible to express chemistry problems in the form of a language and transform it into a sequence of tokens, which has led to transformational advances within the chemistry domain. The technology has demonstrated its power in a wide variety of predictive tasks, including reverse and forward synthesis, molecular regression to predict molecular properties, and classification of reactions.

In addition, Transformer has applications that go beyond mere manipulation of molecular graphs. They are also successful in tasks that require a deep understanding of experimental conditions and standard procedures, such as inference of experimental procedures, by modeling human language. This opens up unprecedented and diverse problem-solving possibilities in the field of chemistry.

This wide range of applications is made possible by different variations of the transformer architecture. Depending on the specific application, different parts of the architecture are utilized in the form of encoder-decoder models, encoder only, or decoder only. This allows us to develop the best model for different applications, such as transforming from one sequence to another, tasks to extract rich representations from data, generative applications, etc.

Transformer architecture changed the world when it dramatically improved translation between languages. However, this technology has not only transcended language barriers, but has also led to revolutionary advances in the realm of chemistry. It has successfully translated chemical reactions and molecular structures from one "language" to another.

A breakthrough in this field came with the molecular transformer introduced by Schwaller et al.Schwalleret al. viewed chemical reaction prediction as the task of "translating" from one form (precursors) to another (SMILES of products) and were very successful in chemical reaction prediction The technique has been very successful in the prediction of chemical reactions and has established a new standard in the field.In addition, the technique has been applied to other complex tasks such as reverse synthesis analysis, enhancing the ability of researchers to predict the reactants and reagents needed to produce a particular compound.

The range of applications of this technology has further expanded to include models such as Chemformer proposed by Irwin et al. that can be pre-trained for a variety of chemical tasks and then specialized for specific applications. This extends flexibility and applicability in chemistry problem solving.In addition, research by Tu and Coley is developing a new approach to directly encode molecules as molecular graphs and translate them into SMILES. This method further expands the potential of transformers in chemistry problem solving and shows a significant improvement over previous methods.

Representation learning also plays an important role in chemistry. Converting molecules and reactions into vector form has a wide range of applications, including similarity assessment for database searches, reaction yield prediction, and identification of toxic compounds. These applications are crucial in the development process of new drugs.

A study by Wang et al. showed improved accuracy in downstream regression tasks by generating reaction representations and comparing them to traditional manual molecular representations. This highlights the effectiveness of the transformer encoder in chemistry tasks.Another study replaced the decoder portion of the transformer with a classification layer to learn class predictions for chemical reactions. The resulting vector representations were used to visualize and explore a database of chemical reactions, revealing how reactions are grouped by data source and compound properties.

Such applications of unsupervised learning have been developed in the field of biochemistry, where Rives et al. have learned transformer models on unlabeled protein sequences to learn the "language of proteins," allowing them to predict protein properties and protein folding. In addition, these models have shown the ability to generalize beyond naturally occurring proteins, paving the way for de novo generation of new proteins.

We have also discovered that Transformer has the ability to create internal representations of chemical reactions and accurately compute atomic mappings throughout the process. RXNMapper, the result of this discovery, outperforms other methods in terms of speed, parallelization, and accuracy. This approach is equally valid for enzymatic reactions, opening up new avenues for identifying the active site of a protein sequence.

Furthermore, the chemical transformation process is a multifaceted process that is not limited to chemical structures. It deals with a wide variety of data types and modalities, from human language used to describe molecules and experimental results to experimental data presented in the form of numerical sequences and images.

Given this diversity, chemists have proposed tasks that bridge the gap between the molecular world and human language. For example, the task "Molecular Capture" describes a particular molecule in natural language. This encompasses a wide range of features, such as molecular properties, origin, and drug interactions, expressed in simple English.In addition, a new model has been developed that allows for inter-conversion between molecules and natural language. This allows for a wide range of tasks to be performed, such as generating molecules based on textual queries, predicting reaction results, and reverse synthesis.

The technique can also be applied to predict experimental procedures, which are essential for synthetic process design. Models have been developed to generate specific steps for experimental realization, such as addition of substances, stirring, and purification, where predicted reactions alone are not sufficient.

In addition, research has been conducted to link experimental results to molecular structure, and a transformer model has been trained to predict structure using computationally generated IR spectra. This approach has achieved better results than previous methods in predicting functional groups from IR spectra.

It shows that transformer architecture has the potential to innovate not just in processing text, but in broader areas such as chemistry and biochemistry.

Applications evolving beyond task-specific models

Recent technological advances have rapidly brought attention to foundation models that are pre-trained on large amounts of data. These models acquire extensive knowledge by learning from a wide range of textual data obtained from the Internet. As we have seen, these models have the ability to produce human-like text in a variety of situations, especially with the extension of the transformer architecture. These models can also be tailored for specific purposes with less data.

Notable among these is ChatGPT, a large-scale language model fine-tuned for conversational use.The release of ChatGPT has not only popularized machine learning in droves, but has sparked a profound debate about the nature of intelligence. At the same time, however, it has raised alarm bells about potential problems, such as the spread of misinformation; ChatGPT's influence and accessibility has prompted a rethinking of how we generate and consume media, and has prompted careful consideration of its potential impact.

The success and popularity of ChatGPT is due to its user-friendly interface, which is freely accessible and intuitive for everyone to use, and its usefulness, which shows excellent performance for tasks other than those for which it was learned. These points reveal the power of ChatGPT and similar models, and suggest the potential for further innovative applications.

In addition, the development of machine learning algorithms and the increase in data volume in ChatGPT'slarge-scale language models have created a new trend that pushes the technical limits. As these models grow larger, they are able to perform their learned tasks more effectively. This phenomenon is particularly pronounced for language models and has come to be known as "scaling laws." These laws have become an important tool for researchers to identify performance improvement trends seen as models scale and to predict the capabilities of large models.

However, the phenomenon of "emergent capabilities," which do not merely enhance existing capabilities, but also emerge entirely new capabilities as the model grows, is attracting attention. These new capabilities are unpredictable in small-scale models and suddenly appear when the model reaches a certain size. For example, while language models with limited computational budgets behave randomly, once they exceed a certain size, their performance on tasks has been observed to improve significantly.

These emergent capabilities include Chain of Thought (CoT) reasoning for step-by-step inference and the ability to follow instructions, which often degrade performance in traditional small-scale models, but improve performance in models that reach a specific size. This allows language models to effectively solve a variety of tasks, including inference, using natural language queries without explicit learning.These remarkable capabilities demonstrated by large-scale language models have the potential to be revolutionary in applications in diverse fields, including chemistry.

Large-scale language models in chemistry

The application of transformer architecture is also gaining attention in the chemical field. This is to precisely encode chemical tasks and process them accurately. The vast majority of chemical information is expressed through human language. Reasoning in chemistry, such as the description of reaction mechanisms and the mode of action of drugs, is fundamentally expressed in human language. But it also requires non-textual elements such as graphs and images. They cannot be expressed only in human language. This raises the question of whether and to what extent large-scale language models can reproduce chemical inference.

In particular, the techniques of fine tuning and in-context learning are the primary means of adapting these large-scale, pre-trained language models to specific applications. These techniques have performed well in many applications andfocus on the new learning paradigms provided by large-scale languagemodels. For example, it has been demonstrated that large-scale languagemodelssuch as GPT-3can efficiently solve a wide variety of tasks in chemistry and materials science through fine tuning.

An important application in this field is molecular generation. Until now, models that generate molecules using linear string representations, such as SMILES and SELFIES, have been the norm. However, the use of language models by Flam-Shepherd and Aspuru-Guzik to directly generate 3D atomic positions has opened up new possibilities in this field. These models can generate structures obtained through various forms of training, such as crystals and proteins, and have shown performance comparable to expertly designed state-of-the-art algorithms, while overcoming the limitations of traditional methods.

The application of large-scale language models in chemistry is particularly useful when data are scarce or difficult to obtain. The innovative capabilities of these models are expected to lead to new advances in chemical research. The flexibility offered by these technologies and their ability to rapidly unravel complex correlations in data could fundamentally change the way machine learning is used in the sciences.

In addition,one of the most notable abilities demonstrated by the language model is step-by-step reasoning, which was mentioned earlier. This ability is activated through Chained Thinking (CoT) prompting and includes the ability to use tools effectively. These advances have shown to significantly improve the performance of language models across a wide variety of tasks, and CoT prompting directs language models to follow a series of reasoning steps to solve a task. This enables the ability to perform symbolic operations, which are performed in a manner similar to a human performing arithmetic operations while tracking intermediate steps.

The ability to use tools is another important property of language models, allowing them to invoke external computational tools and enrich their knowledge through search engine queries and calculator access. This allows for improved performance of large language models in a range of tasks that were previously inaccessible. This new advance suggests the possibility of combining these capabilities to produce more powerful and useful functionality.

More recently, Modular Reasoning, Knowledge and Language (MRKL) and Reason+Act (ReAct) systems have been developed that combine the CoT and tool use capabilities of modern large-scale language models. These agents outperform other methods based on large-scale language models by incorporating external tools into the CoT setting. In particular, effective tool use partially solves the single-modality problem of LLM, enabling processing of different types of input data, real-time decision making in a simulation environment, and even the ability to interact with real-world robot platforms.

The deployment of agents in chemistry demonstrates the potential of large-scale language models for applications in chemistry, despite the tendency of these models to produce erroneous content and inaccurate information. Collaborations involving researchers from around the world have demonstrated 14 use cases, including wrappers to improve accessibility of computational tools, reaction optimization assistants, and knowledge parsers and synthesizers. These advances have the potential to significantly improve the applicability and accessibility of computational applications in chemistry. In particular, the development of a set of computational chemistry tools by Bran and Cox et al. demonstrates efficiencies in the planning and execution of chemistry tasks andcurbs the tendency to generate"hallucinatory" answers as these models provide solutions based on real-world data. platforms such as ChemCrow can be used by chemists to act as a general assistant and make computational tools more accessible to accelerate scientific discovery.


The introduction of machine learning, especially transformer architectures, has led to breakthroughs inchemistry and new drug development. With the similarities between chemistry and language, the introduction of open databases and benchmarks has enabled chemical tasks to be expressed in words and tasks to be solved. And this progress has unfolded in three phases.

In the first phase, specific task-specific models achieved excellent performance in molecular transformation tasks such as reaction prediction and reverse synthesis analysis. Because of their superior performance, we have made these models the standard for many applications.

Attempts were then made to combine various additional information related to chemistry, such as experimental data and natural language, which could be used in even more applications. However, these were still limited to specific tasks.

And recent technological advances in learning and tuning large-scale language models have led to research that exploits the extensive capabilities of these models. This includes applications such as regression, classification, molecular generation, and reaction optimization with unprecedented flexibility and performance. In addition, agents incorporating a near infinite number of modalities are realizing tasks ranging from molecular generation to automated organic synthesis.

By leveraging the expressive power and flexibility of natural language, this recent trend in technology aims to bridge the gap between chemical and natural languages. Further development of these technologies promises a future in which machine learning will play an even greater role in accelerating scientific discovery.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Takumu avatar
I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us