Catch up on the latest AI articles

ChemChat: The Future Of Large-Scale Language Models And Chemistry, And The Potential Of Integrating External Tools With Chatbots

ChemChat: The Future Of Large-Scale Language Models And Chemistry, And The Potential Of Integrating External Tools With Chatbots

Large Language Models

3 main points
✔️ Innovations in chemistry through large-scale language models: Large-scale language models handle the language of chemistry and accelerate the molecular design and exploration process. Excellent results achieved.
✔️ Looking to the Future of Molecular Discovery: Advanced methodologies that leverage the integration of chemistry-oriented tools with large-scale language models accelerate the process of molecular discovery and alleviate the cost and time constraints of molecular synthesis.
✔️ Integrating Chemistry Tools with Chatbots: Chatbot interfaces centered on large-scale language models transform the way chemists interact with chemical data, allowing them to easily perform programming tasks.

Language models in molecular discovery
written by Nikita JanakarajanTim ErdmannSarath SwaminathanTeodoro LainoJannis Born
(Submitted on 28 Sep 2023)
Subjects: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Biomolecules (q-bio.BM)


The images used in this article are from the paper, the introductory slides, or were created based on them.


While rapid technological innovation and changes have occurred in recent years that have fundamentally altered our understanding of biochemical processes, the chemical field still spends enormous amounts of time and money - "10 years" and "300 billion yen" - to bring new products to market. This is due to the high failure rate of laboratory-level experiments, the wide scope of exploration in chemistry, and the strong element of luck, including unexpected discoveries. It is common to design a molecule, design a synthetic route, and take time to synthesize it on the basis of various theories, only to find that the desired function cannot be obtained. Then, as a result of the repetition of these experiments, discoveries, which can be called coincidental, give rise to new products such as pharmaceuticals.

In this context, the emergence of large-scale language models and technologies that can understand and generate text like humans has led to success in a variety of fields. In the field of chemistry, this is also a possibility. Molecules also have a way to be represented as a language, and this has the potential to accelerate the molecular design and discovery process. In recent years, large-scale language models have shown excellent results in handling chemical languages, from protein folding to the design of small molecules, peptides, and polymers.

But what exactly are large-scale language models? Simply put, these are machine learning models that understand fragments of text and make continuous inferences based on them. These models enable tasks such as text generation and language translation through learning probability distributions over sequences of words.

Why are language models useful in chemistry? By learning to represent chemical structures, these models facilitate the exploration of chemical space and allow us to design molecules with specific functional properties. Furthermore, by bridging the gap between natural and scientific language, chemists may be able to communicate in natural language the molecular functions they wish to design and, through dialogue, find molecular structures and how to synthesize them.

This paper focuses on the utility of large-scale language models to accelerate molecular discovery (molecular design and exploration). Starting with traditional scientific discovery methods, the paper introduces the combination of molecular generative and molecular property prediction models, as well as tools and libraries for scientific language modeling. Finally, it looks ahead to how natural language models, via chatbots, can be combined with the molecular discovery process to transform the molecular design of the future.

Accelerating the molecular discovery cycle is expected to make a significant contribution to important issues related to human life, especially in the field of drug discovery.

Future molecular discovery methods

Molecular discovery poses a major challenge to the traditional scientific method because of the need to optimize a wide variety of properties over a vast range. In the design, manufacture, test, and analysis (DMTA) cycle, the cost and time constraints of synthesizing molecules are bottlenecks that impede research progress. Traditional methods rely on "molecular hypotheses" by medicinal chemists, which are not exhaustive and often slow to address global challenges. Streamlining the molecular discovery process has long been strongly urged. The main challenge here is to improve the speed and quality of evaluation of these "molecular hypotheses" based on laboratory-level experiments.

Deep generative models have emerged as a promising means to speed up hypothesis generation and design for molecular discovery. However, even these advanced molecular generation models require efficient methods for large-scale virtual screening to effectively test hypotheses. The Accelerated Molecular Discovery Cycle adds a validation loop to the DMTA cycle, allowing many hypotheses to be evaluated quickly and inexpensively. This new loop enhances the generative model of the design phase and ensures that only promising hypotheses actually proceed to synthesis or physical experiments.

Molecular representation

Molecular representation defines the information that a model can utilize. The term "representation" in this context refers to the way in which the structure and properties of a molecule are represented. With advances in chemical language models (CLMs), text-based representations of molecules are gaining attention.

SMILES is a textual representation for describing a molecule, with atoms, bonds, branches, aromaticity, etc. represented by specific character strings. This representation is suitable for use in chemical language models because it allows for easy tokenization (splitting) of molecules. However, SMILES is non-unique, and the same molecule may be represented by several different SMILES strings. This may be used in data expansion, molecular property prediction, and molecular generation. On the other hand, SMILES can also produce invalid representations, and various processing and normalization steps are taken to avoid this problem.

Developed as an alternative to SMILES, SELFIES is designed to avoid generating invalid molecular representations. It is based on rules that guarantee the validity of bond valence and conserves branch lengths and ring sizes to avoid open branches and rings. This ensures that a valid representation is always obtained during molecule generation, but may be too short to represent useful molecules.

Introduced by IUPAC, InChI (International Chemical Identifier) is a string that hierarchically encodes the structural information of a molecule. For large molecules, this string can be long and complex. To solve this problem, a hash called InChIKey has been developed to facilitate searching and retrieval. However, InChI is not very commonly used in chemical language models.

These text-based molecular representations play an important role in training chemical language models. Models can learn molecular properties and structures from these representations and use them to generate new molecules or to predict the properties of existing molecules. Each representation method has its own advantages and limitations, and the appropriate one must be chosen for the intended purpose.

Generation Processes

Generative modeling is the process of learning the underlying distribution of data for the purpose of generating new samples and plays an important role in improving the efficiency of drug discovery. There are two types of models that can be used in this technique: conditional generative models, which utilize specific data attributes or labels to generate new samples with desirable characteristics, and unconditional models, which generate molecules that are similar to the training data. In particular, the conditional generative model facilitates goal-oriented hypothesis design and greatly improves the efficiency of drug development.

The figure below shows an example of the process of conditional molecule generation using a language model.

The process begins with the collection and processing of multimodal data, which is then compressed into fixed-size latent representations. These representations are then passed to the molecular generative model. The generated molecules receive in-silico (computational) property predictions linked to the generative model through a feedback loop during training. the in-silico model uses a reward function to direct the property or task-driven molecules to the generative model. In the inference stage, the candidate molecules generated by the optimized model continue through a workflow of laboratory synthesis and subsequent experimental validation to determine their effects for the desired task.

Introduction of advanced software tools for scientific language modeling

The evolution of open source software in the scientific community has led to revolutionary changes, especially in the field of chemistry. This has contributed in many ways to the development of new research methods and the reproducibility of scientific results. Here we focus on software tools useful for molecular discovery, ranging from Python packages to cloud-based web applications.

First are natural language processing models; the success of the Transformer technology has been heralded as a breakthrough in the field of natural language processing and has been further highlighted by the popularity of the transformers library developed by HuggingFace. These technologies have applications in a variety of fields, including computer vision, reinforcement learning, protein structure prediction, etc. HuggingFace provides language models, ranging from implementations of new models to pre-trained models that can be used for fine tuning and inference. Some of these models are specific to the life science field, such as molecular property prediction and text-based molecular generation.

GT4SD (Generative Toolkit for Scientific Discovery) is a toolkit designed to streamline scientific discovery. It provides support for leveraging language models in a wide range of applications, including molecular discovery applications.GT4SD allows users to use, learn, fine-tune, and share state-of-the-art generative models. The toolkit makes a variety of technologies available, including diffusion and graph generation models, enabling researchers to efficiently move forward with projects such as organic materials design. It provides a command line tool and has a model hub for sharing learned models in the cloud. It also has many property prediction endpoints and pre-trained algorithms in areas such as small molecules, proteins, and crystals, and offers a free web application and educational notebooks.

Predicting chemical reactions and identifying synthetic routes are key challenges in the study of chemistry. The state-of-the-art technology in this area is the "rxn4chemistry" library provided by IBM's RXN for Chemistry platform. This tool applies natural language processing techniques to the chemistry domain and treats chemical reactions as sequence transformation problems, where atoms, molecules, and reactions are represented as letters, words, and sentences, respectively.

Molecular Transformer (MT) is the core architecture of rxn4chemistry and employs an autoregressive encoder-decoder model. This model distinguishes itself from many other models in that it is data-driven and predicts the outcome of chemical reactions without the use of templates and can directly represent stereochemistry. This allows for high performance in regionally and stereoselective reactions; MT has a wide range of applications, from single-step retrosynthesis to enzymatic reactions.

In addition, RXN for Chemistry enables the exploration of synthesis protocols that can be automatically executed on robotic platforms such as IBM RoboRXN for the automation of organic chemistry. It represents an innovative advance in chemical reaction modeling and synthesis protocol automation. Using the encoder-decoder Transformer, chemical synthesis actions can be extracted from patented experimental procedures and predicted directly from reaction SMILES. These models can also be controlled and monitored directly by the robotic platform from a web interface. the RXN for Chemistry platform is accessible through the rxn4chemistry Python package, which provides a rich set of language models The package provides a rich set of language models. This package can be freely accessed and used by researchers for different chemical reaction tasks. It can be used for complex tasks such as multi-step reverse synthesis analysis plans and includes models that are not Transformer-based.

In addition to this, HuggingMolecules also has a library dedicated to aggregating, standardizing, and distributing language models for molecular property prediction. There are many encoder-only CLMs with geometric and structure recognition biases (e.g., MAT and its successor R-MAT), as well as pure BERT-based models trained on SMILES (e.g., MolBERT and ChemBERTA).

For data processing, there is also a tool called RDKit. rxn-chemutils is a library containing chemistry-related utilities from RXN for Chemistry, including SMILES standardization features (e.g., normalization and sanitization) and conversion features to other representations (e.g., InChI ). It harmonizes reaction SMILES and prepares them for consumption by CLM, but also includes extensions of SMILES (by traversing molecular graphs in a non-normal order) and tokenization. Another library with a similar focus is pytoda. It supports different languages (e.g., SELFIES and BigSMILES) and tokenization schemes (e.g., SMILES-PE). Similar functionality is available for proteins, including different languages (IUPAC, UniRep, Blosum62) and protein sequence extension strategies. For small molecules, proteins, and polymers, dedicated language classes facilitate integration with LM, store vocabularies, perform online conversions, and feed custom data sets. Datasets exist for predicting molecular properties, drug sensitivity, protein-ligand affinity, or self-supervision on small molecules, proteins, and polymers.

The Future of Molecular Discovery

Until a few years ago, the idea of using AI models for scientific knowledge extraction and computational analysis was an ambitious dream, like imagining the existence of a search engine. At the core of scientific thinking is the ability to reason, and the day has yet to come when AI will reason as well as humans do. But AI can learn and mimic human behavior; large-scale language models like ChatGPT and GitHub Copilot are trained on the vast amounts of data we document. When this is applied to computational science, non-experts can confidently perform computational analysis using well-designed prompts. This process allows scientists to provide feedback to the model and optimize it. This will make scientific exploration easier for people with non-scientific backgrounds, as they will be able to conduct scientific analysis without specialized training. This development opens the door to a new revolution in the field of molecular discovery. In the future, there will be a chatbot-like interface that handles all computational processes. This interface will support the entire process involved in molecular discovery, starting from the design idea, through synthesis planning, material procurement, regular safety checks, and experimental validation.

Traditionally, neural networks trained specifically for a particular task have required the development of new models for new tasks. Recent advances in large-scale language models, however, are fundamentally changing this approach. Foundation models" are now capable of performing multiple tasks through training on huge data sets. This has opened up new research directions in the field of natural language processing, such as prompt engineering and in-context learning.

Foundation models are also being introduced in the field of chemistry. Task-specific models that combine natural and chemical languages are being developed, while multi-task models that combine property prediction, reaction prediction, and molecule generation are also emerging. These models have shown superior performance over conventional models by supporting the entire process from natural text to the discovery of new molecules, the proposal of synthetic pathways, and the execution of actual synthetic protocols.

These advances have contributed significantly to the acceleration of scientific inquiry and technological innovation. In the field of molecular discovery, we can expect great advances in the future.

Integration of chemical tools and chatbots

Given the powerful versatility of large language models, it is a natural progression to build chatbot interfaces around them; many similar tools have emerged, such as ChatGPT. These tools have shown excellent performance on simple chemistry tasks and allow chemists to work interactively with chemical data to tackle chemistry tasks. In addition, models developed by computer scientists for drug discovery and materials science are also available through large-scale language models. This allows experts who do not have the programming skills necessary to use these AI models to easily access the latest technologies.

The convenience of such chatbots can be realized by integrating them with existing chemistry software tools such as PubChem, RDKit, and GT4SD. These applications can enhance the use of these models and maximize their potential and value. The figure below shows an example of the use of various chemistry tools through the ChemChat chat interface built in this paper.

In this example, the user initially provides a molecular structure and asks to identify the molecule. The information entered by the user is sent to a large-scale language model, and once it is determined that a supported tool such as PubChem can answer the question, the chatbot sends a request to the PubChem API, which returns a brief description of the molecule. The user is then asked to calculate the logP distribution coefficient and the quantitative estimate of drug-like properties (QED). The calculation of these properties is done via the GT4SD tool and answered by the chatbot.

The combination of existing tools and large-scale language models creates a chatbot assistant for materials science and data visualization that can perform simple programming tasks without requiring the user to know programming or access computing resources.

The chat exchange also asks for three molecules that are similar to the first identified theobromine and have a logP of about 0.5. In ChemChat, when three candidate SMILES are mentioned in response to this question, the text results are visualized after some post processing The results of the text are visualized after some post-processing.

Chatbots are highly useful, as seen in the rapid expansion of ChatGPT. By leveraging large-scale language models, complex chemical information processing can be easily performed. The synergy between existing chemistry tools and natural language capabilities has the potential to revolutionize the way chemistry is researched and experimented with.


In the field of chemistry, even with the development of research into functional molecules and biochemistry, molecular design still requires a great deal of time and high expense due to its complexity. However, the use of large-scale language models has opened up the possibility of dramatically increasing the speed of molecular discovery. These language models have the potential to learn chemical structures and explore the chemical search space efficiently by interpreting and generating text.

The combination of models that predict molecular creation and properties, scientific language modeling, and the use of state-of-the-art software tools can drive this acceleration of molecular discovery. text-based molecular representations such as SMILES, SELFIES, and InChI can provide the basis for models to be learned, and conditional generative modeling can and conditional generative modeling can enable the creation of new molecules.

In addition, the development of open source software, in the form of Python packages and web applications, provides tools to boost molecular discovery. Predicting chemical reactions, identifying synthetic routes, and integrating large-scale language models with chemical tools are key elements shaping the future of molecular discovery.

Such advances are expected to accelerate scientific inquiry and innovation in areas that are deeply relevant to people's lives, including drug discovery. The emergence of chatbot interfaces with large-scale language models at their core, seamlessly integrated with cheminformatics software to facilitate scientific analysis by anyone without specialized knowledge, portends a new revolution in molecular design and discovery.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us