DrugLLM] Molecule Generation By Few-Shot Using A Large-scale Language Model
3 main points
✔️ First molecular optimization with fusions using a large language model to generate new molecules from a small number of modified samples
✔️ Uses Group-based Molecular Representation (GMR) for molecular representation to overcome SMILES challenges and achieve efficient efficient molecule generation.
✔️ Demonstrates performance that significantly outperforms existing molecular generation models, but also suggests areas for improvement, such as hardware limitations and the early stages of zero-shot optimization
DrugLLM: Open Large Language Model for Few-shot Molecule Generation
written by Xianggen Liu, Yan Guo, Haoran Li, Jin Liu, Shudong Huang, Bowen Ke, Jiancheng Lv
(Submitted on 7 May 2024)
Comments: 17 pages, 3 figures
Subjects: Biomolecules (q-bio.BM); Computation and Language (cs.CL); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Small moleculesplay a very important role in the field of drug discovery due to their ability to bind to specific biological targets and modulate their function. According to U.S. Food and Drug Administration (FDA) approval records for the past decade, small molecules account for 76% of all drugs approved for the market. Small molecules are relatively easy to synthesize and have good bioavailability, making it easy to reach their intended target. However, designing molecules with ideal properties is very difficult and consumes a lot of resources and time. For example, finding an effective drug requires a drug development process of 9 to 12 years and billions of dollars.
The scope of the search for new molecules is so vast that there are up to 1,060 synthesizable drugmolecules. This makes it a major challenge for chemists to identify molecules that interact with biological targets. Modern technology allows us to test more than 106 molecules in the laboratory, but larger scale experiments are prohibitively expensive and impractical. Therefore, it is necessary to use computational tools to narrow the scope of the search.
Virtual screening is one such tool that can battle to identify promising molecules among millions of existing and virtual molecules. However, high-throughput screening and virtual screening cannot generate new molecules because they only target known, synthesizable molecules.
As an alternative to exploring this vast candidate pool of molecules, de novo design demonstrates the ability to generate completely novel and unique molecules. Traditional de novo design generates new molecules based on receptor and ligand structures, but recently deep learning and reinforcement learning have shown promise. In particular, methods such as integrated generative and predictive neural networks are being used to generate new molecules.
While these new technologies are advancing, the method of molecule generation by fuchot has not yet been fully investigated.Molecular generation by fuchots aims to generate new molecules with expected properties from a limited sample of molecules.
Many current de novo designs require thousands of data for training, whereas in drug discovery, data is typically scarce. Therefore,the ability to perform generation infourshotsis critical to the advancement of de novo design technology.
Large-scale language modelshave made great strides in natural language processing, especially in thefuchsotlearning problem, but there are still challenges when dealing with the language of biology and chemistry.Therefore, in this paper, we propose DrugLLM, a large-scale language model for drug discovery.
DrugLLM uses Group-based Molecular Representation (GMR) to represent molecules and solve problems inherent in SMILES. GMR uses structural groups to build the topological structure of a molecule and converts them into a linear sequence.It also organizes modification sequences according to specific molecular properties. By continuously predicting the next molecule based on its modification history, DrugLLM can learn the relationship between molecular structure and properties.
Data Collection and Preparation
To train and analyze DrugLLM, we are building a large dataset using the "ZINC" and "ChEMBL" datasets ZINC is a free database containing over 2.3 million compounds available for purchase in a 3D format for docking. From this database, we filter drug-like molecules (molecules with physicochemical properties and structural features commonly found in pharmaceuticals) to obtain 4.5 million molecules.ChEMBL is a comprehensive repository of bioactive compound properties and uses a web resource client to Bioactivity data are collected by following the preprocessing methods of Stanley et al. (2021), excluding non-drug-like compounds and applying standard cleaning and canonicalization. In addition, all molecules are represented by SMILES strings and labeled with specific properties.
To facilitate comparison of properties, only property categories represented by real values are considered. This yields a large data set, with each table consisting of thousands of tabular data containing hundreds of molecules measured by the same characteristic. Based on the collected data, we convert them into meaningful text sentences or paragraphs.
For example, a modification between two molecules with similar structures is considered a single sentence, and multiple cases of modification are considered a single paragraph. Modifications within the same paragraph are then assumed to describe the same property change. If the first two modification cases describe increased solubility, then all other sentences in this paragraph are also assumed to be about increased solubility.
This provision is achieved using a heuristic algorithm. First, the molecules are clustered using randomly selected clustering centers based on the set of molecules with the characteristic. If the similarity between a molecule and a center is greater than 0.6, the molecule is clustered with that center. The number of clustering centers is dynamically increased until all molecules in the set have been classified.
In addition to molecular modifications to single properties, combinations of multiple properties are also considered. These relate primarily to simple molecular properties that can be computed with Python scripts. For example, LogP, topological polar surface area (TPSA), and combinations thereof are included in the training set. In total, we have collected over 25 million modification paragraphs and 2 billion molecules to build the training dataset. The dataset contains more than 10,000 different molecular properties, activities, and compositions; in addition to the SMILES molecules, each paragraph also includes an additional property optimization description to relate the meaning of the properties to the molecular structure.
Group-Based Molecular Representation (GMR)
For molecular representation, we use a framework called Group-based Molecular Representation (GMR).This is intended to improve the interpretability of molecular information bydecomposing molecules into structural groups and recording their connection information toreconstructSMILES strings ona group basis.
First, the ChEMBL database is used to collect data on the molecule. Then, using the SMILES representation, we extract information about the ring structures in the molecule and integrate the intersecting rings to identify specific structural groups. For non-ring portions, all C-C bonds are cleaved and the remaining molecular fragments are treated as independent structural groups. This allows for the creation of a comprehensive dictionary that assigns a unique string identifier to each group.
Next, the SMILES strings of individual molecules are split into multiple structural units. Using a width-first search algorithm, we check to see if the molecules are still connected after the structural groups are removed and record the two atoms at their connection points. This forms a feature of the atoms. SMILES normalization is performed on each structure group and on the molecular fragments after splitting, and the corresponding strings in the dictionary are merged into an encoding string. This is repeated to generate the final accurate molecular coding.
Based on the encoded molecular fragments, each structure group is recombined in the correct position by referring to the recorded splicing information. This process is repeated until all structure groups are correctly spliced, and finally the original molecular SMILESare decoded. This is designed to ensure the integrity and reversibility of the molecular information.
This effectively manages the detailed structural information of the molecule and improves the accuracy of the analysis.
Experiments and Results
The focus of this paper is on learning large-scale language models that can capture the relationship between molecular structures and their corresponding chemical and biological activities. DrugGPT uses SMILES as its molecular representation, while DrugLLM uses group-based molecular representation (GMR) as its primary language representation. ThisGMR overcomes the three main challenges in SMILES notation by using structural groups to represent molecular structures.
Thefirst is the large number of tokens; in SMILES format, each letter is considered a separate token, so the number of tokens is huge and consumes significant computational resources during training.The thirdis structural sensitivity. Even small changes in the structure of a molecule can result in large differences in the corresponding SMILES representation.
As shown in the figure below (reproduced below), the GMR framework uses unique string identifiers to represent different structure groups, and these identifiers are linked by slash-encircled numeric position data. using GMR, the model can recognize molecular strings in units of structure groups, thereby reducing the input and output tokens. GMR also simplifies the molecular assembly logic by integrating and removing cyclic structures, thereby reducing the difficulty of model recognition. In addition, it minimizes differences due to slight structural changes in the SMILES string.
To train DrugLLM, we construct sentences or paragraphs consisting of molecular modifications as training data, as shown in the figure below. Specifically, DrugLLM considers a modification between two molecules with similar structures as a sentence, and treats a series of such modifications as a paragraph. Molecular modifications within a paragraph must characterize the same property; for example, if the first three molecular modification samples describe an increase in the number of hydrogen bond receptors, it is expected that subsequent sentences in that paragraph will also describe an increase in receptor number. In this way, the content of the paragraphs will be focused and DrugLLM will be able to predict the next token autoregressively based on the previous context. Furthermore, since each paragraph encompasses a variety of molecular properties and each paragraph addresses its own molecular properties, DrugLLM must be capable of in-context learning.
However, relevant datasets are rarely available. In this paper, we collect the tabular form of molecular datasets from the ZINC database and the ChEMBL platform and convert them into corresponding sentences and paragraphs. In total, over 25,000,000 modified paragraphs and 200,000,000 molecules are collected as training datasets.
The dataset contains more than 10,000 different molecular properties and activities, including the number of hydrogen bond receptors and topological polar surface area (TPSA).Based on pre-training of state-of-the-art large-scale language models, DrugLLM utilizes the Transformer architecture. It also employs LLama 7B parameters and expands the vocabulary by introducing frequently used SMILES tokens. These tokens are split with byte pair encoding (Sennrich et al., 2016).DrugLLM is trained on eight NVIDIA RTX 3090 GPUs for six weeks using the AdamW optimizer. From a machine learning perspective, the paragraphs function as a molecule generation process by Fewshott. Therefore, the trained DrugLLM can perform molecule generation by fuchots without further fine tuning.
DrugLLMis a model that usesFew-Shotlearningto optimizephysicochemicalproperties. As shown in the figure below, K-shot learning provides the model with K pairs of modified examples and benchmark molecules. The goal of the model is to generate new molecules with improved properties based on the modified samples while maintaining structural similarity to the benchmark molecules. Due to input token limitations, the number of example molecular optimizations is limited to a maximum of 9 pairs.
To visualize the structural similarity between the generated and benchmark molecules, a chart is created using the UMAP (Uniform Manifold Approximation and Projection) method. The distributions of the generated molecules (left side) and the original molecules (right side) match, and the similarity of this distribution and the marked improvement in the LogP properties of the generated molecules indicate the high performance of the model.
To evaluate the ability ofDrugLLM'sFewshotto generate molecules, four physicochemical properties are chosen as test tasks, including LogP (water-octanol partition coefficient), solubility, synthetic accessibility, and topological polar surface area (TPSA). These properties can be accurately estimated with machine learning-based scripts and are widely used to evaluate molecular generation models.
For comparison, we use a junction tree variational autoencoder (JTVAE), a variational junction tree neural network (VJTNN), and a scaffold-based molecule generator (MoLeR). We also include a random generation control based on the latent space of JTVAE. The quality of the generated molecules is evaluated based on success rate and molecular similarity. The success rate represents the percentage of generated molecules that follow the rules of the modified sample. To avoid generation bias, the input context (language model prompts) describes a balanced increase or decrease in characteristics.
The figure below shows the distribution of several key properties (LogP, solubility, synthetic accessibility, and TPSA) for the original and generated data. These distributions are visualized using kernel density estimation (KDE). This provides further evidence of the model's validity.
We also report the performance of few-shot generation with respect to LogP values, as shown in the figure below: the three baseline molecular generation models, JTVAE, VJTNN, and MoLeR, had success rates of about 50%, similar to random generation. However, DrugLLM showed progressive improvement in molecule generation by few-shot, with an accuracy of 75% for molecules generated as the number of shots increased. Performance comparisons for molecular solubility, synthetic accessibility, and TPSA were similarly consistent.
While it is usually difficult to optimize molecules with few modifications (high similarity), DrugLLM maintains a high success rate as the generated similarity increases,demonstrating its superior performance in generating byfusions. Furthermore, DrugLLM-GMR slightly outperforms DrugLLM-SMILES, demonstrating the advantages of GMR in training large models.
Furthermore, as noted above, since DrugLLMhas demonstrated the ability to generate by fuchsot with excellent physicochemical properties, we are next testing the efficacy of DrugLLM in the biological activity of molecules. Biological activity is considered an even more complex and difficult challenge than physicochemical properties; the molecules produced by DrugLLM are usually novel and not recorded in the ChEMBL database. Unlike physicochemical properties, biological activity is more difficult to estimate by chemical or physical rules. Furthermore, the significant time and expense involved in laboratory experiments make large-scale molecular evaluation difficult. Therefore, this paper utilizes message passing to predict biological activity.
Before building the DrugLLM dataset using the ChEMBL database, all biological activities were scanned and those with a relatively sufficient number of samples (N ≥ 800) and accurate property predictions (Pearson correlation coefficient r ≥ 0.75) were selected. Finally, 10 activities were selected and these were excluded from the training data. The Pearson correlation of the predictive model exceeds 0.75 and thus statistically correlates well with the actual ratings.
As shown in the table below, the three generation baselines do not achieve meaningful improvements compared to random generation. This indicates that these molecular generation models do not successfully capture modification rules based on limited samples.
In contrast, DrugLLM performs significantly better than the other baselines on most test characteristics. In particular, DrugLLM is able to generate the appropriate molecule that binds to Rho-related protein kinase 1 with a 76% success rate. These test characteristics have not been observed during DrugLLM training. These results demonstrate DrugLLM's ability to find the inherent rules of molecular modification for unknown molecular properties from a limited number of examples.
Summary
In this paper, we address a new computational task: molecular optimization by fuchsot. This task generates new molecules from a small number of modified samples based on the molecule of interest. Variousfuchotlearning taskshave beenproposed, butthere has been little research on molecule generation byfuchot.Molecular optimization by fuchot requires the ability of the model to learn abstract rules from a small number of samples and apply them to new molecules. With current methods, ChatGPT and other molecular generation models do not accomplish this task well, but the "DrugLLM" proposed in this paper shows excellent performance.
DrugLLM is a large-scale language model built on a large amount of small molecules and biological data. Recent large-scale language models, ChatGPTmAlpaca and ChatGLM, have excellent capabilities in general natural language generation, but lack knowledge of biology and pharmacology. There are also large-scale language models that specialize in biology and medicine, but these still use traditional learning strategies anddo not address the issues of how tounderstand the language of biology and chemistry andhow to performfuchsotlearning. In this paper, DrugLLM uses GMRto propose a new method ofiterative,context-sensitivemolecular modification.
However, there are several limitations to this approach. First, DrugLLM only supports a maximum of nine shots of molecular modification due to hardware limitations. Also, DrugLLM's zero-shot molecular optimization is still in its early stages and needs to be improved. Currently DrugLLM can only optimize molecules based on two known molecular properties. In addition, the GMRs currently in use have difficulty representing complex molecules in specific situations and lack standardization methods.
The authorsclaim thatthis "DrugLLM" is the first large-scale language model for molecular generation and optimization by Fewshot. A large corpus of text is constructed from data related to molecular properties and biological activity to train DrugLLM in an autoregressive fashion;DrugLLM's superior performancesuggests that DrugLLM has great potential as a powerful computational tool in drug molecule discovery.
Categories related to this article