Developed A Chemical LLM "LlaSMol" With A Large Data Set SMolInstruct
3 main points
✔️ Building SMolInstruct, a large, high-quality instruction tuning dataset
✔️ Developing LlaSMol, a large-scale language model for chemistry tasks using SMolInstruct and demonstrating its superior performance
✔️ Presented limitations and future research issues in the evaluation of molecular captioning and molecular generation tasks
LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset
written by Botao Yu, Frazier N. Baker, Ziqi Chen, Xia Ning, Huan Sun
(Submitted on 14 Feb 2024 (v1))
Comments: Accepted by COLM 2024
Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Chemistryis a fundamental science that underpins many aspects of modern life, including drug discovery, materials science, and energy production. To facilitate research and applications in this field, deep learning models such as graph neural networks and transformer models have been applied in various chemical tasks such as reaction prediction, retrosynthesis, and property prediction. However, these areoftentask-specific models thatare difficult to adapt to different tasks.
On the other hand, large-scale language models such as GPT-4, the Llama series, and Mistral have emerged as general-purpose base models and have shown tremendous capabilities in natural language processing tasks. However, when applied to chemistry tasks, their capabilities remain limited. For example, one study observed that while GPT-4 outperforms other large-scale language models, it falls short of task-specific deep learning models. In particular, GPT models have been found to perform poorly when accurate understanding of SMILES, a textual representation of molecules, is required.
Furthermore, in addition to directly applying pre-trained large-scale language models, attempts have been made to fine-tune large-scale language models on instruction tuning datasets, but their performance is very poor and not even close to state-of-the-art models designed for specific tasks (SoTA). Given these results, it seems necessary to verify whether large-scale language models can actually perform chemical tasks effectively or whether there are fundamental limitations.
In this paper, we show that the large-scale language model we developed achieves very high performance on a comprehensive set of chemistry tasks, significantly outperforming the state-of-the-art GPT-4 and Claude 3 Opus. The key to this is the construction of SMolInstruct, a large, comprehensive, high-quality instruction tuning dataset.Based on this dataset, four open-sourcelarge-scalelanguagemodels, Galactica, Llama 2, Code Llama, and Mistral, are fine-tuned with SMolInstruct tobuild alarge-scale languagemodelfor chemistry called "LlaSMol".
Through comprehensive experiments, we have evaluated these models and have made several interesting findings and suggestions. In particular, the Mistral-based model significantly outperforms the others, showing that the base model has a significant impact on the performance of the chemistry task. In addition, the use of SMILES as a molecular representation ensures the validity of the molecules produced and achieves better performance than using SELFIES.
In addition, we found that the use of standardized SMILES reduces the learning burden and improves performance in model training and applications. While instruction tuning is effective in injecting chemistry task-related knowledge into the model, the dataset plays an important role; training with SMolInstruct achieved significantly better performance than training with the historical dataset, suggesting the contribution of the dataset.
The LlaSMol model, while still inferior to state-of-the-art task-specific models designed specifically for each individual task,shows performance comparable to SoTA withonly 0.58% of parametersfine-tuned.This resultsuggests great potential for further improvement and a role as a strong foundational model for the chemical field.
Dataset "SMolInstruct"
Here we present a newly proposed dataset,SMolInstruct, and its constructionSMolInstruct is a large instruction tuning dataset focused on small molecules, containing a total of 14 chemical tasks.
- name conversion task
- Convert IUPAC name to molecular formula (NC-I2F)
- Convert IUPAC names to SMILES (NC-I2S)
- Convert SMILES to molecular formula (NC-S2F)
- Convert SMILES to IUPAC names (NC-S2I)
These tasks aid in a deeper understanding of molecular structure and representation and form the basis of a large-scale language model of chemistry.
- Characteristic Prediction Task
- PP-ESOL to predict water solubility (Mobley & Guthrie, 2014)
- PP-Lipo to predict octanol/water partition coefficient (Poole & Poole, 2003)
- PP-BBBP predicts blood-brain barrier permeability (Martins et al., 2012)
- PP-ClinTox to predict human toxicity (Gayvert et al., 2016)
- PP-HIV predicts HIV replication suppression (Institute, 2004)
- PP-SIDER to predict drug side effects (Kuhn et al., 2015)
These properties are especially important in drug discovery.
- Tasks related to textual description of molecules
- Molecular Caption (MC) generates a textual description of a given molecule
- Molecule Generation (MG) generates molecules based on a given text description
These tasks require an understanding of the structure and properties of molecules and serve to bridge the gap between natural language and molecules.
- Tasks related to chemical reaction knowledge
- Forward Synthesis predicts products from reactants and reagents.
- Retrosynthesis predicts reactants from products
These tasks play an important role in real-world applications. For example, retrosynthesis is integral to synthesis planning, and forward synthesis is used to validate retrosynthesis proposals.
SMolInstruct contains a total of 3.3M samples, each of which is organized as a query-response pair. The query describes the task and task-specific information (e.g., input molecules, textual descriptions, etc.), and the response is a statement containing the answer to the query. For all tasks, we use SMILES as the default representation of molecules unless explicitly defined (NC-I2F, NC-I2S, NC-S2F, NC-S2I), but we also provide SELFIES (Krenn et al., 2019) representations.SMolInstruct encompasses a wide range of knowledge about chemistry and will be an important resource for future research and practical applications.
The SMolInstruct dataset is then constructed in four steps (data collection, quality control, data partitioning, and instruction construction).First, experts are consulted to identify key tasks. Next, the data needed for these tasks are collected from the various sources listed in the table below. Note that "Qry." and "Resp." represent the average length of queries and responses, respectively.
Specifically, the name conversion tasks (NC-I2F, NC-I2S, NC-S2F, and NC-S2I) utilize PubChem2 (Kim et al., 2019), a comprehensive molecular database. The IUPAC name, SMILES representation, and molecular formula of a randomly selected molecule from this database are extracted and reorganized as input-output pairs for the task.
Molecular description-related tasks (MC and MG) use data from ChEBI-20 and Mol-Instructions, both of which contain high-quality molecular text pair data.Characterization prediction tasks (PP-ESOL, PP-Lipo, PP-BBBP, PP-ClinTox, PP-HIV, and PP-SIDER) utilize established MoleculeNet datasets, which represent properties important in real-world applications such as drug discovery.For the chemical reaction tasks (FS and RS), we collect reaction data from USPTO-full. This is an extensive dataset containing over 1M reaction samples extracted from US patents. All of these datasets have been used extensively in previous studies.
In order to ensure the quality of the data sets, the collected data are rigorously screened. The collected data contain many problematic low-quality samples, which are classified into three types
- Chemically Invalid SMILES
- Some SMILES strings are chemically invalid and may deviate from the SMILES grammar or exceed the chemical valence. To solve this problem, we use a tool called RDKit (RDKit, 2023) to analyze molecules and detect errors.
- Incorrect or inaccurate information
- Through manual checks, we identify and correct incorrect or inaccurate information recorded in the data. For example, within the USPTO-full dataset, we are correcting mislabeled reactants and reagents by comparing their atomic mapping to products; for the MC and MG tasks, we are using rule sets based on word patterns, length, and keywords to filter out text descriptions lacking irrelevant information. For PP-SIDER, we filtered out ambiguous name impairments.
- duplicate sample
- Duplicate samples are detected and removed.
In addition, data partitioning of multi-task data sets requires careful handling to prevent data leakage between tasks. For example, if FS and RS are inverse tasks and the FS samples in the training set and the RS samples in the test set have the same chemistry, data leakage could occur and bias the evaluation. Therefore, sample pairs corresponding to the same molecule/reaction between related tasks (FS and RS, MC and MG, and the four NC tasks) are identified and placed together in the training set or evaluation set.
There are also samples that have the same inputs but different outputs. For example, in the RS task, the same product (same input) may be synthesized from multiple sets of reactants (different outputs). If these samples are placed in both the training and test sets, the results may be exaggerated. Therefore, samples with the same inputs are placed together either inside or outside the test set.
In order to make a fair comparison with Mol-instructions (Fang et al., 2023), for tasks that are shared between both datasets (MC, MG, FS, and RS), their training data are not included in the test set and can be evaluated directly. After applying these restrictions, samples are randomly split into training, validation, and test sets. However, the sample for the PP task is scaffold split according to standard methods (Wu et al., 2018).
In addition, to create query-response text pairs for instruction tuning, templates containing queries and corresponding responses were manually created and paraphrased using GPT-4. We also standardized all SMILES expressions and unified the data format.
It also considers data sets that contain many types of sequences (SMILES, molecular formulas, numbers, etc.) in addition to natural language text, and uses special tags to encapsulate the corresponding segments. (e.g. <SMILES>... </SMILES>, <MOLFORMULA>... </MOLFORMULA>, <NUMBER>... </NUMBER>). This design explicitly communicates information types to the model and facilitates answer extraction during evaluation. Note that the figure below showsthe statistical distribution of molecules in SMolInstruct.
Experiment Summary
Using the "SMolInstruct" dataset proposed in this paper,we have created a large-scale language model capable of performing chemical tasks byfine tuning thebase model.We have named this model "LlaSMol" (Large-scale language model for small molecules). The following four differentlarge-scale languagemodels are used as base models
- Galactica 6.7B (Taylor et al., 2022): trained for scientific applications and already exposed to chemistry-related data
- Llama 2 (Touvron et al., 2023b) 7B: A generic large-scale language model
- Code Llama (Roziere et al., 2023) 7B: Based on Llama 2 and learned for code
- Mistral (Jiang et al., 2023) 7B: A generic large-scale language model
These models were subjected to instruction tuning using the SMolInstruct dataset, and the resulting models are LlaSMolGalactica, LlaSMolLlama 2, LlaSMolCode Llama, andLlaSMolMistral, respectively.
We also compare the LlaSMol model constructed here to two types of models: one is a large-scale language model that has not been fine-tuned in SMolInstruct; in addition to the four base models (Galactica, Llama 2, Code Llama, Mistral), Results are compared to the current state-of-the-art large-scale language models GPT-4 (OpenAI, 2023) and the latest Claude 3 Opus (Anthropic, 2024), with one-shot settings for Llama 2, Code Llama and Mistral, and zero shot settings for GPT-4 and Claude 3 Opus are studied in a zero-shot setting.We also compare the results with Molinst (Fang et al., 2023) and ChemLLM (Zhang et al., 2024 ), which are tailored specifically for chemical tasks.
The second is a task-specific model of SoTA: for NC-I2S and NC-S2I, we compare it to STOUT (Rajan et al., 2021) trained on SMILES-IUPAC name pair data;for NC-S2F, we use RDKit (RDKit, 2023) toWe implement the program and report the results;for NC-I2F, we build a baseline STOUT+RDKit that combines STOUT and RDKit; for the PP task, we incorporate a molecular 3D representation and follow a pre-training and fine-tuning paradigm We compare our results with Uni-Mol (Zhou et al., 2023).
For MC and MG we compare with MolT5 (Edwards et al., 2022) and use its released checkpoints;for FS and RS we use RSMILES (Zhong et al., 2022) and Molecular Transformer ( Schwaller et al., 2019) are retrained and use a transformer encoder-decoder model (Vaswani et al., 2017) adapted to the two tasks.
The following evaluation indicators, commonly used in previous studies, are employed
- Exact Match (EM): Percentage of forecast results that are in perfect agreement with the gold standard
- Fingerprint-based Tanimoto similarity (FTS): quantifying the structural similarity between molecules using the Tanimoto similarity of Morgan fingerprints
- METEOR score: a comprehensive text-based index that considers both exact matches and semantic similarity in MC
- Root Mean Square of Error (RMSE): Square root of the mean square error between the predicted and actual values of PP-ESOL and PP-Lipo
- Accuracy (Acc): percentage of correct predictions for binary classification tasks (PP-BBBP, PP-ClinTox, PP-HIV, and PP-SIDER)
- Valid (Valid): percentage of valid predictions that follow the SMILES grammar and chemical valence law in tasks with SMILES output (NC-I2S, MG, FS, and RS)
Experimental Results
Here are the main experimental results.Among all large language models, the LlaSMol model shows the best performance. This demonstrates the effectiveness of the proposed SMolInstruct dataset and fine tuning. In particular, compared to the base models (Galactica, Llama 2, Code Llama, and Mistral), the LlaSMol model performs significantly better. This demonstrates the effectiveness of SMolInstruct in improving understanding of molecular representations and task-related knowledge. In addition, LlaSMol significantly outperforms GPT-4 on all tasks and even exceeds Claude 3 Opus on most tasks. It also outperforms two other chemistry large-scale language models trained on chemistry-directed data (ChemLLM and Molinst). In particular, it outperforms Molinst, which uses the same base model and LoRA settings, on shared learning tasks (MC, MG, FS, and RS).
The four LlaSMol models also show significant differences in their performance, underscoring the significant impact of the base model on the downstream tasks. Despite sharing the same training and inference settings and comparable model sizes, LlaSMolMistral consistently outperforms LlaSMolLlama 2 significantly, demonstrating Mistral's potential in chemistry tasks. LlaSMolCode Llama also outperformed LlaSMolLlama 2 on most tasks, suggesting a synergy between Code Llama's program language knowledge and molecular representation. In addition, LlaSMolGalactica outperformed LlaSMolLlama 2 and LlaSMolCode Llama on many occasions, demonstrating the benefits of prior learning based on chemistry-related documentation.
The LlaSMol model does not outperform the SoTA model, but it does show potential for further improvement. Specifically, LlaSMolMistral outperforms the SoTA model on PP-Clintox and PP-SIDER, but has yet to succeed on other tasks. However, compared to previous efforts (Fang et al., 2023; Zhang et al., 2024), LlaSMol significantly closes the performance gap between the LLM and SoTA task-specific models.
Notably, LlaSMolMistral achieves this performance by fine tuning only a small percentage of its parameters (41.9M, or 0.58% of the parameters). Since increasing the number of trainable parameters could significantly improve performance, LlaSMolMistral has the potential to outperform task-specific models through more extensive fine tuning and serve as a strong base model for chemical applications.
These are the main performance results of the different models on SMolInstruct. Through a detailed comparison and discussion of each model, the excellent effectiveness of the dataset and fine tuning proposed in this paper is confirmed.
Summary
Large-scale language models (LLMs) have shown potential as versatile assistants, but their performance in chemistry-related tasks remains poor. To address this problem, this paper introduces a large, comprehensive, and high-quality instruction tuning dataset called "SMolInstruct". This dataset consists of 14 tasks that are highly relevant to real-world applications and contains over 3 million carefully selected samples.
SMolInstruct was used to develop LlaSMol, a large-scale language model for performing chemical tasks. Experimental results show that LlaSMol outperforms existing large-scale language models, confirming SMolInstruct's important role in improving performance.
However, several limitations are also apparent. First, the evaluation of molecular captioning (MC) and molecular generation (MG) tasks cannot accurately assess the chemically correct description or the ability of the model to generate molecules. This is likely due to the vagueness of the definition of molecular description and the limited data available. In addition, the paper does not examine the generalizability of the models beyond the learned task, which is an area for future research.
In addition, the model developed by this paper still does not outperform state-of-the-art (SoTA) task-specific models. This may be due to the small percentage of trainable parameters and suboptimal training procedures. However, we have proposed a high quality instruction tuning dataset and have demonstrated its effectiveness. The dataset and model suggest deep insights for future research.
Future research is expected to address these issues and further improve performance.
Categories related to this article