The Impact Of Large-Scale Language Models On Scientific Discovery: A Preliminary Study Using GPT-4

Large Language Models 05/02/2024

3 main points
✔️ GPT-4 is also becoming a major contributor to scientific discovery activities.
✔️ A wide range of applications of GPT-4 are presented, including drug discovery, biology, computational chemistry, materials design, and partial differential equations. It also introduces techniques for each of these applications.
✔️ This book summarizes the current shortcomings in the use of GPT-4 and outlines the future prospects for GPT-4.

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
written by Microsoft Research AI4Science, Microsoft Azure Quantum
(Submitted on 13 Nov 2023 (v1), last revised 8 Dec 2023 (this version, v2))
Comments: Accepted on arXiv
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Recent breakthroughs in natural language processing have resulted in powerful large-scale language models (LLMs) that have demonstrated remarkable capabilities in a wide range of areas, including natural language understanding, generation, translation, and tasks beyond language processing.

This report delves into the performance of LLM in the context of scientific discovery and research, focusing on GPT-4, a state-of-the-art language model. The research covers a variety of scientific disciplines, including drug discovery, biology, computational chemistry (density functional theory (DFT) and molecular dynamics (MD)), materials design, and partial differential equations (PDEs).

Evaluating the GPT-4 in scientific tasks is critical to uncover the potential of the GPT-4 in various research areas, validate domain-specific expertise, accelerate scientific progress, optimize resource allocation, guide future model development, and facilitate interdisciplinary research. The research methodology of this paper consists primarily of expert-led case evaluations, which provide qualitative insights into the model's understanding of complex scientific concepts and relationships.

Preliminary investigations by the authors indicate that GPT-4 shows promising potential for a variety of scientific applications and is well suited to handle complex problem solving and knowledge integration tasks. The authors present an analysis of GPT-4's performance in the aforementioned domains (e.g., drug discovery, biology, computational chemistry, and materials design) and highlight its strengths and limitations. Broadly speaking, the authors assess GPT-4's knowledge base, scientific understanding, scientific numerical capabilities, and various scientific prediction capabilities.

In the fields of biology and materials design, GPT-4 has extensive domain knowledge to address specific requirements. In other areas, such as drug discovery, GPT-4 offers strong property prediction capabilities. However, in research areas such as computational chemistry and PDE, while GPT-4 shows promise in assisting researchers with predictions and calculations, additional efforts are needed to improve its accuracy.

Despite its impressive capabilities, GPT-4 could be improved for quantitative computational tasks. For example, fine-tuning is needed to achieve better accuracy; it should be emphasized that the field of LLM and large-scale machine learning is advancing rapidly, and future generations of this technology may have capabilities beyond those discussed in this report. In particular, the integration of LLM with specialized scientific tools and models, along with the development of fundamental scientific models, represent two promising avenues of exploration.

Introduction

The rapid development of artificial intelligence (AI) has led to the emergence of sophisticated large-scale language models (LLMs) such as OpenAI's GPT-4, Google's PaLM 2, Anthropic's Claude, and Meta's LLaMA 2. LLMs can transform the way authors generate and process information across a variety of domains, and have demonstrated outstanding performance in tasks as diverse as abstraction, comprehension, visualization, coding, mathematics, law, and understanding human motivation and emotion.

LLM has been successfully integrated not only into the text domain, but also into other domains such as image processing, speech recognition, and even reinforcement learning, demonstrating its adaptability and broad application potential. In addition, LLMs have been used as controllers/orchestrators to coordinate other machine learning models for complex tasks. Among these LLMs, GPT-4 has garnered significant attention for its remarkable capabilities.

Recent papers have even suggested that GPT-4 may show early signs of artificial intelligence (AGI), and GPT-4 has attracted significant attention in the scientific community, especially in medicine, healthcare, engineering, and social sciences, because of its extraordinary ability in general AI tasks. The main objective of this study is to examine the capabilities of LLM in the context of natural science research. Because of the broad scope of the natural sciences, it is not possible to cover all sub-disciplines; therefore, we will focus on some areas, such as drug discovery, biology, computational chemistry, materials design, and partial differential equations (PDEs).

The authors' objective is to provide a broad overview of LLM performance and potential applications of LLMs in these specific scientific fields, with a focus on GPT-4, the most advanced LLM. An overview of this report is shown in Figure 1.1.

(This article only discusses techniques for drug discovery. Readers interested in other areas should refer to the original article.)

Figure 1.1: Overview of this report

Drug discovery

Drug discovery is the process of identifying and developing new drug candidates to treat or prevent specific diseases or conditions. This complex and multifaceted field aims to improve human health and well-being by creating safe and effective targeted therapeutics. The importance of drug discovery lies in its ability to identify and develop new therapeutics to treat disease, alleviate suffering, and improve human health. Drug discovery is an integral part of the pharmaceutical industry and plays an important role in the advancement of medicine. Drug discovery involves a complex, multidisciplinary process that includes target identification, lead compound optimization, and preclinical testing, ultimately leading to the development of safe and effective drugs. Assessing the ability of GPT-4 in drug discovery has great potential to accelerate the drug discovery process, reduce discovery and design costs, and improve creativity. In this chapter, we first study GPT-4's knowledge of drug discovery through qualitative tests and then study its predictive capabilities through quantitative tests on several key tasks, including drug-target interaction/binding affinity prediction, molecular property prediction, and reverse synthesis prediction.

The authors see great potential for GPT-4 in drug discovery:

- Broad Knowledge : GPT-4 has a broad understanding of key concepts in drug discovery, including individual drugs, target proteins, general principles of small molecule drugs, and challenges faced at various stages of the drug discovery process. This broad knowledge base allows GPT-4 to provide useful insights and recommendations for a wide range of drug discovery tasks.

- Versatility in key tasks : LLMs such as GPT-4 can assist in several important tasks in drug discovery, including

- Molecular Manipulation: Molecular Manipulation: GPT-4 can modify existing molecular structures to generate new molecular structures, which may lead to the discovery of new drug candidates.

- Drug target binding prediction: GPT-4 can predict interactions between molecules and their target proteins.

- Molecular Property Prediction : GPT-4 can predict various physicochemical and biological properties of molecules.

- Reverse Synthesis Prediction: GPT-4 can predict synthetic pathways for target molecules, helping chemists design efficient and cost-effective synthetic strategies (Figure 2.23).

- De novo Molecule Generation: GPT-4 can generate de novo molecules following textual instructions. This de novo molecule generation capability is a valuable tool for identifying new drug candidates that may address unmet medical needs.

- Coding capabilities: GPT-4's powerful coding capabilities will greatly reduce human labor in the future; GPT-4 is a useful tool to assist in drug discovery research, but it is important to be aware of its limitations and potential errors.

To better utilize the GPT-4, we offer a few tips for researchers:

- SMILES sequence processing challenge: GPT-4 may have trouble processing SMILES sequences directly. To improve model understanding and output, it is better to provide the name of the drug molecule and its description, if possible. Doing so will provide more context to the model and improve its ability to generate relevant and accurate responses.

- Limitations in Quantitative Tasks: While GPT-4 is excellent for qualitative tasks and questions, it may face limitations when it comes to quantitative tasks such as numerical prediction of molecular properties and drug-target binding in the evaluated data set. In such cases, we recommend using the GPT-4 output as a reference and validating it with dedicated AI models and scientific computing tools to obtain reliable conclusions.

- Double-checking generated molecules : When generating new molecules with GPT-4, it is essential to verify the validity and chemical properties of the generated structures.

Key Concepts in Drug Discovery

・Entity Translation

The focus here is on evaluating the performance of GPT-4 in translating drug names, IUPAC nomenclature, chemical formulas, and SMILES expressions.

Drug names, IUPAC nomenclature, chemical formulas, and SMILES strings are important building blocks for understanding and communicating the chemical structure and properties of drug molecules. These representations are essential for researchers to effectively communicate, retrieve, and analyze compounds. Some examples are shown in Figures 2.2 and 2.3.

The first example generates SMILES for the chemical formula, the IUPAC name, and the specified drug name. The input drug is Afatinib, a drug used to treat non-small cell lung cancer (NSCLC). As shown in Figure 2.2, GPT-4 correctly outputs the chemical formula as C24H25ClFN5O3 and the IUPAC name is correct, so GPT-4 recognizes the drug afatinib.

However, SMILES is not correct.So, we give guidance and let GPT-4 generate SMILES. Unfortunately, as shown in the figure, despite explicitly instructing GPT-4 to "note the number of atoms in each atom type" and having it generate them based on the correct IUPAC and chemical formula, the SMILES sequences generated in some attempts are still incorrect.

In Figure 2.3, GPT-4 is asked to translate IUPAC names and SMILES sequences and output chemical formulas. As shown in the figure, the translation from the SMILES sequence to the IUPAC name is correct, but the translation in the opposite direction is incorrect. Furthermore, the chemical formulas generated are incorrect in both translation directions. These two cases show that IUPAC is easy for GPT-4 to understand and generate.

Since GPT-4 is based on subword representations, it is possible that the tokenization method is used in GPT-4. This hypothesis could also explain the situation where the generated chemical formulas are not necessarily correct (as shown in Figure 2.3) because they are based on counts of each atom type.

Figure 2.2: Example of entity translation. Given a drug name, we generate its chemical formula, IUPAC name, and SMILES. The first molecular graph is the true 2D structure of afatinib, and the next four graphs are translated from the SMILES sequence generated by GPT-4.

Figure 2.3: Translation of IUPAC names and SMILES.

・Memorization of knowledge and information

The GPT-4 knowledge and information memory capacity for drug discovery is assessed. The drug of choice is still afatinib, and the case is shown in Figure 2.4.

First, when we ask GPT-4 for a referral for the drug afatinib, GPT-4 responds with multiple pieces of information, including the type of molecule, target receptor, FDA-proven timing, function, route, side effects, etc. Checking Pubchem and DrugBank, the information GPT-4 provides is clear and correct, GPT-4's knowledge of afatinib.

Next, we ask further questions about the chemical properties of afatinib, which are important to truly understand molecular medicines. In Figure 2.5, GPT-4 answers a variety of properties about afatinib, including molecular weight, solubility, XLogP3, and number of hydrogen bond acceptors. (1) Most of the properties introduced are correct. (2) Some properties are incorrect. In particular, counting-related results such as the number of heavy atoms are incorrect. (3) Interestingly, the SMILES notation for the generated afatinib is also incorrect.

This is in good agreement with the observations of the previous subsection, suggesting that SMILES generation is still a challenge for GPT-4.

Figure 2.4: General and chemical information about the drug afatinib from GPT-4. Most of the knowledge about afatinib is correct.

Figure 2.5: Molecular characterization of the drug afatinib from GPT-4

・Molecular Manipulation

Molecular manipulation is the process of modifying the structure of a molecule to achieve a desired property or function. In the pharmaceutical industry, molecular manipulation can optimize drug candidate compounds to increase efficacy, reduce side effects, and improve pharmacokinetic properties, which is critical for designing potent and safe therapeutics.

Figure 2.7 presents a case in which GPT-4 was asked to assist in drug molecular manipulation. Specifically, Asciminib, a first-in-class allosteric inhibitor of BCR::ABL1 kinase activity, was recently approved for the treatment of patients with chronic myeloid leukemia in chronic phase who have failed two lines of therapy or have the T315I mutation.

If we first ask GPT-4 to modify asciminib by replacing chlorine (Cl) in asciminib with an element on the bottom row of the periodic table (actually bromine (Br)), GPT-4 knows the element and successfully makes the replacement. If we further ask GPT-4 to replace pyrrolidine with a six-membered ring and alcohol with fluoride, GPT-4 shows the correct process, but interestingly, the result is wrong.

After many rounds of guidance and corrections, GPT-4 eventually corrects the desired molecule correctly. Thus, we see that GPT-4 has strong knowledge, but without specific user feedback and step-by-step checks, GPT-4 can fail to error with great potential. cases showing that GPT-4 often generates invalid SMILES There are also cases that show that GPT-4 often produces invalid SMILES.

Figure 2.7: Asciminib operation; GPT-4 attempts to correct as instructed, but the correction results are incorrect. After repeating the guidance many times, GPT-4 finally makes the correct correction.

・Macro questions about drug discovery

The above assessment focuses more on individual drugs and molecules. Here we further test the GPT-4 on macroscopic questions about drug discovery. In Figure 2.8, we first ask basic questions about Lipinski's Rule of Five.

The GPT-4 provides the correct answer and a reasonable explanation for evaluating the drugability of a compound. In Figure 2.9, GPT-4 is a question that introduces the main challenges in drug discovery. This is a general, broad question with no standard answer; the semantic depth of the GPT-4's answer implies that the GPT-4 has macroscopic knowledge of such drug discovery.

The last case in Figure 2.10 is how AI and machine learning can help in drug discovery, and GPT-4's answer is a good, well-rounded answer that covers a variety of aspects where AI could be useful, including physical property prediction (ADMET), drug design (generative models), and knowledge discovery. Overall, GPT-4 has knowledge of the entire drug discovery process and individual steps.

Figure 2.8: Lipinski's Five Laws.

Figure 2.9: Major challenges in drug discovery.

Figure 2.10: Proposal for AI for drug discovery.

Drug-target binding

A fundamental concept in pharmacology and drug discovery, drug-target binding is the specific interaction between a drug molecule and its target, usually a protein or receptor in the body. Understanding drug-target binding is essential for designing effective and safe drugs. The strength of the interaction, called binding affinity, is a key determinant of drug potency and efficacy. In general, the higher the binding affinity, the stronger the action on the target and, consequently, the greater the therapeutic effect. Accurately predicting the binding affinity of a drug to its target can greatly expedite the drug discovery pipeline and reduce the time and cost required to bring a new therapy to market. Here we investigate the ability of GPT-4 to predict drug-target interactions (DTI) and affinity scores. We employ a series of benchmark datasets representing a variety of drug candidates and target proteins for quantitative assessment and case studies for qualitative assessment.

・Drug-target affinity prediction

Drug-target affinity (DTA) prediction is a regression problem that estimates an affinity score that quantifies the binding strength between a drug candidate and its target protein.

Setup : BindingDB and DAVIS are two prominent data sets for affinity prediction and exhibit different data distributions, as shown in Figure 2.11. The authors adopted the data processing approach utilized in previous studies: due to API call limitations, 1,000 samples are randomly selected for the test set; the GPT-4 temperature is set to 0.7; the GPT-4 temperature is set to 1.0; the GPT-4 temperature is set to 1.0; and the GPT-4 temperature is set to 1.0. Three different settings are validated: zero-shot, few-shot, and similarity-based.

Figure 2.11: Label distribution for the BindingDB Ki and DAVIS datasets, where the x-axis represents the log treatment of affinity values and the y-axis displays the frequency ratio corresponding to each affinity value.

Zero-Shot Assessment : The Zero-Shot assessment primarily tests the ability of GPT-4 to understand the key concepts of affinity prediction, as shown in Figure 2.12 - Figure 2.13.

- In Figure 2.12, GPT-4 does not directly perform calculations when asked to estimate drug-target affinity. Instead, it provides step-by-step guidance for estimating binding affinity and provides additional information about the drug and target.

- Figure 2.13 shows an interesting case in which GPT-4 appears to calculate "direct" affinity predictions. Given the SMILES of the drug and the FASTA sequence of the target, GPT-4 seemingly emulates running AutoDock Vina (docking software) and returns an affinity score of -7.2 kcal/mol. However, it is not purely running AutoDock Vina; it is simply fabricating a score. It is therefore important to verify the accuracy of such numerical results generated by GPT-4.

Figure 2.12: Example of zero-shot drug target affinity (DTA) prediction: the GPT-4 model does not predict DTA directly, but rather provides valuable information such as calculating affinity using relevant docking software.

Figure 2.13: An interesting example of a zero-shot DTA prediction: the GPT-4 appears to be running docking software, but is simply creating an affinity score.

Few-Shot Evaluation : To investigate the ability of GPT-4 to learn few-shot for DTA prediction, we provide GPT-4 with a few-shot example (demo). (1) different system prompts (similar to the zero-shot evaluation), and (2) a varying number of few-shot examples. For the few-shot examples, we randomly select or manually select7 them to ensure diversity and quality, but we observe slight differences in the predicted results.

Figure 2.14 displays two different system prompts and Figure 2.15 shows an example of a few shots; the first system prompt originated from a drug expert to test whether GPT-4 can estimate affinity, and the second system prompt was generated by a GPT-4 act as a machine learning predictor to identify patterns from a few-shot example. The results of the several-shot evaluation are shown in Table 1.

The table shows that on the BindingDB Ki dataset, GPT-4 appears to randomly guess affinity scores regardless of the number of prompts or minority shot cases. In contrast, GPT-4 shows its ability on the DAVIS dataset, where it can somewhat improve DTA prediction performance with a larger number of minority shot cases (5 cases versus 3). However, when compared to state-of-the-art deep learning models, the results still fall short.

Figure 2.14: System messages used in the evaluation shown in Table 1.

Figure 2.15: Example of a few shots used in a few-shot DTA evaluation.

Table 1: Table 1: Results of the several-shot DTA predictions for the BindingDB Ki and DAVIS data sets, where R represents the Pearson correlation and Si represents the different system prompts as shown in Figure 2.14.

kNN few-shot evaluation : In previous evaluations, the few-shot samples were selected manually or randomly, and these examples (demos) remain consistent for each test case throughout the (1000) test set. to further assess GPT-4's learning ability, we used k nearest neighbors to select few shot examples are selected and additional few-shot evaluations are performed.

Specifically, for each test case, we provide different few-shot examples that are guaranteed to be similar to the test case. This is called kNN few-shot evaluation. In this way, the test case can learn from its similar examples and achieve better affinity predictions.

While there are various methods to obtain k nearest neighbors as few-shot examples, this study employs an embedding-based similarity search by computing the embedding cosine similarity between test cases and cases in the training set (e.g., BindingDB Ki training set, DAVIS training set). The embeddings are derived from the GPT-3 model and API calls are used to retrieve the GPT-3 embeddings for all training and test cases.

The results displayed in Table 2 show that a few shot examples based on similarity can significantly improve the accuracy of DTA predictions. For example, the Pearson correlation can approach 0.5, and more similar examples can further improve performance. An upper limit is observed when 30 nearest neighbors are provided. While these results are promising (compared to the evaluation of the previous few shots), the performance still has a long way to go before existing models (e.g., SMT-DTA; as a result, GPT-4 can excel in DTA prediction without fine-tuning).

Table 2: kNN-based few-shot DTA prediction results for the DAVIS dataset. Various numbers of k nearest neighbors are selected by GPT-3 embedding of drug and target sequences. p is Pearson correlation.

・Prediction of drug-target interactions

Drug-target interaction (DTI) prediction is another task similar to affinity prediction: instead of outputting a specific affinity value between drug and target, DTI is a binary classification task that outputs a "yes" or "no" answer indicating whether the drug and target have a strong binding affinity. This is presumably a simpler prediction task. Evaluate the customized BindingDB dataset: randomly select 1000 test cases with 500 positive and 500 negative drug-target pairs. Similarly, we will evaluate the zero shot, few shot, and kNN few shot settings.

Zero-shot and few-shot evaluation : For zero-shot evaluation, the system prompts as shown in Figure 2.17, giving GPT-4 the IUPAC name of the compound, SMILES, target protein name, and FASTA sequence From the DTA prediction, GPT-4 has a hard time recognizing these items mappings. The results show that (1) GPT-4 randomly outputs 'Yes' or 'No' to interaction predictions when asked to output a binary label, and the explanation appears unreasonable; (2) GPT-4 may not give an answer to whether a drug and target can interact or not, docking tool (similar to the DTA prediction); (3) more stringent prompts, e.g., asking GPT-4 to "check the explanation and answer and give a more confident answer," GPT-4 "cannot answer with confidence whether a compound can interact with a protein," as shown in Figure 2.16. Most often, the answer is "I cannot answer with certainty whether or not the compound can interact with the protein.

Figure 2.16: An example of a zero-shot assessment for a drug-target interaction; GPT-4 refuses to respond at a high rate when prompted for self-refinement.

Figure 2.17: System messages used for zero-shot evaluation, few-shot evaluation in Table 3, and kNN few-shot DTI evaluation in Table 4.

For the few-shot evaluation, results are shown in Table 3. Varying the number of randomly sampled few-shot examples8 between {1, 3, 5, 10, 20}, we observe that the classification results are not stable as the number of few-shot examples increases. Furthermore, the results lag far behind trained deep learning models such as BridgeDTI [96].

Table 3: Few-shot DTI prediction results for the BindingDB dataset, where N is the number of randomly sampled few-shot examples.

Evaluation of kNN number shots : Similarly, we evaluate the embedding-based kNN number shots for the BindingDB DTI prediction from GPT-4. The embedding is also derived from GPT-3. For each test case, the nearest neighbor k ranges from {1, 5, 10, 20, 30} and the results are displayed in Table 4. From the table, one can observe the obvious advantages of incorporating more similar drug-target interaction pairs. For example, from k = 1 to k = 20, the accuracy, precision, reproducibility, and F1 scores are significantly improved; GPT-4 slightly outperforms the robust DTI model BridgeDTI [96], demonstrating the strong learning ability of the embedded-based kNN evaluation and the great potential of GPT-4 for DTI prediction This demonstrates the potential of the GPT-4 for DTI prediction. It also shows that GPT embeddings perform well in the binary DTI classification task.

Table 4: kNN-based few-shot DTI prediction results for the BindingDB dataset, where GPT-3 embedding selects a different number of K nearest neighbor sequences for the drug and target sequences.

Prediction of Molecular Properties

Here we quantify GPT-4's performance on two property prediction tasks selected from MoleculeNet: one predicts the drug's ability to penetrate the blood-brain barrier (BBBP), and the other predicts whether the drug has biological activity with the p53 pathway (Tox21-p53) The two tasks are. Both tasks are binary classifications.

Using scaffold partitioning, for each molecule in the database, extract its scaffold. Then, based on the frequency of the scaffolds, the corresponding molecules are assigned to a training set, a validation set, and a test set. This ensures that the molecules in the three sets exhibit structural differences.

Qualitative studies showed that GPT-4 performs differently in different representations of the same molecule. In the quantitative study here, we also looked at different representations. First, we test GPT-4 with the molecule SMILES or IUPAC name; the IUPAC prompt is shown in the top box in Figure 2.18; for the SMILES-based prompt, simply replace "IUPAC" with "SMILES." The results are shown in Table 5. In general, GPT-4 with IUPAC as input achieves better results than GPT-4 with SMILES as input. The authors speculate that the IUPAC names represent molecules in the training text used by GPT-4 by explicitly using substructure names, which occur more frequently than SMILES.

Inspired by the success of LLM's few-shot (or in-context) learning in natural language tasks, we conducted a 5-shot evaluation of BBBP using IUPAC names. The prompts are shown in Figure 2.18. For each molecule in the test set, we select the five most similar molecules from the training set based on Morgan's fingerprints. Interestingly, we observe a decrease in precision and accuracy for the five-shot setting (row 'IUPAC (5-shot)' in Table 5) and an increase in recall and F1 when compared to the zero shot setting (row 'IUPAC' in Table 5). This phenomenon can be attributed to the dataset splitting technique. Since scaffolding partitioning produces large structural differences between the training and test sets, it is possible that the five most similar molecules chosen as the few-shot cases are not actually similar to the test cases. Such structural differences can lead to biased or erroneous predictions.

Figure 2.18: Prompt for BBBP property prediction. Numerators are represented by IUPAC names.

Table 5: BBBP prediction results. The test set contains 107 positive and 97 negative samples.

In addition to using SMILES and IUPAC, we also test on GPT-4 using drug names. we search molecular SMILES in DrugBank to obtain drug names. of the 204 drugs, 108 drug names are found in DrugBank. of the 204 drugs, we find 108 drug names in DrugBank. Enter the name at the same prompt as in Figure 2.18. The results are shown in the right half of Table 5, which also lists the corresponding results for the 108 drugs by GPT-4 with SMILES and IUPAC entered. Using molecular names, we can see that all four metrics show significant improvement. A possible explanation is that drug names appear more frequently (than IUPAC names or SMILES) in the GPT-4 training corpus.

The final analysis of BBBP evaluates GPT-4 against MolXPT, a GPT-based language model specifically trained on molecular SMILES and biomedical literature; MolXPT has 350M parameters and is fine-tuned on MoleculeNet. Notably, it outperforms GPT-4 on the complete test set, with accuracy, repeatability, and F1 scores of 70.1, 66.7, 86.0, and 75.1, respectively. These results clearly show that fine-tuning the specialized model in the area of molecular property prediction can produce results equal to or better than GPT-4, indicating that there is significant room for improvement in GPT-4.

Results for Tox21-p53 are shown in Table 6. Similarly, GPT-4 with IUPAC names as input outperforms SMILES, and the five-shot results are much worse than the zero shot results.

Table 6: Prediction results for the Tox21 SRp53 set (Tox21-p53 in brief). all positive samples (72 samples) and 144 negative samples (twice as many as positive samples) from the test set were randomly selected for evaluation due to the limited API access quota for GPT-4.

An example of a zero-shot BBBP prediction is shown in Figure 2.19 GPT-4 generates accurate drug descriptions, indications, and targets, leading to reasonable conclusions.

Figure 2.19: Example of BBBP prediction: sufentanil (DrugBank ID: DB00708). The green area is confirmed to be correct.

Retrosynthesis

Retrosynthesis is an important tool in the drug discovery process, allowing chemists to strategically devise synthetic pathways to create desired compounds from simpler starting materials. By breaking down complex molecules into simpler building blocks, chemists can determine the most efficient and cost-effective synthetic route to develop a new drug candidate. As a result, retrosynthesis enables rapid and efficient design and synthesis of new drug candidate compounds.

Here we explore how GPT-4 can help us understand chemical reactions and predict potential reactants to products both qualitatively and quantitatively.

・Understanding of chemical reactions

Two cases are presented to demonstrate GPT-4's ability to understand chemical reactions. In Figure 2.21, GPT-4 is asked to describe a given chemical reaction (represented by the SMILES sequence) as an organic chemist and an expert in reverse synthesis GPT-4 first translates the SMILES sequence into the names of the reactants and then explains the mechanism of the reaction GPT-4 is the first step, GPT-4 fails in the first step, translating SMILES CC(=O)cc1ccc2[nH]ccc2c19 to the name 2-acetylindole and SMILES CC(C)(C)OC(=O)OC(=O)OC(C)(C)C10 to the name anhydrous trimethylacetic acid. As shown in Figure 2.20, these names have molecular graphs very similar to the original SMILES, but are different molecules. As a result, the following explanation goes in the wrong direction. In Figure 2.22, we ask GPT-4 to think carefully step by step to explain this chemical reaction. This time the explanation goes in the right direction: GPT-4 no longer converts the SMILES sequence into a name, but instead describes the functional groups in the molecule in detail. Unfortunately, it is still incorrect: di-tert butyl dicarbonate does not have three ester (C=O) functional groups, and the explanation of the reaction mechanism is not entirely correct, since isobutene and CO2 are obtained as by-products, not tert-butanoic acid.

Figure 2.20: Two-dimensional molecular graphs of the two true reactants (a) and (c) generated by GPT-4 and their translated names (b) and (d). Similar but not identical.

Figure 2.21: Example 1 for understanding chemical reactions.

Figure 2.22: Example 2 for understanding chemical reactions.

・Prediction of reverse synthesis

Using the widely used benchmark USPTO-50K data set and a few-shot setup, we will quantitatively study the ability of GPT-4 in single-step reverse synthesis prediction (i.e., prediction of possible reactants for a given product).

Setup : The USPTO-50K data set contains 50,037 chemical reactions extracted from US patents. Using the same data partitioning as in much of the literature, 40,029 reactions are used as the training set and 5,007 reactions as the test set; due to API call limitations, the first 500 samples from the USPTO-50K test set are selected for testing. We use top 1 accuracy as the evaluation metric and R-SMILES as the main baseline; R-SMILES is a state-of-the-art model specifically designed for inverse synthesis prediction and trained on this data set.

Few-Shot Results :The authors considered several aspects when evaluating the few-shot capability of GPT-4 for inverse synthesis prediction: (1) differences in the number of few-shot examples and (2) differences in the methods for obtaining few-shot examples. (a) random selection, and (b) selecting K nearest neighbors from the training dataset based on Molecular Fingerprints similarity. (3) We also evaluate whether adding IUPAC names to the prompts improves accuracy. Figure 2.23 shows the prompts used for the number-shot evaluation. The results are shown in Table 7:

- GPT-4 has a predictive accuracy of 20.5% for inverse synthesis.

. - GPT-4 accuracy improves as more examples are added to the prompt, K = 10 is a good choice.

- K nearest neighbor search significantly outperforms random search (20.2% vs. 1.2%).

- Including IUPAC names in the prompt slightly improves accuracy (20.6% vs. 20.2%) and decreases the percentage of invalid SMILES.

- The accuracy of GPT-4 (20.6%) is lower than the accuracy of the domain-specific model (53.6%), which indicates that there is ample room for improving GPT-4 for this particular task.

Figure 2.24 shows an example where GPT-4 failed to predict the correct reactant for a product on the first attempt and finally succeeded after several rounds of guidance and correction. This suggests that GPT-4 has good knowledge, but specific user feedback and step-by-step validation are needed to avoid errors.

Table 7: Number-shot resynthesis prediction results for the USPTO-50k data set.

Figure 2.23: Example of a few shots used in the evaluation of a few-shot retrosynthesis prediction.

Figure 2.24: Example of reverse synthesis prediction. With multiple rounds of guidance, GPT-4 will eventually give the correct reactant.

Generation of new molecules

An important application in drug discovery will be studied in the proposal/generation of novel molecules as drug candidates; SARS-Cov-2 uses spike proteins to penetrate human surface receptors. The authors will ask GPT-4 to provide general guidance for designing protein-based drugs that bind to the spike protein that neutralizes COVID-19. GPT-4 then shows them how to design such protein drugs from scratch using a computational tool called Rosetta, which provides excellent answers to the authors' questions and demonstrates that GPT-4 can help design novel protein drugs.

Figure 2.25: GPT-4 understanding how to use computational tools for biological design.

One measure to evaluate a protein drug is to estimate its binding affinity to its target. We asked GPT-4 to show us how to do this computationally, and GPT-4 showed us in detail how to estimate protein binding affinity using a publicly available tool called RosettaDock. It also gave us examples of how to interpret the results, and although GPT-4 shows a reasonable design process, it is still difficult to computationally predict protein-protein interactions in very complex internal environments, so wet-lab experiments for validation are required for protein design Note that the

Figure 2.26: GPT-4 teaches us how to design such protein drugs from scratch using a computational tool called Rosetta.

Coding support for data processing

Assess the ability of the assistant in processing data for drug discovery using GPT-4. Specifically, we will have GPT-4 generate Python code to process drug discovery-related data. A significant amount of drug and protein data is stored in sequence formats such as SMILES and FASTA, which can be downloaded from the PubChem11 and UniProt12 websites. Examples are shown in Figures 2.27 and 2.28.

Figure 2.27: SMILES coding assistance for downloading molecular formulas with ID from PubChem.

Figure 2.28: Coding aid for downloading protein sequences from UniProt with ID.

Figure 2.28 accurately describes the code that GPT-4 uses to download the protein sequence data, add spaces, and save it to a file in the specified format. Molecular processing (Figure 2.27) requests SMILES and chemical formula lookups for the molecule. Interestingly, GPT-4 generates a nearly correct URL for the data download, but combines the keywords "SMILES and chemical formula" in the URL, resulting in an invalid URL. When informed about this error, GPT-4 identifies the problem as being related to the PubChem REST API call. Instead of fixing the bug, it proposes an alternative solution that uses the "pubchempy" package to download the data and successfully executes the code. These examples demonstrate that GPT-4 can help generate the correct scripts for data processing in drug discovery.

Biology

The study explores in detail the capabilities of GPT-4 in the area of biological research, focusing on its ability to understand biological language, to use embedded biological knowledge for reasoning, and to design biomolecules and biological experiments. The authors' observations reveal that GPT-4 shows great potential to contribute to the field of biology by demonstrating the ability to process complex biological language, perform bioinformatic tasks, and even serve as a scientific assistant in biological design. GPT-4's biological broad grasp of biological concepts and its promising potential as a scientific assistant in design work underscore the important role of GPT-4 in the development of the field of biology:

Biological Information Processing: GPT-4 understands information processing from files specific to biological domains, such as MEME format, FASTQ format, and VCF format. In addition, GPT-4 is adept at performing bioinformatics analysis from given tasks and data, such as predicting signaling peptides from a given sequence, as shown in Figure 3.4.

- Biological Understanding: GPT-4 has a broad understanding of a variety of biological topics, including consensus sequences, PPI, signaling pathways, and evolutionary concepts.

- Biological Reasoning: GPT-4 has the ability to infer plausible mechanisms from biological observations using embedded biological knowledge.

- Biological Support: GPT-4 has shown potential as a scientific assistant in the area of protein design tasks and in wet-lab experiments by translating experimental protocols for automation purposes. while GPT-4 is a very powerful tool to support biological research, some limitations and occasional errors.

To better utilize the capabilities of the GPT-4, we offer some tips for researchers:

- Understanding FASTA sequences: a notable challenge for GPT-4 is the direct processing of FASTA sequences. When possible, the name of the biomolecule should be provided along with the sequence.

- Inconsistent Results The performance of tasks related to biological entities of GPT-4 is affected by the wealth of information related to the entity. Analysis of less-studied entities, such as transcription factors, can produce inconsistent results.

- Understanding Arabic numerals: GPT-4 has difficulty dealing directly with Arabic numerals. It is recommended that Arabic numerals be converted to text.

- Quantitative Calculations: Although the GPT-4 is excellent at understanding and processing biological language, it is limited in its ability to perform quantitative calculations. For reliable conclusions, it is recommended to manually validate the results or validate them with another computational tool.

- Prompt Sensitivity: GPT-4 responses are inconsistent and largely dependent on the wording of the question.

In summary, the GPT-4 shows great potential to advance the field of biology by demonstrating proficiency in understanding and processing biological language, reasoning with embedded knowledge, and assisting with design tasks. Despite some limitations and errors, with proper guidance and refinement, the GPT-4 has the potential to become an invaluable tool for researchers in the evolving field of biological research.

Computational chemistry (i.e. computer simulation of chemical phenomena)

Computational chemistry is an interdisciplinary field that utilizes computational methods and techniques to address complex problems in chemistry. For a long time, computational chemistry has been an essential tool in the study of molecular systems, providing insight into atomic-level interactions and guiding experimental efforts. The field involves the development and application of theoretical models, computer simulations, and numerical algorithms to study the behavior of molecules, atoms, materials, and physical systems. Computational chemistry plays an important role in understanding molecular structures, chemical reactions, and physical phenomena at both the microscopic and macroscopic levels. In this chapter, we survey the capabilities of GPT-4 in various areas of computational chemistry, including electronic structure methods and molecular dynamics simulations, and provide two practical examples where GPT-4 is useful from different perspectives. In summary, we believe that GPT-4 can assist researchers in computational chemistry from many perspectives with the following capabilities:

GPT-4 has extensive knowledge of computational chemistry covering topics such as density functional theory, Feynman diagrams, electronic structure theory, molecular dynamics simulations, and molecular structure generation GPT-4 can explain basic concepts as well as summarize important findings and trends in the field The GPT-4 is also able to provide a comprehensive overview of the field.

- Method selection: GPT-4 can recommend the appropriate computational method and software package for a particular research problem, taking into account factors such as system size, timescale, and level of theory.

- Simulation Setup GPT-4 can assist in setting up and suggesting simulation parameters such as simple molecular input structure preparation, specific symmetry, density functional, time step, ensemble, temperature, pressure control methods, and initial setup.

- Code Development: GPT-4 can assist in the implementation of new algorithms and functionality into existing computational chemistry and physics software packages.

- Experimental, Computational, and Theoretical Guidance: GPT-4 can assist researchers by providing experimental, computational, and theoretical guidance; GPT-4 is a powerful tool to support computational chemistry research, but some limitations and errors can be found.

To better utilize the GPT-4, we offer some tips for researchers:

- Hallucinations: GPT-4 occasionally generates incorrect information; GPT-4 may struggle with complex logical reasoning. Researchers should independently verify and validate output and suggestions from GPT-4.

- Raw atomic coordinates: GPT-4 is not very good at generating or processing raw atomic coordinates for complex molecules or materials. However, GPT-4 may still work for simple systems with the appropriate prompts, including molecular formula, molecular name, or other ancillary information.

- Exact computation: GPT-4 does not excel at exact computation in the benchmarks evaluated by the authors, and physical priors such as symmetry, equivalence, and invariance are usually ignored. Currently, quantitative numbers returned by GPT-4 may be obtained from literature searches and a few examples, and it is better to combine GPT-4 with scientific computing packages (e.g. PySCF) or machine learning models (e.g. Graphormer or DiG).

- Practice: GPT-4 only provides guidance and suggestions; it does not allow for direct experimentation or simulation. Researchers must set up and run their own simulations and experiments or utilize other frameworks based on GPT-4, such as AutoGPT16, HuggingGPT, or AutoGen.

In summary, GPT-4 shows excellent potential in a variety of areas of computational chemistry, including electronic structure methods, molecular dynamics simulations, and real-world applications. While some limitations and inaccuracies exist, with the adoption of proper guidance and coordination, GPT4 has the potential to evolve into a valuable resource for researchers navigating the dynamically expanding field of computational chemistry.

Material Design

This section examines the capabilities of GPT-4 in the area of materials design. The authors have devised a comprehensive set of tasks that encompass a broad range of aspects of the materials design process, from initial conceptualization to subsequent validation and synthesis. The authors' goal is to evaluate GPT-4's expertise and ability to generate meaningful insights and solutions in real-world applications. The tasks designed by the authors cover many aspects, including background knowledge, design principles, candidate identification, candidate structure generation, property prediction, and synthesis condition prediction. By addressing the full spectrum of the design process, the authors aim to provide a comprehensive assessment of GPT-4's proficiency in designing more complex materials, particularly crystalline inorganic materials, organic polymers, and metal-organic frameworks (MOFs). It is important to note that the authors' assessment is primarily focused on providing a qualitative assessment of GPT-4's capabilities in this particular area, and that obtaining statistical scores will only be pursued when feasible.

Through the authors' evaluation, GPT-4's capabilities in materials design are summarized as follows

- Information Memory Information Memory: The student has an excellent ability to memorize information about inorganic crystals and polymers and propose design principles. His textual understanding of the basic rules of materials design is noteworthy. For example, in the design of solid electrolyte materials, they can propose ways to increase ionic conductivity and give precise examples.

- Composition creation: skilled in generating feasible chemical compositions of new inorganic materials (Figure 5.5).

- Synthesis Planning Demonstrated sufficient ability to plan the synthesis of inorganic materials.

- Coding Support: Provides generally useful coding support for materials tasks. For example, it can generate molecular dynamics and DFT input for a large number of properties calculations, allowing you to correctly utilize many calculation packages and build automated processing pipelines. Iterative feedback and manual adjustments may be required to fine-tune the generated code.

Despite its capabilities, GPT-4 also has potential limitations in materials science:

- Representation: Representation: There are issues with the representation and proposal of organic polymers and MOFs.

- Structure generation: Structure generation is limited in its ability to generate structure, especially when generating accurate atomic coordinates.

- Prediction: Insufficient for accurate quantitative prediction in predicting properties. For example, when predicting whether a material is metallic or semiconductive, the accuracy is only marginally better than a random guess.

- Synthetic routes: we struggle to suggest synthetic routes for organic polymeric materials that are not present in the training set without additional guidance.

In conclusion, GPT-4 represents a promising foundation for supporting materials design tasks, and its performance in specific areas such as structure generation and property prediction (benchmarks studied by the authors) could be further enhanced by incorporating additional training data with complementary modalities such as molecular graphs and dedicated AI models. Incorporation could be further enhanced as LLMs such as GPT-4 continue to advance, providing more sophisticated and accurate assistance in materials design, ultimately leading to more efficient and effective materials discovery and development.

Partial differential equation

Partial differential equations (PDEs) are an important and active area of research in mathematics, with widespread applications in a variety of fields, including physics, engineering, biology, and finance. PDEs are mathematical equations that describe the behavior of complex systems involving multiple variables and their partial derivatives. PDEs play an important role in modeling and understanding a wide range of phenomena, from fluid mechanics and heat transfer to electromagnetic fields and collective dynamics.

Here we investigate GPT-4's capabilities in several aspects of PDEs: understanding the fundamentals of PDEs, solving PDEs, and AI support for PDE research. We evaluated models for various forms of PDEs, including linear equations, nonlinear equations, and stochastic PDEs. The results show that GPT-4 can assist researchers in multiple ways:

PDE Concepts: The GPT-4 provides an understanding of basic PDE concepts, allowing researchers to better understand the PDEs they are dealing with. the GPT-4 is a useful resource for teaching students, helping them to better understand and appreciate the importance of PDEs in their academic pursuits and research activities. The GPT-4 is designed to.

- Relationships between concepts: because this model can identify relationships between concepts, it may help mathematicians broaden their horizons and intuitively grasp the connections between different subfields.

- Solution Recommendations: GPT-4 can recommend appropriate analytical and numerical methods to address various types and complex PDEs. Depending on the specific problem, GPT-4 can recommend appropriate methods to obtain exact or approximate solutions.

- Code generation: The model can generate code for numerical solution of PDEs in various programming languages, such as MATLAB and Python, to facilitate implementation of computational solution methods.

- Research Directions GPT-4 can suggest new problems, generalizations, and refinements that may lead to more important and impactful results in the PDE area, and suggest further research directions and potential extensions GPT-4 has the potential to support PDE research, but some limitations are To better utilize GPT-4, we offer the following recommendations to researchers

- Output Verification: Output Verification: Although GPT-4 demonstrates a human-like ability to solve partial differential equations and provide explicit solutions, incorrect derivations may be made. Researchers should use caution and verify the output of the model when solving partial differential equations with GPT-4.

- Recognition of illusions : GPT-4 may incorrectly cite literature that does not exist. Researchers should cross-check citations and be aware of this limitation to ensure the accuracy and reliability of the information provided by the model.

Future Outlook

This study investigated the capabilities and limitations of LLMs in various natural science domains.

The authors' main goal is to provide an initial assessment of GPT-4, a state-of-the-art LLM and its potential to contribute to scientific discovery, serving as a valuable resource and tool for researchers in multiple disciplines. Through extensive analysis, the authors highlighted GPT-4's proficiency in a number of scientific tasks, ranging from literature synthesis to property prediction and code generation.

Despite its impressive capabilities, it is essential to recognize the limitations of GPT-4 (and similar LLMs). For example, challenges when dealing with certain data formats, inconsistent responses, and occasional hallucinations. We believe that the authors' work is an important first step toward understanding and assessing the potential of GPT-4 in the natural sciences.

By providing a detailed overview of the strengths and weaknesses of the GPT-4, the authors aim to help researchers make informed decisions when incorporating the GPT-4 (or other LLMs) into their daily work and ensure optimal application while keeping in mind its limitations. In addition, the authors seek to encourage further exploration and development of the GPT-4 and other LLMs to enhance their capacity for scientific discovery. This may require improving the learning process, incorporating discipline-specific data and architecture, and integrating specialized techniques tailored to various scientific disciplines.

As the field of artificial intelligence continues to advance, the integration of sophisticated models such as GPT-4 is expected to play an increasingly important role in accelerating scientific research and innovation. It is our hope that the authors' work will serve as a valuable resource for researchers, facilitate collaboration and knowledge sharing, and ultimately contribute to a broader understanding and application of GPT-4 and similar LLMs in the pursuit of scientific breakthroughs. The remaining sections of this chapter summarize aspects of the LLM that need improvement for scientific research and discuss potential directions for enhancing or building upon the LLM to advance the pursuit of scientific breakthroughs.

Improvement of LLM

A more detailed and comprehensive approach is needed to further develop LLM for scientific discovery and to address its limitations. Here we provide a more extensive discussion of the improvements suggested earlier:

- Enhanced SMILES and FASTA sequence processing: LLM's proficiency in processing SMILES and FASTA sequences, along with dedicated tokens/tokenizers and additional parameters (such as new token embedding parameters), can be enhanced by focusing on these specific sequence types with special It can be improved by incorporating training datasets. In addition, the ability to employ SMILES and FASTA sequence-specific encoders and decoders can improve the comprehension and generation of LLMs in drug discovery and biological research. It is important to note here that only newly introduced parameters require further learning, while the original parameters of pre-trained LLMs can remain frozen.

- Improved Quantitative Task Capabilities : To improve LLM's capabilities in quantitative tasks, we can integrate more specialized training datasets specific to quantitative problems and incorporate domain-specific architectures and multi-task learning to achieve better performance in tasks such as numerical prediction of drug-target binding and better performance on tasks such as molecular property prediction.

- Improved understanding of less-studied entities: To improve knowledge and understanding of less-studied entities, such as transcription factors, you will need to incorporate more specialized training data related to these entities. This includes the latest research findings, expert-curated databases, and other resources that can help the model gain a deeper understanding of the topic.

- Molecular and structure generation enhancements: To enhance LLM's ability to generate innovative and viable chemical compositions and structures, specialized training data sets and methodologies related to molecular and structure generation should be incorporated. Approaches such as learning based on physical a priori and reinforcement learning can be used to fine-tune LLM and enhance its ability to generate chemically valid and novel molecules and structures. In addition, the development of specialized models, such as diffusion models for molecule and structure generation, can be combined with LLM as an interface to interact with these specific models.

- Improving the interpretability and explainability of the model: As LLMs become more sophisticated, it is essential to improve their interpretability and explainability. This will help researchers better understand the LLM's output and trust its recommendations. Employing techniques such as attention-based explanations, feature importance analysis, or counterfactual explanations can provide deeper insights into LLM reasoning and decision-making processes.

By addressing these limitations and incorporating the suggested improvements, LLM will be a more powerful and reliable tool for scientific discovery across a variety of disciplines. This will enable researchers to benefit from the advanced capabilities and insights of the LLM, accelerating the pace of research and innovation in drug discovery, materials science, biology, mathematics, and other areas of scientific inquiry.

In addition to the aforementioned aspects, it is essential to address several other considerations that are not limited to the scientific domain but apply to areas such as natural language processing and computer vision in general. These include reducing output variability, mitigating input sensitivity, and minimizing illusions. Reducing output variability and input sensitivity is critical to the robustness of the LLM and its consistency in generating accurate responses across a wide range of tasks.

This can be accomplished by improving the learning process, incorporating techniques such as reinforcement learning, and integrating user feedback to improve the LLM's adaptability to diverse inputs and prompts. minimizing hallucinations is also an important aspect, as it directly affects the reliability and credibility of the LLM output. Implementing strategies such as contrast learning, consistency training, and leveraging user feedback can reduce the occurrence of hallucinations and improve the overall quality of the information produced.

Addressing these general considerations will further improve the performance of LLM and make it more robust and reliable in its applications in both scientific and general domains. This will contribute to the development of a comprehensive and versatile AI tool that can help researchers and practitioners in various fields achieve their objectives more efficiently and effectively.

Integration of LLM and scientific tools

There is growing evidence that the capabilities of GPT-4 and other LLMs can be greatly enhanced by integrating external tools and specialized AI models, as demonstrated by systems such as HuggingGPT, AutoGPT, and AutoGen. We believe that incorporating specialized computational tools and AI models is even more important for scientific tasks than for general AI tasks because it can facilitate cutting-edge research and streamline complex problem solving in a variety of scientific domains. A prime example of this approach can be found in the Copilot for Azure Quantum platform. This platform provides a chemistry learning experience specifically designed to enhance scientific discovery and accelerate research productivity in the fields of chemistry and materials science. It combines the power of GPT-4 and other LLMs with scientific publications and computational plug-ins to enable researchers to tackle challenging problems with greater precision and efficiency. By leveraging Copilot for Azure Quantum, researchers have access to a wealth of advanced capabilities tailored to their needs, including a chemistry and materials science data infrastructure that reduces LLM illusions and enables on-the-fly information retrieval and insight generation. Other examples include ChemCrow, an LLM agent designed to accomplish chemistry tasks across organic synthesis, drug discovery, and materials design by integrating GPT-4 and 17 expert-designed tools, and GPT-3.5 and appropriate toolkits (Table Searcher, Internet Searcher, Predictor, Generator, etc.) to generate new materials and predict the properties of those materials (e.g., organometallic skeletons), such as ChatMOF, an LLM agent. In conclusion, scientific tools and plug-ins have the potential to significantly enhance the capabilities of GPT-4 and other LLMs in scientific research. This approach not only fosters more accurate and reliable results, but also empowers researchers to tackle complex problems with confidence, ultimately accelerating scientific discovery and fostering innovation across disciplines as diverse as chemistry and materials science.

Building a unified basic scientific model

GPT-4 is primarily a language-based underlying model and is trained on huge amounts of textual data. However, in scientific research, there are numerous other valuable data sources besides textual information. Examples include drug molecule databases, protein databases, and genome databases, which are critical to scientific discovery. These databases contain large molecules such as the titin protein, which is composed of over 30,000 amino acids and approximately 180,000 atoms (and 3x atomic coordinates). Converting these data sources into text format results in very long sequences that are difficult for LLM to process efficiently. For this reason, we believe it is extremely important to develop a scientific infrastructure model that supports the research and discovery of natural scientists. While pre-trained models exist that target individual scientific domains and focus on a limited set of tasks, a unified, large-scale scientific infrastructure model has not yet been established. Existing models include

- The ESM-x series (ESM-2, ESMFold, MSA Transformer, ESM-1v for variant effect prediction, ESM-IF1 for reverse folding, etc.) are pre-trained protein language models.

- DNABERT-1/2, Nucleotide Transformers, MoDNA, HyenaDNA, and RNA-FM are learned models of DNA and RNA.

- Geneformer is pre-trained on a corpus of approximately 30 million single-cell transcriptomes, enabling context-specific predictions with limited data from network biology such as chromatin and network dynamics.

Inspired by these studies, the authors advocate the development of a unified, large-scale science infrastructure model that can address as many scientific domains and tasks as possible and support multimodal and multiscale inputs.As demonstrated in GPT-4, part of the strength of the LLM lies not only in its scale but also in its breadth. As a result, the development of a unified science base model across domains is an important differentiator from previous domain-specific models and will greatly enhance the effectiveness of the unified model.

This unified model offers several unique features compared to the traditional Large Language Model (LLM):

- Supports a variety of inputs including multimodal data (text, 1D sequences, 2D graphs, 3D three-dimensional structures), periodic and acyclic molecular systems, and various biomolecules (proteins, DNA, RNA, omics data, etc.).

- Incorporating physical laws and first principles into model building and learning algorithms (e.g., data cleaning and preprocessing, loss function design, optimizer design). This approach recognizes the fundamental difference between the physical world (and its scientific data) and the general AI world (NLP, CV, voice data). Unlike the latter, the physical world is governed by laws, and scientific data represent (noisy) observations of these fundamental laws.

- Leveraging the power of existing LLMs such as GPT-4, it effectively utilizes textual data in the scientific domain, handles open domain tasks (unseen during the study), and provides a user-friendly interface to assist researchers.

Developing a unified, large-scale scientific infrastructure model with these characteristics will advance the state-of-the-art of scientific research and discovery, enabling natural scientists to tackle complex problems with greater efficiency and precision.

Categories related to this article

友安昌幸 (Masayuki Tomoyasu): JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO