The Potential And Challenges Of MatSci-LLM And The Use Of Large-Scale Language Models In Materials Science

Large Language Models 29/10/2024

3 main points
✔️ Large-scale language models will transform not only natural language processing, but also computer vision, healthcare, legal, finance, and many other fields
✔️ Accelerates discovery and analysis of new materials in materials science, but has limitations in numerical problems and code generation, requiring additional domain-specific learning
✔️ MatSci-LLM will enable automated knowledge-based generation and end-to-end automation of materials design, accelerating scientific discovery, but requires further development and research in related fields

Are LLMs Ready for Real-World Materials Discovery?
written by Santiago Miret, N M Anoop Krishnan
(Submitted on 7 Feb 2024)
Comments: Published on arxiv.
Subjects: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The advent of large-scale language models is fundamentally changing the way technology development and research is conducted.Large-scale languagemodels are not only having a profound impact on the field of natural language processing, but also on many related fields, such as computer vision that generates images from text (Zhang et al., 2023). As a result, efforts to integrate the capabilities oflarge-scale languagemodels into a variety of industries are accelerating.

Examples include healthcare task automation (He et al., 2023), legal (Dahl et al., 2024), finance (Wu et al., 2023a), and software engineering (Fan et al., 2023).

Notable amongthese is the application oflarge-scale languagemodels tomaterials science. This will accelerate the discovery, synthesis, and analysis of new materials, thereby opening up great possibilities to address today's complex societal problems, such as climate change, energy security, sustainable agriculture and manufacturing, personalized medical devices, and access to more powerful computational systems.

Recent research has seen theintroduction oflarge-scale languagemodels indiverse areas of chemistry (Jablonka et al., 2023) and biology (Lin et al., 2023; Hsu et al., 2022; Xu et al., 2023; Cui et al., 2023; Dalla-Torre et al., 2023)is increasing, but its application in materials science remains slow.

This paperanalyzes the current challenges oflarge-scale languagemodelsin materials science, andorganizes and proposes requirements for alarge-scalelanguage model formaterials science(MatSci-LLM). It also provides a roadmap showing specific applications of MatSci-LLM for the development of the materials science field.

Failure of Large-Scale Language Models in Materials Science

This paper shows great potential for the application of large-scale language models in materials science. However, it is also important to understand their limitations for practical applications.The paper presents examples of failures oflarge-scale languagemodels intasks such as question answering, code generation, eigenexpression extraction, summary classification, and compositional extraction of materials literature, demonstrating the need to develop a robust MatSci-LLM.

GPT-4 and LLaMA-2, well-knownhigh-performance large-scale language models,are trained on large amounts of public text data. As such, they are assumed to have certain knowledge of materials science from Wikipedia and other public sources.

Thus, Zaki et al. (2024) created a dataset consisting of 650 questions requiring undergraduate-level knowledge, which they used toassess the materials science domain knowledge of alarge languagemodel.

The results show that GPT-4 achieved a 62% correct response rate by using Chain-of-Thought (CoT) inference, but performed poorly on numerical-type questions at 39%.This indicates thatcurrentlarge-scale languagemodels are poor at assigning appropriate numerical values, consolidating context, and solving the original question.

We also found that when compared to human performance on the same exam, the GPT-4 CoT was better than the other baselines, but compared to the top performing humans, it achieved only about 50% and failed to exceed the criteria for passing the exam.

They also found that one of the tasks thatlarge-scale languagemodels are good at is code generation, but in a materials science-related code generation task, the GPT-4 was only 71% accurate Zaki et al. (2024) conducted a detailed analysis of the questions and code generation tasks that showed low performance, and found thatlarge-scale languagemodels are poorat solving complex numerical problems, due to erroneous numerical assignments, unit conversion errors, and missing constants during unit conversion, and thatlarge-scale languagemodels are poor at understanding information related to 3D structures, particularly symmetries related to crystal structures and materials, leading to misinterpreted and inaccurate conclusions. We have found that.

It is clear that current large-scale language models need further refinement for practical applications in materials science. By learning from more domain-specific information and improving inference capabilities, large-scale language models could become a practical tool.

Large-scale Language Model Infrastructure Based on Domain-Specific Languages in Materials Science

The field of materials science requires technical depth and breadth because it is closely related to various engineering disciplines, including physics, chemistry, and biology. For this reason, domain-specific language models are essential to overcome the challenges inherent in materials science. This paper also discusses the importance of domain-specific language models in materials science.

While domain-specific notations such as the IUPAC nomenclature (Hellwich et al., 2020) exist in chemistry, there is no standard notation in materials science. For example, NaAlSi2O8, Na2O.Al2O3.2SiO2, and SiO2-0.5Na2O-0.5Al2O3 all represent the same material in different contexts. In addition, domain-specific names are sometimes used, such as "soda" or "lime" to indicate Na2CO3 or CaCO3. In cement chemistry, C-S-H refers to calcium silicate hydrate, while standard chemical notation refers to carbon, sulfur, and hydrogen. Thus, materials science notation is diverse and requires domain-specific learning in order for large-scale language models to be understood in the proper context.

In addition, certain information may be omitted from research papers. For example, information may be provided in the form of references to previous studies, such as "Fracture simulations were performed using the methods described in Griffith et al. In the materials science literature, it is common for experimental and simulation protocols, material compositions, and synthesis conditions to be described based on other papers. Therefore, large-scale language models require the ability to gather information from multiple sources and properly interpret and explain the context.

In materials science, text is commonly used to represent 3D or 2D structures. Crystal structures are represented using the Wyckoff position (Aroyo et al., 2006), whereas in crystallography, 4 mm represents the crystal structure, while in the general literature it is sometimes used as a unit of distance. Crystals are also represented in the CIF (Crystal Information File) format, which contains detailed crystal data. However, current large-scale language models are unable to read, interpret, and generate CIF files. This is a major limitation in the discovery of new materials.

Moreover, information about materials can be represented in multiple modalities, including text, tables, figures, and videos. While progress has been made in extracting information, especially in tabular form (Gupta et al., 2023; Zhao et al., 2023), challenges remain in how to inject knowledge into large-scale language models based on the extracted data. Material properties are often described in scientific units, and tables and text must be linked to obtain accurate information.

Material properties are also represented graphically as experimental results such as Raman analysis, X-ray diffraction (XRD), and scanning electron micrographs. For example, to interpret the statement "The XRD pattern in Figure XY(a) indicates that the sample is amorphous," the text and the figure must be understood together. For a large-scale language model to properly learn this information, a large number of images and corresponding text are required.

Other materials science information may be presented in a combination of text, figures, tables, and video, and further study of large-scale language models is required to link this information appropriately. For example, material failure modes, crystal growth, and thermal response may be presented in video. The ability of large-scale language models to integrate and interpret this information is an important future challenge.

In applications in materials science, overcoming these challenges will make it an even more useful tool.

Construction of a Multimodal Materials Science Corpus

The performance of a language model is highly dependent on the quality of the dataset used for training. Therefore, dataset creation is a key factor in facilitating progress in a variety of deep learning areas, including computer vision, graph learning, and natural language processing. In materials science in particular, the domain-specific variability of text is high, and the development of multimodal language models requires the development of datasets that combine additional modalities such as figures, tables, and images. This allows for more powerful language modeling by representing scientific information in a variety of modalities.

The gold standard data for traininglarge-scale languagemodels formaterials scienceis contained primarily in peer-reviewed publications from prestigious editorial houses such as Elsevier, Royal Society, American Society, and Springer Nature. However, access to valuable textual data is difficult because many of these publications charge a fee and public access is restricted. Therefore, generic language models such as GPT-4 and LLaMa are unlikely to have access to these data, contributing to their poor performance on materials science tasks. There is a movement to make scientific textual data available in open access through various preprint servers and portals such as Semantic Scholar, but data from these sources need to be cleaned, etc.

Many well-known journals offer paid subscription-based text and data mining APIs, but the machine-readable format is limited to manuscripts published in the 21st century; many pre-20th century publications are only available as PDFs or scanned files, making them less machine-readable This is why we are not able to offer pre-20th century scientific manuscripts in machine-readable format. As a result, pre-20th century scientific data is rarely available for training large-scale language models. In addition, many peer-reviewed journals do not allow text and data mining and have no framework for doing so. Data obtained from preprint servers also require extensive cleaning before they can be applied to training.

When data is obtained from multiple sources or modalities (tables, text, images, video, code, etc.), each modality requires an appropriate description. For example, a silicon CIF document needs a detailed description of the information contained in that fileso that alarge-scale languagemodel can not only understand the CIF format, but also learn how to interpret that information. However, such large-scale annotations are not currently available and require expert input to be reliable.

It is not easy to properly link data about multiple entities so that they can be read together in a relevant context. For example, a description of a figure or table in a manuscript may span multiple paragraphs and spread across supplemental material. This calls for the development of learning schemes that respect the dataset and context, as opposed to standard machine learning approaches.

Another challenge in creating datasets based on peer-reviewed publications is the use of external references. In manuscripts, references are made to multiple documents to support the current study. Therefore, training data must properly account for external references, reduce illusions, and provide well-reasoned hypotheses. Manylarge-scale languagemodels hallucinate when asked for references and may generate fictitious references when generating scientific manuscripts. This demonstrates the need to properly incorporate external references into the training data.

Addressing these challenges will require close collaboration among publishers, government, industry, and academia. New machine learning solutions for MatSci-LLM are also needed. For example, computer vision techniques to convert scanned documents into text that respects the original format, and new methods to process external references and multimodal data are required. Such solutions could have an impact beyond the field of materials science to the digitization of older historical documents in history, law, finance, etc.

Roadmap for MatSci-LLM Applications

The application of MatSci-LLM offers very exciting opportunities in end-to-end automation of materials design. Automated materials design can accelerate the understanding of complex problems in materials science. The figure below outlines an end-to-end materials discovery framework with MatSci-LLM at its core.

MatSci-LLM has the potential for three breakthrough capabilities:first, the ability to augment materials science knowledge and improve human understanding through automated knowledge base generation;second, the ability to automate materials design with AI-powered materials generation and highly accurate simulation; and third, the ability to design materials with AI-powered materials generation and highly accurate simulation. The second is the ability to design materials through in silico, automated materials design using AI-based materials generation and high-precision simulation.And third, the ability to enable automated labs for real-world materials synthesis and characterization.

Recent studies have used large-scale language models to externalize domain-specific knowledge in a structured form, thereby extending the availability of scientific knowledge. For example, Cox et al. generated annotations for over 15,000 protein coding databases, and Buehler externalized knowledge from large-scale language models as a structured knowledge graph. This allows scientists to use the knowledge to deepen their understanding and make corrections and adjustments as needed. Such a knowledge base is a valuable resource for engineering applications in diverse areas of materials science and technology.

MatSci-LLM also has great potential in its ability to provide a human-machine interface. By leveraging the ease of use of natural language and the ability to understand and generate text for large-scale language models, MatSci-LLM has the potential to streamline complex scientific processes. For example, code generation can be used to assist in the discovery of new materials and the execution of simulation workflows; in Buehler's study, a large-scale language model generated a new molecular compound in SMILES notation and queried an agent that performed the relevant calculations, thus demonstrating the polymer material The end-to-end design was shown. In this way, MatSci-LLM can also serve as a model for generating new materials, complementing current technology while proposing new material solutions.

Bringing simulated materials into reality and matching experimental and simulation results is the ultimate goal of end-to-end materials design, and MatSci-LLM is a powerful tool in accelerating the design and execution of experiments. Recent research has leveraged the human-machine interface to discover and synthesize complex material systems. Automated, self-driving materials labs are also being developed. This allows MatSci-LLM to facilitate collaboration between the various machines involved in the execution of experiments, further accelerating the materials development process.

In addition, MatSci-LLM provides an interface for human scientists to define design requirements using natural language. This allows experimental workflows to be executed by large language models with enhanced tools. As shown in the figure below (reproduced below), MatSci-LLM's ability to execute experimental workflows fully automates the materials discovery cycle, allowing desired materials to be created in the real world.

Summary

Circulation, as illustrated in the figure above, has the potential to yield impactful scientific discoveries for a wide range of materials by discovering new physical and chemical relationships through end-to-end automation and augmenting them with human knowledge.

However, the use of large-scale language models in materials science described in this paper presents unique challenges, and further research is needed to make MatSci-LLM an effective scientific assistant. Progress at the interface of many fields, such as machine learning, materials simulation, materials synthesis, materials characterization, and robotics, is also essential for the development of useful research. We look forward to future progress in this field.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

The Potential And Challenges Of MatSci-LLM And The Use Of Large-Scale Language Models In Materials Science

Summary

Failure of Large-Scale Language Models in Materials Science

Large-scale Language Model Infrastructure Based on Domain-Specific Languages in Materials Science

Construction of a Multimodal Materials Science Corpus

Roadmap for MatSci-LLM Applications

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...