Catch up on the latest AI articles

SciAssess] Benchmark For Evaluating LLM's

SciAssess] Benchmark For Evaluating LLM's "Analytical Performance Of Scientific Literature"

Large Language Models

3 main points
✔️ Developed the SciAssess benchmark to assess the ability to analyze scientific literature
✔️ Evaluates memory, comprehension, and analysis of key models in GPT-4, GPT-3.5, and Gemini

✔️ Expanding the scope of the benchmark and introducing multimodal datasets for future improvements

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis
written by Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Hongshuai Wang, Yongge Li, Mujie Lin, Shuwen Yang, Jiankun Wang, Yuqi Yin, Yaqi Li, Linfeng Zhang, Guolin Ke
(Submitted on 4 Mar 2024 (v1))
Comments: Published on arxiv.

Subjects: Computation and Language (cs.CL)

code:

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Recent advances in large-scale language models,such as Llama, Gemini, and GPT-4,have garnered significant attention due to their superior natural language understanding and generation capabilities. Evaluating these models is important to identify their limitations and potential and to facilitate further technological advances. To this end, a number of specific benchmarks have been proposed to evaluate the diverse skill sets of large-scale language models. This allows for more complex tasks.

Meanwhile,large-scale languagemodels are playing an increasingly important role inscientific research.In particular, in the analysis of scientific literature,large-scale languagemodels have been put to practical use in applications such as literature summarization and knowledge extraction, increasing researchers' productivity. However, existing benchmarks fail to address the complex and comprehensive understanding of scientific literature and scenarios dealing with multimodal data. These benchmarks do not adequately replicate the challenges of domain-specific terminology, complex relational reasoning, and integration of multimodal information posed by the scientific literature. Filling this gap requires the development of advanced benchmarks that accurately reflect the complexity and specificity of scientific literature analysis.

The following three key elements are considered essential for assessing the ability of large-scale language models in scientific literature analysis

  1. Model Capabilities: Benchmarks help us understand how these capabilities are acquired and enhanced by identifying desired capabilities and modeling their intrinsic relationships.
  2. Scope and Task: The benchmark should encompass a broad range of scientific areas and select tasks that represent typical challenges and scenarios in each area.
  3. Quality Control: The quality of the benchmark data set must be kept high and serve as a reliable basis for deriving accurate and actionable insights. Each data point must be rigorously validated by domain experts to ensure its accuracy and reliability.

Against this background, this paper proposes a new benchmark specifically designed for scientific literature analysis, SciAssess, which covers a wide variety of tasks and question types and aims to provide a more detailed and rigorous assessment of the capabilities of large-scale language models.

SciAssess assesses a model's ability across three progressive levels: memory, comprehension, and analysis and reasoning. This provides fine and informative assessment results that specifically indicate where the model falls short.It also covers a wide range of tasks relevant to different scientific disciplines, including general chemistry, organic electrolytes, alloy materials, drug discovery, and biology. To ensure representative benchmarks, raw data are carefully collected from publicly available scientific publications and specialized databases, ensuring that SciAssess comprehensively reflects the current state of scientific research. In addition, they undergo rigorous peer-reviewed cross-validation to ensure accuracy and reliability. In addition, to protect privacy and security, careful screening is performed to ensure that sensitive information is removed or anonymized. This maintains SciAssess' legal and ethical integrity.

SciAssess aims to reveal the performance of large-scale language models in the area of scientific literature analysis and to identify their strengths and weaknesses. It is hoped that the insights gained from SciAssess will further improve the ability of large-scale language models to analyze the scientific literature and ultimately contribute to the acceleration of scientific discovery and innovation. The insights gained from SciAssess are expected to further improve the ability of large-scale language models to analyze scientific literature and ultimately contribute to the acceleration of scientific discovery and innovation.

Benchmark Data Set

In developing criteria for evaluating large-scale language models in science, this paper carefully designs three elements: model capability, scope and task, and quality control.Drawing on the widely recognized Bloom's taxonomy, we have developed a benchmark specifically designed for the analysis of scientific literature called "SciAssess". This assessment covers three key competencies

  • Memory (L1): refers to the extensive knowledge base of the model and the ability to accurately answer questions about general facts in science
  • Comprehension (L2): Ability to accurately identify and comprehend key information and facts within a given text
  • Analysis and Reasoning (L3): Advanced ability to integrate extracted information with the existing knowledge base, and to use logical reasoning and analysis to draw firm conclusions and predictions

The benchmark covers a variety of scientific disciplines, as shown in the table below. In addition, the following five question formats are designed to evaluate the model. True/False Questions, Choice Questions, Table Extraction, Constrained Generation, and Free Response Generation. Details and specific examples of these question formats are provided below.

General Chemistry

The General Chemistry assessment set is a comprehensive set of tasks designed to assess chemistry-related skills in large language models, ranging from basic knowledge to applied problem solving and research analysis. The set includes five different tasks, each targeting a different aspect of chemistry and academic understanding. Through these tasks, youwill get a complete picture of the ability oflarge-scale languagemodels to academic research in chemistry and the practical application of its principles. All test data are collected from the OpenAI evals repository.

MMLU (Massive Multitask Language Understanding) is a new benchmark for measuring model knowledge by assessing knowledge acquired during prior learning in zero-shot and fourshot settings. This makes the benchmark more challenging and similar to the way humans are assessed: out of 57 subjects, high school chemistry and college chemistry are selected to assess knowledge recall.For example prompts and responses, see below.

Abstract2Title tests a model's ability to use summary sections of literature to generate appropriate titles. Large-scale language models must understand the summary section and paraphrase it concisely. The conciseness of the generated title is evaluated by the GPT-4 as follows

Question Extraction aims to assess the ability of large-scale language models to identify, extract, and summarize key research questions from scientific article summaries. This taskrequires thatlarge-scale languagemodels have a deep understanding of the content of the summary and concisely summarize information including background, objectives, methods, results, and conclusions. It tests your ability to understand complex and specialized language, to identify key focal points within extensive and detailed information, and to summarize and reconstruct academic content.

This requires deeper analysis to identify the problem, hypothesis, or issue that the research seeks to solve, rather than just a superficial processing of the text. This taskis particularly important in assessing the utility oflarge-scale languagemodelsin academic and research settings.Efficiently understanding and extracting the main points of scholarly articles can help in literature reviews, developing research proposals, and identifying trends and gaps in research. This streamlines the process of tackling the vast and ever-growing scientific literature andhighlights the potential oflarge-scale languagemodels toassist researchers, scholars, and students.Responses will be rated on a scale of 1 to 5 by the GPT-4, similar to the Abstract2Title task.

Balancing Equationsis designed to assess the ability of large language models to understand and apply the stoichiometry of chemistry and the laws of conservation of mass and energy. Balancing chemical equations includes adjusting the coefficients of reactants and products so that the number of atoms of each element is equal on both sides of the reaction equation. This reflects the conservation of matter.

This tasknot only tests the ability of thelarge-scale languagemodel to interpret and understand the symbolic language of chemistry, but also assesses problem-solving skills and expertise-based abilities. To balance a chemical reaction equation, thelarge-scale languagemodel must identify reactants and products, understand stoichiometric relationships between them, and apply mathematical reasoning to find coefficients to balance the reaction equation.

Alloy Materials

Alloy materials are mixtures of two or more metallic elements in certain proportions that possess metallic properties. Alloys are widely used in many fields, including aerospace, automotive manufacturing, construction, and electrical products. Specific properties and requirements can be achieved by adjusting the composition and manufacturing process. Therefore, extracting alloy composition and process values from the literature is very important in alloy design.

The paper also investigates the ability of large-scale language models to extract the information needed for alloy design. It designs a comprehensive set of tasks related to the literature study. These include alloy composition extraction, process value extraction, process sequence determination, and sample identification. Standard solutions for all tasks addressed here are manually extracted from literature in different journals and verified by different persons.

By extracting and structuring alloy composition information from article text and tables, researchers can more effectively use historical data and obtain useful guidance for subsequent design. This task evaluates the ability of large-scale language models to extract alloy compositions (all elemental contents) from text and tables. The extraction position of the alloy elements usually falls into two cases: first, when the elemental content is listed in a table (see table below), and second, when the elemental content is indicated by the alloy name. For example, " Fe30Co20Ni50 " indicates an atomic ratio of 30% Fe, 20% Co, and 50% Ni. The goal of this task is to comprehensively extract this information, organize the results into a table, and calculate the agreement scores between the standard answer table and the extracted result table. This will demonstrate the comprehension ability of the large-scale language model to integrate, extract, and structure multimodal information.

The properties of an alloy are also determined by its composition and the fabrication process (e.g., treatment and heat treatment). In particular, the extraction of heat treatment temperatures is very important. The purpose of this task is to identify the maximum temperature values for the heat treatment of an alloy. To ensure accurate statistical analysis, the prompts are designed in the form of multiple-choice questions.Below is a sample.

Alloy processing requires a clear sequence for each process. Therefore, it is important to ensure that the order of the extracted heat treatment processes matches the order of the experiment. For example, a sample may be further aged after solution treatment to release internal stresses. In this task, the sequential relationship between the two heat treatments is objectively analyzed and evaluated to determine if it is correct or incorrect. If a specific heat treatment name is not present in the paper, it is considered False. This task assesses your ability to understand a large-scale language model that determines processing order from text.The prompts to the model are as follows.

Organic Materials

Organic materials are made from carbon-based molecules and polymers, and their diverse functions favor a wide range of applications. Unlike inorganic materials, their properties are easily modified and adaptable, making them important in fields such as electronics, photonics, sensors, and energy. The vast possibilities of organic chemistry are exploited to promote technological progress.

Here we focus on two subfields of organic functional materials: organic electrolytes and polymer materials. For polymer materials, we evaluate the effectiveness of a large-scale linguistic model to extract key properties associated with polymer materials from the scientific literature. In particular, we design two tasks, a textual and a tabular one, using the application of conjugated polymers in organic solar cells as a case study. This allows us to evaluate the model's ability to recognize and identify information about these materials from a variety of tasks.

Organic electrolytes are widely used electrolytes, especially in lithium-ion batteries. They contain organic solvents, lithium salts, and additives as needed to facilitate ion transfer within the battery, allowing energy to be stored and released. Understanding the solubility of organic electrolytes is critical because it directly affects the efficiency of the electrolytic process, product selectivity, and equipment design. This task will investigate LLM's ability to obtain solubility-related tables. Papers on electrolytes typically select data from various aspects to describe the system. Therefore, it is very difficult to integrate multiple tables into an appropriate format. Therefore, the emphasis is on evaluating the model's ability to understand the meaning, selecting the most appropriate and largest table related to "solubility" from a large number of choices, and converting it to the specified format.The prompts to the model are as follows.

The composition and properties of organic electrolytes are critical to battery performance, stability, and safety. Therefore, to further evaluate the model's ability to acquire information related to electrolytes, we asked multiple-choice questions about the physical and chemical properties of the solution system's composition and dissolution reactions. These questions were based on the information presented in the tables in the paper.The prompts to the model are as follows.

Important values such as power conversion efficiency (PCE), open circuit voltage (VOC), and other electronic characteristics are extracted from the literature. These properties are typically included in tabular form. Extracting these properties using large-scale language models demonstrates the great potential of the AI community in polymer modeling. Examples could include computer-assisted screening, targeted design, and optimization. The source data is collected from journals such as Nature Communications, Advanced Materials, Nature Photonics, Nature Commun., J. Phys. Chem, Appl. Phys. Lett.The model prompts the following.

Drug Discovery

This paper also examines the capabilities of large-scale language models in the field of drug discovery. It designs comprehensive tasks related to patent and literature research, focusing on affinity data extraction and patent coverage.

The Affinity Data Extraction taskassesses the ability oflarge-scale languagemodels to extract affinity tables (including molecular tags, affinity for different targets in SMILES). This evaluation task tests the ability of large-scale language models to understand complex, domain-specific language, molecules, and tables. Extracting affinity data requires not only superficial processing of text, but also deep analysis to match different modalities.As a specific example, the output is shown in the table below.

The dataset was carefully selected from PubChem BioAssays to encompass literature from a variety of journals and ages. Because the original datasets are organized by bioassay number, we merged the source data based on DOI and carefully sampled a subset of them. These papers cover a wide range of protein targets and cell lines, with tables presented simultaneously in different formats.

Themolecular determinationtaskevaluates themodel'sability to determine if a molecule(represented in SMILES) iscoveredin a document.Large-scale language models must recognize all markash structural formulas and their substituents to determine if the required molecule is covered.

Biology

The MedMCQA is a task designed to assess the ability to understand and reason about medical multiple-choice questions. The task consists of clinically relevant questions and knowledge assessments and is intended to measure the ability of an artificial intelligence system.For example, the following prompts are entered into the model

In order to protect the quality and ethical standards of the datasets, the following rigorous procedures are taken

  • Expert validation: To ensure the accuracy and reliability of SciAssess, multiple cross-validations are performed by experts on all tasks. This ensures that the labels on the datasets are accurate and high quality standards are maintained.
  • Screening and Anonymization: SciAssess undergoes thorough screening of sensitive information and all potentially sensitive data identified is deleted or anonymized. This ensures privacy protection and data security.
  • Copyright Compliance: Stringent copyright review procedures are in place for all documents and data to ensure that SciAssess does not infringe intellectual property rights and complies with legal standards and ethical codes.

These procedures ensure data quality, privacy protection, and legal compliance.

Experiment

Three major large-scale language models were evaluated for their ability to analyze scientific literature: the firstis GPT-4. OpenAI's GPT-4 excels in text generation and comprehension, and has enhanced capabilities for image processing, code interpretation, and information retrieval. This positions it as a versatile tool that can handle the complexity of scientific texts. The latest version of GPT-4 allows users to write answers using a code interpreter and uses Chain of Sort (CoT) to extract the final results. the CoT prompts are as follows.

Second, GPT-3.5, OpenAI's GPT-3.5, which precedes GPT-4, stands out for its advanced language processing capabilities and can effectively deal with complex text; third,Gemini, Google DeepMind's family of Gemini models, which integrate text, It excels in multimodal understanding that integrates code, image, and audio analysis. Performance in MMLU tests is particularly noteworthy, with Gemini-1.0-Ultra outperforming human benchmarks. However, we are evaluating Gemini-1.0-Pro because it has not received an API at this time. This model excels in understanding and synthesizing scientific literature and is an advanced tool in academic research, providing insight and increasing productivity in the analysis of scientific literature.

SciAssess is based on an improved version of the framework provided by openai/evals (https://github.com/openai/evals). The paper also incorporates additional features such as model calls (e.g., Gemini), custom tasks and metrics, datasets, and PDF processing modules, with detailed code to be released soon.

The main part of SciAssess focuses on academic literature and uses different methods for processing literature PDFs.

  • GPT-4: Utilizing the web-based ChatGPT4 interface, the original PDF file is uploaded directly to the chat interface to ask questions and take advantage of OpenAI's built-in PDF processing capabilities.
  • GPT-3.5: Use PyPDF2 to convert PDF to text, then enter plain text into the model.
  • Gemini: Because of its superior ability to process text and images simultaneously, it first uses PyPDF2 to extract text from the PDF, then uses PyMuPDF to retrieve the images in the document, arranges them in reading order, and sends both text and images to the model.

Here we focus on the ability to memorize, comprehend, and analyze the performance of large-scale language models in a variety of scientific fields, both in tasks with and without multimodal content.

A comparison of the overall performance of the large-scale language models in the various scientific disciplines summarized in the table below reveals clear strengths and weaknesses of each model.

The GPT-4 consistently outperforms the other models in almost all areas, earning the highest overall average ranking. This shows excellent adaptability in understanding complex scientific literature; GPT-3.5 lags behind GPT-4, but shows competence in a wide range of tasks, indicating its robustness; Gemini is third in the overall ranking, but shows its strength in specific tasks; and GPT-4.5 is third in the overall ranking, but shows its strength in specific tasks.

Across many scientific disciplines, GPT-4 performed well in nearly all domains, and in Biology, it was rated on par with Gemini. This highlights GPT-4's superior ability to understand scientific literature and its high adaptability; while Gemini ranked third overall, it performed as well as GPT-4 in Biology, indicating its potential strength in certain domains.

In the area of drug discovery, all models scored near zero on the "Tag2Molecule" task, indicating that all models are limited in their ability to handle highly specialized chemical content and complex molecular structural transformations. These findings reveal the strengths and limitations of each model within a particular scientific discipline and provide valuable insight for future model improvements.

Memory (L1) indicates the model's ability to recall previously learned information. In this respect, the GPT-4 shows the highest average ranking, proving its superiority. For example, in the "MMLU High School Chemistry" task, the GPT-4 demonstrated accurate recall of basic chemistry knowledge, leading the other models with an accuracy score of 0.591. This advantage of GPT-4 may be attributed to its ability to cover more scientific knowledge areas due to its extensive training data set.

Comprehension (L2) measures a model's ability to comprehend complex text and extract important information, and GPT-4 leads in Comprehension, demonstrating outstanding performance on multiple tasks. For example, on the "Abstract2Title" task, GPT-4 ranks at the top with a model rating score of 0.99. This demonstrates a deep understanding of text content and the ability to accurately generate relevant titles.

Analysis and Reasoning (L3) refers to the ability of the model to process complex problems, reason, and generate solutions; GPT-4 has a slight lead in this ability, showing an average ranking of 1.75. This indicates a strong ability to apply knowledge, analyze situations, and draw conclusions. For example, on the "Sample Differentiation" task, GPT-4 achieves an accuracy score of 0.528, well above GPT-3.5 (0.177) and Gemini (0.059).

Summary

SciAssess is designed to rigorously assess the ability of large-scale language models in analyzing the scientific literature. The benchmarkevaluates the memory, comprehension, and analytical power oflarge-scale languagemodels inspecific scientific fields such as general chemistry, alloy materials, organic materials, drug discovery, and biology. major models such as GPT-4, GPT-3.5, and Gemini were evaluated in detail to identify each model's strengths and areas in need of improvement and where each model has strengths and needs improvement. This research has providedstrong support for the development oflarge-scale languagemodelsin the field of scientific research.

The authors state that in the future, they aim to significantly improve the usefulness and effectiveness of the benchmarks by further expanding the scientific areas covered by the benchmark tests and incorporating more complex multimodal data sets.This is expected to facilitate the use oflarge-scale languagemodels and provide clear guidelines that will contribute to further scientific research and innovation.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Takumu avatar
I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us