A New Approach To Improving The Performance Of Biomedical NER With Large-Scale Language Models

Large Language Models 01/10/2024

3 main points
✔️Improving the performance of large-scale language models inbiomedical NER
✔️Improving the performance of large-scale language modelsby devising prompting strategies and definitional knowledge
✔️ Significantly improving the accuracy of the NER task by evaluating an extensive biomedical data set

On-the-fly Definition Augmentation of LLMs for Biomedical NER
written by Monica Munnangi, Sergey Feldman, Byron C Wallace, Silvio Amir, Tom Hope, Aakanksha Naik
(Submitted on 29 Mar 2024)
Comments: To appear at NAACL 2024
Subjects: Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Large-scale language models perform well on zero-shot and fourshot tasks, but there is still room for improvement in unique expression recognition (NER) in biomedical text. For example, a study by Gutiérrez et al. (2022) showed that GPT-3 performance with In-Context Learning (In-Context Learning) was inferior to smaller fine-tuned models, even when using the same amount of data. Biomedical texts are full of technical terminology and require specialized knowledge to interpret. However, annotation tasks are costly, time-consuming, and difficult, and the availability of labeled data is limited.

Against this backdrop, this paper seeks to improve the performance of large-scale language models using a novel biomedical-specific approach to knowledge extension. The approach focuses on dynamically incorporating definitions of relevant biomedical concepts, allowing the model to correct entity extraction errors during inference. Two methods, single-turn and iterative-prompting, are also being tested, and it has been observed that extending definitions contributes to performance improvements in a variety of models. For example, GPT-4 performance has improved by an average of 15%.

We also evaluate the effectiveness of human-curated sources and automatically generated definitions using large-scale language models, and find that human-curated information yields higher performance gains. These results raise new discussions about how definition knowledge can help improve the performance of large-scale language models in a variety of tasks and domains with limited data.

The figure below outlines the approach using a sample of zero shots. Wrong (red) and correct (green) extractions have been made given the definition of the extracted entity (yellow).

Experiment Summary

Several models were used in the experiments, including closed models accessible via APIs (e.g. OpenAI's GPT-3.5 and GPT-4, Anthropic's Claude 2) and the open source Llama 2. Google's PaLM, on the other hand, was excluded because it did not perform well enough in early stage testing. Note thatthe evaluation is based on the F1 score at the entity level.

The datasets used in the experiment were selected from the BigBIO benchmark, which is drawn from a wide range of biomedical disciplines. The benchmark contains more than 100 datasets, covering 12 task types and more than 10 languages; NER is a major task category in BigBIO and contains 76 datasets. We first exclude clinical data and non-English language datasets, and then select representative datasets for each entity type. This narrows the selection down to 16 datasets that contain particularly interesting information extraction phenomena or challenging cases. These selected datasets are included in the most common biomedical benchmarks and form an ideal foundation for providing new insights.These efforts provide a deep understanding of how large-scale language models can meet the specific demands of the biomedical field.

Experimental Results

First, we examine the performance of a large language model on the NER task, both zero-shot and fourshot. In addition to this, we alsoreport theperformance of a smaller, finetuned model (Flan-T5 XL).

The zero shot evaluation focuses on two elements: input format and output format. The input format defines how task descriptions and expected categories are provided to the model. The output format controls how the model structures the results.

There arealsotwoapproaches toinput format: Text (Text) andSchema Def (Schema Def).Text (Text)uses a standard prompt that includes a brief description of the task and a list of valid target entity types.Schema Defuses prompts with additional detailed descriptions of all target entity types based on previous research.

In terms of output formats,we also exploretwostructured formats,JSON andcode snippets;JSON helps structure the data and facilitates post-processing and evaluation. Code snippets use concrete programming examples to represent results. These formats have been shown to improve the model's zero-shot information extraction (IE) performance.Using these settings, we evaluate the performance of all models except GPT-4.

In addition, the fourshot evaluation takes the input/output format that performed best on the zero-shot and validates this on a specific data set (e.g., CDR).Finally, we evaluate the performance of small-scale models that have been fine-tuned specifically for each dataset.

The table below shows the results for all datasets for GPT-3.5, Claude 2, and Llama 2. The table below shows the zero-shot scores by text input and JSON output, text input and code output, definition input and JSON output, and definition input and code output.

For all models and datasets, we found that prompts with added schema definitions reduced performance. For output formats, we found that JSON was preferred for most datasets, with the exception of PICO and CHIA. This observation is consistent across all models.

As expected, fuchot performance tends to improve as the number of shots increases (see table below). Finally, we find that fuchot training with tuned large language models significantly outperforms fine-tuned small language models with the same five instances.

Next, we conduct experiments on extending prompts with definitions.In-Context Learning (In-Context Learning) leverages the knowledge that a large-scale language model has acquired through prior learning. However, this knowledge sometimes contains errors or is missing. To address this problem, methods have been attempted to immediately extend prompts with relevant factual knowledge to improve the accuracy of language comprehension tasks.

In particular, dynamically adding definitions of biomedical concepts in the text to the prompts in the NER task is expected to improve model performance. In the biomedical field, it is important to provide specific information at test time to compensate for areas in which large language models generally perform poorly.

In this experiment, we are firstcreating a knowledge base. We built a knowledge base of biomedical concept definitions and mapped the concepts in the text using a commercial entity linker. Next,we are performing inference with prompts. After the initial entity extraction, we ask the model to modify the model with prompts containing concept definitions. At this stage, entities are added, deleted, or reassigned types.

Concept definitions taken from the Unified Medical Language System (UMLS) will be used, but not all concepts will be useful. Concepts belonging to broad categories will be excluded and the focus will be on more elaborate categories.

Zero-shot also attempts to modify entities with prompts enhanced with a single definition.Few-shot attempts more advanced modifications with prompts containing multiple examples and concept definitions, but to avoid increased costs, it uses a single-turn approach instead of processing too much information at once.

This approach is based on the model's ability to modify its own output and aims to extract more accurate information through self-validation. We are exploring the possibility of providing contextual knowledge to support the process of self-validation and improve the accuracy of clinical information extraction.

All experiments are output in JSON format for consistency and are run in a uniform setup across all data sets. Of particular note are the fuschots, where five randomly selected shots are used for each test instance; three different random seeds are used to run each experiment, and their average performance is reported.

The experiment also included an evaluation of GPT-4; given the high API cost, the test set was subsampled to 100 instances.

The two tables below show the performance of GPT-3.5, Claude 2, Llama 2, and all datasets utilizing the GPT-4 definition extension in the zero-shot and fourshot settings. In the zero shot setting, Llama 2 and GPT-4 consistently achieve significant performance gains for both single-turn and iterative-prompting prompting strategies. Conversely, Claude 2 and GPT-3.5showed improvement only when usingiterative-prompting, with average performance gains of 12% and 29.5%, respectively.

In the fuchot setting, Claude 2 and GPT-4 also showed improvement in five of the six data sets; Llama 2 and GPT-3.5 also produced results in three and four data sets, respectively. Overall,GPT-4 withiterative-promptingshowed the best performance. These results confirm that prompting with extended concept definitions improves NER performance.

It is also tested whether the use of the entity linker model alone accounts for much of the observed gains. The performance of the entity linker alone on the same test set averaged a low F1 of 1.05, indicating that this is not due to this. The results in the table below indicate that the approach of adding candidate entities without a concept definition has limited effect, and in some cases, performance is worse than the zero-shot baseline.

Summary

This paper extensively evaluates the effectiveness of In-Context Learning (In-Context Learning), which leverages large-scale language models and focuses on the biomedical field of unique expression recognition (NER). We compared different forms of input and output and identified the main error types that these models commit. We also propose and validate a novel method for rapidly adapting general large-scale language models to biomedical NER tasks by dynamically providing concept definitions from an external knowledge base.

The process uses a series of prompts to allow the model to revise its predictions and uses definitions of key concepts to increase accuracy. Initially, we seek to extract entities, then add definitions of biomedical concepts to prompt the model to revise its predictions.

Evaluations conducted across the six data sets showed consistent and significant improvements compared to baseline, especially in the zero-shot setting. Ablation studies indicate that the model's ability to leverage conceptual definitions is a key driver of improvement, and that without these definitions, meaningful predictive improvement cannot be achieved.

Although we considered only datasets from the specialized domain of biomedicine, it has been shown that this approach can be applied to more general knowledge bases such as Wikidata. This suggests the potential benefits of this approach in other domains, and further application of this approach in future research is expected.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

A New Approach To Improving The Performance Of Biomedical NER With Large-Scale Language Models

Summary

Experiment Summary

Experimental Results

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...