Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

12/08/2025

3 main points
✔️ LLM annotation of PubMed articles on a paragraph-by-paragraph basis enables extraction of high-quality clinical cases and educational value sentences
✔️ Clinical sentence upsampling and educational value filters improve medical QA performance and learning efficiency
✔️ Combined strategy BE-All combines performance improvement and learning token reduction and is effective for multi-language adaptation Demonstrated effectiveness in multilingual adaptation

Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content
written by Rian Touchent, Nathan Godey, Eric de la Clergerie
(Submitted on 25 Jun 2025)
Comments: Dataset link: this https URL
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Based on the PubMed Central Open Access (PMC-OA) corpus, we proposea new biomedical dataset, Biomed-Enriched, which utilizes step-by-step annotation with LLMs.

While LLMs in general show high performance on a wide variety of tasks, they suffer from a lack of specialization and terminological accuracy in the medical and biomedical fields.
One of the reasons for this is that training data is mainly web-derived and information on specialized areas is scarce. In particular, clinical data is difficult to release due to privacy restrictions, and non-English data is scarce.

In this study, we first annotated 400,000 paragraphs in Llama-3.1-70B-Instruct for approximately 130 million paragraphs in PMC-OA, and then distilled the labels into XLM-RoBERTa-base and applied them to the entire corpus.
This enabled the extraction of high-quality clinical cases and multilingual segments by assigning a type (research, clinical case, review, etc.), domain (clinical, biomedical, other), and educational value (1-5) to each paragraph.

Experiments show that up-sampling of clinical sentences and filtering by educational value improves medical QA performance and learning efficiency.

Proposed Methodology

The proposed method, Biomed-Enriched, features precise annotation and data filtering on a paragraph-by-paragraph basis.

In the data collection phase, approximately 4.5 million full-text articles were extracted from PMC-OA, removing non-textual elements and also eliminating short sentences with less than 64 tokens.

A two-stage annotation was then conducted.
In the first stage, Llama-3.1-70B-Instruct is used to assign text type (clinical case, research, review, or other), domain classification (clinical, biomedical, or other), educational value (1-5 points), and language to 400,000 randomly selected paragraphs.
In a second step, the resulting annotations are distilled into XLM-RoBERTa-base to efficiently classify all paragraphs. Based on the annotation results, BE-Educational, which retains only paragraphs with an educational value of 3 or higher, BE-Clinical, which upsamples clinical fields by a factor of 10, BE-ClinicalCase, which enhances clinical cases, BE-French, which corrects for multilingual balance, etc. Multiple dataset derivatives have been constructed.

We also created "BE-Prefix," which adds annotation metadata to the beginning of paragraphs, and designed the model to associate meta-information with context.

Experiments

In the evaluation experiments, OLMo2-7B-stage1 was used as the base model and 336 billion additional tokens were trained on each Biomed-Enriched derived data set.

Comparisons were made with BE-Base (unprocessed PMC-OA) and various filter/upsampling applied versions.
The evaluation metrics used were the MMLU medical subset, MedQA, MedMCQA, and PubMedQA, as well as FrenchMedMCQA, which measures French adaptation, and performance was measured in zero or five shots.

The results showed that BE-All, which employs a composite strategy, showed the best performance with an average score of 61.08%, an improvement of +0.67 points over BE-Base. In particular, clinical up-sampling showed a +4.04 point improvement in MMLU Professional Medicine, and educational value filters showed stable improvement in MedMCQA and PubMedQA.

In addition, BE-All reached comparable performance with about one-third of the training tokens of BE-Base, confirming its high data efficiency.
Furthermore, BE-French achieved significant performance improvement on FrenchMedMCQA, demonstrating the effectiveness of multilingual support.

Categories related to this article

nakata

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

Summary

Proposed Methodology

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation