A New Approach To The Medical Field, BioMistral 7B, A Large-scale Language Model Specialized For The Medical Field

Large Language Models 24/05/2024

3 main points
✔️ Based on Mistral with additional pre-training in PubMed Central, a new large-scale language model, BioMistral 7B, was developed specifically for the medical field.
✔️ Evaluated the generalizability of the large-scale language model in the medical field through translation into seven different languages, in addition to an evaluation in English.
✔️ Plans to apply supervised fine-tuning and direct preference optimization techniques to evaluate BioMistral 7B's generation quality by humans and to enhance its multilingual and chat capabilities.

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
written by Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, Richard Dufour
(Submitted on 15 Feb 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The world of natural language processing is evolving at an impressive rate, and large-scale language models such as ChatGPT and Vicuna are fundamentally changing the way we interface with computers. These advanced models demonstrate human-like reasoning capabilities, from simple text comprehension to complex problem solving.

In particular, open source models such as BLOOM and LLaMA are gaining attention in the medical field, offering new possibilities for innovation in healthcare. However, the introduction of these technologies into the healthcare sector presents unique challenges and opportunities. There are a number of issues that need to be resolved, including data privacy concerns and barriers to the adoption of open source models.

To meet these challenges, this paper develops BioMistral 7B. It is dedicated to the biomedical field, derived from Mistral 7B Instruct, and built with additional prior learning in PubMed Central. bioMistral 7B expands the potential of open source large-scale language models in the medical field and demonstrates their ability to address a wider range of use cases. and the ability to address a wider range of use cases.

In addition , the effort is making the datasets, multilingual benchmarks, preprocessing scripts, and models available on HuggingFace and GitHubunder the Apache 2.0 license. This research is expected to be an innovative step in the future of medical technology.

BioMistral Modules

Here are the modules that facilitate the construction of BioMistral 7B.

First, we discuss the pre-study datasets. We have selectedthe PMC Open Access subset, a comprehensive yet freely accessible collection of medical research articles for specialized adaptation to the biomedical field, based on successful prior studies such as PMC-LLaMA, PubMedBERT, and SciFive. These show significant improvements in language modeling in medical applications. The focus of this paper is on a subset that is licensed for commercial use and includes documents licensed under various Creative Commons licenses (CC0, CC BY, CC BY-SA, CC BY-ND). This subset guarantees the reusability of the model's output, even for commercial purposes.

We have also carefully selected approximately 300 million tokens and 1.47 million documents from the pre-processed PubMed Central corpus. This dataset consists primarily of English documents, but also includes documents in nine other languages, including Dutch, German, and French. Our approach emphasizes the multilingual dataset by prioritizing non-English documents and focuses on building a diverse and representative training dataset for our goal of 300 million tokens. We then preprocess the raw text documents through a tokenization and normalization process using the Mistral tokenizer.

Next, let's talk about model adaptation: the development of BioMistral 7B uses Mistral 7B Instruct v0.1 as the underlying model, which is specifically designed for incorporating guided prompts and for fine-tuning a variety of tasks with limited data sets The model was chosen because it is a model with a specific guide. This choice emphasizes the model's ability to work with specific guides and its ability to flexibly adapt to different types of tasks. The optimization process also employs an efficient AdamW optimizer and a cosine scheduler that adjusts the learning rate over time. The model's architecture retains standard features of the transformer architecture inherited from Mistral, including Grouped-Query Attention, Sliding Window Attention, and Rolling Buffer Cache. These choices are made to ensure a high degree of throughput and accuracy.We have also introduced a post-tokenization grouping technique to improve the efficiency of pre-training. This effectively aggregates variable-size sequences marked with tokens indicating the end of a sequence, allowing the model's 2,048 token sequences to be filled without padding. This ingenuity significantly reduces the number of sequences, resulting in a shorter epoch time.

Next, we are introducing an innovative approach to integrating different pre-trained models to further advance the state-of-the-art in the biomedical field. Traditionally, it has been common to use different models for each specific application, which increases complexity and cost. Following the latest research trends, we are merging parameters across models to improve performance and enhance out-of-domain generalization capabilities, introducing advanced model merging techniques such as SLERP, TIES, and DARE.

Merging general domain models with domain-specific models is particularly useful for increasing adaptability and accuracy in the biomedical field. This approach allows us to enhance the capabilities of specialized models for a broader range of applications. In addition, it explores new inference possibilities and aims to exceed the performance of traditional models.

Finally, we discuss quantization technology. Enables large-scale language models to be applied to a wider range of devices, facilitating their widespread adoption. Reduced memory usage allows smaller devices to runlarge languagemodels, increasing the accessibility of the technology. This paper introduces two advanced quantization methods: activation-aware weight quantization (AWQ) and BitsandBytes (BnB). AWQ minimizes model performance degradation by avoiding essential weight quantization in favor of weight importance. This allows for efficient model size reduction while maintaining accuracy.BnB quantization, on the other hand, achieves simplicity and uniformity by assigning a uniform precision of 4 or 8 bits to the entire model. This approach reduces the complexity of the quantization process and makes runninglarge languagemodels on a wider range of devicesa reality.

Assessment Protocol - Benchmarking of English Medical Reasoning Tasks

To evaluate the performance of the BioMistral 7B model, we focus on 10 English-language question-and-answer (QA) tasks selected from four important medical corpora covering a wide range of medical specialties, including genetics, anatomy, and clinical cases. They cover realistic scenarios that medical professionals encounter on a daily basis, the format of medical school entrance exams, and comprehension tests based on PubMed content.These datasets summarize real-world scenarios encountered by medical professionals, forms of medical school admissions tests, and comprehension tests based on PubMed content. A summary of the datasets is provided in the table below.

The MMLU (Hendrycks et al., 2021) is a collection of exam questions across 57 subjects; we have selected six subjects related to medical and clinical knowledge. These are University Biology, Medicine, Anatomy, Medical Specialties, Medical Genetics, and Clinical Knowledge, which form a medically relevant benchmark consisting of 1,089 questions.Since the MMLUlackstrainingdata, we used MedQA for fine tuning and evaluated generalization performance with the MMLU.

The MedQA (Jin et al., 2020) contains a variety of medical knowledge questions presented in USMLE format. The training set consists of 10,178 samples and the test set consists of 1,273 questions, in both 4- and 5-question formats. MedMCQA (Pal et al., 2022) contains over 193K questions drawn from the Indian Medical Admission Test, covering 21 medical subjects and 2,400 healthcare topics; 183K training samples and 4,183 validation questions. Due to the unavailability of the test set answer key, the validation set is used for evaluation and hyper-parameter tuning is used to split the training set into new 146K and 37K samples.

PubMedQA (Jin et al., 2019) contains 211K artificially generated multiple-choice question samples and 1,000 samples labeled by experts, and the model uses a given PubMed abstract and the questions to it as context, with "yes," "no," "maybe are evaluated according to the required settings with inferences predicting "yes", "no", "maybe", and "maybe". Fine tuning was performed on 211K artificially labeled samples, and validation measured performance on 500 expert samples, with BigBio (Fries et al., 2022) and Chen et al. (2023), Singhal et al. (2023a) It is evaluated on 500 test samples following protocols.

Evaluation Protocols - Multilingual Evaluation

We aim to provide a comprehensive evaluation of the BioMistral 7B model, emphasizing its performance in multiple languages as well as in English. To date, the biomedical language model has been extensively validated in languages such as English, Chinese, French, and Spanish. However, their performance in languages other than these has not yet been fully evaluated. This situation is due to the lack of biomedical-related tasks in languages other than English.

To address this issue, we are benchmark testing in seven different languages (Spanish, German, Portuguese, Russian, French, Arabic, and Chinese) using automated translation through the OpenAI API using GPT-3.5 Turbo. While automated translation certainly presents challenges, recent technological advances have greatly improved the accuracy of these tools, allowing for efficient multilingual evaluation.

The multilingual evaluation methodology is designed similarly to the three-shot scenario conducted in English. The questions, choices, and context are translated, while the examples used for the few-shot study are retained. This allows us to test the comprehensibility and adaptability of the model while taking into account the cultural and linguistic characteristics of each language.

Evaluation Protocols - Application of Instruction Prompts

The BioMistral 7B model assessment adheres strictly to the instructional prompts based on the official guidelines for GPT-4 medical assessment (Nori et al., 2023a). This ensures that each question-answer (QA) task is presented in a multiple-choice question (MCQA) format with options from A to D or A to E. For a detailed list of instructional prompts, see the Appendix of this paper.

During the inference process, the model predicts the next token based on the given input prompt and generates a probability for each token in the vocabulary. To increase the accuracy of the prediction, the vocabulary is narrowed down to include only tokens that correspond to the answer options (in this case, the letters of the choices). This approach reduces the risk of the model generating irrelevant tokens or inaccurate information (illusions) and allows for more reliable predictions.

Evaluation Protocol - Supervised Fine Tuning (SFT)

Supervised fine tuning (SFT) is an essential process of fine tuning on annotated data to adapt a model to a specific task. to bring the power of BioMistral 7B beyond few-shot learning, we have used BioMistral 7B and existing SFTs are applied to the open source model. The training set used was selected based on predefined criteria.

However, the challenge is that traditional SFT methods often require large amounts of resources. To address this issue, we have introduced QLoRa fine tuning methods and 8-bit quantization techniques. These techniques are cost-effective and make the SFT process more viable.In addition, we employ improved batch processing methods to reduce the time required for fine tuning. These strategies play an important role in efficiently maximizing BioMistral 7B's performance and adaptability to specific tasks.

Experimental Results and Discussion

To test the capabilities of the BioMistral 7B model, we first examine its performance in a few-shot learning scenario.For this evaluation, we perform 3-shot in-context learning based on 3 randomly selected samples from the training set for each dataset.The results are shown in the table below.

The results are extremely positive, with BioMistral 7B outperforming the existing Mistral 7B Instruct model in 8 of the 10 tasks, strongly demonstrating the effectiveness of adaptation to specific domains. In particular, it outperforms other open-source biomedical baseline models in all tasks in this 3-shot learning scenario.

In MedQA, BioMistral 7B shows a marked improvement over MediTron-7B and MedAlpaca 7B, and in MMLU it significantly outperforms existing biomedical large-scale language models; similarly in MedMCQA, BioMistral 7B shows a models. On the other hand, in PubMedQA, the performance is reduced due to possible hallucinations resulting from class imbalances.

While GPT-3.5 Turbo is the best model in this entire 3-shot learning scenario, BioMistral 7B demonstrates new possibilities for AI in biomedical applications due to its domain adaptability and excellent performance in the few-shot learning scenario. These results provide important insights into the direction of AI technology in future biomedical applications.

The fine tuning performance of BioMistral 7B is then evaluated in comparison to several baseline models; the performance of the BioMistral model and associated baselines is shown in the table below.

Overall, SFT further improves model performance on nearly all datasets. Comparison of the models shows a similar trend to the few-shot-in-context learning assessment, with BioMistral 7B outperforming Mistral 7B Instruct on 7 of the 10 tasks and outperforming other open source biomedical baselines on all tasks. We also see significant improvements in BioMistral 7B on PubMedQA.

Summary

In recent years, large-scale language models have shown remarkable diversity and offer potential applications across specific domains such as medicine and health care. Despite the availability of a variety of healthcare-specific open source large-scale language models, adapting general-purpose large-scale language models to the healthcare domain poses significant challenges.

Leveraging Mistral, an infrastructure model further pre-trained in PubMed Central, this paperproposesBioMistral 7B, anopen sourcelarge-scale language modelspecific to the biomedical domain, showing new possibilities for large-scale language models specific to the medical domain The model is based on PubMed Central 7B.The model is a further evolution of Mistral 7B Instruct, based on the high quality resources of PubMed Central, and utilizes techniques such as quantization and model integration. As a result, BioMistral 7B achieves remarkable performance on multilingual medical assessment benchmarks compared to existing open source 7B models.

Looking ahead, the company plans to conduct human evaluations to further deepen BioMistral 7B's generative quality. The company also plans to expand the model's multilingual capability and chat functionality using techniques such as supervised fine tuning and direct optimization of preferences.

BioMistral 7B is expected to expand the potential of AI technology in the medical field and further enhance its accuracy and reliability to help solve real-world medical problems.

In addition, the paper makes available the dataset, multilingual evaluation benchmarks, scripts, and all models obtained during the experiment.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

A New Approach To The Medical Field, BioMistral 7B, A Large-scale Language Model Specialized For The Medical Field

Summary

BioMistral Modules

Assessment Protocol - Benchmarking of English Medical Reasoning Tasks

Evaluation Protocols - Multilingual Evaluation

Evaluation Protocols - Application of Instruction Prompts

Evaluation Protocol - Supervised Fine Tuning (SFT)

Experimental Results and Discussion

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...