Me-LLaMA, A New Open-source Large-scale Language Model For The Medical Field
3 main points
✔️ Developed Me-LLaMA, a new large-scale language model with broad medical knowledge
✔️Proposescomprehensive largedatasetincludingcontinuouspre-training data, instructional coordination data, and evaluation benchmark MIBE
✔️Investigates ruin-and-forget problem inexistinglarge-scale language models formedicine; Me-LLaMA model retains knowledge and maintains superior performance
Me LLaMA: Foundation Large Language Models for Medical Applications
written by Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, Xinyu Zhou, Huan He, Lucila Ohno-Machado, Yonghui Wu, Hua Xu, Jiang Bian
(Submitted on 20 Feb 2024 (v1))
Comments: 21 pages, 3 figures, 8 tables
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
The emergence of large-scale language models is a major breakthrough in the quest to improve the quality of patient care and the efficiency of clinical operations.Large-scale languagemodels have tens of billions of parameters, are trained from vast amounts of text data, and have the ability to generate human-like responses and perform complex tasks. This shows great potential for improving clinical documentation, diagnostic accuracy, and management of patient care. However,large-scale languagemodelssuch as ChatGPT and GPT-4are closed and difficult to customize for the specific situations required in the medical field.
To address this issue, open sourcelarge-scale languagemodels have been developed inrecent years.Open sourcelarge-scale languagemodels are a promising solution, offering unlimited access and the flexibility to customize them to the specific needs of the medical field. For example, the LLaMA modelis at the forefront ofopen sourcelarge-scale languagemodels inthe general domainand has cutting-edge capabilities. However, because these models are trained primarily on general domain data, they lack the specialized knowledge needed for accurate and reliable medical applications.
To compensate for these shortcomings, open-sourcelarge-scale languagemodels are being developedspecifically for healthcare by enhancing them withbiomedical data.However, existing studies such as PMC-LLaMA and Meditron focus on the biomedical domain and evaluate only question answering (QA) tasks. Very few studies have used clinical data and evaluated clinical tasks, with GatorTronGPT and Clinical-LLaMA being the exceptions. However, GatorTronGPT's lack of instructional coordination and model and data size limitations, and Clinical-LLaMA's limited prior learning of clinical texts, have preventedit from fully exploitinglarge-scale languagemodels ina variety of clinical settings.In addition, it suffers from "catastrophic forgetting," in which previous knowledge is compromised when integrating new medical data.
To address these challenges, this paperdevelops Me-LLaMA,a newlarge-scale language modelfor medicine that continuously pre-trains the LLaMA2 model, adjusts instructions, and incorporates rich biomedical and clinical data.
This provides a comprehensive dataset for the study oflarge-scale languagemodelsfor medicine, and also includes a large pre-training dataset, an instructional coordination dataset, and a new medical evaluation benchmark (MIBE).
In evaluations using MIBE, the Me-LLaMA modeloutperformsexisting open-sourcelarge-scale language modelsfor medicinein zero-shot, fourshot, and supervised learning. With per-task instruction adjustment, the Me-LLaMA model outperforms ChatGPT and GPT-4 on many datasets.
Technique
Me-LLaMA is being developed through ongoing pre-study and instructional adjustment of LLaMA2. This process includes a 129B token and 214K instructional survey sample consisting of general, biomedical, and clinical data.
To adapt the LLaMA2 model to the medical field, we are creating a mixed continuous pre-training dataset. This dataset contains 129B tokens and is composed of biomedical literature, clinical notes, and general domain data. This balances domain-specific knowledge with broad contextual understanding and reduces catastrophic forgetting.
- Biomedical Literature
- It contains an extensive collection of biomedical literature obtained from PubMed Central and PubMed Abstracts.
- clinical notebook
- Anonymized free-text clinical notes from MIMIC-III, MIMIC-IV, and MIMIC-CXR are used to reflect real clinical scenarios and inferences.
- General Field Data
- To prevent catastrophic forgetting, we have included a subset from the RedPajama dataset to replicate LLaMA's pre-training data. The ratio of biomedical, clinical, and general field data is 15:1:4, with a strong focus on the medical field, but also incorporating general knowledge.
We are also developing a new medical indication tuning dataset to enhance the ability of models to follow instructions and generalize to diverse medical tasks. This dataset includes a variety of data sources, including biomedical literature, clinical notes, clinical guidelines, wikidoc, knowledge graphs, and general field data. The diverse tasks in the dataset are intended to refine the model's ability to process and respond to medical information accurately and contextually. The final dataset contains 214,595 high quality samples and noise (e.g., null inputs and responses) has been removed.
Furthermore, while existing research in the medical field focuses primarily on the evaluation of QA tasks, this paper introduces a new broad evaluation benchmark, covering six tasks: QA, unique expression extraction (NER), relation extraction (RE), classification (CF), text summarization (TS), and natural language inference (NLI). These tasks cover biomedical and These tasks include 12 carefully selected datasets from the biomedical and clinical fields, providing a broad spectrum of evaluation. We also include the general domain QA data MMLU to assess the problem of forgetting general domain knowledge.
It is also evaluated in two evaluation settings: in-context learning (zero-shot and fourshot learning) and supervised learning, and compared to a baseline model to assess performance and generalization ability for a variety of tasks.
Experimental Results
The table below compares the zero-shot performance of the Me-LLaMA chat model and the baseline model on various MIBE tasks. The comparison includes large language models with instructional fine-tuning to enhance the ability to follow instructions, such as the LLaMA2 chat model.
Among models with 13B parameters, Me-LLaMA 13B-chat outperformed LLaMA2 13B-chat, PMC-LLaMA-chat, and Medalpaca 13B in almost all 12 datasets. The only exception was the clinical question and answer (QA) dataset, EmrQA, which showed a slight decrease in accuracy. In addition, Me-LLaMA outperformed AlpaCare-13B in 9 of the 12 data sets.
Among models with 70B parameters, Me-LLaMA 70B-chat consistently outperformed Meditron 70B in all 12 data sets and outperformed LLaMA2-70B-chat in 11 data sets. Notably, in the PubMedQA dataset, Me-LLaMA 70B-chat outperformed the LLaMA2-70B-chat model by approximately 10% in accuracy and 8.0% in Macro-F1 score. In addition, Me-LLaMA 13B-chat outperformed the LLaMA2-70B-chat model with significantly larger parameter sizes in 6 of the 12 datasets (including PubMedQA, MedQA, MedMCQA, 2013 DDI, HoC, MIMIC-CXR) and competitive in the other three datasets (including EmrQA, MTsample, and MedNLI).
The figure below also compares the performance of the Meditron 70B and Me-LLaMA model fuchsot, a large-scale linguistic model of current state-of-the-art medicine. The comparison is based on PubMed Rouge-L scores, accuracy scores for the three QA datasets, and F1 scores for the other datasets. given the limitations of Meditron's ability to follow instructions, fuchsot was used for the performance comparison, with the 1-shot method used for the summary dataset and the 5-shot method for the other data sets, while the 5-shot method was used for the other data sets. As a result, the Me-LLaMA model achieved superior performance on 11 of the 12 data sets (excluding PubMedQA).
The table below compares the performance of the Me-LLaMA 13/70B infrastructure model with other open source infrastructure large language models in a supervised setting: the Me-LLaMA 13B model outperforms the equivalent-sized medical infrastructure model PMC-LLaMA 13B on 11 of 12 data sets, LLaMA2 13B in 10 of the 12 data sets(excluding DDI and HoC).In addition, the Me-LLaMA 13B modelwas competitiveagainst the LLaMA2 70B and Meditron 70B in 8 of the 12 datasets(PubMedQA, EmrQA, 2010 i2b2, MTsample, PubMed, MIMIC-CXR, BioNLI, MedNLI).For the 70B model, the Me-LLaMA 70B achieved the best performance in 9 out of 12 data sets compared to the LLaMA2 70B and Meditron 70B (excluding MedMCQA, 2010 i2b2, and PubMed).
In addition, the figure belowcompares the performance of the Me-LLaMA model with ChatGPT and GPT-4 in zero-shot and task-specific instructional fine-tuning settings. Due to privacy concerns restricting the transmission of clinical datasets containing patient information to ChatGPT and GPT-4, eight datasets without these restrictions (PubMedQA, MedQA, MedMCQA, HoC, MTsample, PubMed, BioNLI, 2013 DDI) were compared; results for the three QA datasets ChatGPT and GPT-4 are referenced from the OpenAI paper.
The comparison uses Rouge-127 scores for the summary data set PubMed, accuracy scores for the three QA data sets, and Macro-F1 scores for the other data sets. With task-specific instruction tuning, the Me-LLaMA model outperforms ChatGPT in 7 of the 8 datasets (excluding PubMed) and outperforms GPT-4 in 5 datasets (PubMedQA, HoC, MTsample, BioNLI, 2013 DDI) The data sets are all in the Me-LT range.In the zero-shot setting, the Me-LLaMA model outperforms ChatGPT on five datasets (PubMedQA, MedQA, MedMCQA, BioNLI, and 2013 DDI) but is inferior on seven datasets compared to GPT-4.
We also investigate the impact of continuous pre-training and instructional tuning.The table below compares the impact of continuous pre-training and instructional tuning on the zero-shot performance of large language models.
Specifically, it focuses on the differences between Me-LLaMA 13/70B and its backbone model, LLaMA2 13/70B, in the zero-shot setting, demonstrating the benefits of continuous pre-training. It also compares Me-LLaMA-13/70B-chat with the instruction-tuned chat-optimized version, Me-LLaMA-13/70B-chat, highlighting the benefits of instructional tuning in the zero-shot context.
Overall, we find that both continuous pre-training and instructional tuning significantly improve the zero-shot capability of the models. For example, the Me-LLaMA 13B model shows performance gains ranging from 0.5% to 13.1% on a variety of data sets compared to the LLaMA2 13B model. This demonstrates the benefits of continuous pre-training. Instructional tuning, on the other hand, yields even greater zero-shot performance gains compared to continuous prior learning.
Specifically, the Me-LLaMA-70B-chat model showed a performance improvement of 3.7% to 41.9% over the Me-LLaMA 70B base model that did not receive instructional tuning. This suggests that instructional tuning plays an important role in enhancing a model's ability to leverage context in learning tasks, even without supervised fine tuning or prior examples.
In addition, we investigate the problem of catastrophic forgetting.We compare existing large-scale language models for medicine to assess their vulnerability to catastrophic forgetting (the phenomenon of forgetting old knowledge when learning new data). This issue is particularly important for large-scale language models for medicine, which need to maintain accurate and consistent knowledge from both the general and medical domains.
The table below compares the performance of various large-scale language models for medicine and their backbone models after continuous pre-training against the general domain data MMLU28 and the medical data MedQA.
The Me-LLaMA model shows improved performance in both the general and medical domains. On the other hand, some models show improvement only on medical data, while others show performance degradation in both domains after continuous pre-training with medical data. Specifically, Meditron 7/70B shows improvement on the MedQA dataset, but performance declines on the MMLU dataset; PMC-LLaMA 7/13B shows performance declines on both datasets; and Meditron 7/70B shows performance declines on the MMLU dataset. These results underscore the importance of balancing general and medical data during training to prevent knowledge loss.
Summary
In this paper, we develop new large-scale language models for medicine, Me-LLaMA 13B and Me-LLaMA 70B and Me-LLaMA-13/70B-chat. These models have been developed by continuously pre-training the LLaMA2 model and making instructional adjustments. The data used includes data from a wide range of biomedical, clinical, and general domains.
Evaluation results show that the Me-LLaMA modeloutperformsexisting open-sourcelarge-scale languagemodelsfor medicine ina variety of learning scenariosand achieves competitive results with leading commercial models such as ChatGPT and GPT-4. This researchpaves the way formore accurate, reliable, and comprehensivelarge-scale language models for medicine and highlights the potential of large-scale language models in medical applications.
However, in the zero-shot setting,we found that thelarge-scale languagemodelsfor medicine, including the proposed model,performed poorly on certain tasks (e.g., NER and RE). This may be due to the fact that the model responses lack the expected conciseness and accuracy. For example, the zero-shot output of Me-LLaMA-13B-chat produced challenges in several tasks, often generating redundant sentences in multi-label classification. In addition, the NLI task contained inaccurate numerical responses and irrelevant strings.
In the supervised fine-tuning setting, the Me-LLaMA model performed better or comparable on many tasks compared to the large-scale language model pre-SOTA. However, on the PubMed summary dataset, we found significantly lower performance than methods based on pre-trained language models (e.g., BART). This shortcoming is due to the low quality of gold standard summaries in the dataset, which reduces the quality of model-generated summaries and biases the evaluation metrics.
Understanding the importance of diversity of data sources during the model development, pre-training and instructional tuning phases," high-quality data, meticulously curated from a wide range of sources, forms the foundation of model performance, allowing the model to accurately capture a wide range of medical and biomedical concepts The model is able to accurately capture a wide range of medical and biomedical concepts. In particular, the balance between medical and general domain data is critical, and the integration of general domain data plays a key role in mitigating the knowledge forgetting problem.
The paper found that a 19:1 mixing ratio of medical to general domain data, for example, as in the PMC-LLaMA 13B model, resulted in poor performance on both the general and biomedical tasks. In contrast, the model with a 4:1 ratio showed improved performance for both general and medical tasks. This suggests that careful empirical analysis is needed to find the optimal data balance.
Balancing cost and effectiveness between pre-training and instructional tuning of large language models is also important. For example, pre-training of the LLaMA2 70B model is very resource intensive, taking 700 hours at about 160 A100 GPUs per epoch. In contrast, instructional tuning takes only about 70 hours at 8 A100 GPUs per epoch andis much more economicalthan pre-training. This efficiency indicates the priority of instruction tuning in limited resource scenarios and underscores the potential for cost-effective model improvement.
Me-LLaMA models are available as base and chat-optimized versions in 13B and 70B sizes, suggesting a wide range of medical applications where balance between model size and resource availability is critical. The base model provides a robust foundation with extensive medical knowledge and can be adapted to specialized tasks through supervised fine-tuning.
On the other hand, the chat version excels in instructional follow-up capability and in-context learning, and is very effective in zero-shot or fourshot learning scenarios. large models like the 70B are ideal for comprehensive medical analysis, providing deeper understanding and complex inference capabilities . However, these deployments require large amounts of computational resources, which can be a challenge in resource-limited settings. The 13B model, on the other hand, offers a practical compromise that balances efficiency and effectiveness, opening up possibilities for a wide variety of applications.
It is important to recognize the limitations of the current Me-LLaMA model. As with all existing large-scale language models, they have the potential to generate factual errors and biased information. To mitigate this, future research could incorporate methodologies such as reinforcement learning with human feedback (RLHF).
Another limitation is that the current token processing capacity is limited to 4096 tokens, a constraint inherited from the LLaMA2 model. Addressing this limitation involves extending the model's ability to process longer contexts.
This study isan important step toward further development and practical application of anew large-scale languagemodelfor medicine; the Me-LLaMA model has great potential in medical applications, and further research is expected to demonstrate its utility and effectiveness.
Categories related to this article