Catch up on the latest AI articles

Evaluation And Prospects Of The Large-Scale Language Model

Evaluation And Prospects Of The Large-Scale Language Model "Gemini" In The Medical Domain

Large Language Models

3 main points
✔️ Gemini Overview and Applications: Gemini is a multimodal language model with the ability to understand and generate information from diverse input formats, including text, images, audio, and video in the medical field.
✔️ Gemini Evaluation Methodology and Results: Gemini shows robust understanding across a wide variety of medical topics, but is highly susceptible, especially with respect to hallucinations.

✔️ Future Prospects and Challenges: Suggests room for improvement, including the fact that Gemini Pro ratings are limited to the available APIs and the lack of ratings for longer questions.

Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations
written by Ankit PalMalaikannan Sankarasubbu
(Submitted on 10 Feb 2024)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In recent years, large-scale language models have made remarkable advances in our ability to understand and generate human language. These advances have spurred breakthroughs in a variety of fields, including linguistics and computer programming. In particular, models such as GPT-3 and PaLM are able to understand complex language patterns by learning large amounts of textual data. Rapid developments in artificial intelligence technology are driving continuous improvements in LLMs and accelerating innovation in specialized fields. These advances have been accomplished incrementally as model size, data volume, and computational power have increased. Many advanced models are built on a transformer architecture foundation, utilizing self-supervised learning techniques.

The application of large-scale language models in medicine is particularly innovative and the possibilities are endless. These models are expected to bring new insights into the world of medicine by analyzing large volumes of medical literature and integrating new knowledge. Researchers are actively evaluating how large-scale language models can complement medical expertise and enhance medical services.

However, along with promising opportunities, this emerging technological domain also presents significant challenges. For example, questions such as whether large-scale language models can process medical knowledge at an expert level and whether there is a risk of generating incorrect information. Understanding the potential and limitations of these technologies is essential to the responsible use of language models in medicine.

This paper investigates the potential and challenges of large-scale language models in the medical domain, focusing on Google's Gemini model, a state-of-the-art multimodal language model. The paper rigorously evaluates Gemini's capabilities using multiple benchmark tests to identify its strengths and limitations in the medical domain.

This study demonstrates Gemini's robust understanding in a wide variety of medical topics, while also highlighting its limitations in areas where specialized knowledge is required. The study provides deep insights into the application of large-scale language models, including Gemini, to the medical field and highlights its potential strengths and challenges. It is hoped that this will facilitate discussion on the future prospects of AI technology in the medical field.


Here we present an overview of Gemini's structure, performance, and how to evaluate its inference capabilities.Gemini is designed to enable complex analysis and inference using a state-of-the-art multimodal architecture and leveraging Google's advanced TPU hardware. Here we provide an overview of its architecture and how it can be applied specifically in the medical field.

Gemini Architecture: Gemini is a model based on an advanced transformer decoder that can handle contexts of up to 32,000 tokens and seamlessly combine text, graphics, and voice data. The model is designed for reliability and efficiency, reducing hardware failures and data distortion; Gemini's inference skills and its benchmark scores set a new standard for multimodal AI research.

Benchmarking in Medicine: MultiMedQA is a medical QA dataset for assessing clinical reasoning skills, including real-world exams like the USMLE and NEET-PG, which require knowledge across disciplines; MedQA and MedMCQA are medical licensing Questions are drawn from exams and present complex clinical reasoning challenges; PubMedQA includes 1,000 questions that synthesize insights from research abstracts and assess closed-domain reasoning skills; MMLU covers a wide range of areas, testing integration of basic science knowledge and medical understanding; and MMLU is a comprehensive medical QA dataset that tests medical reasoning skills in the United States and India, including the USMLE and NEET-PG.

Special Benchmarks: Med-HALT is a benchmark for assessing dangerous reasoning tendencies, designed according to the medical principle of "first do no harm". Through the Reasoning Hallucination Test (RHT) and the Memory Hallucination Test (MHT), the model assesses the ability to logically analyze and, when necessary, admit uncertainty.

The Visual Question and Answer (VQA) benchmark uses 100 multiple-choice questions from the New England Journal of Medicine (NEJM) Image Challenge to assess Gemini's multimodal reasoning abilities: image comprehension, medical knowledge recall, and step-by-step reasoning, and testing the three abilities of the

The above demonstrates how Gemini's innovative approach solves complex problems in the medical field. These tests demonstrate Gemini's reasoning capabilities and its accuracy and reliability, especially in handling medical information.

Experimental results

Here we analyze how well Gemini performed on the MultiMedQA, Med-HALT Hallucinations, and Medical Visual Question Answering (VQA) benchmarks and compare it to other models.

First, let's discuss Gemini's performance on the MultiMedQA benchmark. The chart below shows the MultiMedQA scores for Med-PaLM 2, GPT-4, and Gemini Pro. Gemini Pro has achieved notable results in the MultiMedQA benchmark on a variety of medical topics.

The table below also compares Gemini Pro results with those of Flan-PaLM, Med-PaLM, and Med-PaLM 2. Gemini Pro scored 67.0% on the MedQA (USMLE) data set, which is the highest score for MedPA-LM 2's highest score (up to 86.5%) or the GPT-4 (5-shot) of 86.1%. This large difference indicates that there is room for Gemini Pro to improve its ability to handle the complex, multi-step U.S. National Medical Examination-style questions.

In addition, the MedMCQA dataset covers a wide range, making it a particularly challenging environment; Gemini Pro achieved a score of 62.2% on the MedMCQA dataset, which is a significant difference compared to other models on the leaderboard. For example, both ER and best scored 72.3% on Med-PALM 2, indicating stronger comprehension and processing power in this context. Additionally, the GPT-4 model, including the base and 5-shot versions, shows superior performance with scores ranging from 72.4% to 73.7%. These results suggest some room for improvement for better performance on Gemini's MedMCQA dataset.

The PubMedQA dataset also uses a yes/no/tabun response format, which creates unique challenges for dichotomous and triadic questions; Gemini Pro scored 70.7% on this dataset, with the highest score being 81.8% for Med-PaLM 2 and 5 shot GPT-4-base at 80.4%. This difference in performance suggests that Gemini Pro needs to improve its ability to process dichotomous and triadic responses, as well as its ability to process questions from scientific documents and clinical areas.

Furthermore , on the MMLU clinical knowledge dataset, Gemini Pro's performance was inferior to state-of-the-art models such as Med-PaLM 2 and 5-shot GPT-4. Gemini Pro's overall test set accuracy was 78.6%, which is significantly different from the 88.7% achieved by both Med-PaLM 2 and 5-shot shot GPT-4-baseboth achieved 88.7%. This trend persisted when specific subdomains were analyzed. In the medical genetics assessment, Gemini Pro achieved an 81.8% accuracy rate, while 5 -shot GPT-4-base achieved a 97.0% correct rate. Similarly, on the anatomy assessment, the Gemini Pro was 76.9% accurate, butmore than 8% lower than the 85.2% accuracy of the 5-shot GPT-4-base. Similar performance gaps were seen in other categories, such as Professional Medicine and University Biology, where the Gemini Pro could not keep up with the top models. Furthermore, in the category of University Medicine, Gemini Pro's score of 79.3% showed reasonable ability, but fell short of the top performance of models such as Med-PaLM 2 and the GPT-4 variant. These results suggest that the Gemini Pro has strong underlying capabilities for processing medical data and that its architecture has potential. However, when looking at the best performance of models such as Med-PaLM 2 and GPT-4, it is clear that there is room for improvement.

We have also performed a comparative analysis with large open-source language models. Here, we used a variety of state-of-the-art models, including Llama-2-70b, Mistral-7bv0.1, Mixtral-8x7b-v0.1, Yi-34b, Zephyr-7b-beta, Qwen-72b, and Meditron-70b, to evaluate their zero- and fusionshot capabilities across medical reasoning tasks. We evaluated the capabilities of Zero-Shot and FewShot through a standardized analysis using the MultiMedQA Benchmark to quantify the capabilities and limitations of the publicly available L-large language models. The figures below illustrate zero-shot and fourshot performance, respectively.

(Zero shot performance)

(Few Shot Performance)

Performance across datasets: We tested a number of open source models on a variety of medical datasets and evaluated their fourshot and zero-shot capabilities; in the 5-shot learning benchmark, Qwen-72b consistently outperformed. Its flexibility and ability to absorb knowledge from a small number of good examples shows that it bridges the gap between broad AI capabilities in specific medical knowledge domains and the nuanced requirements of specific medical expertise.

Zero-Shot vs. Few-Shot Prompting: Comparing zero-shot and fourshot training results reveals the importance of example-based training on model performance. large language models such as Yi-34b and Qwen-72b showed significant performance gains when only a few examples were introduced. only a small number of examples. These results indicate that example-based learning plays an important role in improving model accuracy and inference performance, especially in specialized fields such as medicine.

Model-specific insights: the results of the evaluation showed that each model exhibited unique strengths and weaknesses across the various medical question types and datasets; Gemini Pro performed consistently across multiple datasets and has a strong ability to be applied to different situations, but in certain areas in particular, Yi-34 b, it was not as effective as models such as Yi-34b, especially in certain areas. On the other hand, models like Mixtral-7b-v0.1 show great potential on the PubMedQA dataset, allowing for effective analysis and inference from scientific articles. In addition, Mixtral-8x7b-v0.1 performed particularly well on MMLU Clinical Knowledge and MMLU University Student Biology, demonstrating its ability to absorb complex medical information; Qwen-72b's ability to process many types of medical questions is powerful, with no prior examples required. The model's performance on the MMLU University Biology dataset is unparalleled, demonstrating a 93.75% accuracy rate and a good understanding of complex biological concepts.


While this paper has provided a comprehensive benchmark of Gemini's capabilities, there are several limitations to future exploration. First of all, the Gemini Pro evaluation is limited to the available APIs and does not take advantage of the more advanced capabilities of Gemini Ultra. Future studies are expected to leverage Gemini Ultra's APIs to gain deeper insights.

An additional limitation is that it does not include ratings for longer questions. This is an important aspect in the context of MultiMedQA and should be extended to this area in future studies. The use of real-time data and advanced techniques such as retrieval augmentation generation (RAG) could also improve the performance of the model.

The VQA task used a relatively small sample, and future studies will need to examine it on larger data sets. Addressing these limitations will help us understand the potential of Gemini and contribute to the development of more sophisticated AI tools for medical applications.

The study also evaluated Google's Gemini on multiple benchmarks in the medical domain and found that while Gemini shows understanding on a variety of medical topics, it falls short of other leading models in some areas. In particular, it was noted to be highly susceptible to hallucinations, making it important to improve its reliability and trustworthiness. This study pioneers the evaluation of multimodal models in medicine and provides a public tool to facilitate future development. Ultimately, AI cannot replace human clinical judgment and empathy, but carefully designed AI assistance can improve expertise and support medicine's mission to heal and serve.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us