Improved Diagnostic Accuracy, New Diagnostic Support Through Medically Specialized LLM
3 main points
✔️ Large-scale language models provide highly accurate answers in diagnosis and clinical support
✔️ Suggests that prompt design has a significant impact on the performance of large-scale language models
✔️ Validates the usefulness of large-scale language models through interactions with medical professionals
Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain
written by Burcu Sayin, Pasquale Minervini, Jacopo Staiano, Andrea Passerini
(Submitted on 29 Mar 2024
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Recent studies have shown the effectiveness of large-scale language models in medical AI applications. Their effectiveness is particularly pronounced in diagnostic and clinical support systems, where they have proven to provide highly accurate answers to a variety of medical queries ( These models are sensitive to prompt design, and appropriately designed prompts can effectively correct physicians' incorrect responses.
However, challenges still remain in implementing large-scale language models in clinical settings. For example, the need for advanced prompting techniques for complex tasks has been noted. Also, while existing research focuses on the stand-alone use of large-scale language models, human decision makers (e.g., physicians) must make the final decision in actual medical settings. Understanding how the interaction works when the physician gets assistance from an AI agent is critical to ensuring the utility and reliability of the system.
This paper takes a deep dive into how large-scale language models can be used effectively in the medical field. In particular, it addresses the setting in which physicians are asked questions by a large-scale language model after they have given their opinion, and attempts to show how the model can provide quality answers without challenging the expert's opinion. We also examine how the design of the prompts corrects physician errors and facilitates medical reasoning, as well as how to adapt the large-scale language model based on physician input.
The studyfirstintroduces the binary PubMedQA dataset, which features valid correct answers and misinterpretations generated by GPT4, and specifically demonstrates its effectiveness. Second, it highlights the importance of prompt design to enhance the interaction of large-scale language models with medical professionals, correcting physician errors, explaining medical reasoning, adapting based on physician input, and ultimately showing its impact on improving the performance of large-scale language models. In doing so, it provides key insights into how large-scale languagemodels can work more effectively in medical practice.
Prompt Design
This paper examines the effectiveness of large-scale language models in a question-answering task in the medical field. It evaluates the performance of large-scale language models with and without physician-provided answers and explanations. Previous research hasshown thatprompt design has a significant impact on responses in large-scale languagemodels, and this study examines that impact through multiple learning scenarios that mimic real medical scenarios and interactions with experts. The scenarios include.
- Baseline: Basic question and answer (QA) with no input from the physician
- Case 1: Physician responds with "yes/no" and 4 different scenarios are run depending on the accuracy of the response.
- Case 1a: The doctor always has the right answer.
- Case 1b: Doctors always give wrong answers.
- Case 1c: The physician always answers "yes ".
- Case 1d: The physician always answers "no. "
- Case 2: Physician answers "yes/no" and text explanation is added,Four different scenarios depending on its accuracyImplementation of the
- Case 2a: The doctor always has the right answer.
- Case 2b: Doctors always give wrong answers.
- Case 2c: The physician always answers "yes ".
- Case 2d: The physician always answers "no. "
- Case 3: Physician responds with "yes/no" and the probability of giving the correct answer fluctuates.
- Simulated differences in physician expertise with different probabilities (70%, 75%, 80%, 85%, 90%, 95%)
The figure below shows a template for the prompt.
For example, in Case 1, the first step is to clarify the task instructions for the large-scale language model, as shown in the figure below.
Next, a simulated conversation is developed with the physician and the large-scale language model, as shown in the figure below.
The order of these conversations varies with the order of the examples in different scenarios. The final prompt is completed with a test input that includes specific questions, context, and physician responses.
Case2 also uses the GPT-4 API to generate a correct or incorrect explanation for each question, as shown in the figure below. For example, in Case 2a, the physician always gives the correct answer, and GPT-4 generates the correct explanation based on that. On the other hand, in Case 2c the physician always answers "yes" and GPT-4 generates a reasonably correct or incorrect explanation depending on whether the correct answer to the question is "yes" or "no." This enhances the realism of the actual medical interaction by mimicking the doctor's explanation.
Experiments and Results
This paperseeks to answer the following questions
- Q1: Can a large-scale language model correct a physician's judgment if necessary?
- Q2: Can a large-scale language model explain the basis for its own answers?
- Q3: Can large language models correct answers based on arguments provided by physicians?
- Q4: Can a large-scale language model based on responses provided by physicians perform better than itself or physicians?
We use the "PubMedQA dataset" for our experiments. This is a biomedical question-answer dataset generated from PubMed abstracts, usually answered "yes/no/maybe". For this experiment, we have converted this dataset into a binary format ("yes" or "no" only) and have 445 test examples. Using this data, the GPT-4 is required to generate reasonable correct and incorrect answers to each question.
The models used are the latest AI model Meditron-7B, the conversational AI Llama2-7B Chat and Mistral7B-Instruct (Jiang et al., 2023). These experiments were conducted via the Harness Framework, whose source code is available online.
These are the results of the validation on the importance of the prompt design. The results are shown in the table below. Prompt design has a significant impact on the performance of large-scale language models. In particular, when correcting a physician's incorrect response, a properly designed prompt allows the large-scale languagemodel to effectively correct the physician. As an example, in Case 1d, the Mistral model achieves notably high accuracy in a scenario where the physician always responds "no," even though the actual "no" response rate is 38%. Llama2 and Meditron are also sensitive to prompt changes and perform better in certain scenarios.
These are the results of the validationregarding the verification of explanatory capacity. The results are shown in the table below. The extent to which the large-scale languagemodel was able to explain the rationale for its responses was also assessed. Specifically, Meditron was found to maintain a high quality of explanation without being affected by the physician's short answers. On the other hand, Llama2 tended to lower its ROUGE-L score in cases where the physician responded correctly, while Mistral consistently provided superior explanations across multiple scenarios. These resultsdemonstrate that under properly constructed prompts, a large-scale languagemodel can produce reliable explanations.
These are the results of an examination of the different levels of reliance on physicians' arguments.It is clear to what extent large-scale language models depend on the arguments provided by physicians. In particular, it is shown that when the physician adds arguments to the response, the large-scale language model relies more strongly on those arguments. In Case 2a, Meditron achieves 100% accuracy when physicians consistently provide accurate answers and explanations. This indicates that Meditron tends to focus on the most recent example of the prompt, which is a remarkable performance in certain scenarios.
On the other hand, LLama2 is overly dependent on the arguments provided by the physician in all scenarios, whereas Mistral has a more robust performance and is characterized by less prompt variability. In particular, in Case 2d, Mistral retains more than 75% accuracy in all scenarios, confirming its ability to effectively correct physicians when they provide incorrect responses and arguments.
The next set of validation results relate to the quality and consistency of explanations. Analysis of the ROUGE_L scores for the model in Case 2 shows that LLama2 and Mistral generate more valid and extensive explanations based on prompts that include physician input. In contrast, Meditron relies heavily on physician input, resulting in a much higher quality of explanation. Furthermore, we see differences in the consistency of the answers provided by each model, with LLama2 and Mistral tending to provide reasonable explanations regardless of the physician's position.
Furthermore, it has been shown that while a large-scale language model incorporating expert responses can improve its performance, it is difficult to exceed the expert's own capabilities. The data analysis for Case 3 (see table below) shows that while the basic performance of the large-scale language model does not vary significantly from scenario to scenario, there are clear improvements under certain conditions. For example, Meditron was able to exceed base performance in Scenario 2 where physician accuracy exceeds 80%; LLama2 also exceeds base performance in all scenarios where physician accuracy exceeds 85%.
However, the Mistral model was significantly affected by the physician's responses in Case 3, and its performance tended to deteriorate. Thissuggests that the ability of a large-scale languagemodel depends on the quality of information provided by the physician.
Furthermore, when an even larger model, such as 70B, was tested for performance when based on physician responses, the results were poor. Performance was observed to deteriorate when the same prompts were used, indicating that larger models do not necessarily guarantee superior results. In particular, the case of the LLama2-70B model achieving only less than 55% accuracy on the multiple-choice MEDQA data set suggests that model size may not be the key to improved performance.
Summary
Insights from this paper show that prompt design has a significant impact on the performance oflarge-scale languagemodels, and that while the models are sensitive to changes in prompts, they effectively correct incorrect physician responses with appropriate instructions and examples.
Also, if the prompts are carefully designed, the large-scale languagemodel demonstrates the ability to explain the responses. In addition, large-scale languagemodels tend to be relied upon by physicians in providing arguments for their responses and are heavily influenced by the order of the examples, especially a few examples.
It is also emphasized that the large model (70B) does not always guarantee superior results, and that prompt quality is the key to improved performance. The results call for further research on prompt design and its impact. This study highlights the role of prompts in the development of medical AI and their impact on the interaction between large language models and medical professionals.
Categories related to this article