Catch up on the latest AI articles

Extracting Critical Information From Medical Documents Using InstructGPT

Extracting Critical Information From Medical Documents Using InstructGPT

Natural Language Processing

3 main points
✔️ Proposed a method for automatically retrieving medical information in an interactive manner using InstructGPT
✔️ Experiments on tasks such as identifying abbreviations, extracting group information from medical experiments, and extracting medication information
✔️ Achieves significantly higher accuracy than previous prior studies with zero-shot, few-shot

Large Language Models are Few-Shot Clinical Information Extractors
written by Monica AgrawalStefan HegselmannHunter LangYoon KimDavid Sontag
(Submitted on 25 May 2022 (v1), last revised 30 Nov 2022 (this version, v2))
Accepted as a long paper to The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


  The tremendous performance of dialogue models, exemplified by ChatGPT, which was introduced in December of 2022, is a hot topic. These advances have led to a rapid increase in research approaching existing tasks in an interactive manner, and medical natural language processing is no exception.

In this paper, we propose a method for automatically extracting various types of information from medical documents using InstructGPT (here: text-davinci-002, text-davinci-edit-001) and confirm its accuracy. Experiments on four different tasks, including abbreviation (acronym) identification, coreference analysis, medical experiment group information extraction, and medication information extraction, have confirmed that the proposed method significantly outperforms the accuracy of conventional Zero-shot and Few-shot models.

What are InstructGPT and Prompt?

  InstructGPT is a large-scale language model developed by OpenAI. InstructGPT is not a dialogue-specific model like ChatGPT, but it is often used in a dialogue format, as in this study. InstructGPT is not a dialogue-specific model like ChatGPT, but it is often used in a dialogue format, as in this study.

 Large-scale language models often use what is called a prompt, a kind of directive, as input to the model (see "Prompt: Create a list of medications." in the example below). The model then generates a plausible sentence that follows the prompt to create a response.

 The example above is an example of an actual experiment conducted in this paper, where we first present a medical record and enter the instruction "Please make a list of drugs" into the model. That is all that is required to obtain a list of drugs!

 In the following one-shot example, the model is given only one exemplar input-output example to further improve the accuracy of the output by providing the model with the expected output example in advance. In this experiment, answers that were previously obtained in natural language are now given in list form, allowing the model to obtain representations that are more compatible with computer processing.

Information extraction methods in this paper

 In this paper, we attempt to use this Prompt technique to acquire information from InstructGPT. The method is very simple: to retrieve the desired information from the output of InstructGPT, we simply post-process the output so that it is in the desired format.

 For example, the following example is a task that wants to find out what the abbreviation "PA" is that appears in a medical record. To do so, enter the medical document + "Expand the abbreviation: [abbreviation]" as a prompt. Then, InstructGPT will output a similar output, and all we have to do is extract the "pulmonary artery," which is shown in red in the figure, by post-processing. In this paper, the extraction is done by post-processing using the difference from the original document.

  Similar procedures are used for drug information extraction, coreference analysis (the task of identifying the specific content to which a pronoun or other word refers), and group extraction experiments in medical texts. The information to be extracted is highlighted in blue in the figure below. Post-processing includes conversion from bulleted lists to lists and extraction inside quotation marks.

 Other efforts made in this paper include creating new data that can be sent externally (annotations on existing data) and creating our own Prompt for input into the model since the medical document corpus has strict terms of use and cannot be sent to the OpenAI API as is. The other is that the Prompt to be input into the model is created on its own.

experimental results

 Experiments compare the accuracy of existing Zero-shot, One-shot, and Fine Tuning methods using all possible data.

Abbreviation Identification

 We can confirm that even Zero-shot significantly outperforms the accuracy of the Fine-tuned baseline supervised model (however, the baseline model is not a strong model trained specifically for medical documents, so we do not know if it is superior to such models).

group selection (in evolutionary biology)

 The results of the group extraction are shown in the "Abstract-level Accuracy" column on the right side of the table, which is significantly better than the prior model with Zero-shot.

coreference analysis

  Both Zero-shot and One-shot are also superior to the previous study. (The numbers in parentheses ( ) indicate the number of lines in the post-processing script, but it is clear that Zero-shot requires more complex processing due to the lack of control over the output. Therefore, it seems that there are cases where the answers cannot be extracted well from the model output.

 There is also the curious result that when the model is given the wrong answer (INCORRECT), it is more accurate than when the model is given the correct answer (CORRECT) during the one-shot.

drug information extraction

 Similarly here, One-shot outperforms the accuracy of the supervised model.


In this article, we presented a study on the automatic acquisition of medical information in a dialogic format. The experimental results were very good, and the paper greatly suggests the usefulness of dialogue methods using large-scale language models in the medical domain.

However, there are several challenges. First, the generation method is called Hallucination, which outputs information that is not in the input. This is not a good match for a method applied to actual medical practice. If an answer does not exist or cannot be found in a document for a given input, the model should be made to output it as such.

Post-processing is quite heuristics-dependent. The post-processing is designed by looking at the actual generation, and if the output is more liberal, as is the case with Zero-shot, errors will occur frequently.

However, recent advances in giant language models, such as the appearance of GPT-4, have been remarkable, and we can expect to see more research using giant language models in the future, and we can also expect to see the emergence of various studies that include solutions to the above issues.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us