JDocQA Dataset, A New Large-scale Dataset That Revolutionizes The Japanese Language's Question-answering Ability
3 main points
✔️ Development of the JDocQA dataset: A new large-scale dataset focusing on documents containing visual information, designed to measure question-answering ability in Japanese.
✔️ Learning Effects of Unanswerable Questions: confirms that including questions that cannot be directly answered in a document can reduce the tendency of the model to generate inaccurate responses.
✔️ Evolution of multimodal models: state-of-the-art models such as GPT-4 and InstructBLIP suggest high adaptability and performance in question answering tasks that combine text and images.
JDocQA: Japanese Document Question Answering Dataset for Generative Language Models
written by Eri Onami, Shuhei Kurita, Taiki Miyanishi, Taro Watanabe
(Submitted on 28 Mar 2024)
Comments: LREC-COLING2024
Subjects: Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Understanding documents that contain textual and graphical elements, such as slides, reports, web pages, and brochures, is a necessary skill for intelligent agents to answer questions about multimedia documents. While research has progressed to integrate these elements into a visual understanding of the documents they comprehend, there are still challenges when dealing with Japanese documents. Japanese documents are written in two styles, horizontal (left to right) and vertical (top to bottom), and agents need to understand these.
To address this issue, this paper develops the "JDocQAdataset," which consists of 11,600 question-answer pairs, four different question categories, and 1,000 multipage questions. It is based on a collection of multi-format Japanese documents, manually labeled with question-answer pairs.It is alarge,fully labeledJapanese document question-answer dataset. This dataset is also intended for practical cases in which visual information as well as text in the documents needs to be considered in order to answer a question. It also challenges research to reduce model-generated inaccurate answers, or so-called "halucination," by including unanswerable questions for which the answers are not directly written in the documents.
Recent advances in large-scale language models and multimodal models have greatly expanded the possibilities in this area. In particular, models such as GPT-4 and InstructBLIP can handle both text and images and have shown excellent performance in multimodal tasks. Research is actively underway to adapt these models to more specialized domains and languages, and the JDocQA dataset aims to contribute to advances in document comprehension and question answering tasks, especially in Japanese.
Experiments with the JDocQA dataset have shown that learning to include unanswerable questions helps reduce the tendency of the model to generate incorrect answers.
Dataset Overview
JDocQA is a groundbreaking dataset designed to improve question-answering skills in Japanese. It consists of 5,504 documents that combine textual and graphical elements, such as slides, reports, web pages, and brochures, and contains 11,600 question and answer pairs. Questions are grouped into four categories: yes/no, factoid, numeric, and open-ended, and each question contains both textual and visual information from the document.
The statistics for the data set are as follows
- Yes/No questions: 1,855
- Factoid questions: 2,052
- Numerical questions: 1,866
- Open-ended questions: 5,827
This dataset is intended for cases where the model requires not only an understanding of the text, but also an understanding of the visual information in order to answer questions about the documents. Of particular note is the inclusion of "unanswerable questions" for which no explicit answers are provided in the documentation. This is supposed to help the model mimic the challenges it might face in real-world applications and help curb the tendency of the model to generate inappropriate answers, so-called "illusions."
In addition, it includes 1,788 questions that require reference to multiple pages to obtain an answer and 1,000 questions for which the correct answer is not mentioned in the text. This allows the model to accommodate a wide variety of question types and to assess the ability to understand complex document structures.The table below shows the average length of contexts, questions, and answers in the JDocQA dataset.
The figure below alsoshows the categories of visual information referenced in questions and answers in the JDocQA dataset.
Additionally,a comparison of the document question-answer data sets is shown in the table below.
The JDocQA dataset is a valuable resource for developers of question answering systems. It includes tasks where models generate textual answers based on document context and textual questions, addressing a wide range of user questions encountered in realistic applications. It also provides data indicating the categories of visual information referenced in questions and answers, facilitating the development of multimodal question-answering systems.
How to Create A Dataset
The overall flow of how to create a JDocQA dataset is shown in the figure below.
First, we will discuss the collection of PDFs. The foundation of this dataset isa wide range ofpublic documentscreated by Japanese government agencies and local governments.We manually collect a wide variety of PDF documents from the digital collections of the National Diet Library, web archiving projects, and the websites of government ministries and agencies. These documents cover a variety of topics ranging from economic policy to education policy to health and sanitation. They also contain a wealth of visual elements, such as charts and photographs, which the team claims played an important role in the development of the question-and-answer system.
We also use the PyPDF2 tool to extract text from documents. Since text cannot be extracted directly from PDFs created from paper scans, an alternative text source is generated using OCR (optical character recognition) technology. The extracted text is normalized by removing misrecognized symbols, pictograms, and duplicate characters.
Next, we turn to the annotations:43 annotators annotated question-answer pairs on documents containing rich textual and visual information. For each document, two to four question-answer annotations were created, and the questions were based on both textual and visual information. Annotators were also instructed to annotate without the use of AI tools. In particular, the inclusion of non-answerable questions would increase the realism and utility of the data set.
Three types of visual input images are also provided for the multimodal model: the first is an image of the full document page, the second is an image of a table or figure cropped with bounding boxes specified by the annotator, and the third is a blank image for ablation studies. This will enable detailed analysis of how the model processes visual information and utilizes it in question answering, according to the company.
The development of the JDocQA dataset involves multiple steps, from extensive document collection to rigorous text extraction and normalization to annotation of diverse question-answer pairs. This effort will enable the development of a high-quality multimodal question-answering system, which can be applied to applications in more realistic scenarios.
Experiments and Results
In a series of experiments using the JDocQA dataset, various text input models have been tested and their performance analyzed in detail. These experiments measured how effectively the models were able to answer questions, with a particular focus on how they handled "unanswerable questions. The table below provides a summary of the experimental results.
Looking at the models trained on all instances, the models trained on all data, including unanswerable questions, showed superior results compared to the standard models, including gpt-3.5 and gpt-4. This was especially true for the larger models. Interestingly, the use of training data containing unanswerable questions suggests that the models adapt to more realistic scenarios, improving overall performance.
Looking atmodels trainedwithout unanswerablequestions,models trained with the exclusion of unanswerable questions were also tested. These models performed slightly worse on average scores than those that were included. This underscores the importance of including unanswerable questions in the training data in order to reduce the tendency of the models to generate responses based on what is termed "hallucinatory" problems, i.e., information that does not exist.
Multimodal models such as StableLM-InstructBLIP-Alpha perform particularly well when using cutout images of referenced tables and figures. This suggests that visual input, as well as textual input, plays an important role in question answering tasks.
The impact of different token lengths on model performance is also examined. Models with longer token lengths tend to show better results, but have the disadvantage of being more computationally expensive.
The performance of the model for different document types is also analyzed to assess the impact of each type on the model, such as brochures, slides, and report documents. This allows for the development of models that are optimized for specific document types.
The qualitative analysis of the experimental results also provides specific examples of model-generated responses, with particular attention paid to the model's responses to unanswerable questions.
Human evaluations are also conducted to verify the accuracy and reliability of the responses generated.
Overall, these experimental results test a variety of approaches and validate their effectiveness in developing question-answering systems using the JDocQA dataset. In particular, the importance of including unanswerable questions in the training data is emphasized and shown to improve the adaptability of the model in real-world applications.
Summary
This paper provides a new large-scale dataset called the "JDocQA Dataset. It offers a new perspective on the Japanese question-answering task. Through the fusion of visual and textual information, the goal is to develop models with deeper understanding and response capabilities. In particular, experiments using this dataset have confirmed� that the inclusion of "unanswerable questions," for which answers cannot be found directly from the document, has the effect of reducing the number of incorrect answers, or so-called "illusions," generated by the model.
Experimental results suggest the utility of the JDocQA dataset for addressing the diverse challenges faced by question-answering systems. It has been shown to effectively address questions across a wide range of categories, from yes/no questions to open-ended questions. Furthermore, it has been shown that accurately predicting unanswerable questions may contribute to improving the overall performance of the model.
The JDocQA dataset contributes to the development of intelligent question-answering systems that can intelligently respond to questions even when they are not explicitly answered by the text in the document. This will enable applications in more realistic scenarios and further advances in question answering technology.
Categories related to this article