Improving The Accuracy Of Long-form QA Tasks! R&R", A New Method Combining Reprompting And In-context Search To Mitigate The Phenomenon Of LOST IN THE MIDDLE In Large-scale Language Models

Large Language Models 18/11/2024

3 main points
✔️ Introduction of R&R methods: a new method based on a combination of "reprompting" and "in-context retrieval" was developed to improve the performance of large-scale language models in long-text QA tasks.
✔️ mitigates lost in the middle: it reduces the problem of relevant information being lost in the middle of documents. Reprompting improves the accuracy of responses by reducing the phenomenon of large language models being biased toward the beginning or end of a document or closer to key instructions, thereby reducing the distance between relevant information and instructions.
✔️ Optimizing Accuracy and Cost: Combined with the chunk-wise approach, R&R methods can improve the performance of large-scale language models even for long contexts. Minimizes the number of LLM calls and the use of tokens while allowing the use of larger chunks, thus reducing accuracy loss.

Can't Remember Details in Long Documents? You Need Some R&R
written by Devanshu Agrawal, Shang Gao, Martin Gajek
(Submitted on 8 Mar 2024)
Comments: Published on arxiv. For associated code repository see this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The field of natural language processing has seen significant developments with the rise of large-scale language models. These innovative models allow users to perform all kinds of tasks by entering simple "prompt" text. However, it is well known that in question-answering (QA) tasks, users face the problem of "lost in the middle" information when dealing with long texts.

Recently, large language models have been released that support very long contexts, such as GPT-4 Turbo and Claude-2.1, which support context windows of 128k and 200k tokens.Althoughtheselarge-scale languagemodels support long contexts, the quality of their responses tends to degrade when input prompts are very long; Liu et al. (2023) found thateven for 16k token contexts, if the relevant context is located in the middle of the document, the document-based Liu et al. (2023 ) found that accuracy in document-based QA oflarge languagemodels is significantly lower when the relevant context is located in the middle of the document, even if it is a 16k token context, compared to the beginning and the end.This phenomenon is known as"lost in the middle.

To address this, this paper proposes a new approach, R&R, which combines "reprompting" and "in-context retrieval. In this approach, instructions for questions are repeated throughout the document to efficiently extract the most relevant information.It aims to improve the accuracy of QA as well as the performance of large-scale language models for long contexts.

The paper details how this innovative approach works and suggests effective strategies to mitigate the "lost in the middle" effect in long QA tasks. Through experiments, the report also demonstrates the new potential of "reprompting" and "in-context retrieval" in the use of large-scale language models. This approach has the potential to expand the range of applications of NLP technology by increasing accuracy and efficiency when dealing with long sentences.

Technique

This paper spotlights document-based question-answering (QA) tasks and proposes an innovative method using large-scale language models.

This approach asks the large-scale language model to answer questions based on the context of a given document. To accomplish this, the prompt is divided into three sections to clarify the instructions. The question and its answer instructions are first tagged, followed by the document itself.

Finally, the instructions are repeated again just before the large-scale language model generates the answer. This repetition is based on a technique that has been used for some time, and is intended to effectively guide the response without information being lost in the document.

It also assumes that the document is divided into "pages". These pages correspond to the natural breaks in a document, such as paragraphs and sentences, and are referred to as pages for standardization purposes. <PAGE {p}> . . . </PAGE {p}> tags, where {p} is replaced by the appropriate page number. This approach allows thelarge-scale languagemodel to process the entire document more efficiently and to accurately extract information relevant to the question.

In addition, we have introduced a technique called"reprompting".This is intended to reduce the"lost in the middle"phenomenon, in whichlarge languagemodels are biased toward the beginning or end of a document or near key instructions.In reprompting, <INSTRUCTIONS_REMINDER> your task is . . . </INSTRUCTIONS_REMINDER> format, a reminder block containing the original instructions almost verbatimis insertedoutside of the PAGE blockfor each specified token in the document.This is expected toreduce the distance between relevant information and instructions at any given location in the document andimprove the accuracy of responses inlarge languagemodels.

In addition,we have introduced "In-context retrieval" and chunking.In-context retrieval" is based on the idea that the task of extracting information relevant to a question from a document is generally simpler than directly answering the question. This is because information extraction prioritizes reproducibility over accuracy. The process is done in two stages, first identifying the pages most relevant to the question, and then answering the question using abbreviated documents containing only those pages. This method is designed to allow large language models to process information efficiently.Inaddition, the combination of"reprompting" and"in-context retrieval" allows us to extract important information in the middle of a document without missing anything. Specifically, by reminding search instructions throughout the document, thelarge-scale languagemodel helps find relevant pages buried near the middle.

Chunking also divides documents into non-overlapping, contiguous chunks andperforms"in-context retrieval"independently of each other.This allows the most relevant information to be extracted efficiently, reducing the number of LLM calls while maintaining accuracy. If the chunks are large enough,"reprompting"within a chunkcan also be performed, further optimizing the balance between accuracy and efficiency.This opens the possibility to achieve higher performance for more complex documents.

Experiments and Results

This paper examines the effectiveness of "R&R" in document-based question and answer (QA) tasks. The table below, which summarizes the fuzzy match scores obtained for each dataset and long-text method (excluding chunking) at different document lengths (d), shows that reprompts broadly outperform the baseline, especially when GPT-4 Turbo is used and d=80k, where R&R further tends to produce further accuracy gains.

The additional cost of reprompting is minimal, consuming about 1.15% more input tokens than the baseline at d=80k, but there is no additional cost with respect to output tokens; R&R similarly consumes about 1.15% more input tokens than the baseline at d=80k It requires an additional large language model call for the ICR step, which requires an average of 83 output tokens per sample. This is a high number compared to the 43 output tokens for the baseline and reprompting cases. However, these resultssuggest thatR&Ris effective in extending the effective context range oflarge-scale languagemodelsin document-based QA.

In addition, chunkwise ICR and chunkwise R&R (with the addition of reprompting) were performed to compare the benefits of longer context and reprompting to shorter context and chunk-based approaches. The table below shows the fuzzy match scores for each dataset and method, varying the chunk size (c) at which ICR and R&R were performed.

In general, accuracy tends to decrease as chunk size increases for most data sets, as additional filler context reduces search accuracy. However, the results suggest that reprompting may actually make larger chunks usable, with less loss of accuracy as chunk size increases.

This has important implications in terms of the accuracy/cost tradeoff. Smaller chunks require more LLM calls (one per chunk, QA after aggregation), input tokens, and output tokens. In particular, output tokens are costly, equivalent to three times the price of input tokens in GPT-4 Turbo, andthe runtime oflarge languagemodels increases linearly with output tokens. Thus, itis suggested thatreprompting mitigates this tradeoff by allowing larger chunks, requiringfewerlarge language modelcalls andoutput tokens while minimizing the loss of accuracy. Furthermore, while reprompting itself requires a small additional input token, this cost is offset by the reduction in input tokens in larger chunks.

We also adopt this approach to in-context retrieval (ICR) based on the hypothesis that extracting the most relevant pages from a document is easier than direct question answering. In the former case, the reason is that reproducibility is prioritized over accuracy. We test this hypothesis by comparing direct document-based QA with the "extract the most relevant pages to answer the question" task. We exclude NQ from the experiment because the initial pages contain misleading information, and HotPotQA because the relevant context is scattered across multiple pages. However, SQuAD and PubMed reveal that page extraction is significantly more accurate than direct question answering in the example with document length d = 40k.

With respect to the frequency of reprompts, we validate the choice of every 10k tokens and find that this achieves the highest QA accuracy across all data sets.

For the placement of the reprompts, we test the hypothesis that reprompting only immediately before the relevant context significantly improves accuracy. In particular, the method of inserting a single INSTRUCTIONS_REMINDER block before the PAGE block marked as containing "gold passages" uniformly outperformed the method of reprompting every 10k tokens in three of the three datasets with d = 40k document lengths achieves better QA accuracy than the method of uniformly reprompting every 10k tokens in 3 of the 3 datasets with d = 40k document length. This suggests that reprompting works by reducing the distance between the relevant context and the task instructions.

Furthermore, we find that reprompts that merely hint at the original instructions perform worse than the original prompts. This indicates that reprompting is not just about repetition, but about shortening the distance between the question and the relevant context. Finally, tests of reprompting that placed the reminder block at the start of the document yielded significantly worse results than the original reprompting. These results indicate that reprompting is not merely repetitive, but has effects due to specific strategic placement.

Summary

This paper develops a prompt-based method, R&R, to explore the potential for improving the performance of large-scale language models for long sentences in document-based question and answer (QA) tasks. This methodhas been found to beparticularlyeffective in mitigating"lost in the middle.Furthermore, it has been suggested that reprompting works by minimizing the distance between the relevant context and the task instructions.

For extraction-type QA tasks, the chunkwise approach provides a solid foundation, but R&R can also be performed in chunks. Even in this setting, we have found that reprompting is beneficial, reducing the number of large language model calls by allowing the use of larger chunks and minimizing the use of tokens while limiting the loss of accuracy R&R in balancing accuracy and cost, Flexibility of the chunkwise approach and cost savings in practical applications where accuracy is critical

Future research directions are varied and promising: combining R&R with other prompt-based methods may further improve performance. New approaches such as "in-context chunking" may also be considered to further optimize the accuracy/cost trade-off. The application of reprompting to tasks that require a more comprehensive understanding of the document, such as summary production, could also open up new areas of research. Finally, while these are only prompt-based methods, a deeper understanding of their benefits and limitations could shed light on the behavior of large-scale language models for longer texts and provide hints for architectural changes that could drive further improvements.

Categories related to this article

Large Language Models

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Improving The Accuracy Of Long-form QA Tasks! R&R", A New Method Combining Reprompting And In-context Search To Mitigate The Phenomenon Of LOST IN THE MIDDLE In Large-scale Language Models

Summary

Technique

Experiments and Results

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...