What Part Of The Context Does The Large-scale Language Model Use?

Large Language Models 16/08/2023

3 main points
✔️ Examine how large-scale language models exploit long contexts using two experiments
✔️ Experiments show that performance is highest when relevant information is at the beginning or end of the input context
✔️ In addition, we find that performance drops significantly when relevant information is in the middle of the input context In addition, we found that when the relevant information is in the middle of the input context, the performance is significantly degraded.

Lost in the Middle: How Language Models Use Long Contexts
written by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang
(Submitted on 6 Jul 2023)
Comments: 15 pages, 17 figures
Subjects: Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction.

In recent years, large-scale language models have become part of human life, used in conversational interfaces and for a variety of tasks such as searching, summarizing, and collaborative writing.

While these language models allow much longer contexts to be used as input than previous models, the extent to which these longer contexts can be used has not been investigated.

This paper describes a paper that investigates how language models use input context against this background , using two experiments (multi-document question answering&key-value retrieval) to examine how language models use long contexts. This paper is a survey of how language models use input contexts.

Multi-Document Question Answering

The purpose of this experiment was to investigate how the language model uses the input context, and to this end, the paper employed a multi-document question answering This task is to find relevant information in the input context and use it to answer a question.

In addition, we varied the length of the input context and the position of relevant information on the sentences as we adjusted them to measure changes in model performance.

Experimental Setup

In this experiment, the model's input is a question to be answered andk sentences (e.g., sentences from Wikipedia ), and when performing the task, the model accesses the sentences containing the answer in the input context and uses them to answer the question.

The figure below is an example. (Relevant text to correctly answer the question is shown in bold.)

In this experiment, we instantiate this task on the NaturalQuestions benchmark (a dataset containing historical queries issued to the Google search engine and human-annotated answers extracted from Wikipedia).

Also in this experiment, in this task,

Adjust the length of the input context by increasing or decreasing the number of sentences that do not contain the answer
Adjust the position of relevant information by rearranging the order of sentences in the input context

These two innovations enabled verification under a variety of conditions.

Models

In this paper, we experimented with the following six large-scale language models.

MPT-30B-Instruct: pre-trained model with 1 trillion tokens using 2048 token sequences, capable of handling contexts with up to 8192 tokens
LongChat-13B-16K: based on LLaMA13B with extended context window
GPT-3.5-Turbo-0613: GPT model capable of supporting up to 4K token contexts
GPT-3.5-Turbo-16K-0613: GPT model capable of supporting up to 16K tokens in context
Claude-1.3: Model capable of handling up to 8K tokens in context
Claude-1.3-100K: Model capable of supporting up to 100K tokens in context

Standard prompts were used for each model, and experiments were conducted with input contexts containing 10, 20, and 30 sentences, respectively.

Results

The following figure shows the results of examining the change in performance of the answers to the multi-document questions when the position of the relevant information is varied. (The vertical axis is the percentage of correct answers, and the 1st on the horizontal axis represents the beginning of the sentence.)

The figure shows that performance is highest when the relevant information is placed at the beginning or end of the context, and performance degrades more rapidly when the model needs to use information in the middle of the input context.

In addition, the following figure shows the results of examining the change in performance of responses to a multi-document question when the length of the input context is varied. (The vertical axis represents the percentage of correct answers, and the horizontal axis represents the length of the input context.)

From the figure, it can be seen that the performance of the language model in this task decreases as the input context becomes longer.

Key-value Retrieval

Given the poor performance of the language model in the multi-document question answering task described above when using information in the middle of the input context, the question arises to what extent is the language model able to retrieve information from the beginning of the sentence? The question arises: to what extent can language models retrieve information from the beginning of a sentence?

In this paper, we adopted the task of key-value retrieval and conducted experiments to answer this question.

Experimental Setup

In key-value retrieval, the input is a JSON object serialized as a string with k key&value pairs, and the goal of the task is to return the value associated with the specified key.

Thus, each JSON object contains one related key&value pair and k-1 unrelated key&value pairs.

This is shown in the figure below and is evaluated to see if the correct value appears in the output. (The associated key&value pairs are shown in bold.)

In this experiment, in this task,

Modifies the number of key&value pairs in the input by adding or eliminating random keys and adjusting the length of the input context
Adjust the position of related information by changing the position of keys to be searched within the serialized JSON object.

These two innovations enabled verification under a variety of conditions.

In addition, the language model used in the experiment will be the same as that of the multi-document question answering.

Results

The following figure shows the results of examining the change in performance on the key-value retrieval task when varying the length of the input context and the position of the relevant information. (The vertical axis is the percentage of correct answers, and the 1st on the horizontal axis represents the beginning of the sentence.)

From this task, we again observed that except for very accurate models (e.g., Claude-1.3), performance was highest when the relevant information was placed at the beginning or at the pine tail of the context, and performance rapidly degraded when it was in the middle of the input context.

summary

How was it? In this article, we have examined how language models use long contexts using two experiments (multi-document question answering&key-value retrieval) to investigate how language models use input contexts. This paper described a survey of the

The two experiments in this paper demonstrate that the language model performs best when the relevant information is placed at the beginning or end of the context, and that performance degrades rapidly when the relevant information is placed in the middle of the input context.

The experimental results are an important discovery that addresses the question of how language models use input context and provides hints for better use of language models.

The study of these black box aspects of language models will help us deepen our understanding of language models, and therefore, future developments will be closely watched.

The details of the language model and experimental results presented here can be found in this paper for those interested.

Categories related to this article

田中侑李

What Part Of The Context Does The Large-scale Language Model Use?

Introduction.

Multi-Document Question Answering

Experimental Setup

Models

Results

Key-value Retrieval

Experimental Setup

Results

summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...