Catch up on the latest AI articles

Google's High-performance LLM That Compresses Very Long Prompt Sentences To Save Memory

Google's High-performance LLM That Compresses Very Long Prompt Sentences To Save Memory

Large Language Models

3 main points
✔️ LLM has a limit to the length of prompts that can be entered, cannot summarize long sentences, etc.
✔️ Proposes an attention mechanism for LLM that compresses prompts into parameters and introduces a memory portion

✔️ Enables processing of prompts of infinite length. Achieved highest performance on book summarization task

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
written by Tsendsuren MunkhdalaiManaal FaruquiSiddharth Gopal
(Submitted on 10 Apr 2024)
Comments: 9 pages, 4 figures, 4 tables

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

code:   

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

To understand a sentence (text), it is necessary to understand the overall context and then understand each individual token (chunk of text).

How long a context can be understood is called the context window size in the Large Language Model (LLM). To fully understand a prompt, there must be enough context window size for the length of the input prompt. Therefore, context window size simultaneously means the length of the input prompt that can be adequately processed.

In May 2024, Open AI announced its newest LLM, GPT-4o, with a context window size of 128,000 tokens, which according to the Open AI blog is about 300 pages of text. When ChatGPT was announced in November, the context window size was 4,000, so GPT-4o has a context window size 32 times larger than the initial ChatGPT.

Thus, the context window size is longer than it was at the time because of the need for processing that understands longer contexts.

For example, if LLMs cannot grasp long contexts, they may not be able to summarize long sentences well, they may not be able to consider long descriptions of tasks in in-context learning (in-context learning) and give the expected answers, or they may not be able to teach LLMs enough sample responses with sufficient variation, In-context learning (in-context learning) can lead to the following problems: LLMs cannot give sufficient sample responses; LLMs cannot provide enough relevant document information obtained through RAG (Retrieval-Augmented Generation, i.e., embedding retrieved information in prompts, etc.).

In order to clearly communicate what a human wants to do to the LLM, provide sufficient information, and reduce halucination, it is desirable to be able to increase the context window size. Recently, the terms mega-prompts (very long prompts) and meany shots (giving many teaching examples in in-context learning) have emerged in the context of LLMs. Furthermore, reports are emerging that RAG is better suited than fine-tuning for the ability to respond with correct facts. It can be said that LLMs are expected to perform better by taking full advantage of the context window.

Thus, even with a context window size of 300 pages of text, the larger and more complex the task to be solved by in-context learning, the larger the required context window size will increase.

Therefore, the context window size (length of prompts that can be processed) is important to the LLM.

What if there was an LLM with this context window size of infinite length?

This will also allow us to learn what we want to know at our own pace without having to read very long sentences, and to get the best possible answer by plowing through possible related literature without having to relearn or fine-tune the LLM. This will open up the possibility of easily tuning in-context learning without having to re-do LLM pre-training or fine-tuning.

The Transformer, the machine learning model used as the base of LLM, has the problem of increasing computation and memory usage in proportion to the square of the prompt length. The proposed method, Infini-Transformers, compresses the prompt itself into a parameter and stores it, thereby reducing memory usage.

We will now introduce the model and evaluation results.

Transformer with new attention mechanism Infini-Attention: Infini-Transformers

Issue: Transformer's attention mechanism uses a lot of memory to hold keys and values

In the transformer attention structure used in LLM, when understanding each token (query) in an input prompt, the similarity of the token to the tokens before and after it (keys) other than the token to be understood is calculated, and the features (values) of the token are updated according to this similarity, The context (tokens before and after the target of understanding, including the target of understanding itself) is taken into account in understanding.

The similarity between the query and the key is computed by the inner product of the query and the key. Since the matrix size is query length x query length, the computation and memory usage increases quadratically with the prompt length.

For example, one 500 billion model parameter LLM model reported using 3 terabytes of memory to hold keys and values, with a context length of 2048 (tokens).

If the length of the prompt exceeds the memory of the computer, it will physically exceed the storage capacity of the device and cannot be accepted, and if it is too computationally intensive, the LLM will not respond at all.

As described above, the conventional attention mechanism creates a matrix of keys and values that grows as the length of the prompt (input token sequence) increases, so memory usage continues to grow with the length of the prompt.

Solution: Attention mechanism incorporating a mechanism to retain previous keys and values in compressed memory Infini-Attention

Solution Ideas

Therefore, this paper proposes an attentional mechanism that divides the input prompts, processes them in order from the front, and keeps the previous key and value in a compressed memory with fixed-size parameters. The memory usage will depend on the parameters (matrix size) of the compressed memory, but the fixed-size parameters will limit the memory usage and computation even if the prompts are long.

Overall structure of Infini-Attention

Infini-Attention converts the input token sequence into a segment sequence and calculates the inner product within each segment. The segment sequence is the input token sequence divided into segments of length N, each segment being distinguished by an index S.

Infini-Attention does not discard the keys and values of previous segments, but keeps them in compressed memory.

The structure,so to speak, is a combination ofglobal attention (attention considering past segments) and local attention (attention to the segment currently being processed). This structure is shown in Figure 1.

Figure 1. block diagram of infini-attention

Local Attention

The purple block in Figure 1 is the block that performs local attention. It performs the usual scaled dot product attention on the query $Q_s$ of the input segment of interest. That is, it calculates the normalized inner product from the "segment length N x key dimensionality matrix" $K_s$ and the "key dimensionality x value dimensionality matrix" $V_s$ of the target segment, and then calculates the product with the "N x value dimensionality matrix" $V_s$. to obtain the attentional context $A_{dot}$.

Global Attention (Attention with compressed memory)

Since this purple block is the only one that contains local attention, there is a green block that calculates global attention. This block holds the results of the attention calculation based on the keys and values of the past segments. Rather than compressing and storing all past keys and values,thisblockupdates the status each timea segment has been processed, thus creating an image of constantly updating the "previous" status (Update in Figure 1).Theupdated compressed memory is retrieved (Retrieve in Fig. 1) and combined with the local attention output (Concat in Fig. 1).

In the compressed memory retrieval, based on the matrix of the query $Q_{s}$ of the size of the segment length N x the dimensionality of the key of the target segment, the matrix of the compressed memory $M_{s-1}$ of the dimensionality of the key x the dimensionality of the value, and the normalization term $Z$, the contents of the compressed memory (the matrix of N x the dimensionality of the key) $A_{mem}$ is A_{mem}$, the matrix of N x the number of dimensions of the key, is obtained.

In updating compressed memory, the matrix and normalization terms of compressed memory are updated with the results of calculations based on the keys and values of the segment of interest and used in the processing of the next segment.

Matrix of compressed memory after update = matrix of compressed memory before update +matrix product based onactivation function and key$K$ and value $V$ of thesegment of interest.

This added term is called the associative binding operator.

Inifini-Transformers uses an existing method, the Delta rule, which improves on the above compressed memory processing.The Delta rule applies the associative allocation operator

Input/output relationship between input segment, transformer block and compressed memory

Figure 2illustrates the input/output relationship betweeninput segments, transformer blocks, andcompressed memory. The results of the key and value calculations for each layer update the compressed memory (blue arrows) and are used as compressed memory for processing the next segment (purple arrows). This ensures that the valid context at the time of processing each segment also includes past segments.

Attention on the compressed memory side is not scaled dot product attention, but linear attention (Linear), which is considered a linear order of computation.

Figure 2. input/output of Infini-Transformer. Input segment (dotted box), transformer block (gray) and compressed memory (green).

The local attention calculation result $A_{dot}$ and the global attention calculation result $A_{mem}$ are combined based on a learnable trade-off adjustment parameter.

Compressed memory is used at each layer of the transformer to store$M_{s-1}$ and $z$, whichis thesize of the number of dimensions of the key x the number of dimensions of the value ( for$M_{s-1}$) + the number of dimensions of the key ( for$z$ ).This fixed memory usage can theoretically accept an infinite number of input segments.

Reduction in memory usage compared to the existing system

Although existing methods have proposed transformers that store input sequences to the LLM as model parameters, the proposed method, for example, achieves a compression ratio of 114 times that of the existing method Memorizing Transformers, which has a memory length of 65,000. The proposed Infini-Transformers shows a very high compression ratio. It is clear that the proposed Infini-Transformers can process long contexts with very low memory usage compared to conventional methods.

Assessment Results

passkey task

A task to assess understanding of long contexts is the passkey task, inwhich LLMs are given long prompts with random numbers (passkeys) tucked into them and instructed to find the passkeys (code words) to see if they can extract the passkeys correctly.

The prompts given are five patterns in length: 32K (32,000), 128K (128,000), 256K (256,000), 512K (512,000), and 1M (1,000,000) tokens.

As for where to creep the passkey, try three pattens at the beginning, middle, and end of the prompt.

We experimented with both cases ofZero-shot solving and400-step fine tuning (FT) with aninputsize of5K tokens.

The comparison method is the proposed Infini-Transformers, where Linear is the case without the Delta rule and Linear + Delta is the case with the Delta rule.

Figure 3 shows the results of the passkey task evaluation.Thenumbers separated by /indicate the success rate of passkey extraction when the passkey is placed at the beginning, middle, and end of the prompt, respectively. When the passkey is placed at the end of the prompt, the percentage of correct answers is high, but otherwise the success rate is low.

Figure 3. pass key task

Although there is no discussion by the authors of this paper, it is assumed that the side effect of compressed memory is that information in the distant past is diluted because compressed memory preserves past context.

The dependence of the success rate on the length of the prompt given is gradual; the longer the prompt, the lower the success rate, but there is not a large dependence such that if the length of the prompt doubles, the success rate is halved, and the success rate decreases slowly.

When doing FT, the success rate is greatly improved compared to Zero-shot. Even if the effective context range is longer, it does not mean that the information in that context can be used effectively, and it may be necessary to train the user to use the information better through fine tuning.

In Infini-Transformers, there is a comparison between the case without Delta (Linear) and the case with Delta (Linear+Delta), but there is little difference in basic performance. Only in the case of Zero-shot 128K in Figure 3, there is a noticeable difference.

Book Summary

An LLM with 8 billion model parameters was pre-trained in 30,000 steps with an input length of 8,000 and finetuned to a book translation task. Fine tuning was performed with an input length of 32,000. Evaluation was performed with an input length of 80,000.

The evaluation metric is Rouge, which measures the degree of agreement between the machine summary and the correct summary, with several variations, the higher the better.

The evaluation results are shown in Figure 4. The proposed method, Infini-Transformers (Linear + Delta), outperforms the best existing performance method.

Figure 4. book summary task

Conclusion

This article describes LLM, which enables Transfomer to handle prompts of infinite length by introducing local attachments, which are attachments to segments of the input prompt, and global attachments, which consider the context of the entire input prompt using compressed memory. Explanation.

The memory usage of conventional Transformers depends on the length of the input prompt and is proportional to the square of its length. Inifini-Transformers' memory usage, on the other hand, is independent of the length of the input prompt. Its memory usage depends on the matrix $M_{s-1}$ of compressed memory of size key dimensionality x value dimensionality and the normalization term $Z_{s-1}$ of size key dimensionality. Thus, for prompts of infinite length, the memory usage is not infinite.

On the task of summarizing the entire content of a book, which requires processing of long contexts, Infini-Transformers outperformed existing

In the future, if we want LLMs to do a variety of things with in-context learning, the length of the context should increase in length. The ability to deal with this infinite length of context will be very impactful.

However, even though the proposed Infini-Transformers make memory usage less dependent on the length of the input prompt, I am still concerned about how the optimal segment size of the input prompt should be determined and how much it will affect accuracy.

I am also concerned about the length of the input prompts during pre-study and fine-tuning, and how long the prompts should be set to be sufficiently long to be understood.

On evaluation, fine tuning with 32,000 tokens of input prompts could process 500,000 tokens of context, which means that we know that we are likely to be able to process about 10 times as many tokens of input prompts at training time as at inference time, but is there always such a proportional relationship, or is it just for book summary tasks Is it limited to the book summary task? Or do we have to do fine tuning by task each time? I am curious.

If such a technology to make the input prompt length infinite is adopted by ChatGPT and others, and the input prompt limit is greatly eased, the mega-prompt trend will accelerate. On the other hand, it is likely to cause other problems such as an abnormally high server load due to the huge input prompt size from users. Also, since long input prompts need to be taught during training, the preparation of appropriate datasets with long input lengths may be more of a bottleneck.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us