A Better Attention Mechanism Will Improve The Performance Of LLM's Long-text Processing!
3 main points
✔️ Verify a unified performance evaluation method for long-text response models.
✔️ Accurate Attention mechanism shows high performance in long-text processing.
✔️ Approximation methods are resource efficient but inaccurate.
A Controlled Study on Long Context Extension and Generalization in LLMs
written by Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush
(Submitted on 18 Sep 2024(v1))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Background
This paper investigates how large-scale language models (LLMs) can extend their ability to process longer contexts. Traditionally, LLMs have been trained on short context data, but real-world tasks require the ability to comprehend longer sentences and documents. For example, learning from textbooks, summarizing novels, and problem solving based on numerous examples require the ability to comprehend longer contexts.
However, training models that can handle long sentences is inefficient because it requires a huge amount of computational resources. For this reason, many researchers have developed methods to adapt already trained models through "contextual expansion". In this paper, various techniques are proposed to deal with long sentences, but these methods are diverse and each has its own advantages and disadvantages.
For example, the Attention mechanism for long-text support can be broadly divided into two categories: accurate and approximate. Accurate Attention mechanisms are highly accurate, but their computational cost is very high. On the other hand, approximate Attention mechanisms reduce computational cost but are often less accurate. In this paper, we compare various methods with a unified evaluation criterion to verify which method is most effective.
Proposed Method
The method proposed in this paper is a "context extension technique" to adapt existing large-scale language models (LLMs) to longer contexts. In order to process long contexts, regular models use only short texts during training, but real-world applications require the ability to process long documents and large volumes of information.
The paper compares different approaches to handling long contexts without significantly modifying existing models and examines their effectiveness.
First, context expansion methods can be divided into "exact Attention mechanisms" and "approximate Attention mechanisms. The exact Attention mechanism is expected to be highly accurate because it can strictly process long contexts, but it consumes more computational resources.
On the other hand, the approximate Attention mechanism is a way to maintain some accuracy while saving computational resources.
Experiment
In the experiments in this paper, several different methods were tested to evaluate the performance of large-scale language models (LLMs) in processing long sentences. The purpose of the experiments was to compare the effectiveness of different "contextual extension" methods against existing models and to quantify the performance of the models for long contexts.
First, the basic model used was "LLaMA2-7B" and different contextual extension methods were applied to this model. The main evaluation criteria were "parplexity" and performance on the "Needle in a Haystack" task. Perplexity is a measure of how well a model can predict sentences; the lower the number, the better the model. The "Needle in a Haystack" task tests how accurately a model can find specific information in a long document.
Experimental results showed that methods such as "NTK-RoPE" and "CLEX," which use an accurate Attention mechanism, performed best for both perplexity and needles-in-a-haystack. These methods were able to maintain high accuracy even when the context length was extended to 32k or 64k.
On the other hand, the approximate Attention mechanisms "Landmark Attention" and "LongLoRA" performed well in short contexts, but their accuracy decreased as the context became longer.
Furthermore, models with accurate Attention mechanisms showed consistently good results even in long contexts. In particular, "NTK-32K" was able to handle up to 32k context lengths, and was confirmed to maintain a certain level of accuracy even in contexts longer than 64k.
On the other hand, methods such as "LM-Infinite" and "Self-Extend" performed well in short contexts, but sometimes missed information in longer texts.
The NTK-based models also outperformed other methods in a complex context processing task called the RULER test, with "Dynamic NTK" in particular showing flexible scaling and stable results as context length increased.
The results of this experiment provide important guidance in understanding how the model should be extended for long-form processing.
Conclusion
The conclusions of this paper provide insight into methods for improving large-scale language models (LLMs) for long contexts, highlighting the superior performance of accurate Attention mechanisms, especially in long context tasks.
Experimental results show that perplexity (a measure of prediction accuracy) is closely related to task success rate, and that an accurate Attention mechanism is key when dealing with long contexts.It is concluded that depending on the characteristics of the task, one should carefully choose whether to use an accurate NoteAttention mechanism or to adopt an approximate method.
The paper also discusses directions for future research, which will require optimal hyperparameter tuning in the development of models for long-text handling, as well as efforts to achieve equivalent performance with fewer computational resources. This is expected to lead to further advances in models that handle long contexts.
Categories related to this article