Cross-Layer Attention Significantly Reduces Transformer Memory

Transformer 10/12/2024

3 main points
✔️ Reducing KV Cache Memory with Cross-Layer Attention (CLA)
✔️ Maintaining Accuracy in 1B and 3B Parameter Models Using CLA
✔️ Effective memory efficiency improvement by combining CLA with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
written by William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly
(Submitted on 21 May 2024)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In recent years, transformer models have made great strides in the field of natural language processing, achieving excellent results in a variety of applications. However, key-value (KV) caches with high memory requirements are essential to maximize the performance of large-scale language models. Especially when dealing with long sequences and large batch sizes, their memory consumption becomes very high, posing a practical challenge.

To solve this challenge, many researchers have sought ways to improve the memory efficiency of the KV cache. Among them, Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) have been widely adopted as effective means of reducing the size of the KV cache by allowing multiple query heads to share a single key/value head. However, further memory efficiency improvements are needed.

Against this backdrop, researchers at MIT and the MIT-IBM Watson AI Lab proposed a new approach, Cross-Layer Attention (CLA), which shares key and value heads between adjacent layers to further reduce the KV cache size, while maintaining model accuracy.

Related Research

To maximize the performance and efficiency of transformer models, many researchers are exploring different approaches. This paper specifically focuses on related research to improve the memory efficiency of the KV cache. Below is a summary of key related research presented in the paper.

Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)

The most relevant studies are Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), which improve the attention mechanism of the transformer model; Shazeer (2019) proposed MQA, in which multiple Ainslie et al. (2023) generalized this to Grouped-Query Attention (GQA), an architecture in which query heads are grouped and each group shares a single key/value head. value head. This minimizes accuracy loss while improving memory efficiency.

KV Cache Compression

Another approach to reducing the size of the KV cache is KV cache compression; Hooper et al. (2024) proposed KVQuant, which performs key and value quantization to achieve storage with low precision. Zhang et al. (2024) also showed how to compress the KV cache to one or two bits by performing non-uniform encoding of key and value using a technique called Coupled Quantization.

Delete Unnecessary KV Cache Entries

Another approach is to remove unneeded KV cache entries; Zhang et al. (2023) proposed H2O, which removes unimportant KV cache entries; Liu et al. (2023) showed how to use the Scissorhands technique to store only important tokens during generation Liu et al. (2023) showed how to store only the important tokens that are being generated using the Scissorhands technique. These techniques effectively reduce the memory usage of the KV cache.

Reduction of KV Cache Size Due to Architectural Changes

Cross-Layer Attention (CLA), proposed in this paper, is an approach to reduce KV cache size through architectural changes. It is unique in that CLA performs key/value sharing between adjacent layers, whereas conventional GQA and MQA perform key/value sharing within a single layer. This further reduces the memory footprint of the KV cache while maintaining model accuracy.

Efficient Training Memory

Much research has also been done to improve memory efficiency during training; Shoeybi et al. (2020) proposed Megatron-LM, a model parallelization technique for efficient training of large neural networks; Huang et al. (2019) introduced GPipe, which uses pipeline parallelization to optimize training memory usage. Huang et al. (2019) introduced GPipe, which uses pipeline parallelization to optimize the use of training memory; CLA is compatible with these techniques and is expected to further improve memory efficiency.

Proposed Method (Cross-Layer Attention)

To solve the memory problem of the KV cache in transformer models, researchers proposed a new method, Cross-Layer Attention (CLA), which aims to share key and value heads between adjacent layers to reduce the size while maintaining model accuracy. This section details the design of the CLA and its specific behavior.

Basic Concept of CLA

In traditional transformer architectures, each layer computes its own keys and values and stores them in the KV cache. This method requires large amounts of memory to accommodate long sequences and large batch sizes. In contrast, CLA reduces memory usage by sharing keys and values computed in some layers with neighboring layers.

Specifically, the CLA works as follows

Key/value computation and sharing: Some layers compute their own keys and values and store them in the KV cache. Neighboring layers then reuse these computed keys and values (see Figure 1).

Sharing Factor: The number of layers whose keys and values are shared is called the "sharing factor. For example, if the Sharing Factor is 2, each pair of layers uses the same key and value (see Figure 2).

This approach reduces the memory usage of the KV cache by a sharing factor.

Figure 1: Conceptual diagram of Cross-Layer Attention (CLA)

CLA Architecture

The CLA design can be combined with traditional Multi/Query Attention (MQA) and Grouped/Query Attention (GQA). While conventional MQA and GQA share keys and values within the same layer, CLA shares across multiple layers. This allows for further memory reduction.

The specific structure of the CLA is as follows

Key/Value Projections: Some layers perform their own key/value projections and store the results in the KV cache. Other layers reuse these projections.
Combination flexibility: CLA can be combined with MQA and GQA, combining the benefits of each for optimal memory efficiency.

Figure 2: Composition of CLAs with different sharing factors

Experiment

In this study, a series of experiments were conducted using 1B and 3B parameter models to validate the proposed Cross-Layer Attention (CLA) method.

In all experiments, models were trained on the SlimPajama dataset. The GPT-NeoX tokenizer was used as the model tokenizer and tokenization was performed with Byte-Pair Encoding (BPE). Pre-normalization, SwiGLU activation functions, and rotational position embedding were also employed based on Llama's architecture. Training was performed in the PyTorch framework using an NVIDIA H100 GPU.

Experimental Results for 1B Parameter Model

Various CLA configurations were tested in the 1B parameter model. In particular, the MQA-CLA2 configuration showed excellent performance (see Figure 3).

Figure 3: Experimental results for 1B parameter model

MQA and CLA2 models: The MQA and CLA2 models with head dimensions varying from 64 to 512 improved accuracy compared to the conventional MQA model while reducing KV cache memory. In particular, the model with a head dimension of 128 halved the memory usage compared to the conventional MQA model, and the accuracy was confirmed to be almost the same.
GQA/CLA2 model: A combined GQA and CLA2 model was also tested, but the most effective was the GQA2/CLA2 configuration, which showed superior accuracy compared to other configurations.

Experimental Results of 3B Parameter Model

Experiments were also conducted on the 3B parameter model to verify the effectiveness of CLA. Again, the MQA-CLA2 configuration was found to be the most effective.

MQA model with 128 head dimensions: After adjusting the learning rate, the MQA-CLA2 model showed superior accuracy compared to the traditional MQA model with 128 head dimensions. In particular, a significant difference was observed in perplexity on the Wikitext dataset (see Table 5).

Table 5: Experimental results for the 3B parameter model

Consideration

1. improved memory efficiency: CLA was found to be able to effectively reduce memory usage in the KV cache while maintaining near-perfect accuracy, especially at a sharing factor of 2. This results in a significant improvement in memory efficiency compared to previous architectures.

2. maintaining accuracy: CLA can be used to reduce memory usage with minimal loss of accuracy, which is especially useful in scenarios with long sequences or large batch sizes.

3. the importance of the learning rate: The results show that tuning the learning rate has a significant impact on the performance of the model, and that a particularly high learning rate is particularly effective in the CLA model. This indicates that CLA not only improves memory efficiency, but may also improve the efficiency of the training process itself.

These results clearly show that CLA is a method that could become the new standard in transformer model design, offering significant advantages in both practicality and efficiency.

Conclusion

In this paper, we proposed Cross-Layer Attention (CLA) as a new method to reduce the memory usage of the KV cache in transformer models. and maintaining almost the same accuracy. Experimental results, especially on 1B and 3B parameter models, show that CLA provides superior performance in both memory efficiency and accuracy.

Future prospects include further optimization and extension of CLA. For example, application to different model architectures and larger models, as well as validation of the effectiveness of CLA for long-term sequences. It will also be important to evaluate the effectiveness of CLA in real-world applications to further confirm its utility and efficacy; CLA will be an important step in contributing to the evolution of transformer models.