Mechanism And Effect Of "Representation Shift" Token Compression For FlashAttention

25/08/2025

3 main points
✔️ Proposes Representation Shift and introduces a method to measure importance by the amount of token representation change
✔️ Independent of Attention maps, generality applicable to FlashAttention and CNN/SSM
✔️ Experiments show up to 5.5x inference speedup Demonstrated both accuracy and efficiency

Representation Shift: Unifying Token Compression with FlashAttention
written by Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim
(Submitted on 1 Aug 2025)
Comments: International Conference on Computer Vision (ICCV), 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper presents a new approach to the challenge of increasing computational cost in transformer models.

While transformers have been widely used in natural language processing and image/video understanding in recent years, processing efficiency has become a serious issue as they become larger in scale, since the computational cost of the self-attention mechanism increases in proportion to the square of the number of input tokens.
Traditionally, there have been attempts to solve this problem from two directions.
One is the memory efficiency method represented by FlashAttention, and the other is a computation reduction method based on token compression.
However, token compression is usually incompatible with mechanisms that do not construct an attention map, such as FlashAttention, because it uses an attention map to estimate the importance of tokens.

The authors therefore proposed a new metric, Representation Shift. This is a way to define importance by measuring how much the representation of each token changed as it passed through the layers.
This metric is training-free and model-independent, and can be combined with FlashAttention.
Experimental results show that this method outperforms conventional methods in terms of both efficiency and accuracy, achieving up to 5.5x inference speedup.

Proposed Method

The proposed method, Representation Shift, quantifies the degree to which a token is informationally enriched in the model by measuring the difference between the embedded representation of each token in the input and output of the layers.

Specifically, the distance between the vectors before and after passing through the MLP and attention layers is calculated, and the value is used as the importance score. The L2 norm showed the most stable performance for measuring the distance.
While conventional methods rely on attention maps, this method can estimate token importance independently of the attention mechanism, and thus can be naturally integrated with calculation methods that do not construct attention maps, such as FlashAttention.
The framework is also versatile enough to be applied not only to Transformer, but also to CNNs and state space models (SSMs).

The authors also studied in detail the design choices, such as which layer to measure Representation Shift and which operation (Attention or MLP) to base it on.
The results showed that using the amount of change in the MLP layer was the most effective.

This design minimizes information loss while eliminating token redundancy.

Experiments

The authors conducted extensive experiments on both image classification and video comprehension tasks to validate the effectiveness of the proposed method.

First, for the video task, video text retrieval and video QA using UMT (Unmasked Teacher) were evaluated in a setting where tokens were reduced by 20% per layer.
The results showed that Representation Shift combined with FlashAttention was faster and more accurate than existing attention score-based methods, achieving up to 5.5x throughput improvement.
It also showed a better speed/accuracy tradeoff compared to mere model miniaturization.

Next, image classification was validated on ImageNet using DeiT sequences, and in combination with FlashAttention, achieved a 1.2x increase in inference speed while improving accuracy over conventional attention-based methods.
We also applied the method to CNN/SSMs such as ResNet and Vision Mamba, and confirmed its effectiveness on these non-Transformer-based architectures.
In particular, row-by-row token pruning on ResNet-50 achieved a speedup of over 18% and nearly maintained accuracy.

These experiments demonstrate that Representation Shift is a versatile and powerful token compression standard.

Categories related to this article

nakata

Mechanism And Effect Of "Representation Shift" Token Compression For FlashAttention

Summary

Proposed Method

Experiments

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Reward Variance

MMR1: A Multimodal Inference Model That Stabilizes Reinforcement Learning With Sampling Based On Rew ...

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Variance

VCRL: A New Approach To LLM Reinforcement Learning That Controls Learning Difficulty With Reward Var ...

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, Laughter, And Personality

The Challenge Of Social-MAE, A Social AI That Uses Self-supervised Learning To Decipher Emotions, La ...

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

OnGoal: New Chat Interface To Visualize The Goals Of LLM Dialogue

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

TriMM: Collaborative Multimodal Coding For High-quality 3D Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation

Dress&Dance: Video Diffusion Model For Highly Accurate Virtual Fitting And Motion Generation