
Toward AI That Doesn't Forget Images, CoMemo Pioneers Next-generation Vision And Language Models
3 main points
✔️ CoMemo is proposed to solve the problems of image information neglect and position encoding in LVLM
✔️ Image processing is performed in a dual structure of Context Path and Memory Path to both preserve and utilize visual information
✔️ New RoPE-DHR method maintains 2D structure even in high-resolution images while suppressing position information degradation The new RoPE-DHR method maintains the 2D structure even in high-resolution images while minimizing the degradation of positional information.
CoMemo: LVLMs Need Image Context with Image Memory
written by Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai
(Submitted on 6 Jun 2025)
Comments: ICML 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Overview
LVLMs that integrate vision and language have attracted much attention in recent years. These models incorporate image information into language models, enabling advanced inference by combining image and language. However, traditional approaches have the challenge that image information is not fully exploited. In particular, the "Lost in the middle" phenomenon, in which "middle information in the image" is easily ignored by the model in long contexts, and the accuracy degradation of location information have been problems.
To solve these problems, this paper proposes a new architecture called "CoMemo," which, in addition to conventional autoregressive image processing, introduces an auxiliary "memory path" to enable the model to continuously pay attention to the image content while maintaining the image context information. In addition, the newly designed RoPo The newly designed RoPE-DHR (Rotary Position Embedding for Dynamic High Resolution) has also been used to reduce the weakening of long-distance dependence while preserving the two-dimensional structure of the image.
On a variety of visual and linguistic tasks, CoMemo showed better performance than previous models, with particularly good results in image context understanding, long sentence generation, and integrated inference of multiple images.
Proposed Methodology
The core of CoMemo's design is image processing with a dual pathway: Context Path and Memory Path.
Context Path is a route that links image tokens with text tokens and performs autoregressive (autoregressive) processing as before. In contrast, Memory Path is designed to process image tokens by cross-attention, allowing flexible reference to image information from the text side. This dual structure greatly improves the problems of "neglect of image information" and "lack of attention to intermediate positions" that were likely to occur in the previous model.
In addition, CoMemo introduces a new position encoding method called RoPE-DHR, which divides images into "thumbnails" and "high-resolution tiles" and performs conventional position encoding for thumbnails while allowing tiles to inherit the position information of their thumbnails, thereby This improves computational efficiency while maintaining the two-dimensional positional relationship.
Furthermore, the training method is also ingenious, employing a three-stage learning strategy that allows learning to occur in stages. First, the parameters of the memory path and projector are adjusted, then the gate parameters are fixed to balance the path, and finally all parameters are fine-tuned. This ensures that the model is not biased toward any particular pathway and that both pathways are utilized in a balanced manner.
Experiments
In the paper, experiments were conducted on seven different benchmarks that combine visual and verbal to validate the effectiveness of CoMemo.
These included image caption generation, long sentence generation, multiple image inference, long-text understanding, mathematical inference, general VQA (visual question answering), and OCR-related tasks. The models used are all 2B-parameter scale and have uniform training conditions.
The results show that CoMemo shows significant performance gains over the previous LVLM-S and LVLM-X architectures: +17.2% on the image caption generation task, +7.0% on long sentence generation, and +5.6% on long context comprehension. In particular, in tasks such as MM-NIAH and MileBench, where important information is extracted from images and text, CoMemo was able to retain and utilize the intermediate information well, in contrast to conventional methods that tend to lose the intermediate information.
Component ablation experiments also quantitatively verified the impact of the presence or absence of RoPE-DHR and Memory Path on performance, and revealed the importance of both elements. In terms of computational efficiency, although there is a slight increase in inference time, it is well within the practical range, and the overall results show a high degree of practicality.
Categories related to this article