
Next Generation VLA Model By CogVLA! Instruction-driven Routing And Efficient Robot Operation Based On Cognitive Science
3 main points
✔️ CogVLA combines efficiency and performance in a three-stage structure inspired by human cognitive processes
✔️ EFA-Routing, LFP-Routing, and CAtten maintain visual, verbal, and behavioral consistency
✔️ LIBERO and real-world experiments achieve highest success rates and significant improvements in computational efficiency
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
written by Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie
(Submitted on 28 Aug 2025)
Comments: 23 pages, 8 figures, Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This paper addresses the challenges of high computational cost and lack of semantic consistency across modalities in Vision-Language-Action (VLA) models, which have received considerable attention in recent years.
Traditional methods have focused primarily on improving computational efficiency within language models, neglecting integrated optimization of vision, language, and action.
As a result, they have encountered problems such as the loss of important information due to visual feature compression and the loss of contextual consistency due to token skipping in the language model.
To solve this, the authors proposed CogVLA, inspired by human cognitive processes.
CogVLA introduces instruction-based routing and sparsification to achieve both efficiency and semantic consistency in the visual-to-action sequence.
Furthermore, through evaluations on the simulation benchmark LIBERO and on real robot tasks, the proposed method outperforms conventional methods and demonstrates significant efficiency improvements.
Proposed Methodology
CogVLA employs a three-stage progressive architecture based on human cognitive science.
First, "EFA-Routing (Encoder-FiLM based Aggregation Routing)" injects instructions into the visual encoder and selectively aggregates and compresses highly relevant visual tokens.
This reduces input visual information to 25% and suppresses irrelevant features.
Second, "LFP-Routing (LLM-FiLM based Pruning Routing)" further eliminates visual tokens that are not relevant to instructions inside the language model, reducing the computational load while emphasizing task-relevant meaning.
Third, "CAtten (Vision-Language-Action Coupled Attention)" is introduced to generate action sequences from the compressed representation while maintaining logical consistency and temporal integrity.
This CAtten applies causal attention between vision and language while allowing bidirectional parallel decoding at the action layer, thus achieving both efficiency and accuracy.
These integrated designs allow CogVLA to achieve efficiency while maintaining cross-modal semantic consistency.
Experimentation
CogVLA was evaluated in the simulation benchmark LIBERO and in a real robot environment.
In LIBERO, 500 trials were conducted on four different task groups: spatial reasoning, object recognition, goal understanding, and long-term tasks.
As a result, CogVLA achieved an average success rate of 97.4%, outperforming existing state-of-the-art models.
In addition, the Cobot Agilex ALOHA platform was used in a real-world environment to try complex tasks such as object placement, drawer manipulation, and t-shirt folding.
The success rate reached 70.0%, far outperforming other methods.
Furthermore, in terms of efficiency, it reduced inference time by 2.8x, FLOPs by 3.1x, and training cost by 2.5x compared to OpenVLA.
Ablation studies also confirmed the effectiveness of the proposed method, with modules at each stage working complementary to each other.
Categories related to this article