Catch up on the latest AI articles

Next Generation VLA Model By CogVLA! Instruction-driven Routing And Efficient Robot Operation Based On Cognitive Science

Next Generation VLA Model By CogVLA! Instruction-driven Routing And Efficient Robot Operation Based On Cognitive Science

3 main points
✔️ CogVLA combines efficiency and performance in a three-stage structure inspired by human cognitive processes
✔️ EFA-Routing, LFP-Routing, and CAtten maintain visual, verbal, and behavioral consistency
✔️ LIBERO and real-world experiments achieve highest success rates and significant improvements in computational efficiency

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
written by Wei LiRenshan ZhangRui ShaoJie HeLiqiang Nie
(Submitted on 28 Aug 2025)
Comments: 23 pages, 8 figures, Project Page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper addresses the challenges of high computational cost and lack of semantic consistency across modalities in Vision-Language-Action (VLA) models, which have received considerable attention in recent years.

Traditional methods have focused primarily on improving computational efficiency within language models, neglecting integrated optimization of vision, language, and action.
As a result, they have encountered problems such as the loss of important information due to visual feature compression and the loss of contextual consistency due to token skipping in the language model.

To solve this, the authors proposed CogVLA, inspired by human cognitive processes.
CogVLA introduces instruction-based routing and sparsification to achieve both efficiency and semantic consistency in the visual-to-action sequence.

Furthermore, through evaluations on the simulation benchmark LIBERO and on real robot tasks, the proposed method outperforms conventional methods and demonstrates significant efficiency improvements.

Proposed Methodology

CogVLA employs a three-stage progressive architecture based on human cognitive science.

First, "EFA-Routing (Encoder-FiLM based Aggregation Routing)" injects instructions into the visual encoder and selectively aggregates and compresses highly relevant visual tokens.
This reduces input visual information to 25% and suppresses irrelevant features.

Second, "LFP-Routing (LLM-FiLM based Pruning Routing)" further eliminates visual tokens that are not relevant to instructions inside the language model, reducing the computational load while emphasizing task-relevant meaning.

Third, "CAtten (Vision-Language-Action Coupled Attention)" is introduced to generate action sequences from the compressed representation while maintaining logical consistency and temporal integrity.
This CAtten applies causal attention between vision and language while allowing bidirectional parallel decoding at the action layer, thus achieving both efficiency and accuracy.

These integrated designs allow CogVLA to achieve efficiency while maintaining cross-modal semantic consistency.

Experimentation

CogVLA was evaluated in the simulation benchmark LIBERO and in a real robot environment.

In LIBERO, 500 trials were conducted on four different task groups: spatial reasoning, object recognition, goal understanding, and long-term tasks.
As a result, CogVLA achieved an average success rate of 97.4%, outperforming existing state-of-the-art models.

In addition, the Cobot Agilex ALOHA platform was used in a real-world environment to try complex tasks such as object placement, drawer manipulation, and t-shirt folding.
The success rate reached 70.0%, far outperforming other methods.

Furthermore, in terms of efficiency, it reduced inference time by 2.8x, FLOPs by 3.1x, and training cost by 2.5x compared to OpenVLA.
Ablation studies also confirmed the effectiveness of the proposed method, with modules at each stage working complementary to each other.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us