GenRecal, A General-purpose Distillation Framework For Lightweight, High-performance Distillation

01/07/2025

3 main points
✔️ This paper proposes a new framework, GenRecal, to solve the problem of knowledge distillation from large vision language models to small models
✔️ GenRecal enables knowledge distillation between models with different token types through a process called recalibrationIt overcomes the limitations of normal distillation methods byenabling
✔️ more efficient generation of smaller models and improved performance of various vision and language models

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
written by　Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
(Submitted on 18 Jun 2025)
Comments:Project page: this https URL
Subjects: Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

This paper proposes a new method for converting large visual language models (VLMs) into smaller, more efficient models.

Traditional methods have difficulty in knowledge distillation between models with different token sizes. To solve this problem, the author proposed an approach called "Recalibration," which allows knowledge transfer between models with different token sizes. Specifically, Recalibration improves learning efficiency by adapting tokens from large models to small models.

This method has outperformed traditional methods in numerous benchmarks and is believed to be useful for developing efficient multimodal AI systems. It also shows that it can be customized for specific applications by allowing a flexible distillation process that combines different models. This is expected to open new avenues for building high-performance AI systems even in resource-limited environments.

Proposed Methodology

In this paper, we propose a new framework called Generation after Recalibration (GenRecal) for effective knowledge distillation among large-scale visual language models (VLMs) with different token types. In this approach, the small-scale and large-scale VLMs are first given the same input to obtain their respective intermediate representations.

Next, a module called Recalibrator projects the features of the small-scale model into the representation space of the large-scale model to ensure compatibility; Recalibrator consists of two projection layers and a decoder block to perform dimensional matching of tokens and re-attachment of location information. Learning proceeded in three stages, with the first stage training the Recalibrator alone to align the representation, the second stage initiating distillation, and the final stage fine-tuning the whole.

This structure allows for knowledge transfer between different architectures, which is not possible with conventional methods, and transfers highly accurate inference capabilities from the high-performance model to the lightweight model.

Experiments

Experiments were conducted with various combinations of teacher and student models to verify that GenRecal outperforms traditional distillation methods.

In particular, powerful teacher models such as InternVL2.5-78B and Qwen2-VL-72B were combined with smaller InternVL2.5-8B and Qwen2-VL-2B. Evaluations were conducted on benchmarks such as MM-Vet and MMMU, which significantly outperform traditional knowledge distillation and mere fine tuning.

To verify the effectiveness of the Recalibrator, performance comparisons with and without a regularization term and t-SNE visualization of the feature space were conducted, revealing that representation alignment is essential for knowledge transfer. Furthermore, the higher performance of the teacher model tends to improve the accuracy of the student model, providing multifaceted support for the effectiveness of this method.

Conclusion

In this paper, we proposed GenRecal, a new framework that enables knowledge distillation among visual language models (VLMs) with different architectures and token types. The central Recalibrator adapts features from small-scale models to the representation space of large-scale models for effective knowledge transfer. It employs a three-stage training mechanism to incrementally improve performance from feature alignment to distillation to fine-tuning.

Experiments show that the system outperforms traditional distillation methods and simple fine tuning, achieving high accuracy across a wide range of benchmarks. Furthermore, a trend was observed that the better the performance of the teacher model, the better the student model, indicating that the Recalibrator is the key to successful distillation.

This study is an important step forward in the development of lightweight, high-performance VLMs.

Categories related to this article

nakata

GenRecal, A General-purpose Distillation Framework For Lightweight, High-performance Distillation

Overview

Proposed Methodology

Experiments

Conclusion

Innovations In Outlier-Safe Pre-Training For Large Language Models To Prevent Outliers And Protect Quantization Accuracy

Innovations In Outlier-Safe Pre-Training For Large Language Models To Prevent Outliers And Protect Q ...

Democratizing GPT-4o Level Image Generation: The Janus-4o And ShareGPT-4o-Image Challenge

Democratizing GPT-4o Level Image Generation: The Janus-4o And ShareGPT-4o-Image Challenge

FedNano: Lightweight And Efficient Distributed Learning Of Large-scale Multimodal Models

FedNano: Lightweight And Efficient Distributed Learning Of Large-scale Multimodal Models

ImmerseGen: Agent-guided, Lightweight X Highly Realistic Next-generation VR Scene Generation

ImmerseGen: Agent-guided, Lightweight X Highly Realistic Next-generation VR Scene Generation

SwarmAgentic: Fully Automated Agent System Generation Enabled By Swarm Intelligence

SwarmAgentic: Fully Automated Agent System Generation Enabled By Swarm Intelligence

Toward AI That Doesn't Forget Images, CoMemo Pioneers Next-generation Vision And Language Models

Toward AI That Doesn't Forget Images, CoMemo Pioneers Next-generation Vision And Language Models