Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Large Language Models 27/02/2025

3 main points
✔️ Introducing routed visual experts and cross-modal bridging modules to process visual and linguistic information independently and effectively.
✔️ Hybrid image tokenization and discrete autoregressive modeling improve learning stability of visual data.
✔️ Achieves high performance on VQA and cross-modal benchmarks, showing results comparable to or better than previous models.

Libra: Building Decoupled Vision System on Large Language Models
written by Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu
(Submitted on 16 May 2024)
Comments: ICML2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In recent years, multimodal AI (i.e., models that integrate visual and verbal information) has been actively studied due to the rapid evolution of large-scale language models (LLMs). In particular, systems that combine visual and linguistic information have been used in a variety of applications, including image caption generation, visual question answering (VQA), and robot decision making. However, previous research has pointed out the problem of integrating LLM with visual information processing, which results in a loss of information independence.

In this paper, we propose a new prototype model called "Libra" to address this issue. The main feature of this model is to design the visual system and the language model separately (Decoupled) so that the uniqueness of both can be maintained while achieving more effective cross-modal understanding.

Most conventional models are designed to process visual and textual information in an integrated manner. To solve this problem, Libra introduced the "Routed Visual Expert," which processes visual information independently, effectively linking the visual system and LLM. Visual Expert," which processes visual information independently, to achieve effective integration between the visual system and LLM.

Furthermore, to stabilize the representation of visual data, "Discrete Auto-Regressive Modeling" (DAM ) was adopted to enable more effective learning of visual data. This enables higher performance with less data than conventional MLLM (Multimodal Large-scale Language Modeling).

Experimental results show that Libra achieves performance comparable to existing state-of-the-art MLLMs despite using only 50 million training data (compared to over 1 billion for conventional models). This achievement is a significant contribution to the design of multimodal learning from a different perspective than conventional approaches.

Related Research

The main issue addressed by this study is how to integrate the visual system with the language model. There have been two main types of approaches

Unified model integrating vision and language
- Example: Unified-IO, Flamingo
- Learning by integrating language models and visual systems
- Challenge: Loss of independence of visual information, out of scale and balance with linguistic knowledge
Pre-learn LLM and integrate visual information later
- Examples: BLIP-2, Emu, CogVLM
- An approach that first reinforces the language model and then integrates visual information
- Issue: Visual information is not adequately represented, resulting in information imbalance

This paper proposed a "separate visual and verbal learning" method to overcome the shortcomings of these two approaches.

Proposed Method

Libra's design consists of three major elements.

Routed Visual Expert

Libra has introduced the Routed Visual Expert, which allows visual information to be processed independently. In this mechanism, a "visual-only expert module" is added to each layer of LLM to have its own attention mechanism (Attention).

A s shown in Figure 1, by this design,

Allocate a dedicated parameter space for vision that is different from the language model (LLaMA2)
Cross-modal processing is controlled by a dedicated "bridge module (Cross-Modal Bridge)

This maintains the independence of visual information by doing so.

Discrete Auto-Regressive Modeling

In conventional visual modeling, continuous image representations are commonly used as is. However, this method has the problem that the label space becomes infinite, making learning unstable.

Libra solves this problem by transforming visual information into "discrete tokens. This method transforms each image into a form that "predicts the next token" to improve learning stability (see Figure 2).

Hybrid Image Tokenization

Discretization of visual information may result in information loss. Therefore, in Libra,

Contiguous Visual Signals
Discrete Image Tokens

introduced a hybrid image tokenization strategy (see Figure 3) that combines

This approach allows us to retain the maximum amount of information in the image while taking advantage of CLIP's pre-trained knowledge.

Experimental Results

Libra was evaluated in the following multimodal benchmarks

VQA (Visual Question Answering)
- Tasks to look at images and answer questions
- Libra achieves scores comparable to traditional Qwen-VL and LLaVA 1.5 with only 50 million data (see Table 1)
Image Captioning
- Task to look at an image and generate a description
- Achieved higher accuracy on Flickr30K and COCO data sets compared to GPT-4V and PaLM-E (see Table 2)
Multimodal Visual Perception (MVP)
- A measure of how accurately MLLM understands visual information
- Libra showed higher accuracy than the other models due to the independence of visual information (see Figure 4).

Conclusion

In this study, we proposed a new MLLM model "Libra" that can process visual information independently.
It overcomes the challenges of conventional "integrated visual and verbal learning,

routing-type visual expert
Discrete autoregressive modeling
Hybrid image tokenization

The high performance was achieved while maintaining the uniqueness of visual information by combining the three methods of

In the future, it is expected to be used for learning with more diverse data sets and for applications to video data. Personally, I felt that this technology is very promising for use in situations where real-time visual recognition is required, such as in the medical field and in automated driving.

Categories related to this article

Sasayama

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Summary

Related Research

Proposed Method

Routed Visual Expert

Discrete Auto-Regressive Modeling

Hybrid Image Tokenization

Experimental Results

Conclusion

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...

Vript-Hard, A New Benchmark For Testing Comprehension Of Long-form Video

Vript-Hard, A New Benchmark For Testing Comprehension Of Long-form Video