![Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems](https://aisholar.s3.ap-northeast-1.amazonaws.com/media/February2025/libra.png)
Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems
3 main points
✔️ Introducing routed visual experts and cross-modal bridging modules to process visual and linguistic information independently and effectively.
✔️ Hybrid image tokenization and discrete autoregressive modeling improve learning stability of visual data.
✔️ Achieves high performance on VQA and cross-modal benchmarks, showing results comparable to or better than previous models.
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
In recent years, multimodal AI (i.e., models that integrate visual and verbal information) has been actively studied due to the rapid evolution of large-scale language models (LLMs). In particular, systems that combine visual and linguistic information have been used in a variety of applications, including image caption generation, visual question answering (VQA), and robot decision making. However, previous research has pointed out the problem of integrating LLM with visual information processing, which results in a loss of information independence.
In this paper, we propose a new prototype model called "Libra" to address this issue. The main feature of this model is to design the visual system and the language model separately (Decoupled) so that the uniqueness of both can be maintained while achieving more effective cross-modal understanding.
Most conventional models are designed to process visual and textual information in an integrated manner. To solve this problem, Libra introduced the "Routed Visual Expert," which processes visual information independently, effectively linking the visual system and LLM. Visual Expert," which processes visual information independently, to achieve effective integration between the visual system and LLM.
Furthermore, to stabilize the representation of visual data, "Discrete Auto-Regressive Modeling" (DAM ) was adopted to enable more effective learning of visual data. This enables higher performance with less data than conventional MLLM (Multimodal Large-scale Language Modeling).
Experimental results show that Libra achieves performance comparable to existing state-of-the-art MLLMs despite using only 50 million training data (compared to over 1 billion for conventional models). This achievement is a significant contribution to the design of multimodal learning from a different perspective than conventional approaches.
Related Research
The main issue addressed by this study is how to integrate the visual system with the language model. There have been two main types of approaches
-
Unified model integrating vision and language
- Example: Unified-IO, Flamingo
- Learning by integrating language models and visual systems
- Challenge: Loss of independence of visual information, out of scale and balance with linguistic knowledge
-
Pre-learn LLM and integrate visual information later
- Examples: BLIP-2, Emu, CogVLM
- An approach that first reinforces the language model and then integrates visual information
- Issue: Visual information is not adequately represented, resulting in information imbalance
This paper proposed a "separate visual and verbal learning" method to overcome the shortcomings of these two approaches.
Proposed Method
Libra's design consists of three major elements.
Routed Visual Expert
Libra has introduced the Routed Visual Expert, which allows visual information to be processed independently. In this mechanism, a "visual-only expert module" is added to each layer of LLM to have its own attention mechanism (Attention).
A s shown in Figure 1, by this design,
- Allocate a dedicated parameter space for vision that is different from the language model (LLaMA2)
- Cross-modal processing is controlled by a dedicated "bridge module (Cross-Modal Bridge)
This maintains the independence of visual information by doing so.
Discrete Auto-Regressive Modeling
In conventional visual modeling, continuous image representations are commonly used as is. However, this method has the problem that the label space becomes infinite, making learning unstable.
Libra solves this problem by transforming visual information into "discrete tokens. This method transforms each image into a form that "predicts the next token" to improve learning stability (see Figure 2).
Hybrid Image Tokenization
Discretization of visual information may result in information loss. Therefore, in Libra,
- Contiguous Visual Signals
- Discrete Image Tokens
introduced a hybrid image tokenization strategy (see Figure 3) that combines
This approach allows us to retain the maximum amount of information in the image while taking advantage of CLIP's pre-trained knowledge.
Experimental Results
Libra was evaluated in the following multimodal benchmarks
-
VQA (Visual Question Answering)
- Tasks to look at images and answer questions
- Libra achieves scores comparable to traditional Qwen-VL and LLaVA 1.5 with only 50 million data (see Table 1)
-
Image Captioning
- Task to look at an image and generate a description
- Achieved higher accuracy on Flickr30K and COCO data sets compared to GPT-4V and PaLM-E (see Table 2)
-
Multimodal Visual Perception (MVP)
- A measure of how accurately MLLM understands visual information
- Libra showed higher accuracy than the other models due to the independence of visual information (see Figure 4).
Conclusion
In this study, we proposed a new MLLM model "Libra" that can process visual information independently.
It overcomes the challenges of conventional "integrated visual and verbal learning,
- routing-type visual expert
- Discrete autoregressive modeling
- Hybrid image tokenization
The high performance was achieved while maintaining the uniqueness of visual information by combining the three methods of
In the future, it is expected to be used for learning with more diverse data sets and for applications to video data. Personally, I felt that this technology is very promising for use in situations where real-time visual recognition is required, such as in the medical field and in automated driving.
Categories related to this article