A Proposal For Mixed-first Optimization That Revolutionizes The Inference Performance Of Multimodal LLMs!

30/06/2025

3 main points
✔️ Mixed-first optimization method proposed to improve inference performance of large multimodal language models
✔️ Improved inference performance as models can handle different data types more efficiently
✔️ This method has shown improved performance on tasks that require advanced inference capabilities

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
written by　Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai
(Submitted on 15 Nov 2024 (v1), last revised 7 Apr 2025 (this version, v2))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This paper introduces mixed-priority optimization as an approach for improving the inference performance of multimodal large-scale language models (MLLMs). Specifically, it focuses on ways to integrate inputs from different sources.

While LLM is typically trained primarily on textual data, it is said to be able to enable more sophisticated inference by also utilizing different modality information, such as visual and audio. This research addresses the challenges of combining these different modalities.

Mixed-priority optimization is a technique that adjusts the learning emphasis so that the model can answer more accurately in a new inference task. In the paper, we show that this technique improves model performance and demonstrates its effectiveness in general question answering and complex inference tasks.

Experimental results confirm that the technique allows the model to produce more accurate and reliable results in a variety of areas. The results of the study provide an important foundation for future MLLM development.

Proposed Methodology

In this paper, a new method is proposed to improve the inference capability of multimodal LLMs. Specifically, an approach called Mixed Preference Optimization (MPO) is used. This method aims to improve performance by allowing models to incorporate a variety of evaluation criteria for a given task.

MPO first creates a large multimodal evaluation dataset (MMPR), which is then used to train the model. This dataset is designed to enhance the model's ability to integrate disparate visual and textual information to make decisions. In training, models are evaluated on a diverse sample and optimized based on different evaluation criteria.

Experiments

This study aims to improve the weak inference capability of multimodal large-scale language models (MLLMs).

Traditional models are good at integrating text and images, but suffer from poor performance in chain-of-sort (CoT) inference, in which the correct answer is derived while explaining reasons.

The research team first built a new large-scale inference preference dataset (MMPR). This dataset generates a large number of model answers for tasks for which the correct answer is clear, and automatically labels those that are close to the correct answer as "good examples" and those that are outliers as "bad examples.

The system also incorporates a mechanism to use incomplete answers as "bad examples" by cutting off answers in the middle when the correct answer is unknown and letting the remaining answers complete the task. Furthermore, we proposed a new learning method called Mixed Preference Optimization (MPO), which not only learns which answer is better, but also devises a way to simultaneously learn the quality of the answers and the process of generating them.

This method allows the model to flexibly handle a wide variety of inference patterns and greatly improves its inference ability. In experiments, the model has achieved higher accuracy than conventional models in benchmarks such as MathVista.

Conclusion

This paper discusses new methods for improving the inference capability of LLMs that handle multiple pieces of information. Typically, LLMs are trained on large datasets, but this paper presents a way to enable LLMs to perform more sophisticated inference through "mixed-first optimization.

Specifically, the paper develops a method to teach models efficiently, especially with limited resources, in order to elicit consistent performance across a variety of tasks and data sets. The method is designed to allow LLM to handle data in different formats, such as audio and images, in a multifaceted manner. As a result, the method has been shown to enhance inference capabilities derived from text, reduce misinformation, and improve response accuracy.

The paper describes that multiple benchmark tests were conducted to evaluate the effectiveness of the proposed approach, and that improved results were obtained compared to existing models. This suggests that this approach may contribute to further development of LLM.

Categories related to this article

AIライター: Reviewer: nakata

A Proposal For Mixed-first Optimization That Revolutionizes The Inference Performance Of Multimodal LLMs!

Summary

Proposed Methodology

Experiments

Conclusion

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

LongVie: A New Era Of 1-minute Ultra-High Quality Video Generation Realized By Multimodal Control

Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, And Editing With High Efficiency

Skywork UniPic: Next-generation Multimodal Model That Integrates Image Understanding, Generation, An ...

Seed Diffusion Preview: Next-generation Code Generation Model That Combines Fast Inference And High Performance

Seed Diffusion Preview: Next-generation Code Generation Model That Combines Fast Inference And High ...

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

MATE: Multi-agent Accessibility-specific Modality Transformation Framework

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

Biomed-Enriched: Large Biomedical Dataset With LLM Annotation For Clinical And Educational Value

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Effectiveness?

How Many Times Is Debugging LLM Effective? What Is The New Indicator "DDI" To Detect The Decay Of Ef ...