Qwen2-VL] Latest VLM That Can Process Images And Videos In Different Resolutions
3 main points
✔️ Qwen2-VL efficiently processes images and videos at different resolutions with Naive Dynamic Resolution
✔️ M-RoPE technology integrates visual data and textual information positioning for complex tasks
✔️ Model 72B solves a variety of tasks with multilingual support and high accuracy and enhances image and text integration processing
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
written by Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
(Submitted on 18 Sep 2024)
Comments:Code is available at this https URL
Subjects:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Background
In this paper, a model called "Qwen2-VL" is proposed, which understands both visuals and text. qwen2-VL is particularly unique in that it features a new technology called "dynamic resolution support," which allows it to efficiently process images and videos in a variety of resolutions.
Conventional visual language models (LVLMs) can only process images of a certain fixed resolution, and important information is easily lost in images with higher resolution. To overcome this problem, Qwen2-VL performs optimal processing based on the size of the input image, and can accurately capture details even in high-resolution images.
Qwen2-VL also employs a technology called "multimodal rotary position embedding (M-RoPE)," which allows for the efficient integration and processing of positional information from images, videos, and text. This improves the model's ability to understand not just images and text, but also complex scenes and movements in video.
Proposed Method
The Qwen2-VL method proposed in this paper is designed to integrate both visual and verbal information for advanced recognition. In particular, a new mechanism called " Naive Dynamic Resolution " allows for flexible processing regardless of the resolution of the image or video. Conventional visual language models (LVLMs) can only process images at a fixed resolution, sometimes losing details in higher resolution images, but Qwen2-VL solves this problem.
First, Qwen2-VL uses a "Vision Transformer" (ViT) to process images. This ViT consists of 675M parameters and is common to models of any scale. This Transformer processes the visual data and interprets the information obtained in combination with the language model (LLM). Even if the resolution of the images to be processed is high, a mechanism is built in to compress them to an appropriate number of tokens and extract the information efficiently.
Next, Qwen2-VL employs "Multimodal Rotary Position Embedding (M-RoPE). This is a system that can handle not only normal 1D position information, but also position information in time and space, as well as text and images, in multiple dimensions. This enables the system to handle dynamic data such as video, greatly improving its ability to capture changes in scenes and the passage of time.
Furthermore, Qwen2-VL employs the "Unified Image and Video Understanding" mechanism to process images and videos in a consistent manner. This allows it to properly understand not only short videos, but also longer videos of 20 minutes or more in length. In particular, it effectively handles the temporal connections between frames and detailed location information within the image.
The strength of Qwen2-VL lies in its ability to improve accuracy as the size of the model increases. In particular, the 72B's large models are capable of performing highly complex image and video tasks and have demonstrated state-of-the-art performance in a variety of benchmarks. For example, it outperforms many other models on datasets such as DocVQA and MathVista.
Experiment
The experiments conducted in this paper test the performance of the Qwen2-VL model on a variety of visual and verbal tasks. The purpose of the experiments is to see how well the proposed technique compares to other existing models.
First, the model was evaluated using multiple benchmark datasets. Specifically, performance was measured on tasks ranging from visual question answering (VQA), document recognition, video comprehension, and even mathematical reasoning. For example, on datasets related to text recognition such as DocVQA and InfoVQA, Qwen2-VL achieved accuracy that exceeded state-of-the-art models. In particular, on a model as large as 72B, Qwen2-VL has been shown to understand text in documents with high accuracy.
Experiments also tested the ability to understand long videos; Qwen2-VL can process videos longer than 20 minutes, understand the content in the video, and answer questions accurately. This capability is very useful when dealing with long duration dynamic content, which has been difficult with previous models.
The "M-RoPE" mechanism, which can simultaneously handle image and text location information, is also useful in processing video.
In addition, the experiment also verified the impact of model size on performance. Performing the same task with different sizes of models ranging from small (2B) to large (72B) confirmed that the larger models were able to solve the problem with higher accuracy. However, there are situations in which certain tasks are already well performed regardless of the model size, suggesting that the efficiency of the model should also be considered.
Finally, each experimental result is presented in tabular form, making it clear that Qwen2-VL outperforms other competing models on many benchmarks. This proves that Qwen2-VL is a very powerful tool in the combined visual and verbal task.
Conclusion
The paper concludes that Qwen2-VL has demonstrated very strong performance in vision and language processing, further advancing state-of-the-art technology. In particular, innovations such as Naive Dynamic Resolution, which allows flexible processing of images and videos independent of resolution, and M-RoPE, which integrates spatio-temporal information, have resulted in results that exceed the limits of previous models.
Experimental results show that Qwen2-VL outperforms other state-of-the-art models on many benchmarks, especially the large 72B model, which performs best on complex tasks. In addition, the model is not only available in English and Chinese, but also in Japanese and other multilingual languages, confirming its global applicability.
This technology is expected to play a major role in a variety of future application areas requiring the fusion of vision and language. In addition, Qwen2-VL has the potential to be used for agent manipulation of robots and mobile devices in the future, and is expected to be further developed.
In conclusion, Qwen2-VL sets a new standard for visual language models with its high performance, scalability, and multilingual compatibility.
Categories related to this article