Catch up on the latest AI articles

[NVLM] Multimodal LLM Outperforms GPT-4o In Image And Language Tasks

[NVLM] Multimodal LLM Outperforms GPT-4o In Image And Language Tasks

Large Language Models

3 main points
✔️ Propose a new multimodal large-scale language model called NVLM 1.0
✔️ This model handles visual and verbal tasks simultaneously and outperforms previous models

✔️ Advanced multimodal tasks such as complex inference and OCR perform more efficiently and effectively

 NVLM: Open Frontier-Class Multimodal LLMs
written by Wenliang DaiNayeon LeeBoxin WangZhuoling YangZihan LiuJon BarkerTuomas RintamakiMohammad ShoeybiBryan CatanzaroWei Ping
(Submitted on 17 Sep 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)


code:

The images used in this article are from the paper, the introductory slides, or were created based on them.

Background

There have been two main approaches to traditional multimodal LLM: decoder-only architectures (e.g., LLaVA) and cross-attention-based architectures (e.g., Flamingo); NVLM 1.0 compares the pros and cons of these approaches and introduces a new hybrid architecture that improves both training efficiency and multimodal inference capability.

The paper also introduces a new technique called 1-D Tile Tag Design, which processes high-resolution images in a tiled format. This greatly improves OCR (Optical Character Recognition)-related tasks and multimodal inference capabilities.

In addition, the multimodal pre-training and supervised fine-tuning datasets are described in detail, showing that data quality and task diversity are more important than scale.

Technique

The most important feature of NVLM 1.0 is that it is a family of models with three different architectures. These are the decoder-only NVLM-D, the cross-attention-based NVLM-X, and the hybrid architecture NVLM-H, which combines the best of both worlds. This combination allows each model to perform optimally in different types of tasks.

NVLM-D directly processes visual features in a decoder-only network for unified inference capability. NVLM-X, on the other hand, uses cross-attention to efficiently capture visual information, giving it an edge in processing high-resolution images. Finally, NVLM-H processes thumbnail image information at the decoder layer and other tiled image information at the cross-attention layer, thereby increasing computational efficiency while taking advantage of the benefits of both.

In addition, NVLM 1.0 introduces a high-resolution image processing technique called "1-D Tile Tag Design. This technique divides an image into multiple tiles and tags each tile for the model to recognize, greatly improving accuracy in OCR-related tasks.

These design and data innovations have allowed NVLM 1.0 to not only show high performance on visual and verbal tasks, but also to outperform previous models on text-only tasks.

Experiment

The experiments in this paper were tested on several benchmarks to evaluate the performance of the NVLM 1.0 model. The experiments focused primarily on both combined visual and verbal tasks and text-only tasks. Models of different architectures (NVLM-D, NVLM-X, and NVLM-H) were used to compare the capabilities of each model, respectively.

First, several benchmarks were used to evaluate the combined visual and verbal tasks. Specifically, these included benchmarks for multimodal reasoning (MMMU), which requires complex reasoning, visual context problems involving mathematical reasoning (MathVista), image understanding (VQAv2), and OCRBench, which evaluates OCR capabilities. These tests validated how each model performed on different types of tasks.

The NVLM-D model demonstrated high accuracy, especially in OCR tasks and image comprehension, showing an advantage over the other models. The NVLM-X model, on the other hand, used cross-attention to increase the efficiency of processing high-resolution images and showed superior results in inference speed and accuracy. It outperformed the other models, especially in mathematical reasoning and complex visual problems.

These models were also evaluated in a text-only task to determine if their text-only performance degraded after multimodal training. Results confirmed that NVLM models maintained or even improved performance on text tasks after training.

Experimental results show that NVLM 1.0 performs very well on both visual and verbal tasks, with particularly strong performance on OCR tasks and in scenarios requiring complex reasoning.

Summary

The conclusion of the paper states that NVLM 1.0 demonstrated high performance on a wide variety of tasks, opening up new possibilities for multimodal large-scale language models. In particular, NVLM 1.0 matched or exceeded the performance of other state-of-the-art models on tasks that require integration of vision and language.

Overall, NVLM 1.0 represents an accomplishment that expands the availability of flexible and powerful solutions for a wide variety of applications, especially in advanced tasks that deal with both vision and language. This research will contribute to the future development of multimodal models, and it is hoped that the published model weights and code will facilitate further research and applications.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us