Catch up on the latest AI articles

Multimodal GPT-4 And LLaVA Integration Of Advanced Image Understanding And Natural Language Interaction

Multimodal GPT-4 And LLaVA Integration Of Advanced Image Understanding And Natural Language Interaction

Computer Vision

3 main points
✔️ demonstrate that visual instruction tuning using language-only GPT-4 is effective.
✔️ We have introduced an automated pipeline and have shown how to generate data that follows language and image instructions.

✔️ Future work will explore the possibility of pre-training on larger data scales and large image-text data, as well as new features through improved chat assistants and integration of vision models.

Visual Instruction Tuning
written by Haotian LiuChunyuan LiQingyang WuYong Jae Lee
(Submitted on 17 Apr 2023 (this version), latest version 11 Dec 2023 (v2))
Comments: project page: this https URL

Subjects:  Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


This paper proposes LLaVA, a large-scale model that uses GPT-4 to generate multimodal language image instruction-following data and leverages it to integrate vision and language understanding. Initial experiments show that LLaVA exhibits excellent multimodal chatting capabilities and outperforms GPT-4 on synthetic multimodal instructions. When fine-tuned in Science QA, new state-of-the-art accuracy was achieved by the synergy of LLaVA and GPT-4.


This paper focuses on the development of an artificial intelligence assistant that combines both vision and language. In traditional models, each task is solved independently and language only describes the image content. However, with the evolution of large-scale language models (LLMs), language is now able to direct a wide variety of tasks. In this paper, we introduce a new technique called visual instruction tuning, which generates visual data to build large-scale multimodal models (LMMs). The generated data is used to fine-tune the LMM and build a general-purpose instruction-following visual agent. Achieve superior performance on the Science QA multimodal inference dataset using GPT-4.

Related Research

This paper focuses on approaches for building agents that follow visual and verbal instructions. Existing work can be broadly classified into end-to-end trained models and those that tune different models via systems such as LangChain. We are also applying LLM's instruction tuning approach proposed in natural language processing (NLP) research to visual tasks in order to build a general-purpose instruction-following visual agent. This improves effective understanding and generalization of instructions, suggesting that it may be applicable to new multimodal tasks.

GPT-assisted generation of visual indication data

While the community is experiencing a surge in publicly available image and text data, multimodal instructional data is limited. To address this challenge, a proposal has been made to use ChatGPT/GPT-4 to collect multimodal instruction following data from a wide range of image pair data.

An approach is proposed to use GPT-4 to generate natural questions based on image-text pairs. Since the usual extension methods lack diversity and deep reasoning, a method was proposed that utilizes language-only GPT-4 and ChatGPT as a teacher to generate data that follows visual instructions. It encodes images using symbolic representations to generate different types of instruction-following data. It is suggested that GPT-4 can provide high-quality instruction-following data, and that this will provide better results than ordinary data augmentation.

Tuning of visual instructions


The main goal is to effectively leverage the capabilities of both pre-trained LLMs and visual models. The network architecture is shown in Figure 1.

LaMA (Large Language Model for Instructions Following) is employed as LLM (Large Language Model) (⋅) parameterized by the parameter φ. This is due to its proven effectiveness in instruction tuning work for open-source languages only. We consider ViT-L/14, which provides visual features Zv=g ( Xv ), and use a trainable projection matrix W to transform image features into language embedding tokens Hq. This ensures that the image and language models have the same dimension.

Thus, the sequence of visual tokens Hv derived from images is lightweight and efficient, allowing for quick iterations of data-centric experiments. Other models such as Flamingo's Gate Cross Attention, BLIP-2's Q-former, or SAM provide object-level functionality. Future research remains to explore more effective and sophisticated architectural designs.


For each image, conversation data consisting of multiple turns ( X1q, X1a, ... XTq, XTa ) is generated. where T is the total number of turns. Summarize all the assistant's answers and organize the instructions in each turn as Xtinstruct. This technique yields the multimodal instructions in the unified format shown in Table 2. Using the original autoregressive training goals, LLM instruction tuning is performed on the predictive tokens. Specifically, we compute the probability of generating the target answer Xa in a sequence of length L.In training the model, we consider a two-step instruction tuning procedure. where θ is a trainable parameter and Xinstruct < i and Xa < i are the instruction and answer tokens in all turns before the current prediction token xi, respectively. In the conditional statement, we explicitly add Xv to emphasize that the image is grounded for all responses, and skip the Xsystem-message and all previous <STOP>s to improve readability.

The method also consists of two stages. In the first stage, 595K image/text pairs are screened from CC3M and converted into data that obey instructions using a simple extension method to treat them as single-turn conversations. Here, randomly sampled questions are used as instructions for the images, and the original captions are trained as expected answers. At this stage, we fix the weights of the visual encoder and LLM and use only the projection matrix W to maximize the likelihood.

In the second step, the visual encoder weights are fixed and the weights of the projection layer and LLM of LLaVA are updated. In other words, the trainable parameters are the projection matrices W and φ. Chatbot training uses the collected linguistic image instruction tracking data to equally sample multi-turn and single-turn responses; in the ScienceQA benchmark, questions are provided with context in the form of natural language or images, and the assistant is responsible for the inference process provided in natural language and selects an answer from multiple alternatives.


Multimodal chatbot

Researchers developed a new multimodal AI model called LLaVA and built a chatbot demo demonstrating its image understanding and conversational capabilities; LLaVA was trained on only 80,000 images and showed similar inference results compared to GPT-4. This suggests that LLaVA can understand a scene and respond appropriately while following instructions. Other models (BLIP-2 and OpenFlamingo) focus on describing the image and are limited in their responses to instructions. The quantitative evaluation also compares the question-answering abilities of LLaVA and GPT-4 on selected images from the COCO validation set, in an attempt to understand LLaVA's performance from GPT-4's evaluation. Specific results are presented in Table 3.

Adjustments to the instructions improved the model's ability to follow user instructions by more than 50 points. Adding detailed descriptions and complex inference questions improved the overall performance of the model by 7 points. The model's performance on conversational questions also improved, suggesting that inferential ability complements conversational ability. Finally, the best performance, 85.1%, was achieved by combining the three data types. This evaluation protocol provides a benchmark for comprehensively assessing and understanding the capabilities of large multimodal models.

In the study, LLaVA with the new adapter achieved a high accuracy of 90.92% on the ScienceQA dataset, while GPT-4 showed a result of 82.69%. GPT-4 tends to fail when images or plots are missing, but combining LLaVA and GPT-4 GPT-4 tends to fail when images or plots are insufficient, but the combination of LLaVA and GPT-4 maintained a high accuracy rate of 90.97%. In addition, a scheme was proposed to generate unique answers by prompting GPT-4 again, which achieved a new maximum accuracy of 92.53%. The study suggests new possibilities for model ensembles utilizing LLM. Comparison of model performance under different conditions provides a better understanding of appropriate model configurations for scientific QA tasks.


This paper demonstrates the effectiveness of visual instruction tuning using the GPT-4 language model. A new data generation pipeline was implemented to produce data that follows the language and image instructions, and a multimodal model LLaVA was trained on it. With fine-tuning, new SoTA accuracy was achieved for ScienceQA and a superior visual chat experience was achieved for the multimodal chat data. Future prospects include pre-training on larger data scales and connecting to other vision models. This is expected to enable new features and improve performance.

In my opinion, it is clear that this research is contributing to the advancement of multimodal AI. In particular, the combination of GPT-4 and LLaVA has shown promising results in the integration of language and vision, and it will be increasingly interesting to see future studies combining large data sets and models.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us