[mPLUG-Owl] Developing An LLM That Can Understand Images And Text
3 main points
✔️ Recent research in large-scale language models (LLMs) has focused on the ability to combine multiple sources of information.
✔️ The training method "mPLUG-Owl" can be used to incorporate visual information into LLMs. This allows for the ability to combine different sources of information, leading to an increase in LLM performance.
✔️ mPLUG-Owl uses two stages of training to increase the ability of LLMs to associate images with text. Experiments have shown better performance than existing methods, and we expect that this will have practical applications.
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
written by Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, Fei Huang
(Submitted on 27 Apr 2023)
Comments: Working in Process
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Recent research has focused on the ability to combine multiple sources of information (modalities) using LLM.
Researchers have attempted two approaches to developing LLMs that incorporate visual information: one approach uses textual descriptions of visual information and the other uses an integrated model. However, these approaches have challenges that make it difficult to handle certain multimodal tasks.
Therefore, a new training method, mPLUG-Owl, is proposed in this paper. It is designed to incorporate visual information into LLMs and consists of three parts: the LLM itself, a visual knowledge module, and a visual abstraction module. This method allows us to combine different sources of information and improve the performance of the LLM in different tasks.
Specifically, a two-stage training method is used to associate images and text. In the first stage, modules are trained to associate images with text, and in the second stage, these modules are fine-tuned to enhance LLM performance.
Experimental results show that mPLUG-Owl outperforms existing methods. The method also improves the ability to relate multiple images and understand text, which can be applied to real-world problems.
First, LLMs are gaining attention in the field of natural language processing, with models such as BERT, GPT, and T5 emerging, and large models such as GPT3 showing particularly good performance. This has led to the creation of many new LLMs, contributing to the advancement of natural language processing.
Next, research is presented on multimodal large-scale language models. These models are expected to be able to process not only language but also other sources of information such as visual and audio. Approaches that have been tried include using textual descriptions of visual information and using pre-trained large-scale language models to build a unified model.
Finally, a new model called mPLUG-Owl is introduced. It is characterized by its ability to coordinate representations between vision and language models and to understand language and multimodal instructions. It is expected to show superior performance on a number of novel tasks.
mPLUG-Owl is a multimodal model that combines vision and language, integrating images, text, and other information to understand meaning and generate responses.
Specifically, mPLUG-Owl consists of a visual infrastructure model, a language infrastructure model, and a visual abstractor module. Visual information is summarized into tokens and combined with language for input.
Figure 1: Comparison between the different training paradigms. All of these methods are trained in two stages. Stage 1 represents pre-training, and stage 2 represents instruction tuning.
The mPLUG-Owl training process is also performed using a language modeling task. This process learns how to generate subsequent tokens and minimize training losses.
In addition, mPLUG-Owl training includes a joint instructional coordination phase. This phase integrates visual and verbal information to refine the model and improve performance on a variety of tasks.
Thus, mPLUG-Owl integrates multimodal information and performs well in language understanding and response generation tasks.
The experiments in the paper examined ways to introduce multimodality into a large language model. First, in the model setup, a visual infrastructure model called ViT-L/14 was selected, with 1024 hidden dimensions and 24 layers; ViT was initialized from the CLIP ViT-L/14 model; the model was then used for training. Datasets such as LAION-400M, COYO-700M, Conceptual Captions, and MSCOCO were used for training, and the model was trained in 50k steps, representing a total of approximately 104 billion tokens. In addition, the performance of the model was evaluated by ablating the data modality in a two-step training scheme and instructional tuning.
This figure shows the results of comparing the response quality of mPLUG-Owl with other models using a visually related evaluation set called OwlEval. In the figure, the ranking order of response quality is A > B > C > D, showing the performance of each model. The figure includes 82 responses generated by each model, which were manually scored.
The quantitative analysis used OwlEval, a set of visually relevant ratings, to evaluate the ability of the different models to answer various questions. Results show that mPLUG-Owl produced better responses than the other models, particularly in its enhanced ability to understand both instructions and images.
In the qualitative analysis, specific cases were presented, such as knowledge-intensive QA and multi-turn conversations, and it was observed that mPLUG-Owl performed better than the other models. On the other hand, in the case related to understanding jokes, mPLUG-Owl also showed the ability to understand humor, but some errors were observed due to limitations of the training data.
These results suggest that mPLUG-Owl performs well on multimodal tasks, but also indicates that there is room for improvement in some areas.
This section describes the initial features of mPLUG-Owl and its limitations.
Figure 10 shows visual correlation capabilities across multiple images, with some success in identifying the same person and distinguishing differences in color, but still limited in its ability to relate multiple images.
Figure 11 shows multilingual comprehension skills in Chinese, French, and Japanese, with promising results, but still lacking complete multilingual training.
Figures 16 through 18 show OCR (optical character recognition) capabilities for simple to complex scenes, although there is still room for improvement in the recognition of numbers.
In Figure 12, document understanding and its applications, such as film review and code generation, are explored, but some applications have not yet yielded satisfactory results.
Figures 13 and 14 also show situations where mPLUG-Owl is used to create poetry, lyrics, advertisements, and other artwork, but more research is needed to make it more practical.
In the conclusion of this paper, a new training method, mPLUG-Owl, is proposed. This method improves the multimodal capabilities of large language models (LLMs). mPLUG-Owl modularizes the underlying LLMs and enhances image-text linkage by incorporating visual knowledge and abstractions. This approach has demonstrated excellent performance in a variety of applications, suggesting potential for multimodal generation.
In my opinion, this new training method is an important step in the evolution of artificial intelligence, and the combination of visual and verbal information will allow for more diverse and creative generation.
Categories related to this article