Versatile Diffusion] Diffusion Model That Integrates Text And Images

Diffusion Model 21/12/2023

3 main points
✔️ Multimodal diffusion model that integrates text and images
✔️ Capturing text and image context information using CLIP
✔️ Sharing information across the model with the Global Layer

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
written by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi
(Submitted on 15 Nov 2022 (v1), last revised 23 Mar 2023 (this version, v3))
Comments: Github link: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In this study, a multimodal diffusion model for images and text, called Versatile Diffusion (VD), was proposed. Using this model, the following tasks can be accomplished

Text-to-Image
Image-to-Text
Image-to-Image
Text-to-Text

In short, any-to-any with images and text is possible. It is also possible to edit an image by entering a prompt, such as "Make this picture an oil painting.

The Versatile Diffusion in this study can be easily performed on the Hugging Face demo page below. It can be performed on all tasks described in this paper, so those interested are encouraged to try it out.

https://huggingface.co/spaces/shi-labs/Versatile-Diffusion

Let's take a look inside the model. Note that Versatile Diffusion will be referred to as "VD" in the following.

Versatile model architecture

The core technology of VD is a "multi-flow multi-modal diffusion model" that can generate various forms of data conditional on the context of images and text.

Here, "single flow" refers to using the context of "single modality m" to generate the data of "single modality n." The recent hot topics of "Text-to-Image," such as Stable Diffusion and Imagen, have the same definition of single flow in VD The definition of "single flow" in VD is.

In the case of VD, it can be called multi-flow because it can perform various generation tasks in addition to Text-to-Image.

As shown in the "Diagram of the Inverse Diffusion Process for One Step of VD" below, the VD model consists of three layers: the Global Layer, the Data Layer, and the Context Layer.

The behavior of each layer and the "corresponding layer in the Stable Diffusion model" are summarized in the table below.

layer	behavior	Stable Diffusion is an example.
Global Layer	Always active, independent of flow Parameter sharing between different flows Integration of time information	Time Embedded Layer
Data Layer	Activated when the network generates a "corresponding output modality	Residual Block (conditioning by time)
Context Layer	Activated when "corresponding context modality" is entered	Cross Attention layer (conditioning by text)

layer

behavior

Stable Diffusion is an example.

Global Layer

Always active, independent of flow

Parameter sharing between different flows

Integration of time information

Time Embedded Layer

Data Layer

Activated when the network generates a "corresponding output modality

Residual Block (conditioning by time)

Context Layer

Activated when "corresponding context modality" is entered

Cross Attention layer (conditioning by text)

Taking Text-to-Image as an example, _xt is sent to the Data Layer for the image and the Context Layer for the text, resulting in _{xt-1 in} the next step. Similarly, for Image-to-Image, _xt is sent to the Data Layer for the image and the Context Layer for the image.

Thus, the entire VD network is organized as shown in the figure below.

As shown in the lower right corner of this figure, four types of VD generation flows exist.

Text-to-Image
Image-to-Text
Image-Variation
Text-Variation

In short, the repeated use of the Data Layer and Context Layer indicates that the structure follows the U-Net in the traditional diffusion model.

What makes CLIP different from previous models is that it introduces not only a text encoder, but also an image encoder. This is because not only textual conditioning but also image conditioning is assumed.

Furthermore, the Global Layer enables multimodal generation to be realized in a single model, since "time information" and "network-wide parameters" are shared by each layer.

Source: https://github.com/shi-labs/versatile-diffusion

In VD, images are converted to latent representations by VAE and text is converted to latent representations by Optimus Bert.

Incidentally, at inference time, the complete noise image and text data are input to the VD network. The image and text prompts are then used for conditioning generation through each encoder in CLIP.

Evaluation experiment

A table comparing the data generated by VD and Stable Diffusion is shown below.

Qualitative results show that VD has a high capacity to generate VD. In addition, the quantitative evaluation results are as follows.

The FID scores indicate that VD performs better than the other baselines in the Text-to-Image and Image-Variation tasks.

We also conducted an experiment in which we asked subjects to vote on which of the images generated by each model they thought was of the best quality. The results are shown below.

The number of votes for "better quality images generated by Stable Diffusion" is represented by blue, VD by cyan, and the number of votes for "about the same quality" by gray.

The results show that in Text-to-Image, opinions such as "the quality is about the same" are conspicuous, but in Image-Variation, VD is rated higher.

Summary

VD can handle multiple tasks such as Text-to-Image, Image-to-Text, and image variation generation. Other suggested applications include semantics and style dissociation between image and text, and dual multi-context blending.

In addition, the possibility of covering more modalities in the future, such as 3D generation, voice, and music, is discussed.

This may be accomplished by preparing encoders for other modalities and building models as in this study. Of course, detailed design and coding appropriate for each modal would be required.