Catch up on the latest AI articles

Versatile Diffusion] Diffusion Model That Integrates Text And Images

Versatile Diffusion] Diffusion Model That Integrates Text And Images

Diffusion Model

3 main points
✔️ Multimodal diffusion model that integrates text and images
✔️ Capturing text and image context information using CLIP

✔️ Sharing information across the model with the Global Layer

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
written by Xingqian XuZhangyang WangEric ZhangKai WangHumphrey Shi
(Submitted on 15 Nov 2022 (v1), last revised 23 Mar 2023 (this version, v3))
Comments: Github link: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In this study, a multimodal diffusion model for images and text, called Versatile Diffusion (VD), was proposed. Using this model, the following tasks can be accomplished

  • Text-to-Image
  • Image-to-Text
  • Image-to-Image
  • Text-to-Text

In short, any-to-any with images and text is possible. It is also possible to edit an image by entering a prompt, such as "Make this picture an oil painting.

The Versatile Diffusion in this study can be easily performed on the Hugging Face demo page below. It can be performed on all tasks described in this paper, so those interested are encouraged to try it out.

Let's take a look inside the model. Note that Versatile Diffusion will be referred to as "VD" in the following.

Versatile model architecture

The core technology of VD is a "multi-flow multi-modal diffusion model" that can generate various forms of data conditional on the context of images and text.

Here, "single flow" refers to using the context of "single modality m" to generate the data of "single modality n." The recent hot topics of "Text-to-Image," such as Stable Diffusion and Imagen, have the same definition of single flow in VD The definition of "single flow" in VD is.

In the case of VD, it can be called multi-flow because it can perform various generation tasks in addition to Text-to-Image.

As shown in the "Diagram of the Inverse Diffusion Process for One Step of VD" below, the VD model consists of three layers: the Global Layer, the Data Layer, and the Context Layer.

The behavior of each layer and the "corresponding layer in the Stable Diffusion model" are summarized in the table below.

layer behavior Stable Diffusion is an example.
Global Layer

Always active, independent of flow

Parameter sharing between different flows

Integration of time information

Time Embedded Layer
Data Layer Activated when the network generates a "corresponding output modality Residual Block (conditioning by time)
Context Layer Activated when "corresponding context modality" is entered Cross Attention layer (conditioning by text)

Taking Text-to-Image as an example, xt is sent to the Data Layer for the image and the Context Layer for the text, resulting in xt-1 in the next step. Similarly, for Image-to-Image, xt is sent to the Data Layer for the image and the Context Layer for the image.

Thus, the entire VD network is organized as shown in the figure below.

As shown in the lower right corner of this figure, four types of VD generation flows exist.

  • Text-to-Image
  • Image-to-Text
  • Image-Variation
  • Text-Variation

In short, the repeated use of the Data Layer and Context Layer indicates that the structure follows the U-Net in the traditional diffusion model.

What makes CLIP different from previous models is that it introduces not only a text encoder, but also an image encoder. This is because not only textual conditioning but also image conditioning is assumed.

Furthermore, the Global Layer enables multimodal generation to be realized in a single model, since "time information" and "network-wide parameters" are shared by each layer.


In VD, images are converted to latent representations by VAE and text is converted to latent representations by Optimus Bert.

Incidentally, at inference time, the complete noise image and text data are input to the VD network. The image and text prompts are then used for conditioning generation through each encoder in CLIP.

Evaluation experiment

A table comparing the data generated by VD and Stable Diffusion is shown below.

Qualitative results show that VD has a high capacity to generate VD. In addition, the quantitative evaluation results are as follows.

The FID scores indicate that VD performs better than the other baselines in the Text-to-Image and Image-Variation tasks.

We also conducted an experiment in which we asked subjects to vote on which of the images generated by each model they thought was of the best quality. The results are shown below.

The number of votes for "better quality images generated by Stable Diffusion" is represented by blue, VD by cyan, and the number of votes for "about the same quality" by gray.

The results show that in Text-to-Image, opinions such as "the quality is about the same" are conspicuous, but in Image-Variation, VD is rated higher.


VD can handle multiple tasks such as Text-to-Image, Image-to-Text, and image variation generation. Other suggested applications include semantics and style dissociation between image and text, and dual multi-context blending.

In addition, the possibility of covering more modalities in the future, such as 3D generation, voice, and music, is discussed.

This may be accomplished by preparing encoders for other modalities and building models as in this study. Of course, detailed design and coding appropriate for each modal would be required.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us