Catch up on the latest AI articles

UniD3] Multimodal Discrete Diffusion Model Integrating Image And Text

UniD3] Multimodal Discrete Diffusion Model Integrating Image And Text

Diffusion Model

3 main points
✔️ Multimodal Diffusion Model for Any-to-Any
✔️ Treating Images and Text as One Discrete Token (Integrated
✔️ Introducing Transformer with Mutual Attention for Denoising

Unified Discrete Diffusion for Simultaneous Vision-Language Generation
written by Minghui HuChuanxia ZhengHeliang ZhengTat-Jen ChamChaoyue WangZuopeng YangDacheng TaoPonnuthurai N. Suganthan
(Submitted on 27 Nov 2022)
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Various cross-modal models, such as image generation from text, image generation from sketches, and video generation from images, have been rapidly developed. However, they all perform generation tasks that are limited to a specific modality, such as "text to image only.

Therefore, this study proposed a multimodal generation method using UniD3, an Image-Language model that handles different modalities in an integrated manner. This method can achieve "Any-to-Any", where any modal can be output for any modal input.

As shown in the figure above, UniD3 enables not only "text to image" but also "image to text" and "unconditional image-text generation.


Let's quickly look at how the Image-Text Any-to-Any is achieved: the overall pipeline of UniD3 is as follows.

Specifically, we start by compressing Image and Text into discrete token sequences using the respective encoders "dVAE" and "BPE".

Next, using Fusion embedding, the two tokens are concatenated to compute an embedding "Fused Embedding" of the same space. In this way, image embedding and text embedding can be handled in an integrated manner as a single token.

Then, for the Fused Embedding described earlier, the Markov transition matrix is used in the diffusion process, and denoising is performed by the "Unified Transformer with Mutual Attention" in the inverse diffusion process.

The Fused Embedding so reconstructed can then be divided again into image embedding and text embedding to obtain separate tokens for both modalities.

Introduction of Mutual Attention

In this study, we introduce a new attentional mechanism called Mutual Attention in the Unified Transformer for denoising.

The Unified Transformer consists of several Transformer blocks, each containing a Self Attention, two Cross Attentions, and a Feed forward Layer.

Here, the usual Self Attention mechanism is effective at capturing relationships between elements within one modality, but is not good at capturing relationships between different modalities.

Therefore, in this study, Mutual Attention mechanism is introduced to capture the relationship between modalities even when image and text tokens are combined.

A diagram of the Mutual Attention Block is shown below.

This block first takes as input a "Fused Hidden State with Noise" that combines different image and text tokens.

Next, Self Attention is applied and each block captures its relevance within the same sequence. It is then decomposed again into tokens of different modalities and passed through two Cross Attention. In this way, relevance between different modalities is captured.

Both tokens are then combined again and passed through the Feed forward Layer to the next Transformer block. By repeating this process, denoising proceeds and finally a "noiseless Fused Hidden State" is obtained.

Incidentally, the [MASK] tokens in the noised tokens suggest that the Mask estimation is functioning as a denoising.

Experimental results

The following experiments were performed to investigate the performance of UniD3

  • unconditional generation
  • conditional creation

CUB-200 (a dataset containing images and text of bird species) and MSCOCO (a dataset containing a variety of images and captions) were used in the experiment.

Result of unconditional generation

The results of generation without conditions are as follows.

At this time, the image and text are generated simultaneously. The quality of the generated image and text is good, and you have maintained consistency between the description text and the image.

Conditional Generation Results

The following indicators were used to objectively evaluate conditional generation

  • FID: Reality and diversity of images
  • IS: Realism and diversity of images
  • BLEU-4: Accuracy of Text Captions
  • METEOR: Accuracy of text captions
  • SPICE Score: Accuracy of text captions
  • CLIP Score: Image and text consistency

Based on objective indicators, the following is a comparison with other models.


UniD3 can be used for inpainting and capturing images. Examples of the results are shown below.

Example of captioning.

Ref. Captions is the text of the original dataset; Samples is the caption generated by UniD3's Image-to-Text.

Example of inpainting.

The ochre areas in the left image and the strike-through lines in the text indicate MASK. Inpainting completes the data in these MASK areas. The four results of inpainting are shown on the right.


This research is a pioneering example of "Any-to-Any" using a multimodal diffusion model. Other modal input and generation, such as voice and music, should be possible by applying this kind of research.

We will continue to keep a close eye on the development of the Any-to-Any model.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us