Catch up on the latest AI articles

[CoDi] Any-to-any Diffusion Model That Can Handle Almost Any Modality

[CoDi] Any-to-any Diffusion Model That Can Handle Almost Any Modality

Diffusion Model

3 main points
✔️ Data can be generated from multiple input modalities
✔️ Latent Alignment allows use of a common modal feature space as a condition

✔️ Achieve generalization to other modalities with only some combined data sets

Any-to-Any Generation via Composable Diffusion
written by Zineng TangZiyi YangChenguang ZhuMichael Zeng,  Mohit Bansal
(Submitted on 19 May 2023)
Comments: Project Page: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV);. Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

code :

The images used in this article are from the paper, the introductory slides, or were created based on them.


This article introduces CoDi (Composable Diffusion), an any-to-any diffusion model that can simultaneously input multiple modalities and generate a variety of data.

Earlier we said "multiple modalities can be entered simultaneously to generate a variety of data." Specifically, the following 11 generation tasks are possible.

  • Text→Image
  • Image→Text
  • Text→Audio
  • Audio→Text
  • Image→Audio
  • Audio→Image
  • Text→Video
  • Video→Text
  • Text + Image + Audio → Image
  • Text→Image + Text
  • Text→Video + Audio

In other words, it can handle a variety of inputs and outputs. In particular, the ability to handle video and audio and to accept more than one input condition is a highlight of this research. The results of specific generation tasks can be found on the project page below.

CoDi official project page

Before explaining CoDi in detail, we will first discuss the research outline and background.

Research Overview and Background

Traditional multimodal models can generate another single modality from a single modality, such as Text-to-Image or Text-to-Audio, but they have difficulty handling multiple modalities simultaneously.

CoDi addresses this problem by allowing the generation of any combination of modalities from any combination of modalities. Specifically, it is built from latent diffusion models (LDMs) that have been individually trained for different modalities and then combined.

In addition, each input modality is projected into a shared feature space, and the output model performs generation based on this combined feature.

This approach allows CoDi to seamlessly generate multiple modalities, e.g., "synchronized video and audio" from text.

In the figure above, lines of the same color represent the corresponding inputs and outputs.

We will see how such multimodal generation is achieved in the next section.

How CoDi works

First, let's touch on the challenges that exist in any-to-any handling of arbitrary modalities, and then look at CoDi's model structure to address these challenges.

Issues in Any-to-Any

Generating arbitrary outputs from arbitrary combinations of input modalities involves enormous requirements in terms of both computation and data. For example, as we introduced the "11 CoDi generation task example" earlier, it would normally be necessary to prepare 11 diffusion models, one for each task.

In that case, 11 huge deep learning models would have to be trained individually, which would be computationally intensive.

In addition, there is little consistent training data for many modality combinations, and training on all possible input-output combinations is not realistic. For example, there is a large amount of text-image pair data, but few video-audio pair data.

Naturally, learning a model for such a combination of modalities is difficult in view of the lack of data.

Model Structure and Learning Methods

To address the issues mentioned earlier, CoDi allows all modalities to be handled in a single diffusion model in an integrated manner.

CoDi's learning and inference methods are shown in the figure below.

Specifically, latent diffusion models (LDMs) for each of the four modalities are first trained separately. These models can be trained independently and in parallel to ensure the quality of generation of a single modality.

Then, in Stage 1 in the figure above, the system is trained to accept a variety of conditional inputs. In this case, a technique called "Bridging Alignment" is used to project each modality into a common feature space.

Bridging Alignment

To achieve Bridging Alignment, we first train a "text-image" contrast learning model called "CLIP".

Then, with the CLIP weights frozen, the audio and video prompt encoders are also trained on the audio-text and video-text paired data sets using contrastive learning.

Here, text is used in most of the above contrast studies because the amount of text data is large and it is easy to construct "text-data" pair data.

This method allows the four modalities to be projected into a common feature space and handled in an integrated manner.

Multimodal generation with Latent Alignment

The goal of the last Stage 2 is to allow cross-attention between the diffusion flows of each modal, i.e., to generate two or more modalities simultaneously. To achieve this, we follow the same design as in the Bridging Alignment described earlier to generate conditioning with Latent Alignment.

This Latent Alignment is a technique that projects the latent variables of each modal into a common latent space.

The procedure for generating conditionals by Latent Alignment is as follows.

  1. Cross-Attention in the image-text diffusion model and the respective environmental encoder V are learned on text-image paired data.
  2. Freeze weights in text diffusion model and learn environmental encoder and Cross-Attention in speech diffusion model with text-speech paired data
  3. Freeze the audio diffusion model and its environmental encoder, and learn multimodal generation of video with audio-video paired data

Here, CoDi is only trained on multimodal generation tasks for three pairs of data (text-sound, text-image, and video-sound). However, combinations of modalities not used in training, such as the "image-text-sound" multimodal generation, can also be generated simultaneously.

Objective function in multimodal generation of Modal A and B

To generate two or more modalities A and B simultaneously, add a Cross-Attention sublayer to UNet. Then, the latent variables of modality B are projected into a common latent space by Latent Alignment as described earlier, and passed through the Cross-Attention layer of the U-Net of modality A.

Then the objective function for generating modality A would be

Above $t$ represents a time step and $y$ represents data for conditioning.

Thus, the diffusion model of modality A is learned in a way that includes the information of modality B. Incidentally, in the simultaneous generation of modalities A and B, $L^{A}_{Cross}$ + $L^{B}_{Cross}$ is the objective function.

The environmental encoder V is also learned through contrastive learning.

Evaluation experiment

The data sets used in this study are listed in the table below.

The data sets used include image + text (captioned image), audio + text (captioned audio), audio + video (captioned video), and video + text (captioned video).

Training tasks also include single-modality generation, multimodal generation, and contrastive learning to align prompt encoders.


An example of single modality generation with CoDi is as follows

Indeed, you can see that it accepts a variety of condition inputs.

In addition, the results of the quantitative evaluation using the evaluation indicators are as follows

For single modality generation, CoDi achieves state-of-the-art results for voice captioning and speech generation. For image captioning, CoDi shows comparable performance compared to Transformer-based state-of-the-art models.

The study also introduced a new metric, SIM, to measure consistency and integrity among the generated modalities. This metric can quantify the degree of agreement between modalities by calculating the cosine similarity between the generated modality embeddings.

Evaluated in settings such as audio to image + text, image to audio + text, and text to video + audio, it consistently showed stronger consistency compared to independent generation.

Further examples of multimodal generation are as follows

It is clear that high quality data generation is possible even with multimodal generation.


CoDi can process and simultaneously generate a variety of modalities, including text, images, video, and audio. The ability to produce high-quality, consistent output from a combination of various input modalities is an important step toward making human-computer interaction more realistic.

Such multimodal models could then be used to study general-purpose artificial intelligence.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us