[AudioLDM] Text-to-Audio Generation Model Using Latent Diffusion

Diffusion Model 16/01/2024

3 main points
✔️ Using audio-only data to train LDMs improves computational efficiency
✔️ Using CLAP ensures "text-to-speech consistency" without using paired data
✔️ Various zero-shot tasks can be performed using only trained AudioLDM without fine-tuning

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
written by Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley
(Submitted on 29 Jan 2023 (v1), last revised 9 Sep 2023 (this version, v3))
Comments: Accepted by ICML 2023. Demo and implementation at this https URL. Evaluation toolbox at this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

The proposed model is called "AudioLDM," which can generate high-quality audio from text prompts on a single GPU.

ambient sound
Animal noises
People talking
music

AudioLDM leverages the latent diffusion model (LDM) used for high-quality image generation to generate speech with continuous latent representation. Specifically, it combines a mel-spectrogram-based variational autoencoder (VAE) with conditioning based on Contrastive Language-Audio Pretraining (CLAP) embedding to enable speech generation with advanced text conditions.

The audio generated in this way can be viewed on the AudioLDM project page below.

Official Project Page

Furthermore, this study shows that the trained AudioLDM can be used to perform the following speech operations without fine-tuning

voice style conversion
super-resolution (e.g. video)
inpainting

Let's look at how these can be achieved in terms of the model structure of AudioLDM.

AudioLDM Model Structure

The overall structure of AudioLDM is shown in the figure below.

The solid line in the above figure represents the learning process, and the dotted line represents the inference process.

What is unique here is that CLAP is used for training and inference, with only audio data used for training and text data used for inference.

Effects of using CLAP

CLAP (Contrastive Language-Audio Pretraining) is a contrastive learning model that ensures consistency between speech and text.

Source:https://github.com/microsoft/CLAP

The main effect of using this CLAP is not only to maintain consistency between speech and text, but also to compensate for the lack of training data. This is because in the field of Text-to-Data, a large amount of text-data pair data is needed. In this case, a set of text tied to speech data is what is needed.

However, it is difficult to collect such speech-text pair data sets, making it difficult to improve the accuracy of speech generation. Therefore, by using CLAP that has been trained with a large amount of data in advance, it is not necessary to train again with one's own data, and cross-modal information can be obtained efficiently.

The LDM is then conditioned on the audio and text embedding acquired by CLAP.

Acquisition of latent expression of speech through LDM

Latent Diffusion Models (LDM) are diffusion models that generate latent representations of data. Specifically, data is converted into a latent representation by VAE's Encoder in advance, and noise is added to the latent representation in the diffusion process. The latent representation is then learned to be reconstructed in the denoising process.

In this way, the latent representation, which has a lower dimensionality than the raw data, can be handled and the data can be generated more efficiently. Incidentally, the generated latent representation can be passed through VAE's Decoder at the end to extract the generated data.

In this study, too, the mel-spectrogram of the speech data is compressed into a latent representation by VAE during training. During inference, the latent representation is converted to a mel-spectrogram by a VAE decoder. The mel-spectrogram is then passed through a module called Vocoder to output the raw speech data.

Expansion of voice data

In this study, speech data is augmented by mixup to solve the speech data shortage and improve the performance of the model.

Specifically, new voice data $x_{1,2}$ is generated from the existing voice data $x_1$ and $x_2$ based on the following formula.

Here $\lambda$ is sampled from the Beta distribution $B(5, 5)$. The text that is tied to the augmented speech data is not needed because it is not used during training.

Other voice control

Once AudioLDM is learned, various tasks, such as inpainting, can be solved in a zero-shot fashion. The inference process for such tasks is shown in the figure below.

In (b), inpainting and super-resolution, it is possible to repair missing parts of the audio and improve the resolution of the audio data.

In (c), style conversion, it is possible to convert, for example, from "calm music" to "energetic music".

Evaluation experiment

Data-set

Four datasets were used in the AudioLDM study: AudioSet (AS), AudioCaps (AC), Freesound (FS), and BBC Sound Effect library (SFX).

AS is large, containing 527 labels and over 5,000 hours of audio, while AC is a smaller dataset containing about 49,000 audio clips and text descriptions. However, these datasets are mainly audio from YouTube, and quality is not guaranteed.

Therefore, data was collected from FreeSound and BBC SFX to add high-quality audio data.

AC and AS were used to evaluate the model, with each audio clip from AC having five text captions, one randomly selected as the text condition; from AS, 10% of audio samples were randomly selected as a separate evaluation set, and instead of a text description, a label concatenation was used.

Evaluation Method

This study employs a comprehensive evaluation method that includes both objective and subjective evaluations to assess the performance of AudioLDM.

The following evaluation indicators are used for objective evaluation

Frechet distance (FD)
inception score (IS)
kullback-leibler (KL) divergence

The subjective evaluation was conducted by six audio experts. Specifically, the experiment involved answering the questions on the following questionnaire regarding Overall Voice Quality (OVL) and Relevance to Text (REL).

The same Text-to-Audio DiffSound and AudioGen are used to compare models.

AudioLDM was trained on a small model (AudioLDM-S) and a large model (AudioLDM-L). Additionally, the AudioLDM-L-Full model was trained on all datasets, indicating that the effect of training data size has been investigated.

Result

The results of the comparative evaluation are shown in the table below.

Overall, "AudioLDM-L-Full" has the best performance. This means that AudioLDM, with a large number of parameters and trained on all datasets, is the most accurate model.

Appropriate data for conditioning

The following table examines the difference in performance between using only speech embedding and using both text and speech embedding as conditions when training AudioLDM.