[AudioLDM] Text-to-Audio Generation Model Using Latent Diffusion
3 main points
✔️ Using audio-only data to train LDMs improves computational efficiency
✔️ Using CLAP ensures "text-to-speech consistency" without using paired data
✔️ Various zero-shot tasks can be performed using only trained AudioLDM without fine-tuning
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
written by Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley
(Submitted on 29 Jan 2023 (v1), last revised 9 Sep 2023 (this version, v3))
Comments: Accepted by ICML 2023. Demo and implementation at this https URL. Evaluation toolbox at this https URL
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
The proposed model is called "AudioLDM," which can generate high-quality audio from text prompts on a single GPU.
- ambient sound
- Animal noises
- People talking
- music
AudioLDM leverages the latent diffusion model (LDM) used for high-quality image generation to generate speech with continuous latent representation. Specifically, it combines a mel-spectrogram-based variational autoencoder (VAE) with conditioning based on Contrastive Language-Audio Pretraining (CLAP) embedding to enable speech generation with advanced text conditions.
The audio generated in this way can be viewed on the AudioLDM project page below.
Furthermore, this study shows that the trained AudioLDM can be used to perform the following speech operations without fine-tuning
- voice style conversion
- super-resolution (e.g. video)
- inpainting
Let's look at how these can be achieved in terms of the model structure of AudioLDM.
AudioLDM Model Structure
The overall structure of AudioLDM is shown in the figure below.
The solid line in the above figure represents the learning process, and the dotted line represents the inference process.
What is unique here is that CLAP is used for training and inference, with only audio data used for training and text data used for inference.
Effects of using CLAP
CLAP (Contrastive Language-Audio Pretraining) is a contrastive learning model that ensures consistency between speech and text.
Source:https://github.com/microsoft/CLAP
The main effect of using this CLAP is not only to maintain consistency between speech and text, but also to compensate for the lack of training data. This is because in the field of Text-to-Data, a large amount of text-data pair data is needed. In this case, a set of text tied to speech data is what is needed.
However, it is difficult to collect such speech-text pair data sets, making it difficult to improve the accuracy of speech generation. Therefore, by using CLAP that has been trained with a large amount of data in advance, it is not necessary to train again with one's own data, and cross-modal information can be obtained efficiently.
The LDM is then conditioned on the audio and text embedding acquired by CLAP.
Acquisition of latent expression of speech through LDM
Latent Diffusion Models (LDM) are diffusion models that generate latent representations of data. Specifically, data is converted into a latent representation by VAE's Encoder in advance, and noise is added to the latent representation in the diffusion process. The latent representation is then learned to be reconstructed in the denoising process.
In this way, the latent representation, which has a lower dimensionality than the raw data, can be handled and the data can be generated more efficiently. Incidentally, the generated latent representation can be passed through VAE's Decoder at the end to extract the generated data.
In this study, too, the mel-spectrogram of the speech data is compressed into a latent representation by VAE during training. During inference, the latent representation is converted to a mel-spectrogram by a VAE decoder. The mel-spectrogram is then passed through a module called Vocoder to output the raw speech data.
Expansion of voice data
In this study, speech data is augmented by mixup to solve the speech data shortage and improve the performance of the model.
Specifically, new voice data $x_{1,2}$ is generated from the existing voice data $x_1$ and $x_2$ based on the following formula.
Here $\lambda$ is sampled from the Beta distribution $B(5, 5)$. The text that is tied to the augmented speech data is not needed because it is not used during training.
Other voice control
Once AudioLDM is learned, various tasks, such as inpainting, can be solved in a zero-shot fashion. The inference process for such tasks is shown in the figure below.
In (b), inpainting and super-resolution, it is possible to repair missing parts of the audio and improve the resolution of the audio data.
In (c), style conversion, it is possible to convert, for example, from "calm music" to "energetic music".
Evaluation experiment
Data-set
Four datasets were used in the AudioLDM study: AudioSet (AS), AudioCaps (AC), Freesound (FS), and BBC Sound Effect library (SFX).
AS is large, containing 527 labels and over 5,000 hours of audio, while AC is a smaller dataset containing about 49,000 audio clips and text descriptions. However, these datasets are mainly audio from YouTube, and quality is not guaranteed.
Therefore, data was collected from FreeSound and BBC SFX to add high-quality audio data.
AC and AS were used to evaluate the model, with each audio clip from AC having five text captions, one randomly selected as the text condition; from AS, 10% of audio samples were randomly selected as a separate evaluation set, and instead of a text description, a label concatenation was used.
Evaluation Method
This study employs a comprehensive evaluation method that includes both objective and subjective evaluations to assess the performance of AudioLDM.
The following evaluation indicators are used for objective evaluation
- Frechet distance (FD)
- inception score (IS)
- kullback-leibler (KL) divergence
The subjective evaluation was conducted by six audio experts. Specifically, the experiment involved answering the questions on the following questionnaire regarding Overall Voice Quality (OVL) and Relevance to Text (REL).
The same Text-to-Audio DiffSound and AudioGen are used to compare models.
AudioLDM was trained on a small model (AudioLDM-S) and a large model (AudioLDM-L). Additionally, the AudioLDM-L-Full model was trained on all datasets, indicating that the effect of training data size has been investigated.
Result
The results of the comparative evaluation are shown in the table below.
Overall, "AudioLDM-L-Full" has the best performance. This means that AudioLDM, with a large number of parameters and trained on all datasets, is the most accurate model.
Appropriate data for conditioning
The following table examines the difference in performance between using only speech embedding and using both text and speech embedding as conditions when training AudioLDM.
These results indicate that audio information is more effective than textual information as conditioning during learning.
Based on these results, the actual AudioLDM training also uses only "audio embedding obtained by CLAP's audio encoder" for conditioning the LDM.
Appropriate number of sampling steps
In AudioLDM, DDIM is used as the sampling method. The appropriate number of DDIM steps can be found in the table below.
From the table above, we can see that the appropriate number of sampling steps is between 100 and 200.
Summary
In this study, AudioLDM, which can generate speech from text prompts, was proposed, and it has achieved SOTA in the Text-to-Audio field.
The following three main issues were identified for this study.
- The sampling rate of the generated audio is insufficient for music generation
- Each module is trained separately, which may cause misalignment
- Spreading fake information by generating false audio information
In the future, they need to explore approaches such as higher sampling rates and end-to-end fine-tuning.
Finally, the source code for AudioLDM is available on GitHub and HuggingFace, so those interested can try running it locally, etc.
Categories related to this article