Make-An-Audio] Prompt-enhanced Diffusion Model For Speech Generation.
3 main points
✔️ Pseudo Labeling to Solve Audio-Text Pair Data Shortage
✔️ Introducing Spectrogram Auto Encoder
✔️ Also Applied to Personalized Generation, Audio Inpainting, and Data-to-Audio
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
written by Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao
(Submitted on 30 Jan 2023)
Comments: Audio samples are available at this https URL
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
This paper proposes Make-An-Audio, a diffusion model that generates speech from text. The model takes a text prompt as input and generates speech accordingly.
For example, if you enter the prompt "cat meowing and a young woman's voice," the output will be audio of a cat meowing and a woman talking.
Audio samples generated in this study can be heard on the following GitHub page
https://text-to-audio.github.io/
This research was accepted to ICML 2023, one of the most difficult international conferences in the field of machine learning. Therefore, if you are interested in generative AI, this is a paper you should check out.
The challenge of voice generation is lack of data
Deep generative models have recently become capable of faithfully representing data described in text prompts, thanks to large training data and powerful models. In particular, Text-to-Image and Text-to-Video can generate a wide variety of high-quality works with unprecedented ease.
However, when it comes to Text-to-Audio (audio generation), it currently lags behind image and video generation.
There are two reasons for this
- Fewer "text" and "voice" pair data
- Modeling of long continuous signal data is extremely complex
For example, with image data, you can easily collect a large amount of "text-labeled" image data.
As for audio, it is possible to collect audio data by itself, but when it comes to data with text labels, the number is extremely small.
For these reasons, Text-to-Audio is not as accurate as image and video generation.
Therefore, this study proposed the following three solutions to the problem.
- Make-An-Audio with Prompt Enhanced Diffusion Model
- Pseudo-labeling to generate "text closely related to speech" (eliminates lack of data)
- Spectrogram autoencoder to predict self-supervised representation (eliminates modeling complexity)
In this way, Make-An-Audio is conceptually simple, yet delivers surprisingly powerful results.
In addition to text, this study also addresses the following three patterns of speech generation
- Personalized Text-To-Audio Generation
- Audio Inpainting
- Visual-To-Audio Generation
Now let's look at some specific methods.
Technique
In this section, we will review the methodology used in this study. First, an overview of the overall methodology.
Methodology Overview
The process of audio generation in Make-An-Audio is shown in the figure below.
Each element in the above structure can be broken down into the following five categories
- Enhanced pseudo-prompts to alleviate data shortage issues
- Text Encoder with Controlled Learning (CLAP)
- Spectrogram autoencoder for predicting self-supervised representations
- Diffusion model conditional on Embedding of text
- Neural vocoder for converting mel spectrograms to raw waveforms
The following sections describe these elements in detail.
Pseudo-labeling (Distill-then-Reprogram)
This section describes the "Distill-then-Reprogram" method of applying pseudo-labeling to resolve Text-Audio pair data shortages.
As the name implies, this technique involves distillation and then reprogramming.
Let's look at each phase in turn.
・Expert Distillation
This phase first takes as input "Audio data without text labels".
We then have the following two pre-trained expert models for prompt generation
- Automatic audio caption generation model: generates a variety of text to describe the content of Audio data
- Speech-to-text search: Search for "similar Audio data" in the database
Here, the two expert models work together to extract knowledge about the input Audio data and generate text that is relevant to that Audio data.
・Dynamic Reprogramming
Next, using the text just generated by Expert Distillation as X, the following four steps are used to expand the data.
- Store audio data with very short text labels & in DB
- Sampling 0~2 Audio data from DB
- Combine those Audio data
- Combine X and & according to the "template".
Here, & content refers to short, mostly one-word texts such as "Birds" or "Footsteps".
The template also includes the following
The detailed rules for the above template were published in this paper as follows
Specifically, we replace X and & , respectively, with the natural language of sampled data and the class label of sampled events from the database. nbsp;
For verb (denoted as v), we have {'hearing', 'noticing', 'listening to ', 'appearing'};
for adjective (denoted as a), we have {'clear', 'noisy', 'close-up' , 'weird', 'clean'};
for noun (denoted as n), we have {'audio', 'sound', 'voice'};
for numeral/quantifier (denoted as q), we have {'a', 'the', 'some' };
Text encoder
In the field of text-to-data (image, audio, music etc.), the two most common methods for encoding text prompts are
- Cross-modal learning with contrast learning (MuLan, CLIP, CLAP etc.)
- Encoding with pre-trained LLM (T5, FLAN-T5 etc.)
This study also uses these pre-trained frozen models to encode text prompts. Specifically, we use CLAP as the control learning model and T5-Large as the pre-trained LLM.
To preface the results, it was concluded that both CLAP and T5-Large were able to achieve similar results in the benchmark evaluation, but CLAP may be more efficient without the offline calculation of embedding required by LLM.
Spectrogram Auto Encoder
where the input Audio data represents the mel spectrogram x. The spectrogram autoencoder is configured as follows
- Encoder E that takes x as input and outputs the latent representation z
- Decoder G to reconstruct the mel spectrogram signal x ' from z
- Multi-window discriminator
This entire system is learned Ent-to-End so as to minimize
- Reconstruction loss: Improved learning efficiency and fidelity of generated spectrograms
- GAN loss: discriminator and generator learn adversarily
- KL-penalty loss: Encoder learns standard z and restricts itself to avoid highly dispersive latent spaces
Thus, Make-An-Audio predicts a self-supervised representation instead of a waveform. This greatly reduces the challenges of modeling long continuous data and guarantees high-quality semantic understanding.
Latent Diffusion
The Latent Diffusion Model is used here. This is a method of generating latent expressions by applying diffusion and inverse diffusion processes to latent expressions. In doing so, we condition on the text. The loss function is as follows
Data-To-Audio
This section describes Data-to-Audio generation, which is an application and generalization of ideas from this research to date. Data-to-Audio in this context refers to the following three types of data
- Personalized Text-To-Audio Generation
- Audio Inpainting
- Visual-To-Audio Generation
Personalized Text-To-Audio Generation
More recently, "personalization" has been seen in the visual and graphics fields, allowing for personal and creative customization, such as incorporating unique objects.
This personalization is also incorporated in this research. For example, when the speech "thunder" is given and generated with the prompt "baby crying," it can generate speech that sounds like "a baby crying on a thunder day. It can also be used extensively for mixing and tuning speech, such as adding background sounds to existing speech or editing speech by inserting speech objects.
Audio Inpainting
Audio Inpainting is the task of restoring audio by reconstructing damaged portions of a digital audio signal.
Inpainting is a technique that originally restores a portion of an image. It is a technique that reconstructs masked areas on an image, allowing natural restoration of scratches, text written over the image, and so on. They have extended this technology to audio.
In this study, to extend to speech, the speech spectrogram is considered as an image and Inpainting is applied to the spectrogram image.
Specifically, Make-An-Audio is fine-tuned for Audio Inpainting, and irregular masks (thick, medium, and thin masks) are given to the audio spectrogram.
Masking here refers to the addition of audio to fill in the damaged portions of the audio.
Visual-To-Audio Generation
Recent deep generative models are advancing the above "generation of audio in line with the content of images and videos". Therefore, to further this research, Make-An-Audio is extended for "audio generation from visual information" in this study. Specifically, there are two ways
- Image-to-Audio
- Video-to-Audio
The Visual-To-Audio inference process for this study is as follows
Here, this study utilizes the following two ideas
- Contrastive Learning (CLIP)
- Text-to-Audio model with CLIP guide
CLIP is "cross-modal learning of images and text" using contrastive learning, and consists of an image encoder and a text encoder.
・Image-to-Audio
Using this technology, the Image-to-Audio for this study is the following procedure.
- An image is entered.
- Image data goes through CLIP's image encoder
- Go through Make-An-Audio's Transformer
- Through the Cross Attention layer of the U-net
- Go through Make-An-Audio's Audio Docoder
- Going through Make-An-Audio's Vocoder
・Video-to-Audio
A video is a collection of multiple image frames. Therefore, Video-to-Audio picks four frames uniformly from the video, pools and averages these CLIP image features, and sends them to the Transformer layer of Make-An-Audio.
Experiment
This section presents the quantitative evaluation methodology for this study.
Valuation index
In this study, the model is evaluated using objective and subjective measures of "speech quality" and "accuracy of text-to-speech linkage.
The following indicators are used as objective indicators
- melception-based FID: automatic performance indicator
- KL divergence: Audio fidelity
- CLAP score: accuracy of text-to-speech connections
For the subjective rating index, we crowdsourced subjects and administered a MOS (Mean Opinion Score) test on a 20-100 Likert scale to these subjects. Here, the following psychological experiments were conducted to calculate the indices (95% confidence intervals for each).
indicator | meaning | Experimental Details |
---|---|---|
MOS-Q | sound quality | Instruct subjects to focus on examining sound quality and naturalness and have them evaluate |
MOS-F | text-to-speech consistency | Subjects were shown the audio and prompts and asked, "Does the natural language description closely match the audio?" and have them answer "completely," "mostly," or "somewhat" on a 20-100 Likert scale. |
A screenshot of the instructions given to the subjects is shown below.
Text-to-Audio results
Here is a comparison in objective and subjective measures with Diffsound, the only publicly available Text-to-Audio baseline. The results are as follows
From these results, we can say the following three things
- In terms of audio quality, Make-An-Audio achieved the highest scores of FID 4.61 and KL 2.79
- Make-An-Audio achieves highest CLAP for text-to-speech consistency
- MOS-Q and MOS-F received the highest ratings by subjects at 72.5 and 78.6, respectively
At this time, comparative experiments were also conducted with the following encoders
- BERT
- T5-Large
- CLIP
- CLAP
The text encoder weights for the generation are frozen. From the above results table, the following three considerations can be made.
- CLIP is for Text-Image and may not be useful for Text-to-Audio
- CLAP and T5-Large achieve similar performance
- CLAP does not require offline computation of embedding in LLM and may be more computationally efficient (only 59% parameter)
Audio Inpainting
The Audio Inpainting evaluation includes the following six masking methods.
- Thick masking (Irregular)
- Medium masking (Irregular)
- Thin masking (Irregular)
- Masking 30% of data (Frame)
- Masking 50% of data (Frame)
- Masking 70% of data (Frame)
We also randomly masked wide and narrow areas and used FID and KL indicators to measure performance.
The results are as follows
The results show that regardless of the masking method, the larger the area to be masked during learning, the better the accuracy.
For mask areas of similar size, the Frame base consistently outperforms the Irregular base.
An example of Audio Inpainting results is shown below.
The above figure shows, from top to bottom, the input audio with defects (Input), the result of audio inpainting (Result), and the real audio without defects (GT). In this case, given an input audio Input with missing audio, the system learns the audio as close as possible to the real audio GT without missing audio.
As you can see from the results, the voice is correctly filled and reconstructed for the different shapes of the masked area.
Visual-to-Audio
The Visual-to-Audio results are as follows
The results show that it can be generalized successfully to images and videos.
Personalized Text-To-Audio Generation
In Personalized Text-To-Audio, we see a tradeoff between Faithfulness (text-to-caption consistency) and Realism (sound quality), as shown in the figure on the left below.
The figure on the right shows that as T increases, a large amount of noise is added to the original audio, making the generated samples more realistic but losing Faithfulness.
For comparison, Realism is measured by the 1-MSE distance between the generated and initial speech, while Faithfulness is measured by the CLAP score between the generated samples.
Future Outlook
In this article, we introduced Make-An-Audio, a prompt-enhanced diffusion model for Text-to-Audio; Make-An-Audio enables highly accurate speech generation. And undoubtedly, this research will be the basis for future speech synthesis research. Furthermore, I believe that it will help reduce the amount of labor required to create short videos and digital art.
The authors also mention the challenges of this study in that "Latent Diffusion Models usually require more computational resources and may show degradation as the training data decreases. Therefore, one of the future directions is to "develop a lightweight and fast diffusion model" to speed up data generation.
Categories related to this article