Catch up on the latest AI articles

Make-An-Audio] Prompt-enhanced Diffusion Model For Speech Generation.

Make-An-Audio] Prompt-enhanced Diffusion Model For Speech Generation.

Diffusion Model

3 main points
✔️ Pseudo Labeling to Solve Audio-Text Pair Data Shortage
✔️ Introducing Spectrogram Auto Encoder
✔️ Also Applied to Personalized Generation, Audio Inpainting, and Data-to-Audio

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
written by Rongjie HuangJiawei HuangDongchao YangYi RenLuping LiuMingze LiZhenhui YeJinglin LiuXiang YinZhou Zhao
(Submitted on 30 Jan 2023)
Comments: Audio samples are available at this https URL

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)


The images used in this article are from the paper, the introductory slides, or were created based on them.


This paper proposes Make-An-Audio, a diffusion model that generates speech from text. The model takes a text prompt as input and generates speech accordingly.

For example, if you enter the prompt "cat meowing and a young woman's voice," the output will be audio of a cat meowing and a woman talking.

Audio samples generated in this study can be heard on the following GitHub page

This research was accepted to ICML 2023, one of the most difficult international conferences in the field of machine learning. Therefore, if you are interested in generative AI, this is a paper you should check out.

The challenge of voice generation is lack of data

Deep generative models have recently become capable of faithfully representing data described in text prompts, thanks to large training data and powerful models. In particular, Text-to-Image and Text-to-Video can generate a wide variety of high-quality works with unprecedented ease.

However, when it comes to Text-to-Audio (audio generation), it currently lags behind image and video generation.

There are two reasons for this

  • Fewer "text" and "voice" pair data
  • Modeling of long continuous signal data is extremely complex

For example, with image data, you can easily collect a large amount of "text-labeled" image data.

As for audio, it is possible to collect audio data by itself, but when it comes to data with text labels, the number is extremely small.

For these reasons, Text-to-Audio is not as accurate as image and video generation.

Therefore, this study proposed the following three solutions to the problem.

  • Make-An-Audio with Prompt Enhanced Diffusion Model
  • Pseudo-labeling to generate "text closely related to speech" (eliminates lack of data)
  • Spectrogram autoencoder to predict self-supervised representation (eliminates modeling complexity)

In this way, Make-An-Audio is conceptually simple, yet delivers surprisingly powerful results.

In addition to text, this study also addresses the following three patterns of speech generation

  • Personalized Text-To-Audio Generation
  • Audio Inpainting
  • Visual-To-Audio Generation

Now let's look at some specific methods.


In this section, we will review the methodology used in this study. First, an overview of the overall methodology.

Methodology Overview

The process of audio generation in Make-An-Audio is shown in the figure below.

Each element in the above structure can be broken down into the following five categories

  • Enhanced pseudo-prompts to alleviate data shortage issues
  • Text Encoder with Controlled Learning (CLAP)
  • Spectrogram autoencoder for predicting self-supervised representations
  • Diffusion model conditional on Embedding of text
  • Neural vocoder for converting mel spectrograms to raw waveforms

The following sections describe these elements in detail.

Pseudo-labeling (Distill-then-Reprogram)

This section describes the "Distill-then-Reprogram" method of applying pseudo-labeling to resolve Text-Audio pair data shortages.

As the name implies, this technique involves distillation and then reprogramming.

Let's look at each phase in turn.

・Expert Distillation

This phase first takes as input "Audio data without text labels".

We then have the following two pre-trained expert models for prompt generation

  • Automatic audio caption generation model: generates a variety of text to describe the content of Audio data
  • Speech-to-text search: Search for "similar Audio data" in the database

Here, the two expert models work together to extract knowledge about the input Audio data and generate text that is relevant to that Audio data.

・Dynamic Reprogramming

Next, using the text just generated by Expert Distillation as X, the following four steps are used to expand the data.

  1. Store audio data with very short text labels & in DB
  2. Sampling 0~2 Audio data from DB
  3. Combine those Audio data
  4. Combine X and & according to the "template".

Here, & content refers to short, mostly one-word texts such as "Birds" or "Footsteps".

The template also includes the following

The detailed rules for the above template were published in this paper as follows

Specifically, we replace X and & , respectively, with the natural language of sampled data and the class label of sampled events from the database. nbsp;

For verb (denoted as v), we have {'hearing', 'noticing', 'listening to ', 'appearing'};

for adjective (denoted as a), we have {'clear', 'noisy', 'close-up' , 'weird', 'clean'};

for noun (denoted as n), we have {'audio', 'sound', 'voice'};

for numeral/quantifier (denoted as q), we have {'a', 'the', 'some' };

Text encoder

In the field of text-to-data (image, audio, music etc.), the two most common methods for encoding text prompts are

  • Cross-modal learning with contrast learning (MuLan, CLIP, CLAP etc.)
  • Encoding with pre-trained LLM (T5, FLAN-T5 etc.)

This study also uses these pre-trained frozen models to encode text prompts. Specifically, we use CLAP as the control learning model and T5-Large as the pre-trained LLM.

To preface the results, it was concluded that both CLAP and T5-Large were able to achieve similar results in the benchmark evaluation, but CLAP may be more efficient without the offline calculation of embedding required by LLM.

Spectrogram Auto Encoder

where the input Audio data represents the mel spectrogram x. The spectrogram autoencoder is configured as follows

  • Encoder E that takes x as input and outputs the latent representation z
  • Decoder G to reconstruct the mel spectrogram signal x ' from z
  • Multi-window discriminator

This entire system is learned Ent-to-End so as to minimize

  • Reconstruction loss: Improved learning efficiency and fidelity of generated spectrograms
  • GAN loss: discriminator and generator learn adversarily
  • KL-penalty loss: Encoder learns standard z and restricts itself to avoid highly dispersive latent spaces

Thus, Make-An-Audio predicts a self-supervised representation instead of a waveform. This greatly reduces the challenges of modeling long continuous data and guarantees high-quality semantic understanding.

Latent Diffusion

The Latent Diffusion Model is used here. This is a method of generating latent expressions by applying diffusion and inverse diffusion processes to latent expressions. In doing so, we condition on the text. The loss function is as follows


This section describes Data-to-Audio generation, which is an application and generalization of ideas from this research to date. Data-to-Audio in this context refers to the following three types of data

  • Personalized Text-To-Audio Generation
  • Audio Inpainting
  • Visual-To-Audio Generation

Personalized Text-To-Audio Generation

More recently, "personalization" has been seen in the visual and graphics fields, allowing for personal and creative customization, such as incorporating unique objects.

This personalization is also incorporated in this research. For example, when the speech "thunder" is given and generated with the prompt "baby crying," it can generate speech that sounds like "a baby crying on a thunder day. It can also be used extensively for mixing and tuning speech, such as adding background sounds to existing speech or editing speech by inserting speech objects.

Audio Inpainting

Audio Inpainting is the task of restoring audio by reconstructing damaged portions of a digital audio signal.

Inpainting is a technique that originally restores a portion of an image. It is a technique that reconstructs masked areas on an image, allowing natural restoration of scratches, text written over the image, and so on. They have extended this technology to audio.

In this study, to extend to speech, the speech spectrogram is considered as an image and Inpainting is applied to the spectrogram image.

Specifically, Make-An-Audio is fine-tuned for Audio Inpainting, and irregular masks (thick, medium, and thin masks) are given to the audio spectrogram.

Masking here refers to the addition of audio to fill in the damaged portions of the audio.

Visual-To-Audio Generation

Recent deep generative models are advancing the above "generation of audio in line with the content of images and videos". Therefore, to further this research, Make-An-Audio is extended for "audio generation from visual information" in this study. Specifically, there are two ways

  • Image-to-Audio
  • Video-to-Audio

The Visual-To-Audio inference process for this study is as follows

Here, this study utilizes the following two ideas

  • Contrastive Learning (CLIP)
  • Text-to-Audio model with CLIP guide

CLIP is "cross-modal learning of images and text" using contrastive learning, and consists of an image encoder and a text encoder.


Using this technology, the Image-to-Audio for this study is the following procedure.

  1. An image is entered.
  2. Image data goes through CLIP's image encoder
  3. Go through Make-An-Audio's Transformer
  4. Through the Cross Attention layer of the U-net
  5. Go through Make-An-Audio's Audio Docoder
  6. Going through Make-An-Audio's Vocoder


A video is a collection of multiple image frames. Therefore, Video-to-Audio picks four frames uniformly from the video, pools and averages these CLIP image features, and sends them to the Transformer layer of Make-An-Audio.


This section presents the quantitative evaluation methodology for this study.

Valuation index

In this study, the model is evaluated using objective and subjective measures of "speech quality" and "accuracy of text-to-speech linkage.

The following indicators are used as objective indicators

  • melception-based FID: automatic performance indicator
  • KL divergence: Audio fidelity
  • CLAP score: accuracy of text-to-speech connections

For the subjective rating index, we crowdsourced subjects and administered a MOS (Mean Opinion Score) test on a 20-100 Likert scale to these subjects. Here, the following psychological experiments were conducted to calculate the indices (95% confidence intervals for each).

indicator meaning Experimental Details
MOS-Q sound quality Instruct subjects to focus on examining sound quality and naturalness and have them evaluate
MOS-F text-to-speech consistency Subjects were shown the audio and prompts and asked, "Does the natural language description closely match the audio?" and have them answer "completely," "mostly," or "somewhat" on a 20-100 Likert scale.

A screenshot of the instructions given to the subjects is shown below.

Text-to-Audio results

Here is a comparison in objective and subjective measures with Diffsound, the only publicly available Text-to-Audio baseline. The results are as follows

From these results, we can say the following three things

  • In terms of audio quality, Make-An-Audio achieved the highest scores of FID 4.61 and KL 2.79
  • Make-An-Audio achieves highest CLAP for text-to-speech consistency
  • MOS-Q and MOS-F received the highest ratings by subjects at 72.5 and 78.6, respectively

At this time, comparative experiments were also conducted with the following encoders

  • BERT
  • T5-Large
  • CLIP
  • CLAP

The text encoder weights for the generation are frozen. From the above results table, the following three considerations can be made.

  • CLIP is for Text-Image and may not be useful for Text-to-Audio
  • CLAP and T5-Large achieve similar performance
  • CLAP does not require offline computation of embedding in LLM and may be more computationally efficient (only 59% parameter)

Audio Inpainting

The Audio Inpainting evaluation includes the following six masking methods.

  • Thick masking (Irregular)
  • Medium masking (Irregular)
  • Thin masking (Irregular)
  • Masking 30% of data (Frame)
  • Masking 50% of data (Frame)
  • Masking 70% of data (Frame)

We also randomly masked wide and narrow areas and used FID and KL indicators to measure performance.

The results are as follows

The results show that regardless of the masking method, the larger the area to be masked during learning, the better the accuracy.

For mask areas of similar size, the Frame base consistently outperforms the Irregular base.

An example of Audio Inpainting results is shown below.

The above figure shows, from top to bottom, the input audio with defects (Input), the result of audio inpainting (Result), and the real audio without defects (GT). In this case, given an input audio Input with missing audio, the system learns the audio as close as possible to the real audio GT without missing audio.

As you can see from the results, the voice is correctly filled and reconstructed for the different shapes of the masked area.


The Visual-to-Audio results are as follows

The results show that it can be generalized successfully to images and videos.

Personalized Text-To-Audio Generation

In Personalized Text-To-Audio, we see a tradeoff between Faithfulness (text-to-caption consistency) and Realism (sound quality), as shown in the figure on the left below.

The figure on the right shows that as T increases, a large amount of noise is added to the original audio, making the generated samples more realistic but losing Faithfulness.

For comparison, Realism is measured by the 1-MSE distance between the generated and initial speech, while Faithfulness is measured by the CLAP score between the generated samples.

Future Outlook

In this article, we introduced Make-An-Audio, a prompt-enhanced diffusion model for Text-to-Audio; Make-An-Audio enables highly accurate speech generation. And undoubtedly, this research will be the basis for future speech synthesis research. Furthermore, I believe that it will help reduce the amount of labor required to create short videos and digital art.

The authors also mention the challenges of this study in that "Latent Diffusion Models usually require more computational resources and may show degradation as the training data decreases. Therefore, one of the future directions is to "develop a lightweight and fast diffusion model" to speed up data generation.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us