Moûsai] Diffusion Model Of High-quality Music Generation By Text Input.

Diffusion Model 04/10/2023

3 main points
✔️ diffusion model for generating music from text
✔️ can generate long-lasting, high-quality music in real time
✔️ introduces a new diffusion model called diffusion magnitude autoencoder

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion
written by Flavio Schneider, Zhijing Jin, Bernhard Schölkopf
(Submitted on 27 Jan 2023 (v1), last revised 30 Jan 2023 (this version, v2))
Comments: Music samples for this paper: this https URL all music samples for all models: this https URL and codes: this https URL
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

This article describes Moûsai, a diffusion model that generates music in response to text prompts. In this model, for example, inputting a prompt such as "African Drums, Rythm, (Deluxe Edition), 2 of 4 " will output an African drum rhythm.

This Moûsai model was also used as the basis for Stable Audio, the audio version of Stable Diffusion. Furthermore, it is a high-profile study because it uses a "diffusion model," a deep generative AI model that has become a hot topic in recent years.

First, before we get into the details of Moûsai, let's look at some of the recent trends and issues in the field of music generation.

Trends in Music Generation in Recent Years

In recent years, generative AI models have emerged in the fields of image and text, such as Stable Diffusion and GPT. In this context, AI for music generation has also been receiving a lot of attention.

In the field of music generation AI, "Transformer-based autoregressive models" have been the mainstream, but recently music generation using diffusion models has also been popular. In this study, the diffusion model, which is the latest trend, is used.

In discussing the importance of this research, it is important to know the challenges in the field of music generation. First, let's review the challenges specific to music generation.

Music Generation Challenges

The following problems have been identified with music generation by generative AI.

Cannot generate music longer than 1 minute
Sound quality is low.
The music generated always feels the same (no diversity)

In other words, with conventional generation models, attempts to generate music longer than one minute have resulted in music breakdowns or low sound quality in the first place. Therefore, the objective of this research is to solve these problems.

Moûsai vs. traditional music generation model

Next, let us look at the position of this study in the field of music generation. Below is a table comparing Moûsai with traditional music generation models.

The entries (column names) in this table are as follows

Model: Model name
Sample Rate: Sampling rate (indicates the sound quality, higher rate means higher quality)
Ctx. Len. : Time of generated music
Input: Type of input
Music (Diverse): Type of music to be generated
Example: Example of generated music
Infer.Time: Time required for generation (inference time)
Data: Size of the data set

The bottom row of this table is Moûsai. This row reveals that Moûsai has the following characteristics

Capable of generating music as high as 48 kHz
Capable of generating more than 1 minute of music
Capable of generating music in a variety of genres
Type of input is text

Looking at these characteristics, it can be said that Moûsai in this study is a model that solves problems in the field of music generation. We will explain how Moûsai solves these problems in the next section.

Moûsai Model Structure

In this section, we will first look at the overall model of Moûsai and then discuss the individual components in detail.

Overall music generation process

The overall music generation process (reasoning process) of this study is as follows

The top left of this figure shows the input text through the text prompt. Once the text prompt has been entered, Audio is finally output as shown at the bottom right through each step. The flow of how the music is actually generated is as follows.

A text prompt is entered.
Prompt embedding is generated by T5 (Transformer-based)
Generate latent variables (Latent) from Noise subject to its embedding.
Generate Audio from Noise subject to the generated latent variables

In addition, the components of this model are as follows

TextEncoder (above)
DiffusionGenerator (medium)
DiffusionDecoder (bottom)

In addition, the authors combine the above elements to express the following

DiffusionGenerator+DiffusionDecoder=Diffusion Magnitude autoencoder (2nd stage)
TextEncoder+DiffusionGenerator=Latent text-to-audio Diffusion (1st stage)

From this we can say that Moûsai is a two-stage cascade model. In the next section, we will look at each of these stages in more detail. In doing so, the correspondences are a bit complicated and will be explained in light of the model as a whole.

Diffusion Magnitude autoencoder (phase 2)

First is the Diffusion Magnitude autoencoder. This is the final step in the overall process, and is the phase in which the DiffusionGenerator and DiffusionDecoder are combined to "actually output audio. This is the final step in the overall process. It plays the role of "Step 3 → Step 4" in the "flow of actual music generation" described above.

The Diffusion Magnitude autoencoder is an extended version of the Diffusion autoencoder, which is a type of diffusion model that determines latent variables from a set of data and performs denoising Diffusion autoencoder is a kind of diffusion model that finds a latent variable from some data and performs denoising in the inverse diffusion process subject to the latent variable.

The training process for the Diffusion Magnitude autoencoder is as follows

Here, the raw audio is transformed into a spectrogram by the input STFT transform, and the magnitude is passed through a 1D convolutional encoder to obtain latent variables. At the same time, the raw audio is noised by a diffusion process and denoised by UNet to reconstruct the original audio. In doing so, the latent variable created from the spectrogram is used as a condition for denoising in the inverse diffusion process.

The Diffusion Magnitude autoencoder corresponds to the blue box below in the overall Moûsai model.

This is the learning process of the Diffusion Magnitude autoencoder, which corresponds to the second stage of Moûsai. Here, in order to create the "latent variable" that is the condition for denoising, it is necessary to consider a diffusion model conditional on the prompt text embedding. This will be discussed in the next section.

Latent text-to-audio Diffusion (Phase 1)

Next is Latent text-to-audio Diffusion. This is the phase in which TextEncoder and DiffusionGenerator are combined to "seek latent variables that lead to the second phase. This plays the role of "Step 1 to Step 3" in the aforementioned "flow up to the actual generation of music.

Latent text-to-audio Diffusion is an extended version of Latent Diffusion, a technique also used in Stable Diffusion. Specifically, Latent Diffusion is a type of diffusion model that uses VAE to obtain latent variables of data, and then applies diffusion and inverse diffusion processes to the latent variables.

The learning process for Latent text-to-audio Diffusion is as follows

First, as before, the raw audio is converted to a spectrogram by STFT transform, and the magnitude is passed through a 1D convolutional encoder to obtain latent variables. At the same time, the text prompt is passed through a Transformer-based T5 to create a text embedding.

The latent variable is then noised in the diffusion process, denoised by UNet, and the original latent variable is reconstructed. In doing so, the text embedding mentioned earlier is used as a condition for denoising in the inverse diffusion process.

Latent text-to-audio diffusion corresponds to the blue box below in Moûsai's overall model.

This is the learning process of Latent text-to-audio Diffusion.

Evaluation experiment

By building a model like the one we just looked at, we have solved the problems inherent in music generation. But how did we quantitatively evaluate the music Moûsai makes?

Here are some of the experiments the authors performed in this study.

Data-set

Before describing the experimental details of this study, we will first briefly touch on the data set used.

A total of 2500 hours of music data was used in this study (details of the music used are not available). In addition, texts corresponding to those music pieces are also used. Those texts consist of metadata such as song titles, artist names, genres, etc.

Assessing Diversity and Text Relevance

First, the authors conducted a psychological experiment on three subjects to quantitatively evaluate the "diversity" of music generated by Moûsai and the "relevance of text and songs.

As for the specific experiments, the first two models, "Moûsai" and "Riffuion," generate four genres of music with the same prompts. At this time, the following prompts were used.

Subjects were asked to "listen to the generated music and correctly classify each song into one of four genres. The table below shows the number of times subjects correctly identified the genre of music generated by each model.

The Moûsai result on the left shows that "the music generated by Moûsai is more correctly classified into genres" due to the higher value of the diagonal component of the confusion matrix. On the other hand, the right Riffusion results in "all songs sound Pop".

Mousai is able to do this, but in the case of Riffusion, no matter what genre of music we aim for, it all sounds like Pop. Riffusion is able to do that.

This shows that Moûsai is "more prompt and genre-capturing in its music generation.

Sound quality evaluation

The authors then evaluated the agreement between the "true music data mel-spectrogram" and the "mel-spectrogram of the music generated by Moûsai" to assess the sound quality of the music output by Moûsai. The results are shown below.

The top line is the "mel spectrogram of real music data" and the bottom line is the "mel spectrogram of music generated by Moûsai. The results show that Moûsai's mel-spectrogram is indeed a good representation of the real mel-spectrogram.

Therefore, it is clear that Moûsai can be used to generate music with the same quality as real music.

Summary

In this article, I explained Moûsai, a diffusion model of music generation. Although there are a number of other AIs that generate music from text, I felt that Moûsai's quality is top level compared to them. Furthermore, I heard that this paper is a master's thesis, and even so, it shows a high level of quality.

However, as written in Future Work, I also felt that conditioning other than text (e.g., music generation from humming) would be useful.

Finally, the source code for this study is also available, and those who are interested are encouraged to touch it.