Catch up on the latest AI articles

[Mustango] Music Generation Model Utilizing Domain Knowledge Of Music

[Mustango] Music Generation Model Utilizing Domain Knowledge Of Music

Audio And Speech Processing

3 main points
✔️ Proposes a Text-to-Music model called Mustango that leverages music domain knowledge
✔️ Introduces a music-specific UNet called MuNet

✔️ Utilizes over 52,000 data points extended by leveraging music knowledge

Mustango: Toward Controllable Text-to-Music Generation
written by Jan MelechovskyZixun GuoDeepanway GhosalNavonil MajumderDorien HerremansSoujanya Poria
(Submitted on 16 Mar 2024)
Comments: NAACL 2024

Subjects: Audio and Speech Processing (eess.AS)


The images used in this article are from the paper, the introductory slides, or were created based on them.

Mustango: a music generation model that leverages domain knowledge

This paper is entitled "Development of a Text-to-Music Model 'Mustango' Using Music Domain Knowledge".It utilizes music domain knowledge to generate music from text instructions.

Experiments in this study show that Mustang's performance in generating music with text guidance significantly outperforms other models.

And Mustango is an innovative generative model based on music theory that has the potential to expand the range of creative activities.

Mustango Model Structure

Recent developments in diffusion models have greatly improved the performance of Text-to-Music generation models.

However, existing models do not consider the fine control of musical aspects such as tempo, chord progression, and key of the music being generated at all.

In this study, Mustango is proposed as a model that can generate music with musical aspects. Specifically, the model assumes not only general textual instructions, but also textual prompts that include musical elements, such as

  • chord progression (music)
  • beat
  • tempo
  • key

Mustango's model structure is as follows

The basic structure is based on the Latent Diffusion Model, which transforms speech waveforms → mel-spectrogram → latent representation (compressed by VAE ), and applies the diffusion model to the latent representation. In this research, MuNet, which is a specialized version of UNet for music generation, is used for the diffusion model.

Specifically, after denoising by MuNet, music is generated by converting latent representation → mel-spectrogram (reconstructed by VAE) → speech waveform (reconstructed by HiFi GAN ).

MuNet Conditioning

As mentioned earlier, MuNet is a music-specific diffusion model of UNet. In this study, it plays the role of denoising; conditioning on MuNet is done by the following procedure.

  1. Text encoder (FLAN-T5) to obtain embedding from input text
  2. Extract beat and chord features using beat and chord encoders
  3. Cross-attachment integration of text embedding, beat features, and code features, in that order

Abeat encoder (DeBERTa Large ) encodes beat counts and beat intervals from text prompts.

Achord encoder (FLAN-T5 Large ) also encodeschord progressions from text prompts and beat information.

Construction of a large data set "MusicBench

In the field of music generation from textual instructions, the lack of "text-music" paired datasets has also been a problem. For example, MusicCaps, a frequently used benchmarkdataset in the field of music generation in recent years, contains only about 5,000 data sets.

The lack of such data is a hindrance to further performance improvement of the music generation model.

In order to compensate for such lack of data, this study has constructed a large data set, MusicBench, using a unique data extension method based on the MusicCaps mentioned above.

Specifically, MusicBench is built from 5,479 samples out of MusicCaps using the following procedure.

  1. Split MusicCaps into TrainA and TestA
  2. Extract beat, chord, key, and tempo information from TrainA and TestA music data
  3. TrainB and TestB are created by adding the text for the musical features in Step 2 to the captions of TrainA and TestA.
  4. Paraphrase TrainB caption in ChatGPT to create TrainC
  5. 3,413 samples were extracted from TrainA by excluding samples with low sound quality
  6. Perform data expansion to change pitch, tempo, and volume for the music data in Step 5 to generate 37,000 samples.
  7. Add 0-4 random caption sentences to the sample in Step 6
  8. Paraphrase the caption in Step 7 with ChatGPT
  9. TrainA, TrainB, TrainC, Combine the data expanded in steps 5~8

These steps resulted in the construction of a large dataset MusicBenchcontaining 52,768 samples of final training data (11 times larger than MusicCaps).

Incidentally, the paraphrase using ChatGPT uses the following prompts.

Music feature extraction model

In step 2 above, four musical features (beat, downbeat, chord, key, and tempo) are extracted from the music data and appended to the existing text prompts.

In doing so, a model called BeatNet is used for feature extraction related to beats and downbeats.

As for tempo (BPM), they estimate it by averaging the reciprocal of the time interval between beats.

The features related to the chord progression were extracted using a modelcalled Chordino, and the keys were extracted using Essentia's KeyExtractor algorithm.

Extension method to music data and text data

In step 6 above,data expansion is performed on the music data to change the pitch, tempo, and volume. In doing so, the three musical characteristics mentioned above are changed in the following ways.

  • Shift the pitch of music by ±3 semitones using PyRubberband
  • Change tempo in the range of ±5-25%.
  • Gradual changes in volume (both crescendo and decrescendo)

At this time, the text prompts accompanying the expanded music data are also captured to match the expanded music data.

Effectiveness of this method

To validate thequality of the music generated by Mustango and the validity of the dataset MusicBench, evaluations on objective and subjective measures have been conducted.

Evaluation in Objective Indicators

The evaluation in the objective indexuses Fréchet Distance (FD), Fréchet Audio Distance (FAD), and KL divergence to assess the quality of the generated music.

For the evaluation, test data from TestA, TestB, and FMACaps were used.

The results are as follows

The inferiority of the Tango model trained with MusicCaps over the other models demonstrates the effectiveness of MusicBench. We alsosee that the pre-trained Tango and Mustango models finetuned with MusicBench performed equally well on FD and KL, but Mustango performed significantly better on FAD.

In addition, Mustango outperforms MusicGen and AudioLDM2 in FAD and KL on all test sets.

In addition to this evaluation, nine metrics were defined for musical characteristics such as tempo, key, chord, beat, etc., and whether the generated music expressed these musical characteristics as directed by the text.

For the evaluation, TestB and FMACaps test data were used.

The results are as follows

TestB shows that all models except MusicGen have comparable performance for Tempo, and similar performance for Beat among the models.

As for Key, the models trained in MusicBench significantly outperform the models trained in MusicCaps. Among them, Mustango outperforms all other models in TestB and is second in FMACaps.

With respect to Chord, Mustango significantly outperforms all other models.

These results indicate that Mustango is the most effective model for controlling chord progressions.

Evaluation in subjective indicators

For the subjective evaluation, a survey was conducted bygeneral listeners and experts (with at least five years of music education ).

The first round compares Mustango with Tango, and the second round compares Mustango with MusicGen and AudioLDM2.

The results are as follows

In the first round, Tango trained with MusicCaps was inferior to the model trained with MusicBench on all measures, indicating the effectiveness of MusicBench.

Throughout, we also find that Mustango performs best on many indicators.


In this article, we introduced our research on Mustango, a music generation AI that leverages music domain knowledge.

One limitation of this study is that the current Mustango can only generate a maximum of 10 seconds of music due to computational resource constraints.

He also says that the current Mustango is primarily for Western music forms only, with little ability to generate music from other cultures.

Therefore, as future research, they plan to "generate longer time music" and "apply it to more diverse musical genres, such as dealing with non-Western music.

Personal Opinion

Although Mustango has achieved SOTA in many indicators, I had the impression that it is still inferior to other models in terms of performance in some respects.

Nevertheless, the MusicBench for the dataset constructed in this study seems to have been shown to be valid, so it can be used as a great benchmark for future studies.

In this regard, I believe we have made great strides in resolving the data shortage in the music generation field.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us