[MusicLDM] Text-to-Music Model With Low Risk Of Plagiarism

Diffusion Model 22/01/2024

3 main points
✔️ Music Generation Model Leveraging Contrastive Learning and Latent Diffusion Models
✔️ Applying the AudioLDM audio generation model architecture to the music field
✔️ Introducing a data extension strategy to reduce plagiarism risk

MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies
written by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov
(Submitted on 3 Aug 2023)
Comments: IEEE International Conference on Acoustics, Speech, and Signal Processing、ICASSP 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In this study, a diffusion model of the Text-to-Music model, MusicLDM, was proposed; the music generated by MusicLDM can be viewed on the official project page below.

Source: MusicLDM project page

Let's begin by looking at the background of this study.

Research Background

Text-condition-based generation tasks have attracted much attention in recent years and have been applied to a variety of modalities, including Text-to-Image, Text-to-Video, and Text-to-Audio. In particular, "AudioGen," "AudioLDM," and "Make-an-Audio" have attracted attention for Text-to-Audio with diffusion models, and their high accuracy is surprisingly high.

Text-to-Music models using diffusion models have also been actively studied in recent years, and a number of high-performance models have been published.

Challenges specific to music generation

However, there are two main "music-specific challenges" in such a Text-to-Music field

Lack of music-text pair data
Risk of unintentional plagiarism of AI-generated music

Compared to other modalities such as Text-to-Image, the available Text-Music pair data is relatively scarce, making it difficult to train high-quality conditional models. In addition, because music involves many different concepts, such as "melody," "harmony," "rhythm," and "timbre," a large and diverse training set is especially needed that reflects these concepts well.

An additional concern associated with Text-to-Music generation is the risk of plagiarism or lack of novelty in the generated output.

This is because music is often protected by copyright laws, and generating new music that is too similar to existing music can lead to legal problems. Therefore, it is important to develop a Text-to-Music model that can generate a variety of novelty music while avoiding plagiarism, even when trained on a relatively small training data set.

Solving challenges with unique data extension strategies

Therefore, two new mix-up strategies specifically designed for music generation were proposed in this study.

Beat-Synchronous Audio Mixup (BAM)
Beat-Synchronous Latent Mixup (BLM)

Each first analyzes the music data used for training, aligns the beats, and then either interpolates the audio directly (BAM) or encodes it and then interpolates in latent space (BLM).

The model is then trained on the expanded training data. It is then tested on the pre-trained CLAP to test for plagiarism and novelty in the music generated by the model.

Experiments have shown that such mix-up augmentation strategies significantly reduce the risk of plagiarism in the generated output. In addition, they found that the mix-up not only maintained the integrity of the music and text, but also improved the overall quality of the music audio.

MusicLDM Model Structure

First, let's look at the MusicLDM architecture.

This model is an architecture built by adapting the "Stable Diffusion" architecture for image generation and the "AudioLDM" architecture for audio generation to the music domain.

Specifically, it consists of the following modules

U-Net" playing the role of Latent Diffusion Model
VAE" compresses input speech into latent expression + converts latent expression to speech
Hifi-GAN" to convert mel-spectrogram to audio waveform
CLAP, a Contrastive Learning Model of Speech-Text for Embedded Generation

The training procedure is to first apply STFT and MelFB to the input speech waveform $x$ and convert it into a mel spectrogram. The mel spectrogram is treated as image data and passed through a VAE encoder to compute a latent representation of speech. By inputting that latent representation into U-Net, the diffusion model is applied.

In this case, during training, the input speech or text is embedded by CLAP and the embedded representation is passed to U-Net as a condition. During inference, only text is used as input.

Re-study of each module

Since CLAP, which is also used in this model, is pre-trained with a paired data set of audio, represented by sound events, sound effects, and natural sounds, and text, CLAP is re-trained with a "paired data set of text and music" to improve the consistency of the music data and the corresponding text.

In addition, the Hifi-GAN vocoder is re-trained with music data to achieve high-quality conversion from mel spectrogram to music waveform.

Improved AudioLDM Conditioning

Here, in the reference source AudioLDM, the model is only given audio-embedded data as a condition during the learning process. Such Audio-to-Audio learning is essentially an approximation of text-to-speech generation.

However, since CLAP is trained to learn text and audio embeddings jointly, but does not explicitly enforce that the embeddings are similarly distributed in latent space, it is difficult to generate a coherent Text-to-Audio output with Audio-to-Audio training alone This is because the training of Audio-to-Audio alone cannot produce a coherent Text-to-Audio output.

Moreover, the problem would be more acute if available text-music pair data is limited. In other words, relying solely on the condition by audio embedding means ignoring the available text data and not utilizing the full potential of the data set.

Therefore, this study implements the following two approaches

Text-to-Audio during learning
Audio-to-Audio training and fine tuning for text embedding condition generation

Data expansion strategies to avoid plagiarism problems

As mentioned earlier, this study uses a unique data extension technique to avoid the risk of plagiarism due to a lack of music-text pair data and the risk of plagiarism of generated music.

The strategy is to mix songs $x_1$ and $x_2$ in a certain ratio, as shown in the middle figure above.

Here, when data is expanded, it is first grouped by music with the same tempo each other by Beat Transformer, as shown on the left in the figure above. This is to avoid chaos in the expanded data when mixing two music data with different tempo (beats per minute).

The respective starting positions of the two music data are then aligned by comparing the downbeat maps.

Beat-Synchronous Audio Mixup (BAM)
Beat-Synchronous Latent Mixup (BLM)

Beat-Synchronous Audio Mixup (BAM)

BAM generates new song data $x$ using song $x_1$ and song $x_2$ according to the following formula.

$x=\lambda x_1+(1-\lambda) x_2$

At this time, $\lambda$ is sampled randomly from $Beta(5, 5)$.

Beat-Synchronous Latent Mixup (BLM)

BLM is a strategy similar to BAM, but differs in that it uses the latent variables of the songs $x_1$ and $x_2$, respectively. Specifically, the songs $x_1$ and $x_2$ are transformed into $y_1$ and $y_2$ through the VAE encoder. The two latent variables are then used to generate a new latent variable $y$ for the new song data, according to the following formula.

$y=\lambda y_1+(1-\lambda) y_2$

The $y$ thus generated is passed through a VAE decoder to convert it to a mel spectrogram, which is then passed through Hifi-GAN to generate new song data $x$.

Difference between BAM and BLM

The right side of the above figure shows the interpolation between the feature spaces of speech signals when using BAM and BLM. In the feature space of the speech signal, "●" represents the feature points of the music data and "△" represents the feature points of other speech signals such as natural sounds, speech activity, noise, etc. In the VAE pre-training process, a latent space is constructed to encode and decode the music data.

Here, the goal of VAE is to transform the original feature space into a low-dimensional manifold by learning the distribution of latent variables that best represent the original data. This manifold is designed to capture the basic structure of the music data.

Therefore, any feature point within this manifold is considered a valid expression of music.

As shown above right, BAM linearly combines two points in speech space to form a new point on the red line; BLM, represented by the blue line, performs a similar operation but becomes a new point in VAE-transformed latent space, which is decoded into the musical manifold in speech space.

Pros and Cons of BAM and BLM

BAM and BLM each have their advantages and disadvantages.

BAM applies mixups to the original feature space to achieve smooth interpolation between feature points, but it cannot ensure reasonable music samples in the musical manifold.

BLM, on the contrary, reinforces within musical diversity and produces robust and diverse latent representations. However, BLM would be computationally expensive because it would require a VAE decoder and the computation of latent features back into speech via Hifi-GAN. Furthermore, BLM may not be effective in VAE if poorly defined or if other latent features are present.

Experiment

Generative Capacity Results

MusicLDM music generation quality was evaluated using FD, IS, and KL.

FD is a speech embedding model of VGGish and PANN that measures the similarity between the generated music and the target, IS measures the diversity and quality of the generated music, and KL evaluates the average similarity between the individual generated music and the real music.

Across all metrics, MusicLDM performs better than other baseline models.

Text-music consistency + effectiveness of data extension strategies

The text and music consistency test computes the inner product between the true text embedding obtained from the test set and the audio embedding obtained from the music generated by the model. The text and audio embeddings are calculated by the CLAP model.

The effectiveness test of the data expansion strategy also measures "the extent to which the model copies samples directly from the training set.

First, it verifies this by calculating the dot product between the audio embedding of each generated music output and all audio embeddings in the training set and returning the maximum value, i.e., the nearest neighbor similarity in the training set.

Next, the percentage of the generated outputs whose nearest neighbors have a similarity greater than or equal to the threshold is calculated. This is called the nearest neighbor speech similarity ratio, which is SIMAA@90 for a threshold of 0.9 and SIMAA@95 for a threshold of 0.95. The lower this ratio, the lower the risk of plagiarism.

The two figures below show a pair of examples with high (top) and low (bottom) similarity scores.

Examples of high similarity scores.

Example of low similarity score.

The results of these tests of text-music consistency and the effectiveness of data extension strategies are shown in the Objective Metrics below.

The original MusicLDM (without mixup) achieved the highest text-to-speech similarity score, but also showed the highest (worst) closest speech similarity score. This indicates that the model without mixups tends to copy the training data.

MusicLDM, using a simple mix-up strategy, achieved the lowest similarity score, but with poor text-to-speech consistency.

MusicLDM with BAM and BLM has a good balance between speech similarity score and text-to-speech similarity.

Overall, the mix-up strategy is effective as a data extension technique to help the model generate newer music, but simple mix-ups can degrade the quality of the generation.

BLM is considered the most effective mix-up strategy, especially in terms of quality, relevance, and novelty of the generated speech. This indicates that mixing in the latent space is more efficient than mixing in the direct speech space.

Subjective test results

In addition to evaluations based on objective measures, the study also included subjective listening tests for four models - MuBERT, the original MusicLDM, and BAM or BLM strategies - to subjectively assess the actual listening experience of the generated music.

Here, 15 subjects are asked to listen to six pieces of generated music, randomly selected from the test set. Subjects are asked to rate the music in terms of quality, consistency with the text, and musicality.

The results are as shown in the Subjective Listening Test on the right side of the figure below.

We find that the MusicLDM samples using the BAM or BLM mixup strategy achieve better textual integrity and quality than the MuBERT or original MusicLDM samples.

MuBERT samples are synthesized from real music samples to achieve the highest musicality scores.

Summary

This article introduced MusicLDM, a Text-to-Music model. Experimental results show that BLM is an effective Text-to-Music mix-up strategy.

Another issue suggested by this study is the low quality of the training data.

MusicLDM is trained on music data with a sampling rate of 16 kHz, while most standard music productions are at 44.1 kHz. This low sampling rate of the training data also reduces the quality of the music produced. In addition, combined with the poor performance of the Hifi-GAN vocoder at high sampling rates, further improvements will be needed to hinder practical Text-to-Music applications.

Furthermore, while beat information is important for musical alignment, there is room for other musical factors such as key signatures and instrumental alignment to be taken into account during the data expansion strategy.