LP-MusicCaps] Automatic Generation Of Music Captions Using LLM

Contrastive Learning 20/11/2023

3 main points
✔️ Automatic music caption generation using LLM
✔️ Solving the music-text pair data shortage
✔️ Creating LP-MusicCaps, a large music data set with language

LP-MusicCaps: LLM-Based Pseudo Music Captioning
written by SeungHeon Doh, Keunwoo Choi, Jongpil Lee, Juhan Nam
(Submitted on 31 Jul 2023)
Comments: Accepted for publication at the 24th International Society for Music Information Retrieval Conference (ISMIR 2023)
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Outline of this study

This study proposes a method to perform automatic captioning tasks for music tracks using a large-scale language model (LLM). Such a task of "generating natural language captions for music" is called "Music Captioning" in Music Information Retrieval (MIR).

Automatic music captioning is intended to generate a "natural language description" for a given music track. For example, "Lose Yourself" by "Eminem" would generate a description such as, "A powerful melody with an impressive rap song with a tremendous rhyme scheme.

First, to understand the purpose for which this study was conducted, let us look at the problems of existing research and issues in the field of "music generation from text".

Lack of paired data for "music/text" is a problem

There are many existing studies in the field of music captioning, and "track-level captioning" and "playlist-level captioning" have been proposed. However, the challenge in all studies has been the lack of accuracy due to the "lack of large public data sets".

Such lack of data can also cause a reduction in the quality of the music produced in the Text-to-Music model.

Purpose of this study

In order to solve such data shortage, this study proposed a method using LLM for music caption generation. And the main objective of this study is to solve the data shortage by using LLM to generate captions that are semantically and grammatically accurate, clean and rich in vocabulary.

Specifically, the procedure is as follows

Created a large dataset "LP-MusicCaps" using pseudo-labeling with LLM
Transformer-based music captioning model trained on the above dataset
Evaluation with Zero-Shot and Transition Learning

First, let's take a look at LP-MusicCaps, a large paired music/text dataset.

LP-MusicCaps Overview

This dataset is constructed by using LLM to generate new music/text pairs from an existing music/text pair dataset.

The existing data sets used include the following three

Existing Data Sets	Contents
MusicCaps (MC)	Contains 5521 pieces of music data, each "tagged" and "labeled" with a long description by a music expert
Magnatune (MTT)	Contains 26k music clips, each tagged with characteristics such as genre, instrument, vocal, mood, perceived tempo, origin, acoustic characteristics, etc.
Million Song Dataset (MSD)	Contains 0.52 million 30-second clips and 1054 tag vocabularies, tagged by genre, style, instrument, vocal, mood, theme, culture, etc.

Each dataset contains multiple tags for a single piece of music. For example, as shown in the "Example of aspect lists" at the bottom of the figure below.

Source: https://arxiv.org/abs/2301.11325

This is an example of MusicCaps.

These tags were used for pseudo-labeling, increasing the amount of text data, which in turn increased the size of the paired data. And on average, 10.7, 3.3, and 10.2 tags per music data were used for pseudo-caption generation in each data set.

And finally, we ended up with the following three

new data set	Contents
LP-MusicCaps-MC	Created by pseudo-labeling using MC tags. 22k text for 6k music data
LP-MusicCaps-MTT	Created by pseudo-labeling using MTT tags. 88k text for 22k music data
LP-MusicCaps-MSD	Created by pseudo-labeling using MSD tags; 2.2M text for 0.5M music data

Incidentally, the table below provides a comparison of each dataset in the music and audio fields, including the above datasets.

Together, the three data sets in LP-MusicCaps yield 0.5M of music data and 2.2M of text data. Note that the "C/A" in this table represents the amount of text data to one music data.

Next, let's take a closer look at the "pseudo-labeling technique" that played a role in creating these data sets.

Pseudo-labeling method using LLM

The key points of pseudo-labeling with LLM are as follows

Use "tag data" from existing music tagging datasets
Add task instructions
Enter "tag data" and "task instructions" as prompts in the LLM
GPT-3.5 Turbo is used for LLM

First, enter "existing tag data" and "task instructions" as prompts into the GPT-3.5 Turbo. The task directive here includes typical task instructions for LLMs, for example, "Describe this song" or "Summarize the content of this song.

This process is illustrated in the figure below.

In this way, pseudo-text labels were generated and added to the new data set, thus creating a large data set.

Here, this "GPT-3.5 Turbo" is pre-trained with large data sets and enormous computing power in advance. Furthermore, it is very powerful because it is fine-tuned by reinforcement learning with human feedback (RLHF) whenever instructions are given.

How to design task instructions (prompts)

The content of the task instructions to be entered as prompts into the GPT-3.5 Turbo includes the following four tasks

Writing
Summary
Paraphrase
Attribute Prediction

In addition, the actual prompts for each task directive are standardized into the following format, respectively

Writing: Write a song description sentence including the following attributes.{input tags}

Summary: Write a single sentence that summarizes a song with the following attributes. Don't write the artist name or album name. tags}

Paraphrase: Write a song description sentence including the following attributes. Creative paraphrasing is acceptable.

Attribute Prediction: Write the answer as a Python dictionary with new_attribute and description as keys. For description, write a song description sentence including the following attributes and new attributes. For description, write a song description sentence including the following attributes and new attributes.

So, a sentence with a "tag" added to the end of each task directive is entered into the GPT-3.5 Turbo as a prompt. An example of a sentence generated in this manner is shown below.

This is an example of applying pseudo-labeling to MusicCaps. The "Input tags" at the top is the list of tags contained in the MusicCaps, and the "Ground Truth" at the bottom is the long description contained in the MusicCaps.

In this way, pseudo-texts were created for the LP-MusicCaps, a large paired music and text dataset. The next section details how we evaluated the validity of the captions for this dataset and the results.

Objective evaluation of LP-MusicCaps captions

Dataset for evaluation

The evaluation of LP-MusicCaps was conducted using the MusicCaps dataset, which was created in a Google study. This dataset consists of the following three components

musical data
tag list
Long description (written by an expert)

If one set of these three elements is considered one data set, it contains 5.5K data. Incidentally, there is one tag list and one long description for each music data.

This long description is also used as the correct data for the caption (correct caption).

Valuation index

We evaluate the dataset by measuring the similarity between the generated captions and the correct captions using objective measures such as

indicator	meaning
BLEU1, BLEU2, BLEU3, BLEU4	Measures n-gram overlap between generated and correct captions, depending on the size of the n-grams: 1-gram for BLEU1, 2-gram for BLEU2, and so on.
METEOR	Based on n-gram duplicates, measures precision and recall considering word alignment
ROUGE-L	Measures longest common subsequence between generated and correct captions
BERT-S	Calculates BERT-embedded similarity between generated and correct captions. More robust to synonyms, paraphrases, and changes in word order, and better captures "semantic similarity between captions" than n-gram metrics.

In addition, the diversity of the generated captions was also assessed, measuring the "number of different words" and "percentage of new vocabulary" in the captions.

Result

The results of the evaluation of the data set, based on the above objective indicators, are as follows

The above results show that captions generated by LP-MusicCaps' "Writing" and "Summary" task directives score best on many measures.

Music caption generation model

The music caption generation model utilizes a Transformer-based cross-modal Encoder-Decoder, whose structure is shown in the figure below.

The model takes the logmel spectrogram image as input, passes through the convolution layer, and is processed by the Transformer block; in the Decoder, the output of the Encoder is passed through the Cross-Modal Attention layer to account for the combination of music and text, and the Next sentence prediction is performed.

This model has also been trained on LP-MusicCaps, mentioned earlier.

The same metrics used to evaluate LP-MusicCaps have been used to evaluate this model as well. The results are as follows

Its excellent scores on a number of indicators indicate its effectiveness.

Summary of this study

The proposed pseudo-caption generation approach was developed to solve the problem of insufficient data in music caption generation. This approach uses LLM to generate pseudo-captions from tags.

Another significant achievement is the pre-creation of the LP-MusicCaps dataset to improve the accuracy of automatic music caption generation. And using this dataset to train the music caption model has been shown to improve the generalization performance of the model.

In addition, these datasets and music caption generation models are publicly available on GitHub and Hugging Face and can be used by anyone as open source.

Hugging Face page of the LP-MusicCaps dataset

GitHub page for music caption generation model

Therefore, they will reduce the cost and time required to collect music-text datasets, and will further facilitate research linking music and language.