Catch up on the latest AI articles

CLAP] Contrastive Learning Model Of Speech And Text

CLAP] Contrastive Learning Model Of Speech And Text

Contrastive Learning

3 main points
✔️ Introducing the speech-to-text contrast learning model
✔️ Large captioned speech dataset also available

✔️ Achieves SoTA in Text-to-Audio search and speech classification

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
written by Yusong WuKe ChenTianyu ZhangYuchen HuiTaylor Berg-KirkpatrickShlomo Dubnov
(Submitted on 12 Nov 2022 (v1), last revised 8 Apr 2023 (this version, v3))
Comments: Published on arxiv.

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Recently, a contrastive learning model called Contrastive Language-Image Pretraining (CLIP) has been proposed in the image field. This technology can be applied to Text-to-Image and Image-to-Text by learning the correspondence between text and images projected onto a shared latent space.

The Contrastive Language-Audio Pretraining (CLAP) model in this study is an application of this CLIP to the speech field.

Here, along with text and images, voice is one of the most important modalities. Therefore, there is an increasing need for models that can acquire information on voice data without requiring large amounts of data.

Previous studies have proposed contrastive learning between speech and text, but all have been incomplete. The four main reasons for this have been directly attributed to the following four points.

  • Lack of voice/text pair data
  • Improper encoder and model selection
  • Difficult to process variable length audio data
  • Lack of ability to generalize to downstream tasks (Text-to-Audio only)

In this study, we are working on a large data set and a contrastive learning model to solve these problems.

First, we would like to introduce the dataset LAION-Audio-630K, which was constructed for this study.

Construction of a large data set "LAION-Audio-630K"

To overcome the lack of paired audio and text data, the authors used a proprietary method to construct a dataset, LAION-Audio-630K, which was used to train the model. This dataset contains about 633,526 total data (4,325.39 total hours), the largest size at the time of this paper's publication.

The sizes of the previously published data sets and the LAION-Audio-630K are compared in the table below.

To begin, audio data including "human activity," "natural sounds," and "sound effects" and associated textual descriptions were collected from eight publicly available sources.

Information about each source and the data obtained can be found in the table below.

The following figure also shows the distribution of "data lengths" of audio data in "Epidemic Sound" and "Freesound" for the above sources.

As can be seen from this distribution, in the case of speech, there is variation in length from data to data. This large variation is a factor that makes learning difficult. In the case of images, there is no problem if all image data used for training is uniformly sized, but in the case of audio, a little ingenuity is required.

Model structure of CLAP

This section describes the model structure of CLAP. The model structure is as follows

First, the raw audio and text data are embedded by the following encoders, respectively.

Audio Encoder text encoder

PANN (CNN-based model)

HTSAT (transformer-based)

CLIP transformer (text encoder for CLIP)



For the embedding acquired by the encoder above, MLP is applied to obtain the same dimensions of "audio embedding Ea " and "text embedding Et ".

Then, using Ea andEt for each pair of data, the following loss function is learned to minimize

where $\tau$ represents the temperature parameter, which is optimized by training. And if the model is perfectly optimized, the matrix should look like this

This diagonal component indicates that each pair of E's has a Cosine similarity of 1, i.e., they have the same meaning. In other words, "the distance between the same pair of speech and text (diagonal component)" is learned to be closer, and "the distance between different pairs (off-diagonal component)" is learned to be farther.

Of course, it is almost impossible for the cosine similarity between the same pair to be perfectly 1.

Learned CLAPs will then be available for


For Text-to-Audio Retrieval, follow the steps below to perform a voice search.

  1. Text query is entered and embedded by text encoder
  2. Calculate the cosine similarity between the embedding of all audio data in the Audio Database and the text embedding.
  3. Extract the audio embedding with the highest cosine similarity

This procedure allows you to search the database for audio that corresponds to the query text.

Devices for audio and text data

Here, each data input to this model is subjected to special processing.

・Audio data

Unlike RGB image data, which can be resized to a uniform resolution, audio has a variable-length nature. Traditionally, the entire audio is input to the audio encoder, which outputs an average of the amount of audio embedded per frame or per chunk (slice and vote).

However, this method is computationally inefficient for long audio.

Therefore, the present study combines both coarse, global information and randomly sampled local information to perform learning and inference in constant computation time for speech input of different lengths.

・Text data

Some datasets include labels or tags as keywords for the corresponding audio.

As shown in the figure below, the data is augmented by generating captions by "Keyword-to-Caption (Label-to-Caption)" using pre-trained T5 to create captions over these keywords.

An example of a caption generated by Keyword-to-Caption using T5 is shown below.

In short, when you enter a few keywords, T5 generates natural sentences that include those keywords.

The text thus generated is used for training data.

Evaluation experiment

First, two different audio encoders and three different text encoders were tested in combination in terms of Text-to-Audio and Audio-to-Text in order to explore the best encoder combinations.

The results are as follows

As can be seen from the results, HTSAT is most accurate when used as the speech encoder, and RoBERTa or BERT tends to be more accurate when used as the text encoder.

Subsequent "Comparative Experiments in Text-to-Audio" have employed the HTSAT-RoBERTa combination.

Continuing with the comparison experiment in Text-to-Audio, the results are as follows.

In addition, experiments in speech classification have shown the following results.

These results show that CLAP achieves SoTA in tasks such as Text-to-Audio and Audio Classification.

Two important points are as follows

  • The point at which the choice between speech and text encoders affects the performance of the model.
  • The generalization performance of the model for different data sets and the trade-offs demonstrated


The CLAPs in this study could be applied to a variety of downstream tasks, such as sound source separation and audio captioning.

Incidentally, at the time of this writing, there are still many published studies of speech production using CLAP. Therefore, this research is expected to continue to be important in the field of speech and multimodal research, so keep an eye on it.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us