CLAP-IPA: Acquisition Of Multilingual Phonetic Expressions By Contrastive Learning Of Speech And IPA Sequences

Natural Language Processing 30/01/2025

3 main points
✔️ Building a basic model for multilingual speech processing (CLAP-IPA) by learning to contrast speech signals with their corresponding IPA (International Phonetic Alphabet) sequences
✔️ Recorded high performance in Keyword Spotting and Forced Alignment under multilingualism
✔️ IPA learning benefits high performance even under zero-shot conditions

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
written by Jian Zhu, Changbing Yang, Farhan Samir, Jahurul Islam
(Submitted on 14 Nov 2023)
Comments: NAACL 2024 Main Conference
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Spoken language processing is a research field that aims to realize various natural language processing tasks using speech as input. With the recent emergence of high-performance multimodal models such as GPT-4o, it may seem that there are few remaining challenges in spoken language processing. In reality, however, there are still many challenges. In this article, we will focus on "multilingualization of models" among them.

Multilingualization of models refers to building models that work in a variety of languages. It is relatively easy to build models in languages with ample data resources, such as English and Japanese. On the other hand, there are many languages in the world that lack data resources. Building models that work in such languages will contribute to the realization of an inclusive AI society.

CLAP-IPA" presented in this article is a study that directly addresses the issue of multilingualization of the model. The key point is that it uses the IPA (International Phonetic Alphabet ) system to describe sounds in all languages of the world; for example, "IPA" is written as [aɪ- piː-eɪ].

IPAis very compatible with multilingualization. Let us suppose that instead of IPA, we use characters that we would normally see. For example, English is written in the alphabet, whereas Japanese is written in kanji and kana characters, so models for each language will be built based on different symbol systems. This also affects the architecture of the model itself, making it difficult to construct a consistent model in multiple languages. In addition, some languages in the world have no fixed writing system(e.g., Swiss German, Arabic dialects, etc.), whichmay make such a policy impossible to adopt in the first place. On the other hand, IPA allows for consistent language-independent descriptions, which naturally allows for multilingualization of the model.

CLAP-IPA is based on a previous study called CLAP. In a nutshell, the idea is to build a better phonetic representation by contrasting and learning "speech" and "its symbolic representation. As a result, Keyword Spotting and As a result, it recorded high performance in two tasks, Keyword Spotting and Forced Alignment. Furthermore, the benefit of using IPA has proven to maintain high performance, especially under zero-shot conditions.

What is IPA (International Phonetic Alphabet)?

IPAis a symbol system for describing the "sound" of any language in the world (it happens to have the same name as a certain beer variety, so the PDF of the paper includes a beer pictogram in the title. The details of IPA are covered in the wikipedia article, so I will limit myself to a brief description here.

IPAhas two types of notations: a simplified notation enclosed with slashes // and a precise notation enclosed with brackets [].The simplified notation differs depending on the target language, so the latter is more convenient when describing sounds in a language-independent manner. Therefore, in the following, we will use the precise notation (e.g., [a], [p]).

There are many different types of "sounds" that we speak. For example, the English word "dog" consists of the three sounds [d], [ɒ], and [ɡ]. In contrast, the Japanese word "tool" strictly consists of five sounds: [d], [o], [ɯ], [ɡ], and [ɯ]. Although they are two different languages, if you look at them at the sound level, you will see that they actually share some common parts. Specifically, [d] and [g] can be considered the same sound. It is intuitively obvious that different languages have something in common when viewed at the sound level.Inthis way, we can say that an IPA is a compilation of the same sounds defined as the same symbols (=[d] and [g]).

The aforementioned wikipedia article also has a table with a list of IPAs, which you can check out if you are interested.

Details of The Proposed Model (CLAP-IPA)

Creation of Data Set (IPA-PACK)

Prior to training the model, a dataset consisting of pairs of speech and IPA strings must be created. We first prepare a multilingual dataset consisting of pairs of speech and its transcriptions. In the paper, three types of datasets are prepared: FLEURS, MSWC, and DORECO. The transcriptions at this point are normal text, not IPA sequences. The transcribed text is then converted into IPA sequences using the G2P (Grapheme-to-Phoneme) system, where Grapheme refers to grapheme, or normal text, and Phoneme can be taken here to mean IPA. Thus, G2P is a system that converts ordinary text into IPA sequences.

This resulted in the dataset ( IPAPACK ) shown in the following statistics: VoxCommunis is a dataset of prior studies consisting of 38 languages, while IPAPACK consists of as many as 115 languages.

However, G2P is not highly reliable in every language, and it is inherently subject to human validation.Although human validation is included for some languages, it is almost impossible to validate for all 115 languages. Therefore, the authors admit that IPAPACK may contain errors, and this is one of the major limitations of this study.

Model Training

CLAP, from which CLAP-IPA is derived,stands for"ContrastiveLanguage-Audio Pretraining.Many of you may have heard of theCLIP (ContrastiveLanguage-ImagePretraining)model inimage processing, and CLAP can be considered its audio version.

For a data set $\mathcal{B}=\{P_i, S_iC}$ consisting of IPA column and voice pairs, let$x_i=f_S(P_i)$ be therepresentation vector of IPA column and$y_i=f_T(S_i)$ be the representation vector of voice. Here, BERT is used as $f_S$ (encoder of IPA columns) and Whisper is used as $f_T$ (encoder of speech).

Under this, we calculate the SigLIP loss defined below.

The $Z_{ij}$ takes $1$ for the correct answer pairs and $1$ for the other pairs. Intuitively, learning proceeds in the direction of increasing the inner product $x_i\cdot y_j$ for the correct answer pairs and decreasing it for the other pairs. In other words, learning proceeds in such a way that the similarity of the vectors of the correct answer pairs becomes larger.

Note that CLAP in the previous study employs softmax loss, which is different from CLAP-IPA, and the sigmoid loss employed by CLAP-IPA is known to have better properties. If you are interested, please refer to this paper.

Experiment Details

The performance of CLAP-IPA is evaluated in two tasks, Keyword Spotting (KWS) andForced Alignment.

Keyword Spotting (KWS)

KWS is a binary classification task that takes a keyword and audio as input and determines whether or not the keyword is present in the audio. For example, when you say "Hey Siri" to Siri, you start an interaction with Siri, which is exactly what KWS is doing.

Using CLAP-IPA for KWS is very simple. It is accomplished by converting the given keywords into an IPA sequence to obtain the CLAP-IPA embedding $x$ and measuring the similarity between it and the input audio embedding $y$. If the similarity is greater than a threshold value, it is determined that "the keywords are contained in the speech".

Forced Alignment

Forced Alignment is the task of estimating phoneme or word duration from a given speech utterance, and we found that it appears naturally in CLAP-IPA. That is, when taking the similarity matrix of the speech and its IPA sequence, it was observed that the similarity was high for the time span of a word or phoneme.The following figure will help you to understand.

The two matrices are both converted to speech and IPA sequences (*exactly "word/phoneme sequences"). See the paper for details. The top figure shows the matrix computed in the zero-shot setting, i.e., pre-trained CLAP-IPA. The bottom figure shows the fine-tuned matrix with alignment, and you can see that even in the zero-shot setting, there is some correspondence between the two models.

Experimental Results

First, we show the results of the KWS.The following shows the results of the English evaluation of the LibriPhrase dataset.

Let me explain a few additional points. First, Easy/Hard at the top indicates the difficulty level of the task. For example, when the keyword is "friend", Easy is to have the student judge a completely different keyword, such as "guard", and Hard is to have the student judge a keyword with a similar pronunciation, such as " find". The first three Methods are prior studies: CLAP-IPA-TEXT uses regular text instead of IPA, and CLAP-IPA-PHONE uses phoneme sequences for training. What they have in common is that the transcribed text is language-dependent because IPA is not used. The remaining five models are the proposed models: CLAP-IPA-FLEURS and CLAP-IPA-VC use less training data, while the remaining models use the full training data. Model sizes are in the order tiny < base < small.

Looking at the results, the performance of CLAP-IPA stands out for the Easy task, while the previous study outperforms it for the Hard task. Also, comparing CLAP-IPA-TEXT and CLAP-IPA-PHONE with other CLAP-IPAs (= models trained with IPA columns), the latter shows better overall performance. This shows the usefulness of using IPA under multilingualism.

Below are the results of the experiment with unseen languages, i.e., languages that were not in the training data in the pre-training.

The performance of the bottom five models studied in the IPA column is particularly outstanding.

The following are the results of the Forced Alignment. Below are the results for the English data set, TIMIT. It shows the performance of alignment with words and phonemes.

The top 6 rows are the prior studies and the bottom 6 rows are the proposed method. The results for the words in the previous study are not shown, probably because the performance is quite high for the model trained for Forced Alignment in English. The CLAP-IPA results show that even the zero-shot performance, which is not trained forced alignment, is still good, and that it is greatly improved by fine-tuning.

Below is a comparison of the results between the SEEN and UNSEEN languages.

It is indeed a coincidence that the performance has increased with unseen, but it is quite amazing that it has not decreased at least. Here again, the advantage of using IPA for learning is truly apparent.

Summary

In this study, we proposed a method for acquiring multilingual phonetic representations by contrastive learning between speech and IPA sequences. The performance of this method under zero shots (unseen language) is particularly remarkable, and we believe that the advantages of IPA are fully demonstrated.

Typically, IPAs do not have many opportunities to appear in machine learning models. The reason, as mentioned earlier, is that there is still no reliable converter to IPA in multiple languages. The IPAPACK created in this study will contain a certain amount of errors. Nevertheless, the excellent point of this paper is that we actually trained it on a certain scale and verified its performance. It is expected that IPA will come into the limelight in the future, and that research on G2P and research using IPA for training will emerge.

Categories related to this article

Kando

CLAP-IPA: Acquisition Of Multilingual Phonetic Expressions By Contrastive Learning Of Speech And IPA Sequences

Summary

What is IPA (International Phonetic Alphabet)?

Details of The Proposed Model (CLAP-IPA)

Creation of Data Set (IPA-PACK)

Model Training

Experiment Details

Keyword Spotting (KWS)

Forced Alignment

Experimental Results

Summary

What Is A Good Vocabulary In Machine Translation?

What Is A Good Vocabulary In Machine Translation?

When Should We Believe In LLM?

When Should We Believe In LLM?

Extracting Critical Information From Medical Documents Using InstructGPT

Extracting Critical Information From Medical Documents Using InstructGPT

Can You Have A Conversational Dialogue With A Mobile UI In A Large Language Model?

Can You Have A Conversational Dialogue With A Mobile UI In A Large Language Model?

What Is The Importance Of Pre-training On Data With Expertise? ~ Application Of BERT To The Classification Of Legal Documents ~.

What Is The Importance Of Pre-training On Data With Expertise? ~ Application Of BERT To The Classifi ...

Finally, An AI That Understands Sarcastic Dialogue And Can Generate Descriptions!

Finally, An AI That Understands Sarcastic Dialogue And Can Generate Descriptions!