[Google × Meta] XLS-R Large-scale Model For Speech Recognition And Speech Translation

Speech Recognition For The Dysarthric 21/09/2024

3 main points
✔️ Large-scale cross-language speech representation model XLS-R
✔️ Significant performance improvements in speech translation and speech recognition
✔️ Comparison of large-volume cross-language models to single-language models

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
written by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli
[Submitted on 17 Nov 2021 (v1), last revised 16 Dec 2021 (this version, v3)]
comments:To appear at IEEE ICASSP 2021
subjects:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Nice to meet you all!

I am Ogasawara, a new writer for AI-SCHILAR.

The paper presented here is this one

XLS-R:SELF-SUPERVISED CROSS-LINGUAL SPEECHREPRESENTATION LEARNING AT SCALE

It is.

As I summarized the main points at the beginning of this article, the goal seems to be to improve the accuracy of speech recognition by significantly improving the wav2vec2 model.

I wonder what kind of methods are being used!Let's learn together with me little by little~~!

I will try to introduce them in as concise a manner as possible, so please bear with me until the end.

Outline of this study

In this paper, we propose a large-scale XLS-R for multilingual speech representation learning based on wav2vec2.0. It is very large scale, training with up to 200 million parameters and a total of approximately 460,000 hours of public speech data in 128 languages.

This is a very large project that is only possible because of the involvement of Meta and Google, two of the world's largest companies, in this paper.

One of the features of the evaluation was that performance was evaluated not only for speech recognition, but also for a wide range of tasks and languages. The results showed the highest ever accuracy in speech translation and a significant improvement in error rate in speech recognition.

This paper does not describe the algorithm or design methodology of the model, so let's take the time to find out by what process these results were obtained!

Let's keep it in mind

Wav2vec2.0

This model was developed by facebook (Meta). The model is characterized by its end-to-end method, which connects the speech signal to the text. Another impressive feature of this model is that it is a self-supervised learning system. This means that in the first stage of learning, only a large amount of unlabeled speech can be used.

And good results can be achieved by fine tuning with small amounts of data for the tasks that we, the users, want to use when we use it.

Fine tuning

I am sure many readers are familiar with this, but just in case it is important, I will add an explanation. Simply put, it is to customize a completed model to fit the task you want to perform.

For example, suppose you have purchased a pre-built deck of card games. You have played against your friend using this deck many times, but you just can't beat him. So, in order to beat your friend's deck, you buy a powerful countermeasure card and incorporate it into your deck.

In short, it is fine to think of it as a process that you can tweak to make it easier for you to use.

Prior learning

As explained in wav2vec2.0, this kind of model requires two stages of training.Since it is pre-training, we are talking about the first stage of training. In this model, self-supervised learning with large unlabeled speech data is applicable.

Speech corpus dataset

Speech data sets are created by a company or a volunteer reading a text that has been created with a balance of phonemes and other elements. English data sets in particular are available in a wide variety and the recording time is very large. The availability of these data sets is the key to improving speech recognition technology, and it shows how strong the U.S. is, where global giants such as google and meta are located.

In Japan, too, there are some, but they are not very plentiful because they are often old and many of them require a fee. In recent years, however, a free and open source corpus called ITA Corpus has appeared, and it is widely used by researchers and creators. Zundamon, for example, is a famous example.

Do you understand? Reflection on the past

There are only three important things!

Let's just hold on to this!

Must be based on wav2vec2.0
The number of parameters and training data must be huge anyway.
The model should be significantly improved in accuracy.

As long as you have these three things in mind, the rest will be fine!

In the next section, we will look at the experiment.

This is where we start! About the Experiment

Thank you very much for reading this long explanation of the basics.Next, I will finally explain the most interesting part of the paper, the experiment.

Experimental setup

Set up wav2vec2.0 as pre-training. Optimization by adjusting the number of parameters
Learning with ultra-powerful GPUs
Use a multilingual corpus to balance the model

Now we will conduct fine tuning for each task and evaluate the results of the experiment.

What are the results of the experiment?

Voice translation

For the speech translation task from one language to English , significant improvements were achieved for all resource amounts. Increasing the model size also improved benchmark performance in most cases.

Also, in translating from English to one language, a large model can perform as well as an English-only pre-trained model. This indicates that given sufficient capacity, a multilingual model can perform identically to a single-language pre-trained model.

voice recognition

Unlike the translation task, this task showed significant institutional improvement for small and medium amounts of training data.

Summary of Dissertation

Thank you all for your hard work. What I introduced this time was an attempt to evaluate the performance of wav2vec2.0 with a wide range of tasks and languages by modifying it significantly. Frankly, I thought the research was a world away from the average graduate student. It is difficult to collect such a large amount of audio data, and I don't have a super high performance GPU to enable training on such a large amount of data.

What is about 460,000 hours of learning data? Maybe we won't reach it even if we collect the audio data sets that exist in Japan. However, I did discover one good thing! That is, there is a limit to the growth of speech recognition rate even on such a large scale.

I guess both quality and quantity are important.

The results of this study can be summarized as follows

First, too large to reproduce (*model is available)
Accuracy improved for both translation tasks and speech recognition tasks.

The two major results of the project are

A little chat with a chick writer, Ogasawara

I'm talking about AI being anything but superhuman or doraemon.

If you have read this to the end, you are probably already very aware of this fact. But the general public knows too little about AI.

I'm sure AI can do anything when I look at the Internet. Then there is no work for me. I'm a bit stunned when I see people writing "AI can do anything, can't it? Know your enemy. The more you know, the less afraid you will be and the less anxious you will be. You must be holding a smartphone or a mouse right now, right?

I was talking about how scary it is to be indifferent and stop thinking.

See you in the next article.

This is Ogasawara, a newbie chick writer~.

See you later!

Categories related to this article

アサさん

[Google × Meta] XLS-R Large-scale Model For Speech Recognition And Speech Translation

Introduction

Outline of this study

Let's keep it in mind

Wav2vec2.0

Fine tuning

Prior learning

Speech corpus dataset

Do you understand? Reflection on the past

This is where we start! About the Experiment

Experimental setup

What are the results of the experiment?

Voice translation

voice recognition

Summary of Dissertation

A little chat with a chick writer, Ogasawara

[For Everyone To Enjoy The Convenience... Speaker Adaptation Of Dysarthric Speech Using Whisper

[For Everyone To Enjoy The Convenience... Speaker Adaptation Of Dysarthric Speech Using Whisper

[You're Using Wav2vec2 For This? It Makes Feature Extraction Of Dysarthric Speech More Efficient!

[You're Using Wav2vec2 For This? It Makes Feature Extraction Of Dysarthric Speech More Efficient!

Classification Tasks - Extremely Difficult! Use The WHFEMD Algorithm To Accurately And Efficiently Capture And Classify Features O ...

Classification Tasks - Extremely Difficult! Use The WHFEMD Algorithm To Accurately And Efficiently C ...

A Paper That Overturns Conventional Wisdom! The Classification Of Dysarthria Was Based On Noise, Not Characteristics!

A Paper That Overturns Conventional Wisdom! The Classification Of Dysarthria Was Based On Noise, Not ...

Equal Access To Convenience! EasyCall Corpus", A Speech Corpus For The Dysarthric

Equal Access To Convenience! EasyCall Corpus", A Speech Corpus For The Dysarthric

The Time Has Come For Everyone To Speak English! Zero-shot Text-to-speech Technology For Multiple Languages Makes It Easy For Anyo ...

The Time Has Come For Everyone To Speak English! Zero-shot Text-to-speech Technology For Multiple La ...