Catch up on the latest AI articles

Zero-Shot Transition Learning] Innovative Technology For Speech Recognition Of Unlearned Languages From Multilingual Corpus Data!

Zero-Shot Transition Learning] Innovative Technology For Speech Recognition Of Unlearned Languages From Multilingual Corpus Data!

Speech Recognition For The Dysarthric

3 main points
✔️ Highly Accurate Speech Recognition Even Without Linguistic Data
✔️ A Simple but Innovative Approach

✔️ Effective methods of using multilingual data

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition
written by Qiantong XuAlexei BaevskiMichael Auli
[Submitted on 23 Sep 2021]
subjects:Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

code: 

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Nice to meet you all!

I am Ogasawara, a new writer for AI-SCHILAR.

The paper presented here is this one

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition.

It is.

As I summarized the main points at the beginning of this article, it seems that the goal is to fine-tune wav2vec2.0, which is pre-trained in multiple languages, to recognize unlearned languages.

I wonder what kind of methods are being used! Let's learn together with me little by little~~!

I will try to introduce them in as concise a manner as possible, so please bear with me until the end.

Why do this study?

There are many languages in the world, but only a few of them have been studied in speech recognition. The reason why research has not progressed in many languages is that current models require large amounts of labeled speech data.

The rapid development of speech recognition research in recent years has confirmed that a small amount of training data is sufficient to achieve sufficient accuracy, but there is a major drawback: models must be prepared for each language.

Therefore, this study aims at zero-shot transcription using labels composed in multiple languages.

How to approach

  1. Self-learning with multilingual data
  2. Fine tuning in multiple languages. Also phonemic awareness.
  3. Use phoneme mapping from the learning language phonemes to the target language during inference
  4. Test fine-tuned models against all unlearned languages

Check basic terminology

What is a phoneme?

It is so important that it comes up frequently in speech recognition research. A phoneme is simply the smallest unit of pronunciation we have when we speak. It is enough to remember that it is the smallest unit of pronunciation.

What is fine tuning?

To additionally train a model that has been trained in advance with your own data for your own use and tasks. By doing this, you can tune the generalized model to a model that is easy for you to use.

What is wav2vec2.0?

This is a pre-training model, which is a model that has been trained on a huge amount of data at the stage of building the model. The amount of training data is truly enormous. You can't reproduce it in a single graduate school. The advantage is that only a small amount of data is needed for fine tuning because the model has been trained with a huge amount of data in advance.

experimental setup

About the Learning Model

The model used in this study is wav2vec2.0 XLSR-53. This will be a multilingual learning model that has been trained on 53 languages.

About the Data Set

Three major multilingual speech corpora will be used. Examples of languages include Dutch, French, German, Italian, and Portuguese.

In addition to this, there are many other languages used and the playback time of the audio is very large.

The students must have spent a lot of time learning on a very high performance computer.

About Model Learning

The model is implemented in fairseq. This is an open source for building machine learning models published by meta (formerly facebook) on githab.

Anyone with knowledge of python and a little English can use it for free to build machine learning models, so please take a look if you are interested.

Back to the story, we will use the XLSR-53 model, which has been pre-trained for about 56,000 hours. We will not discuss the parameters related to the learning.

Do you understand? Reflection on the past

There are only three important things!

Let's just hold on to this!

  1. Training on multilingual datasets to attempt to transcribe unlearned languages
  2. Use wav2vec2.0 XLSR-53
  3. Very large and delicate parameter adjustments are required

As long as you have these three things in mind, the rest will be fine!

What are the results of the experiment?

Comparison with unsupervised methods

Now for the first experiment, let's compare zero-shot transition learning with unsupervised wav2vec2.0. The models used for both are the same.

As for the results of this experiment, you are able to show that zero-shot transition learning and unsupervised models are about equal in performance.This is honestly surprising. If this were possible, it would be realistic for use in a variety of languages.

This technology will be very important if we are going to see more and more IOT in the future.

Comparison with other zero shots

Let's compare its performance with the model that preceded this study. Here again, you are demonstrating the ease of zero shots (if you are a company ). It' s much less data intensive than building individual models. In some areas, the results outperform the supervised model results, and this is a truly innovative approach.

The bottleneck, however, is that it is difficult to reproduce without a university or company with a supercomputer, since it requires a huge amount of time to train the data.

Summary of Dissertation

Thank you all for your hard work. What I presented this time was a zero shot transfer learning of unlearned languages using multilingual data. English and other major languages are the ones that have been the focus of a lot of research in speech recognition.

There are so many languages in the world that it is very expensive and labor-intensive to build models for each of them.

In that respect, this zero shot method has great potential. What did you all think?

The results of this study can be summarized as follows

  1. No need to create a model specifically for unlearned languages
  2. High accuracy, not inferior to supervised and unsupervised models

The two major results of the project are

A little chat with a chick writer, Ogasawara

Experiments in information systems are simple of simple.

Once you've made a hypothesis and created a program, you can leave it alone until it's done. Creating a program is also a muddy process.Even so, I am happy when the results are in line with my hypothesis, and even if they are not, it is interesting to think about why they are the way they are.

I often build AI-related programs for speech recognition, so I use a lot of libraries, but I have been building them in a black box state without really understanding what is in the libraries.I think it's not a good idea to do so, but there are many kinds of libraries, and each library has many usable functions. It would be a bit tedious to understand them all, wouldn't it?

I wonder how engineers are dealing with libraries?

Libraries are extremely useful, but it is common for programmers to have trouble solving errors when the cause of the error is library-related.

See you in the next article.

This is Ogasawara, a newbie chick writer~.

See you later!

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us