[WavLM] Past All Speech Recognition Models! What Is The Structure And Performance?

Speech Processing 30/09/2024

3 main points
✔️ Addressing the problem of speech tasks in multiple speakers
✔️ Enhanced performance through masking during learning and significant expansion of the amount of training data
✔️ High performance on a variety of tasks, not just speech recognition

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
written by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei
[Submitted on 26 Oct 2021 (v1), last revised 17 Jun 2022 (this version, v5)]
Comments: Submitted to the Journal of Selected Topics in Signal Processing (JSTSP)
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

What is WavLM, The "Super Work-Ready Man"?

Read at Least This Part! Super Summary of The Paper!

WavLM is a relatively new artificial intelligence model that can perform a variety of speech processing tasks. Unlike conventional models such as supervised learning, unsupervised learning, or wav2vec2.0, WavLM is capable of performing not only speech recognition, but also speaker identification, speech separation, and many other speech-related tasks.

In recent years, self-supervised learning (ex. wav2vec2.0) has been very successful in the field of speech recognition, but its application to other speech processing tasks has been limited. Since speech information contains a wide variety of information such as speaker characteristics and emotions, the challenge was to develop such a dream model that could be applied to all tasks.

This research aims to develop a general-purpose pre-training model that can perform a variety of tasks such as speech recognition and speaker identification with a single model.

WavLM has demonstrated performance that exceeds that of previous models in SUPERB, which is considered a benchmark for speech processing tasks, and has also achieved significant improvements in accuracy in a variety of other tasks.

The reason why wavLM was able to perform well on all tasks, even though it was specialized for certain tasks in the past, is very much due to its learning method of predicting masked areas and using more diverse and larger data sets.

This research shows the potential for a general-purpose model that can extract diverse information from speech information and apply it to a variety of tasks, and is proof that we are one step closer to our dream model.

The success of this model will lead to a smaller, more versatile model that understands human speech from a variety of angles and can be applied in a variety of ways, and in the future, a model that can perform tasks more naturally and more effectively!

Is The Structure of WavLM Similar to The Structure of HuBERT?

I need to explain a little about HuBERT.

HuBERT is a self-learning model developed by META, and is characterized by the fact that it extracts features from speech according to k_means. It also requires thousands to tens of thousands of hours of pre-training, making it an extremely powerful model that cannot be reproduced by any individual or university.

Now look at the figure below. This figure shows the structure of HuBERT.

Briefly, the input audio information is processed by the transformer after passing through the CNN, but there is one major feature here. One of the main features of the transformer is that it masks (hides) a portion of the input information.

In other words, you don't put in all of the audio information, but you make it missing. After that, when the modeler comes out of the transformer, the modeler is asked to predict what is missing in the given information! The modeler is forced to make a prediction.

By learning the model through the task of creating and predicting the missing data, the model is now much more powerful than the conventional model.

Back to the topic at hand, WavLM, which is introduced in this article, has almost the same structure. Of course, if you dig deeper, you will find mathematical and structural differences in the paper, but I have omitted them in this article in the hope that you will understand just the superficial details.

On Which Datasets was WavLM Trained and to What Extent?

The table above shows some of the experimental results. It may not be clear what they are, but you can understand that WavLM performs better than other models, can't you?

Wouldn't you like to know how such an amazing model is learned? It's quite amazing.

Now let's start for the first time.

The data set used is,

Librispeech	60000 hours
Gigaspeech	10000 hours
VoxPopul	24000 hours

This alone is bad. It's a level of awesomeness that I'm going to pull back a bit. I always experiment with a little over an hour's worth of data, so there's already a cloud of difference.

Then, what about learning methods?

Mask a portion of the input audio. Adding artificial noise to the audio.
The model simultaneously performs the task of predicting the speech in the masked portion and removing noise
Use transformer to accurately capture the sequence of speech

The study follows these three steps. If you are a little confused, you may want to take another look at the diagram that explains the structure.

Finally, as for the learning time, there is no clear description of what GPU was used and how many hours it took, so I can only speculate, but I suspect that they used a really big supercomputer-like machine and spent many days on the learning process.

I can only say that it is amazing. Because I would never be able to reproduce it. Not only me, but even a university professor might have a tough time. .... I really have to take my hat off to the funding and research power of foreign companies.

What impact did WavLM have?

Finally, let's look at this table, which has appeared many times in this article, and see what impact this model has had on the results.

Yes. Let's start with the conclusion. What this table shows is an evaluation of the word error rate in the speech separation task! This is only the error rate, so the lower the number, the better the performance. (I once got this wrong and embarrassed my teacher by reporting the results: ....)

Do you know what voice separation is? Let's say two people speak at the same time. If you were Prince Shotoku, you might be able to hear both people speaking at the same time, but it would be impossible for the average person. Of course, this is also true for speech recognition models. In such a case, the speech separation task is used to separate the two overlapping voices.

And by this I mean, are the results, broken down one by one, sounding right? Are there any major mistakes? Are there any major mistakes?

If you look at the table, you see that the values are lower than the other models.That's how high the voice separation accuracy of this model is.

In this paper, experiments were conducted on a variety of speech-related tasks, and all of them outperformed the other models. I would love to introduce the results, but the amount of work involved is quite large, so I will present the results in this table, which is the easiest to understand.

If you're curious, I've posted a link to the paper so you can check it out!

A Little Chat with A Chick Writer, Ogasawara

We are looking for companies and graduate students who are interested in conducting collaborative research!

His specialty is speech recognition (experimental), especially with dysarthric people.

This field has limited resources available, and there will always be a limit to what one person can tackle alone.

Who would like to join us in solving social issues using the latest technology?

Categories related to this article

アサさん