Artificial Intelligence Developed By Meta! How Well Does The "HuBERT" Model, Which Is Different From Conventional Self-supervised Learning Models, Perform?

AI For Science 29/08/2024

3 main points
✔️ Loss function to predict only masked regions
✔️ Leveraging cluster ensembles
✔️ Iterative teacher label refinement

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
written by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed
[Submitted on14 Jun 2021]
comments:To appear at IEEE ICASSP 2024
subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Nice to meet you all!

I am Ogasawara, a new writer for AI-SCHILAR.

The paper presented here is this one

HuBERT: Learning Self-Supervised Speech Representation Hidden Units by Mask Prediction.

It is.

As I summarized the main points at the beginning of this article, the goal seems to be to demonstrate the usefulness of the HuBERT model for solving the problems inherent in self-supervised learning models.

I wonder what kind of methods are being used!Let's learn together with me little by little~~!

I will try to introduce them in as concise a manner as possible, so please bear with me until the end.

Outline of this study

The traditional self-supervised learning model model has three main problems.

Each input utterance would contain several speech units.
No dictionary of units in the pre-study phase.
Variable length of voice units and no explicit division

The model proposed throughout this paper to solve these problems is HuBERT.It aims to improve recognition accuracy by taking a certain approach to the loss function.

Test results for this model show a significant improvement in recognition accuracy of up to 19%.

Proposed Method

Figure (a) above shows the method proposed in this paper.

I know it is difficult to understand this all by looking at it, so let's unpack it one at a time.

The goal of this article is to be able to look at and understand this model after reading it.

I'm going to chew it up and explain it to you as I mentioned earlier, so please follow me to the end!

CNN (Convolutional Neural Network)

Many readers may be familiar with this one. It's famous and often used in the imaging field. To briefly explain how it works, a mechanism called the convolution layer is incorporated into the structure of a neural network.

One of the advantages of the convolutional layer is that it is capable of extracting advanced features while retaining the information from the input source. In the method in this paper, you are incorporating it as a feature extractor.

Transformer

In brief, an Attention model is a model that is different from both CNN and RNN with a mechanism called Attention, which is a score or similar mechanism that indicates which words to focus on when the meaning of a word in a sentence is not clear.

In the long history of machine learning research, this model is so new, so revolutionary, and so outstanding that it can be considered a revolution in the machine learning world. BERT and its family of models and many other excellent models exist as evolutions that incorporate this mechanism.

About Masking

Look again at Figure (a), between the CNN and the transformer, you see a spot marked MSK. This is the masking area. It would be difficult to explain in detail, so I will give a brief explanation: rather than passing all the data output from the encoder, it is better to randomly select and mask the output data before passing it on, so the learning results may be better.

This technique is often used not only for speech, but also for images and natural language.

After acquiring prior knowledge

Yes, I am. I have explained the preliminary knowledge so far, but have you all been able to keep up?

A thesis looks very difficult, and in fact, it requires a very high level of knowledge to try to understand it from the text alone.

But researchers also do a lot of things to make their ideas and results well known.One of them is the figure in the paper. Let's decipher Figure (a), shall we? The way to look at this figure is from bottom to top.

Voice is input
The input speech waveform is passed to the CNN and converted into a feature representation.
The transformed feature representation is sent to the transformer
In the transformer, the acoustic units obtained by k_means are masked and input
The transformer is trained to predict this masked area

This is the flow of the HuBERT model. Do you understand it? At first, some of you may have thought, "What the heck is this? I know some of you may have thought, "What the heck is this?

In this article, I have omitted a detailed explanation of the mechanism and formulas, since the purpose is to give you a rough idea of what it is like. The flow is simple, but this model is very elaborate. This article also includes a mathematical explanation of the model, so if you are interested, I recommend you to read the original.

Do you understand? Reflection on the past

There are only three important things!

Let's just hold on to this!

HuBERT is a BERT model with hidden layers
Simple architecture with only feature extractor and transformer
Trying to understand the mathematical approach is a deep

As long as you have these three things in mind, the rest will be fine!

In the next section, we will look at the experiment.

This is where we start! About the Experiment

Thank you very much for reading this long explanation of the basics.Next, I will finally explain the most interesting part of the paper, the experiment.

experimental setup

Now let's talk about the experimental setup. In this experiment, we use Librispeech for 960 hours and Libri-light for 6,000 hours as pre-training for the model. For fine tuning, we use Libri-light (10 minutes, 1 hour, 10 hours) or Librispeech (100 hours, 960 hours) again. Finally, we use k-means clustering as the teacher labels.

Three models are designed as model construction: HuBERT BASE, LARGE, and X-LARGE. They are basically based on the wav2vec2.0 architecture and have 95M, 317M, and 964M parameters, respectively.

What are the results of the experiment?

Low-resource (Libri-light: 10 minutes to 100 hours) evaluation

Experimental results at low resources show that HuBERT LARGE XLARGE outperforms wav2vec 2.0. Improvement is also seen even with very short data, 10 minutes.

High Resource (Librispeech: 960 hours) rating

In HuBERT LARGE, the results were comparable to wav2vec2.0. However, in XLARGE, the results showed a WER improvement of up to 13%.

Quality analysis of teacher labels

The k-means clustering is stable and shows a slight performance improvement with increasing data volume. Clustering with the HuBERT model also produced teacher labels of significantly better quality than MFCC.

Summary of Dissertation

Thank you all for your hard work. I was introduced to the HuBERT model, which solves the problem of self-supervised learning models. For me, it was a very interesting result, except for the fact that it was very interesting. In the low resource level, LARGE had an advantage over wav2vec2.0, but when the data size became larger, both models became equal.

After all, you can't understand research unless you try. It was a very good paper that gave me the feeling that I could turn my hypothesis upside down.

The results of this study can be summarized as follows

Ability to generate better quality teacher labels than conventional MFCC, mainly in feature extraction
WER improvement over wav2vec 2.0.

The two major results of the project are

A little chat with a chick writer, Ogasawara

Participate in academic conferences~!

The important event in the master's program is the "conference presentation! I'm both looking forward to it and a little anxious about it, but I'm still looking forward to it.

It is a very valuable opportunity to have my research results listened to seriously and to get a response from an expert, so I have to make sure I make the most of it.

See you in the next article.

This is Ogasawara, a newbie chick writer~.

See you later!

Categories related to this article

アサさん