[Unit-DSR] Normalization Of Disabled Speech To Normal Speech By HuBERT

Self-supervised Learning 26/07/2024

3 main points
✔️ An innovative speech unit-based method for dysarthric speech reconstruction
✔️ Highly versatile and efficiently trainable HuBERT model
✔️ High functionality through a simple two-module structure

UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization

written by Yuejiao Wang, Xixin Wu, Disong Wang, Lingwei Meng, Helen Meng
(Submitted on 26 Jan 2024)
Comments: Accepted to ICASSP 2024
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Nice to meet you all!

I am Ogasawara, a new writer for AI-SCHILAR.

The paper presented here is this one

"Speech Reconstruction System for the Syllabically Impaired Using Speech Unit Normalization."

It is.

As summarized in the introduction, HuBERT is an innovative speech reconstruction method using the "HuBERT" model developed by MeTA, which aims to remove communication barriers by converting dysarthric speech into natural and intelligible speech. The goal is to remove barriers to communication by converting dysarthric speech into natural and intelligible speech.

I wonder what kind of methods are being used!Let's learn together with me little by little~~!

I will try to introduce them in as concise a manner as possible, so please bear with me until the end.

Proposed Method

Figure (a) above shows the method proposed in this paper [Unit-DSR].

I know it is difficult to understand this all by looking at it, so let's unpack it one at a time.

There are only "two" things that are important in Unit-DSR!

To begin with,Unit-DSR isa modelbuilt from a speech unit normalizer and Unit HiFi-GAN.

To elaborate a bit more,

Speech unit normalizer: Converts speech of dysarthric people into speech patterns of normal peopleto make data easier to handle.
Unit HiFi-GAN:Role to generate speech directly from data obtained by the speechunit normalizer.

If you remember these two things, you will be able to understand the rest much better!

HuBERT" is the cornerstone of this study

Now that you've learned the important parts of the proposal methodology, let's take a deeper look at the model~!

We will proceed slowly, so please follow us closely.

First, let me introduce HuBERT.

Simply put, HuBERT is an evolution of BERT.

In more detail, it is a self-supervised learning model, which learns speech bycreating pseudo-labels through classification of speech using the k_means method andadaptingprediction loss onlytomasked regions.

In my opinion, the key is that it is a self-supervised learning model.This is because it is very difficult to collect the speech sounds of dysarthric people, especially those with cerebral palsy.In particular, for many people with cerebral palsy, the act of "speaking" itself is very painful.

Self-supervised learning is one model that is particularly important in the field of dysarthric speech recognition, as many AI models in recent years are requiring vast amounts of training data.The source code for this model is available on githab, so if you are interested, you can try implementing it yourself for a better understanding.

It's not hard! CTC Loss function

The loss function calculates the "discrepancy" between the predicted value and the correct location. In this method, CTC Loss is used.

What is CTC Loss?

For those who are not familiar with this function, it is a loss function that is often used to label time series data.

It is often used in the field of speech recognition, so please just remember that it is "often used for labeling time series data!

As a supplement, a light explanation of the benefits of using this function is that it automatically finds the appropriate arrangement even if the lengths of the input series (audio) and output series (text) are different.

Finally, we move on to the explanation of the diagram!

モデルの説明

So far, this has been a prelude to acquiring the knowledge to understand this diagram.

Now, here is a fun time to understand the diagram.

Let's read it together!

Start with the blue start on the left.

Subject's voice (able-bodied or disabled) is referenced
Some are loaded into the HuBERT model, weights are initialized and go to the speech unit normalizer, and some go to the k_means model (the latter is explained in this article).
Once in the k_means model, the voice is converted into a sequence of numbers called a series.
Read series and remove emphasis
Go to CTC Loss

This sequence of processes represents the extraction of normalization units.

Next, green start on the right

Referring to a randomly selected voice (able-bodied or disabled)
To voice unit normalizer
Speech waveforms are reconstructed into a series of normalized units
Audio generated by HiFi GAN

This is the general flow of the proposal methodology.

The detailed theory is somewhat difficult to follow, so we would like you to grasp the general framework of this method first.

Anyway, I've explained it all in a very masticated and biting way, but I hope you are following along.

So let's take a moment here to reflect on what we have learned!

Do you understand? Reflection on the past

There are only three important things!

Let's just hold on to this!

Unit-DSR is amodelbuilt from the speech unit normalizer and Unit HiFi-GAN
HuBERT is an evolution of BERT
The loss function calculates the "discrepancy" between the predicted value and the correct location.

As long as you have these three things in mind, the rest will be fine!

In the next section, we will look at the experiment.

This is where we start! About the Experiment

Thank you very much for reading this long explanation of the basics.Next, I will finally explain the most interesting part of the paper, the experiment.

Database used

The corpus UASpeech was used in the development of this system.

A unique feature of this corpus is that it includes not only the speech of normal people, but also that of people with dysarthria.

As a side note, there are various speech databases available in Japan, including those that read out the ITA corpus, but all of them are based on the voices of healthy people, and I could not find any that recorded the voices of people with disabilities as far as I could find.

As I mentioned at the beginning of this article, it is extremely difficult to collect the speech sounds of people with dysarthria, and even more so to create a database, but I believe that research on the speech sounds of people with disabilities will not progress without a database that is accessible to all.

I sincerely hope that database research will progress in Japan as well.

System design using

The system used in this experiment is the Unit-DSR system shown earlier in Figure (a).

Although there are detailed explanations of the experimental conditions, including detailed parameter adjustments and the details of each layer, the purpose of this article is to give you a general idea of the paper, so we will omit them.

If you are interested, we have included the URL of the paper and encourage you to read it for yourself!

What are the results of the experiment?

Is the performance of this proposed method improved compared to the previous one?

Let us look at the results of the experiment from two perspectives.

1: Content Restoration

In assessing this item, the MOS test and the speech recognition test are used.

The MOS test is a test in which listeners (20 people) were prepared and asked to compare the randomly selected reconstructed audio with the original audio and to pseudo-evaluate how similar the two were, in order to collect subjective data.

Next was the speech recognition test. This test was conducted to collect objective data, and the word error rate was measured using a speech recognition model called Jasper.

The image above shows the results. The area highlighted in yellow is the system of this experiment.

In conclusion, the system neatly demonstrates the usefulness of reconstructing accurate content and pronunciation speech.

However, the challenge is that the reconstructed speech still contains many phoneme errors, and the results of speech recognition tests have been poor.

Still, it is great and gratifying to be able to show that our content recovery accuracy is significantly better than previous models.

2: Will changes in the sound source environment affect accuracy?

This evaluation item examines the extent to which changes in the distribution of input speech affect the reconstructed normalization units.

Specifically, the playback rate of dysarthric speech is varied to simulate changes in the patient's speech rate. Noise is then intentionally added to account for various recording conditions.

As for this evaluation experiment, we were able to confirm that the Unit-DSR system has the property of being robust to input speech distribution fluctuations.

For my part, I am very excited about the results.

This is because being resistant to noise greatly improves the possibility of practical use in everyday conversation and communication with people on the go.

Conventional models are susceptible to noise, especially in the case of dysarthric speech, and are only accurate in a special space, such as an experimental recording environment. I must say that this is a very innovative technology.

Summary of Dissertation

The Unit-DSR system we proposed this time normalizes dysarthric speech to normal speech patterns and generates waveforms directly from speech units.

The results of this study can be summarized as follows

First to introduce voice units in DSR tasks and record better performance than previous models
Significantly improved learning efficiency by using HuBERT with its high adaptive capacity

The two major results are

A little chat with a chick writer, Ogasawara

Well, this paper was really eye-opening because it contained very groundbreaking content!

Two points that I think are great about this paper are

Self-supervised learning models are used to eliminate the lack of training data.
We advocate a model that is robust in noisy environments.

That's the point.

I tend to read papers written in Japanese rather than English, and many of them mention the above two issues. I think it is really great that this paper has found a solution to one of these issues.

But this experiment is very large-scale.We are using multiple disabled voices, and we are also using thousands of hours of voice database.

It is unfortunate that it seems somewhat difficult to further develop this experiment, since it would require several expensive GPUs and financial resources to learn this amount of data.

Let's cut the chit-chat around here, shall we?

Well, thank you so much to all the readers who have read to the end.

As this is an article by a newbie writer, there may have been some parts here and there that were difficult to read or understand.

Nevertheless, I am very happy if I was able to give some interesting knowledge to everyone who read to the end.

See you in the next article.

This is Ogasawara, a newbie chick writer~.

See you later!