Catch up on the latest AI articles

[Be Who You Are: ....] No Dysarthria, No Elderly Abandoned; A2A Converter Improves Speech Recognition Rates

[Be Who You Are: ....] No Dysarthria, No Elderly Abandoned; A2A Converter Improves Speech Recognition Rates

Speech Recognition For The Dysarthric

3 main points
✔️ Combining self-supervised learning (SSL) and conventional speech recognition techniques greatly improves speech recognition accuracy for disabled speakers and the elderly
✔️ For input features, fusing various conventional methods is effective

✔️ Significant improvement in error rates of up to 30% or so

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition
written by Shujie HuXurong XieMengzhe GengZengrui JinJiajun DengGuinan LiYi WangMingyu CuiTianzi WangHelen MengXunying Liu
[Submitted on 3 Jul 2024]
Comments:   IEEE/ACM Transactions on Audio, Speech, and Language Processing
Subjects:   Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)

code:

The images used in this article are from the paper, the introductory slides, or were created based on them.

For A Society Where People with Disabilities and The Elderly can Be Themselves...

Read at Least This Part! Super Summary of The Paper!

Isn't voice recognition convenient? There's no need for typing or flicking. In this day and age, people are talking about smart homes, and anything is possible as long as you have the right words. But what about dysarthric and elderly people with pronunciation problems? Currently, these technologies are intended for normal people, so they cannot benefit from the latest technology. This paper is a sincere attempt to address this issue.

Now, you probably know that in the speech recognition community, SSL (a.k.a. self-supervised learning) has been performing well in various speech-related tasks. However, if we input their speech directly into these models, we cannot obtain sufficient results due to lack of data and differences in speech features. Therefore, this study aims to effectively utilize the SSL model and build a system specialized for their voices.

The problem that this study is facing is, broadly speaking, that we want to improve the accuracy of speech recognition for those who have pronunciation problems, right? That's a little too big, so let's make the problem smaller. There are two factors that cause this problem: lack of data and differences in speech features. If we were to talk about speech features of dysarthric people, they would be missing consonants and opaque and irregular pronunciation. It is still very different from the speech of normal people.

In this study, we proposed an approach that combines the SSL model with several conventional methods. This approach has resulted in significant error rate improvements of up to 30% on four data sets.

Prior to this study, the approach was to directly adapt their speech to the SSL model, but this approach did not provide sufficient performance. In this study, however, we were able to achieve higher performance by flexibly incorporating conventional methods while taking advantage of the features of the SSL model. The performance improvement was particularly significant for speakers with severe disabilities.

The results may be applied primarily to speech recognition tasks where data is scarce or where feature differences from standard data are significant.

Finally, this is an important study that will greatly contribute to the realization of a society in which both the disabled and the elderly can communicate in their own way, in other words, can be themselves.

What is The A2A Inversion Model? Why is It Valid?

My greatest thanks to you for reading this far!

Now that you' ve read this far, you're interested in this paper, right? Let's take it a little further from here...

Now look at the diagram above. No one should be able to understand this in an instant. I will take my time to explain it in as much detail as possible. I think this is a very important and interesting part of the paper.

First of all, let me explain what the A2A model is. First of all, let me explain what the A2A model is. Simply put, it is the conversion from speech to a different form of speech.In this paper, it is the conversion from speech features to articulation features. As a side note, articulation refers to the movement of the tongue and lips when pronouncing a word.

Now, let's go through the architectural workflow in an orderly fashion. First, let's take a quick look at the flow.

  1. 3-step fine tuning of HuBERT encoder
  2. A2A Model Training
  3. Inverse transformation of A2A model

It looks like this. Let's take a closer look.

First, the HuBERT encoder is fine-tuned three times with three sets of data: the first fine-tuning with normal data, the second fine-tuning with dysarthric data, and the third fine-tuning with normal articulatory data. The reason for this tedious process is that the more various data we have, the more we can create models for various speakers and tasks.

Next, in training the A2A model, features extracted from articulatory data of a healthy subject are used as input and the articulatory features are learned as output. This allows the model to learn the transformation from speech features to articulatory features.

Finally, the inverse transformation of the A2A model generates articulatory features corresponding to dysarthric speech by feeding acoustic features extracted from the data set as input to the trained A2A model.

Well, I guess it goes like this. I hope you were able to grasp it, even if only somewhat. The paper describes the theory and structure in much more depth, but for now, I would like you to understand at least the core points,

What's so Great about The A2A Model?

Let us now come to the summary. To tell the truth, in this paper, the A2A model is only one of the proposed methods.

So the original article is a very huge report of research results, 16 pages long. The medium of online articles and long articles of more than 3,000 words do not go well together, so we had to make a choice. There are many interesting methods and results, but this time I introduced the A2A model in a little depth.

The innovative aspect of this approach is that it allows for the estimation of articulatory features for dysarthric speech using different data in stages. This is very useful for dysarthric speakers with limited data.

We look forward to further development of this research to realize a society in which both the disabled and the elderly can communicate with each other in their own way!

Finally, there are many interesting data from the various experiments conducted in this paper. However, because there are so many models being compared and so many tasks being tested, the table of experimental results is ridiculously overstuffed. What I want everyone to know in this article is the current state of the art in speech recognition for dysarthric and elderly people, and the A2A model of approach. Since these two points have been communicated, we will not discuss the experimental conditions and experiments in this article.

If you are interested, please access the original paper from the link to the paper at the beginning of the article!

A Little Chat with A Chick Writer, Ogasawara

We are looking for companies and graduate students who are interested in conducting collaborative research!

His specialty is speech recognition (experimental), especially with dysarthric people.

This field has limited resources available, and there will always be a limit to what one person can tackle alone.

Who would like to join us in solving social issues using the latest technology?

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us