Catch up on the latest AI articles

[For Everyone To Enjoy The Convenience... Speaker Adaptation Of Dysarthric Speech Using Whisper

[For Everyone To Enjoy The Convenience... Speaker Adaptation Of Dysarthric Speech Using Whisper

Speech Recognition For The Dysarthric

3 main points
✔️ Proposed speaker adaptation method using P-Tuning for Whisper model
✔️ Proposed method improves CER by 13

✔️ The proposed method is highly flexible and shows improved performance in a variety of configurations

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition
written by Yicong Jiang,Tianzi Wang,Xurong Xie,Juan Liu,Wei Sun,Nan Yan,Hui Chen,Lan Wang,Xunying Liu,Feng Tian
[Submitted on 14 Jun 2024]
Comments:   Accepted by interspeech 2024
Subjects:   Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)

code:

The images used in this article are from the paper, the introductory slides, or were created based on them.

So That Everyone Can Enjoy The Benefits of Science and Technology...

Read at Least This Part! Super Summary of The Paper!

Don't you think the accuracy of voice recognition has improved tremendously in recent years? For example, Siri for IOS and google assistant for android. For example, "Put on some music! or "What's the weather today? or "What's the weather like today?

In such a situation, a high-performance large-scale model called whisper, a speech recognition model from OpenAI, a company well known for chatGPT, appeared and attracted attention for a while as its model performance surpassed even that of google.

This paper is a study of speaker adaptation using the whisper model. The authors attempted to improve the recognition accuracy by learning the voices of people who have pronunciation problems and cannot use speech recognition well.

Dysarthric people have difficulty controlling the muscles involved in speech production, mainly due to damage to their nervous system, resulting in slurred and unstable pronunciation. This makes it difficult to collect large-scale data for machine learning, which is one of the reasons why research in the field of dysarthric speech recognition has not progressed.

The current paper addresses two major issues : improving the recognition rate of speech for speakers with dysarthria and proposing an effective speaker adaptation method under limited data.

As a result of the study, the proposed method successfully reduced the text error rate by 13% relative to whisper's Baseline. The method was also shown to be particularly effective for severely dysarthric speech.

Now, it was previously thought that speech recognition for dysarthric speakers required specialized models and complex speaker adaptation methods. However, this study has shown that it is possible to demonstrate higher recognition accuracy by combining a large-scale pre-trained model (whisper) with an efficient adaptation method.

We need to find easier adaptive and cognitive methods to help people with and without dysarthria communicate, so that they can all benefit equally from science and technology.

How to Incorporate The Speaker Adaptation Algorithm into Whisper...

My greatest thanks to you for reading this far!

Now that you' ve read this far, you're interested in this paper, right? Let's take it a little further from here...

Now look at the diagram above. No one should be able to understand this in an instant. I will take my time to explain it in as much detail as possible. I think this is a very important and interesting part of the paper.

I'll briefly introduce the flow of this architecture.

  1. input process
  2. prompt generation
  3. Whisper model processing
  4. decoding
  5. adaptation mechanism

The first step is the input process. As you can imagine, voice features are input.

When the features leave the output layer, they produce a speaker prompt. I won't bore you with the details. It is passed to the whisper model, and after passing through two convolution layers, the process is passed on to the transformer encoder.

Now, when the encoder finishes processing, it is passed to the decoder to generate the text output.

Speaker prompts are used before and after input to provide speaker-specific information to the model, allowing the model to adapt to individual speaker characteristics and improve recognition accuracy.

I hope you have managed to grasp at least the gist of what I am talking about, even if it is a rather tame explanation. It is very difficult to understand this kind of architecture in depth. However, the essence of the architecture is usually found in the superficial knowledge, so I think it would be good if you could understand it at least somewhat.

Is Dysarthric Speech Now Recognized...?

Finally, let's look at the results of this study, shall we? Look at the table above. It compares the performance of different models and different adaptation methods for speech recognition of dysarthric speech.

These are the five models for this comparison. The model proposed this time is the one called whisper-PP, the lowest one.

Now, in conclusion, the proposed method showed the best performance compared to the other models. The regular whisper also performed next to it, so it may be a good match for dysarthric speech.

For the recognition of severely dysarthric speech (FJ1), the proposed method was still strong and improved the CER by 7%. Although the conformer is the latest speech recognition model , it struggled with their speech. It is not because it is the newest model that it will be strong against their speech...

The model used in the experiment is: Conformer is the latest speech recognition model, TDNN is a time-delay neural network, and the rest of the models are the same as WHISPER's, another speaker adaptation method that was tried, and this study's model.

All evaluation indicators are based on the CER (Character Error Rate), which is a measure of how many mistakes per character are made by comparing the speech recognition result with the original text. Since this is only an error rate, the lower the number, the better the performance.

The overall experimental results show that the proposed method outperforms other models in all tasks. This result shows how accurately and efficiently the speaker adaptation method is able to extract speaker features and integrate them into the model!

To Ensure That Disabled and Able-Bodied People Enjoy Equal Convenience Regardless of Their Disability. ....

This may be a bit radical, but there is no such thing as equality in this world. There are those who loudly preach that this is the age of diversity and that everyone is equal, but this is nothing more than a theoretical theory.

After all, there are a wide variety of people in this world, including both able-bodied people and people with disabilities, and there is a clear difference between the two. What is truly needed in this world is not equality, but consideration or a kind heart to extend a helping hand.

Having said that, I would like everyone to be able to enjoy equally at least the conveniences that result from science and technology, and I think this is something that needs to happen.

Researchers are exploring every day to realize a society where everyone can enjoy convenience equally, right? It is cool and I admire it!

So, in this article, we looked at speaker adaptation methods that enable proper recognition of dysarthric speech. I hope you all got a little bit out of it. I hope you are learning as much as possible.

See you then! See you in the next article~~!

A Little Chat with A Chick Writer, Ogasawara

We are looking for companies and graduate students who are interested in conducting collaborative research!

His specialty is speech recognition (experimental), especially with dysarthric people.

This field has limited resources available, and there will always be a limit to what one person can tackle alone.

Who would like to join us in solving social issues using the latest technology?

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us