Catch up on the latest AI articles

[You're Using Wav2vec2 For This? It Makes Feature Extraction Of Dysarthric Speech More Efficient!

[You're Using Wav2vec2 For This? It Makes Feature Extraction Of Dysarthric Speech More Efficient!

Speech Recognition For The Dysarthric

3 main points
✔️ Feature extraction using wav2vec2.0 improved accuracy in detection and severity classification of dysarthria from speech
✔️ Features from the first layer of wav2vec2 were most effective in detection

✔️ Features from the last layer of wav2vec2 were most accurate for severity classification

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech
written by Farhad Javanmardi,Saska Tirronen,Manila Kodali,Sudarsana Reddy Kadiri,Paavo Alku
(Submitted on 25 Sep 2023, last revised 17 Oct 2023)
Comments: copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)

code:

The images used in this article are from the paper, the introductory slides, or were created based on them.

Efficient Feature Extraction with Wav2vec2!

Read at Least This part! Super Summary of The Paper!

I read papers and write articles on dysarthric speech recognition, so I always start with "Do you know a dysarthric person? I always start out with "Do you know any dysarthric people?

I need a twist... a twist... I couldn't think of anything...

Hi. So, to get back on track, have you ever heard of dysarthria? If you are reading this article, you are probably at least as interested in it as anyone else, so I'll try to be brief.

Dysarthria is a disorder in which a person understands a language but is unable to pronounce it correctly due to problems in the speech organs. There are various types of dysarthria, such as acquired or congenital, paralytic or not, etc., and each has a very different speech tendency, so research on these disorders has been quite difficult.

Now, you said that dysarthria really has many different characteristics. This means that it is extremely difficult for even the most skilled doctor to determine the severity of the symptoms by comparing their pronunciation.

So, in this article, let's chew through together the papers related to the detection and severity identification of dysarthric speech by feature extraction using wav2vec2!

The key word in this article is paralytic dysarthria. This is a speech disorder caused primarily by damage to the nervous system and has a significant impact on the patient's life and quality of life.

Traditionally, the evaluation of this disability has been left to subjective judgments by physicians and others, but this is time-consuming, costly, and subject to variations in judgment depending on the experience of the evaluator.

This research addresses the challenge of detecting paralytic dysarthria directly from speech signals and automatically classifying its severity.

The results show that using features extracted from the wav2vec2 model improved the accuracy of fault detection to 93.95% and improved to 44.56% in the severity classification task. In particular, features from the early layers of the model were more effective in detecting impairments, while features from the later layers were more effective in classifying severity.

Previous studies have used acoustic features such as spectrograms and MFCCs that modify the speech signal in a tangible form, but this research has made it possible to extract more sophisticated and abstract features by using a pre-trained model as a feature extractor, thereby surpassing the performance of previous This research has achieved a higher level of abstract feature extraction by using a pre-trained model as a feature extractor.

Now, what do you think about this brief introduction so far? It was very surprising to see wav2vec2, which is mainly used in the field of speech recognition, as a feature extractor.

And when it comes to the fault detection task, the detection rate is almost 95%! This is too good to be true. It is difficult to leave the classification of severity to the automatic system since it is only about 45%, but I think it is more than sufficient as an auxiliary tool for the evaluation staff.

Now next time, after a light touch on the architecture of wav2vec2, let's dig a little deeper into the paper.

Let's Take A Look at The Architecture of Wav2vec2...

The flow of the proposed methodology is shown in this figure, which is much shorter and more compact and easier to read this time.

In this article, I will not explain the architecture of wav2vec2, so if you don't understand it or have forgotten it, I recommend you review it after reading this article. This model is often used, and all other speech recognition models have the same structure if they are transformer models.

Let's look at the flow in this figure, starting with (a), the detection system.

  1. Audio signals are input.
  2. Features are extracted from audio by the wav2vec2 feature extractor
  3. Classify features using SVM (Support Vector Machine)
  4. Classification results predict whether the patient has normal or paralytic dysarthria

This is the flow of prediction. The flow was very simple: extracted features were applied to a classification task, and the results were used to make predictions,

Next, let's look at (b), the severity classification system,

  1. Audio signals are input.
  2. Features are extracted from audio by the wav2vec2 feature extractor
  3. Classify features using SVM (Support Vector Machine)
  4. Determination of severity

Huh. It's almost the same as in (a), isn't it? That's right. In detection and in severity determination, it is just a classification task after all.

However, as noted in the opening summary, there is a clear difference between the two, and that is that the first layer of wav2vec2 was more efficient for detection, while the final layer was more efficient for severity classification, according to the experimental results.

The transformer model, including wav2vec2, is made up of layers of layers that are in charge of different roles, so even within the layers that extract features, the roles of what features are extracted are probably divided ( guess ).

In this article, we introduce a paper that shows that by using wav2vec2 as a feature extractor to efficiently extract features of dysarthric speech, we were able to improve accuracy by 93% in the dysarthric speech detection task and by 44% in the severity classification.

The detection task has a score that looks like it could be put into practice, and the classification is accurate enough to be useful in assisting the specialist.

In Japan, not much research has been done on detecting disorders and classifying severity, so I am wondering if it is possible to extract features in the same way in Japanese.

See you in the next article!

A Little Chat with A Chick Writer, Ogasawara

We are looking for companies and graduate students who are interested in conducting collaborative research!

His specialty is speech recognition (experimental), especially with dysarthric people.

This field has limited resources available, and there will always be a limit to what one person can tackle alone.

Who would like to join us in solving social issues using the latest technology?

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us