I Want To Use A Speech Activation System Even If I Have Dysarthria! Corpus For Speech Activation Systems And What Is A Speech Activation System?
3 main points
✔️ A corpus of dysarthric speech in Chinese has been built and published. Published resources for speech activation research.
✔️ We conducted a comprehensive experimental analysis using MDSC. It also clarified the issues of speech activation systems for dysarthria.
✔️ Proposed a voice activation system for dysarthric people, which was strong against differences in intelligibility and showed excellent performance.
Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design
written by Ming Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Jianxing Yang, Ming Li, Chin-Hui Lee
[Submitted on 14 Jun 2024]
Comments:to be published in Interspeech 2024
Subjects: Computation and Language (cs.CL)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Nice to meet you all!
I am Ogasawara, a new writer for AI-SCHILAR.
The paper presented here is this one
Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design ".
It is.
As I summarized the main points at the beginning of this article, it seems that the purpose of the project is to develop and publish a corpus of dysarthric Chinese speech and to invent a speech activation system.
I wonder what kind of methods are being used! Let's learn together with me little by little~~!
I will try to introduce them in as concise a manner as possible, so please bear with me until the end.
Outline of This Study
For example, voice control of devices has become commonplace with the development of smart home technologies such as Switchbot and Amazon's Alexa. However, dysarthric people with pronunciation problems are less likely to benefit from them.
Therefore, in this study, we have created and published a corpus of dysarthric speech in Chinese, as well as conducted experiments and analysis on speech activation of dysarthric speakers using MDSC.
The results showed that the speech of dysarthric people varies greatly from person to person and that the amount of data is limited. The proposed system adapted to the speaker's speech in about 3 minutes and was highly sensitive to differences in intelligibility, but further study is needed for people with severe dysarthria.
Let's Keep It in Mind.
What is Dysarthria?
It is a disorder in which congenital or acquired factors prevent a person from pronouncing a language correctly, even though he or she understands it. Acquired factors include stroke or neuromuscular disease, for example.
Speech characteristics vary greatly from person to person, but in general, speech intelligibility is reduced and the spoken word is difficult to understand. This makes interpersonal communication extremely difficult.
What is Voice Activation?
It is a technology that activates devices with specific words such as "Hey Siri" or "OK! If you don't pronounce it right, it won't work.
What are PER and WER?
PER is the phoneme error rate, measured in phonemes, the smallest unit of pronunciation, and WER is the word error rate, measured in words.
Do You Understand? Reflection on The Past
There are only three important things!
Let's just hold on to this!
- Dysarthria is a disorder in which a person understands pronunciation but is unable to pronounce it correctly
- Voice activation is like Siri activating with Hey!
- PER and WER are commonly used as valuation indicators
As long as you have these three things in mind, the rest will be fine!
What is MDSC? What Kind of Corpus is It?
You have developed and published a speech corpus for dysarthric Chinese in this study. Aren't you wondering what this is all about?
If you don't care, you can move on to the next topic. I'm going to dig a little deeper, so if you are interested, please stay with me.
Objective
The purpose of this corpus is to provide a corpus of Chinese dysarthric speakers for the study of speech activation systems. Since it is intended for research on the speech activation system, the recorded words are also related to it.
Frankly, its use is too limited, and since it is in Chinese, it is probably not relevant to our lives. But the idea of creating a corpus to design this voice activation system is very helpful. Even in Japanese, this field will not develop unless someone does it.
Features and Contents
This corpus contains about 10 hours of dysarthric speech and about 8 hours of non-syllabic speech. It is quite a collection of data - as you would expect from Chinese, which has a large number of speakers.
I would like to create a corpus (preferably an open corpus) for dysarthria in Japanese as well, but the number of Japanese speakers is smaller than English or Chinese speakers, and although it may not be a good idea, I feel that Japan lags behind other countries in social advancement for people with disabilities.
The reality is that it is difficult to gather subjects because of this.
I'm going off on a bit of a tangent, but I'll continue. The words included are the activation words that are the key to the voice activation system and other commands. In total, about 360 words are included.
This is Where We Start! About The Experiment
Thank you very much for reading this long explanation of the basics.Next, I will finally explain the most interesting part of the paper, the experiment.
What Kind of Experiment?
This is an experiment to evaluate the performance of a voice activation system adapted to individual dysarthria.
Experimental Setup
1: Data set
The MDSC will be used in this experiment, as explained earlier. It will be a data set for a voice activation system study that contains approximately 16 hours of Chinese dysarthric speech.
2: Subjects
Selected from MDSC
3: Evaluation index
PER and WER will be used to evaluate system performance.
What are The Results of The Experiment?
Now, we will perform a ceremony to announce the results of the experiment! Here we go!
Poof! I see something like three models - but what I want to deal with in this paper is the black bar line on the SDD graph. Notice that.
See! This is the result of the experiment. The diagram is very clearly organized, but I will explain a few important points.
- Superiority of the SDD model: SDD shows the lowest score for all speakers. This is a measure of error rate, so the lower it is, the better the model. And the model seems to be able to adapt to a speaker's speech in about 3 minutes.
- Differences in improvement by intelligibility: For speakers with moderate intelligibility, the SDD model has the highest improvement rate. This indicates that speaker adaptation is an effective tool for moderately intelligible speakers, but there is, of course, a challenge. However, there is of course a challenge: the growth in improvement rates for the least articulate speakers is smaller than the growth for the moderately articulate speakers.
- Need for SDD model: In the SID graph, the red bar is the entire dysarthric speech trained. This also shows improvement in many cases, but the performance is limited, with the improvement rate worsening for speakers with high intelligibility. On the other hand, SDD this one, which has been individually adapted, shows improvement in all speakers, which should give you an idea of the need for SDD.
Yes, I have. I have summarized three of the most important results of this research. From these results, I believe we have grasped the validity of the Chinese dysarthric speech corpus and the need for an SDD system.
Summary of Thesis
This time, we are doing research on the Chinese language, so some of you may be thinking, "This has nothing to do with my life. Some of you may be thinking, "What's that got to do with my life? But that is not true. If we can do it in other languages, we should be able to get the same results in Japanese. Japanese has inevitably fewer speakers than Chinese, and we have to find dysarthric people among them and consult with them to see if they can become subjects. Since I am a student, I cannot offer an honorarium, and I am not sure if they will take me up on the offer. But someone has to do it, or there will be no development in this field. That is what I mean.
A Little Chat with A Chick Writer, Ogasawara
We are looking for companies and graduate students who are interested in conducting collaborative research!
His specialty is speech recognition (experimental), especially with dysarthric people.
This field has limited resources available, and there will always be a limit to what one person can tackle alone.
Who would like to join us in solving social issues using the latest technology?
Categories related to this article