
The Time Has Come For Everyone To Speak English! Zero-shot Text-to-speech Technology For Multiple Languages Makes It Easy For Anyone To Pronounce English Like A Native Speaker!
3 main points
✔️ Proposed Zero-Shot Voice Transfer (VT) module that can be integrated into multilingual TTS systems
✔️ Proposed VT module can convert high-quality, highly reproducible voices from one short voice to different language voices
✔️ It can recover the speech of speakers with dysarthria
Zero-shot Cross-lingual Voice Transfer for TTS
written by Fadi Biadsy, Youzheng Chen, Isaac Elias, Kyle Kastner, Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran
[Submitted on 20 Sep 2024]
Comments: Submitted to ICASSP
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Speech Synthesis is A GEM in The Rough of Possibilities...
Read at Least This Part! Super Summary of The Paper!
I was so excited to be able to speak English fluently and have fun interacting with the locals! It's been two years since I started speaking English with such enthusiasm. My resistance to reading and listening to English has lessened, but I still can't speak it.
I'm having a hard time pronouncing English.I'm going to fail. .... This is my experience, but I'm sure there are many readers who actually have this feeling.
English pronunciation is very different from Japanese. I can understand why you might feel uneasy about whether you will be able to enjoy interacting with local people with such a pronunciation.
In this issue. I will introduce TTS (text to speech) technology, which is sure to play an active role as a communication tool to alleviate some of our concerns!
This module, developed by the Google team, requires only a few seconds of voice sample to synthesize a voice speaking in another language while preserving the characteristics of the person's voice.
Isn't this awesome? The sample used for training is only a few seconds of audio. With conventional synthesis technologies, at least two hours are often required, so I must say that this is revolutionary.
Thus, although there have been models related to TTS for some time, they were based on synthesis in the same language, and converting to multiple languages was a technical challenge.
To successfully synthesize into multiple languages, the study addressed three challenges.
- Generate high-quality audio even from a small number of audio samples
- Transferring voice features from language A to language B
- Restoring the voice of a speaker with a speech impediment.
The word "speech impediment" has been mentioned here, so I would like to add something. A speech impediment is a disorder in which a person has some problem in the vocal organs or cranial nerves and is unable to pronounce words correctly. Since it is so far removed from normal speech, it can be considered as another language in research.
Now, the main result of this research is the development of a zero-shot VT module that can be easily integrated into a multilingual TTS system. And as for synthesis accuracy, we have generated speech in 9 languages from a single short speech sample and achieved an average speaker similarity of 73%. We have also demonstrated that high-quality speech synthesis is possible from the speech of speakers with speech impediments.
It's amazing, isn't it? It synthesizes a voice with a similarity of over 70% from a few seconds of speech. This is not only a means of communication, but it could also be used for various welfare applications, such as restoring the voice of a patient who has lost the ability to speak due to pharyngectomy.
In previous studies, synthesizing high-quality speech required a large number of samples and transferring voices between languages was difficult. This research has greatly improved these limitations and greatly expanded the possibilities for speech synthesis.
Speakability is important in communication, isn't it? When you are having a pleasant conversation with someone, a cold mechanical voice is not a good communication aid.
Now, from the next chapter, we will look at the architecture of this VT module. If you want to know more about the technology, you can't escape from the architecture.
Let's Take A Look at The Architecture of The VT Module...
Here is the architecture of the VT module. Let's take a moment to review what a module is. Let's review what a module is. In simple terms, a module is a customizable part that can be incorporated into a model. If you just want to understand the contents of this article, you should have no problem with this level of understanding.
I'll explain it slowly. First, the input part. The sample voice is input to the Speaker Encoder, and the text to be synthesized is input to the Text Encoder.
Inside the encoder, the input text is processed in an easily processable form, and the speech is extracted for its speaker features. A Transformer layer is used for feature extraction.
The Bottleneck Layer extracts speaker features from the output of the speech encoder.
(It seems that the speaker's features are carefully extracted~)
Duration Predictor and Upsampler predict the duration of each piece of text and extend the extracted features based on the prediction.
The Feature Decoder consists of a total of six layers and generates voice features.
(There are 6 layers! That's why you can consistently produce high-quality audio.)
WaveFit Vocoder takes the output generated in the previous layer and produces the final speech waveform.
Yes, I am. The flow is like this! Did you get a general idea of the flow?
The great thing about this module is that it can be easily incorporated into existing multilingual TTS models. Well, even if it is easy, the structure of the TTS model is complex, and probably more complex and bizarre on a per-program basis, so we may never get around to it. ....
Originally, this is where we would start to introduce and discuss the results, but since the table is quite difficult to introduce, it is sufficient to keep the following points in mind for this paper. (Although this will be a repetition of the summary...)
This is not a TTS model, but a module. Modules are custom parts that are added to the assembled Gundam plastic model kits, and are items that can be attached to the transformation belts of Kamen Riders to enhance their performance.
While previous research required a large number of speech samples to achieve high quality, this module requires only a few seconds of speech samples to produce high quality speech in nine languages with an average similarity of over 70%.
Furthermore, the possibilities of speech synthesis are expected to be extended to welfare applications, such as restoring the speech of speakers with speech impediments.
Well, I guess that's about it. Speech synthesis is fun, but it is extremely difficult to create your own program. I tried with GAN, but the noise was so bad that it didn't even sound like a voice, and the result was not very good.
I really can't get my head around engineers and researchers who can develop a model from scratch!
Yes, I am. Thank you so much to all the readers who have read this far and to the end of this issue.
It was Asa - see you soon!
A Little Chat with A Chick Writer, Ogasawara
We are looking for companies and graduate students who are interested in conducting collaborative research!
His specialty is speech recognition (experimental), especially with dysarthric people.
This field has limited resources available, and there will always be a limit to what one person can tackle alone.
Who would like to join us in solving social issues using the latest technology?
Categories related to this article