Speech Processing Model That Defies Common Sense! The Amazing Performance Of The Speech Processing Model "SpeechT5" Developed By Microsoft

Sound 17/02/2025

3 main points
✔️ SpeechT5 is an encoder-decoder model that can handle both speech and text
✔️ Pre-trained with large speech and text data and applicable to a variety of spoken language processing tasks
✔️ Uses speech and text information cross-modally rather than separately

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
written by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei
[Submitted on 14 Oct 2021 (v1), last revised 24 May 2022 (this version, v3)]
Comments: Accepted by ACL 2022 main conference
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Defies Common Sense! Multiple Possible SpechT5s.

Read at Least This Part! Super Summary of The Paper!

Have you ever had a complex about your voice? I do. I have a congenital disability that prevents me from pronouncing my voice well, and I have a muffled voice, so it's not uncommon for me to think, "Oh, it would be so cool if I could be as sharp and clear as that voice actor.

With the recent anime boom, voice actors have come into the spotlight, and even those with super cute voices that everyone envies are often told that they have a voice complex that is different from those around them.

Now, the voice is a very familiar thing, and I think it is an eternal challenge for human beings to speak more coolly/prettily.

Here are some research papers in the field of speech synthesis and voice quality conversion that sincerely confront such eternal challenges.

The SpeechT5 model developed and provided by Microsoft is capable of handling speech and text simultaneously, and is designed to handle speech-related processing tasks such as speech recognition and speech synthesis in a single model.

Conventional speech processing models are often trained using only speech, and the importance of text data is often overlooked. In addition, the focus during model development has often been on the encoder, and there has been a lack of prior training of the decoder.

Therefore, SpeechT5 aimed to develop a model that could effectively utilize both speech and text data and perform all speech processing tasks at a high level.

As a result of the study, the model significantly outperformed existing models on a variety of speech processing tasks. In particular, it outperformed even wav2vec2 and HuBERT, which are relatively high-performing models for speech recognition.

The traditional approach is not to create a man who can do everything, but to create a specialist in a certain task. Well, a man who can do everything seems to do a halfway job, doesn't he?

However, while SpeechT5 is a model that can do anything, it is not a poor dexterity model and performs every job to a high standard, once again demonstrating the potential of a model that can work in multiple capacities.

I have actually tried this model in a text-to-speech task, and it really speaks English fluently! It is a model that I really regret that it is currently only available in English, but I felt that it is a dependable and robust model that I can trust to handle English.

Now, in the next chapter, I will discuss the architecture of this model a little more in depth.

Let's Take a Look at The Architecture of SpeechT5...

Now let's look at the architecture of SpeechT5. After all, architecture is an unavoidable part of learning a model.

We will look at it slowly, so please take your time to understand and follow along!

The audio information and its textual counterpart are passed to the encoder as input.
Information passed to the encoder is also passed to the decoder.
The decoder is passed not only the information from the encoder, but also the audio signal and its text counterpart
After processing the four pieces of information, the decoder passes the processed information to the audio signal processing mechanism and the text information processing mechanism, respectively.

This is the sequence of events. It wasn't that difficult, was it? A word of caution here: voice information and voice signal information are completely different things.

To put it plainly, speech information is that which is easily understood by humans (such as human speech), and speech signal information is that which is easily understood by machines (such as numerical values).

Strictly speaking, text information is also converted inside the model into a form that is easier for the model to process, but that is another story.

This is a very simplified explanation of the architecture. In fact, there are some mathematical settings and explanatory items, but they cannot be contained in the vicinity of 3,000 characters, so I will omit them.

Now that we have touched on the architecture, let's go a little further into the results!

As I said at the beginning, this model is a multi-tasking model. Therefore, there are many items that summarize the results, but let's look at the most obvious voice recognition results here.

Thebottom column, SpeechT5, is the most noteworthy. WER is an indicator of the performance of a model by measuring how many words were misspelled in each unit by comparing the speech recognition results with the original text.

Since it is an error rate, the lower the number that comes out, the better. Now what are the results?

The results are obvious. You can see that the values are lower than the other models, i.e., they have very high speech recognition performance.

I usually do research to measure and compare the performance of speech recognition models in dysarthric speech, and Wav2vec2 and HuBERT are both very good models, but only Whisper comes close to surpassing them. But if they are going to surpass them, the only one that can do it right is Whisper,

Well, this one can work in multiple languages, while whisper is a multi-language model that supports many languages, so we can't simply compare them.

On the researcher's end, I would like to examine which model is stronger for dysarthric speech.

This is the end of this article~.

This model is relatively easy for anyone to try, as links to colab are available on Huggingface and other sites, so if you are interested, please experience the performance of this model for yourself.

After all, you learn best by doing something with your own hands,

A Little Chat with A Chick Writer, Ogasawara

We are looking for companies and graduate students who are interested in conducting collaborative research!

His specialty is speech recognition (experimental), especially with dysarthric people.

This field has limited resources available, and there will always be a limit to what one person can tackle alone.

Who would like to join us in solving social issues using the latest technology?

Categories related to this article

アサさん

Speech Processing Model That Defies Common Sense! The Amazing Performance Of The Speech Processing Model "SpeechT5" Developed By Microsoft

Defies Common Sense! Multiple Possible SpechT5s.

Read at Least This Part! Super Summary of The Paper!

Let's Take a Look at The Architecture of SpeechT5...

A Little Chat with A Chick Writer, Ogasawara

I Want To Use A Speech Activation System Even If I Have Dysarthria! Corpus For Speech Activation Systems And What Is A Speech Acti ...

I Want To Use A Speech Activation System Even If I Have Dysarthria! Corpus For Speech Activation Sys ...

Generating Dysarthric Speech! What Is The Magic Data Extension Technology To Solve The Shortage Of Training Data?

Generating Dysarthric Speech! What Is The Magic Data Extension Technology To Solve The Shortage Of T ...