Conformer: Transformer Applied To Speech Recognition! Transformer X CNN By Google Is Too Awesome!

Voice Recognition 19/11/2020

3 main points
✔️ A model combining Transformer and CNN, Conformer applied to speech recognition
✔️ Convolutional modules have been found to be the most important in Conformer
✔️ Best accuracy in existing speech recognition research

Conformer: Convolution-augmented Transformer for Speech Recognition
written by Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang
(Submitted on 16 May 2020)
Comments: Accepted at Interspeech2020
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)

summary

This is a paper put out by Google that applies Transformer, which is becoming hot in the machine learning industry, especially in natural language processing, to speech recognition. Until now, the RNNs-based models have been the best in the speech recognition space, but recently, the Transformer and CNN models have started to give better results. In this paper, we call this combination of transformer and CNNs as Conformer. And, not surprisingly, this Conformer has significantly outperformed previous studies.

Introduction

In recent years, the accuracy of neural network-based speech recognition systems has improved dramatically. RNNs, which can efficiently account for the temporal dependency of speech, have been at the vanguard of this trend, and more recently, Self-Attention-based Transformers have emerged. This is because Transformers are able to capture longer time dependencies and train them more efficiently. Other CNNs have had some success in capturing local context through local receptive fields at each layer.

However, Self-Attention and CNNs each had their limitations on their own: while Transformer was good at considering global contexts, i.e., long time dependencies, it was not good at extracting local contexts, i.e., local relationships. CNN is the opposite: while it is good at extracting local information by subdividing it into blocks, as we see in the computer vision domain, for example, it requires many layers and parameters to capture the connections in a larger view.

So, recent research is increasingly combining CNN and Self-Attention to outperform their individual performance. By using them simultaneously, we can combine the best of both worlds to capture both local and global contexts.

This study combines CNN and Transformer to apply it to speech recognition. It is based on the assumption that capturing local and global information will lead to higher accuracy in parameter determination. We then invented a new combination of Self-Attention and Convolution.

The model, named Conformer, performed best on the LibriSpeech dataset results, outperforming the Transformer Transducer by 15%, which was the best of the previous studies. We also experimented with patterns of 10, 30, and 118 million parameter sizes, respectively, and even the intermediate 30 million models already outperformed the Transformer Transducer.

To read more,

Please register with AI-SCHOLAR.

Categories related to this article

山下夏生

Conformer: Transformer Applied To Speech Recognition! Transformer X CNN By Google Is Too Awesome!

summary

Introduction

The Secrets Of Speech Recognition Technology

The Secrets Of Speech Recognition Technology

Model Lightweight Techniques! Lightweight And High Performance Speech Emotion Recognition Model LightSER-NET!

Model Lightweight Techniques! Lightweight And High Performance Speech Emotion Recognition Model Ligh ...

Ultra-lightweight CNN Speech Recognition Model! Google-developed "ContextNet" Explained!

Ultra-lightweight CNN Speech Recognition Model! Google-developed "ContextNet" Explained!

This Is The SoTA Paper On Speech Recognition! What A Study By Google That Pushes The Limits Of Semi-supervised Learning!

This Is The SoTA Paper On Speech Recognition! What A Study By Google That Pushes The Limits Of Semi- ...

End-to-end Speech Translation "NeurST".

End-to-end Speech Translation "NeurST".

Facebook AI Has Developed A New Voice Separation Model With RNN! Extracting Only Your Voice From A Large Group Of People's Convers ...

Facebook AI Has Developed A New Voice Separation Model With RNN! Extracting Only Your Voice From A L ...