Catch up on the latest AI articles

Conformer: Transformer Applied To Speech Recognition! Transformer X CNN By Google Is Too Awesome!

Conformer: Transformer Applied To Speech Recognition! Transformer X CNN By Google Is Too Awesome!

Voice Recognition

3 main points
✔️ A model combining Transformer and CNN, Conformer applied to speech recognition
✔️ Convolutional modules have been found to be the most important in Conformer 
✔️ Best accuracy in existing speech recognition research

Conformer: Convolution-augmented Transformer for Speech Recognition
written by 
Anmol GulatiJames QinChung-Cheng ChiuNiki ParmarYu ZhangJiahui YuWei HanShibo WangZhengdong ZhangYonghui WuRuoming Pang
(Submitted on 16 May 2020)

Comments: Accepted at Interspeech2020
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)


This is a paper put out by Google that applies Transformer, which is becoming hot in the machine learning industry, especially in natural language processing, to speech recognition. Until now, the RNNs-based models have been the best in the speech recognition space, but recently, the Transformer and CNN models have started to give better results. In this paper, we call this combination of transformer and CNNs as Conformer. And, not surprisingly, this Conformer has significantly outperformed previous studies.


In recent years, the accuracy of neural network-based speech recognition systems has improved dramatically. RNNs, which can efficiently account for the temporal dependency of speech, have been at the vanguard of this trend, and more recently, Self-Attention-based Transformers have emerged. This is because Transformers are able to capture longer time dependencies and train them more efficiently. Other CNNs have had some success in capturing local context through local receptive fields at each layer.

However, Self-Attention and CNNs each had their limitations on their own: while Transformer was good at considering global contexts, i.e., long time dependencies, it was not good at extracting local contexts, i.e., local relationships. CNN is the opposite: while it is good at extracting local information by subdividing it into blocks, as we see in the computer vision domain, for example, it requires many layers and parameters to capture the connections in a larger view.

So, recent research is increasingly combining CNN and Self-Attention to outperform their individual performance. By using them simultaneously, we can combine the best of both worlds to capture both local and global contexts.

This study combines CNN and Transformer to apply it to speech recognition. It is based on the assumption that capturing local and global information will lead to higher accuracy in parameter determination. We then invented a new combination of Self-Attention and Convolution.

The model, named Conformer, performed best on the LibriSpeech dataset results, outperforming the Transformer Transducer by 15%, which was the best of the previous studies. We also experimented with patterns of 10, 30, and 118 million parameter sizes, respectively, and even the intermediate 30 million models already outperformed the Transformer Transducer.

To read more,

Please register with AI-SCHOLAR.

Sign up for free in 1 minute


If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us