This Is The SoTA Paper On Speech Recognition! What A Study By Google That Pushes The Limits Of Semi-supervised Learning!

Voice Recognition 25/06/2021

3 main points
✔️ Google published a SoTA paper on speech recognition
✔️ Based on the Transformer-based speech recognition model Conformer
✔️ Combines best practices of self-training and semi-supervised learning

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
written by Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu
(Submitted on 20 Oct 2020)
Comments: Accepted by NeurIPS SAS 2020 Workshop
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)

code：

first of all

In recent years, semi-supervised learning has been used to significantly improve the performance of speech recognition. The purpose of semi-supervised learning is to improve the performance of supervised learning with labeled data by using such a large amount of unlabeled data. The purpose of semi-supervised learning is to improve the performance of supervised learning with labeled data by using such large unlabeled data. In this paper, we achieve SoTA on the LibriSpeech dataset, as shown in the figure above, by combining pre-training and self-training, which have been developed recently. The unlabeled data for semi-supervised learning is obtained from the free LibriVox dataset.

Let's review iterative self-training, which is learning a model by labeled learning, and then using the model to assign labels to unlabeled data. Self-training is to learn a model by labeled learning, and then to use the model to label unlabeled data. It should be noted that the model that assigns labels is called a teacher model, and the model that learns pseudo labels is called a student model.

On the other hand, in pre-training, the model is trained on unlabeled data by a pre-training task and then fine-tuned on supervised data. For example, in image recognition pre-training, models trained on ImageNet 's image classification task are used as initial parameters for other tasks. Also in the field of natural language processing, for example, BERT first solves the task of masking some words in a sentence and predicting the masked words from the surrounding context.

In this paper, we propose a method to combine iterative self-training and pre-training. In other words, we pre-train some models and use them as initial models for iterative self-training. Here, the unlabeled dataset plays two roles: one is a pre-training dataset and the other is an unlabeled dataset to generate pseudo-labels for training student models. This idea itself has been widely treated in the field of image recognition, but we applied it to speech recognition.

Proposed method

Model Architecture: Conformer

The model architecture is based on a Transformer-based speech recognition model called Conformer. This Conformer is explained in detail in this article if you are interested in it.

The speech recognition network itself is a series of transformations consisting of an LSTM decoder and a Conformer encoder, each of which is a stack of "Conformer Blocks" consisting of multi-headed self-attention, depth-wise convolution, feed-forward layers The main element is a stack of " Conformer Blocks " each consisting of multi-headed self-attention, depth-wise convolution, and feed-forward layers. This conformer encoder is illustrated in the figure above.

Pre-training with wav2vec 2.0

In this paper, we pre-train the Conformer encoder in the same way as wav2vec 2.0, which pre-trains on speech in unlab-60, a subset of Libri-Light. wav2vec 2.0 is described in detail in this article (https://ai-scholar.tech/articles/voice-recognition/wav2vec ), so please have a look.

The conformer encoder can be divided into three parts: a " feature encoder " consisting of convolution subsampling blocks, a "context network" consisting of linear layers, and a stack of conformer blocks. The conformer encoder can be divided into three parts: a "feature encoder" consisting of convolutional subsampling blocks, a " context network " consisting of linear layers, and a stack of conformer blocks. The features encoded from these convolutional subsampling blocks are masked into context vectors on one side and then fed to the rest of the network, while on the other side they pass through the linear layer to generate the target context vector. In other words, the Wav2vec 2.0 pre-training optimizes the contrastive loss between the masked context vector and the target context vector. This mechanism is illustrated in the figure above.

Noisy Student Training with SpecAugment

In this paper, we employ the noisy student training pipeline to train models pre-trained with wav2vec 2.0. in NST, the teacher model is obtained by shallow-fusing the entire ASR model together with the language model, and the transcript for unlabeled data is generated through inference of speech that has not yet undergone augmentation. The transcript for unlabeled data is generated through inference on speech that has not yet undergone augmentation. The labeled data, after filtering and balancing, generates the next ASR model. The input data for the student model is augmented by adaptive SpecAugmentation. Experiments show that using the data generated by the teacher model is more effective than filtering and balancing to achieve SoTA performance.

To summarize, let's assume that the labeled LibriSpeech dataset is S, the unlabeled Libri-Light dataset is U, and the language model trained on the LibriSpeech language model corpus is LM, then the procedure for training a set of models is as follows.

Experimental results

The results of WERs (% ) obtained from experiments with LibriSpeech are shown in the figure above. The figure compares models trained on unlabeled data ( Baseline ), models trained without pre-training ( NST only), models pre-trained and fine-tuned on supervised data (pre-training only), and models trained with the Semi-Supervised Learning (SSL) pipeline proposed in this paper. Learning (SSL) pipeline proposed in this paper. The results show that the model trained with the SSL pipeline using the generation-3 conformer XXL model proposed in this paper achieves the best performance.

Finally

In this paper, we combine state-of-the-art architectural knowledge and augmentation, especially semi-supervised learning, to achieve SoTA in speech recognition tasks.

Categories related to this article

山下夏生

This Is The SoTA Paper On Speech Recognition! What A Study By Google That Pushes The Limits Of Semi-supervised Learning!

first of all

Proposed method

Model Architecture: Conformer

Pre-training with wav2vec 2.0

Noisy Student Training with SpecAugment

Experimental results

Finally

The Secrets Of Speech Recognition Technology

The Secrets Of Speech Recognition Technology

Model Lightweight Techniques! Lightweight And High Performance Speech Emotion Recognition Model LightSER-NET!

Model Lightweight Techniques! Lightweight And High Performance Speech Emotion Recognition Model Ligh ...

Ultra-lightweight CNN Speech Recognition Model! Google-developed "ContextNet" Explained!

Ultra-lightweight CNN Speech Recognition Model! Google-developed "ContextNet" Explained!

End-to-end Speech Translation "NeurST".

End-to-end Speech Translation "NeurST".

Facebook AI Has Developed A New Voice Separation Model With RNN! Extracting Only Your Voice From A Large Group Of People's Convers ...

Facebook AI Has Developed A New Voice Separation Model With RNN! Extracting Only Your Voice From A L ...

[wav2vec 2.0] Facebook AI Unveils A New Speech Recognition Framework! Self-supervised Learning Achieves High Accuracy Without Corr ...

[wav2vec 2.0] Facebook AI Unveils A New Speech Recognition Framework! Self-supervised Learning Achie ...