Catch up on the latest AI articles

BERT For Speech Recognition

BERT For Speech Recognition


3 main points
✔️ A simple approach to fine-tune BERT for Speech Recognition.
✔️ A novel method to harness the potential of powerful language models like BERT.
✔️ Testing and benchmarking on Mandarin Speech Recognition dataset. 

Speech Recognition by Simply Fine-tuning BERT
written by Wen-Chin HuangChia-Hua WuShang-Bao LuoKuan-Yu ChenHsin-Min WangTomoki Toda
(Submitted on 30 Jan 2021)
Comments: Accepted to ICASSP 2021.

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)




Conventional speech recognition systems comprise a Language Model(LM), an acoustic model, and a lexicon. Recently, end-to-end automatic speech recognition (ASR) systems have emerged that are made up of a single unified model. These models need to be trained on large volumes of labeled data. Also recently, there have been rapid developments in the capacity of language models (BERT, GPT) based on self-supervised training.  These language models(LM) like BERT have been trained on a large corpus of unlabeled text data which are relatively easy to scrape off the Internet.

LMs have played an important role in Conventional speech recognition. It has been found that adding an LM to an end-to-end ASR also improves performance. Also, for us humans, an improvement in lexical capabilities automatically translates to being able to recognize more words. It is therefore intuitive to wonder how much more knowledge does an LM need to make it work like an ASR system? A powerful language model can predict the next possible word in the sentence and therefore very few clues in the speech signal would be sufficient to recognize the words. Based on this intuition, this paper tries to fine-tune a BERT model to recognize speech.

What is BERT?

BERT is a language model composed of multi-layer transformer encoder layers. For training, a vocabulary is created consisting of the unique words and their n-gram(bigrams and trigrams) from the training corpus. It also includes some special entries like [SEP] for sentence separation and [CLS] to signify classification tasks. Next, each entry in the vocabulary is tokenized serially. During training, a sentence is taken with a few masked words(~15%), and BERT is optimized to predict those words. BERT is trained to predict a probability distribution over all the words in the vocabulary. 

BERT has been fine-tuned to a variety of tasks such as sentiment analysis, question-answering, text summarizing, neural machine translation, and more. During fine-tuning, a few structural components of the model are tweaked and task-specific data is used. 

BERT for speech recognition

We assume that we have an ASR dataset like LibriSpeech D = {Xi, Yi}i=1~N. Here, Xi={xi1,xi2,...xik}  denotes the acoustic features and Yi={yi1,yi2,...yil} are their corresponding texts. The acoustic features are of dimension d and the vocabulary size is V. 

Fine-Tuning the Probabilistic Language Model: BERT-LM 

Fine-tuning the probabilistic LM is quite simple. The model is made to always predict the next word in the sentence. We start from word t=1 where the input sequence is just the token [CLS] to get the first word y1. Then, we build upon the predicted word by iteratively passing the predicted sequence through the model to obtain the next word in the sequence. The following equations summarize the concept:

Fine-Tuning the Automatic Speech Recognition Model: BERT-ASR 

The BERT-ASR model is fine-tuned in a way similar to BERT-LM. As shown in the second figure, the model is trained to predict the next word in the sequence just like the language model. However, as depicted in the upper image, we sum the acoustic embeddings with the three other embeddings used in BERT. These acoustic embeddings are generated by an acoustic encoder which will be described later.


Align text and sound features

One assumption we make here is that the frames of the acoustic features X have already been linked to their respective words i.e. a range of consecutive frames in X has been assigned to a particular word in Y. In practice, this can be done using an HMM/DNN model. For the ASR, equation (1) converts to:

Here, Ft is the acoustic feature corresponding to the word yt.

Acoustic model

We introduce two types of acoustic models to transform the acoustic features into acoustic embeddings. 

  1. Average Encoder
    This is a rather simple approach where the segmented acoustic models are just averaged along the time axis.  Then the averaged data is passed through a linear layer to distill the information and scale the dimensions to the dimensions of the other embeddings (token emb, position emb, segment emb). 

  2. Conv1d-Resnet
    The average encoder leaves out the temporal dependencies between the frames of the acoustic features. To represent those dependencies well in the embeddings, we pass the features through a series of residual blocks instead of just averaging the features. 


BERT-ASR and BERT-LM were both tested and fine-tuned on the AISHELL-1 dataset, which is a corpus for speech recognition in Mandarin. A pre-trained BERT model trained on Chinese Wikipedia was used for the task. Also, the acoustic feature to text alignment was done using an HMM/DNN model also trained on the AISHELL-1 training set. We also tested on two different decoding scenarios: oracle decoding(Orac.where the alignment was assumed to be accessible and practical decoding(Prac.) where the alignment was assumed to be linear(25 frames per word). The beam size was assumed to be 10 in both cases. The result is shown below:


Results on the AISHELL-1 dataset on CER: Character Error rate and PPL: Perplexity

It is clear that the CNN resnet encoder outperforms the average encoder, and oracle decoding achieves better results than linear alignment. For further details, please refer to the original paper.  


This paper presents a novel idea of fine-tuning language models like BERT for speech recognition. The results on the AISHELL-1 were surprising and impressive but there is still a long way to go. The sources of error could also be attributed to the fact that Mandarin is a character-based language and the same utterances can be mapped to different characters. Also, the model has a very insufficient context at the beginning with a few or no words. This causes the error to propagate and degrades model performance. Nevertheless, it would be interesting to see how this approach works in other languages and with other language models like AlBERT and GPT.  

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Thapa Samrat avatar
I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us