Catch up on the latest AI articles

The Secrets Of Speech Recognition Technology

The Secrets Of Speech Recognition Technology

Voice Recognition

3 main points
✔️ suggest ways to determine which information is important in a speech recognition system.
✔️ It shows that speech recognition systems learn not only speech features, but also other information, such as speaker characteristics and emotions.

✔️ To gain even more information, we add new tasks such as accent and age to try to understand which information the acoustic model encodes.

Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems
written by Quentin RaymondaudMickael RouvierRichard Dufour
(Submitted on 29 Feb 2024)
Comments: Published on arxiv.

Subjects:Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Speech recognition technology using deep learning has made great progress. This has made speech recognition systems more accurate. However, this technology is very complex and it is difficult to understand which information is used where. Therefore, this paper proposes a method for identifying which information is important in a speech recognition system. Specifically, we propose a method for using information in the intermediate stages of a speech recognition system to evaluate the performance of the system.

Through a variety of experiments, we have shown that speech recognition systems learn not only speech features, but also other information such as speaker characteristics and emotions. We also found that information not needed for speech recognition tends to be eliminated at the higher stages. In other words, we found that speech recognition systems using deep learning learn not only speech, but also other information. This allows for more accurate speech recognition, but it also means that the mechanism is very complex and it is difficult to understand which information is used and how.


Recently, speech recognition technology has made significant advances, most notably the integration of deep learning approaches with large amounts of speech data, especially at both the acoustic and linguistic levels. The transition from classical speech recognition systems to deep neural networks (DNNs) has greatly improved the performance of speech recognition. However, it is still difficult to understand how DNNs learn. Whereas previously the focus was on speech features and phonemes, the latest research seeks to better understand how speech recognition systems process information. In particular, focusing on the acoustic model within a speech recognition system and investigating what information is processed and at what layers could lead to the development of better speech recognition technology.

Proposed Method

Acoustic model architecture

Acoustic models are the core element of automatic speech recognition (ASR) systems and are trained to recognize basic speech units (usually phonemes) from a given speech signal. The processing of speech signals is complex because they carry a lot of information, including language, noise, and speaker. Therefore, the accuracy of acoustic models has been improved by moving away from the classical approach to a new architecture using DNNs (Deep Neural Networks). In particular, the TDNN-F (Factorized TDNN) architecture is an example of such an evolution. This model is designed to process complex information in signals and performs well in speech recognition tasks.

Proposed Protocol

The study uses multiple classification tasks to reveal which hidden layers of specific information in the acoustic model reside and how they vary with the level of the neural network. This allows us to understand what information each layer of the acoustic model extracts and which tasks it is useful for. It is hoped that this will provide important insights that will help improve speech recognition systems. The following figure shows the proposed protocol for acoustic model information probing.

Research Tasks

Speaker verification evaluates the ability to identify the speaker from speech. Speech velocity analysis examines how well speech can respond to changes in speech velocity. The speaker gender identification task evaluates how accurately the acoustic model estimates the gender of the speaker. In addition, the acoustic environment task estimates the environment in which the speech was recorded. Finally, the speech emotion/emotion recognition task evaluates the ability to estimate emotions and feelings from speech. These tasks provide valuable insight into the performance of the acoustic model and the usefulness of its information.


The table above shows performance on different probe tasks ( experiments or tasks designed to obtain specific information ). Performance is expressed in terms of EER ( Equal Error Rate) for speaker verification and in terms of accuracy for other tasks. The table compares the performance of each layer of TDNN-F with the MFCC (acoustic features) baseline. The results show that the vector representation from the hidden layer usually provides better classification results than the traditional MFCC. However, MFCC is superior in the speaker verification task. This indicates that the speaker ID information associated with phoneme recognition tends to be suppressed in contrast to other tasks. This tendency is also observed in the self-supervised learning model, suggesting that information associated with speaker identity is not useful for phoneme identification and must be suppressed. This suggests that the hidden layer of the acoustic model contains structured information that is useful for different tasks.

Furthermore, it is shown that information is encoded and suppressed differently depending on the depth of the network. The lower the hidden layer, the better the surrounding noise is picked up and the best performance is achieved in acoustic environment tasks. On the other hand, the best performance is seen in the middle level hidden layers for tasks such as speaker gender and speech speed. These results are important for understanding how acoustic models process information in different tasks.


In this paper, a protocol was proposed to investigate the information contained in the acoustic models used in speech recognition systems. A variety of speech-oriented tasks were used to study the neural-based acoustic models in detail. The research analyzed the performance of the TDNN-F acoustic model at various hidden layers to understand the information contained at different levels of the acoustic model, including speaker, acoustic environment, and voice. For example, they showed that information related to gender, speaking speed, speaker identity, emotion, and sentiment was encoded. Results highlighted that information is encoded in different ways within the acoustic model. At the lower levels, information is structured and performance continues to improve, but eventually information tends to be suppressed.

In future work, they will try to understand which information the acoustic model encodes by adding new tasks such as accent and age to obtain more information. They also hope to focus on the representation of other acoustic signals such as wav2vec.


If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us