
A New Wave Of Multispeaker Speech Recognition! The Challenge Of High Accuracy Systems By DiCoW And DiariZen
3 main points
✔️ Proposed a multi-speaker speech recognition system combining DiCoW and DiariZen to achieve high accuracy
✔️ Integrates speaker separation and ASR to perform robustly in unknown domains and win 2nd place in the challenge
✔️ Improves recognition performance by using VAD together to deal with label mismatch in training data
BUT System for the MLC-SLM Challenge
written by Alexander Polok, Jiangyu Han, Dominik Klement, Samuele Cornell, Jan Černocký, Lukáš Burget
(Submitted on 16 Jun 2025)
Comments: Published on arxiv.
Subjects: Audio and Speech Processing (eess.AS)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
This paper proposes a system that combines two models, DiCoW and DiariZen, to address the challenge of multilingual and multispeaker speech recognition (ASR).
DiCoW is based on the Whisper model and performs speech recognition conditional on frame-by-frame speaker information. DiariZen, on the other hand, is a speaker-dialysis pipeline built on Pyannote.
The authors first applied both models to multilingual data in a pre-trained state and verified their generalizability in unknown domains. The results showed that DiariZen outperformed the baseline Pyannote model. In addition, the model was fine-tuned with data for the MLC-SLM Challenge to improve recognition accuracy. Ultimately, the proposed system placed second in Task 2 of the Challenge and reportedly demonstrated strong robustness to speaker diversity and data inconsistencies.
Proposed Methodology
The proposed method consists of two major components.
The first is DiariZen, which divides speech into multiple chunks and performs local neural network speaker separation (EEND) on each. By clustering the speaker embeddings obtained here, it maps speakers across chunks to complete the overall speaker separation.
The second is DiCoW, which adds probabilistic information about STNO masks (silence, target speaker, other speaker, overlap) representing frame-level speaker activity to the Whisper architecture and dynamically transforms the input representation at each Transformer layer. These transformations, called Frame-Level Diarization-Dependent Transformations (FDDT), enable the model to learn in a speaker-conditional manner.
This enables ASR that does not rely on speaker embedding or pre-registered speech, but only on probabilistic speaker information.
Experiments
In our experiments, we first evaluated the speaker separation performance of DiariZen and Pyannote both in the unknown domain and after fine tuning. The results showed that DiariZen had a DER (speaker separation error rate) of 12.7% after fine tuning, lower than Pyannote's 16.4%. We then evaluated the speech recognition performance of DiCoW and found that even the pre-trained model showed a tcpWER (word error rate) that was significantly higher than the baseline. Furthermore, after fine tuning, DiCoW achieved an accuracy of less than 20% error rate for many languages.
However, there were some label mismatches and incorrect annotations of silent segments in the training data, which led to poor performance in some languages. To address this issue, the authors introduced an approach that uses the VAD model in conjunction with the VAD model to enhance silence detection. This approach has shown significant improvements in recognition performance on development data close to the test conditions.
Categories related to this article