Facebook AI Has Developed A New Voice Separation Model With RNN! Extracting Only Your Voice From A Large Group Of People's Conversation!

Voice Recognition 21/01/2021

3 main points
✔️Facebook AI develops a new supervised speech separation model using RNN
✔️Proposing a new loss for learning speech separation networks
✔️Proposed a model selection method for unknown number of speakers

Voice Separation with an Unknown Number of Multiple Speakers
written by Eliya Nachmani, Yossi Adi, Lior Wolf
(Submitted on 29 Feb 2020 (v1), last revised 1 Sep 2020 (this version, v4))
Comments: Accepted to ICML 2020
Subjects: Audio and Speech Processing (eess.AS)

1. Introduction

Facebook 's AI research team has announced new research results for speech separation. Speech separation is the process of extracting only the voice of a specific person, even when multiple people are talking at the same time. While most of the early work was based on unsupervised learning of sources from multiple microphones, such as independent component analysis, this work focuses on the problem of supervised speech separation from a single microphone, whose performance has been dramatically improved by deep neural networks.

Existing research has relied on mask processing (i.e., creating a filter in advance that passes only the sound source of speaker A, and multiplying it by the input signal to extract only speaker A ), but the more voices to be separated, the more features need to be extracted, which is a limitation of mask-based methods. However, the more voices to be separated, the more features need to be extracted, which is a limitation of mask-based methods. In this study, we use RNN without mask processing. In addition, we propose a new loss to realize this new RNN-based separation method. In addition, we demonstrate that the new loss improves the performance of the baseline method.

As in the state-of-the-art methods, we train one model for each number of speakers. The performance of our method is slower than that of the existing methods, which degrade as the number of speakers increases.

We also propose a method using a learning-free activity detector to deal with an unknown number of speakers (a situation where the number of speakers is not known in advance).

To read more,

Please register with AI-SCHOLAR.

Categories related to this article

山下夏生

Facebook AI Has Developed A New Voice Separation Model With RNN! Extracting Only Your Voice From A Large Group Of People's Conversation!

1. Introduction

The Secrets Of Speech Recognition Technology

The Secrets Of Speech Recognition Technology

Model Lightweight Techniques! Lightweight And High Performance Speech Emotion Recognition Model LightSER-NET!

Model Lightweight Techniques! Lightweight And High Performance Speech Emotion Recognition Model Ligh ...

Ultra-lightweight CNN Speech Recognition Model! Google-developed "ContextNet" Explained!

Ultra-lightweight CNN Speech Recognition Model! Google-developed "ContextNet" Explained!

This Is The SoTA Paper On Speech Recognition! What A Study By Google That Pushes The Limits Of Semi-supervised Learning!

This Is The SoTA Paper On Speech Recognition! What A Study By Google That Pushes The Limits Of Semi- ...

End-to-end Speech Translation "NeurST".

End-to-end Speech Translation "NeurST".

[wav2vec 2.0] Facebook AI Unveils A New Speech Recognition Framework! Self-supervised Learning Achieves High Accuracy Without Corr ...

[wav2vec 2.0] Facebook AI Unveils A New Speech Recognition Framework! Self-supervised Learning Achie ...