Catch up on the latest AI articles

Facebook AI Has Developed A New Voice Separation Model With RNN! Extracting Only Your Voice From A Large Group Of People's Conversation!

Facebook AI Has Developed A New Voice Separation Model With RNN! Extracting Only Your Voice From A Large Group Of People's Conversation!

Voice Recognition

3 main points
✔️Facebook AI develops a new supervised speech separation model using RNN
✔️Proposing a new loss for learning speech separation networks
✔️Proposed a model selection method for unknown number of speakers 

Voice Separation with an Unknown Number of Multiple Speakers
written by Eliya Nachmani, Yossi Adi, Lior Wolf
(
Submitted on 29 Feb 2020 (v1), last revised 1 Sep 2020 (this version, v4))
Comments: Accepted to ICML 2020
Subjects: Audio and Speech Processing (eess.AS)
  
 

1. Introduction

Facebook 's AI research team has announced new research results for speech separation. Speech separation is the process of extracting only the voice of a specific person, even when multiple people are talking at the same time. While most of the early work was based on unsupervised learning of sources from multiple microphones, such as independent component analysis, this work focuses on the problem of supervised speech separation from a single microphone, whose performance has been dramatically improved by deep neural networks.

Existing research has relied on mask processing (i.e., creating a filter in advance that passes only the sound source of speaker A, and multiplying it by the input signal to extract only speaker A ), but the more voices to be separated, the more features need to be extracted, which is a limitation of mask-based methods. However, the more voices to be separated, the more features need to be extracted, which is a limitation of mask-based methods. In this study, we use RNN without mask processing. In addition, we propose a new loss to realize this new RNN-based separation method. In addition, we demonstrate that the new loss improves the performance of the baseline method.

As in the state-of-the-art methods, we train one model for each number of speakers. The performance of our method is slower than that of the existing methods, which degrade as the number of speakers increases.

We also propose a method using a learning-free activity detector to deal with an unknown number of speakers (a situation where the number of speakers is not known in advance).

 

To read more,

Please register with AI-SCHOLAR.

Sign up for free in 1 minute

OR

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us