[VoiceCraft] A Language Model That Synthesizes Natural Speech At The Highest Level In The Industry

Speech Synthesis 01/07/2024

3 main points
✔️ Token-Complementary Neural Codec Language Model with Transformer Decoder
✔️ Achieves State-of-the-Art Performance in Both Speech Editing and Zero-Shot Speech Synthesis (TTS )
✔️ Introducing REALEDIT, a high-quality dataset for evaluating speech editing

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
written by Puyuan Peng ,Po-Yao Huang, Daniel Li
(Submitted on 25 Mar 2024)
Comments: Published on arxiv.

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

VoiceCraft: Achieves SOTA in Speech Editing and Synthesis

This paper is entitled "Development of VoiceCraft Achieving SOTA in Both Speech Editing and Zero-Shot Speech Synthesis (TTS) " . In this article, we refer to Text-to-Speech as TTS, following this paper.

The key points of this study are as follows

Problem set: Develop a unified model for speech editing and zero-shot speech synthesis (TTS)
Solution Method 1: Propose a token-complementary neural codec language model, VoiceCraft.
Solution Method 2:Learning and evaluating VoiceCraft for both speech editing and speech synthesis tasks.
Point: VoiceCraft enables voice editing and synthesis at the industry's highest level.

In short,VoiceCraftenables voice editing that is so natural that it is indistinguishable froma sample voice, while at the same timeoutperforming previous state-of-the-art models in zero-shot TTS.

Incidentally, the VoiceCraft code and model weights areavailableon GitHub for the purpose of promoting research in speech synthesis and AI safety.

Research Background

Neural codec language model

In recent years, there has been a lot of research on speech synthesis using neural codec language models.

A neural codec language model is a method of generating speech in the same manner as language generation by converting a speech signal into a sequence of discrete tokens and applying the language model to the sequence.

It is unique in that it does not use the mel spectrogram as an intermediate representation, but uses voice tokens.

Zero-shot TTS and audio editing

Zero-shot TTS is similar to transcription, where you enter a "sample voice sample" and the textyou want to transcribe. The AI then reads out the text you want to transcribe in the voice of the sample voice.

Speech editing, on the other hand, is the task of changing words or phrases in the speech sample and reading them out naturally. In doing so, it is necessary to maintain the accent, intonation, etc., without changing any parts of the speech other than those to be edited.

It is easier to understand this area by referring to the official Demo page.

Various models have been developed in the field of TTS and audio editing, but there are few models that can perform both zero-shot TTS and audio editing in a unified manner.

There is also a lack of "more realistic speech data" that includes a variety of accents, speaking styles, recording conditions, and noise.

VoiceCraft's main methods

VoiceCraftachievesspeech editing and TTS by reordering the output tokens of the Neural Codec Language Model (NCLM) and subsequentautoregressive series prediction by a decoder-only Transformer.

Theprocedure for sorting tokens is a two-step process

Causal masking
Delayed stacking

The firstCausal Maskingtakes a continuous speech waveform as input and quantizes it using Encodec. During training, the span of X tokens is randomly masked and moved to the end of the sequence.

The next Delayed stacking shifts the vector so that it is a vector with elements taken out diagonally so that it can be conditioned on codebook k-1 in the prediction of codebook k at time t in Y.

Modeling with Transformer decoder

The resulting token sequence Z is then modeled autoregressively using the Transformer decoder. The combined transcripts W and Z of the audio are then entered as the condition.

The inference for the speech editing task is characterized by identifying the span to be edited and using mask tokens to estimate masks in an autoregressive manner.

Zero-shot TTS, on the other hand, involves concatenating the prompt audio, its transcript, and the target transcript.

The key point in both cases is that the use of token reordering techniques allows for natural speech synthesis that takes into account the context of both directions.

VoiceCraft vs. existing models

In this study, experiments are being conducted to compareVoiceCraft'sperformance with existing models for voice editing and zero-shot TTS tasks.

Experiments in audio editing

Here, in particular, it is validated with "more realistic speech data" that includes a variety of accents, speaking styles, recording conditions, and background noise.

Specifically, the authors used a newly created dataset in the speech editing task, REALEDIT, which contains 310 actual audio recordings collected from audiobooks, YouTube videos, and podcasts, with the length of the text to be edited ranging from 1 word to 16 words. The length of the text to be edited ranges from one to 16 words.

The validation compares VoiceCraft with the best existing performance model, FluentSpeech. WER is used as a quantitative measure and MOS (Mean Opinion Score) as a qualitative evaluation.

The results are as follows

VoiceCraftoutperforms FluentSpeech in all of its MOSs.

In addition, the audio edited by VoiceCraft was indistinguishable from the actual recorded audio before editing 48% of the time for humans.

Zero-shot speech synthesis (TTS) experiments

Here is a comparison of VoiceCraft with VALL-E, XTTS v2, YourTTS, and FluentSpeech.

WER and SIM (voice similarity to the original voice owner) are used as quantitative indicators, and MOS is used as a qualitative evaluation.

The results are as follows

VoiceCraft outperforms other models in SIM and all MOS metrics.

VoiceCraft is a state-of-the-art model in the field of speech synthesis

This article presented VoiceCraft's work in achieving SOTA in both speech editing and zero-shot speech synthesis (TTS).

One limitation of this study is that long silence and scratch noise may occur during generation.

In addition, he noted that advances in speech synthesis technology have increased the risk of speech forgery and abuse, which calls for more research on watermarking and deep-fake detection for models like VoiceCraft.

Personal Opinion

Since the VoiceCraftcode and models are publicly available, we can expect further model performance improvements and the development of innovative models based on VoiceCraft.

On the other hand, we felt that the risk of abuse, such as fraud through voice forgery, could not be ignored. After all, the voice generated by VoiceCraft is indistinguishable from the voice of the person (the owner of the input voice).

Therefore, there is a concern that the number of frauds, such as "making it look as if the person himself is speaking and having his relatives transfer money into his account," will increase.

It will be even more necessary to deal with such risks in the future.