[HiFi-GAN] GAN-based Vocoder Capable Of Generating 22 KHz Audio On A Single GPU

Speech Synthesis 10/07/2024

3 main points
✔️ Proposed HiFi-GAN neural vocoder for high-quality and efficient speech synthesis
✔️ 22.05 kHz speech can be generated on a single V100 GPU
✔️ Demonstrated applicability to various end-to-end speech synthesis tasks

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
written by Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
(Submitted on 23 Oct 2020)
Comments: NeurIPS 2020. Code available at this https URL
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Proposed HiFi-GAN to generate raw waveforms from intermediate representations

The paper is entitled "Construction of a HiFi-GAN that can efficiently generate high-quality speech waveforms from intermediate representations such as mel-spectrograms.

The key points of this study are as follows

Challenge:GAN-based speech waveform generation methods fall short of autoregressive and flow-based models in terms of quality.
Solution method: Proposed HiFi-GAN for efficient and high-quality speech synthesis.
Point: 22.05 kHz high quality audio can now be generated by a single V100 GPU.

In other words, high-quality raw speech can now be efficiently generated from an intermediate representation of speech called a mel-spectrogram.

Background on Neural Vocoders and the Speech Synthesis Field

In recent years, speech synthesis technology has advanced rapidly with the development of deep learning.

Most neural speech synthesis models employ a two-stage pipeline

Predicts intermediate representations such as mel spectrograms from text
Synthesize raw waveforms from intermediate representations

This paper focuses on the second stage of model design to "efficiently generate high-quality speech waveforms from the mel spectrogram ".

Incidentally, this second stage model is often referred to as a "neural vocoder," and has been the subject of various studies.

previous work	Contents	problem
WaveNet	High-quality speech synthesis using convolutional neural networks	Slow generation speed due to autoregressive model
Flow-based models such as Parallel WaveNet and WaveGlow	Parallel computation for higher speed	Large number of parameters
MelGAN with GAN	Compact model enables high-speed synthesis	Not up to the quality of autoregressive or flow-based models

Incidentally, a matter common to the speech synthesis field is that it is important to model periodic patterns of speech because speech is composed of sinusoidal signals of various periods.

Proposed Method: Overview of HiFi-GAN

The HiFi-GAN in this study is a GAN-based generative model that

Source: https://pytorch.org/hub/nvidia_deeplearningexamples_hifigan/

Specifically, it consists of one Generator and two Discriminators: Multi-Period Discriminator (MPD) and Multi-Scale Discriminator (MSD).

Generator

The Generator of HiFi-GAN consists of an all-convolutional neural network.

Using the mel-spectrogram as input, up-sampling is repeated by transpose convolution and expanded until the length of the output sequence matches the time resolution of the raw speech waveform.

Each transpose convolution is followed by a multi-receptive field fusion (MRF) module.

Multi-receptive field fusion (MRF)

The MRF module is designed to capture patterns of various lengths in parallel.Specifically, the MRF module returns the sum of the outputs of multiple Residual Blocks.

Each Residual Block has a different kernel size and dilation rate selected to form a variety of receptive field patterns.

Discriminator

HiFi-GAN uses the following two discriminators

Multi-Period Discriminator (MPD)
Multi-Scale Discriminator (MSD)

MPD consists of multiple sub-discriminators, each of which receives only equally spaced sampled signals from the input speech. This allows each sub-discriminator to focus on different periodic patterns in the input speech and to capture the various periodic structures inherent in speech.

MSD is also able to capture continuous patterns and long-term dependencies by continuously evaluating input speech at different scales.Specifically, the MSD consists of three sub-discriminators that take as input three types of speech: raw speech, speech downsampled by 1/2, and speech downsampled by 1/4.

Thus, by combining MPD and MSD, HiFi-GAN is expected to be able to evaluate from multiple perspectives, from detailed periodic features to global continuous features of the generated speech

Loss function

The HiFi-GAN study uses the following four loss functions

GAN Loss (Adversarial Loss)
Mel-Spectrogram Loss
Feature Matching Loss
Final Loss Function

GAN Loss (Adversarial Loss)

In GAN Loss (Adversarial Loss ), the MPD and MSD are considered as one discriminator and the objective function of LSGAN is used. The discriminator learns to classify real speech as 1 and generated speech as 0, and the generator learns to deceive the discriminator.

Mel-Spectrogram Loss

Mel-Spectrogram Loss introduces mel-spectrogram loss in addition to GAN loss to improve the training efficiency of the generator and the quality of the generated speech.

Specifically, it is defined as the L1 distance between the waveform synthesized by the generator and the mel spectrogram of the real waveform.

This loss allows the generator to synthesize natural waveforms that correspond to the input conditions and stabilizes learning from the early stages of adversarial learning.

Feature Matching Loss

It refers to the similarity of the discriminator features in the real and generated samples.

Specifically, it extracts the intermediate features of the discriminator and calculates the L1 distance between the real and conditionally generated samples in each feature space.

Final Loss Function

Final HiFi-GAN loss function.

The loss function of the generator here is expressed as a weighted sum of the above three loss functions.

Effectiveness of this method

Experimental Details

The following four experiments have been conducted to evaluate HiFi-GAN's speech synthesis quality and synthesis speed.

Subjective evaluation and speed comparison with other state-of-the-art models (WaveNet, WaveGlow, MelGAN)
Investigation of the impact of each component of the HiFi-GAN (MPD, MRF, mel spectrogram loss) on the quality
Investigation of generalization performance in speech synthesis
End-to-end speech synthesis experiments

Subjective evaluation and speed comparison with other latest models

Fifty utterances were randomly selected from LJSpeech to measure subjective evaluation (Mean Opinion Score, MOS) and synthesis speed.

Results show that HiFi-GANachieves higher MOS than other models such as WaveNet, WaveGlow, and MelGAN. As for V3 of HiFi-GAN, it is 13.44 times faster than real-time synthesis on CPU.

Study of the impact of HiFi-GAN on the quality of each component

To investigate the effect ofeach componentof HiFi-GAN (MPD, MRF, Mel Spectrogram Loss) on voice quality, they removed each component based on V3 and compared MOS. They also examined the effect of introducing MPD in MelGAN.

The results show that MPD, MRF, and mel spectrogram loss all contribute to the performance improvement. In particular, when MPD is removed, the quality is considerably reduced.

In addition,when MPD was introduced to the MelGAN model, significant improvements were observed.

Investigation of generalization performance in speech synthesis

The voice data of nine speakers were excluded from the data set, and the MOS was measured by performing a Mel Spectrogram Transform → HiFi-GAN-based speech synthesis on the voices of these speakers.

The results show that HiFi-GAN outperforms autoregressive and flow-based models in all three variants.

The results showed a high generalization ability in speech synthesis.

End-to-end speech synthesis experiments

HiFi-GAN is combined with the Text-to-Spectrogram model "Tacotron2" to evaluate the performance of end-to-end speech synthesis.

Specifically, the mel spectrogram generated by Tacotron2 is input to HiFi-GAN and MOS is measured. In addition, the effect of fine tuning is also verified.

The resultsshow that the speech synthesis model combiningTacotron2 and HiFi-GANoutperforms WaveGlow. Fine tuning also allowed V1 to achieve a MOS of 4.18, which is almost equal to the quality of the human voice.

Summary

In this article, we introduced our research on HiFi-GAN, a GAN model that enables efficient and high-quality speech synthesis.

Three limitations of this study include the following three points

Applicability to more diverse speakers and languages is unknown
The emotional and prosodic expressiveness of the voice has not been adequately tested.
Speech synthesis performance in a limited computing resource environment has not been evaluated

Therefore, as future research, they plan to develop an extended model of HiFi-GAN to address the above issues, and to reduce the size and efficiency by learning on small data sets.

Personal Opinion

In my personal opinion, I thought it was a great idea to focus on the periodic characteristics of speech and to propose MPD. I also thought it was an idea that could be applied to other time-series data generation models, not limited to audio.

Incidentally, HiFi-GAN is often used as a vocoder in the last part of the process in current music generation models using diffusion models, etc.

For example

Music data → Mel Spectrogram conversion → Compressed by VAE → Generated by diffusion model → Converted to Mel Spectrogram by VAE → Generated raw music data through HiFi-GAN

I also had the impression that it did not take much time to execute the study.