Text To Speech Methods That Run On Fewer Computational Resources

NAS 05/10/2022

3 main points
✔️ Proposed a way to design models for Text to Speech on resource-limited devices such as mobile phones
✔️ Reduced model size and inference latency over traditional lightweight models using NAS
✔️ Successfully design lighter and faster models automatically without compromising voice quality

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
written by Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Sheng Zhao, Enhong Chen, Tie-Yan Liu
(Submitted on 8 Feb 2021)
Comments: ICASSP 21
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Text to Speech, which synthesizes natural speech from text, has been deployed in many services such as voice navigation and newscasting. Neural network-based models of Text to Speech have significantly improved speech quality over the models used in traditional Text to Speech, but because most neural network-based models use autoregressive models and inference latency is large, making it is difficult to use in end devices such as mobile phones. Non-autoregressive models have significantly faster inference speeds than autoregressive models, but the model size, inference latency, and power consumption are larger. Therefore, non-autoregressive models are also difficult to use in end devices such as mobile phones, which is the target of this paper.

A possible technique for using these models in end devices such as mobile phones is to make neural networks lightweight. There are many techniques for designing lightweight and efficient neural networks, such as quantization and pruning. These methods can compress large models into smaller models and have been very successful with less computational cost. However, most of these methods are designed for convolutional neural networks for computer vision, which contain domain-specific knowledge and properties, so recurrent neural networks and attention networks, such as those used in Text to Speech models, such as those used in Text to Speech. For example, manually reducing the depth or width of the network leads to severe performance degradation.

Given these problems, the authors considered using Neural Architecture Search (NAS) to design optimal architectures. To apply Neural Architecture Search to new domains and tasks, the search space, search algorithms, and evaluation metrics need to be designed. In this paper, the authors designed a model of Text to Speech in NAS, which is designed to run on end devices with limited computational resources, such as mobile phones, by designing these three aspects: search space, search algorithm, and evaluation metric.

proposed method

Analysis of the current model

FastSpeech is a leading model in the field of Text to Speech. Therefore, we adopted this model as the backbone of this paper. First, we analyze the structure of FastSpeech to see which part needs many parameters.

The table below shows the number of parameters for each FastSpeech module.

From the table, it can be seen that the encoder and decoder occupy most of the parameters in FastSpeech. Therefore, the authors mainly aim to reduce the size of the encoder and decoder and explore the encoder and decoder architecture using NAS. Also, the predictor is not a large number of parameters in total, but it occupies a small percentage of the inference time. So, rather than exploring the architecture, we aim to manually design a distributed predictor that performs more lightweight operations.

Design of search space

FastSpeech has four transformer blocks for both encoders and decoders, each containing a multi-head self-attention mechanism and a forward propagating network. We employ this encoder and decoder framework as the backbone of the network. We also consider the multi-head self-attention mechanism and the forward propagating network in the transformer block separately and treat them as separate operations.

Using the backbone network as described above, we set up the search space as follows.

LSTM is not considered because of its slow inference speed
Multi-head self-attention mechanism with the different number of heads of interest {2, 4, 8}.
Depth-separable convolution (SepConv) is used, and {1, 5, 9, 13, 17, 21, 25} is adopted as the kernel size.

Therefore, the total number of candidate operations is 11, including a forward propagation network, three multi-head self-attention mechanisms, and seven SepConv types.

search algorithm

There are many methods for exploring neural network architectures, but the authors adopted a GBDT-based method (GBDT-NAS). This method can speed up the evaluation of architectures by using GBDT to predict the accuracy of the architecture. In addition, the architectures with the top GBDT prediction results are trained on the training set and then their performance is verified on the dev set to find the architecture with the best performance.

experiment

data-set

We used the LJSpeech dataset, which contains 13100 pairs of text and speech data. We divided this dataset into three parts: 12900 samples as the training set, 100 samples as the dev set, and 100 samples as the test set.

result

Audio Quality voice quality

To evaluate the quality of the synthesized speech, we performed CMOS evaluation on the test set. The results are shown in the table below.

We compare the proposed method (LightSpeech), standard FastSpeech, and manually designed lightweight FastSpeech. The table shows that even though the number of parameters is the same as the manually created lightweight FastSpeech model, the speech quality (CMOS) achieves better performance than the standard FastSpeech.

Speeding Up and Computational Complexity

The table below shows the results of measuring the speedup and the amount of calculation.

The table shows that the architecture discovered by LightSpeech can achieve 15 times the compression ratio, 16 times the number of MACs, and 6.5 times the inference speed on the CPU compared to FastSpeech2. 6.5 times faster inference speed compared to FastSpeech2. Thus, it can be more realistically deployable in many resource-constrained scenarios.

summary

In this paper, we proposed LightSpeech, which leverages NAS to discover a lightweight and fast model of Text to Speech. Experiments show that the discovered architecture achieves 15 times higher compression ratio, 16 times higher number of MACs, and 6.5 times higher inference speed than FastSpeech2 on inference speedup on the CPU and comparable sound quality compared to FastSpeech2.