[Technology disclosure] Alibaba speech AI : KAN-TTS speech synthesis technology-Alibaba Cloud Developer Community

SOURCE Alibaba voice AI public account

keywords: Speech synthesis Knowledge-awareNeural TTS KAN-TTS online Chinese real-time speech synthesis

in recent years, End2end technology has developed rapidly and has been extensively studied in various fields. In the field of speech synthesis, researchers have also proposed a speech synthesis system 1 based on End2end technology. In the End2end speech synthesis system, only text and corresponding wav data are required. Anyone can use the powerful deep learning technology to obtain good speech synthesis.

Comparatively speaking, the computer-based speech synthesis technology, which started around 1960, has accumulated profound domain knowledge in all modules of the system in the past 60 years or so, including speech signal processing, text analysis and model level. These domain knowledge are accumulated based on in-depth research on human vocalization mechanism, auditory perception mechanism, linguistics and other aspects. The combined action of these domain knowledge constructs the traditional speech synthesis system 3[5].

The preceding figure compares the block diagram of the traditional speech synthesis system and the End2end speech synthesis system. In the traditional speech synthesis system, the input text goes through multiple modules and forms a Rich context linguistic information based on multiple domainknowledge. The back-end model is modeled and predicted based on the previous results and acoustic feature, the final result is Vocoder to obtain synthesized speech. In the End2end system, the input Text only passes through the Text Normalize module to form a complete sequence of Chinese characters, which is directly input to the back-end model for modeling and prediction. Most domain knowledge in traditional speech synthesis systems is ignored by the End2end system.

Based on the good wishes of researchers, the speech synthesis technology of End2end only uses a very small part of these accumulated domain knowledge, hoping to completely abandon domain knowledge, A sufficiently good speech synthesis system is obtained through powerful model technology and massive data. However, from the actual results, the synthesized speech produced by the End2end system contains a series of problems, which puzzles the productization of the latest speech synthesis technology.

Previous KAN-TTS: Which is better, traditional speech synthesis technology or End2end technology?

In recent years, End2end Technology 1 (that is, end-to-end technology) has developed rapidly in the field of speech synthesis. This technology abandons the front-end-back multi-model multi-module framework of traditional speech synthesis technology, A unified model is adopted to predict the output voice directly by text-level input. As long as massive amounts of data are combined, researchers can use this technology to "foolishly" build an End2end speech synthesis system and generate good synthesis results.

On the other hand, the traditional speech synthesis system developed from HMM-based parameter/splicing system 3 to later speech synthesis system based on deep neural network [5], up to now, it still occupies an absolute dominant position in the mainstream products of speech synthesis. Although this technology has obvious mechanical sense and poor sound quality, compared with End2end system, it has higher stability of synthesized speech, especially in terms of pronunciation disambiguation and control pause of polyphonic words, both have obvious advantages.

The present life of KAN-TTS: the Hybrid system was born

different systems adopt different domain knowledge, thus resulting in different sound quality, naturality effect and different stability performance. From the final results, we hope to obtain obviously better sound quality and naturalness in End2end system on the one hand, and stability in traditional speech synthesis system on the other hand.

Therefore, combining the traditional speech synthesis system and End2end speech synthesis system, we have built a hybrid system, namely Knowledge-aware Neural TTS (KAN-TTS) technology. This technology fully combines domain knowledge and End2end speech synthesis technology. Based on the traditional speech synthesis system, it makes full use of various domain knowledge based on End2end speech synthesis technology, therefore, an online Chinese real-time speech synthesis system with high performance and stability is constructed.

Compared with the traditional speech synthesis system and End2end system, it mainly includes the following differences:

1. Linguisticdomain knowledge in : in KAN-TTS, we use a large amount of text-related data to build a highly stable domain knowledge analysis module. For example, in the polyphonic word disambiguation module, we use millions of text/pronunciation data containing polyphonic words to train the polyphonic word disambiguation model to obtain more accurate pronunciation. If training is completely based on voice data like the End2end system, data containing polyphonic words alone takes thousands of hours. For the speech synthesis field where conventional data takes several hours to dozens of hours, it is unacceptable.

2. Acousticmodel in : in KAN-TTS, considering the rapid development of deep learning technology and the synthesis effect of End2end model, we also adopted seq2seq model as acoustic model, combined with massive data, the effect and stability of the overall model are further improved.

3. Acousticfeature and Vocoder level : In the KAN-TTS system, we adopt FFT spectrum similar to End2end system as acoustic characteristics, with less information loss and more powerful vocoder to restore waveform, so it has obvious advantages in sound quality.

The figure above shows the basic block diagram of the KAN-TTS. In KAN-TTS, after a series of experiments and comparisons, we finally adopted domain knowledge as shown in the preceding figure as the input of the back-end model.

KAN-TTS: More domain knowledge

in addition to the deep integration of the traditional speech synthesis system and the End2end system, the KAN-TTS system is constructed. We also integrate domain knowledge in many other aspects, the most important of which are linguistic knowledge based on Chinese, acoustic space construction based on massive voice data, and specific speakers, migration learning techniques of specific styles.

1. Model building based on massive data : in order to make the best use of more extensive data, we use hundreds of hours of data of hundreds of people to build a multi-speaker speech synthesis system based on massive data. In contrast, in traditional speech synthesis systems, the data volume of a single speaker usually ranges from several hours to dozens of hours. The speech synthesis system built by using the data of a large number of speakers can provide a more stable synthesis effect and lay a foundation for building a highly stable speech synthesis product.

2. Migration methods for specific data : because a large number of people with different pronunciations are used to build a speech synthesis system, although we use a large number of people with multiple pronunciations to produce highly stable speech synthesis. However, for a particular speaker or a particular style, there is still a certain gap between the effect and the real recording. Therefore, we refer to the research on the proportion of training data in other fields, and further try to migrate and learn data for specific speakers and styles based on the multi-speaker model. Experiments show that the effect of speech synthesis can be further improved and the effect of real recording can be approximated after superposition of migration learning.

KAN-TTS: Engineering Optimization of non-heterogeneous computing

with the development of deep learning technology, the modeling capability of models is becoming more and more powerful, and the following computing requirements are becoming higher and higher. In recent years, many companies have adopted heterogeneous computing for model inference. For example, they have adopted high-performance or special GPU for inference, or even special chip technologies such as FPGA and ASIC to accelerate the calculation of inference, the actual service requirements.

We have carefully compared different inference solutions, taking into account the requirements of our final use scenarios, the requirements for rapid expansion, and even the deployment capabilities of different machines, we finally choose to perform inference calculation in the form of non-heterogeneous computing, that is, do not use any heterogeneous computing modules, including GPU, FPGA, ASIC, etc.

According to the characteristics of KAN-TTS and the requirements of speech synthesis services, we have made several targeted optimizations, including algorithm optimization at the model level and framework and instruction set Optimization at the engineering level. Finally, after a series of optimizations, the results are shown in the following figure:

RTF is used for reference in speech recognition, that is, Real Time Factor, to measure the computing Time required to synthesize a sentence of 1s. QPS is the number of service requests that can be supported simultaneously.

Several advantages of KAN-TTS

1.kan-ttsvs. End2end system from our actual practice, we find that the biggest problems of End2end system are two types of problems: missing words and pronunciation errors of polyphonic words. Because the input of End2end system is Chinese characters, and the number of Chinese characters is large, the training data has poor coverage and uneven distribution, resulting in a large number of sentences with missing words; in addition, due to the preceding reasons, the volume of speech data is always much smaller than that of text data. Based on the current speech data, the multiphonic word coverage in the End2end system is also poor, therefore, there will be a large number of pronunciation errors of polyphonic words.

The above figure shows the comparison between the End2end system and the KAN-TTS on the two problems of missing words and pronunciation errors of polyphonic words. The pronunciation errors of polyphonic words are represented by the situation of "Wei.

As shown in the preceding figure, KAN-TTS significantly surpasses the End2end system on both issues. The main reason is that the KAN-TTS combines the traditional speech synthesis system and makes full use of domain knowledge in many aspects. Therefore, in terms of the stability of synthesized speech, similar results can be obtained to those of traditional speech synthesis systems.

2. KAN-TTS VS. Traditional speech synthesis system

the above figure shows the data for one of the speakers, KAN-TTS the effect changes of the technology under different improvements. MOS is the abbreviation of Mean Opinion Score, which is the subjective test scoring standard in the field of speech synthesis. The full Score is 5, and the bigger the better. In order to measure the actual effect of the technology, we adopt the MOS% form for comparison, that is, taking the Recording score as the denominator, dividing the MOS scores of different systems by the Recording scores, so as to measure the difference between subjective scores of different systems and Recording. The closer it is to 100%, the better, while the score of Recording is always 100%. From the preceding figure, we can see that the traditional concatenation system and the traditional parameter system [5] can obtain a close degree of 85% ~ 90%, respectively. The differences here are related to the pronunciation style and data volume; when KAN-TTS technology is adopted, even if it is only based on Single Speaker data, it can obtain a close degree of more than 95%; However, after multi-speaker and transfer learning technology are adopted, more than 97% of the degree of closeness can be obtained in the degree of naturality.

3. Low-cost boutique customization (sound cloning)

customizing TTS voice for specific speakers or styles is a continuous practical business requirement in the TTS field. For example, customers will customize TTS with distinctive sound styles for their own exclusive IP addresses, and expect to obtain a different user experience from competitive products. In traditional TTS customization, due to the limited technical framework, the data volume required for the entire boutique customization is about 20,000 sentences (20 hours). According to the high standard requirements of TTS data recording, 20,000 sentences usually correspond to the recording cycle of more than half a year, and the speaker needs to continuously carry out high-quality and reliable recording work, this brings certain risks to the actual customization project (the situation of the speaker, such as cold and fever, will directly affect the play of his throat).

Based on KAN-TTS, because we have adopted a new generation of speech synthesis technology, based on more powerful models and models obtained from hundreds of speaker data, this allows us to use a small amount of data to build better TTS sounds.

The preceding figure shows the customization effect of different data volumes based on the KAN-TTS framework. It can be seen that even when the data volume is less than 2 hours (2,000 sentences), customization based on KAN-TTS can also achieve good customization effect, which is not much different from 10 hours, obviously more than 95% and close to real-life recording.

Compared with traditional customization, KAN-TTS-based customization can reduce the amount of data to one tenth of the previous amount, and at the same time, the customization cycle will be shortened from more than half a year to about one month, the customization effect will also be significantly improved from the traditional TTS effect to the KAN-TTS effect with high performance. Now, we have launched a KAN-TTS-based customization service. You can refer to the following page for specific information and customization process.

https://ai.aliyun.com/nls/customtts? spm=5176.12061040.1228967.1.11b04779xrAzng


Knowledge-awareNeural TTS(KAN-TTS) technology combines our latest speech technology, massive text and acoustic data, and large-scale computing capabilities to improve speech synthesis technology.

By deeply integrating traditional speech synthesis technology and End2end system, and combining various domain knowledge, we provide online real-time speech synthesis service with high performance and stability. At the same time, considering the actual needs of customers, we have adopted a completely CPU-based service deployment method and introduced low-cost quality customization, which can be used to deploy customers and voice customization based on the actual needs of customers. Now, you can feel the synthesis effect of https://ai.aliyun.com/nls/tts-Knowledge (aware Neural TTS) on the Alibaba Cloud official website (KAN-TTS).

In future work, we will further improve the speech synthesis technology based on KAN-TTS technology to provide better speech synthesis services.

[1] Yuxuan Wang, RJ Skerry-Ryan, et al. "Tacotron:Towards End-to-End Speech Synthesis", Interspeech 2017.

[2] Jonathan Shen, Ruoming Pang, et al. "NaturalTTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions",ICASSP 2018.

[3] K Tokuda, T Yoshimura, T Masuko, TKobayashi, T Kitamura, "Speech parametergeneration algorithms for HMM-based speech synthesis", ICASSP 2000.

[4] ZH Ling, RH Wang, "HMM-based hierarchical unit selection combining Kullback-Leiblerdivergence with likelihood criterion", ICASSP 2007.

[5] Heiga Zen, Andrew Senior, Mike Schuster,"Statistical Parametric Speech Synthesis Using DeepNeural Networks",ICASSP 2013.

[6] Changhao Shan, Lei Xie, Kaisheng Yao, "ABi-directional LSTM Approach for Polyphone Disambiguation in Mandarin Chinese",ISCSLP 2016.

[7] Chuang Ding, Lei Xie, Jie Yan, Weini Zhang, Yang Liu, "AutomaticProsody Prediction for Chinese Speech Synthesis using BLSTM-RNN and EmbeddingFeatures", ASRU 2015.

[8] Ming Lei, Yijian Wu, Frank K. Soong,Zhen-Hua Ling, Lirong Dai, "A Hierarchical F0 Modeling Method forHMM-Based Speech Synthesis", Interspeech 2010.

[9] Zhen-HuaLing , Zhi-Guo Wang, Li-Rong Dai, "Statistical Modeling ofSyllable-Level F0 Features for HMM-based Unit Selection Speech Synthesis",ISCSLP 2010.

[10] T. Drugman and T. Dutoit, "Thedeterministic plus stochastic model of the residual signal and its applications",IEEE Trans. Audio, Speech and Language Processing, vol. 20, no. 3, pp. 968-981,March 2012.

[11] T. Raitio, A. Suni, J. Yamagishi, H.Pulakka, J. Nurminen, M. Vainio, and P. Alku, "HMM-based speech synthesisutilizing glottal inverse filtering", IEEE Trans. on Audio, Speech, andLang. Proc., vol. 19, no. 1, pp. 153-165, Jan. 2011.

[12] Yu-An Chung, Yuxuan Wang, Wei-NingHsu, Yu Zhang, RJ Skerry-Ryan, "Semi-Supervised Training for ImprovingData Efficiency in End-to-End Speech Synthesis",https://arxiv.org/abs/1808.10128

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now