Interpretation of papers on speech summit Interspeech | Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks-Alibaba Cloud Developer Community

In 2019, the 20th annual meeting of the INTERSPEECH of the International Association for voice communication will be held in Graz, Austria, from September 15-19. Interspeech is the world's largest and most comprehensive top-level conference in the field of voice. Nearly 2,000 people from the front-line industry and academia will participate in activities including keynote speeches, Tutorial, thesis explanations and main exhibitions, etc, this Ali paper with 8 article selected, this paper for the Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma Paper the Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks of

click Download Paper

article interpretation

the main goal of speech Conversion (Voice Conversion,VC) is to convert the Voice of the source speaker into the Voice of the target speaker, with the same language content as the original sample. Speech conversion systems have many application scenarios, such as original speech enhancement, oral language assistance, and personalized speech synthesis (TTS) systems. Currently, speech conversion systems with good performance, such as the method based on Gaussian mixture model (GMM) and the method based on neural network (NN), are generally based on parallel training data, its application scenarios are limited to parallel data collection and one-to-one conversion between the same language. When it is difficult to collect parallel data, such as cross-language speech conversion or many-to-many speech conversion, the requirements of parallel training data greatly limit the availability of the above methods in actual scenarios.

Recently, StarGAN based on anti-Generation Network (GAN) has been introduced into the problem of Speech Conversion. Taking advantage of its many-to-many domain mapping performance and training performance without parallel data, only voice features and domain information are used as input, and relatively successful speech conversion experiment results between many-to-many different speakers are obtained. Based on the above StarGAN-VC method, this paper proposes a fast learning training framework by adding residual training mechanism. Our method is called Res-StarGAN- VC, the main idea is to realize residual mapping by adding input-to-output shortcut connections based on the language content sharing between the source and target speech features in the conversion process.

Experiments show that this fast connection method accelerates the network learning process without adding parameters and computational complexity, and helps to generate high-quality false samples at the beginning of confrontation training to improve training quality. Experimental results and subjective evaluation show that in single language and cross-language many-to-many speech conversion tasks, compared with StarGAN-VC method, the proposed method provides (1) faster convergence and (2) clearer pronunciation and better speaker similarity in confrontation training.

Abstract

This paper proposes a fast learning framework for non-parallel many-to-many voice conversion with residual Star Generative Adversarial Networks (StarGAN). In addition to the state-ofthe-art StarGAN-VC approach that learns an unreferenced mapping between a group of speakers' acoustic features for nonparallel many-to-many voice conversion, our method, which we call Res-StarGAN-VC presents an enhancement by incorporating a residual mapping. The idea is to leverage on the shared linguistic content between source and target features during conversion. The residual mapping is realized by using identity shortcut connections from the input to the output the generator in Res-StarGAN-VC. Such shortcut connections accelerate the learning process of the network with no increase of parameters and computational complexity. They also help generate high-quality fake samples at the very beginning of the adversarial training. Experiments and subjective evaluations show that the proposed method offers (1) significantly faster convergence in adversarial training and (2) clearer pronunciations and better speaker similarity of converted speech, compared to the StarGAN-VC baseline on both mono-lingual and cross-lingual many-to-many voice conversion tasks.Index Terms: Voice conversion (VC), non-parallel VC,many-to-many VC, generative adversarial networks (GANs),StarGAN-VC, Res-StarGAN-VC

alibaba Cloud Developer Community

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now