Does speech synthesis reach the jump point? Latest research summary of deep neural network transformation TTS-Alibaba Cloud Developer Community

from: Machine heart 2021-11-28 32

introduction: in recent years, with the application of deep neural network, the computer's ability to understand natural speech has been thoroughly reformed, such as the application of deep neural network in speech recognition and machine translation. However, the use of computers to generate speech (speech synthesis) or text to speech (TTS) is still largely based on the so-called splicing TTS(concatenative TTS). However, the naturality and comfort of the synthesized speech by this traditional method have great defects. Can deep neural networks promote the progress of speech synthesis as they promote the development of speech recognition? This has also become one of the research topics in the field of artificial intelligence.
+ Follow to continue viewing

in 2016, DeepMind put forward WaveNet, which attracted great attention in the industry. WaveNet can directly generate the original audio waveform, which can achieve excellent results in text-to-speech and conventional audio generation. However, in terms of practical application, one of its problems is that it has a large amount of computation and cannot be directly applied to products.

Therefore, there is still much work to be done in the field of speech synthesis. Speech synthesis has two main goals: intelligibility and naturalness. Intelligibility refers to the definition of the synthesized audio, especially the extent to which the listener can extract the original information. The sense of nature describes the information that cannot be directly obtained by intelligibility, such as the overall ease of listening, the overall style consistency, and the subtle differences in regions or languages.

Last year, we saw the industry focus on speech recognition, and this year speech synthesis has become one of the important areas of deep learning community research. Not long after 2017, machine heart has paid attention to three research papers on this subject: Baidu's Deep Voice, Char2Wav proposed by Yoshua Bengio team and Google's Tacotron.

Before introducing the latest research results this year, let's review the Deep Mind of WaveNet.

WaveNet is inspired by two-dimensional PixelNet, where it is adjusted to one-dimensional.

The above animation shows the structure of the WaveNet. This is a completely Convolution Neural network, in which the convolution layer has different dilation factors, which makes its receptive field available in depth it increases exponentially and can cover thousands of timesteps.

In terms of training time, its input sequence is the real waveform recorded by the human speaker. After training, the network can be sampled to generate synthesized words. At each time step of sampling, a value is taken from the probability distribution calculated by the network. This value is then fed back into the input and a new prediction is generated for the next step. We can find that building samples step by step like this will generate high computing costs, which is also the problem in the practical application mentioned above.

Another point to be mentioned is that in order to use WaveNet to convert text into speech, it is necessary to recognize what is in the text. In DeepMind paper, researchers did this by converting the text into a sequence of language and phonetic features (including current phoneme, syllable, word and other information).

I just mentioned the challenges faced by WaveNet in practical applications, and there is still much room for improvement in the application of deep neural networks to speech synthesis. Next, I will introduce the latest three research results in this field.

Baidu Deep Voice

in February 2017, Baidu Research Department proposed the Deep Speech (Deep Voice) system, which is a high-quality text-to-speech system completely constructed by Deep neural network.

In the research blog, Baidu researchers said that the biggest obstacle to establishing a text-to-speech system nowadays is the speed of audio synthesis, and their system has already achieved real-time speech synthesis, this is 400 times faster than the previous implementation of WaveNet inference.

The authors said that the contribution of Deep Voice paper lies in:

  • inspired by the traditional text-to-speech processing process, Deep Vioce uses the same architecture, but uses neural networks to replace all components and use simpler features. This makes the system more suitable for new datasets, voice, and fields without any manual data annotation or other feature allocation.
  • Deep Voice lays the foundation for real end-to-end speech synthesis. This end-to-end system has no complicated processing flow and does not rely on manual deployment (hand-engineered). The features as input or pre-training (pre-training).

As shown in the preceding figure, TTS contains five modules:

  • A phoneme-to-phoneme model;
  • A separation model for locating phoneme boundaries in a speech data set;
  • the phoneme length model for predicting the temporal duration of each phoneme in the phoneme sequence;
  • A basic frequency model is used to predict whether phonemes are turbid;
  • an audio synthesis model that combines the outputs of the above four components to produce audio.

In Baidu's research, researchers replace each component in the Classic TTS process with corresponding neural networks. For specific implementation, readers can refer to the paper.

Baidu said in its research blog, "deep learning has transformed many fields including computer vision and speech recognition. We believe that speech synthesis has now reached a Jump Point. 」


industry | Baidu proposes Deep Voice: real-time neural speech synthesis system

end-to-end speech synthesis model Char2wav

in February, researchers from Indian Institute of Technology camper, INRS-EMT and Canadian Institute of Advanced Studies (CIFAR) published a paper on arXiv, introducing their research results on end-to-end speech synthesis, Char2Wav.

In this paper, the authors propose an end-to-end model Char2Wav for speech synthesis. Char2Wav consists of two components: a reader and a nerual vocoder.

The reader is an encoder-decoder model with attention. The encoder is a bidirectional cyclic neural network (RNN) with text or phoneme as input, while the decoder is a cyclic neural network with attention, which produces vocoder acoustic features.. Neural vocoder refers to a conditional extension of SampleRNN, which can generate original acoustic wave samples according to intermediate representations.

Char2Wav: an ARSG/attention-based recurrent sequence generator refers to a sequence Y = (y1,..., yT) generated based on an input sequence X. The circulatory neural network. X is preprocessed by an encoder to output a sequence h = (h1, substr, hL). In this study, output Y is a sequence of acoustic characteristics, while X is text or phoneme sequence to be generated. In addition, the encoder is a two-way cyclic network.

The authors said that the work was greatly influenced by the study of Alex Graves (Graves, 2013; 2015). In a guest lecture, Graves showed a speech synthesis model using attention mechanism, but Graves' research has not been published in the form of a paper.

In addition, unlike traditional models for speech synthesis, Char2Wav can learn to generate audio directly based on text. This is consistent with Baidu's Deep Voice system.


Yoshua Bengio et al proposed Char2Wav: end-to-end speech synthesis

google's end-to-end text-to-speech synthesis model Tacotron

not previously, Google song scientist Wang Yuxuan (the first author) and others proposed a new end-to-end speech synthesis system Tacotron, which can receive the input of characters and output the corresponding original spectrum, then it is provided to Griffin-Lim reconstruction algorithm to directly generate speech. In addition, the authors said they also proposed several key technologies that can make the sequence-to-sequence framework perform well in this difficult task.

In terms of the test results, the average subjective opinion score of Tacotron in the American English test reached 3.82 points (the total score was 5 points), in naturalness it is superior to the parametric system which has been applied in production. In addition, because Tacotron generates speech at the frame level, it is much faster than the sample-level autoregressive method.

Model Architecture: The model receives the input of characters, outputs the corresponding original spectrum, and then provides it to the Griffin-Lim reconstruction algorithm to generate voice.


academic | Google's end-to-end speech synthesis system Tacotron: Speech synthesis directly from characters


for a long time, natural voice interaction with machines has always been our dream. Although speech recognition has achieved quite high accuracy, not only speech recognition is included in the loop of speech interaction, but also natural speech synthesis is a very important research field.

After improving the accuracy of speech recognition, deep neural networks also have great potential in promoting the development of speech synthesis. From 2017 to now, we have observed such research results introduced above (of course, there will be omissions). We believe that speech synthesis has reached a "jump point" as described in Baidu blog, and we expect more new research results to appear in the future, enables more natural interaction with machines.

Machine heart is looking for deputy editor, senior reporter, key account manager, activity executive manager, etc. For details and more positions, please check: global recruitment | Machine heart ALL IN, do you CALL?

Machine learning/deep learning artificial intelligence natural language processing algorithm voice technology computer vision
deep computer learning intelligent speech learning intelligent speech learning artificial intelligence artificial intelligence deep learning natural language sqlit learning
developer Community> heart of machine
Please read this disclaimer carefully before you start to use the service. By using the service, you acknowledge that you have agreed to and accepted the content of this disclaimer in full. You may choose not to use the service if you do not agree to this disclaimer. This document is automatically generated based on public content on the Internet captured by Machine Learning Platform for AI. The copyright of the information in this document, such as web pages, images, and data, belongs to their respective author and publisher. Such automatically generated content does not reflect the views or opinions of Alibaba Cloud. It is your responsibility to determine the legality, accuracy, authenticity, practicality, and completeness of the content. We recommend that you consult a professional if you have any doubt in this regard. Alibaba Cloud accepts no responsibility for any consequences on account of your use of the content without verification. If you have feedback or you find that this document uses some content in which you have rights and interests, please contact us through this link: We will handle the matter according to relevant regulations.
Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now