from: Machine heart 2021-11-28 32
in 2016, DeepMind put forward WaveNet, which attracted great attention in the industry. WaveNet can directly generate the original audio waveform, which can achieve excellent results in text-to-speech and conventional audio generation. However, in terms of practical application, one of its problems is that it has a large amount of computation and cannot be directly applied to products.
Therefore, there is still much work to be done in the field of speech synthesis. Speech synthesis has two main goals: intelligibility and naturalness. Intelligibility refers to the definition of the synthesized audio, especially the extent to which the listener can extract the original information. The sense of nature describes the information that cannot be directly obtained by intelligibility, such as the overall ease of listening, the overall style consistency, and the subtle differences in regions or languages.
Last year, we saw the industry focus on speech recognition, and this year speech synthesis has become one of the important areas of deep learning community research. Not long after 2017, machine heart has paid attention to three research papers on this subject: Baidu's Deep Voice, Char2Wav proposed by Yoshua Bengio team and Google's Tacotron.
Before introducing the latest research results this year, let's review the Deep Mind of WaveNet.
WaveNet is inspired by two-dimensional PixelNet, where it is adjusted to one-dimensional.
The above animation shows the structure of the WaveNet. This is a completely Convolution Neural network, in which the convolution layer has different dilation factors, which makes its receptive field available in depth it increases exponentially and can cover thousands of timesteps.
In terms of training time, its input sequence is the real waveform recorded by the human speaker. After training, the network can be sampled to generate synthesized words. At each time step of sampling, a value is taken from the probability distribution calculated by the network. This value is then fed back into the input and a new prediction is generated for the next step. We can find that building samples step by step like this will generate high computing costs, which is also the problem in the practical application mentioned above.
Another point to be mentioned is that in order to use WaveNet to convert text into speech, it is necessary to recognize what is in the text. In DeepMind paper, researchers did this by converting the text into a sequence of language and phonetic features (including current phoneme, syllable, word and other information).
I just mentioned the challenges faced by WaveNet in practical applications, and there is still much room for improvement in the application of deep neural networks to speech synthesis. Next, I will introduce the latest three research results in this field.
Baidu Deep Voice
in February 2017, Baidu Research Department proposed the Deep Speech (Deep Voice) system, which is a high-quality text-to-speech system completely constructed by Deep neural network.
In the research blog, Baidu researchers said that the biggest obstacle to establishing a text-to-speech system nowadays is the speed of audio synthesis, and their system has already achieved real-time speech synthesis, this is 400 times faster than the previous implementation of WaveNet inference.
The authors said that the contribution of Deep Voice paper lies in:
- inspired by the traditional text-to-speech processing process, Deep Vioce uses the same architecture, but uses neural networks to replace all components and use simpler features. This makes the system more suitable for new datasets, voice, and fields without any manual data annotation or other feature allocation.
- Deep Voice lays the foundation for real end-to-end speech synthesis. This end-to-end system has no complicated processing flow and does not rely on manual deployment (hand-engineered). The features as input or pre-training (pre-training).
As shown in the preceding figure, TTS contains five modules:
- A phoneme-to-phoneme model;
- A separation model for locating phoneme boundaries in a speech data set;
- the phoneme length model for predicting the temporal duration of each phoneme in the phoneme sequence;
- A basic frequency model is used to predict whether phonemes are turbid;
- an audio synthesis model that combines the outputs of the above four components to produce audio.
In Baidu's research, researchers replace each component in the Classic TTS process with corresponding neural networks. For specific implementation, readers can refer to the paper.
Baidu said in its research blog, "deep learning has transformed many fields including computer vision and speech recognition. We believe that speech synthesis has now reached a Jump Point. 」
see:
industry | Baidu proposes Deep Voice: real-time neural speech synthesis system
end-to-end speech synthesis model Char2wav
in February, researchers from Indian Institute of Technology camper, INRS-EMT and Canadian Institute of Advanced Studies (CIFAR) published a paper on arXiv, introducing their research results on end-to-end speech synthesis, Char2Wav.
In this paper, the authors propose an end-to-end model Char2Wav for speech synthesis. Char2Wav consists of two components: a reader and a nerual vocoder.
The reader is an encoder-decoder model with attention. The encoder is a bidirectional cyclic neural network (RNN) with text or phoneme as input, while the decoder is a cyclic neural network with attention, which produces vocoder acoustic features.. Neural vocoder refers to a conditional extension of SampleRNN, which can generate original acoustic wave samples according to intermediate representations.
Char2Wav: an ARSG/attention-based recurrent sequence generator refers to a sequence Y = (y1,..., yT) generated based on an input sequence X. The circulatory neural network. X is preprocessed by an encoder to output a sequence h = (h1, substr, hL). In this study, output Y is a sequence of acoustic characteristics, while X is text or phoneme sequence to be generated. In addition, the encoder is a two-way cyclic network.
The authors said that the work was greatly influenced by the study of Alex Graves (Graves, 2013; 2015). In a guest lecture, Graves showed a speech synthesis model using attention mechanism, but Graves' research has not been published in the form of a paper.
In addition, unlike traditional models for speech synthesis, Char2Wav can learn to generate audio directly based on text. This is consistent with Baidu's Deep Voice system.
See:
Yoshua Bengio et al proposed Char2Wav: end-to-end speech synthesis
google's end-to-end text-to-speech synthesis model Tacotron
not previously, Google song scientist Wang Yuxuan (the first author) and others proposed a new end-to-end speech synthesis system Tacotron, which can receive the input of characters and output the corresponding original spectrum, then it is provided to Griffin-Lim reconstruction algorithm to directly generate speech. In addition, the authors said they also proposed several key technologies that can make the sequence-to-sequence framework perform well in this difficult task.
In terms of the test results, the average subjective opinion score of Tacotron in the American English test reached 3.82 points (the total score was 5 points), in naturalness it is superior to the parametric system which has been applied in production. In addition, because Tacotron generates speech at the frame level, it is much faster than the sample-level autoregressive method.
Model Architecture: The model receives the input of characters, outputs the corresponding original spectrum, and then provides it to the Griffin-Lim reconstruction algorithm to generate voice.
See:
academic | Google's end-to-end speech synthesis system Tacotron: Speech synthesis directly from characters
summary
for a long time, natural voice interaction with machines has always been our dream. Although speech recognition has achieved quite high accuracy, not only speech recognition is included in the loop of speech interaction, but also natural speech synthesis is a very important research field.
After improving the accuracy of speech recognition, deep neural networks also have great potential in promoting the development of speech synthesis. From 2017 to now, we have observed such research results introduced above (of course, there will be omissions). We believe that speech synthesis has reached a "jump point" as described in Baidu blog, and we expect more new research results to appear in the future, enables more natural interaction with machines.
Machine heart is looking for deputy editor, senior reporter, key account manager, activity executive manager, etc. For details and more positions, please check: global recruitment | Machine heart ALL IN, do you CALL?
Start Building Today with a Free Trial to 50+ Products
Learn and experience the power of Alibaba Cloud.
Sign Up Now