from: Speech Laboratory of DAMO Academy 2021-07-09 145
SOURCE Alibaba voice AI public account
TTS(Text-To-Speech Speech synthesis) is a small and beautiful pearl in the AI field. With it, intelligent applications and intelligent hardware can grow their mouths. As the vocal link of the voice solution, it can not only be like what you often see in reality-the host broadcasts news, the teacher teaches, and the star navigation. You can also customize special voices, read novels, recite poems, explain videos and so on with funny, soft, or exciting voices. This article will introduce the customized products KAN-TTS voice based on Alibaba's latest speech synthesis technology.
What is speech synthesis? Speech synthesis is a technology that converts text into a natural and smooth speech. At present, speech synthesis technology is widely used in pan-entertainment, education and human-computer interaction business fields. It is commonly used in voice navigation, voice assistant, telephone customer service, dubbing of movies and games, audio reading, etc. Different application scenarios expect different voices, and customization products of the voice model emerge as the times require. The so-called customization of voice model refers to the customization of voice models of different genders, ages, styles and emotions through voice synthesis technology to meet the needs of different businesses and scenarios.
Since deep learning technology was introduced into the field of speech recognition in 2010, it has played an important role in promoting the development of speech technology. However, it has been applied slowly in the TTS direction. It was not until 2016 and 2017 that the powerful ability of deep learning was given to the whole TTS direction with Google's WaveNet, Tacotron and MILA's Char2Wav. Remarkable transcendence has been achieved in terms of sound quality, expressiveness and modeling difficulty. In recent two years, academia began to bring the first-class achievements into practical products, followed by the rapid development of commercial application of TTS. For example, Google Cloud launched a TPU-based WaveNet solution in 2018, Microsoft Azure a GPU-based full Neural solution in 2018. Alibaba Cloud also launched the all-Neural product solution in 2018. Considering the actual customer and business expansion needs, after a lot of optimization, this solution is currently the only fully CPU-based all-Neural product solution in the industry.
The updated and better technology went online. Both Alibaba Group customer service and Ant Financial customer service were the first customers. Both the business volume and technical requirements were far higher than the industry average, this also proves the practical application level of Alibaba's latest KAN-TTS technical framework from another aspect. In 2019, the personalized voice customization service launched by Tmall Genie also came from KAN-TTS. It allows parents to record 10-minute voice data on mobile phones to customize their own voices and compose stories for their children.
In addition to the internal procurement applications of Alibaba Group, Alibaba Cloud launched a new-generation voice model customization service based on KAN-TTS in 2019, which was fast and low-cost. It successfully entered the mobile terminal of First Finance and Economics, according to a small amount of data provided by users for the anchor of financial news, a high-performance synthesized voice is customized, which can provide users with high-experience news reading effect on the first financial APP.
With the improvement of technical level and the promotion of commercial application, Alibaba's voice model customization service based on the KAN-TTS technical framework has further highlighted its advantages. Generally speaking, the general requirements of the market for products are both low price and high quality. This is exactly the advantage of the customization of voice model products under the KAN-TTS.
1. Lower costs. When customizing the traditional voice model, due to the limitation of the technical framework, the data volume required for the whole customization is about 20,000 sentences (20 hours). According to the high standard requirements of voice data recording, 20,000 sentences usually correspond to the recording cycle of more than half a year, requiring the speaker to continuously carry out high-quality and reliable recording work. During this process, you need to continuously pay the recording personnel, recording studio, recording engineer, data processing and other expenses. And because the recording period is too long, the risk of customizing the project will be increased. For example, the situation that the speaker has a cold and fever will directly affect the play of his voice, such as the decoration of the recording studio for some reason and so on. Based on the powerful model structure of KAN-TTS and the data of hundreds of speakers, we can use a small amount of data to build better TTS sounds. At the same time, we have developed a corpus selection tool to cover as many scenarios as possible with a small amount of data, further reducing the amount of recording data.
The preceding figure shows the customization effect of different data volumes based on the KAN-TTS framework. It can be seen that even when the data volume is less than 2 hours (2,000 sentences), customization based on KAN-TTS can also achieve good customization effect, which is not much different from 10 hours, obviously more than 95% and close to real-life recording. Compared with traditional customization, KAN-TTS-based customization can reduce the amount of data to one tenth of the previous amount, and at the same time, the customization cycle will be shortened from more than half a year to about one month.
2. Higher expressive force. The voice customized by the traditional voice model is relatively rigid and single, and it is difficult to debug voice products that are suitable for different scenarios, needs, personality and characteristics. However, customized products based on The Voice model of KAN-TTS technology are outstanding in this aspect. It can flexibly customize products that are more suitable for the needs of the scene according to the demand style. For example, news products require accurate, full and formal pronunciation; Customer service should be kind and natural, pay attention to communication, and sometimes have a more intimate feeling with a little accent. KAN-TTS technology can better grasp the unique characteristics of each person's voice, synthesize your unique voice, and meet personalized needs.
With the support of the latest KAN-TTS technologies, Alibaba Cloud's top-quality voice customization products have continuously explored the characteristics of voice in different application scenarios and developed a set of product capabilities to customize high-quality and expressive voice models with small data volumes. At present, we have implemented our products in news broadcast, novel reading, intelligent hardware and other scenarios. For more cases, see the official website. (https://ai.aliyun.com/nls/customtts)
finally, where will higher levels of speech synthesis products lead?
From the perspective of synthesis technology. Of course, the pursuit is closer to the sound effect of the real person, more exquisite sound quality, more natural pronunciation and intonation and higher scene adaptation. At present, the voice model customized under the technical framework of KAN-TTS has made great progress in these four aspects.
From the perspective of application threshold. At present, most of the customized recording collection work of high-quality vocal models still needs to be completed in professional recording studios, using professional recording equipment and under the guidance of professional recording. How to lower the threshold of recording, so that ordinary people can use ordinary equipment to complete recording collection in normal environment, and ensure that the collected recording meets the needs of customized voice model, it is the next goal to make speech synthesis technology inclusive.
From the perspective of application scenarios. With the increasing popularity of applications, users' voice of speech synthesis is no longer satisfied with friendly and natural pronunciation. Owning personalized voice is becoming a reference dimension for the increasing proportion of consumers when purchasing. With the improvement of technology and the development of market demand, personalized TTS and emotional TTS will be applied in various sub-scenarios, such as knowledge payment, star IP, intelligent hardware, entity or virtual robot. For those customers who have a large amount of text content, such as books and UGC, and have their own audio content, such as strong IP or IP channels, speech synthesis may be the most suitable choice for them. The voice model products customized under the technical framework of KAN-TTS not only have the advantages of high quality, high efficiency and low cost, but also have more flexible cooperation methods, which can provide TTS cloud/local services, you can also customize IP voices or cooperate to build a sound factory.
While pursuing world-class technologies, Alibaba voice continuously provides customers with customized quality voice services, and is committed to achieving a win-win situation of technological innovation and transformation of technological application achievements to better meet customers' personalized needs!