What is Intelligent Speech Interaction? - Intelligent Speech Interaction

Intelligent Speech Interaction is developed based on state-of-the-art technologies such as speech recognition, speech synthesis, and natural language understanding. Enterprises can integrate Intelligent Speech Interaction into their applications to enable the applications to listen to, understand, and speak to users. This way, users can enjoy an immersive human-computer interaction experience. Intelligent Speech Interaction is suitable for various scenarios, including intelligent Q&A, intelligent quality inspection, real-time recording for court trials, real-time subtitling for speeches, and transcription of audio recordings. Intelligent Speech Interaction has been applied to many fields such as finance, insurance, justice, and e-commerce.

Note

Intelligent Speech Interaction V2.0 is released. The new version provides you with easy-to-use SDKs and a feature-rich console where you can use features such as the self-learning platform to improve speech recognition performance. You are welcome to activate Intelligent Speech Interaction.

Short sentence recognition

Short sentence recognition recognizes short speech that lasts within 1 minute. The service applies to short speech interaction scenarios, such as voice search, voice command control, and voice short message. It can also be integrated into various mobile apps, smart home appliances, and smart voice assistants. For more information, see Overview.

Benefits

High recognition accuracy
Uses the word-level latency-controlled bidirectional long short-term memory (LC-BLSTM) and deep feedforward sequential memory network connectionist temporal classification (DFSMN-CTC) models that are created by Alibaba Cloud. Compared with the traditional CTC method in the industry, the two innovative models reduce the error rate by 20%. This greatly improves the accuracy of speech recognition.
High decoding rate
Utilizes the low frame rate (LFR) decoding technology that is created by Alibaba Cloud to increase the decoding rate by more than three times without compromising recognition accuracy. This greatly shortens feedback time and improves user experience.
Original self-learning platform
Provides the self-learning platform for you to customize field-specific models to maximize recognition accuracy.
Extensive field coverage
Applies to various fields, such as finance, insurance, justice, e-commerce, and smart home appliances.

Scenarios

Voice search
Allows you to conduct voice searches in various scenarios, such as map navigation and browser search. You can integrate short sentence recognition into any mobile app to enable voice search. This allows you to conduct hands-free searches by using voice commands.
Voice command control
Allows you to control smart devices in a quick and convenient manner by using voice commands. For example, you can use voice commands to turn on or off the air conditioner and change the TV channel. You can integrate short sentence recognition into any smart devices such as smart home appliances.
Voice short messages
Allows you to convert voice short messages to text when you use the message service. For example, you can use short sentence recognition to convert your voice to a text message when you find it inconvenient to type.

Real-time speech recognition

Real-time speech recognition recognizes audio streams of various lengths in real time, which can achieve the effect of text output on speaking. The built-in feature of intelligent sentence breaking recognizes the start and end time of each sentence. Real-time speech recognition applies to scenarios such as real-time creation of subtitles in live videos, real-time meeting recording, real-time recording of court trials, and smart voice assistants. For more information, see Overview.

Benefits

High recognition accuracy
Uses the word-level LC-BLSTM and DFSMN-CTC models that are created by Alibaba Cloud. Compared with the traditional CTC method in the industry, the two innovative models reduce the error rate by 20%. This greatly improves the accuracy of speech recognition.
High decoding rate
Utilizes the LFR decoding technology that is created by Alibaba Cloud to increase the decoding rate by more than three times without compromising recognition accuracy. This greatly shortens feedback time and improves user experience.
Original self-learning platform
Provides the self-learning platform for you to customize field-specific models to maximize the recognition accuracy.
Extensive field coverage
Applies to various fields, such as finance, insurance, justice, e-commerce, and smart home appliances.

Scenarios

Real-time creation of subtitles in live video
Recognizes speech data and converts the speech data to subtitles in real time for live speeches and videos. You can also use this service to manage the subtitles.
Real-time meeting recording
Recognizes speech data in a conference and converts the speech data to text in real time. This service is especially suitable for remote scenarios such as video conferences.
Real-time recording of court trials
Recognizes speech data of all parties involved in a court trial and converts the speech data to text for all parties to view on the trial page. This reduces the workload of court clerks.
Real-time recording of customer service calls
Recognizes speech data in calls made from and to a call center and converts the speech data to text in real time. This facilitates real-time quality assurance and monitoring.

Recording file recognition

Recording file recognition recognizes recording files that you upload. This service applies to scenarios such as the quality assurance of call centers, recording of court trials in databases, meeting minute summarization, and medical record filing. For more information, see Overview.

Important

If you are using the free trial, the system completes the recognition and returns the text converted from the speech data within 24 hours after you upload the recording file. If you are using a paid edition, the system completes the recognition and returns the text converted from the speech data within 6 hours after you upload the recording file. However, if you upload a recording file that lasts more than 500 hours in half an hour, it takes more time for the system to complete the recognition. If you need to convert large amounts of speech data to text at a time, contact the Alibaba Cloud pre-sales staff.

Benefits

High recognition accuracy
Uses the word-level LC-BLSTM and DFSMN-CTC models that are created by Alibaba Cloud. Compared with the traditional CTC method in the industry, the two innovative models reduce the error rate by 20%. This greatly improves the accuracy of speech recognition.
High decoding rate
Utilizes the LFR decoding technology that is created by Alibaba Cloud to increase the decoding rate by more than three times without compromising recognition accuracy. This greatly shortens feedback time and improves user experience.
Original self-learning platform
Provides the self-learning platform for you to customize field-specific models to maximize the recognition accuracy.
Extensive field coverage
Applies to various fields, such as finance, insurance, justice, e-commerce, and smart home appliances.

Scenarios

Quality assurance of call centers
Recognizes the uploaded recording files of a call center, converts the speech data to text, and then detects illegal or sensitive words in the text.
Recording of court trials in databases
Recognizes the uploaded recording files of court trials, converts the speech data to text, and then stores the text in databases.
Meeting minute summarization
Recognizes the uploaded recording files of a meeting and automatically summarizes the meeting minutes. The meeting minutes can also be manually summarized.
Medical record filing
Records the medical operations of a doctor by speech data, recognizes the recording file, and then converts the speech data to text. This improves the efficiency of medical record filing.

Speech synthesis

Speech synthesis is developed based on the deep learning technology to convert text to natural-sounding and fluent speech. The service provides multiple speakers and allows you to adjust the speed, intonation, and volume of the generated speech. Speech synthesis applies to scenarios such as intelligent customer service, speech interaction, audiobook reading, and accessible broadcast. For more information, see Overview.

Benefits

Cutting-edge technology
Builds a multi-level automatic prediction model based on deep learning by using acoustic and linguistic parameters to generate multi-level rhythm pauses. This way, the service can generate natural-sounding rhythms.
Extensive field coverage
Uses speech libraries collected in multiple fields, such as smart home appliances, in-vehicle devices, navigation, finance, banking, insurance, securities, operators, logistics, real estate, and education, to enable more accurate pronunciation for words in various fields.
Natural sounding
Generates natural-sounding, full-bodied, cadent, and expressive speech whose mean opinion score (MOS) reaches the top level in the industry by using large amounts of speech data as the training corpus to train the synthesized speech.
Deep customization
Allows you to customize a speaker library to meet your personalized application needs. The speech synthesis service provides various speakers with different voices, such as standard male or female voices, gentle female voice, and sweet female voice. The service also allows you to synthesize speech based on the Speech Synthesis Markup Language (SSML) and adjust the speed, intonation, and volume of the generated speech.

Scenarios

Intelligent customer service
Synthesizes speech for intelligent customer service in multiple industries and scenarios. This improves customer satisfaction and the efficiency of customer service and reduces the labor cost in the call center.
Smart devices
Applies human-like voices to smart home appliances, sound boxes, in-vehicle devices, and wearable devices.
Audiobook reading
Uses infectious voices to tell stories, read novels, and broadcast news, which makes reading more convenient.
Accessible broadcast
Converts text to fluent, natural-sounding speech for users of different ages and in different health states to promote accessibility.

Speech synthesis speaker customization

Based on the deep learning technology, the speech synthesis speaker customization service allows you to customize text-to-speech (TTS) speakers at a fast speed by using a small amount of training data. You can use the custom speakers for speech synthesis both in the Intelligent Speech Interaction console and on your smart device.

If you need to customize speakers or further understand the customization process, contact us at nls_support@service.aliyun.com.

Benefits

Cutting-edge technology
Utilizes the Knowledge-Aware Neural TTS (KAN-TTS), a new speech synthesis technology launched by Alibaba Cloud, deep neural network (DNN), and machine learning technologies to convert text to natural-sounding, full-bodied, cadent, and expressive speech. The synthesized speech is almost indistinguishable from the voice recording of a human.
Low demand for data volume
Synthesizes a TTS speaker in Chinese Mandarin by using a minimum of 2,000 sentences as the training corpus. If you provide a training corpus written in both Chinese and English, you can customize a speaker that is available for the two languages.
Cost-effectiveness
Greatly saves time of recording and annotation, and reduces costs because of the low demand for data volume.
Deep customization
Synthesizes custom TTS speakers based on the training corpus that you upload. The service also provides a large number of preset speakers of different tones and styles, which are synthesized based on high-quality recording data that is collected from top recording studios.

Scenarios

Intelligent customer service
Synthesizes speech for intelligent customer service in multiple industries and scenarios. This improves customer satisfaction and the efficiency of customer service and reduces the labor cost in the call center.
Smart devices
Applies human-like voices to smart home appliances, sound boxes, in-vehicle devices, and wearable devices.
Audiobook reading
Uses infectious voices to tell stories, read novels, and broadcast news, which makes reading more convenient.
Accessible broadcast
Converts text to fluent, natural-sounding speech for users of different ages and in different health states to promote accessibility.

Self-learning platform

The self-learning platform provides hotword training and custom linguistic models to help you improve the performance of speech recognition.

Benefits

Easy-to-use
Provides a revolutionary solution that supports self-service speech optimization requiring only a few manual operations. This simplifies the process of speech optimization and improves the accuracy of speech recognition.
Fast
Allows you to optimize, test, and publish your custom linguistic models within several minutes. The platform can also optimize business-specific hotwords in real time. This reduces the long delivery period of traditional customization and optimization that may last several weeks or even months.
Accurate
Ensures high speech recognition accuracy that is fully verified by many internal and external partners and projects. For example, the self-learning platform helps many projects resolve the availability issues and achieve a better optimization effect than the traditional optimization methods used by the competitors.

Scenarios

Hotword training
Allows you to add business-specific hotwords to a vocabulary to improve the accuracy of speech recognition when you find the default recognition results are not as expected.
Custom linguistic models
Allows you to upload business-specific corpus to train linguistic models to achieve higher recognition accuracy in the related business field, such as justice or finance.

References

Pricing: describes the billing methods of Intelligent Speech Interaction.
Quick Start: describes how to get started with Intelligent Speech Interaction.
Developer Guide: describes the terms related to Intelligent Speech Interaction and further describes how to use Intelligent Speech Interaction, for example, how to obtain an access token.
Console User Guide: describes the features provided in the Intelligent Speech Interaction console.
Documentation of a speech service: describes how to use a specific speech service, such as short sentence recognition, real-time speech recognition, recording file recognition, and speech synthesis.
Self-learning Platform: describes how to improve the recognition effect by using the hotword training and custom linguistic model features provided by the self-learning platform.
Best Practices: provides the best practices of using Intelligent Speech Interaction.
FAQ: provides answers to commonly asked questions about Intelligent Speech Interaction.