All Products
Search
Document Center

Intelligent Speech Interaction

Last Updated: Oct 10, 2019

Overview

Intelligent Speech Interaction is developed by Alibaba Cloud based on technologies such as speech recognition, speech synthesis, and natural language understanding. Enterprises can integrate Intelligent Speech Interaction into their products to give them the ability to “listen, speak, and understand users,” providing users with an intelligent human-computer interaction experience in different scenarios. Intelligent Speech Interaction is suitable for various scenarios, including smart Q&A, smart quality inspection, real-time recording of court trials, real-time creation of subtitles in speeches, and interview recording and transcription. Intelligent Speech Interaction has been successfully applied in many fields such as finance, insurance, justice, and e-commerce. Alibaba Cloud has released Intelligent Speech Interaction V2.0. The new version allows you to use tools such as the customization platform to improve the speech recognition effect. It also provides you with a feature-rich management console and easy-to-use SDKs. You are welcome to activate Intelligent Speech Interaction.

Services provided by Intelligent Speech Interaction

Short sentence recognition

You can use this service to recognize short speech that lasts within 1 minute. The service applies to short speech interaction scenarios, such as voice search, voice command control, and voice short message service. It can be integrated into various applications, smart home appliances, smart assistants, and other products. For more information, see API referencefor the short sentence recognition service.

Benefits:

  • High recognition accuracy
    Compared with the traditional connectionist temporal classification (CTC) method in the industry, Alibaba Cloud’s original word-level LC-BLSTM or DFSMN-CTC modeling reduces the error rate by 20%, greatly improving the accuracy of speech recognition.
  • Ultra-high decoding rate
    Alibaba Cloud’s original low frame rate (LFR) decoding technology can increase the decoding rate by more than three times without compromising recognition accuracy, greatly shortening feedback time and improving user experience.
  • Original model optimization tool
    You can use the model optimization tool to customize field-specific models to maximize the recognition accuracy.
  • Extensive field coverage
    The service is widely used in many fields such as finance, insurance, justice, e-commerce, and smart home appliances.

Scenarios:

  • Voice search
    Supports voice search in various scenarios, such as map navigation and browser search. You can integrate voice search into any mobile applications to free your hands.
  • Voice command control
    Controls smart devices by voice commands to achieve quick and convenient operations, such as turning air conditioners on or off and switching TV channels. You can integrate voice commands into smart devices such as smart home appliances.
  • Voice short message service
    Sends or receives short messages by voice. Instead of typing, you can use the voice short message service to quickly convert a voice message to text.

Real-time speech recognition

You can use this service to recognize audio streams of varying length in real time, achieving the effect of text output on speaking. The built-in feature of intelligent sentence breaking recognizes the start and end time of each sentence. The service applies to scenarios such as real-time creation of subtitles in live videos, real-time meeting recording, real-time recording of court trials, and smart voice assistants. For more information, see API reference for the real-time speech recognition service.

Benefits:

  • High recognition accuracy
    Compared with the traditional CTC method in the industry, Alibaba Cloud’s original word-level LC-BLSTM or DFSMN-CTC modeling reduces the error rate by 20%, greatly improving the accuracy of speech recognition.
  • Ultra-high decoding rate
    Alibaba Cloud’s original LFR decoding technology can increase the decoding rate by more than three times without compromising recognition accuracy, greatly shortening feedback time and improving user experience.
  • Original model optimization tool
    You can use the model optimization tool to customize field-specific models to maximize the recognition accuracy.
  • Extensive field coverage
    The service is widely used in many fields such as finance, insurance, justice, e-commerce, and smart home appliances.

Scenarios:

  • Real-time creation of subtitles in live videos
    Converts the audio to subtitles in real time in the live speech and video scenarios. You can further monitor the subtitle content.
  • Real-time meeting recording
    Converts the audio in a meeting to text in real time, which is especially suitable for long-distance scenarios such as video conferences.
  • Real-time recording of court trials
    Converts the speech of all parties involved in a court trial to text, which can be viewed on the trial page. This helps reduce the workload of court clerks.
  • Real-time recording of customer service calls
    Converts the voice data in a call center to text in real time, achieving real-time quality assurance and monitoring.

Recording file recognition

You can use this service to recognize the recording files uploaded by users. The service recognizes the recording files and returns the recognized text within 24 hours after the recording files are uploaded. The service applies to scenarios such as call center quality assurance, recording of court trials in databases, meeting minutes summarization, and medical record filing in hospitals. For more information, see API reference for the recording file recognition service.

Benefits:

  • High recognition accuracy
    Compared with the traditional CTC method in the industry, Alibaba Cloud’s original word-level LC-BLSTM or DFSMN-CTC modeling reduces the error rate by 20%, greatly improving the accuracy of speech recognition.
  • Ultra-high decoding rate
    Alibaba Cloud’s original LFR decoding technology can increase the decoding rate by more than three times without compromising recognition accuracy, greatly shortening feedback time and improving user experience.
  • Original model optimization tool
    You can use the model optimization tool to customize field-specific models to maximize the recognition accuracy.
  • Extensive field coverage
    The service is widely used in many fields such as finance, insurance, justice, e-commerce, and smart home appliances.

Scenarios:

  • Call center quality assurance
    Recognizes the recording files uploaded by a call center as text, and searches the text for any illegal speech or sensitive words.
  • Recording of court trials in databases
    Recognizes the uploaded recording files of court trials as text, and records the text into databases.
  • Meeting minutes summarization
    Recognizes the audio files of meeting minutes as text, and summarizes the meeting minutes manually or automatically.
  • Medical record filing in hospitals
    Records a doctor’s medical operation by voice and recognizes the recording file as text, improving the efficiency of medical record filing.

Speech synthesis

This service allows you to use advanced deep learning technology to convert text to natural-sounding speech. Currently, the service provides you with a variety of tones to choose from, and allows you to adjust the speech speed, intonation, and volume. The service applies to scenarios such as intelligent customer service, speech interaction, literature audio reading, and barrier-free broadcasting. For more information, see API referencefor the speech synthesis service.

Benefits:

  • Cutting-edge technology
    Technically, multi-level rhythm pauses are taken into account to achieve the goal of natural rhythm synthesis. A multi-level automatic prediction model based on deep learning is established by using acoustic parameters and linguistic parameters.
  • Extensive field coverage
    Based on the speech libraries collected in many fields such as smart home appliances, on-vehicle devices, navigation, finance, banking, insurance, securities, operators, logistics, real estate, and education, Alibaba Cloud’s speech synthesis technology enables more accurate pronunciation for words in various fields and industries.
  • Natural sounding
    A large amount of audio data is used as the training corpus to train the synthesized speech, making the synthesized speech natural, full-bodied, and rich in cadence and expressiveness. The mean opinion score (MOS) reaches the top level in the industry.
  • Deep customization
    You can customize a speech library to meet your personalized application needs. The speech synthesis service provides multiple styles for you to choose from, such as standard male or female voices, gentle female voices, and sweet female voices. You can synthesize speech based on Speech Synthesis Markup Language (SSML) and dynamically adjust parameters such as the volume, speed, and pitch.

Scenarios:

  • Intelligent customer service
    Provides the speech synthesis of intelligent customer service in multiple industries and scenarios. This improves the work efficiency of the customer service, ensures customer satisfaction, and reduces the labor cost in a call center.
  • Smart device
    Assigns the most suitable sound to smart home appliances, sound boxes, on-vehicle devices, and wearable devices.
  • Literature audio reading
    Uses infectious voices to tell stories, read novels, and broadcast news, making reading effortless for “lazy” people.
  • Barrier-free broadcasting
    Converts text to fluent, natural-sounding speech for all kinds of people, whether they are healthy or disabled, young or old.

Speech synthesis sound customization

This service provides you with the ability to customize Text-to-Speech (TTS) sounds. With advanced deep learning technology, you can use less data to quickly and efficiently customize personalized synthesis sounds. In this way, you can enable your service or device to make natural and smooth TTS sounds.

You can experience the sample customized sound and learn about the process of sound customization on the Alibaba Cloud official website. If you have any requirements or questions, contact us at nls_support@service.aliyun.com.

Benefits:

  • Cutting-edge technology
    Based on the latest Knowledge-Aware Neural TTS (KAN-TTS) speech synthesis technology, deep neural network (DNN), and machine learning, the service converts text to produce speech that is natural, full-bodied, and rich in cadence and expressiveness. The synthesized speech is almost indistinguishable from the voice recording of a human.
  • Low data volume threshold
    You only need a minimum of 2,000 high-quality sentences to synthesize natural-sounding speech in Chinese Mandarin. With English corpora added, you can implement mixed speech synthesis in both Chinese and English.
  • Lower cost
    Due to the low data volume threshold, the cost for recording and markup is greatly reduced.
  • Deep customization
    The service allows you to specify data to synthesize TTS sounds. The service also provides a large number of candidate speakers and a variety of tones and styles for you to choose from, and ensures the high-quality recording data collected from a top recording studio.

Scenarios:

  • Intelligent customer service
    Provides the speech synthesis of intelligent customer service in multiple industries and scenarios. This improves the work efficiency of the customer service, ensures customer satisfaction, and reduces the labor cost in a call center.
  • Smart device
    Assigns the most suitable sound to smart home appliances, sound boxes, on-vehicle devices, and wearable devices.
  • Literature audio reading
    Uses infectious voices to tell stories, read novels, and broadcast news, making reading effortless for “lazy” people.
  • Barrier-free broadcasting
    Converts text to fluent, natural-sounding speech for all kinds of people, whether they are healthy or disabled, young or old.

Customization platform

You can use the customization platform to improve the recognition effect. The customization platform provides hotword training and custom models to help you improve the recognition effect of the preceding recognition services.

Benefits:

  • Easy-to-use
    The customization platform provides a revolutionary solution that supports self-service speech optimization through one click. The solution greatly lowers the threshold for intelligent speech optimization and allows technically unarmed business persons to significantly improve speech recognition accuracy in their business.
  • Fast
    The customization platform allows you to optimize, test, and publish business-tailored models within several minutes and optimize business-related hotwords in real time. This reduces the long delivery period of traditional customization and optimization which may last several weeks or even months.
  • Accurate
    The customization platform has its optimization effect fully verified by many internal and external partners and projects. It helps many projects address the availability issues and make achievements unattainable by competitors using traditional optimization methods.

Scenarios:

  • Hotword: In speech recognition services, if some of your business-specific words cannot be recognized by default, you can add these words as hotwords to the vocabulary to improve the recognition result of the words.
  • Custom model: You can upload business-related text corpora to train custom models, achieving higher recognition accuracy in your business fields, such as justice and finance.

Learning path

  1. Billing methodIntroduces the billing methods of Intelligent Speech Interaction.
  2. Quick startDescribes how to use Intelligent Speech Interaction.
  3. Developer guideIntroduces related terms, how to obtain access tokens, and so on.
  4. Console guideIntroduces the features provided by the console.
  5. Services: short sentence recognition, real-time speech recognition, recording file recognition, and speech synthesis.Select a service based on your business requirements.
  6. Customization platformImproves the recognition accuracy through the hotword and custom model features provided by the customization platform.
  7. Best practicesIntroduces the best practices of Intelligent Speech Interaction.
  8. FAQIntroduces solutions to common problems.