Community Blog Transcribe Speech to Text in Real-Time Using Alibaba Cloud Intelligent Speech Interaction

Transcribe Speech to Text in Real-Time Using Alibaba Cloud Intelligent Speech Interaction

This article explains how to transcribe speech to text in real-time using Alibaba Cloud Intelligent Speech Interaction.

By Alain Francois

There are some situations where you are looking for real-time video conferences or courses. However, you are unfamiliar with the language or cannot hear properly due to outside circumstances. At this time, a written transcription is the only thing you can use if you cannot hear the audio. Alibaba Cloud empowers its consumers with real-time technologies for speech interaction with its Intelligent Speech Interaction solution.

What Is Intelligent Speech Interaction?

Alibaba Cloud Intelligent Speech Interaction is a service developed based on state-of-the-art technologies, such as speech recognition, speech synthesis, and natural language understanding. It has been developed for enterprises to integrate Intelligent Speech Interaction into their products, enabling them to listen, understand, and converse with users and providing users with an immersive human-computer interaction experience. The service is suitable for various scenarios, such as intelligent Q&A, real-time recording for court trials, real-time subtitling for speeches, and transcription of audio recordings. It is currently available in Mandarin, Cantonese, English, Japanese, Korean, French, and Indonesian.

Intelligent Speech Interaction Features

Intelligent Speech Interaction provides the services and features below:

  • Short sentence recognition recognizes short speech that lasts within one minute. This service applies to scenarios, such as chat conversations and voice search. It has a high decoding rate and a high recognition accuracy that improves the accuracy of speech recognition.
  • Real-time speech recognition recognizes audio streams of various lengths in real-time and speech data streams that last longer than one minute. This service applies to uninterrupted speech recognition scenarios, such as speeches during conferences and livestreaming.
  • Recording file recognition recognizes the recording files you upload. This service applies to scenarios where real-time recognition is not required, such as quality assurance in call centers, recording of court trials in databases, and meeting minute summarization.
  • Speech synthesis converts text to natural-sounding, fluent speech. This service provides a variety of speakers in different languages, dialects, and voices. You can specify the speaker of the synthesized speech based on your business requirements. This service applies to virtual conversation scenarios, such as intelligent customer services and outbound voice calls.
  • Long-text-to-speech synthesis converts long text that contains up to 100,000 characters to natural-sounding speech. It allows you to customize text-to-speech (TTS) speakers at a fast speed using a small amount of training data. This service applies to scenarios where you need the system to read literature and news aloud.
  • The self-learning platform provides hotword training and custom linguistic models to help you improve the recognition effect of the preceding recognition services.

In order to transcribe speech to text in real-time, we need to configure the Speech Recognition service of the Intelligent Speech Interaction.

Speech Recognition Scenarios

The Speech recognition service works in some scenario cases listed below:

  • Real-time creation of subtitles in live video recognizes speech data and converts the speech data to subtitles in real-time for live speeches and videos. You can also use this service to manage the subtitles.
  • Real-time meeting recording works for speech data during a conference and converts the speech data to text in real-time. This service is especially suitable for remote scenarios, such as video conferences.
  • Real-time recording of court trials recognizes the speech data of all parties involved in a court trial and converts the speech data to text for all parties to view on the trial page. This reduces the workload of court clerks.
  • Real-time recording of customer service calls works for speech data in calls made from and to a call center and converts the speech data to text in real-time.

Running a Real-Time Speech Recognition Service

If you want to run the real-time speech recognition service, you need to run the Intelligent Speech Interaction service first.

Go to your Alibaba cloud panel account. If don't have an account yet, you can create a new account. (*Get a discount during the March Mega Sale!)

Activating Speech Recognition Service

Log in to your Alibaba Cloud account and go to the Speech Interaction service:


You will be asked to activate the service on a popup message.


Alibaba Cloud offers a free trial for the Intelligent Speech Interaction service. It supports two concurrent calls at most and provides public concurrent service resources. Activate the service:


After that, you will be notified that the order is completed:


Configuring Speech Recognition Service

Now, you need to create a project:


Add a project name and a description. After creating a project, go to the project setting to select the service to use:


You will see the different services. You need to configure the Speech Recognition Service:


You will be asked to select a model to configure for the speech recognition service:


We will enable the _English Speech Recognition Model_. You can try to upload an audio file to test the real-time transcription. We will upload the audio version of this video of Alibaba Cloud regarding Artificial Intelligence:


The service is doing a real-time transcription on the test windows. You can confirm its use:


Now, the service has been set, and you can publish it. If you have a person's name, place name, or enterprise name that will be used, you can use the hotword to improve the vocabulary recognition results. You can add hotwords before validating the service:



The billing system bills you based on the service usage for processing speech or text data and the usage of additional features or resources of Intelligent Speech Interaction:


Please check the pricing billing methods page for more information


You may use Intelligent Speech Interaction in multiple business scenarios, such as customer service and court scenarios. The required service capabilities may vary with each scenario.

0 0 0
Share on

Alibaba Cloud Community

937 posts | 216 followers

You may also like