All Products
Document Center

Intelligent Speech Interaction:Concepts

Last Updated:Aug 17, 2022

This topic introduces terms and concepts that are related to Intelligent Speech Interaction to help you understand this service.

Audio sample rate

The audio sample rate is the average number of samples that a recording device captures from audio signals in 1 second. The sound that is sampled at a higher audio sample rate can be reproduced in a more real and natural manner.

Intelligent Speech Interaction supports the audio sample rate of 8 kHz or 16 kHz. The telephone workload uses 8 kHz and other workloads use 16 kHz.

If the audio sample rate of your speech data is higher than 16 kHz, you must convert the audio sample rate to 16 kHz so that Intelligent Speech Interaction can process your speech data. If the audio sample rate of your speech data is 8 kHz, do not convert the audio sample rate to 16 kHz. In this case, configure your project to use an 8 kHz model.

Audio bit depth

The audio bit depth is the number of bits of data in each sample. It measures the fluctuation of sound and directly corresponds to the resolution of a sound card. A higher audio bit depth indicates a higher resolution and higher sound quality.

In most cases, Intelligent Speech Interaction uses 16-bit to capture audio data. Each sample is stored as a set of two 8-bit bytes. An audio signal is recorded and digitized at a rate of 16,000 samples per second at two bytes per sample.

Each sample records the amplitude of the sampled signal. The precision of the sample depends on the audio bit depth.

  • An 8-bit byte represents 256 possible values. This means the amplitude values can be divided into 256 discrete sample values.

  • Two 8-bit bytes (16 bits) represent 65,536 possible values. This means the amplitude values can be divided into 65,536 discrete sample values.

    This audio bit depth is applied to CDs.

Audio coding format

The audio coding format is a content representation format for storing and transmitting audio data. Note that the audio coding format is different from the audio file format. For example, you can define the audio coding format in the header of a WAV file to encode audio data in the pulse-code modulation (PCM) or adaptive multi-rate (AMR) format.


Before you call an Intelligent Speech Interaction service, make sure that the service supports the audio coding format of your speech data.

Sound channel

Sound channels separate audio signals that are collected in different spatial locations when the sound is recorded. The number of sound channels equals the number of sound sources during the recording process. Common audio data is mono or binaural (stereo).


Except for the recording file recognition service, other interaction services of Intelligent Speech Interaction support only mono speech data. If your speech data is binaural or multi-channel, convert the data to mono speech data.

Inverse text normalization

Inverse text normalization (ITN) converts speech to readable text. ITN uses standardized formats to display objects such as numbers, amounts of money, dates, and addresses. The following table lists some examples.

Original speech

Recognition result after ITN is enabled

Twenty percent


May the eleventh

May 11

Please dial one one zero.

Please dial 110.


An appkey can uniquely identify a project that is created in the Intelligent Speech Interaction console. When you call an Intelligent Speech Interaction service for a project, you must provide the appkey of the project. Then, the service obtains the configuration information about the project based on the appkey.

Intelligent Speech Interaction can provide speech interaction services in multiple business scenarios, for example, customer service hotlines and mobile phone inputs. The service capabilities vary based on the scenario. To obtain optimal results, make sure that the configurations of the project meet the requirements of the business scenario.

AccessKey pair

An AccessKey pair is an identity credential for applications to call Alibaba Cloud API operations. You can create and view your AccessKey pair on the Security Management page.

An AccessKey pair consists of an AccessKey ID and an AccessKey secret. The AccessKey ID is used to identify you as a user. The AccessKey secret is used to encrypt the signature string of your access request. This can prevent data from being tampered with. You must use the AccessKey ID and the AccessKey secret together. The AccessKey secret is similar to a logon password. Keep the AccessKey secret confidential.

Access token

An access token is a credential for you to call Intelligent Speech Interaction services. An access token has a validity period. You can use your AccessKey ID and AccessKey secret to obtain an access token.


If you call Intelligent Speech Interaction services on a device such as a mobile phone, you can obtain an access token from the server and send it to the device. This prevents your AccessKey pair from being disclosed.

Intermediate result

You can specify whether to return intermediate results when you call an Intelligent Speech Interaction service.

  • If the relevant parameter is set to false, the server returns a final result only after it completes the recognition task.

  • If the relevant parameter is set to true, the server returns the final result after it completes the recognition task, and also returns the intermediate results while you are speaking.

Assume that the final result of a recognition task for a piece of speech data is "Hello welcome to Alibaba Group". If you enable intermediate results, the server may return the following results while you are speaking:

Hello welcome
Hello welcome to
Hello welcome to Alibaba
Hello welcome to Alibaba Group

  • The server may correct the previous intermediate result when it returns the current intermediate result.

  • The current intermediate result does not always have one more word than the previous intermediate result. The number of incremental words is not fixed.


Task IDs are generated by Alibaba Cloud SDK and issued to each call request. Each task has a unique task ID. If an error occurs, you can use the task ID for troubleshooting.