This topic introduces some terms and concepts related to Intelligent Speech Interaction to help you understand this product.
The audio sampling rate is the average number of samples that a recording device obtains from audio signals in 1 second. The sound that is sampled at a higher audio sampling rate can be reproduced more real and naturally.
Currently, Intelligent Speech Interaction only supports the audio sampling rate at 8 kHz or 16 kHz, where the telephone business uses 8 kHz and others use 16 kHz. When you create a project in the Intelligent Speech Interaction console, the project uses a 16 kHz model by default and can process speech data sampled at 16 kHz only. If your speech data is sampled at 8 kHz, you can edit the project to use an 8 kHz model.
When calling an Intelligent Speech Interaction service, you need to specify the audio sampling rate. The specified audio sampling rate must match both your speech data and project configuration. Otherwise, you may obtain unsatisfactory recognition results. If the audio sampling rate of your speech data is higher than 16 kHz, you need to convert it to 16 kHz so that Intelligent Speech Interaction can process your speech data. If the audio sampling rate of your speech data is 8 kHz, do not convert it to 16 kHz. Instead, edit your project to use an 8 kHz model.
The audio bit depth is the number of bits of data in each sample. It measures the fluctuation of sound and directly corresponds to the resolution of a sound card. A higher audio bit depth indicates a higher resolution and higher sound quality.
Each sample records the amplitude of the sampled signal. The sampling precision depends on the audio bit depth.
- An 8-bit byte represents 256 possible values. This means the amplitude values can take 256 discrete sample values.
- Two 8-bit bytes (16 bits) represent 65,536 possible values. This means the amplitude values can take 65,536 discrete sample values. This audio bit depth has been applied to CDs.
- Four 8-bit bytes (32 bits) represent 4,294,967,296 possible values. This means the amplitude values can take 4,294,967,296 discrete sample values. This audio bit depth is unnecessary.
The audio coding format is a content representation format for storing and transmitting audio data. Note that the audio coding format is different from the audio file format. For example, the audio coding format is defined in the header of a WAV file to encode audio data in pulse-code modulation (PCM) or adaptive multi-rate (AMR) format.
The audio coding format is complex, and therefore is briefly introduced in this topic. Before calling an Intelligent Speech Interaction service, ensure that the service supports the audio coding format of your speech data.
Sound channels separate audio signals that are collected in different spatial locations when the sound is recorded. Therefore, the number of sound channels is the number of sound sources during recording. Common audio data is mono or binaural (stereo).
Except for the recording file recognition service, other recognition services of Intelligent Speech Interaction support only mono speech data. If your speech data is binaural or multi-channel, you need to convert it to mono for recognition.
Inverse text normalization (ITN) converts speech to readable text. ITN uses standardized formats to display objects such as numbers, amount of money, dates, and addresses. The following table lists some examples.
|Original speech||Recognition result after ITN is enabled|
|One thousand six hundred and eighty yuan||RMB 1,680|
|May the eleventh||May 11|
|Please dial one one zero.||Please dial 110.|
You can create multiple projects in the Intelligent Speech Interaction console. Each project is uniquely identified by an appkey. When you call an Intelligent Speech Interaction service, you must provide the appkey of the relevant project. Then, the service can obtain the specific configuration information about the project based on the appkey.
You may use Intelligent Speech Interaction in multiple business scenarios, for example, a customer service hotline scenario and a mobile phone input method scenario. Required service capabilities may vary with the scenario. You can obtain optimal results only when the configuration of a project matches the corresponding business scenario. Therefore, you need to create a project for each business scenario in the Intelligent Speech Interaction console and configure each project properly.
An AccessKey is a credential for your application to call Alibaba Cloud API operations. The application that provides this credential has full permissions under your Alibaba Cloud account. You must keep your AccessKey properly. An AccessKey consists of an AccessKey ID and an AccessKey secret. The AccessKey ID is used to identify you as a user. The AccessKey secret is used to encrypt the signature string for your access request to prevent it from being tampered with. You must use the AccessKey ID and AccessKey secret in pair. The AccessKey secret is similar to your logon password and must be kept confidential. You can create and view your AccessKey on the Security Management page.
An access token is a credential for you to call an Intelligent Speech Interaction service. An access token has a validity period. You can use your AccessKey ID and AccessKey secret to obtain an access token. To call an Intelligent Speech Interaction service on a device such as a mobile phone, you can obtain an access token on the server and send it to the device. This effectively prevents your AccessKey from being disclosed.
You can specify whether to return intermediate results when calling an Intelligent Speech Interaction service.
- If you disable intermediate results, the server returns only the final result after it completes the recognition task.
- If you enable intermediate results, the server returns not only the final result after it completes the recognition task but also intermediate results while you are speaking.
Assume that the final result of a recognition task for a piece of speech data is “I bought two pairs of shoes.” If you enable intermediate results, the server may return the following results while you are speaking:
I bought two
I bought two pears
I bought two pairs of shoes.
- The server may correct the previous intermediate result when it returns the current intermediate result. For example, change from “I bought two pears” to “I bought two pairs of shoes.”
- Compared to the previous intermediate result, the number of words added to the current intermediate result is not fixed. That is, the current intermediate result does not always have one more word than the previous intermediate result. For example, change from “I bought two pears” to “I bought two pairs of shoes.”
The SDK automatically generates a unique task ID for each request to call an Intelligent Speech Interaction service. You need to record the task ID. If an error occurs, you can open a ticket to submit the task ID for troubleshooting.