If you do not want to use the SDKs for Intelligent Speech Interaction, or the SDKs for Java, C, or C++ cannot meet your business requirements, you can develop custom programs to access Intelligent Speech Interaction.
Overview
Intelligent Speech Interaction uses the WebSocket protocol to convert voice messages into text in real time. Long voice messages are supported. Commands and events are both data frames of the Text type for WebSocket. You must upload audio streams to the server in the binary frame format. The calling sequence must meet the requirements of WebSocket. Outbound audio data uses the binary frame format of WebSocket. For more information, see Data Frames.
Supported input format: uncompressed PCM or WAV files with 16-bit sampling and mono channel.
Supported audio sampling rates: 8,000 Hz and 16,000 Hz.
Allows you to specify whether to return intermediate results, add punctuation marks during post-processing, and convert Chinese numerals to Arabic numerals.
Allows you to select linguistic models to recognize voice messages in different languages when you manage projects in the Intelligent Speech Interaction console. For more information, see Manage projects.
Authentication
The server uses temporary access tokens for authentication. When you make a request, you must include the access token in the URL. For more information about how to obtain an access token, see Obtain an access token. After you obtain an access token, you can access Intelligent Speech Interaction in one of the following ways.
Access type | Description | URL |
Access from external networks | You can use the URL to access Intelligent Speech Interaction from all servers. | wss://nls-gateway-ap-southeast-1.aliyuncs.com/ws/v1?token=<your token> |
Interaction process
The commands and audio streams must be sent in order as shown in the following figure. Otherwise, interaction with the server fails.
Commands
The request command is used to start or stop a speech recognition task. You must send the request in the JSON format by using the text frame method. You must set basic information about the request in the Header section. A command consists of the Header and Payload sections. The Header section uses a unified format, whereas the Payload section uses different formats for different commands.
1. The Header section
The Header section consists of the following parameters.
Parameter | Type | Required | Description |
appkey | String | Yes | The AppKey of your project that is created in the Intelligent Speech Interaction console. |
message_id | String | Yes | The 32-bit ID of the request. The ID is randomly generated and unique. |
task_id | String | Yes | The 32-bit ID of the speech recognition session. The ID is unique and must remain unchanged throughout the request. |
namespace | String | Yes | The name of the service to be accessed. Set the value to SpeechTranscriber. |
name | String | Yes | The names of the StartTranscription and StopTranscription commands. For more information, see The StartTranscription command and The StopTranscription command. |
2. The StartTranscription command
The following table describes the parameters in the Payload section.
Parameter | Type | Required | Description |
format | String | No | The audio coding format. Supported format: uncompressed PCM or WAV files with 16-bit sampling and mono channel. |
sample_rate | Integer | No | The audio sampling rate. The default rate is 16,000 Hz. After you set this parameter, you must specify a model that is applicable to the scenario and audio sampling rate for your project in the Intelligent Speech Interaction console. |
enable_intermediate_result | Boolean | No | Specifies whether to return intermediate results. Default value: false. |
enable_punctuation_prediction | Boolean | No | Specifies whether to add punctuation marks during post-processing. Default value: false. |
enable_inverse_text_normalization | Boolean | No | Specifies whether to enable inverse text normalization (ITN) during post-processing. Default value: false. If you set this parameter to true, Chinese numerals are converted to Arabic numerals. Important ITN does not apply to words. |
customization_id | String | No | The ID of the custom linguistic model. |
vocabulary_id | String | No | The vocabulary ID of custom popular words. |
max_sentence_silence | Integer | No | The threshold for determining the end of a sentence. If the silence duration exceeds the specified threshold, the system determines the end of a sentence. Unit: milliseconds. Valid values: 200 to 2000. Default value: 800. |
enable_words | Boolean | No | Specifies whether to return information about words. Default value: false. |
enable_ignore_sentence_timeout | Boolean | No | Specifies whether to ignore the recognition timeout of a single sentence in real-time speech recognition. Default value: false. |
disfluency | Boolean | No | Specifies whether to enable disfluency detection to remove modal particles and repetitive speech. Default value: false. |
speech_noise_threshold | Float | No | The threshold for recognizing audio streams as noise. Valid values: -1 to 1. The following information describes the values:
Note This parameter is an advanced parameter. Proceed with caution. We recommend that you run tests to find a proper value. |
enable_semantic_sentence_detection | Boolean | No | Specifies whether to enable semantic sentence segmentation. Default value: false. |
Sample code:
{
"header": {
"message_id": "05450bf69c53413f8d88aed1ee60****",
"task_id": "640bc797bb684bd6960185651307****",
"namespace": "SpeechTranscriber",
"name": "StartTranscription",
"appkey": "17d4c634****"
},
"payload": {
"format": "opus",
"sample_rate": 16000,
"enable_intermediate_result": true,
"enable_punctuation_prediction": true,
"enable_inverse_text_normalization": true
}
}
3. The StopTranscription command
You can run the StopTranscription command to stop a speech recognition task. Therefore, leave the Payload section empty. Sample code:
{
"header": {
"message_id": "05450bf69c53413f8d88aed1ee60****",
"task_id": "640bc797bb684bd6960185651307****",
"namespace": "SpeechTranscriber",
"name": "StopTranscription",
"appkey": "17d4c634****"
}
}
Events
1. The TranscriptionStarted event
The TranscriptionStarted event indicates that the server is ready to recognize speeches and you can send audio streams from the client.
Parameter | Type | Description |
session_id | String | If the session_id is set when the client sends the request, the same value is returned. Otherwise, a unique 32-bit ID that is randomly generated is returned. |
Sample code:
{
"header": {
"message_id": "05450bf69c53413f8d88aed1ee60****",
"task_id": "640bc797bb684bd6960185651307****",
"namespace": "SpeechTranscriber",
"name": "TranscriptionStarted",
"status": 20000000,
"status_message": "GATEWAY|SUCCESS|Success."
},
"payload": {
"session_id": "1231231dfdf****"
}
}
2. The SentenceBegin event
The SentenceBegin event indicates that the server detects the start of a sentence.
Parameter | Type | Description |
index | Integer | The sequence number of the sentence, which starts from 1. |
time | Integer | The start time of a sentence to the start time of the audio stream. Unit: milliseconds. |
Sample code:
{
"header": {
"message_id": "05450bf69c53413f8d88aed1ee60****",
"task_id": "640bc797bb684bd6960185651307****",
"namespace": "SpeechTranscriber",
"name": "SentenceBegin",
"status": 20000000,
"status_message": "GATEWAY|SUCCESS|Success."
},
"payload": {
"index": 1,
"time": 320
}
}
3. The TranscriptionResultChanged event
The TranscriptionResultChanged event indicates that the recognition result has changed.
Parameter | Type | Description |
index | Integer | The sequence number of the sentence, which starts from 1. |
time | Integer | The duration of the processed audio stream. Unit: milliseconds. |
result | String | The recognition result. |
words | Word | The information about words. |
status | Integer | The status code. |
Word structure:
Parameter | Type | Description |
text | String | The text content. |
startTime | Integer | The start time of the word. |
endTime | Integer | The end time of the word. |
Sample code:
{
"header":{
"message_id":"05450bf69c53413f8d88aed1ee60****",
"task_id":"640bc797bb684bd6960185651307****",
"namespace":"SpeechTranscriber",
"name":"TranscriptionResultChanged",
"status":20000000,
"status_message":"GATEWAY|SUCCESS|Success."
},
"payload":{
"index":1,
"time":1800,
"result":"Double Eleven this year",
"words":[
{
"text":"this year",
"startTime":1,
"endTime":2
},
{
"text":"Double Eleven",
"startTime":2,
"endTime":3
}
]
}
}
4. The SentenceEnd event
The SentenceEnd event indicates that the server detects the end of a sentence.
Parameter | Type | Description |
index | Integer | The sequence number of the sentence, which starts from 1. |
time | Integer | The duration of the processed audio stream. Unit: milliseconds. |
begin_time | Integer | The time of the SentenceBegin event that corresponds to the sentence. Unit: milliseconds. |
result | String | The recognition result. |
confidence | Double | The accuracy level of the result. Valid values: 0.0 to 1.0. A larger value indicates a higher accuracy level. |
words | Word | The information about words. |
status | Integer | The status code. Default value: 20000000. After |
stash_result | StashResult | The temporarily stored result. After semantic sentence segmentation is enabled, the intermediate result of the next unsegmented sentence is returned. |
StashResult structure:
Parameter | Type | Description |
sentenceId | Integer | The sequence number of the sentence, which starts from 1. |
beginTime | Integer | The start time of the sentence. |
text | String | The transcription content. |
currentTime | Integer | The time of the audio stream that is being processed. |
Sample code:
{
"header": {
"message_id": "05450bf69c53413f8d88aed1ee60****",
"task_id": "640bc797bb684bd6960185651307****",
"namespace": "SpeechTranscriber",
"name": "SentenceEnd",
"status": 20000000,
"status_message": "GATEWAY|SUCCESS|Success."
},
"payload": {
"index": 1,
"time": 3260,
"begin_time": 1800,
"result": "I want to buy a television this Double Eleven"
}
}
5. The TranscriptionCompleted event
The TranscriptionCompleted event indicates that the speech recognition task is stopped. Sample code:
{
"header": {
"message_id": "05450bf69c53413f8d88aed1ee60****",
"task_id": "640bc797bb684bd6960185651307****",
"namespace": "SpeechTranscriber",
"name": "TranscriptionCompleted",
"status": 20000000,
"status_message": "GATEWAY|SUCCESS|Success."
}
}