All Products
Search
Document Center

Intelligent Speech Interaction:WebSocket

Last Updated:Oct 30, 2023

If you do not want to use the SDKs for Intelligent Speech Interaction, or the SDKs for Java, C, or C++ cannot meet your business requirements, you can develop custom programs to access Intelligent Speech Interaction.

Overview

Intelligent Speech Interaction uses the WebSocket protocol to convert voice messages into text in real time. Long voice messages are supported. Commands and events are both data frames of the Text type for WebSocket. You must upload audio streams to the server in the binary frame format. The calling sequence must meet the requirements of WebSocket. Outbound audio data uses the binary frame format of WebSocket. For more information, see Data Frames.

  • Supported input format: uncompressed PCM or WAV files with 16-bit sampling and mono channel.

  • Supported audio sampling rates: 8,000 Hz and 16,000 Hz.

  • Allows you to specify whether to return intermediate results, add punctuation marks during post-processing, and convert Chinese numerals to Arabic numerals.

  • Allows you to select linguistic models to recognize voice messages in different languages when you manage projects in the Intelligent Speech Interaction console. For more information, see Manage projects.

Authentication

The server uses temporary access tokens for authentication. When you make a request, you must include the access token in the URL. For more information about how to obtain an access token, see Obtain an access token. After you obtain an access token, you can access Intelligent Speech Interaction in one of the following ways.

Access type

Description

URL

Access from external networks

You can use the URL to access Intelligent Speech Interaction from all servers.

wss://nls-gateway-ap-southeast-1.aliyuncs.com/ws/v1?token=<your token>

Interaction process

The commands and audio streams must be sent in order as shown in the following figure. Otherwise, interaction with the server fails.

image

Commands

The request command is used to start or stop a speech recognition task. You must send the request in the JSON format by using the text frame method. You must set basic information about the request in the Header section. A command consists of the Header and Payload sections. The Header section uses a unified format, whereas the Payload section uses different formats for different commands.

1. The Header section

The Header section consists of the following parameters.

Parameter

Type

Required

Description

appkey

String

Yes

The AppKey of your project that is created in the Intelligent Speech Interaction console.

message_id

String

Yes

The 32-bit ID of the request. The ID is randomly generated and unique.

task_id

String

Yes

The 32-bit ID of the speech recognition session. The ID is unique and must remain unchanged throughout the request.

namespace

String

Yes

The name of the service to be accessed. Set the value to SpeechTranscriber.

name

String

Yes

The names of the StartTranscription and StopTranscription commands. For more information, see The StartTranscription command and The StopTranscription command.

2. The StartTranscription command

The following table describes the parameters in the Payload section.

Parameter

Type

Required

Description

format

String

No

The audio coding format. Supported format: uncompressed PCM or WAV files with 16-bit sampling and mono channel.

sample_rate

Integer

No

The audio sampling rate. The default rate is 16,000 Hz. After you set this parameter, you must specify a model that is applicable to the scenario and audio sampling rate for your project in the Intelligent Speech Interaction console.

enable_intermediate_result

Boolean

No

Specifies whether to return intermediate results. Default value: false.

enable_punctuation_prediction

Boolean

No

Specifies whether to add punctuation marks during post-processing. Default value: false.

enable_inverse_text_normalization

Boolean

No

Specifies whether to enable inverse text normalization (ITN) during post-processing. Default value: false. If you set this parameter to true, Chinese numerals are converted to Arabic numerals.

Important

ITN does not apply to words.

customization_id

String

No

The ID of the custom linguistic model.

vocabulary_id

String

No

The vocabulary ID of custom popular words.

max_sentence_silence

Integer

No

The threshold for determining the end of a sentence. If the silence duration exceeds the specified threshold, the system determines the end of a sentence. Unit: milliseconds. Valid values: 200 to 2000. Default value: 800.

enable_words

Boolean

No

Specifies whether to return information about words. Default value: false.

enable_ignore_sentence_timeout

Boolean

No

Specifies whether to ignore the recognition timeout of a single sentence in real-time speech recognition. Default value: false.

disfluency

Boolean

No

Specifies whether to enable disfluency detection to remove modal particles and repetitive speech. Default value: false.

speech_noise_threshold

Float

No

The threshold for recognizing audio streams as noise. Valid values: -1 to 1. The following information describes the values:

  • A value close to -1 indicates that the audio stream is likely to be recognized as a normal speech.

  • A value close to 1 indicates that the audio stream is likely to be recognized as noise.

Note

This parameter is an advanced parameter. Proceed with caution. We recommend that you run tests to find a proper value.

enable_semantic_sentence_detection

Boolean

No

Specifies whether to enable semantic sentence segmentation. Default value: false.

Sample code:

{
    "header": {
        "message_id": "05450bf69c53413f8d88aed1ee60****",
        "task_id": "640bc797bb684bd6960185651307****",
        "namespace": "SpeechTranscriber",
        "name": "StartTranscription",
        "appkey": "17d4c634****"
    },
    "payload": {
        "format": "opus",
        "sample_rate": 16000,
        "enable_intermediate_result": true,
        "enable_punctuation_prediction": true,
        "enable_inverse_text_normalization": true
    }
}

3. The StopTranscription command

You can run the StopTranscription command to stop a speech recognition task. Therefore, leave the Payload section empty. Sample code:

{
    "header": {
        "message_id": "05450bf69c53413f8d88aed1ee60****",
        "task_id": "640bc797bb684bd6960185651307****",
        "namespace": "SpeechTranscriber",
        "name": "StopTranscription",
        "appkey": "17d4c634****"
    }
}

Events

1. The TranscriptionStarted event

The TranscriptionStarted event indicates that the server is ready to recognize speeches and you can send audio streams from the client.

Parameter

Type

Description

session_id

String

If the session_id is set when the client sends the request, the same value is returned. Otherwise, a unique 32-bit ID that is randomly generated is returned.

Sample code:

{
    "header": {
        "message_id": "05450bf69c53413f8d88aed1ee60****",
        "task_id": "640bc797bb684bd6960185651307****",
        "namespace": "SpeechTranscriber",
        "name": "TranscriptionStarted",
        "status": 20000000,
        "status_message": "GATEWAY|SUCCESS|Success."
    },
    "payload": {
        "session_id": "1231231dfdf****"
    }
}

2. The SentenceBegin event

The SentenceBegin event indicates that the server detects the start of a sentence.

Parameter

Type

Description

index

Integer

The sequence number of the sentence, which starts from 1.

time

Integer

The start time of a sentence to the start time of the audio stream. Unit: milliseconds.

Sample code:

{
    "header": {
        "message_id": "05450bf69c53413f8d88aed1ee60****",
        "task_id": "640bc797bb684bd6960185651307****",
        "namespace": "SpeechTranscriber",
        "name": "SentenceBegin",
        "status": 20000000,
        "status_message": "GATEWAY|SUCCESS|Success."
    },
    "payload": {
        "index": 1,
        "time": 320
    }
}

3. The TranscriptionResultChanged event

The TranscriptionResultChanged event indicates that the recognition result has changed.

Parameter

Type

Description

index

Integer

The sequence number of the sentence, which starts from 1.

time

Integer

The duration of the processed audio stream. Unit: milliseconds.

result

String

The recognition result.

words

Word

The information about words.

status

Integer

The status code.

Word structure:

Parameter

Type

Description

text

String

The text content.

startTime

Integer

The start time of the word.

endTime

Integer

The end time of the word.

Sample code:

{
    "header":{
        "message_id":"05450bf69c53413f8d88aed1ee60****",
        "task_id":"640bc797bb684bd6960185651307****",
        "namespace":"SpeechTranscriber",
        "name":"TranscriptionResultChanged",
        "status":20000000,
        "status_message":"GATEWAY|SUCCESS|Success."
    },
    "payload":{
        "index":1,
        "time":1800,
        "result":"Double Eleven this year",
        "words":[
            {
                "text":"this year",
                "startTime":1,
                "endTime":2
            },
            {
              "text":"Double Eleven",
              "startTime":2,
              "endTime":3
            }
        ]
    }
}

4. The SentenceEnd event

The SentenceEnd event indicates that the server detects the end of a sentence.

Parameter

Type

Description

index

Integer

The sequence number of the sentence, which starts from 1.

time

Integer

The duration of the processed audio stream. Unit: milliseconds.

begin_time

Integer

The time of the SentenceBegin event that corresponds to the sentence. Unit: milliseconds.

result

String

The recognition result.

confidence

Double

The accuracy level of the result. Valid values: 0.0 to 1.0. A larger value indicates a higher accuracy level.

words

Word

The information about words.

status

Integer

The status code. Default value: 20000000. After enable_ignore_sentence_timeout is enabled, the error code 51040104 is returned and the current connection is maintained if a timeout occurs.

stash_result

StashResult

The temporarily stored result. After semantic sentence segmentation is enabled, the intermediate result of the next unsegmented sentence is returned.

StashResult structure:

Parameter

Type

Description

sentenceId

Integer

The sequence number of the sentence, which starts from 1.

beginTime

Integer

The start time of the sentence.

text

String

The transcription content.

currentTime

Integer

The time of the audio stream that is being processed.

Sample code:

{
    "header": {
        "message_id": "05450bf69c53413f8d88aed1ee60****",
        "task_id": "640bc797bb684bd6960185651307****",
        "namespace": "SpeechTranscriber",
        "name": "SentenceEnd",
        "status": 20000000,
        "status_message": "GATEWAY|SUCCESS|Success."
    },
    "payload": {
        "index": 1,
        "time": 3260,
        "begin_time": 1800,
        "result": "I want to buy a television this Double Eleven"
    }
}

5. The TranscriptionCompleted event

The TranscriptionCompleted event indicates that the speech recognition task is stopped. Sample code:

{
    "header": {
        "message_id": "05450bf69c53413f8d88aed1ee60****",
        "task_id": "640bc797bb684bd6960185651307****",
        "namespace": "SpeechTranscriber",
        "name": "TranscriptionCompleted",
        "status": 20000000,
        "status_message": "GATEWAY|SUCCESS|Success."
    }
}