All Products
Search
Document Center

Intelligent Speech Interaction:API reference

Last Updated:Mar 18, 2024

The real-time speech recognition service recognizes speech data streams that last for a long time. The service applies to scenarios where uninterrupted speech recognition is required, such as conference speeches and live streaming.

Features

  • Supports pulse-code modulation (PCM) encoded 16-bit mono audio files.

  • Supports the audio sampling rates of 8,000 Hz and 16,000 Hz.

  • Allows you to specify whether to return intermediate results, whether to add punctuation marks during post-processing, and whether to convert Chinese numerals to Arabic numerals.

  • Allows you to select linguistic models to recognize speeches in different languages when you manage projects in the Intelligent Speech Interaction console. For more information, see Manage projects.

    Currently supported languages and dialect models include:

    • 16K ASR (General Purpose): Chinese Mandarin, English, Cantonese, Japanese, Korean, Russian, Indonesian, Vietnamese, Malaysian, Thai, Hindi, Arabic, Kazakh, German, Spanish, and French.

    • 8K ASR (Telephony Channel): Chinese Mandarin, English, Cantonese, Indonesian, and Filipino.

Endpoints

Access type

Description

URL

External access from the Internet

This endpoint allows you to access the real-time speech recognition service from any host over the Internet. By default, the Internet access URL is built in the SDK.

wss://nls-gateway.cn-shanghai.aliyuncs.com/ws/v1

Internal access from an Elastic Compute Service (ECS) instance in the China (Shanghai) region

This endpoint allows you to access the real-time speech recognition service from an ECS instance in the China (Shanghai) region over an internal network. You cannot access an AnyTunnel virtual IP address (VIP) from an ECS instance that is connected to the classic network. This means that you cannot use such an ECS instance to access the real-time speech recognition service over an internal network. To access this service by using an AnyTunnel VIP, you must create a virtual private cloud (VPC) and access the service from the VPC.

Note

  • Access from an ECS instance over the internal network does not consume Internet access traffic.

  • For more information about the network types of ECS instances, see Network types.

ws://nls-gateway.cn-shanghai-internal.aliyuncs.com:80/ws/v1

Interaction process

image
Note

The server adds the task_id parameter to the response header for all responses to indicate the ID of the recognition task. You can record the value of this parameter. If an error occurs, you can submit a ticket to report the task ID and error message.

1. Authenticate the client

To establish a WebSocket connection with the server, the client must use a token for authentication. For more information about how to obtain the token, see Obtain an access token.

2. Start and confirm recognition

The client sends a request to start real-time speech recognition. The server confirms that the request is valid. You must set the request parameters for the client to send a service request. You can set the request parameters by using the set method of the SpeechTranscriber object in the SDK. The following table describes the request parameters.

Parameter

Type

Required

Description

appkey

String

Yes

The appkey of your project created in the Intelligent Speech Interaction console.

format

String

No

The audio coding format. Default value: PCM. The real-time speech recognition service supports uncompressed PCM or WAV encoded 16-bit mono audio files.

sample_rate

Integer

No

The audio sampling rate. Unit: Hz. Default value: 16000. After you set this parameter, you must specify a model or scene that is applicable to the audio sampling rate for your project in the Intelligent Speech Interaction console.

enable_intermediate_result

Boolean

No

Specifies whether to return intermediate results. Default value: false.

enable_punctuation_prediction

Boolean

No

Specifies whether to add punctuation marks during post-processing. Default value: false.

enable_inverse_text_normalization

Boolean

No

Specifies whether to enable inverse text normalization (ITN) during post-processing. Default value: false. If you set this parameter to true, Chinese numerals are converted to Arabic numerals.

Note

ITN is not implemented on words.

customization_id

String

No

The ID of the custom linguistic model.

vocabulary_id

String

No

The vocabulary ID of custom hotwords.

max_sentence_silence

Integer

No

The threshold for detecting the end of a sentence. If the silence duration exceeds the specified threshold, the system determines the end of a sentence. Unit: milliseconds. Valid values: 200 to 2000. Default value: 800.

enable_words

Boolean

No

Specifies whether to return information about words. Default value: false.

enable_ignore_sentence_timeout

Boolean

No

Specifies whether to ignore the recognition timeout issue of a single sentence in real-time speech recognition. Default value: false.

disfluency

Boolean

No

Specifies whether to enable disfluency detection to remove modal particles and repetitive speech. Default value: false.

vad_model

String

No

The ID of the voice activity detection (VAD) model that is used by the server. By default, this parameter is not passed in requests.

speech_noise_threshold

float

No

The threshold for recognizing audio streams as noise. Valid values: -1 to 1. The following information describes the parameter value:

  • The closer the parameter value is to -1, the more likely an audio stream is recognized as a normal speech.

  • The closer the parameter value is to 1, the more likely an audio stream is recognized as noise.

Note

This parameter is an advanced parameter. Proceed with caution when you adjust the parameter value. Perform a test after you adjust the parameter value.

3. Send and recognize audio data

The client cyclically sends audio data to the server and continuously receives recognition results from the server.

  • The SentenceBegin message indicates that the server detects the beginning of a sentence. The real-time speech recognition service uses VAD to determine the beginning and end of a sentence. For example, the server returns the following response:

    {
            "header": {
                    "namespace": "SpeechTranscriber",
                    "name": "SentenceBegin",
                    "status": 20000000,
                    "message_id": "a426f3d4618447519c9d85d1a0d1****",
                    "task_id": "5ec521b5aa104e3abccf3d361822****",
                    "status_text": "Gateway:SUCCESS:Success."
            },
            "payload": {
                    "index": 1,
                    "time": 0
            }
    }

    The following table describes the parameters in the header object.

    Parameter

    Type

    Description

    namespace

    String

    The namespace of the message.

    name

    String

    The name of the message. The SentenceBegin message indicates that the server detects the beginning of a sentence.

    status

    Integer

    The status code. It indicates whether the request is successful. For more information, see the "Service status codes" section of this topic.

    status_text

    String

    The status message.

    task_id

    String

    The globally unique identifier (GUID) of the recognition task. Record the value of this parameter to facilitate troubleshooting.

    message_id

    String

    The ID of the message.

    The following table describes the result parameters in the payload object.

    Parameter

    Type

    Description

    index

    Integer

    The sequence number of the sentence, which starts from 1.

    time

    Integer

    The duration of the processed audio data. Unit: milliseconds.

  • The TranscriptionResultChanged message indicates that an intermediate result is obtained. If the enable_intermediate_result parameter is set to true, the server sends multiple TranscriptionResultChanged messages to return intermediate results. For example, the server returns the following intermediate results:

    {
            "header": {
                    "namespace": "SpeechTranscriber",
                    "name": "TranscriptionResultChanged",
                    "status": 20000000,
                    "message_id": "dc21193fada84380a3b6137875ab****",
                    "task_id": "5ec521b5aa104e3abccf3d361822****",
                    "status_text": "Gateway:SUCCESS:Success."
            },
            "payload": {
                    "index": 1,
                    "time": 1835,
                    "result": "Sky in Beijing",
                    "confidence": 1.0,
                    "words": [{
                            "text": "Sky",
                            "startTime": 630,
                            "endTime": 930
                    }, {
                            "text":"in",
                            "startTime": 930,
                            "endTime": 1110
                    }, {
                            "text": "Beijing",
                            "startTime": 1110,
                            "endTime": 1140
                    }]
            }
    }       

    The parameters in the header object of the TranscriptionResultChanged message are similar to the parameters in the header object of the SentenceBegin message. The value of the name parameter is TranscriptionResultChanged, which indicates that an intermediate result is obtained.

    The following table describes the result parameters in the payload object.

    Parameter

    Type

    Description

    index

    Integer

    The sequence number of the sentence, which starts from 1.

    time

    Integer

    The duration of the processed audio data. Unit: milliseconds.

    result

    String

    The recognition result of the sentence.

    words

    List< Word >

    The word information of the sentence. The word information is returned only when the enable_words parameter is set to true.

    confidence

    Double

    The confidence level of the recognition result of the sentence. Valid values: 0.0 to 1.0. A larger value indicates a higher confidence level.

  • The SentenceEnd message indicates that the server detects the end of a sentence and returns the recognition result of the sentence. For example, the server returns the following response:

    {
            "header": {
                    "namespace": "SpeechTranscriber",
                    "name": "SentenceEnd",
                    "status": 20000000,
                    "message_id": "c3a9ae4b231649d5ae05d4af36fd****",
                    "task_id": "5ec521b5aa104e3abccf3d361822****",
                    "status_text": "Gateway:SUCCESS:Success."
            },
            "payload": {
                    "index": 1,
                    "time": 1820,
                    "begin_time": 0,
                    "result": "Weather in Beijing",
                    "confidence": 1.0,
                    "words": [{
                            "text": "Weather",
                            "startTime": 630,
                            "endTime": 930
                    }, {
                            "text":"in",
                            "startTime": 930,
                            "endTime": 1110
                    }, {
                            "text": "Beijing",
                            "startTime": 1110,
                            "endTime": 1380
                    }]
            }
    }
    

    The parameters in the header object of the SentenceEnd message are similar to the parameters in the header object of the SentenceBegin message. The value of the name parameter is SentenceEnd, which indicates that the server detects the end of a sentence.

    The following table describes the result parameters in the payload object.

    Parameter

    Type

    Description

    index

    Integer

    The sequence number of the sentence, which starts from 1.

    time

    Integer

    The duration of the processed audio data. Unit: milliseconds.

    begin_time

    Integer

    The time when the server returns the SentenceBegin message for the sentence. Unit: milliseconds.

    result

    String

    The recognition result of the sentence.

    words

    List< Word >

    The word information of the sentence. The word information is returned only when the enable_words parameter is set to true.

    confidence

    Double

    The confidence level of the recognition result of the sentence. Valid values: 0.0 to 1.0. A larger value indicates a higher confidence level.

    The following table describes the result parameters in the words object.

    Parameter

    Type

    Description

    text

    String

    The text of the word.

    startTime

    Integer

    The start time of the word in the sentence. Unit: milliseconds.

    endTime

    Integer

    The end time of the word in the sentence. Unit: milliseconds.

4. Complete the recognition task

The client notifies the server that all audio data is sent. The server completes the recognition task and notifies the client that the task is completed.

Service status codes

Each response contains a status field, which indicates the service status code. The following tables describe common error codes, gateway error codes, configuration error codes, and real-time speech recognition error codes.

Common error codes

Error code

Description

Solution

40000001

The error message returned because the client fails authentication.

Check whether the token used by the client is correct and valid.

40000002

The error message returned because the request is invalid.

Check whether the request sent by the client meets requirements.

403

The error message returned because the token expires or the request contains invalid parameters.

Check whether the token used by the client is valid. Then, check whether the parameter values are valid.

40000004

The error message returned because the idle status of the client times out.

Check whether the client does not send data to the server for a long time, for example, 10s.

40000005

The error message returned because the number of requests exceeds the upper limit.

Check whether the number of concurrent connections or the queries per second (QPS) exceeds the upper limit. If the number of concurrent connections exceeds the upper limit, we recommend that you upgrade Intelligent Speech Interaction from the trial edition to Commercial Edition. If you have upgraded the service to Commercial Edition, we recommend that you purchase more resources for higher concurrency.

40000000

The error message returned because a client error has occurred. This is the default client error code.

Resolve the error based on the error message or submit a ticket.

41010120

The error message returned because a client time-out error has occurred.

Check whether the client does not send data to the server for at least 10 consecutive seconds.

50000000

The error message returned because a server error has occurred. This is the default server error code.

If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

50000001

The error message returned because an internal call error has occurred.

If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

52010001

The error message returned because an internal call error has occurred.

If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

Gateway error codes

Error code

Description

Solution

40010001

The error message returned because the method is not supported.

If you use the SDK, submit a ticket.

40010002

The error message returned because the instruction is not supported.

If you use the SDK, submit a ticket.

40010003

The error message returned because the instruction format is invalid.

If you use the SDK, submit a ticket.

40010004

The error message returned because the client is unexpectedly disconnected.

Check whether the client is disconnected before the server completes the requested task.

40010005

The error message returned because the task is in a state that does not support the instruction.

Check whether the instruction is supported by the task in the current state.

Configuration error codes

Error code

Description

Solution

40020105

The error message returned because the application does not exist.

Resolve the route to check whether the application exists.

40020106

The error message returned because the specified appkey and token do not match.

Check whether the appkey of the application is correct and belongs to the same Alibaba Cloud account as the token.

40020503

The error message returned because RAM user authentication fails.

Use your Alibaba Cloud account to authorize the RAM user to access the pctowap open platform (POP) API.

Real-time speech recognition error codes

Error code

Description

Solution

41040201

The error message returned because the client does not send data to the server for 10 consecutive seconds.

Check the network or whether no business data needs to be sent.

41040202

The error message returned because the client sends data at a high transmission rate and consumes all resources of the server.

Check whether the client sends data at an appropriate transmission rate, for example, at the real time factor of 1:1.

41040203

The error message returned because the client sends audio data in an invalid audio coding format.

Convert the audio coding format of audio data to a format supported by the SDK.

41040204

The error message returned because the client does not call methods in the correct order.

Check whether the client calls the relevant method to send a request before the client calls other methods.

41040205

The error message returned because the specified max_sentence_silence parameter is invalid.

Check whether the value of the max_sentence_silence parameter is in the range of 200 to 2000.

41050008

The error message returned because the specified audio sampling rate does not match that of the selected model.

Check whether the audio sampling rate specified for the service call matches the audio sampling rate of the automatic speech recognition (ASR) model that is bound to the appkey of your project created in the console.

51040101

The error message returned because an internal error has occurred on the server.

Resolve the error based on the error message.

51040103

The error message returned because the real-time speech recognition service is unavailable.

Check whether the number of real-time speech recognition tasks exceeds the upper limit.

51040104

The error message returned because the request for real-time speech recognition times out.

Check the logs of the real-time speech recognition service.

51040105

The error message returned because the real-time speech recognition service fails to be called.

Check whether the real-time speech recognition service is enabled and whether the port works properly.

51040106

The error message returned because the load balancing of the real-time speech recognition service fails and the client fails to obtain the IP address of the real-time speech recognition service.

Check whether the real-time speech recognition server in the VPC works properly.