All Products
Search
Document Center

API reference

Last Updated: Nov 29, 2019

Features

You can use the real-time speech recognition SDK to recognize speech data streams that last for a long time. The SDK applies to uninterrupted speech recognition scenarios such as conference speeches and live streaming. It has the following features:

  • Supports the following audio coding formats: pulse-code modulation (PCM) (uncompressed PCM or WAV files) and 16-bit mono.
  • Supports the following audio sampling rates: 8,000 Hz and 16,000 Hz.
  • Allows you to specify whether to return intermediate results, whether to add punctuation marks during post-processing, and whether to convert Chinese numerals to Arabic numerals.
  • Recognizes multiple languages. You can specify the language to be recognized by selecting a model

    Endpoint

Access type Description URL
External access from the Internet This endpoint allows you to access the real-time speech recognition service from any host over the Internet. By default, the Internet access URL is built in the SDK. You do not need to set the URL manually. wss://nls-gateway-ap-southeast-1.aliyuncs.com/ws/v1

Interaction process

trans

Note: The server adds the task_id field to the response header for all responses to indicate the ID of the recognition task. You need to record the value of this field. If an error occurs, you can open a ticket to submit the task ID and error message.

0. Authenticate the client

To establish a WebSocket connection with the server, the client must use a token for authentication. For more information about how to obtain the token, see Obtain a token.

1. Start and confirm recognition

The client sends a recognition request. The server confirms that the request is valid. You need to use the relevant set method of the SpeechTranscriber object to set common request parameters. The following table describes request parameters.

Parameter Type Required Description
appkey String Yes Indicates the appkey of a project created in the Intelligent Speech Interaction console.
format String No Specifies the audio coding format. Valid values: pcm (which indicates non-compressed PCM or WAV files) and 16-bit mono. Default value: pcm.
sample_rate Integer No Specifies the audio sampling rate, in Hz. Default value: 16000. Select a model that supports the audio sampling rate for your project in the console.
enable_intermediate_result Boolean No Specifies whether to return intermediate results. Default value: false.
enable_punctuation_prediction Boolean No Specifies whether to add punctuation marks during post-processing. Default value: false.
enable_inverse_text_normalization Boolean No Specifies whether to enable inverse text normalization (ITN) during post-processing. A value of true indicates that Chinese numerals are converted to Arabic numerals. Default value: false.
customization_id String No Specifies the custom model ID.
vocabulary_id String No Indicates the ID of the custom extensive hotword.
max_sentence_silence Integer No Specifies the threshold for detecting the end of a sentence.
enable_words Boolean No Specifies whether to return the information about words. Default value: false.

2. Send and recognize audio data

The client cyclically sends audio data and continuously receives recognition results from the server.

  • The SentenceBegin message indicates that the server detects the beginning of a sentence. Real-time speech recognition uses voice activity detection (VAD) to determine the beginning and end of a sentence. For example, the server returns the following response:
  1. {
  2. "header": {
  3. "namespace": "SpeechTranscriber",
  4. "name": "SentenceBegin",
  5. "status": 20000000,
  6. "message_id": "a426f3d4618447519c9d85d1a0d15bf6",
  7. "task_id": "5ec521b5aa104e3abccf3d3618222547",
  8. "status_text": "Gateway:SUCCESS:Success."
  9. },
  10. "payload": {
  11. "index": 1,
  12. "time": 0
  13. }
  14. }

The following table describes the parameters in the header object.

Parameter Type Description
namespace String The namespace of the message.
name String The name of the message. The SentenceBegin message indicates that the server detects the beginning of a sentence.
status Integer The status code, which indicates whether the request is successful. For more information, see Service status codes.
status_text String The status message.
task_id String The GUID of the task. Record the value of this field to facilitate troubleshooting.
message_id String The ID of the message.

The following table describes the parameters in the payload object.

Parameter Type Description
index Integer The sequence number of the sentence, which starts from 1.
time Integer The duration of currently processed audio streams. Unit: milliseconds.
  • The TranscriptionResultChanged message indicates that a recognition result is obtained. This message is returned only when the enable_intermediate_result=true parameter is set to true. The server can return multiple TranscriptionResultChanged messages to return intermediate results of a sentence. For example, the server returns the following response:
  1. {
  2. "header": {
  3. "namespace": "SpeechTranscriber",
  4. "name": "TranscriptionResultChanged",
  5. "status": 20000000,
  6. "message_id": "dc21193fada84380a3b6137875ab9178",
  7. "task_id": "5ec521b5aa104e3abccf3d3618222547",
  8. "status_text": "Gateway:SUCCESS:Success."
  9. },
  10. "payload": {
  11. "index": 1,
  12. "time": 1835,
  13. "result": "Weather in",
  14. "confidence": 1.0,
  15. "words": [{
  16. "text": "Weather",
  17. "startTime": 630,
  18. "endTime": 930
  19. }, {
  20. "text": "in",
  21. "startTime": 930,
  22. "endTime": 1110
  23. }, {
  24. "text": "Beijing",
  25. "startTime": 1110,
  26. "endTime": 1140
  27. }]
  28. }
  29. }

For more information about the header object, see the preceding table. The TranscriptionResultChanged message indicates that an intermediate result is obtained for the sentence.The following table describes the parameters in the payload object.

Parameter Type Description
index Integer The sequence number of the sentence, which starts from 1.
time Integer The duration of currently processed audio streams. Unit: milliseconds.
result String The recognition result of the sentence.
words List< Word > The word information of the sentence. The word information is returned only when the enable_words parameter is set to true.
confidence Double The confidence level of the recognition result for the sentence. Valid values: [0.0, 1.0]. A larger value indicates a higher confidence level.
  • The SentenceEnd message indicates that the server detects the end of a sentence and returns the recognition result of the sentence. For example, the server returns the following response:
  1. {
  2. "header": {
  3. "namespace": "SpeechTranscriber",
  4. "name": "SentenceEnd",
  5. "status": 20000000,
  6. "message_id": "c3a9ae4b231649d5ae05d4af36fd1c8a",
  7. "task_id": "5ec521b5aa104e3abccf3d3618222547",
  8. "status_text": "Gateway:SUCCESS:Success."
  9. },
  10. "payload": {
  11. "index": 1,
  12. "time": 1820,
  13. "begin_time": 0,
  14. "result": "Weather in Beijing.",
  15. "confidence": 1.0,
  16. "words": [{
  17. "text": "Weather",
  18. "startTime": 630,
  19. "endTime": 930
  20. }, {
  21. "text": "in",
  22. "startTime": 930,
  23. "endTime": 1110
  24. }, {
  25. "text": "Beijing",
  26. "startTime": 1110,
  27. "endTime": 1380
  28. }]
  29. }
  30. }

For more information about the header object, see the preceding table. The SentenceEnd message indicates that the server detects the end of the sentence.The following table describes the parameters in the payload object.

Parameter Type Description
index Integer The sequence number of the sentence, which starts from 1.
time Integer The duration of currently processed audio streams. Unit: milliseconds.
begin_time Integer The time when the server returns the SentenceBegin message for the sentence. Unit: milliseconds.
result String The recognition result of the sentence.
words List< Word > The word information of the sentence. The word information is returned only when the enable_words parameter is set to true.
confidence Double The confidence level of the recognition result for the sentence. Valid values: [0.0, 1.0]. A larger value indicates a higher confidence level.

The following table describes the parameters in the word object.

Parameter Type Description
text String The text of the word.
startTime Integer The start time of the word in the sentence. Unit: milliseconds.
endTime Integer The end time of the word in the sentence. Unit: milliseconds.

3. Stop and complete recognition

The client notifies the server that all audio data is sent. The server completes the recognition task and notifies the client that the task is completed.

Service status codes

Each response message contains a status field, which indicates the service status code. The following table describes service status codes.

Common errors

Error code Description Solution
40000001 The error message returned because the client fails authentication. Check whether the token used by the client is correct and valid.
40000002 The error message returned because the message is invalid. Check whether the message sent by the client meets relevant requirements.
403 The error message returned because the token expires or the request contains incorrect parameters. Check whether the token used by the client is valid. Then, check request parameter settings.
40000004 The error message returned because the idle status of the client times out. Check whether the client does not send any data to the server for a long time.
40000005 The error message returned because the number of requests exceeds the upper limit. Check whether the number of concurrent connections or the queries per second (QPS) exceeds the upper limit.
40000000 The error message returned because a client error has occurred. This is the default client error code. Resolve the error according to the error message or open a ticket.
50000000 The error message returned because a server error has occurred. This is the default server error code. If the error code is occasionally returned, ignore it. If the error code is returned multiple times, open a ticket.
50000001 The error message returned because an internal call error has occurred. If the error code is occasionally returned, ignore it. If the error code is returned multiple times, open a ticket.

Gateway errors

Error code Description Solution
40010001 The error message returned because the method is not supported. If you use the SDK, open a ticket.
40010002 The error message returned because the instruction is not supported. If you use the SDK, open a ticket.
40010003 The error message returned because the instruction is invalid. If you use the SDK, open a ticket.
40010004 The error message returned because the client is disconnected. Check whether the client is disconnected before the server completes the requested task.
40010005 The error message returned because the task status is incorrect. Check whether the instruction is supported in the current task status.

Metadata errors

Error code Description Solution
40020105 The error message returned because the application does not exist. Resolve the route to check whether the application exists.
40020106 The error message returned because the appkey and token do not match. Check whether the appkey is correct and belongs to the same account as the token.
40020503 The error message returned because RAM user authentication fails. Use your Alibaba Cloud account to authorize the RAM user to access the POP API.

Real-time speech recognition errors

Error code Description Solution
41040201 The error message returned because the client has not sent data for 10 seconds. Check the network or whether no business data needs to be sent.
41040202 The error message returned because the client sends data at a high transmission rate and consumes all resources of the server. Check whether the client sends data at an appropriate transmission rate, for example, at the real time factor of 1:1.
41040203 The error message returned because the client sends speech data in incorrect audio coding format. Convert the audio coding format of audio data into a format supported by the SDK.
41040204 The error message returned because the client calls methods in incorrect order. Check whether the client calls the relevant method to send a request before calling other methods.
41040205 The error message returned because the specified max_sentence_silence parameter is invalid. Check whether the value of the max_sentence_silence parameter is in the range of 200 to 2000.[200-2000]
51040101 The error message returned because an internal error has occurred on the server. Resolve the error according to the error message.
51040102 The error message returned because the automatic speech recognition (ASR) service is unavailable. Resolve the error according to the error message.
51040103 The error message returned because the real-time speech recognition service is unavailable. Check whether the number of real-time speech recognition tasks exceeds the upper limit.
51040104 The error message returned because the request for real-time speech recognition times out. Check real-time speech recognition logs.
51040105 The error message returned because the real-time speech recognition service fails to be called. Check whether the real-time speech recognition service is enabled and whether the port works properly.
51040106 The error message returned because the load balancing of the real-time speech recognition service fails and the client fails to obtain the IP address of the real-time speech recognition service. Check whether the real-time speech recognition server in Virtual Private Cloud (VPC) works properly.