All Products
Search
Document Center

Intelligent Speech Interaction:API reference

Last Updated:Feb 28, 2024

You can use the recording file recognition service to recognize recording files. However, the service does not recognize recording files in real time. In addition, to recognize a recording file, you must submit a reachable HTTP or HTTPS URL of the file, but not the local file.

Features

  • Recognizes single-track recording files in WAV and MP3 formats.

  • Supports two call methods: polling and callback.

  • Supports custom linguistic models and hotwords.

  • Recognizes multiple languages, such as Chinese Mandarin, Chinese dialects, and English.

Call limits

  • The access permissions on recording files that you want to recognize must be public. The URL of each recording file can contain the domain name, but not the IP address. In addition, the URL cannot contain spaces.

    Valid URL

    Invalid URL

    https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav

    • http://127.0.0.1/sample.wav

    • D:\files\sample.wav

  • The maximum file size is 512 MB.

  • If you use the free trial edition, the server completes the recognition task and returns the recognition result within 24 hours after you send a recording file recognition request. If you use Commercial Edition, the server completes the recognition task and returns the recognition result within 3 hours after you send a recording file recognition request. The server retains the recognition result for 72 hours.

    Note

    The preceding time limits do not apply if the recording files that you upload within 30 minutes exceed 500 hours in length. If you need to recognize a large amount of audio data, contact the Alibaba Cloud pre-sales staff.

  • You can use the free trial edition to recognize recording files that are up to 2 hours in length on each calendar day.

  • Procedure

    1. Check the format and audio sampling rate of your recording file. Select an appropriate scenario and model in the Intelligent Speech Interaction console based on your business scenario.

    2. Store the recording file in Alibaba Cloud Object Storage Service (OSS).

      If the access permissions on the recording file are public, directly obtain the OSS URL of the recording file. For more information, see Public read object. If the access permissions on the recording file are private, use the SDK to generate an OSS URL that has a validity period. For more information, see Private object.

      Note

      You can also build a file server and store the recording file on it. To download the recording file from the file server, make sure that the length indicated by the Content-Length field in the HTTP response header is the same as the length of data in the response body. Otherwise, the recording file fails to be downloaded.

    3. Send a recording file recognition request from the client.

      If the request is successful, the server returns the task ID. You can use the task ID to poll the recognition result.

    4. Send a request from the client to query the recognition result.

      The client queries the recognition result based on the task ID that is obtained in Step c. The server retains the recognition result for 72 hours.

    API call methods

  • The recording file recognition service provides the Alibaba Cloud pctowap open platform (POP) API that can be called in a remote procedure call (RPC) style. To call an API operation, the client encapsulates parameters in a request and uses an HTTP method to send the request. The server returns the result in a response. You must store recording files that you want to recognize on a server and make sure that each file can be accessed by using a URL. We recommend that you store recording files in Alibaba Cloud OSS.

  • The recording file recognition POP API supports two operations: use the POST method to send a recording file recognition request and use the GET method to query the recording file recognition result.

    • Operation to send a recording file recognition request:

      • If you use the polling method, you can send a recording file recognition request and obtain the task ID for subsequent recognition result polling.

      • If you use the callback method, you can send a recording file recognition request and a callback URL. If the request is successful, the server uses the POST method to send the recognition result to the callback URL. Make sure that the callback URL can receive a POST request.

      Note

      In earlier versions of the recording file recognition service (2.0 by default), the recognition result obtained by the callback method differs from that obtained by the polling method. The differences lie in the style and fields of the JSON string. In version 4.0, the recording file recognition service updates the recognition result obtained by the callback method to a camelCase JSON string. This produces the same recognition result as that obtained by the polling method.

      If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. You can continue to use this version. If you are a new user, set the version of the recording file recognition service to 4.0.

      Request parameters:

      When you send a recording file recognition request, you must set request parameters and add these parameters in the format of a JSON string to the request body. The following example shows request parameters in JSON format:

      {
          "appkey": "your-appkey",
          "file_link": "https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav",
          "auto_split":false,
          "version": "4.0",
          "enable_words": false,
          "enable_sample_rate_adaptive": true,
          // The valid_times parameter specifies the valid time period that truly requires speech recognition in the total length of an audio track. This parameter is optional.
          "valid_times": [
              {
                  "begin_time": 200,
                  "end_time":2000,
                  "channel_id": 0
              }
          ]
      }

      Parameter

      Type

      Required

      Description

      appkey

      String

      Yes

      The appkey of your project in the Intelligent Speech Interaction console.

      file_link

      String

      Yes

      The URL of the recording file. Make sure that the scenario and model of the project created in the Intelligent Speech Interaction console are suitable for the recording file.

      version

      String

      Yes

      The version of the recording file recognition service. Default value: 2.0. Set this parameter to 4.0.

      enable_words

      Boolean

      No

      Specifies whether to return the recognition results of words. Default value: false. This parameter takes effect only when the version parameter is set to 4.0.

      enable_sample_rate_adaptive

      Boolean

      No

      Specifies whether to automatically downsample an audio file with a sampling rate that is greater than 16,000 Hz. Default value: false. This parameter takes effect only when the version parameter is set to 4.0.

      enable_callback

      Boolean

      No

      Specifies whether to enable the callback method. Default value: false.

      callback_url

      String

      No

      The callback URL. You must specify this parameter if you set the enable_callback parameter to true. The callback URL can be an HTTP or HTTPS URL. It can contain the domain name, but not the IP address.

      auto_split

      Boolean

      No

      Specifies whether to enable automatic track splitting. If you enable automatic track splitting, the server can identify the speaker of each sentence in a conversation between two parties based on the ChannelId parameter in the recognition result of the sentence. Usually, the value of the ChannelId parameter is 1 for the first speaker in the conversation. Only mono audio files with a sampling rate of 8,000 Hz are supported.

      enable_unify_post

      Boolean

      No

      Specifies whether to enable post-processing. Default value: false.

      Note

      The auto_split and enable_unify_post parameters cannot be both set to true.

      enable_inverse_text_normalization

      Boolean

      No

      Specifies whether to enable inverse text normalization (ITN). Valid values: true and false. Default value: false. If you set this parameter to true, Chinese numerals are converted to Arabic numerals. This parameter takes effect only when the version parameter is set to 4.0 and the enable_unify_post parameter is set to true.

      Note

      ITN is not implemented on words.

      enable_disfluency

      Boolean

      No

      Specifies whether to enable disfluency detection. Default value: false. This parameter takes effect only when the version parameter is set to 4.0 and the enable_unify_post parameter is set to true.

      valid_times

      List< ValidTime >

      No

      The valid time period that truly requires speech recognition in the total length of an audio track.

      max_end_silence

      Integer

      No

      The maximum duration of end silence. Default value: 450. Unit: milliseconds.

      max_single_segment_time

      Integer

      No

      The maximum duration of a single sentence. Minimum value: 10000. Default value: 20000. Unit: milliseconds.

      customization_id

      String

      No

      The ID of the custom linguistic model that is created by using the POP API. This parameter is not specified by default.

      class_vocabulary_id

      String

      No

      The ID of the created categorized hotword vocabulary. This parameter is not specified by default.

      vocabulary_id

      String

      No

      The ID of the created extensive hotword vocabulary. This parameter is not specified by default.

      The following table describes the parameters in the ValidTime object.

      Parameter

      Type

      Required

      Description

      begin_time

      Int

      Yes

      The start time offset of the valid time period. Unit: milliseconds.

      end_time

      Int

      Yes

      The end time offset of the valid time period. Unit: milliseconds.

      channel_id

      Int

      Yes

      The sequence number of the audio track to which the setting of the valid time period applies. The value starts from 0.

      Response parameters:

      The server returns a response to the recording file recognition request. The response includes response parameters in the format of a JSON string. For example, the server returns the following response:

      {
              "TaskId": "4b56f0c4b7e611e88f34c33c2a60****",
              "RequestId": "E4B183CC-6CFE-411E-A547-D877F7BD****",
              "StatusText": "SUCCESS",
              "StatusCode": 21050000
      }

      HTTP status code 200 indicates that the request is successful. For more information, see HTTP status codes.

      Parameter

      Type

      Required

      Description

      TaskId

      String

      Yes

      The ID of the recognition task.

      RequestId

      String

      Yes

      The ID of the request. This parameter is used only for debugging.

      StatusCode

      Int

      Yes

      The status code.

      StatusText

      String

      Yes

      The status message.

    • Operation to query the recording file recognition result:

      If the recording file recognition request that you send is successful, the server returns the task ID. You can use the task ID to poll the recognition result.

      Request parameters:

      After the server returns the response to the recording file recognition request, you can use the task ID in the response as a parameter to query the recognition result. When you call the query operation, you must set a polling interval.

      Important

      The query operation supports up to 100 queries per second (QPS). If the QPS exceeds 100, the following error may be returned: Throttling.User : Request was denied due to user flow control. We recommend that you set a longer polling interval.

      Parameter

      Type

      Required

      Description

      TaskId

      String

      Yes

      The ID of the recognition task.

      Response parameters:

      The server returns a response to the query request for the recording file recognition result. The response includes response parameters in the format of a JSON string.

      • The following sample success response shows the recognition result of the single-track recording file nls-sample-16k.wav:

        {
                "TaskId": "d429dd7dd75711e89305ab6170fe****",
                "RequestId": "9240D669-6485-4DCC-896A-F8B31F94****",
                "StatusText": "SUCCESS",
                "BizDuration": 2956,
                "SolveTime": 1540363288472,
                "StatusCode": 21050000,
                "Result": {
                        "Sentences": [{
                                "EndTime": 2365,
                                "SilenceDuration": 0,
                                "BeginTime": 340,
                                "Text": "Weather in Beijing",
                                "ChannelId": 0,
                                "SpeechRate": 177,
                                "EmotionValue": 5.0
                        }]
                }
        }

        Assume that you set the enable_callback parameter to true, specify the callback_url parameter, and set the version parameter to 4.0. The following response shows the recognition result that is obtained by the callback method:

        {
                "Result": {
                        "Sentences": [{
                                "EndTime": 2365,
                                "SilenceDuration": 0,
                                "BeginTime": 340,
                                "Text": "Weather in Beijing",
                                "ChannelId": 0,
                                "SpeechRate": 177,
                                "EmotionValue": 5.0
                        }]
                },
                "TaskId": "36d01b244ad811e9952db7bb7ed2****",
                "StatusCode": 21050000,
                "StatusText": "SUCCESS",
                "RequestTime": 1553062810452,
                "SolveTime": 1553062810831,
                "BizDuration": 2956
        }
        Note
        • The value of the RequestTime parameter is a timestamp that indicates when the recording file recognition request is sent, in milliseconds. For example, a value of 1553062810452 indicates 14:20:10 on March 20, 2019, UTC+8.

        • The value of the SolveTime parameter is a timestamp that indicates when the recording file recognition task is completed, in milliseconds.

      • The following response shows that the task is queuing:

        {
                "TaskId": "c7274235b7e611e88f34c33c2a60****",
                "RequestId": "981AD922-0655-46B0-8C6A-5C836822****",
                "StatusText": "QUEUEING",
                "StatusCode": 21050002
        }

      • The following response shows that the task is running:

        {
                "TaskId": "c7274235b7e611e88f34c33c2a60****",
                "RequestId": "8E908ED2-867F-457E-82BF-4756194A****",
                "StatusText": "RUNNING",
                "BizDuration": 0,
                "StatusCode": 21050001
        }

      • The following sample error response shows that the recording file fails to be downloaded:

        {
                "TaskId": "4cf25b7eb7e711e88f34c33c2a60****",
                "RequestId": "098BF27C-4CBA-45FF-BD11-3F532F26****",
                "StatusText": "FILE_DOWNLOAD_FAILED",
                "BizDuration": 0,
                "SolveTime": 1536906469146,
                "StatusCode": 41050002
        }
        Note

        For more information, see the error codes and solutions in the "Service status codes" section of this topic.

      HTTP status code 200 indicates that the request is successful. For more information, see HTTP status codes.

      Parameter

      Type

      Required

      Description

      TaskId

      String

      Yes

      The ID of the recognition task.

      StatusCode

      Int

      Yes

      The status code.

      StatusText

      String

      Yes

      The status message.

      RequestId

      String

      Yes

      The ID of the request. This parameter is used for debugging.

      Result

      Object

      Yes

      The recognition result object.

      Sentences

      List< SentenceResult >

      Yes

      The recognition results of sentences. This parameter is returned only when the value of the StatusText parameter is SUCCESS.

      Words

      List< WordResult >

      No

      The recognition results of words. This parameter is returned only when the enable_words parameter is set to true and the version parameter is set to 4.0.

      BizDuration

      Long

      Yes

      The total duration of the recording file that is recognized. Unit: milliseconds.

      SolveTime

      Long

      Yes

      The timestamp that indicates when the recording file recognition task is completed. Unit: milliseconds.

      The following table describes the parameters in the recognition result of each sentence.

      Parameter

      Type

      Required

      Description

      ChannelId

      Int

      Yes

      The ID of the audio track to which the sentence belongs.

      BeginTime

      Int

      Yes

      The start time offset of the sentence. Unit: milliseconds.

      EndTime

      Int

      Yes

      The end time offset of the sentence. Unit: milliseconds.

      Text

      String

      Yes

      The recognition result of the sentence.

      EmotionValue

      Int

      Yes

      The emotion value. The value is equal to the volume decibel value divided by 10. Valid values: [1,10]. A greater value indicates a stronger emotion.

      SilenceDuration

      Int

      Yes

      The silence duration between the current and the previous sentences. Unit: seconds.

      SpeechRate

      Int

      Yes

      The average speech rate of the sentence. Unit: words per minute.

      • Recognition results of words

        If the enable_words parameter is set to true and the version parameter is set to 4.0, the server returns the recognition results of words in the response. The recognition results of words obtained by the polling method are the same as those obtained by the callback method. The following response shows the recognition results that are obtained by the polling method:

        {
                "StatusCode": 21050000,
                "Result": {
                        "Sentences": [{
                                "SilenceDuration": 0,
                                "EmotionValue": 5.0,
                                "ChannelId": 0,
                                "Text": "Weather in Beijing",
                                "BeginTime": 340,
                                "EndTime": 2365,
                                "SpeechRate": 177
                        }],
                        "Words": [{
                                "ChannelId": 0,
                                "Word": "Weather",
                                "BeginTime": 640,
                                "EndTime": 940
                        }, {
                                "ChannelId": 0,
                                "Word": "in",
                                "BeginTime": 940,
                                "EndTime": 1120
                        }, {
                                "ChannelId": 0,
                                "Word": "Beijing",
                                "BeginTime": 1120,
                                "EndTime": 2020
                        }]
                },
                "SolveTime": 1553236968873,
                "StatusText": "SUCCESS",
                "RequestId": "027B126B-4AC8-4C98-9FEC-A031158F****",
                "TaskId": "b505e78c4c6d11e9a213e11db149****",
                "BizDuration": 2956
        }
        

        The following table describes the parameters in the recognition result of each word.

        Parameter

        Type

        Required

        Description

        BeginTime

        Int

        Yes

        The start time of the word. Unit: milliseconds.

        EndTime

        Int

        Yes

        The end time of the word. Unit: milliseconds.

        ChannelId

        Int

        Yes

        The ID of the audio track to which the word belongs.

        Word

        String

        Yes

        The recognition result of the word.

  • Service status codes

  • The following table describes the normal status codes.

  • Status code

    Status message

    Description

    Solution

    21050000

    SUCCESS

    The request is successful after you use the POST method to send a recording file recognition request or the GET method to query the recording file recognition result.

    No solution is required.

    21050001

    RUNNING

    The recording file recognition task is running.

    Use the GET method to send the query request for the recognition result later.

    21050002

    QUEUEING

    The recording file recognition task is queuing.

    Use the GET method to send the query request for the recognition result later.

    21050003

    SUCCESS_WITH_NO_VALID_FRAGMENT

    The query request for the recognition result is successful, but the server does not detect any speech data.

    Check whether the recording file contains speech data or the duration of speech data is too short.

  • The following table describes the error codes.

  • Note

    Status codes that start with 4 indicate client errors, whereas those that start with 5 indicate server errors.

    Status code

    Status message

    Description

    Solution

    41050001

    USER_BIZDURATION_QUOTA_EXCEED

    The total duration of the recording files that you want to recognize exceeds the quota for the day.

    If you need to recognize a large amount of audio data, send an email to nls_support@service.aliyun.com.

    41050002

    FILE_DOWNLOAD_FAILED

    The recording file fails to be downloaded.

    Check whether the URL of the recording file is correct or whether the recording file can be accessed and downloaded over the Internet.

    41050003

    FILE_CHECK_FAILED

    The format of the recording file is invalid.

    Check whether the recording file is a single-track or dual-track file in WAV or MP3 format.

    41050004

    FILE_TOO_LARGE

    The recording file is too large.

    Check whether the recording file is larger than 512 MB in size.

    41050005

    FILE_NORMALIZE_FAILED

    The recording file fails to be normalized.

    Check whether the recording file is damaged or cannot be played.

    41050006

    FILE_PARSE_FAILED

    The recording file fails to be parsed.

    Check whether the recording file is damaged or cannot be played.

    41050007

    MKV_PARSE_FAILED

    The MKV parsing fails.

    Check whether the recording file is damaged or cannot be played.

    41050008

    UNSUPPORTED_SAMPLE_RATE

    The audio sampling rate is not supported.

    Check whether the audio sampling rate of the recording file is the same as the sampling rate in the automatic speech recognition (ASR) model that is bound to the appkey of your project in the Intelligent Speech Interaction console.

    41050009

    UNSUPPORTED_ASR_GROUP

    The ASR group is not supported.

    Check whether the appkey belongs to the same Alibaba Cloud account as the AccessKey pair.

    41050010

    FILE_TRANS_TASK_EXPIRED

    The recording file recognition task expires.

    Check whether the task ID exists or expires.

    41050011

    REQUEST_INVALID_FILE_URL_VALUE

    The specified file_link parameter is invalid.

    Check whether the file_link parameter is specified in a correct format.

    41050012

    REQUEST_INVALID_CALLBACK_VALUE

    The specified callback_url parameter is invalid.

    Check whether the callback_url parameter is specified in a correct format.

    41050013

    REQUEST_PARAMETER_INVALID

    The request parameters are invalid.

    Check whether the request body is a valid JSON string.

    41050014

    REQUEST_EMPTY_APPKEY_VALUE

    The appkey parameter is not specified.

    Check whether the appkey parameter is specified.

    41050015

    REQUEST_APPKEY_UNREGISTERED

    The specified appkey parameter is invalid.

    Check whether the appkey that is indicated by the appkey parameter is valid or whether the appkey belongs to the same Alibaba Cloud account as the specified AccessKey ID.

    41050021

    RAM_CHECK_FAILED

    The RAM user authentication fails.

    Check whether the RAM user is authorized to call the Intelligent Speech Interaction API.

    41050023

    CONTENT_LENGTH_CHECK_FAILED

    The specified content-length field is invalid.

    When you download the recording file, check whether the length that is indicated by the content-length field in the HTTP response header is the same as the actual length of the recording file.

    41050024

    FILE_404_NOT_FOUND

    The recording file that you want to download does not exist.

    Check whether the recording file that you want to download exists.

    41050025

    FILE_403_FORBIDDEN

    You are not authorized to download the recording file.

    Check whether you are authorized to download the recording file.

    41050026

    FILE_SERVER_ERROR

    A file server error occurs.

    Check whether the server where the recording file is stored works properly.

    51050000

    INTERNAL_ERROR

    An internal error occurs.

    If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

    51050001

    VAD_FAILED

    The voice activity detection (VAD) fails.

    If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

    51050002

    RECOGNIZE_FAILED

    The ASR fails.

    If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

    51050003

    RECOGNIZE_INTERRUPT

    The ASR is interrupted.

    If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

    51050004

    OFFER_INTERRUPT

    The recognition task is prevented from being written to the queue.

    If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

    51050005

    FILE_TRANS_TIMEOUT

    The recognition task fails due to a timeout.

    If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

    51050006

    FRAGMENT_FAILED

    The multi-channel audio data fails to be converted to mono audio data.

    If the error code is occasionally returned, ignore it. If the error code is returned multiple times, submit a ticket.

    Earlier versions

If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. In version 2.0, the recognition result obtained by the callback method differs from that obtained by the polling method. The differences lie in the style and fields of the JSON string. Assume that you set the enable_callback parameter to true and specify the callback_url parameter. The following response shows the recognition result that is obtained by the callback method:

{
        "result": [{
                "begin_time": 340,
                "channel_id": 0,
                "emotion_value": 5.0,
                "end_time": 2365,
                "silence_duration": 0,
                "speech_rate": 177,
                "text": "Weather in Beijing"
        }],
        "task_id": "3f5d4c0c399511e98dc025f34473****",
        "status_code": 21050000,
        "status_text": "SUCCESS",
        "request_time": 1551164878830,
        "solve_time": 1551164879230,
        "biz_duration": 2956
}