All Products
Search
Document Center

Intelligent Media Services:Access STT models

Last Updated:Apr 14, 2025

Real-time workflows support accessing your self-developed speech-to-text (STT) models using standard protocols.

Before you begin

  • To use a self-developed model, whitelisting is required. For more information, consult us by joining the DingTalk Group.

  • A publicly accessible STT model service that supports the WebSocket protocol must be deployed. Your service must adhere to our input and output specifications described in step 3 and 4 in the following section.

Procedure

This section introduces how to establish a connection between your STT service and Alibaba Cloud:

  1. Configure the following parameters for the Speech-to-text Node in the console.

    image

    Parameter

    Type

    Required

    Description

    Example

    WebSocket URL

    String

    Yes

    The WebSocket address for your STT model interface.

    wss://example.com/asr/ws

    ApiKey

    String

    Yes

    API authentication information.

    AUJH-pf**************HBLKrI

    Custom parameters

    String

    No

    Custom parameters.

    key_a=value_a

  2. After configuration is complete, Alibaba Cloud establishes a connection with your service according to the following rules:

    Assuming the WebSocket URL is wss://example.com/asr/ws, Alibaba Cloud sends a request in the following format: wss://example.com/asr/ws?{request parameters}.

    The following table lists the request parameters:

    Parameter

    Type

    Required

    Description

    session_id

    string

    Yes

    The identifier for this speech recognition connection.

    token

    string

    Yes

    A string serving as a signature derived from the session_id. For the calculation method, see Token calculation.

    language

    string

    No

    The source language. It defaults to Chinese and English.

    Request example:

    Without custom parameters

    wss://example.com/asr/ws?session_id=992204bfdca241e78dca2872625cf99f&token=muebPMT%2BnLe*********UJY4%3D&language=cn

    With custom parameters

    If you have set custom parameters in the console, such as key_a=value_a and key_b=value_b, these parameters are concatenated to the end of the URL:

    wss://example.com/asr/ws?session_id=992204bfdca241e78dca2872625cf99f&token=muebPMT%2BnLeTr**********4%3D&language=cn&key_a=value_a&key_b=value_b
  3. After the connection is established, Alibaba Cloud transmits the PCM audio data to your model according to the following specifications:

    Audio format

    Data format

    Sound channel

    Sample rate

    Protocol

    PCM

    S16LE

    Mono

    16 kHz

    WebSocket

  4. After speech recognition is complete, your service returns the data to Alibaba Cloud in JSON format according to the following specifications:

    Parameter

    Type

    Required

    Description

    session_id

    String

    Yes

    The identifier for this speech recognition connection, consistent with the session_id in the request.

    name

    String

    Yes

    Message type. Valid values:

    • start: Notification message after WebSocket connection is established.

    • result: Speech recognition result.

    • error: Server exception notification.

    code

    Int

    Yes

    The returned code.

    A value of 0 indicates that the request is successful. All other values indicate that it failed.

    message

    String

    Yes

    The returned message content. If the request failed, describe the failure reason in this field.

    result_type

    Int

    No

    Required when name is result. Valid values:

    • 0: Temporary result.

    • 1: Final result.

    payload

    Object

    No

    Required when name is result.

    • result

    String

    No

    All recognized results.

    • begin_time

    Int

    No

    Start time of the recognized speech, in milliseconds.

    • end_time

    Int

    No

    End time of the recognized speech, in milliseconds.

  5. End the transmission by sending a binary message:

    ws.send(bytes("{\"stop_session\": true}", encoding='utf-8'))

Token calculation

The workflow of token calculation is as follows:

  1. Calculating the MD5 hash of the session_id.

  2. Using the api_key to perform HMAC-SHA1 encryption on the MD5-hashed session_id.

  3. Encoding the resulting data using Base64 encoding.

  4. URL-encoding the result.

Code sample

import hashlib
import hmac
import base64
from urllib.parse import quote

def calc_token():
    api_key = '12345678'
    session_id = '992204bfdca241e78dca2872625cf99f'
    sessionId = session_id.encode('utf-8')
    md5 = hashlib.md5()
    md5.update(sessionId)
    baseString = md5.hexdigest()
    baseString = bytes(baseString, encoding='utf-8')
    # step 1: Calculate the MD5 hash of the session_id. Sample result: f481faf07ec18481bc275a3ef3d61ea0
    apiKey = api_key.encode('utf-8')
    token = hmac.new(apiKey, baseString, hashlib.sha1).digest()
    # step 2: Use the api_key to perform HMAC-SHA1 encryption on the MD5-hashed session_id. Sample result: b'\x9a\xe7\x9b<\xc4\xfe\x9c\xb7\x93\xae\xbaY\xc3\x91|!\x8b\x14%\x8e'
    token = base64.b64encode(token)
    # step 3: Encode the resulting data using Base64 encoding. Sample result: muebPMT+nLeTrrpZw5F8IYsUJY4=
    token = str(token, 'utf-8') 
    token = quote(token)
    # step 4: URL-encode the result. Sample result: muebPMT%2BnLeTrrpZw5F8IYsUJY4%3D
    return token