Real-time workflows support accessing your self-developed speech-to-text (STT) models using standard protocols.
Before you begin
To use a self-developed model, whitelisting is required. For more information, consult us by joining the DingTalk Group.
A publicly accessible STT model service that supports the WebSocket protocol must be deployed. Your service must adhere to our input and output specifications described in step 3 and 4 in the following section.
Procedure
This section introduces how to establish a connection between your STT service and Alibaba Cloud:
Configure the following parameters for the Speech-to-text Node in the console.
Parameter
Type
Required
Description
Example
WebSocket URL
String
Yes
The WebSocket address for your STT model interface.
wss://example.com/asr/ws
ApiKey
String
Yes
API authentication information.
AUJH-pf**************HBLKrI
Custom parameters
String
No
Custom parameters.
key_a=value_a
After configuration is complete, Alibaba Cloud establishes a connection with your service according to the following rules:
Assuming the WebSocket URL is
wss://example.com/asr/ws
, Alibaba Cloud sends a request in the following format:wss://example.com/asr/ws?{request parameters}
.The following table lists the request parameters:
Parameter
Type
Required
Description
session_id
string
Yes
The identifier for this speech recognition connection.
token
string
Yes
A string serving as a signature derived from the
session_id
. For the calculation method, see Token calculation.language
string
No
The source language. It defaults to Chinese and English.
Request example:
Without custom parameters
wss://example.com/asr/ws?session_id=992204bfdca241e78dca2872625cf99f&token=muebPMT%2BnLe*********UJY4%3D&language=cn
With custom parameters
If you have set custom parameters in the console, such as
key_a=value_a
andkey_b=value_b
, these parameters are concatenated to the end of the URL:wss://example.com/asr/ws?session_id=992204bfdca241e78dca2872625cf99f&token=muebPMT%2BnLeTr**********4%3D&language=cn&key_a=value_a&key_b=value_b
After the connection is established, Alibaba Cloud transmits the PCM audio data to your model according to the following specifications:
Audio format
Data format
Sound channel
Sample rate
Protocol
PCM
S16LE
Mono
16 kHz
WebSocket
After speech recognition is complete, your service returns the data to Alibaba Cloud in JSON format according to the following specifications:
Parameter
Type
Required
Description
session_id
String
Yes
The identifier for this speech recognition connection, consistent with the session_id in the request.
name
String
Yes
Message type. Valid values:
start: Notification message after WebSocket connection is established.
result: Speech recognition result.
error: Server exception notification.
code
Int
Yes
The returned code.
A value of 0 indicates that the request is successful. All other values indicate that it failed.
message
String
Yes
The returned message content. If the request failed, describe the failure reason in this field.
result_type
Int
No
Required when
name
isresult
. Valid values:0: Temporary result.
1: Final result.
payload
Object
No
Required when
name
isresult
.result
String
No
All recognized results.
begin_time
Int
No
Start time of the recognized speech, in milliseconds.
end_time
Int
No
End time of the recognized speech, in milliseconds.
End the transmission by sending a binary message:
ws.send(bytes("{\"stop_session\": true}", encoding='utf-8'))
Token calculation
The workflow of token calculation is as follows:
Calculating the MD5 hash of the
session_id
.Using the api_key to perform HMAC-SHA1 encryption on the MD5-hashed
session_id
.Encoding the resulting data using Base64 encoding.
URL-encoding the result.
Code sample
import hashlib
import hmac
import base64
from urllib.parse import quote
def calc_token():
api_key = '12345678'
session_id = '992204bfdca241e78dca2872625cf99f'
sessionId = session_id.encode('utf-8')
md5 = hashlib.md5()
md5.update(sessionId)
baseString = md5.hexdigest()
baseString = bytes(baseString, encoding='utf-8')
# step 1: Calculate the MD5 hash of the session_id. Sample result: f481faf07ec18481bc275a3ef3d61ea0
apiKey = api_key.encode('utf-8')
token = hmac.new(apiKey, baseString, hashlib.sha1).digest()
# step 2: Use the api_key to perform HMAC-SHA1 encryption on the MD5-hashed session_id. Sample result: b'\x9a\xe7\x9b<\xc4\xfe\x9c\xb7\x93\xae\xbaY\xc3\x91|!\x8b\x14%\x8e'
token = base64.b64encode(token)
# step 3: Encode the resulting data using Base64 encoding. Sample result: muebPMT+nLeTrrpZw5F8IYsUJY4=
token = str(token, 'utf-8')
token = quote(token)
# step 4: URL-encode the result. Sample result: muebPMT%2BnLeTrrpZw5F8IYsUJY4%3D
return token