This document applies only to the China (Beijing) region. To use the models, you must use a China (Beijing) region API key.
Access the real-time speech recognition service through WebSocket.
The DashScope SDK supports only Java and Python. For other languages, use the WebSocket connection described here.
User guide: For model descriptions and selection guidance, see Real-time speech recognition - Fun-ASR/Paraformer.
WebSocket provides full-duplex communication: the client and server establish a persistent connection with a single handshake, allowing both parties to push data to each other, providing better real-time performance.
WebSocket libraries are available for most languages (Go: gorilla/websocket, PHP: Ratchet, Node.js: ws). Familiarize yourself with WebSocket basics before starting.
Prerequisites
You have activated the Model Studio and created an API key. Export it as an environment variable (not hard-coded) to prevent security risks.
For temporary access or strict control over high-risk operations (accessing/deleting sensitive data), use a temporary authentication token instead.
Compared with long-term API keys, temporary tokens are more secure (60-second lifespan) and reduce API key leakage risk.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Model availability
paraformer-realtime-v2 | paraformer-realtime-8k-v2 | |
Scenarios | Scenarios such as live streaming and meetings | Recognition scenarios for 8 kHz audio, such as telephone customer service and voicemail |
Sample rate | Any | 8 kHz |
Languages | Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese | Chinese |
Punctuation prediction | ✅ Default (no configuration needed) | ✅ Default (no configuration needed) |
Inverse text normalization (ITN) | ✅ Default (no configuration needed) | ✅ Default (no configuration needed) |
Specify recognition language | ✅ Specify using the | ❌ |
Emotion recognition | ❌ |
Interaction flow
The client sends two message types to the server: JSON instructions and binary audio (must be single-channel). Server responses are called events.
The interaction flow is as follows:
Establish a connection: The client connects to the server via WebSocket.
Start a task:
The client sends the run-task instruction to start the task.
The client receives the task-started event from the server, which indicates that the task has started successfully.
Send an audio stream:
The client starts sending binary audio and simultaneously receives the result-generated event from the server, which contains the speech recognition result.
Notify the server to end the task:
The client sends the finish-task instruction to notify the server to end the task, and continues to receive the result-generated event returned by the server.
End the task:
The client receives the task-finished event from the server, which marks the end of the task.
Close the connection: The client closes the WebSocket connection.
URL
The WebSocket URL is as follows:
wss://dashscope.aliyuncs.com/api-ws/v1/inferenceHeaders
Parameter | Type | Required | Description |
Authorization | string | Yes | The authentication token in the format |
user-agent | string | No | The client identifier. It helps the server track the request source. |
X-DashScope-WorkSpace | string | No | Model Studio workspace ID. |
X-DashScope-DataInspection | string | No | Specifies whether to enable the data compliance check. Default: |
Instructions (Client → Server)
Instructions are JSON messages (Text Frames) sent from client to server to control task start/end and mark boundaries. Binary audio (single-channel) is sent separately, not in instructions.
Send instructions in this order (out-of-order sends may fail):
run-task instruction - Starts the task and saves the returned
task_idfor step 3.Binary audio (mono) - Send after receiving the task-started event.
finish-task instruction - Ends the task after all audio is sent, using the same
task_idfrom step 1.
1. run-task instruction: Start a task
This instruction starts a speech recognition task. Use the same task_id when sending the finish-task instruction.
When to send: After the WebSocket connection is established.
Example:
{
"header": {
"action": "run-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // random uuid
"streaming": "duplex"
},
"payload": {
"task_group": "audio",
"task": "asr",
"function": "recognition",
"model": "paraformer-realtime-v2",
"parameters": {
"format": "pcm", // Audio format
"sample_rate": 16000, // Sample rate
"disfluency_removal_enabled": false, // Filter disfluent words
"language_hints": [
"en"
] // Specify language, only supported by the paraformer-realtime-v2 model
}
"input": {}
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. Fixed value: "run-task". |
header.task_id | string | Yes | The current task ID. A 32-bit universally unique identifier (UUID), consisting of 32 randomly generated letters and numbers. It can include hyphens (for example, When you later send the finish-task instruction, use the same task_id that you used for the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters:
Parameter | Type | Required | Description |
payload.task_group | string | Yes | Fixed string: "audio". |
payload.task | string | Yes | Fixed string: "asr". |
payload.function | string | Yes | Fixed string: "recognition". |
payload.model | string | Yes | The name of the model. For a list of supported models, see Model List. |
payload.input | object | Yes | Fixed format: {}. |
payload.parameters | |||
format | string | Yes | The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, and amr. Important opus/speex: Must be encapsulated in Ogg. wav: Must be PCM encoded. amr: Only the AMR-NB type is supported. |
sample_rate | integer | Yes | The audio sampling rate in Hz. This parameter varies by model:
|
disfluency_removal_enabled | boolean | No | Specifies whether to filter out disfluent words:
|
language_hints | array[string] | No | The language code for recognition. If you cannot determine the language in advance, leave this parameter unset for automatic detection. Currently supported language codes:
This parameter only applies to models that support multiple languages (see Model List). |
semantic_punctuation_enabled | boolean | No | Specifies whether to enable semantic sentence segmentation (disabled by default):
Semantic segmentation provides higher accuracy and is ideal for meeting transcription. VAD segmentation has lower latency and is ideal for interactive scenarios. Applies to v2 and later models. |
max_sentence_silence | integer | No | The VAD sentence segmentation silence threshold (ms). If silence after a speech segment exceeds this value, the sentence ends. Range: 200-6000 ms. Default: 800 ms. Applies only when |
multi_threshold_mode_enabled | boolean | No | Specifies whether to prevent VAD from over-segmenting long sentences (disabled by default). Applies only when |
punctuation_prediction_enabled | boolean | No | Specifies whether to automatically add punctuation to results (enabled by default):
Applies to v2 and later models only. |
heartbeat | boolean | No | Specifies whether to maintain a persistent server connection:
Applies to v2 and later models only. |
inverse_text_normalization_enabled | boolean | No | Specifies whether to enable Inverse Text Normalization (ITN). When enabled, Chinese numerals are converted to Arabic numerals (enabled by default). Applies to v2 and later models only. |
2. finish-task instruction: End a task
This instruction ends the speech recognition task. The client sends this instruction after all audio has been sent.
When to send: After the audio is completely sent.
Example:
{
"header": {
"action": "finish-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"streaming": "duplex"
},
"payload": {
"input": {}
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. Fixed value: "finish-task". |
header.task_id | string | Yes | The current task ID. Must be the same as the task_id you used to send the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters:
Parameter | Type | Required | Description |
payload.input | object | Yes | Fixed format: {}. |
Binary audio (Client → Server)
Send audio after receiving the task-started event. You can use real-time audio (microphone) or file audio, which must be single-channel. Upload the audio via the WebSocket binary channel, and we recommend sending 100 ms of audio every 100 ms.
Events (Server → Client)
Events are JSON messages from server to client representing different processing stages.
1. task-started event: Task has started
The task-started event confirms successful task start. Wait for this event before sending audio or the finish-task instruction — otherwise the task will fail.
The payload of the task-started event has no content.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-started",
"attributes": {}
},
"payload": {}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. Fixed value: "task-started". |
header.task_id | string | The task_id generated by the client. |
2. result-generated event: Speech recognition result
While the client sends the audio for recognition and the finish-task instruction, the server continuously returns the result-generated event, which contains the speech recognition result.
You can determine whether the result is intermediate or final by checking if payload.sentence.endTime in the result-generated event is null.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"begin_time": 170,
"end_time": null,
"text": "Okay, I got it",
"heartbeat": false,
"sentence_end": true,
"emo_tag": "neutral", // This field is displayed only when the model parameter is set to paraformer-realtime-8k-v2, the semantic_punctuation_enabled parameter is false, and sentence_end in the result-generated event is true
"emo_confidence": 0.914, // This field is displayed only when the model parameter is set to paraformer-realtime-8k-v2, the semantic_punctuation_enabled parameter is false, and sentence_end in the result-generated event is true
"words": [
{
"begin_time": 170,
"end_time": 295,
"text": "Okay",
"punctuation": ","
},
{
"begin_time": 295,
"end_time": 503,
"text": "I",
"punctuation": ""
},
{
"begin_time": 503,
"end_time": 711,
"text": "got",
"punctuation": ""
},
{
"begin_time": 711,
"end_time": 920,
"text": "it",
"punctuation": ""
}
]
}
},
"usage": {
"duration": 3
}
}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. Fixed value: "result-generated". |
header.task_id | string | The task_id generated by the client. |
payload parameters:
Parameter | Type | Description |
output | object | output.sentence is the recognition result. See the following text for details. |
usage | object | When When |
The format of <code code-type="xCode" data-tag="code" id="ee95c40e4f7j4">payload.output.usage is as follows:
Parameter | Type | Description |
duration | integer | The billable duration of the task, in seconds. |
The format of payload.output.sentence is as follows:
Parameter | Type | Description |
begin_time | integer | The start time of the sentence, in ms. |
end_time | integer | null | The end time of the sentence, in ms. If this is an intermediate recognition result, the value is null. |
text | string | The recognized text. |
words | array | Character timestamp information. |
heartbeat | boolean | null | If this value is true, you can skip processing the recognition result. |
sentence_end | boolean | Indicates whether the given sentence has ended. |
emo_tag | string | The emotion of the current sentence:
Emotion recognition has the following constraints:
|
emo_confidence | number | The confidence level of the emotion recognized in the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level. Emotion recognition has the following constraints:
|
payload.output.sentence.words is a list of character timestamps, where each word has the following format:
Parameter | Type | Description |
begin_time | integer | The start time of the character, in ms. |
end_time | integer | The end time of the character, in ms. |
text | string | The character. |
punctuation | string | The punctuation mark. |
3. task-finished event: Task has ended
When you receive the task-finished event, the task has ended. Close the WebSocket connection and terminate the program.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-finished",
"attributes": {}
},
"payload": {
"output": {},
"usage": null
}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. Fixed value: "task-finished". |
header.task_id | string | The task_id generated by the client. |
4. task-failed event: Task failed
If you receive a task-failed event, close the connection and analyze the error message. Fix your code if the failure is due to a programming issue.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-failed",
"error_code": "CLIENT_ERROR",
"error_message": "request timeout after 23 seconds.",
"attributes": {}
},
"payload": {}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. Fixed value: "task-failed". |
header.task_id | string | The task_id generated by the client. |
header.error_code | string | A description of the error type. |
header.error_message | string | The specific reason for the error. |
Connection overhead and reuse
The WebSocket service supports connection reuse to reduce overhead.
The server starts a task on receiving a run-task instruction. After the client sends a finish-task instruction and receives the task-finished event, the connection can be reused by sending a new run-task instruction.
Different tasks within a reused connection must use different task_ids.
If a failure occurs during task execution, the service will still return a task-failed event and close the connection. This connection cannot be reused.
If no new task is started within 60 seconds after a task ends, the connection automatically times out and disconnects.
Code examples
These examples show basic implementation. Adapt them to your scenarios. WebSocket clients use asynchronous programming to send/receive messages simultaneously. Follow these steps:
Error codes
If an error occurs, see Error messages for troubleshooting.
If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.
FAQ
Features
Q: How to maintain a persistent connection with the server during long periods of silence?
Set heartbeat parameter to true and send silent audio continuously.
Silent audio: audio with no sound signal. Generate it with editing software (Audacity, Adobe Audition) or FFmpeg.
Q: How to convert an audio format to the required format?
You can use the FFmpeg tool. For more information, see the official FFmpeg website.
# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites an existing file (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext
# Example: WAV to MP3 (maintain original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: Can I view the time range for each sentence?
Yes. Results include start/end timestamps for each sentence to determine time ranges.
Q: Why use WebSocket instead of HTTP/HTTPS? Why not provide a RESTful API?
The Speech Service uses WebSocket instead of HTTP/HTTPS or RESTful APIs because it requires full-duplex communication. WebSocket allows both the server and client to proactively push data, such as real-time progress updates for synthesis or recognition. RESTful APIs over HTTP only support client-initiated request-response cycles and cannot meet real-time interaction requirements.
Q: How to recognize a local file (recording file)?
Convert the local file to a binary audio stream and upload via WebSocket binary channel using the send method. See code snippet below and complete example in Code examples section.
Troubleshooting
If an error occurs in your code, refer to Error codes for troubleshooting.
Q: Why there is no recognition result?
Verify audio
formatandsampleRate/sample_ratematch parameter constraints. Common errors:The audio file has a .wav extension but is in MP3 format, and the
formatparameter is incorrectly set to `mp3`.The audio sample rate is 3600 Hz, but the
sampleRate/sample_rateparameter is incorrectly set to 48000.
Use ffprobe to check audio info (container, encoding, sample rate, channels):
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxxWhen you use the
paraformer-realtime-v2model, check whether the language set inlanguage_hintsmatches the actual language of the audio.For example, the audio is in Chinese, but
language_hintsis set toen(English).