This document applies only to the China (Beijing) region. To use the models, you must use a China (Beijing) region API key.
This topic describes how to access the real-time speech recognition service through a WebSocket connection.
The DashScope SDK currently supports only Java and Python. To develop Paraformer real-time speech recognition applications in other programming languages, you can communicate with the service through a WebSocket connection.
User guide: For model descriptions and selection guidance, see Real-time speech recognition - Fun-ASR/Paraformer.
WebSocket is a network protocol that supports full-duplex communication. The client and server establish a persistent connection with a single handshake, which allows both parties to actively push data to each other. This provides significant advantages in real-time performance and efficiency.
For common programming languages, many ready-to-use WebSocket libraries and examples are available, such as:
Go:
gorilla/websocketPHP:
RatchetNode.js:
ws
Familiarize yourself with the basic principles and technical details of WebSocket before you begin development.
Prerequisites
You have activated the Model Studio and created an API key. To prevent security risks, export the API key as an environment variable instead of hard-coding it in your code.
To grant temporary access permissions to third-party applications or users, or if you want to strictly control high-risk operations such as accessing or deleting sensitive data, we recommend that you use a temporary authentication token.
Compared with long-term API keys, temporary authentication tokens are more secure because they are short-lived (60 seconds). They are suitable for temporary call scenarios and can effectively reduce the risk of API key leakage.
To use a temporary token, replace the API key used for authentication in your code with the temporary authentication token.
Model availability
paraformer-realtime-v2 | paraformer-realtime-8k-v2 | |
Scenarios | Scenarios such as live streaming and meetings | Recognition scenarios for 8 kHz audio, such as telephone customer service and voicemail |
Sample rate | Any | 8 kHz |
Languages | Chinese (including Mandarin and various dialects), English, Japanese, Korean, German, French, and Russian Supported Chinese dialects: Shanghainese, Wu, Minnan, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Yunnan, and Cantonese | Chinese |
Punctuation prediction | ✅ Supported by default. No configuration is required. | ✅ Supported by default. No configuration is required. |
Inverse text normalization (ITN) | ✅ Supported by default. No configuration is required. | ✅ Supported by default. No configuration is required. |
Custom vocabulary | ✅ See Custom hotwords | ✅ See Custom hotwords |
Specify recognition language | ✅ Specify using the | ❌ |
Emotion recognition | ❌ |
Interaction flow
The client sends two types of messages to the server: instructions in JSON format and binary audio (must be single-channel audio). Messages returned from the server to the client are called events.
The interaction flow between the client and the server, in chronological order, is as follows:
Establish a connection: The client establishes a WebSocket connection with the server.
Start a task:
The client sends the run-task instruction to start the task.
The client receives the task-started event from the server, which indicates that the task has started successfully.
Send an audio stream:
The client starts sending binary audio and simultaneously receives the result-generated event from the server, which contains the speech recognition result.
Notify the server to end the task:
The client sends the finish-task instruction to notify the server to end the task, and continues to receive the result-generated event returned by the server.
End the task:
The client receives the task-finished event from the server, which marks the end of the task.
Close the connection: The client closes the WebSocket connection.
URL
The WebSocket URL is as follows:
wss://dashscope.aliyuncs.com/api-ws/v1/inferenceHeaders
Parameter | Type | Required | Description |
Authorization | string | Yes | The authentication token. The format is |
user-agent | string | No | The client identifier. This helps the server track the source of the request. |
X-DashScope-WorkSpace | string | No | Model Studio workspace ID. |
X-DashScope-DataInspection | string | No | Specifies whether to enable the data compliance check feature. The default value is |
Instructions (Client → Server)
Instructions are messages sent from the client to the server. They are in JSON format, sent as Text Frames, and are used to control the start and end of a task and to mark task boundaries.
The binary audio (must be single-channel) sent from the client to the server is not included in any instruction and must be sent separately.
Send instructions in the following strict order. Otherwise, the task may fail:
Starts the speech recognition task.
The returned
task_idis required for the subsequent finish-task instruction and must be the same.
Send binary audio (mono)
Sends the audio for recognition.
You must send the audio after you receive the task-started event from the server.
Send the finish-task instruction
Ends the speech recognition task.
Send this instruction after the audio has been completely sent.
1. run-task instruction: Start a task
This instruction starts a speech recognition task. The task_id is also required when you send the finish-task instruction and the same task_id must be used.
When to send: After the WebSocket connection is established.
Example:
{
"header": {
"action": "run-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx", // random uuid
"streaming": "duplex"
},
"payload": {
"task_group": "audio",
"task": "asr",
"function": "recognition",
"model": "paraformer-realtime-v2",
"parameters": {
"format": "pcm", // Audio format
"sample_rate": 16000, // Sample rate
"vocabulary_id": "vocab-xxx-24ee19fa8cfb4d52902170a0xxxxxxxx", // Hotword ID supported by paraformer-realtime-v2
"disfluency_removal_enabled": false, // Filter disfluent words
"language_hints": [
"en"
] // Specify language, only supported by the paraformer-realtime-v2 model
},
"resources": [ // If you are not using the custom vocabulary feature, do not pass the resources parameter
{
"resource_id": "xxxxxxxxxxxx", // Hotword ID supported by paraformer-realtime-v2
"resource_type": "asr_phrase"
}
],
"input": {}
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. For this instruction, the value is fixed as "run-task". |
header.task_id | string | Yes | The current task ID. A 32-bit universally unique identifier (UUID), consisting of 32 randomly generated letters and numbers. It can include hyphens (for example, When you later send the finish-task instruction, use the same task_id that you used for the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters:
Parameter | Type | Required | Description |
payload.task_group | string | Yes | Fixed string: "audio". |
payload.task | string | Yes | Fixed string: "asr". |
payload.function | string | Yes | Fixed string: "recognition". |
payload.model | string | Yes | The name of the model. For a list of supported models, see Model List. |
payload.input | object | Yes | Fixed format: {}. |
payload.parameters | |||
format | string | Yes | The format of the audio to be recognized. Supported audio formats: pcm, wav, mp3, opus, speex, aac, and amr. Important opus/speex: Must be encapsulated in Ogg. wav: Must be PCM encoded. amr: Only the AMR-NB type is supported. |
sample_rate | integer | Yes | The audio sampling rate in Hz. This parameter varies by model:
|
vocabulary_id | string | No | The ID of the hotword vocabulary. This parameter takes effect only when it is set. Use this field to set the hotword ID for v2 and later models. The hotword information for this hotword ID is applied to the speech recognition request. For more information, see Custom vocabulary. |
disfluency_removal_enabled | boolean | No | Specifies whether to filter out disfluent words:
|
language_hints | array[string] | No | The language code of the language to be recognized. If you cannot determine the language in advance, you can leave this parameter unset. The model automatically detects the language. Currently supported language codes:
This parameter only applies to models that support multiple languages (see Model List). |
semantic_punctuation_enabled | boolean | No | Specifies whether to enable semantic sentence segmentation. This feature is disabled by default.
Semantic sentence segmentation provides higher accuracy and is suitable for meeting transcription scenarios. VAD sentence segmentation has lower latency and is suitable for interactive scenarios. By adjusting the This parameter is effective only for v2 and later models. |
max_sentence_silence | integer | No | The silence duration threshold for VAD sentence segmentation, in ms. If the silence duration after a speech segment exceeds this threshold, the system determines that the sentence has ended. The parameter ranges from 200 ms to 6000 ms. The default value is 800 ms. This parameter is effective only when the |
multi_threshold_mode_enabled | boolean | No | If this parameter is set to `true`, it prevents VAD from segmenting sentences that are too long. This feature is disabled by default. This parameter is effective only when the |
punctuation_prediction_enabled | boolean | No | Specifies whether to automatically add punctuation to the recognition results:
This parameter is effective only for v2 and later models. |
heartbeat | boolean | No | Controls whether to maintain a persistent connection with the server:
This parameter is effective only for v2 and later models. |
inverse_text_normalization_enabled | boolean | No | Specifies whether to enable Inverse Text Normalization (ITN). This feature is enabled by default (`true`). When enabled, Chinese numerals are converted to Arabic numerals. This parameter is effective only for v2 and later models. |
payload.resources (This is a list. Do not pass this parameter if you are not using the custom vocabulary feature.) | |||
resource_type | string | No | Fixed string " |
2. finish-task instruction: End a task
This instruction ends the speech recognition task. The client sends this instruction after all audio has been sent.
When to send: After the audio is completely sent.
Example:
{
"header": {
"action": "finish-task",
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"streaming": "duplex"
},
"payload": {
"input": {}
}
}header parameters:
Parameter | Type | Required | Description |
header.action | string | Yes | The instruction type. For this instruction, the value is fixed as "finish-task". |
header.task_id | string | Yes | The current task ID. Must be the same as the task_id you used to send the run-task instruction. |
header.streaming | string | Yes | Fixed string: "duplex" |
payload parameters:
Parameter | Type | Required | Description |
payload.input | object | Yes | Fixed format: {}. |
Binary audio (Client → Server)
The client must send the audio stream for recognition after receiving the task-started event.
You can send a real-time audio stream (for example, from a microphone) or an audio stream from a recording file. The audio must be single-channel.
The audio is uploaded through the WebSocket binary channel. We recommend sending 100 ms of audio every 100 ms.
Events (Server → Client)
Events are messages returned from the server to the client. They are in JSON format and represent different processing stages.
1. task-started event: Task has started
The task-started event from the server indicates that the task has started successfully. You must wait to receive this event before you send the audio for recognition or the finish-task instruction. If you send them before receiving this event, the task will fail.
The payload of the task-started event has no content.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-started",
"attributes": {}
},
"payload": {}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-started". |
header.task_id | string | The task_id generated by the client. |
2. result-generated event: Speech recognition result
While the client sends the audio for recognition and the finish-task instruction, the server continuously returns the result-generated event, which contains the speech recognition result.
You can determine whether the result is intermediate or final by checking if payload.sentence.endTime in the result-generated event is null.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "result-generated",
"attributes": {}
},
"payload": {
"output": {
"sentence": {
"begin_time": 170,
"end_time": null,
"text": "Okay, I got it",
"heartbeat": false,
"sentence_end": true,
"emo_tag": "neutral", // This field is displayed only when the model parameter is set to paraformer-realtime-8k-v2, the semantic_punctuation_enabled parameter is false, and sentence_end in the result-generated event is true
"emo_confidence": 0.914, // This field is displayed only when the model parameter is set to paraformer-realtime-8k-v2, the semantic_punctuation_enabled parameter is false, and sentence_end in the result-generated event is true
"words": [
{
"begin_time": 170,
"end_time": 295,
"text": "Okay",
"punctuation": ","
},
{
"begin_time": 295,
"end_time": 503,
"text": "I",
"punctuation": ""
},
{
"begin_time": 503,
"end_time": 711,
"text": "got",
"punctuation": ""
},
{
"begin_time": 711,
"end_time": 920,
"text": "it",
"punctuation": ""
}
]
}
},
"usage": {
"duration": 3
}
}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "result-generated". |
header.task_id | string | The task_id generated by the client. |
payload parameters:
Parameter | Type | Description |
output | object | output.sentence is the recognition result. See the following text for details. |
usage | object | When When |
The format of <code code-type="xCode" data-tag="code" id="ee95c40e4f7j4">payload.output.usage is as follows:
Parameter | Type | Description |
duration | integer | The billable duration of the task, in seconds. |
The format of payload.output.sentence is as follows:
Parameter | Type | Description |
begin_time | integer | The start time of the sentence, in ms. |
end_time | integer | null | The end time of the sentence, in ms. If this is an intermediate recognition result, the value is null. |
text | string | The recognized text. |
words | array | Character timestamp information. |
heartbeat | boolean | null | If this value is true, you can skip processing the recognition result. |
sentence_end | boolean | Indicates whether the given sentence has ended. |
emo_tag | string | The emotion of the current sentence:
Emotion recognition has the following constraints:
|
emo_confidence | number | The confidence level of the emotion recognized in the current sentence. The value ranges from 0.0 to 1.0. A larger value indicates a higher confidence level. Emotion recognition has the following constraints:
|
payload.output.sentence.words is a list of character timestamps, where each word has the following format:
Parameter | Type | Description |
begin_time | integer | The start time of the character, in ms. |
end_time | integer | The end time of the character, in ms. |
text | string | The character. |
punctuation | string | The punctuation mark. |
3. task-finished event: Task has ended
When you receive the task-finished event from the server, the task has ended. At this point, you can close the WebSocket connection and terminate the program.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-finished",
"attributes": {}
},
"payload": {
"output": {},
"usage": null
}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-finished". |
header.task_id | string | The task_id generated by the client. |
4. task-failed event: Task failed
If you receive a task-failed event, the task has failed. You must close the WebSocket connection and handle the error. Analyze the error message to determine the cause. If the failure is due to a programming issue, adjust your code to fix it.
Example:
{
"header": {
"task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
"event": "task-failed",
"error_code": "CLIENT_ERROR",
"error_message": "request timeout after 23 seconds.",
"attributes": {}
},
"payload": {}
}header parameters:
Parameter | Type | Description |
header.event | string | The event type. For this event, the value is fixed as "task-failed". |
header.task_id | string | The task_id generated by the client. |
header.error_code | string | A description of the error type. |
header.error_message | string | The specific reason for the error. |
Connection overhead and reuse
The WebSocket service supports connection reuse to improve resource utilization and avoid connection establishment overhead.
The server starts a new task when it receives a run-task instruction from the client. When the client sends a finish-task instruction, the server returns a task-finished event to end the task. After the task ends, the WebSocket connection can be reused. The client can start another task by sending a new run-task instruction.
Different tasks within a reused connection must use different task_ids.
If a failure occurs during task execution, the service will still return a task-failed event and close the connection. This connection cannot be reused.
If no new task is started within 60 seconds after a task ends, the connection automatically times out and disconnects.
Code examples
The code examples provide a basic implementation to help you run the service. You must develop the code for your specific business scenarios.
When writing WebSocket client code, asynchronous programming is typically used to send and receive messages simultaneously. You can write your program by following these steps:
Error codes
If an error occurs, see Error messages for troubleshooting.
If the problem persists, join the developer group to report the issue. Provide the Request ID to help us investigate the issue.
FAQ
Features
Q: How to maintain a persistent connection with the server during long periods of silence?
You can set the heartbeat request parameter to true and continuously send silent audio to the server.
Silent audio refers to content in an audio file or data stream that has no sound signal. You can generate silent audio using various methods, such as using audio editing software such as Audacity or Adobe Audition, or command-line tools such as FFmpeg.
Q: How to convert an audio format to the required format?
You can use the FFmpeg tool. For more information, see the official FFmpeg website.
# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: audio.wav
# -c:a: Specifies the audio encoder. Examples: aac, libmp3lame, pcm_s16le
# -b:a: Specifies the bit rate (controls audio quality). Examples: 192k, 320k
# -ar: Specifies the sample rate. Examples: 44100 (CD), 48000, 16000
# -ac: Specifies the number of sound channels. Examples: 1 (mono), 2 (stereo)
# -y: Overwrites an existing file (no value needed).
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac num_channels output.ext
# Example: WAV to MP3 (maintain original quality)
ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
# Example: MP3 to WAV (16-bit PCM standard format)
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
# Example: M4A to AAC (extract/convert Apple audio)
ffmpeg -i input.m4a -c:a copy output.aac # Directly extract without re-encoding
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac # Re-encode to improve quality
# Example: FLAC lossless to Opus (high compression)
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opusQ: Can I view the time range for each sentence?
Yes, you can. The speech recognition results include the start and end timestamps for each sentence. You can use these timestamps to determine the time range of each sentence.
Q: Why use the WebSocket protocol instead of the HTTP/HTTPS protocol? Why not provide a RESTful API?
Voice Service uses WebSocket instead of HTTP, HTTPS, or RESTful because it requires full-duplex communication. WebSocket allows the server and client to actively exchange data in both directions, such as pushing real-time speech synthesis or recognition progress. In contrast, HTTP-based RESTful APIs only support a one-way, client-initiated request-response model, which is unsuitable for real-time interaction.
Q: How to recognize a local file (recording file)?
Convert the local file into a binary audio stream and upload the stream for recognition through the binary channel of the WebSocket. You can typically do this using the send method of a WebSocket library. A code snippet is shown below. For a complete example, see Code examples.
Troubleshooting
If an error occurs in your code, refer to Error codes for troubleshooting.
Q: Why there is no recognition result?
Check whether the audio
formatandsampleRate/sample_ratein the request parameters are set correctly and meet the parameter constraints. The following are common error examples:The audio file has a .wav extension but is in MP3 format, and the
formatparameter is incorrectly set to `mp3`.The audio sample rate is 3600 Hz, but the
sampleRate/sample_rateparameter is incorrectly set to 48000.
You can use the ffprobe tool to obtain audio information, such as the container, encoding, sample rate, and sound channels:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxxWhen you use the
paraformer-realtime-v2model, check whether the language set inlanguage_hintsmatches the actual language of the audio.For example, the audio is in Chinese, but
language_hintsis set toen(English).If all the preceding checks pass, you can use custom hotwords to improve the recognition of specific words.