Paraformer real-time speech recognition client events - Alibaba Cloud Model Studio

Two WebSocket client events control a Paraformer real-time speech recognition task: run-task starts the task with the model and audio settings, and finish-task ends the task after the audio stream completes. This page describes the message structure and field semantics of both events.

User guide: For model details and selection guidance, see Speech-to-text.

Event flow: For the event interaction sequence, see WebSocket API.

run-task

Description: Starts a speech recognition task and configures parameters such as the model, audio format, and sample rate.

When to send: Immediately after the WebSocket connection is established.

Response event: The server must return the task-started event before audio data can be sent.

header object (Required)

Properties

action string (Required)

Instruction type. Set to run-task.

task_id string (Required)

Client-generated task ID in UUID format. Used to correlate subsequent events with this task.

streaming string (Required)

Set to duplex.

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "asr",
        "function": "recognition",
        "model": "paraformer-realtime-v2",
        "parameters": {
            "format": "pcm",
            "sample_rate": 16000,
            "disfluency_removal_enabled": false,
            "language_hints": [
                "en"
            ]
        },
        "input": {}
    }
}

payload object (Required)

Properties

task_group string (Required)

Task group. Set to audio.

task string (Required)

Task type. Set to asr.

function string (Required)

Function type. Set to recognition.

model string (Required)

Model name.

input object (Required)

Set to {}.

parameters object (Required)

Speech recognition parameters.

Properties

format string (Required)

Audio format.

Valid values:

pcm
wav
mp3
opus
speex
aac
amr

Important

Paraformer enforces the following constraints:

opus and speex: Must use Ogg encapsulation.
wav: Must use PCM encoding.
amr: Only AMR-NB is supported.

sample_rate integer (Required)

Sample rate, in Hz.

Valid values:

Paraformer (varies by model):
- paraformer-realtime-v2: Any sample rate.
- paraformer-realtime-8k-v2: 8000 Hz only.

vocabulary_id string (Optional)

Hotword vocabulary ID.

disfluency_removal_enabled boolean (Optional)

Important

Only Paraformer supports this parameter.

Whether to filter out filler words.

Default: false.

language_hints array[string] (Optional)

Language of the audio to recognize. No default value. If not set, the model detects the language automatically.

Valid values:

Paraformer:
- zh: Chinese
- en: English
- ja: Japanese
- yue: Cantonese
- ko: Korean
- de: German
- fr: French
- ru: Russian

semantic_punctuation_enabled boolean (Optional)

Important

Only Paraformer v2 supports this parameter.

Whether to enable semantic-based sentence segmentation.

Default: false.

true: Enables semantic-based segmentation and disables VAD-based segmentation.
false (default): Enables VAD-based segmentation and disables semantic-based segmentation.

Semantic-based segmentation is more accurate and suits meeting transcription. VAD-based (Voice Activity Detection) segmentation has lower latency and suits interactive scenarios.

max_sentence_silence integer (Optional)

Important

Only Paraformer v2 supports this parameter.
Takes effect only when semantic_punctuation_enabled is set to false.

Silence threshold for VAD-based sentence segmentation, in milliseconds. The system ends the current sentence when silence after a speech segment exceeds this threshold.

Default: 1300.

Valid range: [200, 6000].

multi_threshold_mode_enabled boolean (Optional)

Important

Only Paraformer v2 supports this parameter.
Takes effect only when semantic_punctuation_enabled is set to false.

Whether to enable multi-threshold mode. When enabled, this mode prevents VAD-based segmentation from producing overly long segments.

Default: false.

punctuation_prediction_enabled boolean (Optional)

Important

Only Paraformer v2 supports this parameter.

Whether to add punctuation to the recognition results.

Default: true.

heartbeat boolean (Optional)

Important

Only Paraformer v2 supports this parameter.

Whether to enable heartbeat packets.

Default: false.

true: Keeps the connection alive when only silent audio is being sent.
false (default): The connection times out and closes after 60 seconds of silent audio.

inverse_text_normalization_enabled boolean (Optional)

Important

Only Paraformer v2 supports this parameter.

Whether to enable Inverse Text Normalization (ITN). When enabled, Chinese numerals are converted to Arabic numerals.

Default: true.

finish-task

Description: Notifies the server that all audio data has been sent and requests that the task be ended.

When to send: After all audio data has been sent.

Response event: The server returns the task-finished event.

header object (Required)

Properties

action string (Required)

Instruction type. Set to finish-task.

task_id string (Required)

Client-generated task ID in UUID format. Must match the task_id used in the run-task event.

streaming string (Required)

Set to duplex.

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}
    }
}

payload object (Required)

Properties

input object (Required)

Set to {}.