All Products
Search
Document Center

Alibaba Cloud Model Studio:Client events for Qwen-ASR-Realtime

Last Updated:Mar 15, 2026

This page documents client-to-server events for the Qwen-ASR Realtime WebSocket API. Each section covers an event type, its parameters, and server responses.

For a feature overview and complete sample code, see Real-time speech recognition - Qwen. For server-to-client events, see Server events for Qwen-ASR-Realtime.

Event lifecycle

A typical session follows this sequence:

  1. Establish a WebSocket connection.

  2. Send session.update to configure audio format, language, and VAD settings.

  3. Send input_audio_buffer.append repeatedly to stream audio data.

  4. In Manual mode, send input_audio_buffer.commit to trigger recognition for a complete utterance. In VAD mode, the server triggers recognition automatically.

  5. Send session.finish to end the session, then disconnect after receiving the session.finished response.

session.update

Configures the session. Send this immediately after establishing the WebSocket connection to set the audio format, the language, and VAD parameters. If omitted, defaults apply.

The server responds with a session.updated event on success.

Parameters

Parameter Type Required Description
type string Yes Fixed value: session.update.
event_id string Yes A unique event ID.
session object Yes Session configuration object. See the session configuration table below.

Session configuration

Parameter Type Required Description
input_audio_format string No Audio encoding format. Valid values: pcm, opus. Default: pcm.
sample_rate integer No Audio sampling rate in Hz. Valid values: 16000, 8000. Default: 16000. Setting 8000 causes server-side upsampling to 16,000 Hz (minor delay). Use 8000 only for natively 8,000 Hz audio like telephony.
input_audio_transcription object No Speech recognition settings.
input_audio_transcription.language string No Language of the audio. See the supported languages table below.
input_audio_transcription.corpus.text string No Context text for contextual biasing -- background text, entity vocabularies, or reference material that improves recognition accuracy. Maximum: 10,000 tokens.
turn_detection object No VAD configuration. Set to null for Manual mode. If present, VAD mode is enabled.
turn_detection.type string Required when turn_detection is set Fixed value: server_vad.
turn_detection.threshold float No VAD sensitivity threshold. Default: 0.2. Valid range: [-1, 1]. Lower values increase sensitivity (may trigger on background noise). Higher values reduce sensitivity and avoid false triggers in noisy environments. See recommended VAD presets below.
turn_detection.silence_duration_ms integer No Silence duration in milliseconds marking utterance end. Default: 800. Valid range: [200, 6000]. Shorter durations (e.g., 300 ms) speed up responses but may split natural pauses. Longer durations (e.g., 1,200 ms) handle pauses better but increase latency. See recommended VAD presets below.

Recommended VAD presets

Use these presets as starting points. Adjust based on your results:

Preset threshold silence_duration_ms Best for
Low latency 0.0 400 Fast-paced interactions like voice commands or agent assist, where quick responses matter more than handling long pauses
Balanced (default) 0.2 800 General-purpose transcription with a balance between responsiveness and accuracy

Supported languages

Code Language
zh Chinese (Mandarin, Sichuanese, Minnan, and Wu)
yue Cantonese
en English
ja Japanese
de German
ko Korean
ru Russian
fr French
pt Portuguese
ar Arabic
it Italian
es Spanish
hi Hindi
id Indonesian
th Thai
tr Turkish
uk Ukrainian
vi Vietnamese
cs Czech
da Danish
fil Filipino
fi Finnish
is Icelandic
ms Malay
no Norwegian
pl Polish
sv Swedish

Example

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "input_audio_format": "pcm",
        "sample_rate": 16000,
        "input_audio_transcription": {
            "language": "zh"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.0,
            "silence_duration_ms": 400
        }
    }
}

input_audio_buffer.append

Streams an audio chunk to the server's input buffer -- the core event for sending audio data.

Behavior differs by interaction mode:

  • VAD mode: The server monitors the buffer for voice activity and automatically triggers recognition.

  • Manual mode: The client controls utterance boundaries. Send smaller chunks for lower latency.

Important

The audio field contains Base64-encoded data. In Manual mode, maximum size per event: 15 MiB. The server does not send a confirmation response.

Parameters

Parameter Type Required Description
type string Yes Fixed value: input_audio_buffer.append.
event_id string Yes A unique event ID.
audio string Yes Base64-encoded audio data.

Example

{
    "event_id": "event_2728",
    "type": "input_audio_buffer.append",
    "audio": "<Base64-encoded-audio-data>"
}

input_audio_buffer.commit

Triggers recognition for all audio in the buffer as a single utterance. Use in Manual mode when your application controls utterance boundaries (e.g., push-to-talk).

Disabled in VAD mode.

The server responds with an input_audio_buffer.committed event on success.

Parameters

Parameter Type Required Description
type string Yes Fixed value: input_audio_buffer.commit.
event_id string Yes A unique event ID.

Example

{
    "event_id": "event_789",
    "type": "input_audio_buffer.commit"
}

session.finish

Ends the session. Server response depends on speech detection:

After receiving session.finished, disconnect the WebSocket connection.

Parameters

Parameter Type Required Description
type string Yes Fixed value: session.finish.
event_id string Yes A unique event ID.

Example

{
    "event_id": "event_341",
    "type": "session.finish"
}