This page documents client-to-server events for the Qwen-ASR Realtime WebSocket API. Each section covers an event type, its parameters, and server responses.
For a feature overview and complete sample code, see Real-time speech recognition - Qwen. For server-to-client events, see Server events for Qwen-ASR-Realtime.
Event lifecycle
A typical session follows this sequence:
-
Establish a WebSocket connection.
-
Send
session.updateto configure audio format, language, and VAD settings. -
Send
input_audio_buffer.appendrepeatedly to stream audio data. -
In Manual mode, send
input_audio_buffer.committo trigger recognition for a complete utterance. In VAD mode, the server triggers recognition automatically. -
Send
session.finishto end the session, then disconnect after receiving thesession.finishedresponse.
session.update
Configures the session. Send this immediately after establishing the WebSocket connection to set the audio format, the language, and VAD parameters. If omitted, defaults apply.
The server responds with a session.updated event on success.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| type | string | Yes | Fixed value: session.update. |
| event_id | string | Yes | A unique event ID. |
| session | object | Yes | Session configuration object. See the session configuration table below. |
Session configuration
| Parameter | Type | Required | Description |
|---|---|---|---|
| input_audio_format | string | No | Audio encoding format. Valid values: pcm, opus. Default: pcm. |
| sample_rate | integer | No | Audio sampling rate in Hz. Valid values: 16000, 8000. Default: 16000. Setting 8000 causes server-side upsampling to 16,000 Hz (minor delay). Use 8000 only for natively 8,000 Hz audio like telephony. |
| input_audio_transcription | object | No | Speech recognition settings. |
| input_audio_transcription.language | string | No | Language of the audio. See the supported languages table below. |
| input_audio_transcription.corpus.text | string | No | Context text for contextual biasing -- background text, entity vocabularies, or reference material that improves recognition accuracy. Maximum: 10,000 tokens. |
| turn_detection | object | No | VAD configuration. Set to null for Manual mode. If present, VAD mode is enabled. |
| turn_detection.type | string | Required when turn_detection is set |
Fixed value: server_vad. |
| turn_detection.threshold | float | No | VAD sensitivity threshold. Default: 0.2. Valid range: [-1, 1]. Lower values increase sensitivity (may trigger on background noise). Higher values reduce sensitivity and avoid false triggers in noisy environments. See recommended VAD presets below. |
| turn_detection.silence_duration_ms | integer | No | Silence duration in milliseconds marking utterance end. Default: 800. Valid range: [200, 6000]. Shorter durations (e.g., 300 ms) speed up responses but may split natural pauses. Longer durations (e.g., 1,200 ms) handle pauses better but increase latency. See recommended VAD presets below. |
Recommended VAD presets
Use these presets as starting points. Adjust based on your results:
| Preset | threshold | silence_duration_ms | Best for |
|---|---|---|---|
| Low latency | 0.0 |
400 |
Fast-paced interactions like voice commands or agent assist, where quick responses matter more than handling long pauses |
| Balanced (default) | 0.2 |
800 |
General-purpose transcription with a balance between responsiveness and accuracy |
Supported languages
| Code | Language |
|---|---|
| zh | Chinese (Mandarin, Sichuanese, Minnan, and Wu) |
| yue | Cantonese |
| en | English |
| ja | Japanese |
| de | German |
| ko | Korean |
| ru | Russian |
| fr | French |
| pt | Portuguese |
| ar | Arabic |
| it | Italian |
| es | Spanish |
| hi | Hindi |
| id | Indonesian |
| th | Thai |
| tr | Turkish |
| uk | Ukrainian |
| vi | Vietnamese |
| cs | Czech |
| da | Danish |
| fil | Filipino |
| fi | Finnish |
| is | Icelandic |
| ms | Malay |
| no | Norwegian |
| pl | Polish |
| sv | Swedish |
Example
{
"event_id": "event_123",
"type": "session.update",
"session": {
"input_audio_format": "pcm",
"sample_rate": 16000,
"input_audio_transcription": {
"language": "zh"
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.0,
"silence_duration_ms": 400
}
}
}
input_audio_buffer.append
Streams an audio chunk to the server's input buffer -- the core event for sending audio data.
Behavior differs by interaction mode:
-
VAD mode: The server monitors the buffer for voice activity and automatically triggers recognition.
-
Manual mode: The client controls utterance boundaries. Send smaller chunks for lower latency.
The audio field contains Base64-encoded data. In Manual mode, maximum size per event: 15 MiB. The server does not send a confirmation response.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| type | string | Yes | Fixed value: input_audio_buffer.append. |
| event_id | string | Yes | A unique event ID. |
| audio | string | Yes | Base64-encoded audio data. |
Example
{
"event_id": "event_2728",
"type": "input_audio_buffer.append",
"audio": "<Base64-encoded-audio-data>"
}
input_audio_buffer.commit
Triggers recognition for all audio in the buffer as a single utterance. Use in Manual mode when your application controls utterance boundaries (e.g., push-to-talk).
Disabled in VAD mode.
The server responds with an input_audio_buffer.committed event on success.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| type | string | Yes | Fixed value: input_audio_buffer.commit. |
| event_id | string | Yes | A unique event ID. |
Example
{
"event_id": "event_789",
"type": "input_audio_buffer.commit"
}
session.finish
Ends the session. Server response depends on speech detection:
-
Speech detected: The server completes final recognition, sends a
conversation.item.input_audio_transcription.completedevent with the result, then sends asession.finishedevent. -
No speech detected: The server sends
session.finisheddirectly.
After receiving session.finished, disconnect the WebSocket connection.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| type | string | Yes | Fixed value: session.finish. |
| event_id | string | Yes | A unique event ID. |
Example
{
"event_id": "event_341",
"type": "session.finish"
}