Client events - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

This document describes the events that a client sends to the server during a WebSocket session with the Qwen-ASR Realtime API.

User guide: For an overview of the model, its features, and complete sample code, see Real-time speech recognition - Qwen.

session.update

Updates the session configuration. Send this event immediately after establishing a WebSocket connection. If you do not send this event, the system uses the default configurations.

After the server successfully processes this event, it sends a session.updated event as confirmation.

Parameter	Type	Required	Description
type	string	Yes	The event type. The value is fixed to `session.update`.
event_id	string	Yes	The event ID.
session	object	Yes	An object that contains the session configuration.
session.input_audio_format	string	No	The audio format. Supported formats are `pcm` and `opus`. Default: `pcm`.
session.sample_rate	integer	No	The audio sampling rate in Hz. Supported values are `16000` and `8000`. Default: `16000`. If you set this parameter to `8000`, the server upsamples the audio to 16000 Hz before recognition. This might introduce a minor delay. Use this value only if the source audio is 8000 Hz, such as audio from a telephone line.
session.input_audio_transcription	object	No	Configurations related to speech recognition.
session.input_audio_transcription.language	string	No	The source language of the audio. zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu) yue: Cantonese en: English ja: Japanese de: German ko: Korean ru: Russian fr: French pt: Portuguese ar: Arabic it: Italian es: Spanish hi: Hindi id: Indonesian th: Thai tr: Turkish uk: Ukrainian vi: Vietnamese cs: Czech da: Danish fil: Filipino fi: Finnish is: Icelandic ms: Malay no: Norwegian pl: Polish sv: Swedish
session.input_audio_transcription.corpus.text	string	No	Specifies the context. You can provide background text, entity vocabularies, and other reference information (context) during speech recognition to get customized results. Length limit: 10,000 tokens. For more information, see Contextual biasing.
session.turn_detection	object	No	The Voice Activity Detection (VAD) configuration. This parameter enables or disables VAD mode. Set this parameter to `null` to disable VAD mode and enable Manual mode. If this parameter is set, VAD mode is enabled.
session.turn_detection.type	string	No, it is required when `turn_detection` is present.	The value is fixed to `server_vad`.
session.turn_detection.threshold	float	No	The VAD detection threshold. Recommended value: `0.0`. Default: `0.2`. Valid values: `[-1, 1]`. A lower threshold increases VAD sensitivity, which might cause background noise to be mistaken for speech. A higher threshold decreases sensitivity and helps reduce false triggers in noisy environments.
session.turn_detection.silence_duration_ms	integer	No	The VAD endpointing threshold in milliseconds (ms). A period of silence that exceeds this threshold is considered the end of a statement. Recommended value: `400`. Default: `800`. Valid values: `[200, 6000]`. A lower value, such as 300 ms, allows the model to respond faster but may cause unnatural segmentation at normal pauses. A higher value, such as 1200 ms, can better handle pauses within long sentences but increases the overall response latency.

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "input_audio_format": "pcm",
        "sample_rate": 16000,
        "input_audio_transcription": {
            "language": "zh"
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.0,
            "silence_duration_ms": 400
        }
    }
}

input_audio_buffer.append

Appends an audio data block to the server's input buffer. This is the core event for streaming audio.

Differences across scenarios:

VAD mode: The audio buffer is used for voice activity detection. The server automatically decides when to submit the audio for recognition.
Non-VAD mode: The client can control the amount of audio data in each event. The maximum size of the audio field in a single input_audio_buffer.append event is 15 MiB. Stream smaller audio blocks for faster responses.

Important: The server does not send any confirmation response for the input_audio_buffer.append event.

Parameter	Type	Required	Description
type	string	Yes	The event type. The value must be `input_audio_buffer.append`.
event_id	string	Yes	The unique ID for the event.
audio	string	Yes	The Base64-encoded audio data.

{
  "event_id": "event_2728",
  "type": "input_audio_buffer.append",
  "audio": "<audio> by base64"
}

input_audio_buffer.commit

In non-VAD mode, this event manually triggers recognition. It notifies the server that the client has finished sending a complete utterance. The server then recognizes all audio data in the current buffer as a single unit.

Disabled in: VAD mode.

After successful processing, the server sends the input_audio_buffer.committed event as confirmation.

Parameter	Type	Required	Description
type	string	Yes	The event type. The value is `input_audio_buffer.commit`.
event_id	string	Yes	The event ID.

{
  "event_id": "event_789",
   "type": "input_audio_buffer.commit"
}

session.finish

Ends the current session.

Server response flow:

If speech is detected: After completing the final speech recognition, the server sends the conversation.item.input_audio_transcription.completed event, which contains the recognition result. The server then sends the session.finished event to indicate that the session has ended.
If no speech is detected: The server directly sends a session.finished event.

After the client receives the session.finished event, it must disconnect.

Parameter	Type	Required	Description
type	string	Yes	The event type. The value is `session.finish`.
event_id	string	Yes	The event ID.

{
  "event_id": "event_341",
  "type": "session.finish",
}