All Products
Search
Document Center

Alibaba Cloud Model Studio:Client events for Qwen-ASR-Realtime

Last Updated:Jan 16, 2026

This document describes the events that a client sends to the server during a WebSocket session with the Qwen-ASR Realtime API.

User guide: For an overview of the model, its features, and complete sample code, see Real-time speech recognition - Qwen.

session.update

Updates the session configuration. Send this event immediately after establishing a WebSocket connection. If you do not send this event, the system uses the default configurations.

After the server successfully processes this event, it sends a session.updated event as confirmation.

Parameter

Type

Required

Description

type

string

Yes

The event type. The value is fixed to session.update.

event_id

string

Yes

The event ID.

session

object

Yes

An object that contains the session configuration.

session.input_audio_format

string

No

The audio format. Supported formats are pcm and opus.

Default: pcm.

session.sample_rate

integer

No

The audio sampling rate in Hz. Supported values are 16000 and 8000.

Default: 16000.

If you set this parameter to 8000, the server upsamples the audio to 16000 Hz before recognition. This might introduce a minor delay. Use this value only if the source audio is 8000 Hz, such as audio from a telephone line.

session.input_audio_transcription

object

No

Configurations related to speech recognition.

session.input_audio_transcription.language

string

No

The source language of the audio.

  • zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu)

  • yue: Cantonese

  • en: English

  • ja: Japanese

  • de: German

  • ko: Korean

  • ru: Russian

  • fr: French

  • pt: Portuguese

  • ar: Arabic

  • it: Italian

  • es: Spanish

  • hi: Hindi

  • id: Indonesian

  • th: Thai

  • tr: Turkish

  • uk: Ukrainian

  • vi: Vietnamese

  • cs: Czech

  • da: Danish

  • fil: Filipino

  • fi: Finnish

  • is: Icelandic

  • ms: Malay

  • no: Norwegian

  • pl: Polish

  • sv: Swedish

session.input_audio_transcription.corpus.text

string

No

Specifies the context. You can provide background text, entity vocabularies, and other reference information (context) during speech recognition to get customized results.

Length limit: 10,000 tokens.

For more information, see Contextual biasing.

session.turn_detection

object

No

The Voice Activity Detection (VAD) configuration.

This parameter enables or disables VAD mode. Set this parameter to `null` to disable VAD mode and enable Manual mode. If this parameter is set, VAD mode is enabled.

session.turn_detection.type

string

No, it is required when turn_detection is present.

The value is fixed to server_vad.

session.turn_detection.threshold

float

No

The VAD detection threshold. Recommended value: 0.0.

Default: 0.2.

Valid values: [-1, 1].

A lower threshold increases VAD sensitivity, which might cause background noise to be mistaken for speech. A higher threshold decreases sensitivity and helps reduce false triggers in noisy environments.

session.turn_detection.silence_duration_ms

integer

No

The VAD endpointing threshold in milliseconds (ms). A period of silence that exceeds this threshold is considered the end of a statement. Recommended value: 400.

Default: 800.

Valid values: [200, 6000].

A lower value, such as 300 ms, allows the model to respond faster but may cause unnatural segmentation at normal pauses. A higher value, such as 1200 ms, can better handle pauses within long sentences but increases the overall response latency.

{
    "event_id": "event_123",
    "type": "session.update",
    "session": {
        "input_audio_format": "pcm",
        "sample_rate": 16000,
        "input_audio_transcription": {
            "language": "zh",
            "corpus": {
              "text": "ASR corpus to improve model recognition performance"
            }
        },
        "turn_detection": {
            "type": "server_vad",
            "threshold": 0.0,
            "silence_duration_ms": 400
        }
    }
}

input_audio_buffer.append

Appends an audio data block to the server's input buffer. This is the core event for streaming audio.

Differences across scenarios:

  • VAD mode: The audio buffer is used for voice activity detection. The server automatically decides when to submit the audio for recognition.

  • Non-VAD mode: The client can control the amount of audio data in each event. The maximum size of the audio field in a single input_audio_buffer.append event is 15 MiB. Stream smaller audio blocks for faster responses.

Important: The server does not send any confirmation response for the input_audio_buffer.append event.

Parameter

Type

Required

Description

type

string

Yes

The event type. The value must be input_audio_buffer.append.

event_id

string

Yes

The unique ID for the event.

audio

string

Yes

The Base64-encoded audio data.

{
  "event_id": "event_2728",
  "type": "input_audio_buffer.append",
  "audio": "<audio> by base64"
}

input_audio_buffer.commit

In non-VAD mode, this event manually triggers recognition. It notifies the server that the client has finished sending a complete utterance. The server then recognizes all audio data in the current buffer as a single unit.

Disabled in: VAD mode.

After successful processing, the server sends the input_audio_buffer.committed event as confirmation.

Parameter

Type

Required

Description

type

string

Yes

The event type. The value is input_audio_buffer.commit.

event_id

string

Yes

The event ID.

{
  "event_id": "event_789",
   "type": "input_audio_buffer.commit"
}

session.finish

Ends the current session.

Server response flow:

After the client receives the session.finished event, it must disconnect.

Parameter

Type

Required

Description

type

string

Yes

The event type. The value is session.finish.

event_id

string

Yes

The event ID.

{
  "event_id": "event_341",
  "type": "session.finish",
}