This document describes the events that a client sends to the server during a WebSocket session with the Qwen-ASR Realtime API.
User guide: For an overview of the model, its features, and complete sample code, see Real-time speech recognition - Qwen.
session.update
Updates the session configuration. Send this event immediately after establishing a WebSocket connection. If you do not send this event, the system uses the default configurations.
After the server successfully processes this event, it sends a session.updated event as confirmation.
| |
input_audio_buffer.append
Appends an audio data block to the server's input buffer. This is the core event for streaming audio.
Differences across scenarios:
VAD mode: The audio buffer is used for voice activity detection. The server automatically decides when to submit the audio for recognition.
Non-VAD mode: The client can control the amount of audio data in each event. The maximum size of the
audiofield in a singleinput_audio_buffer.appendevent is 15 MiB. Stream smaller audio blocks for faster responses.
Important: The server does not send any confirmation response for the input_audio_buffer.append event.
| |
input_audio_buffer.commit
In non-VAD mode, this event manually triggers recognition. It notifies the server that the client has finished sending a complete utterance. The server then recognizes all audio data in the current buffer as a single unit.
Disabled in: VAD mode.
After successful processing, the server sends the input_audio_buffer.committed event as confirmation.
| |
session.finish
Ends the current session.
Server response flow:
If speech is detected: After completing the final speech recognition, the server sends the
conversation.item.input_audio_transcription.completedevent, which contains the recognition result. The server then sends thesession.finishedevent to indicate that the session has ended.If no speech is detected: The server directly sends a
session.finishedevent.
After the client receives the session.finished event, it must disconnect.
| |