All Products
Search
Document Center

Alibaba Cloud Model Studio:Interaction flow for Qwen-ASR-Realtime

Last Updated:Jan 16, 2026

Qwen's real-time speech recognition service uses the WebSocket protocol to accept and transcribe real-time audio streams. It supports interaction flows in Voice Activity Detection (VAD) mode and Manual mode.

User guide: For an overview of the model, its features, and complete sample code, see Real-time speech recognition - Qwen.

URL

When constructing the URL, replace <model_name> with the name of the desired model.

wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Headers

"Authorization": "bearer <your_dashscope_api_key>"

VAD mode (default)

The server automatically detects the start and end of speech for sentence segmentation. The client continuously sends the audio stream. The server returns the final recognition result after it detects the end of a sentence. Suitable for scenarios such as real-time conversations and meeting minutes.

How to enable: Configure the session.turn_detection parameter in the client's session.update event.

image

Manual mode

The client controls sentence segmentation by sending the audio for a complete sentence and then sending input_audio_buffer.commit to the server. This mode is suitable for scenarios where the client can clearly determine sentence boundaries, such as sending voice messages in chat applications.

How to enable: You can set the session.turn_detection parameter to null in the client's session.update event.

image