Interaction flow for Qwen-ASR-Realtime - Alibaba Cloud Model Studio

Qwen's real-time speech recognition service uses the WebSocket protocol to accept and transcribe real-time audio streams. It supports interaction flows in Voice Activity Detection (VAD) mode and Manual mode.

User guide: For an overview of the model, its features, and complete sample code, see Real-time speech recognition - Qwen.

URL

When constructing the URL, replace <model_name> with the name of the desired model.

wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Headers

"Authorization": "bearer <your_dashscope_api_key>"

VAD mode (default)

The server automatically detects the start and end of speech for sentence segmentation. The client continuously sends the audio stream. The server returns the final recognition result after it detects the end of a sentence. Suitable for scenarios such as real-time conversations and meeting minutes.

How to enable: Configure the session.turn_detection parameter in the client's session.update event.

The client sends input_audio_buffer.append to appends audio to the buffer.
The server returns input_audio_buffer.speech_started when it detects speech.
Note: If the client sends session.finish to end the session before receiving this event, the server returns a Server events. The client must then disconnect.
The client continues to send input_audio_buffer.append.
After all audio has been sent, the client sends session.finish to the server to end the current session.
When the server detects the end of speech, it returns input_audio_buffer.speech_stopped.
The server returns input_audio_buffer.committed.
The server returns conversation.item.created.
The server returns conversation.item.input_audio_transcription.text, which contains real-time speech recognition results.
The server returns conversation.item.input_audio_transcription.completed, which contains the final speech recognition result.
The server returns Server events to notify the client that the recognition process is complete. The client must then disconnect.

Manual mode

The client controls sentence segmentation by sending the audio for a complete sentence and then sending input_audio_buffer.commit to the server. This mode is suitable for scenarios where the client can clearly determine sentence boundaries, such as sending voice messages in chat applications.

How to enable: You can set the session.turn_detection parameter to null in the client's session.update event.

The client appends audio to the buffer by sending input_audio_buffer.append.
The client submits the input audio buffer by sending input_audio_buffer.commit. This submission creates a new user message item in the conversation.
The client sends session.finish to the server to end the current session.
The server returns input_audio_buffer.committed.
The server returns conversation.item.input_audio_transcription.text, which contains real-time speech recognition results.
The server returns conversation.item.input_audio_transcription.completed, which contains the final speech recognition result.
The server returns Server events to notify the client that the recognition is complete. The client must then disconnect.