Qwen's real-time speech recognition service uses the WebSocket protocol to accept and transcribe real-time audio streams. It supports interaction flows in Voice Activity Detection (VAD) mode and Manual mode.
User guide: For an overview of the model, its features, and complete sample code, see Real-time speech recognition - Qwen.
URL
When constructing the URL, replace <model_name> with the name of the desired model.
wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=<model_name>Headers
"Authorization": "bearer <your_dashscope_api_key>"VAD mode (default)
The server automatically detects the start and end of speech for sentence segmentation. The client continuously sends the audio stream. The server returns the final recognition result after it detects the end of a sentence. Suitable for scenarios such as real-time conversations and meeting minutes.
How to enable: Configure the session.turn_detection parameter in the client's session.update event.
The client sends
input_audio_buffer.appendto appends audio to the buffer.The server returns
input_audio_buffer.speech_startedwhen it detects speech.Note: If the client sends
session.finishto end the session before receiving this event, the server returns aServer events. The client must then disconnect.The client continues to send
input_audio_buffer.append.After all audio has been sent, the client sends
session.finishto the server to end the current session.When the server detects the end of speech, it returns
input_audio_buffer.speech_stopped.The server returns
input_audio_buffer.committed.The server returns
conversation.item.created.The server returns
conversation.item.input_audio_transcription.text, which contains real-time speech recognition results.The server returns
conversation.item.input_audio_transcription.completed, which contains the final speech recognition result.The server returns
Server eventsto notify the client that the recognition process is complete. The client must then disconnect.
Manual mode
The client controls sentence segmentation by sending the audio for a complete sentence and then sending input_audio_buffer.commit to the server. This mode is suitable for scenarios where the client can clearly determine sentence boundaries, such as sending voice messages in chat applications.
How to enable: You can set the session.turn_detection parameter to null in the client's session.update event.
The client appends audio to the buffer by sending
input_audio_buffer.append.The client submits the input audio buffer by sending
input_audio_buffer.commit. This submission creates a new user message item in the conversation.The client sends
session.finishto the server to end the current session.The server returns
input_audio_buffer.committed.The server returns
conversation.item.input_audio_transcription.text, which contains real-time speech recognition results.The server returns
conversation.item.input_audio_transcription.completed, which contains the final speech recognition result.The server returns
Server eventsto notify the client that the recognition is complete. The client must then disconnect.