Stream audio to Qwen-ASR-Realtime over WebSocket and receive real-time transcription results via the DashScope Python SDK.
For an overview of supported models, features, and complete sample code, see Real-time speech recognition.
Prerequisites
Before you begin, make sure that you have:
-
DashScope SDK 1.25.6 or later
-
An API key
-
An understanding of the interaction flow
Request parameters
OmniRealtimeConversation constructor
Create an OmniRealtimeConversation instance with the following parameters.
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
str |
Yes | Model to use. |
callback |
OmniRealtimeCallback |
Yes | Callback object that handles server-side events. |
url |
str |
Yes | WebSocket endpoint. Chinese mainland: wss://dashscope.aliyuncs.com/api-ws/v1/realtime International: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
Session configuration
After connecting, call update_session to configure session parameters.
| Parameter | Type | Required | Description |
|---|---|---|---|
output_modalities |
List[MultiModality] |
Yes | Output modality. Fixed to [MultiModality.TEXT]. |
enable_turn_detection |
bool |
No | Enables server-side Voice Activity Detection (VAD). Default: True. When False, call commit() manually to trigger recognition. |
turn_detection_type |
str |
No | Server-side VAD type. Fixed to server_vad. |
turn_detection_threshold |
float |
No | VAD sensitivity threshold. Default: 0.2. Recommended: 0.0. Valid range: [-1, 1]. Lower values = higher sensitivity (may trigger on background noise). Higher values = fewer false triggers in noisy environments. |
turn_detection_silence_duration_ms |
int |
No | Silence duration (ms) that marks the end of a statement. Default: 800. Recommended: 400. Valid range: [200, 6000]. Lower values (e.g., 300 ms) = faster response but may split natural pauses. Higher values (e.g., 1200 ms) = better handling of long-sentence pauses but higher latency. |
transcription_params |
TranscriptionParams |
No | Speech recognition configurations. See TranscriptionParams. |
TranscriptionParams
Configure speech recognition settings with the TranscriptionParams constructor.
| Parameter | Type | Required | Description |
|---|---|---|---|
language |
str |
No | Source language of the audio. Supported values: zh (Chinese: Mandarin, Sichuanese, Minnan, Wu), yue (Cantonese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), pt (Portuguese), it (Italian), ru (Russian), ar (Arabic), hi (Hindi), id (Indonesian), th (Thai), tr (Turkish), uk (Ukrainian), vi (Vietnamese), cs (Czech), da (Danish), fi (Finnish), fil (Filipino), is (Icelandic), ms (Malay), no (Norwegian), pl (Polish), sv (Swedish) |
sample_rate |
int |
No | Audio sampling rate in Hz. Default: 16000. Supported: 16000, 8000. With 8000, the server upsamples to 16,000 Hz before recognition, which may add minor latency. Use 8000 only for 8 kHz source audio like telephone recordings. |
input_audio_format |
str |
No | Audio format. Default: pcm. Supported: pcm, opus. |
corpus_text |
str |
No | Background text, entity vocabularies, or other reference information for contextual biasing. Max: 10,000 tokens. For details, see Contextual biasing. |
Key interfaces
OmniRealtimeConversation class
from dashscope.audio.qwen_omni import OmniRealtimeConversation
| Method | Server response event | Description |
|---|---|---|
connect() |
session.created, session.updated |
Opens a WebSocket connection to the server. |
update_session(...) |
session.updated |
Configures the session. Call after connect(). If omitted, defaults apply. See Session configuration for parameters. |
append_audio(audio_b64: str) |
None | Sends a Base64-encoded audio chunk to the server input buffer. With enable_turn_detection=True, the server detects speech boundaries and commits automatically. With enable_turn_detection=False, the client controls commit timing (max 15 MiB per event). Smaller chunks improve VAD responsiveness. |
commit() |
input_audio_buffer.committed |
Commits buffered audio for recognition. Returns an error if the buffer is empty. Disabled when enable_turn_detection=True. |
end_session(timeout: int = 20) |
session.finished |
Ends the session after the server completes final recognition. In VAD mode (default), call after sending all audio. In Manual mode, call after commit(). Async variant: end_session_async(). |
close() |
None | Terminates the task and closes the connection. |
get_session_id() |
None | Returns the current session ID. |
get_last_response_id() |
None | Returns the most recent response ID. |
OmniRealtimeCallback interface
Subclass OmniRealtimeCallback and implement its methods to handle server events.
from dashscope.audio.qwen_omni import OmniRealtimeCallback
| Method | Parameters | Description |
|---|---|---|
on_open() |
None | Called when the WebSocket connection is established. |
on_event(message: dict) |
message: a server event |
Called when a server event is received. |
on_close(close_status_code, close_msg) |
close_status_code: status code; close_msg: log message |
Called when the WebSocket connection is closed. |