Qwen-ASR-Realtime Python SDK - API reference - Alibaba Cloud Model Studio

Prerequisites

Before you begin, make sure that you have:

DashScope SDK 1.25.6 or later
An API key
An understanding of the interaction flow

Important

Alibaba Cloud Model Studio has released workspace-specific domains for the China (Beijing) and Singapore regions. The new dedicated domains deliver superior performance and higher stability for inference requests. We recommend migrating to the new domains:

China (Beijing): from dashscope.aliyuncs.com to {WorkspaceId}.cn-beijing.maas.aliyuncs.com
Singapore: from dashscope-intl.aliyuncs.com to {WorkspaceId}.ap-southeast-1.maas.aliyuncs.com

Replace {WorkspaceId} with your actual Workspace ID. The existing domains remain fully functional.

Request parameters

OmniRealtimeConversation constructor

Create an OmniRealtimeConversation instance with the following parameters.

Click to view sample code

from dashscope.audio.qwen_omni import OmniRealtimeConversation, OmniRealtimeCallback

class MyCallback(OmniRealtimeCallback):
    """Callback for real-time recognition"""
    def __init__(self, conversation):
        self.conversation = conversation
        self.handlers = {
            'session.created': self._handle_session_created,
            'conversation.item.input_audio_transcription.completed': self._handle_final_text,
            'conversation.item.input_audio_transcription.text': self._handle_stash_text,
            'input_audio_buffer.speech_started': lambda r: print('======Speech Start======'),
            'input_audio_buffer.speech_stopped': lambda r: print('======Speech Stop======')
        }

    def on_open(self):
        print('Connection opened')

    def on_close(self, code, msg):
        print(f'Connection closed, code: {code}, msg: {msg}')

    def on_event(self, response):
        try:
            handler = self.handlers.get(response['type'])
            if handler:
                handler(response)
        except Exception as e:
            print(f'[Error] {e}')

    def _handle_session_created(self, response):
        print(f"Start session: {response['session']['id']}")

    def _handle_final_text(self, response):
        print(f"Final recognized text: {response['transcript']}")

    def _handle_stash_text(self, response):
        print(f"Got stash result: {response['stash']}")

conversation = OmniRealtimeConversation(
        model='qwen3-asr-flash-realtime',
        # The following URL is for the Chinese mainland. For regions outside
        # the Chinese mainland, use
        # wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime instead.
        url='wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime',
        callback=MyCallback(conversation=None)  # Temporarily pass None and inject it later.
    )
# Inject self into the callback.
conversation.callback.conversation = conversation

Parameter	Type	Required	Description
`model`	`str`	Yes	Model to use.
`callback`	`OmniRealtimeCallback`	Yes	Callback object that handles server-side events.
`url`	`str`	Yes	WebSocket endpoint. China (Beijing): `wss://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api-ws/v1/realtime` Singapore: `wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/realtime`. Replace `{WorkspaceId}` with your actual Workspace ID.

Session configuration

After connecting, call update_session to configure session parameters.

Click to view sample code

from dashscope.audio.qwen_omni import TranscriptionParams, MultiModality

transcription_params = TranscriptionParams(
    language='zh',
    sample_rate=16000,
    input_audio_format="pcm"
)

conversation.update_session(
    output_modalities=[MultiModality.TEXT],
    enable_turn_detection=True,
    turn_detection_type="server_vad",
    turn_detection_threshold=0.0,
    turn_detection_silence_duration_ms=400,
    enable_input_audio_transcription=True,
    transcription_params=transcription_params
)

Parameter	Type	Required	Description
`output_modalities`	`List[MultiModality]`	Yes	Output modality. Fixed to `[MultiModality.TEXT]`.
`enable_turn_detection`	`bool`	No	Enables server-side Voice Activity Detection (VAD). Default: `True`. When `False`, call `commit()` manually to trigger recognition.
`turn_detection_type`	`str`	No	Server-side VAD type. Fixed to `server_vad`.
`turn_detection_threshold`	`float`	No	VAD sensitivity threshold. Default: `0.2`. Recommended: `0.0`. Valid range: `[-1, 1]`. Lower values = higher sensitivity (may trigger on background noise). Higher values = fewer false triggers in noisy environments.
`turn_detection_silence_duration_ms`	`int`	No	Silence duration (ms) that marks the end of a statement. Default: `800`. Recommended: `400`. Valid range: `[200, 6000]`. Lower values (e.g., 300 ms) = faster response but may split natural pauses. Higher values (e.g., 1200 ms) = better handling of long-sentence pauses but higher latency.
`transcription_params`	`TranscriptionParams`	No	Speech recognition configurations. See TranscriptionParams.

TranscriptionParams

Configure speech recognition settings with the TranscriptionParams constructor.

Click to view sample code

transcription_params = TranscriptionParams(
    language='zh',
    sample_rate=16000,
    input_audio_format="pcm"
)

Parameter	Type	Required	Description
`language`	`str`	No	Source language of the audio. Supported values: `zh` (Chinese: Mandarin, Sichuanese, Minnan, Wu), `yue` (Cantonese), `en` (English), `ja` (Japanese), `ko` (Korean), `de` (German), `fr` (French), `es` (Spanish), `pt` (Portuguese), `it` (Italian), `ru` (Russian), `ar` (Arabic), `hi` (Hindi), `id` (Indonesian), `th` (Thai), `tr` (Turkish), `uk` (Ukrainian), `vi` (Vietnamese), `cs` (Czech), `da` (Danish), `fi` (Finnish), `fil` (Filipino), `is` (Icelandic), `ms` (Malay), `no` (Norwegian), `pl` (Polish), `sv` (Swedish)
`sample_rate`	`int`	No	Audio sampling rate in Hz. Default: `16000`. Supported: `16000`, `8000`. With `8000`, the server upsamples to 16,000 Hz before recognition, which may add minor latency. Use `8000` only for 8 kHz source audio like telephone recordings.
`input_audio_format`	`str`	No	Audio format. Default: `pcm`. Supported: `pcm`, `opus`.
`corpus_text`	`str`	No	Background text, entity vocabularies, or other reference information for contextual biasing. Max: 10,000 tokens. For details, see Contextual biasing.

Key interfaces

OmniRealtimeConversation class

from dashscope.audio.qwen_omni import OmniRealtimeConversation

Method	Server response event	Description
`connect()`	`session.created`, `session.updated`	Opens a WebSocket connection to the server.
`update_session(...)`	`session.updated`	Configures the session. Call after `connect()`. If omitted, defaults apply. See Session configuration for parameters.
`append_audio(audio_b64: str)`	None	Sends a Base64-encoded audio chunk to the server input buffer. With `enable_turn_detection=True`, the server detects speech boundaries and commits automatically. With `enable_turn_detection=False`, the client controls commit timing (max 15 MiB per event). Smaller chunks improve VAD responsiveness.
`commit()`	`input_audio_buffer.committed`	Commits buffered audio for recognition. Returns an error if the buffer is empty. Disabled when `enable_turn_detection=True`.
`end_session(timeout: int = 20)`	`session.finished`	Ends the session after the server completes final recognition. In VAD mode (default), call after sending all audio. In Manual mode, call after `commit()`. Async variant: `end_session_async()`.
`close()`	None	Terminates the task and closes the connection.
`get_session_id()`	None	Returns the current session ID.
`get_last_response_id()`	None	Returns the most recent response ID.

OmniRealtimeCallback interface

Subclass OmniRealtimeCallback and implement its methods to handle server events.

from dashscope.audio.qwen_omni import OmniRealtimeCallback

Method	Parameters	Description
`on_open()`	None	Called when the WebSocket connection is established.
`on_event(message: dict)`	`message`: a server event	Called when a server event is received.
`on_close(close_status_code, close_msg)`	`close_status_code`: status code; `close_msg`: log message	Called when the WebSocket connection is closed.