All Products
Search
Document Center

Alibaba Cloud Model Studio:Qwen-ASR-Realtime Python SDK - API reference

Last Updated:Mar 15, 2026

Stream audio to Qwen-ASR-Realtime over WebSocket and receive real-time transcription results via the DashScope Python SDK.

For an overview of supported models, features, and complete sample code, see Real-time speech recognition.

Prerequisites

Before you begin, make sure that you have:

Request parameters

OmniRealtimeConversation constructor

Create an OmniRealtimeConversation instance with the following parameters.

Click to view sample code

from dashscope.audio.qwen_omni import OmniRealtimeConversation, OmniRealtimeCallback

class MyCallback(OmniRealtimeCallback):
    """Callback for real-time recognition"""
    def __init__(self, conversation):
        self.conversation = conversation
        self.handlers = {
            'session.created': self._handle_session_created,
            'conversation.item.input_audio_transcription.completed': self._handle_final_text,
            'conversation.item.input_audio_transcription.text': self._handle_stash_text,
            'input_audio_buffer.speech_started': lambda r: print('======Speech Start======'),
            'input_audio_buffer.speech_stopped': lambda r: print('======Speech Stop======')
        }

    def on_open(self):
        print('Connection opened')

    def on_close(self, code, msg):
        print(f'Connection closed, code: {code}, msg: {msg}')

    def on_event(self, response):
        try:
            handler = self.handlers.get(response['type'])
            if handler:
                handler(response)
        except Exception as e:
            print(f'[Error] {e}')

    def _handle_session_created(self, response):
        print(f"Start session: {response['session']['id']}")

    def _handle_final_text(self, response):
        print(f"Final recognized text: {response['transcript']}")

    def _handle_stash_text(self, response):
        print(f"Got stash result: {response['stash']}")

conversation = OmniRealtimeConversation(
        model='qwen3-asr-flash-realtime',
        # The following URL is for the Chinese mainland. For regions outside
        # the Chinese mainland, use
        # wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime instead.
        url='wss://dashscope.aliyuncs.com/api-ws/v1/realtime',
        callback=MyCallback(conversation=None)  # Temporarily pass None and inject it later.
    )
# Inject self into the callback.
conversation.callback.conversation = conversation
Parameter Type Required Description
model str Yes Model to use.
callback OmniRealtimeCallback Yes Callback object that handles server-side events.
url str Yes WebSocket endpoint.
Chinese mainland: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
International: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime




Session configuration

After connecting, call update_session to configure session parameters.

Click to view sample code

from dashscope.audio.qwen_omni import TranscriptionParams, MultiModality

transcription_params = TranscriptionParams(
    language='zh',
    sample_rate=16000,
    input_audio_format="pcm"
)

conversation.update_session(
    output_modalities=[MultiModality.TEXT],
    enable_turn_detection=True,
    turn_detection_type="server_vad",
    turn_detection_threshold=0.0,
    turn_detection_silence_duration_ms=400,
    enable_input_audio_transcription=True,
    transcription_params=transcription_params
)
Parameter Type Required Description
output_modalities List[MultiModality] Yes Output modality. Fixed to [MultiModality.TEXT].
enable_turn_detection bool No Enables server-side Voice Activity Detection (VAD). Default: True. When False, call commit() manually to trigger recognition.
turn_detection_type str No Server-side VAD type. Fixed to server_vad.
turn_detection_threshold float No VAD sensitivity threshold. Default: 0.2. Recommended: 0.0. Valid range: [-1, 1].
Lower values = higher sensitivity (may trigger on background noise). Higher values = fewer false triggers in noisy environments.

turn_detection_silence_duration_ms int No Silence duration (ms) that marks the end of a statement. Default: 800. Recommended: 400. Valid range: [200, 6000].
Lower values (e.g., 300 ms) = faster response but may split natural pauses. Higher values (e.g., 1200 ms) = better handling of long-sentence pauses but higher latency.

transcription_params TranscriptionParams No Speech recognition configurations. See TranscriptionParams.

TranscriptionParams

Configure speech recognition settings with the TranscriptionParams constructor.

Click to view sample code

transcription_params = TranscriptionParams(
    language='zh',
    sample_rate=16000,
    input_audio_format="pcm"
)
Parameter Type Required Description
language str No Source language of the audio. Supported values:
zh (Chinese: Mandarin, Sichuanese, Minnan, Wu), yue (Cantonese), en (English), ja (Japanese), ko (Korean), de (German), fr (French), es (Spanish), pt (Portuguese), it (Italian), ru (Russian), ar (Arabic), hi (Hindi), id (Indonesian), th (Thai), tr (Turkish), uk (Ukrainian), vi (Vietnamese), cs (Czech), da (Danish), fi (Finnish), fil (Filipino), is (Icelandic), ms (Malay), no (Norwegian), pl (Polish), sv (Swedish)

sample_rate int No Audio sampling rate in Hz. Default: 16000. Supported: 16000, 8000. With 8000, the server upsamples to 16,000 Hz before recognition, which may add minor latency. Use 8000 only for 8 kHz source audio like telephone recordings.
input_audio_format str No Audio format. Default: pcm. Supported: pcm, opus.
corpus_text str No Background text, entity vocabularies, or other reference information for contextual biasing. Max: 10,000 tokens. For details, see Contextual biasing.

Key interfaces

OmniRealtimeConversation class

from dashscope.audio.qwen_omni import OmniRealtimeConversation
Method Server response event Description
connect() session.created, session.updated Opens a WebSocket connection to the server.
update_session(...) session.updated Configures the session. Call after connect(). If omitted, defaults apply. See Session configuration for parameters.
append_audio(audio_b64: str) None Sends a Base64-encoded audio chunk to the server input buffer. With enable_turn_detection=True, the server detects speech boundaries and commits automatically. With enable_turn_detection=False, the client controls commit timing (max 15 MiB per event). Smaller chunks improve VAD responsiveness.
commit() input_audio_buffer.committed Commits buffered audio for recognition. Returns an error if the buffer is empty. Disabled when enable_turn_detection=True.
end_session(timeout: int = 20) session.finished Ends the session after the server completes final recognition. In VAD mode (default), call after sending all audio. In Manual mode, call after commit(). Async variant: end_session_async().
close() None Terminates the task and closes the connection.
get_session_id() None Returns the current session ID.
get_last_response_id() None Returns the most recent response ID.

OmniRealtimeCallback interface

Subclass OmniRealtimeCallback and implement its methods to handle server events.

from dashscope.audio.qwen_omni import OmniRealtimeCallback
Method Parameters Description
on_open() None Called when the WebSocket connection is established.
on_event(message: dict) message: a server event Called when a server event is received.
on_close(close_status_code, close_msg) close_status_code: status code; close_msg: log message Called when the WebSocket connection is closed.