All Products
Search
Document Center

Alibaba Cloud Model Studio:Python SDK

Last Updated:Mar 15, 2026

This topic describes the key interfaces and request parameters for Qwen-Omni-Realtime DashScope Python SDK.

Prerequisites

SDK version 1.23.9+ is required. See Real-time multimodal interaction flow.

Getting started

Visit GitHub to download the sample code. We provide sample code for three calling methods:

  1. Audio conversation example: Captures real-time audio from a microphone, enables VAD mode for automatic speech start and end detection (enable_turn_detection = True), and supports voice interruption.

    Use headphones for audio playback to prevent echoes from triggering voice interruption.
  2. Audio and video conversation example: Captures real-time audio and video from a microphone and a camera, enables VAD mode for automatic speech start and end detection (enable_turn_detection = True), and supports voice interruption.

    Use headphones for audio playback to prevent echoes from triggering voice interruption.
  3. Local call: Uses local audio and images as input, enables Manual mode for manual control of the pace of sending (enable_turn_detection = False).

Request parameters

Set the following request parameters using the constructor method (__init__) of the OmniRealtimeConversation class.

Parameter

Type

Description

model

str

The Qwen-Omni model name. See Model list.

callback

OmniRealtimeCallback

The callback object instance that handles server-side events.

url

str

The call address:

  • Singapore region: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime

  • Beijing region: wss://dashscope.aliyuncs.com/api-ws/v1/realtime

Configure the following request parameters using the update_session method.

Parameter

Type

Description

output_modalities

list[MultiModality]

Model output modality. Set to [MultiModality.TEXT] for text only, or [MultiModality.TEXT, MultiModality.AUDIO] for both.

voice

str

The voice for the audio generated by the model. For a list of supported voices, see Voice list.

Default voices:

  • Qwen3-Omni-Flash-Realtime: "Cherry"

  • Qwen-Omni-Turbo-Realtime: "Chelsie"

input_audio_format

AudioFormat

Input audio format. Currently only PCM_16000HZ_MONO_16BIT is supported.

output_audio_format

AudioFormat

The format of the model's output audio. Currently, only pcm is supported

smooth_output

bool

This parameter is supported only by the Qwen3-Omni-Flash-Realtime series.

  • True: Get a conversational response.

  • False: Get a more formal, written-style response.

    However, this may result in poor quality if the content is difficult to read aloud.
  • None: The model automatically selects a conversational or formal response style.

instructions

str

A system message that sets the model's objective or role.

For example: You are an AI agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies. Be accurate and friendly. Always respond with a professional and helpful attitude. Do not provide unverified information or information outside the hotel's scope of services.

enable_input_audio_transcription

bool

Whether to enable speech recognition for input audio.

input_audio_transcription_model

str

Speech recognition model for transcribing input audio. Currently only gummy-realtime-v1 is supported.

turn_detection_type

str

Server-side Voice Activity Detection (VAD) type. This is fixed to server_vad.

turn_detection_threshold

float

The VAD detection threshold. Increase this value in noisy environments and decrease it in quiet environments.

  • The closer the value is to -1, the more likely noise is to be detected as speech.

  • The closer the value is to 1, the less likely noise is to be detected as speech.

Default value: 0.5. Valid values: [-1.0, 1.0].

turn_detection_silence_duration_ms

int

The duration of silence that indicates the end of speech. If this duration is exceeded, the model triggers a response. Default value: 800. Valid values: [200, 6000].

temperature

float

Sampling temperature controls content diversity. Higher values increase diversity, and lower values increase determinism.

Valid values: [0, 2).

Because both temperature and top_p control content diversity, set only one of them.

  • qwen3-omni-flash-realtime series: 0.9

  • qwen-omni-turbo-realtime series: 1.0

qwen-omni-turbo models do not support modification.

top_p

float

Nucleus sampling probability threshold controls content diversity. Higher values increase diversity, and lower values increase determinism.

Valid values: (0, 1.0].

Because both temperature and top_p control content diversity, set only one of them.

Default top_p values:

  • qwen3-omni-flash-realtime series: 1.0

  • qwen-omni-turbo-realtime series: 0.01

qwen-omni-turbo models do not support modification.

top_k

integer

The size of the candidate set for sampling during generation. For example, if you set this parameter to 50, only the 50 tokens with the highest scores in a single generation are used to form the candidate set for random sampling. A larger value increases randomness. A smaller value increases determinism. If you set this parameter to None or a value greater than 100, the top_k policy is not enabled, and only the top_p policy takes effect.

The value must be greater than or equal to 0.

Default top_k values:

  • qwen3-omni-flash-realtime series: 50

  • qwen-omni-turbo-realtime series: 20

qwen-omni-turbo models do not support modification.

max_tokens

integer

The maximum number of tokens to return for the current request.

max_tokens doesn't affect model generation. If output exceeds max_tokens, truncated content is returned.

The default value and the maximum value are both the maximum output length of the model. For information about the maximum output length of each model, see Model list.

The max_tokens parameter is suitable for scenarios where you need to limit the word count, such as for generating summaries or keywords, control costs, or reduce the response time.

qwen-omni-turbo models do not support modification.

repetition_penalty

float

The degree of repetition in consecutive sequences during model generation. A higher repetition_penalty value reduces the repetition. A value of 1.0 means no penalty. There is no strict value range, but the value must be greater than 0.

Default value: 1.05.

qwen-omni-turbo models do not support modification.

presence_penalty

float

Controls the repetition of the content that the model generates.

Default value: 0.0. Valid values: [-2.0, 2.0]. A positive value reduces repetition, and a negative value increases repetition.

Scenarios:

A higher presence_penalty is suitable for scenarios that require diversity, fun, or creativity, such as creative writing or brainstorming.

A lower presence_penalty is suitable for scenarios that require consistency or the use of technical terms, such as in technical documents or other formal documents.

qwen-omni-turbo models do not support modification.

seed

integer

Makes model generation more deterministic, ensuring consistent results across runs.

If you pass the same seed value that you specify in each model call and keep other parameters unchanged, the model returns the same result as much as possible.

Valid values: 0 to 231-1. Default value: -1.

qwen-omni-turbo models do not support modification.

Key interfaces

OmniRealtimeConversation class

Import the OmniRealtimeConversation class using the statement from dashscope.audio.qwen_omni import OmniRealtimeConversation.

Method signature

Server response events (delivered via callback)

Description

def connect(self,) -> None

Server events

Session created

session.updated

Session configuration updated

Creates a connection with the server.

def update_session(self,
                       output_modalities: list[MultiModality],
                       voice: str,
                       input_audio_format: AudioFormat = AudioFormat.
                       PCM_16000HZ_MONO_16BIT,
                       output_audio_format: AudioFormat = AudioFormat.
                       PCM_24000HZ_MONO_16BIT,
                       enable_input_audio_transcription: bool = True,
                       input_audio_transcription_model: str = None,
                       enable_turn_detection: bool = True,
                       turn_detection_type: str = 'server_vad',
                       prefix_padding_ms: int = 300,
                       turn_detection_threshold: float = 0.2,
                       turn_detection_silence_duration_ms: int = 800,
                       turn_detection_param: dict = None,
                       smooth_output: bool = True,
                       **kwargs) -> None

session.updated

Session configuration updated

Updates session configuration. After establishing the connection, the server returns default configurations. Call this method immediately to override defaults. The server validates parameters and returns an error if invalid. For parameter settings, see the "Request parameters" section.

def append_audio(self, audio_b64: str) -> None

None

Appends Base64-encoded audio to cloud input buffer (temporary storage for later commit).

  • If "turn_detection" is enabled, the audio buffer is used for voice detection, and the server decides when to commit.

  • If "turn_detection" is disabled, the client can choose the amount of audio to place in each event, up to a maximum of 15 MiB. For example, streaming smaller blocks of data from the client can make VAD more responsive.

def append_video(self, video_b64: str) -> None

None

Adds Base64-encoded image data to cloud video buffer (local or real-time stream).

The following limits apply to image input:

  • The image format must be JPG or JPEG. The recommended image resolution is 480p or 720p, with a maximum of 1080p.

  • The size of a single image cannot exceed 500 KB before Base64 encoding.

  • The image data must be Base64 encoded.

  • Send images to the server at a frequency of 1 image/second.

def clear_appended_audio(self, ) -> None

input_audio_buffer.cleared

Clears the audio received by the server

Clears audio from cloud buffer.

def commit(self, ) -> None

input_audio_buffer.committed

Server received the committed audio

Commits audio and video added via append. Returns an error if the buffer is empty.

  • If "turn_detection" is enabled, the client does not need to send this event. The server automatically commits the audio buffer.

  • If "turn_detection" is disabled, the client must commit the audio buffer to create a user message item.

Note ⚠️:

  1. If audio transcription is configured for the session using input_audio_transcription, the system transcribes the audio.

  2. Committing the input audio buffer does not create a response from the model.

def create_response(self,
        instructions: str = None,
        output_modalities: list[MultiModality] = None) -> None

Server events

Server starts generating a response

response.output_item.added

New output content is available in the response

Server events

Conversation item created

response.content_part.added

New output content added to the assistant message item

response.audio_transcript.delta

Incrementally generated transcribed text

response.audio.delta

Incrementally generated audio from the model

response.audio_transcript.done

Text transcription completed

response.audio.done

Audio generation completed

response.content_part.done

Streaming of text or audio content for the assistant message is complete

response.output_item.done

Streaming of the entire output item for the assistant message is complete

response.done

Response completed

Instructs server to create model response (automatic when turn_detection is enabled).

def cancel_response(self, ) -> None

None

Cancels the in-progress response (returns an error if none exists).

def close(self, ) -> None

None

Terminates the task and closes the connection.

def get_session_id(self) -> str

None

Gets the session_id of the current task.

def get_last_response_id(self) -> str

None

Gets the response_id of the last response.

def get_last_first_text_delay(self)

None

Gets the first-packet text latency of the last response.

def get_last_first_audio_delay(self)

None

Gets the first-packet audio latency of the last response.

Callback interface (OmniRealtimeCallback)

The server returns events and data via callbacks. Implement callback methods to handle server responses.

Import the interface using the statement from dashscope.audio.qwen_omni import OmniRealtimeCallback.

Method

Parameters

Return value

Description

def on_open(self) -> None

None

None

Called immediately after the connection is established.

def on_event(self, message: str) -> None

message: A server response event.

None

Handles interface call responses and model-generated text/audio. See Server events.

def on_close(self, close_status_code, close_msg) -> None

close_status_code: The status code for closing the WebSocket.

close_msg: The closing message for the WebSocket.

None

Called after the connection is closed.

FAQ

Q: How do I align input audio and images?

Audio serves as the input timeline. Images are inserted into the audio stream when sent, at any point on the timeline.

In real-time interaction, you can enable or disable video input at any time.

Q: What is the recommended frequency for inputting images and audio?

For real-time interaction: send images at 1-2 fps and audio in 100 ms packets.

Q: What are the differences between the two modes of the turn_detection switch?

Currently, only server_vad mode is supported when turn_detection is enabled:

  • If turn_detection is enabled:

    • Input state: Cloud-based VAD detects sentence end from input audio, then automatically triggers model inference and returns text/audio response.

    • Response state: Audio and video input continues uninterrupted during model response. After completion, returns to input state.

    • Interruption: User speech during model response triggers interruption, immediately stopping response and switching to input state.

  • If turn_detection is disabled:

    • Manually determine input turn end, then trigger model inference via commit and create_response.

    • During the model's response state, you must stop the audio and video input. You can resume input for the next turn only after the model has finished responding.

    • You must use the cancel_response method to interrupt the model's response.

Note that when turn_detection is enabled, you can still actively trigger a response using commit and create_response, and actively interrupt it using cancel_response.

Q: Why do I need to select another model for input_audio_transcription?

Qwen-Omni generates responses to input, not direct audio transcriptions. Integrate a separate ASR model for transcription.