Qwen-Omni-Realtime is a real-time audio and video chat model developed by Qwen. It processes streaming audio and image inputs, such as continuous image frames extracted from a video stream, and provides high-quality text and audio outputs in real time.
How to use
1. Establish a connection
You can connect to the Qwen-Omni-Realtime model using the WebSocket protocol. You can establish a connection using the following Python sample code or the DashScope SDK.
Native WebSocket connection
The connection requires the following configuration items:
Configuration item | Description |
Endpoint | China (Beijing): wss://dashscope.aliyuncs.com/api-ws/v1/realtime International (Singapore): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
Query parameter | The query parameter is model. Set it to the name of the model you want to access. Example: |
Request header | Use a Bearer Token for authentication: Authorization: Bearer DASHSCOPE_API_KEY DASHSCOPE_API_KEY is the API key that you requested from Model Studio. |
# pip install websocket-client
import json
import websocket
import os
API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-omni-flash-realtime"
headers = [
"Authorization: Bearer " + API_KEY
]
def on_open(ws):
print(f"Connected to server: {API_URL}")
def on_message(ws, message):
data = json.loads(message)
print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
print("Error:", error)
ws = websocket.WebSocketApp(
API_URL,
header=headers,
on_open=on_open,
on_message=on_message,
on_error=on_error
)
ws.run_forever()DashScope SDK
# SDK version 1.23.9 or later
import os
import json
from dashscope.audio.qwen_omni import OmniRealtimeConversation,OmniRealtimeCallback
import dashscope
# The API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an API key, change the following line to dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
class PrintCallback(OmniRealtimeCallback):
def on_open(self) -> None:
print("Connected Successfully")
def on_event(self, response: dict) -> None:
print("Received event:")
print(json.dumps(response, indent=2, ensure_ascii=False))
def on_close(self, close_status_code: int, close_msg: str) -> None:
print(f"Connection closed (code={close_status_code}, msg={close_msg}).")
callback = PrintCallback()
conversation = OmniRealtimeConversation(
model="qwen3-omni-flash-realtime",
callback=callback,
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
)
try:
conversation.connect()
print("Conversation started. Press Ctrl+C to exit.")
conversation.thread.join()
except KeyboardInterrupt:
conversation.close()// SDK version 2.20.9 or later
import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import java.util.concurrent.CountDownLatch;
public class Main {
public static void main(String[] args) throws InterruptedException, NoApiKeyException {
CountDownLatch latch = new CountDownLatch(1);
OmniRealtimeParam param = OmniRealtimeParam.builder()
.model("qwen3-omni-flash-realtime")
.apikey(System.getenv("DASHSCOPE_API_KEY"))
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
.url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
.build();
OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
@Override
public void onOpen() {
System.out.println("Connected Successfully");
}
@Override
public void onEvent(JsonObject message) {
System.out.println(message);
}
@Override
public void onClose(int code, String reason) {
System.out.println("connection closed code: " + code + ", reason: " + reason);
latch.countDown();
}
});
conversation.connect();
latch.await();
conversation.close(1000, "bye");
System.exit(0);
}
}2. Configure the session
You can send the client event session.update:
{
// The event ID, generated by the client.
"event_id": "event_ToPZqeobitzUJnt3QqtWg",
// The event type. This is fixed to session.update.
"type": "session.update",
// Session configuration.
"session": {
// The output modalities. Supported values are ["text"] (text only) or ["text","audio"] (text and audio).
"modalities": [
"text",
"audio"
],
// The voice for the audio output.
"voice": "Cherry",
// The input audio format. Only pcm16 is supported.
"input_audio_format": "pcm16",
// The output audio format. Only pcm24 is supported.
"output_audio_format": "pcm24",
// The system message, used to set the model's goal or role.
"instructions": "You are an AI customer service specialist for a five-star hotel. Please answer customer inquiries about room types, facilities, prices, and booking policies accurately and in a friendly manner. Always respond with a professional and helpful attitude, and do not provide unverified information or information beyond the scope of the hotel's services.",
// Specifies whether to enable voice activity detection. To enable it, pass a configuration object. The server will automatically detect the start and end of speech based on this object.
// Set to null to let the client decide when to trigger a model response.
"turn_detection": {
// The VAD type. Must be set to server_vad.
"type": "server_vad",
// The VAD detection threshold. Increase this value in noisy environments and decrease it in quiet environments.
"threshold": 0.5,
// The duration of silence that indicates the end of speech. The model response is triggered after this duration is exceeded.
"silence_duration_ms": 800
}
}
}3. Input audio and images
The client sends Base64-encoded audio and image data to the server buffer using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required, and image input is optional.
Images can be from local files or captured in real time from a video stream.
When server-side Voice Activity Detection (VAD) is enabled, the server automatically submits the data and triggers a response after it detects the end of speech. When VAD is disabled (manual mode), the client must call the input_audio_buffer.commit event to submit the data after the data is sent.
4. Receive model responses
The format of the model response depends on the configured output modalities.
Text only
You can receive streaming text through the response.text.delta event and retrieve the full text with the response.text.done event.
Text and audio
Text: You can receive streaming text through the response.audio_transcript.delta event and retrieve the full text with the response.audio_transcript.done event.
Audio: You can retrieve Base64-encoded streaming audio output data through the response.audio.delta event. The response.audio.done event marks the completion of audio data generation.
Model list
Qwen3-Omni-Flash-Realtime is the latest real-time multimodal model from Qwen. Compared to the previous generation model, Qwen-Omni-Turbo-Realtime, which will no longer be updated, Qwen3-Omni-Flash-Realtime offers the following improvements:
Supported languages
The number of supported languages has increased to 10. The supported languages are Chinese (Mandarin and various major dialects such as Shanghainese, Cantonese, and Sichuanese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, and Korean. Qwen-Omni-Turbo-Realtime supports only Chinese (Mandarin) and English.
Supported voices
The number of supported voices has increased to 17. Qwen-Omni-Turbo-Realtime supports only four voices. For more information, see the voice list.
International (Singapore)
Model | Version | Context window | Max input | Max output | Free quota |
(Tokens) | |||||
qwen3-omni-flash-realtime Equivalent to qwen3-omni-flash-realtime-2025-09-15 | Stable | 65,536 | 49,152 | 16,384 | 1 million tokens each, regardless of modality Valid for 90 days after you activate Model Studio |
qwen3-omni-flash-realtime-2025-09-15 | Snapshot | ||||
China (Beijing)
Model | Version | Context window | Max input | Max output | Free quota |
(Tokens) | |||||
qwen3-omni-flash-realtime Equivalent to qwen3-omni-flash-realtime-2025-09-15 | Stable | 65,536 | 49,152 | 16,384 | No free quota |
qwen3-omni-flash-realtime-2025-09-15 | Snapshot | ||||
Getting Started
Before you begin, you must obtain and configure an API key and configure the API key as an environment variable. The environment variable method is deprecated and will be merged into the API key configuration.
You can choose a programming language that you are familiar with and follow these steps to quickly try the real-time conversation feature with the Qwen-Omni-Realtime model.
DashScope Python SDK
Prepare the runtime environment
Your Python version must be 3.10 or later.
First, you can install pyaudio for your operating system.
macOS
brew install portaudio && pip install pyaudioDebian/Ubuntu
If you are not using a virtual environment, you can install PyAudio directly through the system package manager:
sudo apt-get install python3-pyaudioIf you are using a virtual environment, you must first install the compilation dependencies:
sudo apt update sudo apt install -y python3-dev portaudio19-devThen, in the activated virtual environment, install the required dependencies using pip:
pip install pyaudio
CentOS
sudo yum install -y portaudio portaudio-devel && pip install pyaudioWindows
pip install pyaudioAfter the installation is complete, you can install the dependencies using pip:
pip install websocket-client dashscopeChoose an interaction mode
VAD mode (automatic detection of speech start and end)
The server automatically determines when the user starts and stops speaking and then responds.
Manual mode (press to talk, release to send)
The client controls the start and end of speech. After the user finishes speaking, the client must send a message to the server.
VAD mode
Create a new Python file named vad_dash.py and copy the following code into the file:
You can run
vad_dash.pyto have a real-time conversation with the Qwen-Omni-Realtime model through your microphone. The system detects the start and end of your audio and automatically sends the audio to the server without manual intervention.Manual mode
Create a new Python file named
manual_dash.pyand copy the following code into the file:You can run
manual_dash.py, press Enter to start speaking, and press Enter again to retrieve the model's audio response.
DashScope Java SDK
Choose an interaction mode
VAD mode (automatic detection of speech start and end)
The Realtime API automatically determines when the user starts and stops speaking and then responds.
Manual mode (press to talk, release to send)
The client controls the start and end of speech. After the user finishes speaking, the client must send a message to the server.
VAD mode
You can run the OmniServerVad.main() method to have a real-time conversation with the Qwen-Omni-Realtime model through your microphone. The system detects the start and end of your audio and automatically sends the audio to the server without manual intervention.
Manual mode
You can run the OmniWithoutServerVad.main() method. Press Enter to start recording, and press Enter again during recording to stop recording and send the audio. The model response is then received and played.
WebSocket (Python)
Prepare the runtime environment
Your Python version must be 3.10 or later.
First, you can install pyaudio for your operating system.
macOS
brew install portaudio && pip install pyaudioDebian/Ubuntu
sudo apt-get install python3-pyaudio or pip install pyaudioWe recommend that you use
pip install pyaudio. If the installation fails, you must first install theportaudiodependency for your operating system.CentOS
sudo yum install -y portaudio portaudio-devel && pip install pyaudioWindows
pip install pyaudioAfter the installation is complete, you can install the WebSocket-related dependencies using pip:
pip install websockets==15.0.1Create a client
Create a new Python file named
omni_realtime_client.pyand copy the following code into the file:Choose an interaction mode
VAD mode (automatic detection of speech start and end)
The Realtime API automatically determines when the user starts and stops speaking and then responds.
Manual mode (press to talk, release to send)
The client controls the start and end of speech. After the user finishes speaking, the client must send a message to the server.
VAD mode
In the same directory as
omni_realtime_client.py, create another Python file namedvad_mode.pyand copy the following code into the file:You can run
vad_mode.pyto have a real-time conversation with the Qwen-Omni-Realtime model through your microphone. The system detects the start and end of your audio and automatically sends the audio to the server without manual intervention.Manual mode
In the same directory as
omni_realtime_client.py, create another Python file namedmanual_mode.pyand copy the following code into the file:You can run
manual_mode.py, press Enter to start speaking, and press Enter again to retrieve the model's audio response.
Interaction flow
VAD mode
You can set the session.turn_detection parameter of the session.update event to "server_vad" to enable VAD mode. In this mode, the server automatically detects the start and end of speech and responds accordingly. This mode is suitable for voice call scenarios.
The interaction flow is as follows:
The server detects the start of speech and sends the input_audio_buffer.speech_started event.
The client can send input_audio_buffer.append and input_image_buffer.append events at any time to add audio and images to the buffer.
Before you can send an input_image_buffer.append event, you must first send an input_audio_buffer.append event.
The server detects the end of speech and sends the input_audio_buffer.speech_stopped event.
The server sends the input_audio_buffer.committed event to commit the audio buffer.
The server sends the conversation.item.created event, which contains the user message item created from the buffer.
Lifecycle | Client events | Server events |
Session initialization | Session configuration | Session created Session configuration updated |
User audio input | Add audio to the buffer Add an image to the buffer | input_audio_buffer.speech_started Speech start detected input_audio_buffer.speech_stopped Speech end detected Server received the submitted audio |
Server audio output | None | Server starts generating a response New output content in the response Conversation item created New output content added to the assistant message response.audio_transcript.delta Incrementally generated transcribed text Incrementally generated audio from the model response.audio_transcript.done Text transcription complete Audio generation complete Streaming of text or audio content for the assistant message is complete Streaming of the entire output item for the assistant message is complete Response complete |
Manual mode
You can set the session.turn_detection parameter of the session.update event to null to enable manual mode. In this mode, the client requests a server response by explicitly sending the input_audio_buffer.commit and response.create events. This mode is suitable for push-to-talk scenarios, such as sending voice messages in a chat application.
The interaction flow is as follows:
The client can send input_audio_buffer.append and input_image_buffer.append events at any time to add audio and images to the buffer.
Before sending an input_image_buffer.append event, you must send at least one input_audio_buffer.append event.
The client sends the input_audio_buffer.commit event to commit the audio and image buffers, which informs the server that all user input for the current turn has been sent.
The server responds with the input_audio_buffer.committed event.
The client sends the response.create event and waits for the model's output from the server.
The server responds with the conversation.item.created event.
Lifecycle | Client events | Server events |
Session initialization | Session configuration | Session created Session configuration updated |
User audio input | Add audio to the buffer Add an image to the buffer Submit audio and images to the server Create a model response | Server received the submitted audio |
Server audio output | Clear the audio from the buffer | Server starts generating a response New output content in the response Conversation item created New output content added to the assistant message item response.audio_transcript.delta Incrementally generated transcribed text Incrementally generated audio from the model response.audio_transcript.done Text transcription complete Audio generation complete Streaming of text or audio content for the assistant message is complete Streaming of the entire output item for the assistant message is complete Response complete |
API reference
Billing and throttling
Billing rules
The Qwen-Omni-Realtime model is billed based on the number of tokens that correspond to different modalities, such as audio and image. For more information about billing, see the model list.
Throttling
For more information about model throttling rules, see Throttling.
Error codes
If a call fails, see Error messages for troubleshooting.
Voice list
Qwen3-Omni-Flash-Realtime
Name |
| Voice effects | Description | Supported languages |
Cherry | Cherry | A cheerful, friendly, and natural young woman's voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Ethan | Ethan | Standard Mandarin with a slight northern accent. A bright, warm, and energetic voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Nofish | Nofish | A designer who does not use retroflex consonants. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Jennifer | Jennifer | A premium, cinematic American English female voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Ryan | Ryan | A rhythmic, dramatic voice with realism and tension. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Katerina | Katerina | A mature and rhythmic female voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Elias | Elias | Explains complex topics with academic rigor and clear storytelling. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Shanghai-Jada | Jada | A lively woman from Shanghai. | Chinese (Shanghainese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Beijing-Dylan | Dylan | A teenager who grew up in the hutongs of Beijing. | Chinese (Beijing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Sichuan-Sunny | Sunny | A sweet female voice from Sichuan. | Chinese (Sichuanese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Nanjing-Li | Li | A patient yoga teacher. | Chinese (Nanjing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Shaanxi-Marcus | Marcus | A sincere and deep voice from Shaanxi. | Chinese (Shaanxi dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Man Nan-Roy | Roy | A humorous and lively young male voice with a Minnan accent. | Chinese (Min Nan), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Tianjin-Peter | Peter | A voice for the straight man in Tianjin crosstalk. | Chinese (Tianjin dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Cantonese-Rocky | Rocky | A witty and humorous male voice for online chats. | Chinese (Cantonese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Cantonese-Kiki | Kiki | A sweet best friend from Hong Kong. | Chinese (Cantonese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Sichuan-Eric | Eric | An unconventional and refined male voice from Chengdu, Sichuan. | Chinese (Sichuanese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai |
Qwen-Omni-Turbo-Realtime
Name |
| Voice effects | Description | Supported languages |
Cherry | Cherry | A sunny, friendly, and genuine young woman. | Chinese, English | |
Serena | Serena | Kind young woman. | Chinese, English | |
Ethan | Ethan | Standard Mandarin with a slight northern accent. A bright, warm, and energetic voice. | Chinese, English | |
Chelsie | Chelsie | An anime-style virtual girlfriend voice. | Chinese, English |