All Products
Search
Document Center

Alibaba Cloud Model Studio:Qwen-Omni-Realtime

Last Updated:Mar 31, 2026

Qwen-Omni-Realtime is a real-time audio and video chat model. It processes streaming audio and image inputs, such as continuous image frames extracted from a video stream, and generates text and audio output in real time.

Supported regions: Singapore, Beijing. Use the API key for each region.

How to use

1. Establish a connection

Connect to Qwen-Omni-Realtime using WebSocket. Use native Python or the DashScope SDK.

Note

A single WebSocket session lasts up to 120 minutes before closing automatically.

Native WebSocket connection

You need the following configuration items:

Configuration item

Description

Endpoint

China (Beijing): wss://dashscope.aliyuncs.com/api-ws/v1/realtime

International (Singapore): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime

Query parameter

Set the ?model=qwen3.5-omni-plus-realtime query parameter to the model name.

Request header

Authenticate with Bearer Token: Authorization: Bearer DASHSCOPE_API_KEY

DASHSCOPE_API_KEY is your API key from Model Studio.
# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3.5-omni-plus-realtime"

headers = [
    "Authorization: Bearer " + API_KEY
]

def on_open(ws):
    print(f"Connected to server: {API_URL}")
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)

ws.run_forever()

DashScope SDK

# SDK version 1.23.9 or later
import os
import json
from dashscope.audio.qwen_omni import OmniRealtimeConversation,OmniRealtimeCallback
import dashscope
# The API keys for Singapore and Beijing differ. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an API key, replace the next line with dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

class PrintCallback(OmniRealtimeCallback):
    def on_open(self) -> None:
        print("Connected Successfully")
    def on_event(self, response: dict) -> None:
        print("Received event:")
        print(json.dumps(response, indent=2, ensure_ascii=False))
    def on_close(self, close_status_code: int, close_msg: str) -> None:
        print(f"Connection closed (code={close_status_code}, msg={close_msg}).")

callback = PrintCallback()
conversation = OmniRealtimeConversation(
    model="qwen3.5-omni-plus-realtime",
    callback=callback,
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
)
try:
    conversation.connect()
    print("Conversation started. Press Ctrl+C to exit.")
    conversation.thread.join()
except KeyboardInterrupt:
    conversation.close()
// SDK version 2.20.9 or later
import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import java.util.concurrent.CountDownLatch;

public class Main {
    public static void main(String[] args) throws InterruptedException, NoApiKeyException {
        CountDownLatch latch = new CountDownLatch(1);
        OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3.5-omni-plus-realtime")
                .apikey(System.getenv("DASHSCOPE_API_KEY"))
                // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime
                .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                .build();

        OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
            @Override
            public void onOpen() {
                System.out.println("Connected Successfully");
            }
            @Override
            public void onEvent(JsonObject message) {
                System.out.println(message);
            }
            @Override
            public void onClose(int code, String reason) {
                System.out.println("connection closed code: " + code + ", reason: " + reason);
                latch.countDown();
            }
        });
        conversation.connect();
        latch.await();
        conversation.close(1000, "bye");
        System.exit(0);
    }
}

2. Configure the session

Send the client event session.update:

{
    // The ID of this event, generated by the client.
    "event_id": "event_ToPZqeobitzUJnt3QqtWg",
    // The event type. This is fixed to session.update.
    "type": "session.update",
    // Session configuration.
    "session": {
        // Output modalities. Supported values are ["text"] (text only) or ["text","audio"] (text and audio).
        "modalities": [
            "text",
            "audio"
        ],
        // Voice for output audio.
        "voice": "Cherry",
        // Input audio format. Only pcm is supported.
        "input_audio_format": "pcm",
        // Output audio format. Only pcm is supported.
        "output_audio_format": "pcm",
        // System message. Sets the model's goal or role.
        "instructions": "You are an AI customer service agent for a five-star hotel. Answer customer inquiries about room types, facilities, prices, and booking policies accurately and friendly. Always respond with a professional and helpful attitude. Do not provide unconfirmed information or information beyond the scope of the hotel's services.",
        // Enables voice activity detection. To enable it, pass a configuration object. The server will automatically detect speech start/end based on this object.
        // Set to null to let the client decide when to initiate a model response.
        "turn_detection": {
            // VAD type. Must be set to server_vad.
            "type": "server_vad",
            // VAD detection threshold. Increase in noisy environments and decrease in quiet ones.
            "threshold": 0.5,
            // Silence duration to detect speech end. A model response triggers if this value is exceeded.
            "silence_duration_ms": 800
        }
    }
}

3. Input audio and images

Send Base64-encoded audio (required) and image (optional) data to the server buffer using the input_audio_buffer.append and input_image_buffer.append events.

Images can come from local files or be captured in real time from a video stream.
When server-side VAD is enabled, the server automatically submits data and triggers a response when speech ends. When VAD is disabled (manual mode), the client must call the input_audio_buffer.commit event to submit data.

4. Receive model responses

The model response format depends on the configured output modalities.

Model selection

Note

Qwen3.5-Omni-Realtime is in preview. Model invocation is temporarily free, but tool calling still incurs fees, see Billing details.

Qwen3.5-Omni-Realtime is the latest real-time multimodal model. Compared to Qwen3-Omni-Flash-Realtime, it offers:

  • Intelligence level

    Significantly improved intelligence, matching Qwen3.5-Plus.

  • Web search

    Supports web search natively. The model autonomously decides whether to search. See Web search.

  • Semantic interruption

    Automatically identifies conversation intent to avoid interruptions from filler sounds and meaningless background noise.

  • Voice control

    Control volume, speaking rate, and emotion using voice commands such as "speak faster," "speak louder," or "speak cheerfully."

  • Languages supported

    Supports speech recognition in 113 languages and dialects, and speech synthesis in 36 languages and dialects.

  • Voice options

    Offers 55 voices (47 multilingual + 8 dialect-specific). See Voice list.

See Model list for model names, context, pricing, and snapshot versions. For concurrency throttling, see Rate limits.

Getting started

Get an API key and configure the API key as an environment variable.

Select a programming language and follow these steps to start a real-time conversation.

DashScope Python SDK

  • Prepare the runtime environment

Your Python version must be 3.10 or later.

First, install pyaudio based on your operating system.

macOS

brew install portaudio && pip install pyaudio

Debian/Ubuntu

  • If you are not using a virtual environment, install directly using the system package manager:

    sudo apt-get install python3-pyaudio
  • If you are using a virtual environment, first install compilation dependencies:

    sudo apt update
    sudo apt install -y python3-dev portaudio19-dev

    Then install pyaudio using pip in the virtual environment.

    pip install pyaudio

CentOS

sudo yum install -y portaudio portaudio-devel && pip install pyaudio

Windows

pip install pyaudio

After installation, install dependencies using pip:

pip install websocket-client dashscope
  • Select an interaction mode

    • VAD mode (Voice Activity Detection, automatic speech start/end detection)

      The server automatically detects when the user starts and stops speaking and responds.

    • Manual mode (press-to-talk, release-to-send)

      The client controls speech timing. After the user finishes speaking, the client sends a message to the server.

    VAD mode

    Create a file named vad_dash.py with the following code:

    vad_dash.py

    # Dependencies: dashscope >= 1.23.9, pyaudio
    import os
    import base64
    import time
    import pyaudio
    from dashscope.audio.qwen_omni import MultiModality, AudioFormat,OmniRealtimeCallback,OmniRealtimeConversation
    import dashscope
    
    # Configuration parameters: endpoint, API key, voice, model, model role
    # Specify the region. Set to 'intl' for International (Singapore) or 'cn' for China (Beijing).
    region = 'intl'
    base_domain = 'dashscope-intl.aliyuncs.com' if region == 'intl' else 'dashscope.aliyuncs.com'
    url = f'wss://{base_domain}/api-ws/v1/realtime'
    # Configure the API key. If you have not set an environment variable, replace the next line with dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.getenv('DASHSCOPE_API_KEY')
    # Specify the voice
    voice = 'Cherry'
    # Specify the model
    model = 'qwen3.5-omni-plus-realtime'
    # Specify the model role
    instructions = "You are Xiaoyun, a personal assistant. Please answer the user's questions in a humorous and witty way."
    class SimpleCallback(OmniRealtimeCallback):
        def __init__(self, pya):
            self.pya = pya
            self.out = None
        def on_open(self):
            # Initialize the audio output stream
            self.out = self.pya.open(
                format=pyaudio.paInt16,
                channels=1,
                rate=24000,
                output=True
            )
        def on_event(self, response):
            if response['type'] == 'response.audio.delta':
                # Play the audio
                self.out.write(base64.b64decode(response['delta']))
            elif response['type'] == 'conversation.item.input_audio_transcription.completed':
                # Print the transcribed text
                print(f"[User] {response['transcript']}")
            elif response['type'] == 'response.audio_transcript.done':
                # Print the assistant's reply text
                print(f"[LLM] {response['transcript']}")
    
    # 1. Initialize the audio device
    pya = pyaudio.PyAudio()
    # 2. Create the callback function and session
    callback = SimpleCallback(pya)
    conv = OmniRealtimeConversation(model=model, callback=callback, url=url)
    # 3. Establish the connection and configure the session
    conv.connect()
    conv.update_session(output_modalities=[MultiModality.AUDIO, MultiModality.TEXT], voice=voice, instructions=instructions)
    # 4. Initialize the audio input stream
    mic = pya.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True)
    # 5. Main loop to process audio input
    print("Conversation started. Speak into the microphone (Ctrl+C to exit)...")
    try:
        while True:
            audio_data = mic.read(3200, exception_on_overflow=False)
            conv.append_audio(base64.b64encode(audio_data).decode())
            time.sleep(0.01)
    except KeyboardInterrupt:
        # Clean up resources
        conv.close()
        mic.close()
        callback.out.close()
        pya.terminate()
        print("\nConversation ended")

    Run vad_dash.py. The system detects speech start/end and responds automatically.

    Manual mode

    Create a file named manual_dash.py with the following code:

    manual_dash.py

    # Dependencies: dashscope >= 1.23.9, pyaudio.
    import os
    import base64
    import sys
    import threading
    import pyaudio
    from dashscope.audio.qwen_omni import *
    import dashscope
    
    # If you have not set an environment variable, replace the next line with your API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.getenv('DASHSCOPE_API_KEY')
    voice = 'Cherry'
    
    class MyCallback(OmniRealtimeCallback):
        """Minimal callback: Initializes the speaker upon connection and plays returned audio directly."""
        def __init__(self, ctx):
            super().__init__()
            self.ctx = ctx
    
        def on_open(self) -> None:
            # Initialize PyAudio and the speaker (24k/mono/16bit) after connection.
            print('connection opened')
            try:
                self.ctx['pya'] = pyaudio.PyAudio()
                self.ctx['out'] = self.ctx['pya'].open(
                    format=pyaudio.paInt16,
                    channels=1,
                    rate=24000,
                    output=True
                )
                print('audio output initialized')
            except Exception as e:
                print('[Error] audio init failed: {}'.format(e))
    
        def on_close(self, close_status_code, close_msg) -> None:
            print('connection closed with code: {}, msg: {}'.format(close_status_code, close_msg))
            sys.exit(0)
    
        def on_event(self, response: str) -> None:
            try:
                t = response['type']
                handlers = {
                    'session.created': lambda r: print('start session: {}'.format(r['session']['id'])),
                    'conversation.item.input_audio_transcription.completed': lambda r: print('question: {}'.format(r['transcript'])),
                    'response.audio_transcript.delta': lambda r: print('llm text: {}'.format(r['delta'])),
                    'response.audio.delta': self._play_audio,
                    'response.done': self._response_done,
                }
                h = handlers.get(t)
                if h:
                    h(response)
            except Exception as e:
                print('[Error] {}'.format(e))
    
        def _play_audio(self, response):
            # Decode Base64 and write to output stream for playback.
            if self.ctx['out'] is None:
                return
            try:
                data = base64.b64decode(response['delta'])
                self.ctx['out'].write(data)
            except Exception as e:
                print('[Error] audio playback failed: {}'.format(e))
    
        def _response_done(self, response):
            # Mark the current turn complete for the main loop to wait.
            if self.ctx['conv'] is not None:
                print('[Metric] response: {}, first text delay: {}, first audio delay: {}'.format(
                    self.ctx['conv'].get_last_response_id(),
                    self.ctx['conv'].get_last_first_text_delay(),
                    self.ctx['conv'].get_last_first_audio_delay(),
                ))
            if self.ctx['resp_done'] is not None:
                self.ctx['resp_done'].set()
    
    def shutdown_ctx(ctx):
        """Safely release audio and PyAudio resources."""
        try:
            if ctx['out'] is not None:
                ctx['out'].close()
                ctx['out'] = None
        except Exception:
            pass
        try:
            if ctx['pya'] is not None:
                ctx['pya'].terminate()
                ctx['pya'] = None
        except Exception:
            pass
    
    
    def record_until_enter(pya_inst: pyaudio.PyAudio, sample_rate=16000, chunk_size=3200):
        """Press Enter to stop recording and return PCM bytes."""
        frames = []
        stop_evt = threading.Event()
    
        stream = pya_inst.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=sample_rate,
            input=True,
            frames_per_buffer=chunk_size
        )
    
        def _reader():
            while not stop_evt.is_set():
                try:
                    frames.append(stream.read(chunk_size, exception_on_overflow=False))
                except Exception:
                    break
    
        t = threading.Thread(target=_reader, daemon=True)
        t.start()
        input()  # User presses Enter again to stop recording.
        stop_evt.set()
        t.join(timeout=1.0)
        try:
            stream.close()
        except Exception:
            pass
        return b''.join(frames)
    
    
    if __name__  == '__main__':
        print('Initializing ...')
        # Runtime context: Stores audio and session handles.
        ctx = {'pya': None, 'out': None, 'conv': None, 'resp_done': threading.Event()}
        callback = MyCallback(ctx)
        conversation = OmniRealtimeConversation(
            model='qwen3.5-omni-plus-realtime',
            callback=callback,
            # The following is the URL for the International (Singapore) region. If you use a model in China (Beijing), replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime
            url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
        )
        try:
            conversation.connect()
        except Exception as e:
            print('[Error] connect failed: {}'.format(e))
            sys.exit(1)
    
        ctx['conv'] = conversation
        # Session configuration: Enable text and audio output (disable server-side VAD, switch to manual recording).
        conversation.update_session(
            output_modalities=[MultiModality.AUDIO, MultiModality.TEXT],
            voice=voice,
            enable_input_audio_transcription=True,
            # The model for transcribing input audio. Only gummy-realtime-v1 is supported.
            input_audio_transcription_model='gummy-realtime-v1',
            enable_turn_detection=False,
            instructions="You are Xiaoyun, a personal assistant. Please answer the user's questions accurately and friendly, always responding with a helpful attitude."
        )
    
        try:
            turn = 1
            while True:
                print(f"\n--- Turn {turn} ---")
                print("Press Enter to start recording (enter q to exit)...")
                user_input = input()
                if user_input.strip().lower() in ['q', 'quit']:
                    print("User requested to exit...")
                    break
                print("Recording... Press Enter again to stop.")
                if ctx['pya'] is None:
                    ctx['pya'] = pyaudio.PyAudio()
                recorded = record_until_enter(ctx['pya'])
                if not recorded:
                    print("No valid audio was recorded. Please try again.")
                    continue
                print(f"Successfully recorded audio: {len(recorded)} bytes. Sending...")
    
                # Send in 3200-byte chunks (corresponding to 16k/16bit/100ms).
                chunk_size = 3200
                for i in range(0, len(recorded), chunk_size):
                    chunk = recorded[i:i+chunk_size]
                    conversation.append_audio(base64.b64encode(chunk).decode('ascii'))
    
                print("Sending complete. Waiting for model response...")
                ctx['resp_done'].clear()
                conversation.commit()
                conversation.create_response()
                ctx['resp_done'].wait()
                print('Audio playback complete.')
                turn += 1
        except KeyboardInterrupt:
            print("\nProgram interrupted by user.")
        finally:
            shutdown_ctx(ctx)
            print("Program exited.")

    Run manual_dash.py. Press Enter to speak, then press Enter again to receive the model's audio response.

DashScope Java SDK

Select an interaction mode

  • VAD mode (Voice Activity Detection, automatic speech start/end detection)

    The Realtime API automatically detects speech timing and responds.

  • Manual mode (press-to-talk, release-to-send)

    The client controls speech timing. After the user finishes speaking, the client sends a message to the server.

VAD mode

OmniServerVad.java

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.*;
import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.Base64;
import java.util.Map;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;

public class OmniServerVad {
    static class SequentialAudioPlayer {
        private final SourceDataLine line;
        private final Queue<byte[]> audioQueue = new ConcurrentLinkedQueue<>();
        private final Thread playerThread;
        private final AtomicBoolean shouldStop = new AtomicBoolean(false);

        public SequentialAudioPlayer() throws LineUnavailableException {
            AudioFormat format = new AudioFormat(24000, 16, 1, true, false);
            line = AudioSystem.getSourceDataLine(format);
            line.open(format);
            line.start();

            playerThread = new Thread(() -> {
                while (!shouldStop.get()) {
                    byte[] audio = audioQueue.poll();
                    if (audio != null) {
                        line.write(audio, 0, audio.length);
                    } else {
                        try { Thread.sleep(10); } catch (InterruptedException ignored) {}
                    }
                }
            }, "AudioPlayer");
            playerThread.start();
        }

        public void play(String base64Audio) {
            try {
                byte[] audio = Base64.getDecoder().decode(base64Audio);
                audioQueue.add(audio);
            } catch (Exception e) {
                System.err.println("Audio decoding failed: " + e.getMessage());
            }
        }

        public void cancel() {
            audioQueue.clear();
            line.flush();
        }

        public void close() {
            shouldStop.set(true);
            try { playerThread.join(1000); } catch (InterruptedException ignored) {}
            line.drain();
            line.close();
        }
    }

    public static void main(String[] args) {
        try {
            SequentialAudioPlayer player = new SequentialAudioPlayer();
            AtomicBoolean userIsSpeaking = new AtomicBoolean(false);
            AtomicBoolean shouldStop = new AtomicBoolean(false);

            OmniRealtimeParam param = OmniRealtimeParam.builder()
                    .model("qwen3.5-omni-plus-realtime")
                    .apikey(System.getenv("DASHSCOPE_API_KEY"))
                    // The following is the URL for the International (Singapore) region. If you use a model in China (Beijing), replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime
                    .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                    .build();

            OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
                @Override public void onOpen() {
                    System.out.println("Connection established");
                }
                @Override public void onClose(int code, String reason) {
                    System.out.println("Connection closed (" + code + "): " + reason);
                    shouldStop.set(true);
                }
                @Override public void onEvent(JsonObject event) {
                    handleEvent(event, player, userIsSpeaking);
                }
            });

            conversation.connect();
            conversation.updateSession(OmniRealtimeConfig.builder()
                    .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
                    .voice("Cherry")
                    .enableTurnDetection(true)
                    .enableInputAudioTranscription(true)
                    .parameters(Map.of("instructions",
                            "You are an AI customer service agent for a five-star hotel. Answer customer inquiries about room types, facilities, prices, and booking policies accurately and friendly. Always respond with a professional and helpful attitude. Do not provide unconfirmed information or information beyond the scope of the hotel's services."))
                    .build()
            );

            System.out.println("Please start speaking (automatic detection of speech start/end, press Ctrl+C to exit)...");
            AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
            TargetDataLine mic = AudioSystem.getTargetDataLine(format);
            mic.open(format);
            mic.start();

            ByteBuffer buffer = ByteBuffer.allocate(3200);
            while (!shouldStop.get()) {
                int bytesRead = mic.read(buffer.array(), 0, buffer.capacity());
                if (bytesRead > 0) {
                    try {
                        conversation.appendAudio(Base64.getEncoder().encodeToString(buffer.array()));
                    } catch (Exception e) {
                        if (e.getMessage() != null && e.getMessage().contains("closed")) {
                            System.out.println("Conversation closed. Stopping recording.");
                            break;
                        }
                    }
                }
                Thread.sleep(20);
            }

            conversation.close(1000, "Normal exit");
            player.close();
            mic.close();
            System.out.println("\nProgram exited.");

        } catch (NoApiKeyException e) {
            System.err.println("API KEY not found: Please set the DASHSCOPE_API_KEY environment variable.");
            System.exit(1);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void handleEvent(JsonObject event, SequentialAudioPlayer player, AtomicBoolean userIsSpeaking) {
        String type = event.get("type").getAsString();
        switch (type) {
            case "input_audio_buffer.speech_started":
                System.out.println("\n[User started speaking]");
                player.cancel();
                userIsSpeaking.set(true);
                break;
            case "input_audio_buffer.speech_stopped":
                System.out.println("[User stopped speaking]");
                userIsSpeaking.set(false);
                break;
            case "response.audio.delta":
                if (!userIsSpeaking.get()) {
                    player.play(event.get("delta").getAsString());
                }
                break;
            case "conversation.item.input_audio_transcription.completed":
                System.out.println("User: " + event.get("transcript").getAsString());
                break;
            case "response.audio_transcript.delta":
                System.out.print(event.get("delta").getAsString());
                break;
            case "response.done":
                System.out.println("Response complete");
                break;
        }
    }
}

Run OmniServerVad.main(). The system detects speech start/end and responds automatically.

Manual mode

OmniWithoutServerVad.java

// DashScope Java SDK version 2.20.9 or later

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.*;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.Arrays;
import java.util.Base64;
import java.util.HashMap;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicReference;

public class Main {
    // RealtimePcmPlayer class definition starts
    public static class RealtimePcmPlayer {
        private int sampleRate;
        private SourceDataLine line;
        private AudioFormat audioFormat;
        private Thread decoderThread;
        private Thread playerThread;
        private AtomicBoolean stopped = new AtomicBoolean(false);
        private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
        private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();

        // Constructor initializes audio format and audio line.
        public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
            this.sampleRate = sampleRate;
            this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
            DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
            line = (SourceDataLine) AudioSystem.getLine(info);
            line.open(audioFormat);
            line.start();
            decoderThread = new Thread(new Runnable() {
                @Override
                public void run() {
                    while (!stopped.get()) {
                        String b64Audio = b64AudioBuffer.poll();
                        if (b64Audio != null) {
                            byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
                            RawAudioBuffer.add(rawAudio);
                        } else {
                            try {
                                Thread.sleep(100);
                            } catch (InterruptedException e) {
                                throw new RuntimeException(e);
                            }
                        }
                    }
                }
            });
            playerThread = new Thread(new Runnable() {
                @Override
                public void run() {
                    while (!stopped.get()) {
                        byte[] rawAudio = RawAudioBuffer.poll();
                        if (rawAudio != null) {
                            try {
                                playChunk(rawAudio);
                            } catch (IOException e) {
                                throw new RuntimeException(e);
                            } catch (InterruptedException e) {
                                throw new RuntimeException(e);
                            }
                        } else {
                            try {
                                Thread.sleep(100);
                            } catch (InterruptedException e) {
                                throw new RuntimeException(e);
                            }
                        }
                    }
                }
            });
            decoderThread.start();
            playerThread.start();
        }

        // Play an audio chunk and block until playback completes.
        private void playChunk(byte[] chunk) throws IOException, InterruptedException {
            if (chunk == null || chunk.length == 0) return;

            int bytesWritten = 0;
            while (bytesWritten < chunk.length) {
                bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
            }
            int audioLength = chunk.length / (this.sampleRate*2/1000);
            // Wait for audio in the buffer to finish playing.
            Thread.sleep(audioLength - 10);
        }

        public void write(String b64Audio) {
            b64AudioBuffer.add(b64Audio);
        }

        public void cancel() {
            b64AudioBuffer.clear();
            RawAudioBuffer.clear();
        }

        public void waitForComplete() throws InterruptedException {
            while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
                Thread.sleep(100);
            }
            line.drain();
        }

        public void shutdown() throws InterruptedException {
            stopped.set(true);
            decoderThread.join();
            playerThread.join();
            if (line != null && line.isRunning()) {
                line.drain();
                line.close();
            }
        }
    } // RealtimePcmPlayer class definition ends
    // Add a recording method
    private static void recordAndSend(TargetDataLine line, OmniRealtimeConversation conversation) throws IOException {
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        byte[] buffer = new byte[3200];
        AtomicBoolean stopRecording = new AtomicBoolean(false);

        // Start a thread to listen for the Enter key.
        Thread enterKeyListener = new Thread(() -> {
            try {
                System.in.read();
                stopRecording.set(true);
            } catch (IOException e) {
                e.printStackTrace();
            }
        });
        enterKeyListener.start();

        // Recording loop
        while (!stopRecording.get()) {
            int count = line.read(buffer, 0, buffer.length);
            if (count > 0) {
                out.write(buffer, 0, count);
            }
        }

        // Send the recorded data.
        byte[] audioData = out.toByteArray();
        String audioB64 = Base64.getEncoder().encodeToString(audioData);
        conversation.appendAudio(audioB64);
        out.close();
    }

    public static void main(String[] args) throws InterruptedException, LineUnavailableException {
        OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3.5-omni-plus-realtime")
                // The API keys for Singapore and Beijing differ. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured an environment variable, replace the next line with your Model Studio API key: .apikey("sk-xxx")
                .apikey(System.getenv("DASHSCOPE_API_KEY"))
                //The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with wss://dashscope.aliyuncs.com/api-ws/v1/realtime
                .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                .build();
        AtomicReference<CountDownLatch> responseDoneLatch = new AtomicReference<>(null);
        responseDoneLatch.set(new CountDownLatch(1));

        RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);
        final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
        OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
            @Override
            public void onOpen() {
                System.out.println("connection opened");
            }
            @Override
            public void onEvent(JsonObject message) {
                String type = message.get("type").getAsString();
                switch(type) {
                    case "session.created":
                        System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
                        break;
                    case "conversation.item.input_audio_transcription.completed":
                        System.out.println("question: " + message.get("transcript").getAsString());
                        break;
                    case "response.audio_transcript.delta":
                        System.out.println("got llm response delta: " + message.get("delta").getAsString());
                        break;
                    case "response.audio.delta":
                        String recvAudioB64 = message.get("delta").getAsString();
                        audioPlayer.write(recvAudioB64);
                        break;
                    case "response.done":
                        System.out.println("======RESPONSE DONE======");
                        if (conversationRef.get() != null) {
                            System.out.println("[Metric] response: " + conversationRef.get().getResponseId() +
                                    ", first text delay: " + conversationRef.get().getFirstTextDelay() +
                                    " ms, first audio delay: " + conversationRef.get().getFirstAudioDelay() + " ms");
                        }
                        responseDoneLatch.get().countDown();
                        break;
                    default:
                        break;
                }
            }
            @Override
            public void onClose(int code, String reason) {
                System.out.println("connection closed code: " + code + ", reason: " + reason);
            }
        });
        conversationRef.set(conversation);
        try {
            conversation.connect();
        } catch (NoApiKeyException e) {
            throw new RuntimeException(e);
        }
        OmniRealtimeConfig config = OmniRealtimeConfig.builder()
                .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
                .voice("Cherry")
                .enableTurnDetection(false)
                // Set the model role.
                .parameters(new HashMap<String, Object>() {{
                    put("instructions","You are Xiaoyun, a personal assistant. Please answer the user's questions accurately and friendly, always responding with a helpful attitude.");
                }})
                .build();
        conversation.updateSession(config);

        // Add microphone recording functionality.
        AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
        DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);

        if (!AudioSystem.isLineSupported(info)) {
            System.out.println("Line not supported");
            return;
        }

        TargetDataLine line = null;
        try {
            line = (TargetDataLine) AudioSystem.getLine(info);
            line.open(format);
            line.start();

            while (true) {
                System.out.println("Press Enter to start recording...");
                System.in.read();
                System.out.println("Recording started. Please speak... Press Enter again to stop recording and send.");
                recordAndSend(line, conversation);
                conversation.commit();
                conversation.createResponse(null, null);
                // Reset latch for next wait.
                responseDoneLatch.set(new CountDownLatch(1));
            }
        } catch (LineUnavailableException | IOException e) {
            e.printStackTrace();
        } finally {
            if (line != null) {
                line.stop();
                line.close();
            }
        }
    }}

Run OmniWithoutServerVad.main(). Press Enter to start recording, then press Enter again to stop and send.

WebSocket (Python)

  • Prepare the runtime environment

    Your Python version must be 3.10 or later.

    First, install pyaudio based on your operating system.

    macOS

    brew install portaudio && pip install pyaudio

    Debian/Ubuntu

    sudo apt-get install python3-pyaudio
    
    or
    
    pip install pyaudio
    We recommend using pip install pyaudio. If installation fails, first install the portaudio dependency for your OS.

    CentOS

    sudo yum install -y portaudio portaudio-devel && pip install pyaudio

    Windows

    pip install pyaudio

    After installation, install WebSocket-related dependencies using pip:

    pip install websockets==15.0.1
  • Create the client

    Create a file named omni_realtime_client.py with the following code:

    omni_realtime_client.py

    import asyncio
    import websockets
    import json
    import base64
    import time
    from typing import Optional, Callable, List, Dict, Any
    from enum import Enum
    
    class TurnDetectionMode(Enum):
        SERVER_VAD = "server_vad"
        MANUAL = "manual"
    
    class OmniRealtimeClient:
    
        def __init__(
                self,
                base_url,
                api_key: str,
                model: str = "",
                voice: str = "Ethan",
                instructions: str = "You are a helpful assistant.",
                turn_detection_mode: TurnDetectionMode = TurnDetectionMode.SERVER_VAD,
                on_text_delta: Optional[Callable[[str], None]] = None,
                on_audio_delta: Optional[Callable[[bytes], None]] = None,
                on_input_transcript: Optional[Callable[[str], None]] = None,
                on_output_transcript: Optional[Callable[[str], None]] = None,
                extra_event_handlers: Optional[Dict[str, Callable[[Dict[str, Any]], None]]] = None
        ):
            self.base_url = base_url
            self.api_key = api_key
            self.model = model
            self.voice = voice
            self.instructions = instructions
            self.ws = None
            self.on_text_delta = on_text_delta
            self.on_audio_delta = on_audio_delta
            self.on_input_transcript = on_input_transcript
            self.on_output_transcript = on_output_transcript
            self.turn_detection_mode = turn_detection_mode
            self.extra_event_handlers = extra_event_handlers or {}
    
            # Current response status
            self._current_response_id = None
            self._current_item_id = None
            self._is_responding = False
            # Input/output transcript printing status
            self._print_input_transcript = True
            self._output_transcript_buffer = ""
    
        async def connect(self) -> None:
            """Establish a WebSocket connection with the Realtime API."""
            url = f"{self.base_url}?model={self.model}"
            headers = {
                "Authorization": f"Bearer {self.api_key}"
            }
            self.ws = await websockets.connect(url, additional_headers=headers)
    
            # Session configuration
            session_config = {
                "modalities": ["text", "audio"],
                "voice": self.voice,
                "instructions": self.instructions,
                "input_audio_format": "pcm",
                "output_audio_format": "pcm",
                "input_audio_transcription": {
                    "model": "gummy-realtime-v1"
                }
            }
    
            if self.turn_detection_mode == TurnDetectionMode.MANUAL:
                session_config['turn_detection'] = None
                await self.update_session(session_config)
            elif self.turn_detection_mode == TurnDetectionMode.SERVER_VAD:
                session_config['turn_detection'] = {
                    "type": "server_vad",
                    "threshold": 0.1,
                    "prefix_padding_ms": 500,
                    "silence_duration_ms": 900
                }
                await self.update_session(session_config)
            else:
                raise ValueError(f"Invalid turn detection mode: {self.turn_detection_mode}")
    
        async def send_event(self, event) -> None:
            event['event_id'] = "event_" + str(int(time.time() * 1000))
            await self.ws.send(json.dumps(event))
    
        async def update_session(self, config: Dict[str, Any]) -> None:
            """Update the session configuration."""
            event = {
                "type": "session.update",
                "session": config
            }
            await self.send_event(event)
    
        async def stream_audio(self, audio_chunk: bytes) -> None:
            """Stream raw audio data to the API."""
            # Only 16-bit, 16 kHz, mono PCM is supported.
            audio_b64 = base64.b64encode(audio_chunk).decode()
            append_event = {
                "type": "input_audio_buffer.append",
                "audio": audio_b64
            }
            await self.send_event(append_event)
    
        async def commit_audio_buffer(self) -> None:
            """Commit the audio buffer to trigger processing."""
            event = {
                "type": "input_audio_buffer.commit"
            }
            await self.send_event(event)
    
        async def append_image(self, image_chunk: bytes) -> None:
            """Append image data to the image buffer.
            Image data can come from local files or a real-time video stream.
            Note:
                - The image format must be JPG or JPEG. We recommend 480p or 720p resolution. Maximum supported resolution is 1080p.
                - A single image must not exceed 500 KB.
                - Encode image data to Base64 before sending.
                - We recommend sending images at no more than 1 frame per second.
                - You must send audio data at least once before sending image data.
            """
            image_b64 = base64.b64encode(image_chunk).decode()
            event = {
                "type": "input_image_buffer.append",
                "image": image_b64
            }
            await self.send_event(event)
    
        async def create_response(self) -> None:
            """Request the API to generate a response (only needed in manual mode)."""
            event = {
                "type": "response.create"
            }
            await self.send_event(event)
    
        async def cancel_response(self) -> None:
            """Cancel the current response."""
            event = {
                "type": "response.cancel"
            }
            await self.send_event(event)
    
        async def handle_interruption(self):
            """Handle user interruption of the current response."""
            if not self._is_responding:
                return
            # 1. Cancel the current response.
            if self._current_response_id:
                await self.cancel_response()
    
            self._is_responding = False
            self._current_response_id = None
            self._current_item_id = None
    
        async def handle_messages(self) -> None:
            try:
                async for message in self.ws:
                    event = json.loads(message)
                    event_type = event.get("type")
                    if event_type == "error":
                        print(" Error: ", event['error'])
                        continue
                    elif event_type == "response.created":
                        self._current_response_id = event.get("response", {}).get("id")
                        self._is_responding = True
                    elif event_type == "response.output_item.added":
                        self._current_item_id = event.get("item", {}).get("id")
                    elif event_type == "response.done":
                        self._is_responding = False
                        self._current_response_id = None
                        self._current_item_id = None
                    elif event_type == "input_audio_buffer.speech_started":
                        print("Speech start detected")
                        if self._is_responding:
                            print("Handling interruption")
                            await self.handle_interruption()
                    elif event_type == "input_audio_buffer.speech_stopped":
                        print("Speech end detected")
                    elif event_type == "response.text.delta":
                        if self.on_text_delta:
                            self.on_text_delta(event["delta"])
                    elif event_type == "response.audio.delta":
                        if self.on_audio_delta:
                            audio_bytes = base64.b64decode(event["delta"])
                            self.on_audio_delta(audio_bytes)
                    elif event_type == "conversation.item.input_audio_transcription.completed":
                        transcript = event.get("transcript", "")
                        print(f"User: {transcript}")
                        if self.on_input_transcript:
                            await asyncio.to_thread(self.on_input_transcript, transcript)
                            self._print_input_transcript = True
                    elif event_type == "response.audio_transcript.delta":
                        if self.on_output_transcript:
                            delta = event.get("delta", "")
                            if not self._print_input_transcript:
                                self._output_transcript_buffer += delta
                            else:
                                if self._output_transcript_buffer:
                                    await asyncio.to_thread(self.on_output_transcript, self._output_transcript_buffer)
                                    self._output_transcript_buffer = ""
                                await asyncio.to_thread(self.on_output_transcript, delta)
                    elif event_type == "response.audio_transcript.done":
                        print(f"LLM: {event.get('transcript', '')}")
                        self._print_input_transcript = False
                    elif event_type in self.extra_event_handlers:
                        self.extra_event_handlers[event_type](event)
            except websockets.exceptions.ConnectionClosed:
                print(" Connection closed")
            except Exception as e:
                print(" Error in message handling: ", str(e))
        async def close(self) -> None:
            """Close the WebSocket connection."""
            if self.ws:
                await self.ws.close()
  • Select an interaction mode

    • VAD mode (Voice Activity Detection, automatic speech start/end detection)

      The Realtime API automatically detects speech timing and responds.

    • Manual mode (press-to-talk, release-to-send)

      The client controls speech timing. After the user finishes speaking, the client sends a message to the server.

    VAD mode

    In the same directory as omni_realtime_client.py, create vad_mode.py with the following code:

    vad_mode.py

    # -- coding: utf-8 --
    import os, asyncio, pyaudio, queue, threading
    from omni_realtime_client import OmniRealtimeClient, TurnDetectionMode
    
    # Audio player class (handles interruptions)
    class AudioPlayer:
        def __init__(self, pyaudio_instance, rate=24000):
            self.stream = pyaudio_instance.open(format=pyaudio.paInt16, channels=1, rate=rate, output=True)
            self.queue = queue.Queue()
            self.stop_evt = threading.Event()
            self.interrupt_evt = threading.Event()
            threading.Thread(target=self._run, daemon=True).start()
    
        def _run(self):
            while not self.stop_evt.is_set():
                try:
                    data = self.queue.get(timeout=0.5)
                    if data is None: break
                    if not self.interrupt_evt.is_set(): self.stream.write(data)
                    self.queue.task_done()
                except queue.Empty: continue
    
        def add_audio(self, data): self.queue.put(data)
        def handle_interrupt(self): self.interrupt_evt.set(); self.queue.queue.clear()
        def stop(self): self.stop_evt.set(); self.queue.put(None); self.stream.stop_stream(); self.stream.close()
    
    # Record from microphone and send
    async def record_and_send(client):
        p = pyaudio.PyAudio()
        stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=3200)
        print("Recording started. Please speak...")
        try:
            while True:
                audio_data = stream.read(3200)
                await client.stream_audio(audio_data)
                await asyncio.sleep(0.02)
        finally:
            stream.stop_stream(); stream.close(); p.terminate()
    
    async def main():
        p = pyaudio.PyAudio()
        player = AudioPlayer(pyaudio_instance=p)
    
        client = OmniRealtimeClient(
            # The following is the base_url for the International (Singapore) region. The base_url for the China (Beijing) region is wss://dashscope.aliyuncs.com/api-ws/v1/realtime
            base_url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
            api_key=os.environ.get("DASHSCOPE_API_KEY"),
            model="qwen3.5-omni-plus-realtime",
            voice="Cherry",
            instructions="You are Xiaoyun, a witty and humorous assistant.",
            turn_detection_mode=TurnDetectionMode.SERVER_VAD,
            on_text_delta=lambda t: print(f"\nAssistant: {t}", end="", flush=True),
            on_audio_delta=player.add_audio,
        )
    
        await client.connect()
        print("Connection successful. Starting real-time conversation...")
    
        # Run concurrently
        await asyncio.gather(client.handle_messages(), record_and_send(client))
    
    if __name__ == "__main__":
        try:
            asyncio.run(main())
        except KeyboardInterrupt:
            print("\nProgram exited.")

    Run vad_mode.py. The system detects speech start/end and responds automatically.

    Manual mode

    In the same directory as omni_realtime_client.py, create manual_mode.py with the following code:

    manual_mode.py

    # -- coding: utf-8 --
    import os
    import asyncio
    import time
    import threading
    import queue
    import pyaudio
    from omni_realtime_client import OmniRealtimeClient, TurnDetectionMode
    
    
    class AudioPlayer:
        """Real-time audio player class"""
    
        def __init__(self, sample_rate=24000, channels=1, sample_width=2):
            self.sample_rate = sample_rate
            self.channels = channels
            self.sample_width = sample_width  # 2 bytes for 16-bit
            self.audio_queue = queue.Queue()
            self.is_playing = False
            self.play_thread = None
            self.pyaudio_instance = None
            self.stream = None
            self._lock = threading.Lock()  # Add a lock for synchronized access
            self._last_data_time = time.time()  # Record the time the last data was received
            self._response_done = False  # Add a flag to indicate response completion
            self._waiting_for_response = False  # Flag to indicate if waiting for a server response
            # Record the time the last data was written to the audio stream and the duration of the most recent audio chunk for more accurate playback end detection
            self._last_play_time = time.time()
            self._last_chunk_duration = 0.0
    
        def start(self):
            """Start the audio player"""
            with self._lock:
                if self.is_playing:
                    return
    
                self.is_playing = True
    
                try:
                    self.pyaudio_instance = pyaudio.PyAudio()
    
                    # Create an audio output stream
                    self.stream = self.pyaudio_instance.open(
                        format=pyaudio.paInt16,  # 16-bit
                        channels=self.channels,
                        rate=self.sample_rate,
                        output=True,
                        frames_per_buffer=1024
                    )
    
                    # Start the playback thread
                    self.play_thread = threading.Thread(target=self._play_audio)
                    self.play_thread.daemon = True
                    self.play_thread.start()
    
                    print("Audio player started")
                except Exception as e:
                    print(f"Failed to start audio player: {e}")
                    self._cleanup_resources()
                    raise
    
        def stop(self):
            """Stop the audio player"""
            with self._lock:
                if not self.is_playing:
                    return
    
                self.is_playing = False
    
            # Clear the queue
            while not self.audio_queue.empty():
                try:
                    self.audio_queue.get_nowait()
                except queue.Empty:
                    break
    
            # Wait for the playback thread to finish (wait outside the lock to avoid deadlock)
            if self.play_thread and self.play_thread.is_alive():
                self.play_thread.join(timeout=2.0)
    
            # Acquire the lock again to clean up resources
            with self._lock:
                self._cleanup_resources()
    
            print("Audio player stopped")
    
        def _cleanup_resources(self):
            """Clean up audio resources (must be called within the lock)"""
            try:
                # Close the audio stream
                if self.stream:
                    if not self.stream.is_stopped():
                        self.stream.stop_stream()
                    self.stream.close()
                    self.stream = None
            except Exception as e:
                print(f"Error closing audio stream: {e}")
    
            try:
                if self.pyaudio_instance:
                    self.pyaudio_instance.terminate()
                    self.pyaudio_instance = None
            except Exception as e:
                print(f"Error terminating PyAudio: {e}")
    
        def add_audio_data(self, audio_data):
            """Add audio data to the playback queue"""
            if self.is_playing and audio_data:
                self.audio_queue.put(audio_data)
                with self._lock:
                    self._last_data_time = time.time()  # Update the time the last data was received
                    self._waiting_for_response = False  # Data received, no longer waiting
    
        def stop_receiving_data(self):
            """Mark that no more new audio data will be received"""
            with self._lock:
                self._response_done = True
                self._waiting_for_response = False  # Response ended, no longer waiting
    
        def prepare_for_next_turn(self):
            """Reset the player state for the next conversation turn."""
            with self._lock:
                self._response_done = False
                self._last_data_time = time.time()
                self._last_play_time = time.time()
                self._last_chunk_duration = 0.0
                self._waiting_for_response = True  # Start waiting for the next response
    
            # Clear any remaining audio data from the previous turn
            while not self.audio_queue.empty():
                try:
                    self.audio_queue.get_nowait()
                except queue.Empty:
                    break
    
        def is_finished_playing(self):
            """Check if all audio data has been played"""
            with self._lock:
                queue_size = self.audio_queue.qsize()
                time_since_last_data = time.time() - self._last_data_time
                time_since_last_play = time.time() - self._last_play_time
    
                # ---------------------- Smart end detection ----------------------
                # 1. Preferred: If the server has marked completion and the playback queue is empty.
                #    Wait for the most recent audio chunk to finish playing (chunk duration + 0.1s tolerance).
                if self._response_done and queue_size == 0:
                    min_wait = max(self._last_chunk_duration + 0.1, 0.5)  # Wait at least 0.5s
                    if time_since_last_play >= min_wait:
                        return True
    
                # 2. Fallback: If no new data has been received for a long time and the playback queue is empty.
                #    This logic serves as a safeguard if the server does not explicitly send `response.done`.
                if not self._waiting_for_response and queue_size == 0 and time_since_last_data > 1.0:
                    print("\n(No new audio received for a while, assuming playback is finished)")
                    return True
    
                return False
    
        def _play_audio(self):
            """Worker thread for playing audio data"""
            while True:
                # Check if it should stop
                with self._lock:
                    if not self.is_playing:
                        break
                    stream_ref = self.stream  # Get a reference to the stream
    
                try:
                    # Get audio data from the queue, with a timeout of 0.1 seconds
                    audio_data = self.audio_queue.get(timeout=0.1)
    
                    # Check the status and stream validity again
                    with self._lock:
                        if self.is_playing and stream_ref and not stream_ref.is_stopped():
                            try:
                                # Play the audio data
                                stream_ref.write(audio_data)
                                # Update the latest playback information
                                self._last_play_time = time.time()
                                self._last_chunk_duration = len(audio_data) / (
                                            self.channels * self.sample_width) / self.sample_rate
                            except Exception as e:
                                print(f"Error writing to audio stream: {e}")
                                break
    
                    # Mark this data block as processed
                    self.audio_queue.task_done()
    
                except queue.Empty:
                    # Continue waiting if the queue is empty
                    continue
                except Exception as e:
                    print(f"Error playing audio: {e}")
                    break
    
    
    class MicrophoneRecorder:
        """Real-time microphone recorder"""
    
        def __init__(self, sample_rate=16000, channels=1, chunk_size=3200):
            self.sample_rate = sample_rate
            self.channels = channels
            self.chunk_size = chunk_size
            self.pyaudio_instance = None
            self.stream = None
            self.frames = []
            self._is_recording = False
            self._record_thread = None
    
        def _recording_thread(self):
            """Recording worker thread"""
            # Continuously read data from the audio stream while _is_recording is True
            while self._is_recording:
                try:
                    # Use exception_on_overflow=False to avoid crashing due to buffer overflow
                    data = self.stream.read(self.chunk_size, exception_on_overflow=False)
                    self.frames.append(data)
                except (IOError, OSError) as e:
                    # Reading from the stream might raise an error when it's closed
                    print(f"Error reading from recording stream, it might be closed: {e}")
                    break
    
        def start(self):
            """Start recording"""
            if self._is_recording:
                print("Recording is already in progress.")
                return
    
            self.frames = []
            self._is_recording = True
    
            try:
                self.pyaudio_instance = pyaudio.PyAudio()
                self.stream = self.pyaudio_instance.open(
                    format=pyaudio.paInt16,
                    channels=self.channels,
                    rate=self.sample_rate,
                    input=True,
                    frames_per_buffer=self.chunk_size
                )
    
                self._record_thread = threading.Thread(target=self._recording_thread)
                self._record_thread.daemon = True
                self._record_thread.start()
                print("Microphone recording started...")
            except Exception as e:
                print(f"Failed to start microphone: {e}")
                self._is_recording = False
                self._cleanup()
                raise
    
        def stop(self):
            """Stop recording and return the audio data"""
            if not self._is_recording:
                return None
    
            self._is_recording = False
    
            # Wait for the recording thread to exit safely
            if self._record_thread:
                self._record_thread.join(timeout=1.0)
    
            self._cleanup()
    
            print("Microphone recording stopped.")
            return b''.join(self.frames)
    
        def _cleanup(self):
            """Safely clean up PyAudio resources"""
            if self.stream:
                try:
                    if self.stream.is_active():
                        self.stream.stop_stream()
                    self.stream.close()
                except Exception as e:
                    print(f"Error closing audio stream: {e}")
    
            if self.pyaudio_instance:
                try:
                    self.pyaudio_instance.terminate()
                except Exception as e:
                    print(f"Error terminating PyAudio instance: {e}")
    
            self.stream = None
            self.pyaudio_instance = None
    
    
    async def interactive_test():
        """
        Interactive test script: Allows for multi-turn conversations, with audio and images sent in each turn.
        """
        # ------------------- 1. Initialization and connection (one-time) -------------------
        # The API keys for Singapore and Beijing differ. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        api_key = os.environ.get("DASHSCOPE_API_KEY")
        if not api_key:
            print("Please set the DASHSCOPE_API_KEY environment variable.")
            return
    
        print("--- Real-time Multimodal Audio/Video Chat Client ---")
        print("Initializing audio player and client...")
    
        audio_player = AudioPlayer()
        audio_player.start()
    
        def on_audio_received(audio_data):
            audio_player.add_audio_data(audio_data)
    
        def on_response_done(event):
            print("\n(Received response end marker)")
            audio_player.stop_receiving_data()
    
        realtime_client = OmniRealtimeClient(
            # The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with wss://dashscope.aliyuncs.com/api-ws/v1/realtime
            base_url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
            api_key=api_key,
            model="qwen3.5-omni-plus-realtime",
            voice="Ethan",
            instructions="You are Xiaoyun, a personal assistant. Please answer the user's questions accurately and friendly, always responding with a helpful attitude.", # Set the model role
            on_text_delta=lambda text: print(f"Assistant reply: {text}", end="", flush=True),
            on_audio_delta=on_audio_received,
            turn_detection_mode=TurnDetectionMode.MANUAL,
            extra_event_handlers={"response.done": on_response_done}
        )
    
        message_handler_task = None
        try:
            await realtime_client.connect()
            print("Connected to the server. Enter 'q' or 'quit' to exit at any time.")
            message_handler_task = asyncio.create_task(realtime_client.handle_messages())
            await asyncio.sleep(0.5)
    
            turn_counter = 1
            # ------------------- 2. Multi-turn conversation loop -------------------
            while True:
                print(f"\n--- Turn {turn_counter} ---")
                audio_player.prepare_for_next_turn()
    
                recorded_audio = None
                image_paths = []
    
                # --- Get user input: Record from microphone ---
                loop = asyncio.get_event_loop()
                recorder = MicrophoneRecorder(sample_rate=16000)  # 16k sample rate is recommended for speech recognition
    
                print("Ready to record. Press Enter to start recording (or enter 'q' to exit)...")
                user_input = await loop.run_in_executor(None, input)
                if user_input.strip().lower() in ['q', 'quit']:
                    print("User requested to exit...")
                    return
    
                try:
                    recorder.start()
                except Exception:
                    print("Could not start recording. Please check your microphone permissions and device. Skipping this turn.")
                    continue
    
                print("Recording... Press Enter again to stop.")
                await loop.run_in_executor(None, input)
    
                recorded_audio = recorder.stop()
    
                if not recorded_audio or len(recorded_audio) == 0:
                    print("No valid audio was recorded. Please start this turn again.")
                    continue
    
                # --- Get image input (optional) ---
                # The image input feature below is commented out and temporarily disabled. To enable it, uncomment the code below.
                # print("\nEnter the absolute path of an [image file] on each line (optional). When finished, enter 's' or press Enter to send the request.")
                # while True:
                #     path = input("Image path: ").strip()
                #     if path.lower() == 's' or path == '':
                #         break
                #     if path.lower() in ['q', 'quit']:
                #         print("User requested to exit...")
                #         return
                #
                #     if not os.path.isabs(path):
                #         print("Error: Please enter an absolute path.")
                #         continue
                #     if not os.path.exists(path):
                #         print(f"Error: File not found -> {path}")
                #         continue
                #     image_paths.append(path)
                #     print(f"Image added: {os.path.basename(path)}")
    
                # --- 3. Send data and get response ---
                print("\n--- Input Confirmation ---")
                print(f"Audio to process: 1 (from microphone), Images: {len(image_paths)}")
                print("------------------")
    
                # 3.1 Send the recorded audio
                try:
                    print(f"Sending microphone recording ({len(recorded_audio)} bytes)")
                    await realtime_client.stream_audio(recorded_audio)
                    await asyncio.sleep(0.1)
                except Exception as e:
                    print(f"Failed to send microphone recording: {e}")
                    continue
    
                # 3.2 Send all image files
                # The image sending code below is commented out and temporarily disabled.
                # for i, path in enumerate(image_paths):
                #     try:
                #         with open(path, "rb") as f:
                #             data = f.read()
                #         print(f"Sending image {i+1}: {os.path.basename(path)} ({len(data)} bytes)")
                #         await realtime_client.append_image(data)
                #         await asyncio.sleep(0.1)
                #     except Exception as e:
                #         print(f"Failed to send image {os.path.basename(path)}: {e}")
    
                # 3.3 Submit and wait for response
                print("Submitting all inputs, requesting server response...")
                await realtime_client.commit_audio_buffer()
                await realtime_client.create_response()
    
                print("Waiting for and playing server response audio...")
                start_time = time.time()
                max_wait_time = 60
                while not audio_player.is_finished_playing():
                    if time.time() - start_time > max_wait_time:
                        print(f"\nWait timed out ({max_wait_time} seconds). Moving to the next turn.")
                        break
                    await asyncio.sleep(0.2)
    
                print("\nAudio playback for this turn is complete!")
                turn_counter += 1
    
        except (asyncio.CancelledError, KeyboardInterrupt):
            print("\nProgram was interrupted.")
        except Exception as e:
            print(f"An unhandled error occurred: {e}")
        finally:
            # ------------------- 4. Clean up resources -------------------
            print("\nClosing connection and cleaning up resources...")
            if message_handler_task and not message_handler_task.done():
                message_handler_task.cancel()
    
            if 'realtime_client' in locals() and realtime_client.ws and not realtime_client.ws.close:
                await realtime_client.close()
                print("Connection closed.")
    
            audio_player.stop()
            print("Program exited.")
    
    
    if __name__ == "__main__":
        try:
            asyncio.run(interactive_test())
        except KeyboardInterrupt:
            print("\nProgram was forcibly exited by the user.")

    Run manual_mode.py. Press Enter to speak, then press Enter again to receive the model's audio response.

Interaction flow

VAD mode

Set session.turn_detection to "server_vad" in session.update to enable VAD mode. The server automatically detects speech start/end and responds. Suitable for voice calls.

The interaction flow is as follows:

  1. The server detects speech start and sends the input_audio_buffer.speech_started event.

  2. The client can send input_audio_buffer.append and input_image_buffer.append events at any time to append audio and images to the buffer.

    Before sending an input_image_buffer.append event, you must send at least one input_audio_buffer.append event.
  3. The server detects speech end and sends the input_audio_buffer.speech_stopped event.

  4. The server sends the input_audio_buffer.committed event to commit the audio buffer.

  5. The server sends a conversation.item.created event containing the user message item created from the buffer.

Lifecycle

Client events

Server events

Session initialization

session.update

Session configuration

session.created

Session created

session.updated

Session configuration updated

User audio input

input_audio_buffer.append

Add audio to the buffer

input_image_buffer.append

Add an image to the buffer

input_audio_buffer.speech_started

Speech start detected

input_audio_buffer.speech_stopped

Speech end detected

input_audio_buffer.committed

Server received the submitted audio

Server audio output

None

response.created

Server starts generating a response

response.output_item.added

New output content during response

conversation.item.created

Conversation item created

response.content_part.added

New output content added to the assistant message

response.audio_transcript.delta

Incrementally generated transcribed text

response.audio.delta

Incrementally generated audio from the model

response.audio_transcript.done

Text transcription complete

response.audio.done

Audio generation complete

response.content_part.done

Streaming of text or audio content for the assistant message is complete

response.output_item.done

Streaming of the entire output item for the assistant message is complete

response.done

Response complete

Manual mode

Set session.turn_detection to null in session.update to enable Manual mode. The client explicitly sends input_audio_buffer.commit and response.create to request a response. Suitable for push-to-talk scenarios, such as voice messages in chat applications.

The interaction flow is as follows:

  1. The client can send input_audio_buffer.append and input_image_buffer.append events at any time to append audio and images to the buffer.

    Before sending an input_image_buffer.append event, you must send at least one input_audio_buffer.append event.
  2. The client sends the input_audio_buffer.commit event to submit the audio and image buffers, signaling to the server that all user input (audio and images) for the current turn has been sent.

  3. The server responds with an input_audio_buffer.committed event.

  4. The client sends the response.create event, waiting for the server to return the model's output.

  5. The server responds with a conversation.item.created event.

Lifecycle

Client events

Server events

Session initialization

session.update

Session configuration

session.created

Session created

session.updated

Session configuration updated

User audio input

input_audio_buffer.append

Add audio to the buffer

input_image_buffer.append

Add an image to the buffer

input_audio_buffer.commit

Submit audio and images to the server

response.create

Create a model response

input_audio_buffer.committed

Server received the submitted audio

Server audio output

input_audio_buffer.clear

Clear the audio from the buffer

response.created

Server starts generating a response

response.output_item.added

New output content during response

conversation.item.created

Conversation item created

response.content_part.added

New output content added to the assistant message item

response.audio_transcript.delta

Incrementally generated transcribed text

response.audio.delta

Incrementally generated audio from the model

response.audio_transcript.done

Text transcription complete

response.audio.done

Audio generation complete

response.content_part.done

Streaming of text or audio content for the assistant message is complete

response.output_item.done

Streaming of the entire output item for the assistant message is complete

response.done

Response complete

Web search

Web search lets the model reply using real-time retrieved data for scenarios that need up-to-date information, such as stock prices or weather forecasts. The model autonomously decides whether to search.

Only the Qwen3.5-Omni-Realtime model supports web search. It is disabled by default. Enable it using the session.update event.
For billing details, see the agent policy in the Billing details

How to enable

In the session.update event, add these parameters:

  • enable_search: Set to true to enable web search.

  • search_options.enable_source: Set to true to return a list of search result sources.

For full parameter details, see session.update.

Response format

After you enable web search, the response.done event includes a new plugins field in the usage object. This field records search usage metrics:

{
    "usage": {
        "total_tokens": 2937,
        "input_tokens": 2554,
        "output_tokens": 383,
        "input_tokens_details": {
            "text_tokens": 2512,
            "audio_tokens": 42
        },
        "output_tokens_details": {
            "text_tokens": 90,
            "audio_tokens": 293
        },
        "plugins": {
            "search": {
                "count": 1,
                "strategy": "agent"
            }
        }
    }
}

Code examples

The following examples show how to enable web search.

DashScope Python SDK

In the update_session call, pass the enable_search and search_options parameters:

import os
import base64
import time
import json
import pyaudio
from dashscope.audio.qwen_omni import MultiModality, AudioFormat, OmniRealtimeCallback, OmniRealtimeConversation
import dashscope

dashscope.api_key = os.getenv('DASHSCOPE_API_KEY')
url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime'
model = 'qwen3.5-omni-plus-realtime'
voice = 'Tina'

class SearchCallback(OmniRealtimeCallback):
    def __init__(self, pya):
        self.pya = pya
        self.out = None
    def on_open(self):
        self.out = self.pya.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
    def on_event(self, response):
        if response['type'] == 'response.audio.delta':
            self.out.write(base64.b64decode(response['delta']))
        elif response['type'] == 'conversation.item.input_audio_transcription.completed':
            print(f"[User] {response['transcript']}")
        elif response['type'] == 'response.audio_transcript.done':
            print(f"[LLM] {response['transcript']}")
        elif response['type'] == 'response.done':
            usage = response.get('response', {}).get('usage', {})
            plugins = usage.get('plugins', {})
            if plugins.get('search'):
                print(f"[Search] count={plugins['search']['count']}, strategy={plugins['search']['strategy']}")

pya = pyaudio.PyAudio()
callback = SearchCallback(pya)
conv = OmniRealtimeConversation(model=model, callback=callback, url=url)
conv.connect()
conv.update_session(
    output_modalities=[MultiModality.AUDIO, MultiModality.TEXT],
    voice=voice,
    instructions="You are Xiao Yun, a personal assistant",
    enable_search=True,
    search_options={'enable_source': True}
)
mic = pya.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True)
print("Web search is enabled. Speak into the microphone (press Ctrl+C to exit)...")
try:
    while True:
        audio_data = mic.read(3200, exception_on_overflow=False)
        conv.append_audio(base64.b64encode(audio_data).decode())
        time.sleep(0.01)
except KeyboardInterrupt:
    conv.close()
    mic.close()
    callback.out.close()
    pya.terminate()
    print("\nConversation ended")

DashScope Java SDK

In updateSession, pass web search settings through the parameters map:

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.*;
import java.nio.ByteBuffer;
import java.util.*;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;

public class OmniSearch {
    static class SequentialAudioPlayer {
        private final SourceDataLine line;
        private final Queue<byte[]> audioQueue = new ConcurrentLinkedQueue<>();
        private final Thread playerThread;
        private final AtomicBoolean shouldStop = new AtomicBoolean(false);

        public SequentialAudioPlayer() throws LineUnavailableException {
            AudioFormat format = new AudioFormat(24000, 16, 1, true, false);
            line = AudioSystem.getSourceDataLine(format);
            line.open(format);
            line.start();
            playerThread = new Thread(() -> {
                while (!shouldStop.get()) {
                    byte[] audio = audioQueue.poll();
                    if (audio != null) {
                        line.write(audio, 0, audio.length);
                    } else {
                        try { Thread.sleep(10); } catch (InterruptedException ignored) {}
                    }
                }
            }, "AudioPlayer");
            playerThread.start();
        }

        public void play(String base64Audio) {
            audioQueue.add(Base64.getDecoder().decode(base64Audio));
        }
        public void close() {
            shouldStop.set(true);
            try { playerThread.join(1000); } catch (InterruptedException ignored) {}
            line.drain();
            line.close();
        }
    }

    public static void main(String[] args) {
        try {
            SequentialAudioPlayer player = new SequentialAudioPlayer();
            AtomicBoolean shouldStop = new AtomicBoolean(false);

            OmniRealtimeParam param = OmniRealtimeParam.builder()
                    .model("qwen3.5-omni-plus-realtime")
                    .apikey(System.getenv("DASHSCOPE_API_KEY"))
                    .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                    .build();

            OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
                @Override public void onOpen() {
                    System.out.println("Connection established");
                }
                @Override public void onClose(int code, String reason) {
                    System.out.println("Connection closed");
                    shouldStop.set(true);
                }
                @Override public void onEvent(JsonObject event) {
                    String type = event.get("type").getAsString();
                    if ("response.audio.delta".equals(type)) {
                        player.play(event.get("delta").getAsString());
                    } else if ("response.audio_transcript.done".equals(type)) {
                        System.out.println("[LLM] " + event.get("transcript").getAsString());
                    } else if ("response.done".equals(type)) {
                        JsonObject response = event.getAsJsonObject("response");
                        if (response != null && response.has("usage")) {
                            JsonObject usage = response.getAsJsonObject("usage");
                            if (usage.has("plugins")) {
                                JsonObject plugins = usage.getAsJsonObject("plugins");
                                if (plugins.has("search")) {
                                    JsonObject search = plugins.getAsJsonObject("search");
                                    System.out.println("[Search] count=" + search.get("count").getAsInt()
                                            + ", strategy=" + search.get("strategy").getAsString());
                                }
                            }
                        }
                    }
                }
            });

            conversation.connect();
            conversation.updateSession(OmniRealtimeConfig.builder()
                    .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
                    .voice("Tina")
                    .enableTurnDetection(true)
                    .enableInputAudioTranscription(true)
                    .parameters(Map.of(
                            "instructions", "You are Xiao Yun, a personal assistant",
                            "enable_search", true,
                            "search_options", Map.of("enable_source", true)
                    ))
                    .build()
            );

            System.out.println("Web search is enabled. Start speaking (press Ctrl+C to exit)...");
            AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
            TargetDataLine mic = AudioSystem.getTargetDataLine(format);
            mic.open(format);
            mic.start();

            ByteBuffer buffer = ByteBuffer.allocate(3200);
            while (!shouldStop.get()) {
                int bytesRead = mic.read(buffer.array(), 0, buffer.capacity());
                if (bytesRead > 0) {
                    conversation.appendAudio(Base64.getEncoder().encodeToString(buffer.array()));
                }
                Thread.sleep(20);
            }

            conversation.close(1000, "Normal end");
            player.close();
            mic.close();
        } catch (NoApiKeyException e) {
            System.err.println("API key not found: Set the DASHSCOPE_API_KEY environment variable");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

WebSocket (Python)

In the JSON payload for session.update, add the enable_search and search_options fields:

import json
import os
import websocket
import base64
import pyaudio
import threading

API_KEY = os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3.5-omni-plus-realtime"

pya = pyaudio.PyAudio()
out_stream = pya.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

def on_open(ws):
    ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "modalities": ["text", "audio"],
            "voice": "Tina",
            "instructions": "You are Xiao Yun, a personal assistant",
            "input_audio_format": "pcm",
            "output_audio_format": "pcm",
            "enable_search": True,
            "search_options": {
                "enable_source": True
            }
        }
    }))
    print("Web search is enabled. Speak into the microphone...")
    def send_audio():
        mic = pya.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True)
        try:
            while True:
                audio = mic.read(3200, exception_on_overflow=False)
                ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(audio).decode()
                }))
        except Exception:
            mic.close()
    threading.Thread(target=send_audio, daemon=True).start()

def on_message(ws, message):
    event = json.loads(message)
    if event["type"] == "response.audio.delta":
        out_stream.write(base64.b64decode(event["delta"]))
    elif event["type"] == "response.audio_transcript.done":
        print(f"[LLM] {event['transcript']}")
    elif event["type"] == "response.done":
        usage = event.get("response", {}).get("usage", {})
        plugins = usage.get("plugins", {})
        if plugins.get("search"):
            print(f"[Search] count={plugins['search']['count']}, strategy={plugins['search']['strategy']}")

def on_error(ws, error):
    print(f"Error: {error}")

headers = ["Authorization: Bearer " + API_KEY]
ws = websocket.WebSocketApp(API_URL, header=headers, on_open=on_open, on_message=on_message, on_error=on_error)
ws.run_forever()

API reference

Billing and rate limiting

Billing rules

Qwen-Omni-Realtime bills based on token usage per modality (audio and images). For details, see Model list.

Rules for converting audio and images to tokens

Audio

  • Qwen3.5-Omni-Realtime: Total tokens = Audio duration (in seconds) × 7

  • Qwen3-Omni-Flash-Realtime: Total tokens = Audio duration (in seconds) × 12.5

  • Qwen-Omni-Turbo-Realtime: Total tokens = Audio duration (in seconds) × 25. If the audio duration is less than 1 second, it is calculated as 1 second.

Image

  • Qwen3.5-Omni-Plus-Realtime model: 1 token per 32×32 pixels

  • Qwen3-Omni-Flash-Realtime model: 1 token per 32×32 pixels

  • Qwen-Omni-Turbo-Realtime model: 1 token per 28×28 pixels

An image requires a minimum of 4 tokens and supports a maximum of 1,280 tokens. You can use the following code to estimate the total number of tokens consumed by an image:

# Install the Pillow library using the following command: pip install Pillow
from PIL import Image
import math

# For the Qwen-Omni-Turbo-Realtime model, the zoom factor is 28.
# factor = 28
# For the Qwen3-Omni-Flash-Realtime and Qwen3.5-Omni-Realtime models, the zoom factor is 32.
factor = 32

def token_calculate(image_path='', duration=10):
    """
    :param image_path: The path of the image.
    :param duration: The duration of the session connection.
    :return: The number of tokens for the image.
    """
    if len(image_path) > 0:
        # Open the specified PNG image file.
        image = Image.open(image_path)
        # Get the original dimensions of the image.
        height = image.height
        width = image.width
        print(f"Image dimensions before scaling: height={height}, width={width}")
        # Adjust the height to be an integer multiple of the factor.
        h_bar = round(height / factor) * factor
        # Adjust the width to be an integer multiple of the factor.
        w_bar = round(width / factor) * factor
        # Lower limit for image tokens: 4 tokens.
        min_pixels = factor * factor * 4
        # Upper limit for image tokens: 1,280 tokens.
        max_pixels = 1280 * factor * factor
        # Scale the image to ensure the total number of pixels is within the range [min_pixels, max_pixels].
        if h_bar * w_bar > max_pixels:
            # Calculate the scaling factor beta so that the total number of pixels of the scaled image does not exceed max_pixels.
            beta = math.sqrt((height * width) / max_pixels)
            # Recalculate the adjusted height to ensure it is an integer multiple of the factor.
            h_bar = math.floor(height / beta / factor) * factor
            # Recalculate the adjusted width to ensure it is an integer multiple of the factor.
            w_bar = math.floor(width / beta / factor) * factor
        elif h_bar * w_bar < min_pixels:
            # Calculate the scaling factor beta so that the total number of pixels of the scaled image is not less than min_pixels.
            beta = math.sqrt(min_pixels / (height * width))
            # Recalculate the adjusted height to ensure it is an integer multiple of the factor.
            h_bar = math.ceil(height * beta / factor) * factor
            # Recalculate the adjusted width to ensure it is an integer multiple of the factor.
            w_bar = math.ceil(width * beta / factor) * factor
        print(f"Image dimensions after scaling: height={h_bar}, width={w_bar}")
        # Calculate the number of tokens for the image: total pixels divided by (factor × factor).
        token = int((h_bar * w_bar) / (factor * factor))
        print(f"Number of tokens after scaling: {token}")
        total_token = token * math.ceil(duration / 2)
        print(f"Total number of tokens: {total_token}")
        return total_token
    else:
        print("Error: image_path is empty. Cannot calculate tokens.")
        return 0

if __name__ == "__main__":
    total_token = token_calculate(image_path="xxx/test.jpg", duration=10)

Rate limiting

See Rate limiting.

Error codes

If the model call fails and returns an error message, see Error messages for resolution.

Voice list

Set the voice request parameter to the value in the voice parameter column.

qwen3.5-omni-realtime

voice parameter

Details

Languages supported

Tina

Voice name: Tina

Description: A voice like warm milk tea—sweet and cozy, yet sharp when solving problems

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Cindy

Voice name: Cindy

Description: A sweet-talking young woman from Taiwan

Chinese (Taiwanese accent), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Liora Mira

Voice name: Liora Mira

Description: A gentle voice that weaves warmth into everyday life

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Sunnybobi

Voice name: Sunnybobi

Description: A cheerful, socially awkward neighbor girl

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Raymond

Voice name: Raymond

Description: A clear-voiced, takeout-loving homebody

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Ethan

Voice name: Ethan

Description: Standard Mandarin with a slight northern accent. Bright, warm, energetic, and youthful

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Theo Calm

Voice name: Theo Calm

Description: Conveys understanding in silence and healing through words

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Serena

Voice name: Serena

Description: A gentle young woman

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Harvey

Voice name: Harvey

Description: A voice that carries the weight of time—deep, mellow, and scented with coffee and old books

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Maia

Voice name: Maia

Description: A blend of intellect and gentleness

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Evan

Voice name: Evan

Description: A college student—youthful and endearing

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Qiao

Voice name: Qiao

Description: Not just cute—she’s sweet on the surface and full of personality underneath

Chinese (Taiwanese accent), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Momo

Voice name: Momo

Description: Playful and mischievous—here to cheer you up

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Wil

Voice name: Wil

Description: A young man from Shenzhen who speaks with a Hong Kong–Taiwan accent

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Angel

Voice name: Angel

Description: Slightly Taiwanese-accented—and very sweet

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Li Cassian

Voice name: Li Cassian

Description: Speaks with restraint—three parts silence, seven parts reading the room

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Mia

Voice name: Mia

Description: A lifestyle artist who shares slow-living aesthetics and daily comfort through a soothing voice

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Joyner

Voice name: Joyner

Description: Funny, exaggerated, and down-to-earth

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Gold

Voice name: Gold

Description: A West Coast Black rapper

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Katerina

Voice name: Katerina

Description: A mature, commanding voice with rich rhythm and resonance

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Ryan

Voice name: Ryan

Description: High-energy delivery with strong dramatic presence—realism meets intensity

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Jennifer

Voice name: Jennifer

Description: A premium, cinematic-quality American female voice

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Aiden

Voice name: Aiden

Description: An American young man skilled in cooking

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Mione

Voice name: Mione

Description: A mature, intelligent British neighbor girl

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Sunny

Voice name: Sichuan–Sunny

Description: A sweet Sichuan girl who warms your heart

Chinese (Sichuan dialect), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Dylan

Voice name: Beijing–Dylan

Description: A youth raised in Beijing’s hutongs

Chinese (Beijing dialect), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Eric

Voice name: Sichuan–Eric

Description: A lively Chengdu man from Sichuan

Chinese (Sichuan dialect), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Peter

Voice name: Tianjin–Peter

Description: A Tianjin-style xiangsheng performer—professional foil

Chinese (Tianjin dialect), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Joseph Chen

Voice name: Joseph Chen

Description: A longtime overseas Chinese from Southeast Asia with a warm, nostalgic voice

Chinese (Hokkien), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Marcus

Voice name: Shaanxi–Marcus

Description: Broad face, few words, sincere heart, deep voice—the true flavor of Shaanxi

Chinese (Shaanxi dialect), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Li

Voice name: Nanjing–Li

Description: A grumpy uncle

Chinese (Nanjing dialect), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Rocky

Voice name: Cantonese–Rocky

Description: A witty and humorous online chat companion

Chinese (Cantonese), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Sohee

Voice name: Sohee

Description: A warm, cheerful, emotionally expressive Korean unnie

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Lenn

Voice name: Lenn

Description: Rational at core, rebellious in detail—a German youth who wears suits and listens to post-punk

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Ono Anna

Voice name: Ono Anna

Description: A clever, playful childhood friend

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Sonrisa

Voice name: Sonrisa

Description: A warm, outgoing Latin American woman

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Bodega

Voice name: Bodega

Description: A warm, enthusiastic Spanish man

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Emilien

Voice name: Emilien

Description: A romantic French big brother

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Andre

Voice name: Andre

Description: A magnetic, natural, and steady male voice

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Radio Gol

Voice name: Radio Gol

Description: A passionate football commentator who narrates games with poetic flair

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Alek

Voice name: Alek

Description: Cold like the Russian spirit—yet warm as wool beneath a coat

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Rizky

Voice name: Rizky

Description: A young Indonesian man with a distinctive voice

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Roya

Voice name: Roya

Description: A sporty girl with a free-spirited heart

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Arda

Voice name: Arda

Description: Neither high nor low—clean, crisp, and gently warm

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Hana

Voice name: Hana

Description: A mature Vietnamese woman who loves dogs

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Dolce

Voice name: Dolce

Description: A laid-back Italian man

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Jakub

Voice name: Jakub

Description: A charismatic, artistic young man from a Polish town

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Griet

Voice name: Griet

Description: A mature, artistic Dutch woman

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Eliška

Voice name: Eliška

Description: Every word carries Central European craftsmanship and warmth

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Marina

Voice name: Marina

Description: A girl raised in a multicultural city

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Siiri

Voice name: Siiri

Description: Reserved and gentle—with a calm, lake-like speaking pace

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Ingrid

Voice name: Ingrid

Description: A woman from rural Norway

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Sigga

Voice name: Sigga

Description: An intellectual young woman from an Icelandic town

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Bea

Voice name: Bea

Description: A sweet Filipino woman who loves coffee

Chinese (Mandarin), Chinese, English

French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

Chloe

Voice name: Chloe

Description: A Malaysian office worker

Chinese (Mandarin), Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

qwen3-omni-flash-realtime-2025-12-01

Voice name

voice parameter

Voice effect

Description

Languages supported

Cherry

Cherry

A sunny, positive, friendly, and natural young woman

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Serena

Serena

A gentle young woman

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Ethan

Ethan

Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Chelsie

Chelsie

A two-dimensional virtual girlfriend

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Momo

Momo

Playful and mischievous, cheering you up

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Vivian

Vivian

Confident, cute, and slightly feisty

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Moon

Moon

Effortlessly cool Moon White

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Maia

Maia

A blend of intellect and gentleness

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Kai

Kai

A soothing audio spa for your ears

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Nofish

Nofish

A designer who cannot pronounce retroflex sounds

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Bella

Bella

A little girl who drinks but never throws punches when drunk

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Jennifer

Jennifer

A premium, cinematic-quality American English female voice

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Ryan

Ryan

Full of rhythm, bursting with dramatic flair, balancing authenticity and tension

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Katerina

Katerina

A mature-woman voice with rich, memorable rhythm

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Aiden

Aiden

An American English young man skilled in cooking

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Eldric Sage

Eldric Sage

A calm and wise elder—weathered like a pine tree, yet clear-minded as a mirror

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Mia

Mia

Gentle as spring water, obedient as fresh snow

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Mochi

Mochi

A clever, quick-witted young adult—childlike innocence remains, yet wisdom shines through

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Bellona

Bellona

A powerful, clear voice that brings characters to life—so stirring it makes your blood boil.

With heroic grandeur and perfect diction, this voice captures the full spectrum of human expression.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Vincent

Vincent

A uniquely raspy, smoky voice—just one line evokes armies and heroic tales

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Bunny

Bunny

A little girl overflowing with "cuteness"

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Neil

Neil

A flat baseline intonation with precise, clear pronunciation—the most professional news anchor

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Elias

Elias

Maintains academic rigor while using storytelling techniques to turn complex knowledge into digestible learning modules

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Arthur

Arthur

A simple, earthy voice steeped in time and tobacco smoke—slowly unfolding village stories and curiosities

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Nini

Nini

A soft, clingy voice like sweet rice cakes—those drawn-out calls of “Big Brother” are so sweet they melt your bones

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Ebona

Ebona

Her whisper is like a rusty key slowly turning in the darkest corner of your mind—where childhood shadows and unknown fears hide

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Seren

Seren

A gentle, soothing voice to help you fall asleep faster. Good night, sweet dreams

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Pip

Pip

A playful, mischievous boy full of childlike wonder—is this your memory of Shin-chan?

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Stella

Stella

Normally a cloyingly sweet, dazed teenage-girl voice—but when shouting “I represent the moon to defeat you!”, she instantly radiates unwavering love and justice

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Bodega

Bodega

A passionate Spanish man

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Sonrisa

Sonrisa

A cheerful, outgoing Latin American woman

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Alek

Alek

Cold like the Russian spirit, yet warm like wool coat lining

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Dolce

Dolce

A laid-back Italian man

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Sohee

Sohee

A warm, cheerful, emotionally expressive Korean unnie

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Ono Anna

Ono Anna

A clever, spirited childhood friend

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Lenn

Lenn

Rational at heart, rebellious in detail—a German youth who wears suits and listens to post-punk

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Emilien

Emilien

A romantic French big brother

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Andre

Andre

A magnetic, natural, and steady male voice

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Radio Gol

Radio Gol

Football poet Radio Gol! Today I’ll commentate on football using my name.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Shanghai - Jada

Jada

A fast-paced, energetic Shanghai auntie

Shanghainese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Beijing - Dylan

Dylan

A young man raised in Beijing’s hutongs

Beijing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Nanjing - Li

Li

A patient yoga teacher

Nanjing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Shaanxi - Marcus

Marcus

Broad face, few words, sincere heart, deep voice—the authentic Shaanxi flavor

Shaanxi dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Southern Min - Roy

Roy

A humorous, straightforward, lively Taiwanese guy

Southern Min, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Tianjin - Peter

Peter

Tianjin-style crosstalk, professional foil

Tianjin dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Sichuan - Sunny

Sunny

A Sichuan girl sweet enough to melt your heart

Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Sichuan - Eric

Eric

A Sichuanese man from Chengdu who stands out in everyday life

Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Cantonese - Rocky

Rocky

A humorous, witty A Qiang providing live chat

Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Cantonese - Kiki

Kiki

A sweet Hong Kong girl best friend

Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

qwen3-omni-flash-realtime, qwen3-omni-flash-realtime-2025-09-15

Voice name

voice parameter

Voice effect

Description

Languages supported

Cherry

Cherry

A sunny, positive, friendly, and natural young woman

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Ethan

Ethan

Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Nofish

Nofish

A designer who cannot pronounce retroflex sounds

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Jennifer

Jennifer

A premium, cinematic-quality American English female voice

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Ryan

Ryan

Full of rhythm, bursting with dramatic flair, balancing authenticity and tension

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Katerina

Katerina

A mature-woman voice with rich, memorable rhythm

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Elias

Elias

Maintains academic rigor while using storytelling techniques to turn complex knowledge into digestible learning modules

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Shanghai - Jada

Jada

A fast-paced, energetic Shanghai auntie

Shanghainese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Beijing - Dylan

Dylan

A young man raised in Beijing’s hutongs

Beijing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Sichuan - Sunny

Sunny

A Sichuan girl sweet enough to melt your heart

Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Nanjing - Li

Li

A patient yoga teacher

Nanjing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Shaanxi - Marcus

Marcus

Broad face, few words, sincere heart, deep voice—the authentic Shaanxi flavor

Shaanxi dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Southern Min - Roy

Roy

A humorous, straightforward, lively Taiwanese guy

Southern Min, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Tianjin - Peter

Peter

Tianjin-style crosstalk, professional foil

Tianjin dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Cantonese - Rocky

Rocky

A humorous, witty A Qiang providing live chat

Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Cantonese - Kiki

Kiki

A sweet Hong Kong girl best friend

Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Sichuan - Eric

Eric

A Sichuanese man from Chengdu who stands out in everyday life

Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Qwen-Omni-Turbo-Realtime

Voice name

voice parameter

Voice Effect

Description

Languages supported

Cherry

Cherry

A sunny, positive, friendly, and natural young woman

Chinese, English

Serena

Serena

A gentle young woman

Chinese, English

Ethan

Ethan

Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant

Chinese, English

Chelsie

Chelsie

A two-dimensional virtual girlfriend

Chinese, English