All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time speech synthesis - Qwen

Last Updated:Nov 06, 2025

The Qwen real-time speech synthesis model offers low-latency speech synthesis with streaming text input and audio output. It provides various human-like voices, supports multiple languages and dialects, and lets you use the same voice for different languages. The model also automatically adjusts its tone and smoothly handles complex text.

Compared to Speech synthesis - Qwen, Qwen real-time speech synthesis supports the following features:

  • Streaming text input

    Seamlessly integrates with the streaming output of Large Language Models (LLMs). Audio is synthesized as text is generated, which improves the real-time performance of interactive voice applications.

  • Bidirectional communication

    It uses the WebSocket protocol for streaming text input and audio output. This method avoids the overhead of establishing multiple connections and significantly reduces latency.

Supported models

The supported model is Qwen3-TTS Realtime.

Qwen3-TTS Realtime provides 17 voices, supports synthesis for multiple languages and dialects, and lets you customize the format, sample rate, speech rate, volume, pitch, and bitrate of the output audio.

Qwen-TTS Realtime provides only 7 voices, supports only Chinese and English, and does not allow you to customize the format, sample rate, speech rate, volume, pitch, or bitrate of the output audio.

International (Singapore)

Model

Version

Unit price

Supported languages

Free quota (Note)

qwen3-tts-flash-realtime

Current capabilities are equivalent to qwen3-tts-flash-realtime-2025-09-18

Stable

$0.13 per 10,000 characters

Chinese (Mandarin, Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Minnan, Tianjin, Cantonese), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese

2,000 characters for each model

Validity: 90 days after Alibaba Cloud Model Studio activation

qwen3-tts-flash-realtime-2025-09-18

Snapshot

Qwen3-TTS is billed based on the number of input characters. The billing rules are as follows:

  • 1 Chinese character = 2 characters

  • 1 English letter, 1 punctuation mark, or 1 space = 1 character

Mainland China (Beijing)

Qwen3-TTS Realtime

Model

Version

Unit price

Supported languages

qwen3-tts-flash-realtime

Current capabilities are equivalent to qwen3-tts-flash-realtime-2025-09-18

Stable

$0.143353 per 10,000 characters

Chinese (Mandarin, Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Minnan, Tianjin, Cantonese), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese

qwen3-tts-flash-realtime-2025-09-18

Snapshot

Qwen3-TTS is billed based on the number of input characters. The billing rules are as follows:

  • 1 Chinese character = 2 characters

  • 1 English letter, 1 punctuation mark, or 1 space = 1 character

Qwen-TTS Realtime

Model

Version

Context window

Max input

Max output

Input cost

Output cost

Supported languages

(Tokens)

(per 1,000 tokens)

qwen-tts-realtime

Current capabilities are equivalent to qwen-tts-realtime-2025-07-15

Stable

8,192

512

7,680

$0.345

$1.721

Chinese, English

qwen-tts-realtime-latest

Current capabilities are equivalent to qwen-tts-realtime-2025-07-15

Latest

Chinese, English

qwen-tts-realtime-2025-07-15

Snapshot

Chinese, English

Audio-to-token conversion rule: 1 second of audio corresponds to 50 tokens. Audio shorter than 1 second is counted as 50 tokens.

Access methods

The Qwen real-time speech synthesis API is based on the WebSocket protocol. If you use Java or Python, you can use the DashScope SDK to avoid handling WebSocket details. You can also use a WebSocket library in any language to connect:

  • Endpoint URL

    Chinese mainland (Beijing): wss://dashscope.aliyuncs.com/api-ws/v1/realtime

    International (Singapore): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime

  • Query parameters

    The query parameter is `model`. You must specify the name of the model that you want to access. For more information, see Supported models.

  • Header

    Use a Bearer Token for authentication: `Authorization: Bearer DASHSCOPE_API_KEY`

    `DASHSCOPE_API_KEY` is the API key that you obtained from Alibaba Cloud Model Studio.

You can use the following code to establish a WebSocket connection with the Qwen-TTS Realtime API.

Establish a WebSocket connection

# pip install websocket-client
import json
import websocket
import os

# API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: API_KEY="sk-xxx"
API_KEY=os.getenv("DASHSCOPE_API_KEY")
# The following is the URL for the Singapore region. If you use a model in the Beijing region, you must replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime"

headers = [
    "Authorization: Bearer " + API_KEY
]

def on_open(ws):
    print(f"Connected to server: {API_URL}")
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
    print("Error:", error)

ws = websocket.WebSocketApp(
    API_URL,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error
)

ws.run_forever()

After you connect, you will receive the following callback information:

{
    "event_id": "event_xxx",
    "type": "session.created",
    "session": {
        "object": "realtime.session",
        "mode": "server_commit",
        "model": "qwen3-tts-flash-realtime",
        "voice": "Cherry",
        "response_format": "pcm",
        "sample_rate": 24000,
        "id": "sess_xxx"
    }
}

Getting started

Before you run the code, you must obtain and configure an API key.

Your Python version must be 3.10 or later.

Follow these steps to quickly test the real-time audio synthesis feature of the Realtime API.

  1. Prepare the runtime environment

    You can install pyaudio based on your operating system.

    macOS

    brew install portaudio && pip install pyaudio

    Debian/Ubuntu

    sudo apt-get install python3-pyaudio
    
    or
    
    pip install pyaudio

    CentOS

    sudo yum install -y portaudio portaudio-devel && pip install pyaudio

    Windows

    pip install pyaudio

    After the installation is complete, you can use pip to install WebSocket-related dependencies:

    pip install websocket-client==1.8.0 websockets
  2. Create a client

    You can create a local Python file named tts_realtime_client.py and copy the following code into the file:

    tts_realtime_client.py

    # -- coding: utf-8 --
    
    import asyncio
    import websockets
    import json
    import base64
    import time
    from typing import Optional, Callable, Dict, Any
    from enum import Enum
    
    
    class SessionMode(Enum):
        SERVER_COMMIT = "server_commit"
        COMMIT = "commit"
    
    
    class TTSRealtimeClient:
        """
        A client for interacting with the TTS Realtime API.
    
        This class provides methods to connect to the TTS Realtime API, send text data, get audio output, and manage the WebSocket connection.
    
        Attributes:
            base_url (str):
                The base URL of the Realtime API.
            api_key (str):
                The API key for identity verification.
            voice (str):
                The voice used by the server for speech synthesis.
            mode (SessionMode):
                The session mode. Valid values: `server_commit` and `commit`.
            audio_callback (Callable[[bytes], None]):
                The callback function that receives audio data.
            language_type(str)
                The language of the synthesized speech. Valid values: Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Auto.
        """
    
        def __init__(
                self,
                base_url: str,
                api_key: str,
                voice: str = "Cherry",
                mode: SessionMode = SessionMode.SERVER_COMMIT,
                audio_callback: Optional[Callable[[bytes], None]] = None,
            language_type: str = "Auto"):
            self.base_url = base_url
            self.api_key = api_key
            self.voice = voice
            self.mode = mode
            self.ws = None
            self.audio_callback = audio_callback
            self.language_type = language_type
    
            # Current response status
            self._current_response_id = None
            self._current_item_id = None
            self._is_responding = False
    
    
        async def connect(self) -> None:
            """Establishes a WebSocket connection with the TTS Realtime API."""
            headers = {
                "Authorization": f"Bearer {self.api_key}"
            }
    
            self.ws = await websockets.connect(self.base_url, additional_headers=headers)
    
            # Sets the default session configuration.
            await self.update_session({
                "mode": self.mode.value,
                "voice": self.voice,
                "language_type": self.language_type,
                "response_format": "pcm",
                "sample_rate": 24000
            })
    
    
        async def send_event(self, event) -> None:
            """Sends an event to the server."""
            event['event_id'] = "event_" + str(int(time.time() * 1000))
            print(f"Sending event: type={event['type']}, event_id={event['event_id']}")
            await self.ws.send(json.dumps(event))
    
    
        async def update_session(self, config: Dict[str, Any]) -> None:
            """Updates the session configuration."""
            event = {
                "type": "session.update",
                "session": config
            }
            print("Updating session configuration: ", event)
            await self.send_event(event)
    
    
        async def append_text(self, text: str) -> None:
            """Sends text data to the API."""
            event = {
                "type": "input_text_buffer.append",
                "text": text
            }
            await self.send_event(event)
    
    
        async def commit_text_buffer(self) -> None:
            """Commits the text buffer to trigger processing."""
            event = {
                "type": "input_text_buffer.commit"
            }
            await self.send_event(event)
    
    
        async def clear_text_buffer(self) -> None:
            """Clears the text buffer."""
            event = {
                "type": "input_text_buffer.clear"
            }
            await self.send_event(event)
    
    
        async def finish_session(self) -> None:
            """Finishes the session."""
            event = {
                "type": "session.finish"
            }
            await self.send_event(event)
    
    
        async def handle_messages(self) -> None:
            """Handles messages from the server."""
            try:
                async for message in self.ws:
                    event = json.loads(message)
                    event_type = event.get("type")
    
                    if event_type != "response.audio.delta":
                        print(f"Received event: {event_type}")
    
                    if event_type == "error":
                        print("Error: ", event.get('error', {}))
                        continue
                    elif event_type == "session.created":
                        print("Session created, ID: ", event.get('session', {}).get('id'))
                    elif event_type == "session.updated":
                        print("Session updated, ID: ", event.get('session', {}).get('id'))
                    elif event_type == "input_text_buffer.committed":
                        print("Text buffer committed, item ID: ", event.get('item_id'))
                    elif event_type == "input_text_buffer.cleared":
                        print("Text buffer cleared")
                    elif event_type == "response.created":
                        self._current_response_id = event.get("response", {}).get("id")
                        self._is_responding = True
                        print("Response created, ID: ", self._current_response_id)
                    elif event_type == "response.output_item.added":
                        self._current_item_id = event.get("item", {}).get("id")
                        print("Output item added, ID: ", self._current_item_id)
                    # Handles audio delta
                    elif event_type == "response.audio.delta" and self.audio_callback:
                        audio_bytes = base64.b64decode(event.get("delta", ""))
                        self.audio_callback(audio_bytes)
                    elif event_type == "response.audio.done":
                        print("Audio generation complete")
                    elif event_type == "response.done":
                        self._is_responding = False
                        self._current_response_id = None
                        self._current_item_id = None
                        print("Response complete")
                    elif event_type == "session.finished":
                        print("Session finished")
    
            except websockets.exceptions.ConnectionClosed:
                print("Connection closed")
            except Exception as e:
                print("Error processing message: ", str(e))
    
    
        async def close(self) -> None:
            """Closes the WebSocket connection."""
            if self.ws:
                await self.ws.close()
  3. Select a speech synthesis mode

    The Realtime API supports the following two modes:

    • server_commit mode

      In this mode, the client only sends text. The server intelligently determines how to segment the text and when to synthesize it. This mode is suitable for low-latency scenarios where you do not need to manually control the synthesis rhythm, such as GPS navigation.

    • commit mode

      In this mode, the client first adds text to a buffer and then actively triggers the server to synthesize the specified text. This mode is suitable for scenarios that require fine-grained control over sentence breaks and pauses, such as news broadcasting.

    server_commit mode

    In the same directory as tts_realtime_client.py, you can create another Python file named server_commit.py and copy the following code into the file:

    server_commit.py

    import os
    import asyncio
    import logging
    import wave
    from tts_realtime_client import TTSRealtimeClient, SessionMode
    import pyaudio
    
    # QwenTTS service configuration
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, you must replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime
    URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime"
    # API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured environment variables, replace the following line with your Model Studio API key: API_KEY="sk-xxx"
    API_KEY = os.getenv("DASHSCOPE_API_KEY")
    
    if not API_KEY:
        raise ValueError("Please set DASHSCOPE_API_KEY environment variable")
    
    # Collects audio data
    _audio_chunks = []
    # Real-time playback related
    _AUDIO_SAMPLE_RATE = 24000
    _audio_pyaudio = pyaudio.PyAudio()
    _audio_stream = None  # Will be opened at runtime
    
    def _audio_callback(audio_bytes: bytes):
        """TTSRealtimeClient audio callback: Real-time playback and caching"""
        global _audio_stream
        if _audio_stream is not None:
            try:
                _audio_stream.write(audio_bytes)
            except Exception as exc:
                logging.error(f"PyAudio playback error: {exc}")
        _audio_chunks.append(audio_bytes)
        logging.info(f"Received audio chunk: {len(audio_bytes)} bytes")
    
    def _save_audio_to_file(filename: str = "output.wav", sample_rate: int = 24000) -> bool:
        """Saves the collected audio data to a WAV file"""
        if not _audio_chunks:
            logging.warning("No audio data to save")
            return False
    
        try:
            audio_data = b"".join(_audio_chunks)
            with wave.open(filename, 'wb') as wav_file:
                wav_file.setnchannels(1)  # Mono
                wav_file.setsampwidth(2)  # 16-bit
                wav_file.setframerate(sample_rate)
                wav_file.writeframes(audio_data)
            logging.info(f"Audio saved to: {filename}")
            return True
        except Exception as exc:
            logging.error(f"Failed to save audio: {exc}")
            return False
    
    async def _produce_text(client: TTSRealtimeClient):
        """Sends text fragments to the server"""
        text_fragments = [
            "Alibaba Cloud's Model Studio is a one-stop platform for large model development and application building.",
            "Both developers and business personnel can be deeply involved in the design and construction of large model applications.", 
            "You can develop a large model application in 5 minutes through simple interface operations,",
            "or train an exclusive model in a few hours, so you can focus more on application innovation.",
        ]
    
        logging.info("Sending text fragments…")
        for text in text_fragments:
            logging.info(f"Sending fragment: {text}")
            await client.append_text(text)
            await asyncio.sleep(0.1)  # Slight delay between fragments
    
        # Waits for the server to complete internal processing before finishing the session
        await asyncio.sleep(1.0)
        await client.finish_session()
    
    async def _run_demo():
        """Runs the full demo"""
        global _audio_stream
        # Opens the PyAudio output stream
        _audio_stream = _audio_pyaudio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=_AUDIO_SAMPLE_RATE,
            output=True,
            frames_per_buffer=1024
        )
    
        client = TTSRealtimeClient(
            base_url=URL,
            api_key=API_KEY,
            voice="Cherry",
            language_type="Chinese", # Set this parameter to the same language as the text to get correct pronunciation and a natural tone.
            mode=SessionMode.SERVER_COMMIT,
            audio_callback=_audio_callback
        )
    
        # Establishes the connection
        await client.connect()
    
        # Executes message handling and text sending in parallel
        consumer_task = asyncio.create_task(client.handle_messages())
        producer_task = asyncio.create_task(_produce_text(client))
    
        await producer_task  # Waits for text sending to complete
    
        # Waits for an additional period to ensure all audio data is received
        await asyncio.sleep(5)
    
        # Closes the connection and cancels the consumer task
        await client.close()
        consumer_task.cancel()
    
        # Closes the audio stream
        if _audio_stream is not None:
            _audio_stream.stop_stream()
            _audio_stream.close()
        _audio_pyaudio.terminate()
    
        # Saves the audio data
        os.makedirs("outputs", exist_ok=True)
        _save_audio_to_file(os.path.join("outputs", "qwen_tts_output.wav"))
    
    def main():
        """Sync entry point"""
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s [%(levelname)s] %(message)s',
            datefmt='%Y-%m-%d %H:%M:%S'
        )
        logging.info("Starting QwenTTS Realtime Client demo…")
        asyncio.run(_run_demo())
    
    if __name__ == "__main__":
        main() 

    You can run server_commit.py to hear the audio generated in real time by the Realtime API.

    commit mode

    In the same directory as tts_realtime_client.py, you can create another Python file named commit.py and copy the following code into the file:

    commit.py

    import os
    import asyncio
    import logging
    import wave
    from tts_realtime_client import TTSRealtimeClient, SessionMode
    import pyaudio
    
    # QwenTTS service configuration
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, you must replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime
    URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime"
    # API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured environment variables, replace the following line with your Model Studio API key: API_KEY="sk-xxx"
    API_KEY = os.getenv("DASHSCOPE_API_KEY")
    
    if not API_KEY:
        raise ValueError("Please set DASHSCOPE_API_KEY environment variable")
    
    # Collects audio data
    _audio_chunks = []
    _AUDIO_SAMPLE_RATE = 24000
    _audio_pyaudio = pyaudio.PyAudio()
    _audio_stream = None
    
    def _audio_callback(audio_bytes: bytes):
        """TTSRealtimeClient audio callback: Real-time playback and caching"""
        global _audio_stream
        if _audio_stream is not None:
            try:
                _audio_stream.write(audio_bytes)
            except Exception as exc:
                logging.error(f"PyAudio playback error: {exc}")
        _audio_chunks.append(audio_bytes)
        logging.info(f"Received audio chunk: {len(audio_bytes)} bytes")
    
    def _save_audio_to_file(filename: str = "output.wav", sample_rate: int = 24000) -> bool:
        """Saves the collected audio data to a WAV file"""
        if not _audio_chunks:
            logging.warning("No audio data to save")
            return False
    
        try:
            audio_data = b"".join(_audio_chunks)
            with wave.open(filename, 'wb') as wav_file:
                wav_file.setnchannels(1)  # Mono
                wav_file.setsampwidth(2)  # 16-bit
                wav_file.setframerate(sample_rate)
                wav_file.writeframes(audio_data)
            logging.info(f"Audio saved to: {filename}")
            return True
        except Exception as exc:
            logging.error(f"Failed to save audio: {exc}")
            return False
    
    async def _user_input_loop(client: TTSRealtimeClient):
        """Continuously gets user input and sends text. When the user enters empty text, it sends a commit event and ends the current session."""
        print("Enter text (press Enter directly to send a commit event and end the current session, or press Ctrl+C or Ctrl+D to end the program):")
        
        while True:
            try:
                user_text = input("> ")
                if not user_text:  # User input is empty
                    # Empty input is treated as the end of a conversation: commit buffer -> finish session -> break loop
                    logging.info("Empty input, sending commit event and finishing the current session")
                    await client.commit_text_buffer()
                    # Waits for a short period for the server to process the commit to prevent losing audio due to premature session termination
                    await asyncio.sleep(0.3)
                    await client.finish_session()
                    break  # Exits the user input loop directly, no need to press Enter again
                else:
                    logging.info(f"Sending text: {user_text}")
                    await client.append_text(user_text)
                    
            except EOFError:  # User pressed Ctrl+D
                break
            except KeyboardInterrupt:  # User pressed Ctrl+C
                break
        
        # Finishes the session
        logging.info("Finishing session...")
    async def _run_demo():
        """Runs the full demo"""
        global _audio_stream
        # Opens the PyAudio output stream
        _audio_stream = _audio_pyaudio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=_AUDIO_SAMPLE_RATE,
            output=True,
            frames_per_buffer=1024
        )
    
        client = TTSRealtimeClient(
            base_url=URL,
            api_key=API_KEY,
            voice="Cherry",
            language_type="Chinese",  # Set this parameter to the same language as the text to get correct pronunciation and a natural tone.
            mode=SessionMode.COMMIT,  # Change to COMMIT mode
            audio_callback=_audio_callback
        )
    
        # Establishes the connection
        await client.connect()
    
        # Executes message handling and user input in parallel
        consumer_task = asyncio.create_task(client.handle_messages())
        producer_task = asyncio.create_task(_user_input_loop(client))
    
        await producer_task  # Waits for user input to complete
    
        # Waits for an additional period to ensure all audio data is received
        await asyncio.sleep(5)
    
        # Closes the connection and cancels the consumer task
        await client.close()
        consumer_task.cancel()
    
        # Closes the audio stream
        if _audio_stream is not None:
            _audio_stream.stop_stream()
            _audio_stream.close()
        _audio_pyaudio.terminate()
    
        # Saves the audio data
        os.makedirs("outputs", exist_ok=True)
        _save_audio_to_file(os.path.join("outputs", "qwen_tts_output.wav"))
    
    def main():
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s [%(levelname)s] %(message)s',
            datefmt='%Y-%m-%d %H:%M:%S'
        )
        logging.info("Starting QwenTTS Realtime Client demo…")
        asyncio.run(_run_demo())
    
    if __name__ == "__main__":
        main() 

    You can run commit.py. You can enter the text to be synthesized multiple times. If you press the Enter key without entering any text, you will hear the audio returned by the Realtime API from your speakers.

Interaction flow

server_commit mode

You can set the session.mode of the session.update event to "server_commit" to enable this mode. The server will then intelligently handle text segmentation and synthesis timing.

The interaction flow is as follows:

  1. The client sends the session.update event, and the server responds with the session.created and session.updated events.

  2. The client sends the input_text_buffer.append event to append text to the server-side buffer.

  3. The server intelligently handles text segmentation and synthesis timing, and returns the response.created, response.output_item.added, response.content_part.added, and response.audio.delta events.

  4. After the server completes the response, it returns the response.audio.done, response.content_part.done, response.output_item.done, and response.done events.

  5. The server sends the session.finished event to end the session.

Lifecycle

Client events

Server events

Session initialization

session.update

Session configuration

session.created

Session created

session.updated

Session configuration updated

User text input

input_text_buffer.append

Appends text to the server

input_text_buffer.commit

Immediately synthesizes the text cached on the server

session.finish

Notifies the server that there is no more text input

input_text_buffer.committed

Server received the submitted text

Server audio output

None

response.created

Server starts generating a response

response.output_item.added

New output content is available in the response

response.content_part.added

New output content is added to the assistant message

response.audio.delta

Incrementally generated audio from the model

response.content_part.done

Streaming of text or audio content for the assistant message is complete

response.output_item.done

Streaming of the entire output item for the assistant message is complete

response.audio.done

Audio generation is complete

response.done

Response is complete

commit mode

You can set the session.mode of the session.update event to "commit" to enable this mode. In this mode, the client must actively submit the text buffer to the server to obtain a response.

The interaction flow is as follows:

  1. The client sends a session.update event, and the server responds with session.created and session.updated events.

  2. The client sends the input_text_buffer.append event to append text to the server-side buffer.

  3. The client sends the input_text_buffer.commit event to commit the buffer to the server, and sends the session.finish event to indicate that there is no further text input.

  4. The server sends the response.created event and begins to generate the response.

  5. The server sends the response.output_item.added, response.content_part.added, and response.audio.delta events.

  6. When the server finishes responding, it returns the response.audio.done, response.content_part.done, response.output_item.done, and response.done events.

  7. The server sends the session.finished event to end the session.

Lifecycle

Client events

Server events

Session initialization

session.update

Session configuration

session.created

Session created

session.updated

Session configuration updated

User text input

input_text_buffer.append

Appends text to the buffer

input_text_buffer.commit

Commits the buffer to the server

input_text_buffer.clear

Clears the buffer

input_text_buffer.committed

Server received the committed text

Server audio output

None

response.created

Server starts generating a response

response.output_item.added

New output content is available in the response

response.content_part.added

New output content is added to the assistant message

response.audio.delta

Incrementally generated audio from the model

response.content_part.done

Streaming of text or audio content for the assistant message is complete

response.output_item.done

Streaming of the entire output item for the assistant message is complete

response.audio.done

Audio generation is complete

response.done

Response is complete

API reference

Real-time speech synthesis - Qwen API reference

Supported voices

Different models support different voices. When you use a model, you can set the voice request parameter to the corresponding value in the voice parameter column of the following table:

Qwen3-TTS Realtime

Name

voice parameter

Voice effects

Description

Supported languages

Cherry

Cherry

A cheerful, friendly, and natural young woman's voice.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Ethan

Ethan

Standard Mandarin with a slight northern accent. A bright, warm, and energetic voice.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Nofish

Nofish

A designer who does not use retroflex consonants.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Jennifer

Jennifer

A premium, cinematic American English female voice.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Ryan

Ryan

A rhythmic, dramatic voice with realism and tension.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Katerina

Katerina

A mature and rhythmic female voice.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Elias

Elias

Explains complex topics with academic rigor and clear storytelling.

Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Shanghai-Jada

Jada

A lively woman from Shanghai.

Chinese (Shanghainese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Beijing-Dylan

Dylan

A teenager who grew up in the hutongs of Beijing.

Chinese (Beijing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Sichuan-Sunny

Sunny

A sweet female voice from Sichuan.

Chinese (Sichuanese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Nanjing-Li

Li

A patient yoga teacher.

Chinese (Nanjing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Shaanxi-Marcus

Marcus

A sincere and deep voice from Shaanxi.

Chinese (Shaanxi dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Man Nan-Roy

Roy

A humorous and lively young male voice with a Minnan accent.

Chinese (Min Nan), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Tianjin-Peter

Peter

A voice for the straight man in Tianjin crosstalk.

Chinese (Tianjin dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Cantonese-Rocky

Rocky

A witty and humorous male voice for online chats.

Chinese (Cantonese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Cantonese-Kiki

Kiki

A sweet best friend from Hong Kong.

Chinese (Cantonese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Sichuan-Eric

Eric

An unconventional and refined male voice from Chengdu, Sichuan.

Chinese (Sichuanese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai

Qwen-TTS Realtime

Name

voice parameter

Voice effects

Description

Supported languages

Cherry

Cherry

A sunny, friendly, and genuine young woman.

Chinese, English

Serena

Serena

Kind young woman.

Chinese, English

Ethan

Ethan

Standard Mandarin with a slight northern accent. A bright, warm, and energetic voice.

Chinese, English

Chelsie

Chelsie

An anime-style virtual girlfriend voice.

Chinese, English