The Qwen real-time speech synthesis model offers low-latency speech synthesis with streaming text input and audio output. It provides various human-like voices, supports multiple languages and dialects, and lets you use the same voice for different languages. The model also automatically adjusts its tone and smoothly handles complex text.
Compared to Speech synthesis - Qwen, Qwen real-time speech synthesis supports the following features:
Streaming text input
Seamlessly integrates with the streaming output of Large Language Models (LLMs). Audio is synthesized as text is generated, which improves the real-time performance of interactive voice applications.
Bidirectional communication
It uses the WebSocket protocol for streaming text input and audio output. This method avoids the overhead of establishing multiple connections and significantly reduces latency.
Supported models
The supported model is Qwen3-TTS Realtime.
Qwen3-TTS Realtime provides 17 voices, supports synthesis for multiple languages and dialects, and lets you customize the format, sample rate, speech rate, volume, pitch, and bitrate of the output audio.
Qwen-TTS Realtime provides only 7 voices, supports only Chinese and English, and does not allow you to customize the format, sample rate, speech rate, volume, pitch, or bitrate of the output audio.
International (Singapore)
Model | Version | Unit price | Supported languages | Free quota (Note) |
qwen3-tts-flash-realtime Current capabilities are equivalent to qwen3-tts-flash-realtime-2025-09-18 | Stable | $0.13 per 10,000 characters | Chinese (Mandarin, Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Minnan, Tianjin, Cantonese), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese | 2,000 characters for each model Validity: 90 days after Alibaba Cloud Model Studio activation |
qwen3-tts-flash-realtime-2025-09-18 | Snapshot |
Qwen3-TTS is billed based on the number of input characters. The billing rules are as follows:
1 Chinese character = 2 characters
1 English letter, 1 punctuation mark, or 1 space = 1 character
Mainland China (Beijing)
Qwen3-TTS Realtime
Model | Version | Unit price | Supported languages |
qwen3-tts-flash-realtime Current capabilities are equivalent to qwen3-tts-flash-realtime-2025-09-18 | Stable | $0.143353 per 10,000 characters | Chinese (Mandarin, Beijing, Shanghai, Sichuan, Nanjing, Shaanxi, Minnan, Tianjin, Cantonese), English, Spanish, Russian, Italian, French, Korean, Japanese, German, Portuguese |
qwen3-tts-flash-realtime-2025-09-18 | Snapshot |
Qwen3-TTS is billed based on the number of input characters. The billing rules are as follows:
1 Chinese character = 2 characters
1 English letter, 1 punctuation mark, or 1 space = 1 character
Qwen-TTS Realtime
Model | Version | Context window | Max input | Max output | Input cost | Output cost | Supported languages |
(Tokens) | (per 1,000 tokens) | ||||||
qwen-tts-realtime Current capabilities are equivalent to qwen-tts-realtime-2025-07-15 | Stable | 8,192 | 512 | 7,680 | $0.345 | $1.721 | Chinese, English |
qwen-tts-realtime-latest Current capabilities are equivalent to qwen-tts-realtime-2025-07-15 | Latest | Chinese, English | |||||
qwen-tts-realtime-2025-07-15 | Snapshot | Chinese, English | |||||
Audio-to-token conversion rule: 1 second of audio corresponds to 50 tokens. Audio shorter than 1 second is counted as 50 tokens.
Access methods
The Qwen real-time speech synthesis API is based on the WebSocket protocol. If you use Java or Python, you can use the DashScope SDK to avoid handling WebSocket details. You can also use a WebSocket library in any language to connect:
Endpoint URL
Chinese mainland (Beijing): wss://dashscope.aliyuncs.com/api-ws/v1/realtime
International (Singapore): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime
Query parameters
The query parameter is `model`. You must specify the name of the model that you want to access. For more information, see Supported models.
Header
Use a Bearer Token for authentication: `Authorization: Bearer DASHSCOPE_API_KEY`
`DASHSCOPE_API_KEY` is the API key that you obtained from Alibaba Cloud Model Studio.
You can use the following code to establish a WebSocket connection with the Qwen-TTS Realtime API.
Getting started
Before you run the code, you must obtain and configure an API key.
Your Python version must be 3.10 or later.
Follow these steps to quickly test the real-time audio synthesis feature of the Realtime API.
Prepare the runtime environment
You can install pyaudio based on your operating system.
macOS
brew install portaudio && pip install pyaudioDebian/Ubuntu
sudo apt-get install python3-pyaudio or pip install pyaudioCentOS
sudo yum install -y portaudio portaudio-devel && pip install pyaudioWindows
pip install pyaudioAfter the installation is complete, you can use pip to install WebSocket-related dependencies:
pip install websocket-client==1.8.0 websocketsCreate a client
You can create a local Python file named
tts_realtime_client.pyand copy the following code into the file:Select a speech synthesis mode
The Realtime API supports the following two modes:
server_commit mode
In this mode, the client only sends text. The server intelligently determines how to segment the text and when to synthesize it. This mode is suitable for low-latency scenarios where you do not need to manually control the synthesis rhythm, such as GPS navigation.
commit mode
In this mode, the client first adds text to a buffer and then actively triggers the server to synthesize the specified text. This mode is suitable for scenarios that require fine-grained control over sentence breaks and pauses, such as news broadcasting.
server_commit mode
In the same directory as
tts_realtime_client.py, you can create another Python file namedserver_commit.pyand copy the following code into the file:You can run
server_commit.pyto hear the audio generated in real time by the Realtime API.commit mode
In the same directory as
tts_realtime_client.py, you can create another Python file namedcommit.pyand copy the following code into the file:You can run
commit.py. You can enter the text to be synthesized multiple times. If you press the Enter key without entering any text, you will hear the audio returned by the Realtime API from your speakers.
Interaction flow
server_commit mode
You can set the session.mode of the session.update event to "server_commit" to enable this mode. The server will then intelligently handle text segmentation and synthesis timing.
The interaction flow is as follows:
The client sends the
session.updateevent, and the server responds with thesession.createdandsession.updatedevents.The client sends the
input_text_buffer.appendevent to append text to the server-side buffer.The server intelligently handles text segmentation and synthesis timing, and returns the
response.created,response.output_item.added,response.content_part.added, andresponse.audio.deltaevents.After the server completes the response, it returns the
response.audio.done,response.content_part.done,response.output_item.done, andresponse.doneevents.The server sends the
session.finishedevent to end the session.
Lifecycle | Client events | Server events |
Session initialization | session.update Session configuration | session.created Session created session.updated Session configuration updated |
User text input | input_text_buffer.append Appends text to the server input_text_buffer.commit Immediately synthesizes the text cached on the server session.finish Notifies the server that there is no more text input | input_text_buffer.committed Server received the submitted text |
Server audio output | None | response.created Server starts generating a response response.output_item.added New output content is available in the response response.content_part.added New output content is added to the assistant message response.audio.delta Incrementally generated audio from the model response.content_part.done Streaming of text or audio content for the assistant message is complete response.output_item.done Streaming of the entire output item for the assistant message is complete response.audio.done Audio generation is complete response.done Response is complete |
commit mode
You can set the session.mode of the session.update event to "commit" to enable this mode. In this mode, the client must actively submit the text buffer to the server to obtain a response.
The interaction flow is as follows:
The client sends a
session.updateevent, and the server responds withsession.createdandsession.updatedevents.The client sends the
input_text_buffer.appendevent to append text to the server-side buffer.The client sends the
input_text_buffer.commitevent to commit the buffer to the server, and sends thesession.finishevent to indicate that there is no further text input.The server sends the
response.createdevent and begins to generate the response.The server sends the
response.output_item.added,response.content_part.added, andresponse.audio.deltaevents.When the server finishes responding, it returns the
response.audio.done,response.content_part.done,response.output_item.done, andresponse.doneevents.The server sends the
session.finishedevent to end the session.
Lifecycle | Client events | Server events |
Session initialization | session.update Session configuration | session.created Session created session.updated Session configuration updated |
User text input | input_text_buffer.append Appends text to the buffer input_text_buffer.commit Commits the buffer to the server input_text_buffer.clear Clears the buffer | input_text_buffer.committed Server received the committed text |
Server audio output | None | response.created Server starts generating a response response.output_item.added New output content is available in the response response.content_part.added New output content is added to the assistant message response.audio.delta Incrementally generated audio from the model response.content_part.done Streaming of text or audio content for the assistant message is complete response.output_item.done Streaming of the entire output item for the assistant message is complete response.audio.done Audio generation is complete response.done Response is complete |
API reference
Supported voices
Different models support different voices. When you use a model, you can set the voice request parameter to the corresponding value in the voice parameter column of the following table:
Qwen3-TTS Realtime
Name |
| Voice effects | Description | Supported languages |
Cherry | Cherry | A cheerful, friendly, and natural young woman's voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Ethan | Ethan | Standard Mandarin with a slight northern accent. A bright, warm, and energetic voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Nofish | Nofish | A designer who does not use retroflex consonants. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Jennifer | Jennifer | A premium, cinematic American English female voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Ryan | Ryan | A rhythmic, dramatic voice with realism and tension. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Katerina | Katerina | A mature and rhythmic female voice. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Elias | Elias | Explains complex topics with academic rigor and clear storytelling. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Shanghai-Jada | Jada | A lively woman from Shanghai. | Chinese (Shanghainese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Beijing-Dylan | Dylan | A teenager who grew up in the hutongs of Beijing. | Chinese (Beijing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Sichuan-Sunny | Sunny | A sweet female voice from Sichuan. | Chinese (Sichuanese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Nanjing-Li | Li | A patient yoga teacher. | Chinese (Nanjing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Shaanxi-Marcus | Marcus | A sincere and deep voice from Shaanxi. | Chinese (Shaanxi dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Man Nan-Roy | Roy | A humorous and lively young male voice with a Minnan accent. | Chinese (Min Nan), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Tianjin-Peter | Peter | A voice for the straight man in Tianjin crosstalk. | Chinese (Tianjin dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Cantonese-Rocky | Rocky | A witty and humorous male voice for online chats. | Chinese (Cantonese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Cantonese-Kiki | Kiki | A sweet best friend from Hong Kong. | Chinese (Cantonese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai | |
Sichuan-Eric | Eric | An unconventional and refined male voice from Chengdu, Sichuan. | Chinese (Sichuanese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai |
Qwen-TTS Realtime
Name |
| Voice effects | Description | Supported languages |
Cherry | Cherry | A sunny, friendly, and genuine young woman. | Chinese, English | |
Serena | Serena | Kind young woman. | Chinese, English | |
Ethan | Ethan | Standard Mandarin with a slight northern accent. A bright, warm, and energetic voice. | Chinese, English | |
Chelsie | Chelsie | An anime-style virtual girlfriend voice. | Chinese, English |