Real-time speech synthesis converts text into natural speech over a WebSocket connection. Model Studio offers CosyVoice, Qwen-TTS model families with streaming input and output, voice cloning, voice design, and fine-grained audio control for use cases such as voice assistants, audiobooks, and intelligent customer service.
Overview
Low-latency real-time speech synthesis over WebSocket, built for voice assistants, intelligent customer service, live captioning, and other scenarios that require instant responses.
Streaming input and output (full-duplex WebSocket) with low time to first audio, ideal for real-time conversations such as voice assistants and intelligent customer service
Adjustable speech rate, pitch, volume, and bitrate for fine-grained voice control
Compatible with mainstream audio formats (PCM, WAV, MP3, Opus) and supports up to 48 kHz sample rate output
Supports Instruction-based control, allowing natural-language instructions to control voice expressiveness
Supports Voice cloning and Voice Design voice customization
If you don't need real-time output, use Speech synthesis - Qwen(HTTP API), which is suited for batch scenarios such as audiobooks and courseware dubbing. For model selection guidance, see Speech synthesis.
Prerequisites
If you call the service through the DashScope SDK, install the latest SDK.
Quick start
The following examples demonstrate speech synthesis for each model family. For more language examples and detailed parameter descriptions, see the API reference of each model.
CosyVoice
cosyvoice-v3.5-plus and cosyvoice-v3.5-flash are currently available only in the Beijing region and support only voice design and voice cloning scenarios (no system voices). Before using these models, create a target voice by following Voice cloning or Voice Design. After you create the voice, set the voice field in your code to your voice ID and set the model field to the corresponding model.
The following example shows how to synthesize speech with a system voice (see Voice list).
Python
# coding=utf-8
import os
import dashscope
from dashscope.audio.tts_v2 import *
# The API Key differs between the Singapore and Beijing regions. Obtain an API Key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API Key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
# The following is the Singapore region URL. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# Model
# Different model versions require corresponding voice types:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the target language. For details, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"
# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send text for synthesis and obtain binary audio
audio = synthesizer.call("How is the weather today?")
# The first text submission requires establishing a WebSocket connection, so the first-packet latency includes connection setup time
print('[Metric] requestId: {}, first package delay: {} ms'.format(
synthesizer.get_last_request_id(),
synthesizer.get_first_package_delay()))
# Save audio to a local file
with open('output.mp3', 'wb') as f:
f.write(audio)Java
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
public class Main {
// Model
// Different model versions require corresponding voice types:
// cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
// cosyvoice-v2: Use voices such as longxiaochun_v2.
// Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the target language. For details, see the CosyVoice voice list.
private static String model = "cosyvoice-v3-flash";
// Voice
private static String voice = "longanyang";
public static void streamAudioDataToSpeaker() {
// Request parameters
SpeechSynthesisParam param =
SpeechSynthesisParam.builder()
// The API Key differs between the Singapore and Beijing regions. Obtain an API Key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model(model) // Model
.voice(voice) // Voice
.build();
// Synchronous mode: disable callback (second parameter is null)
SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
ByteBuffer audio = null;
try {
// Block until audio is returned
audio = synthesizer.call("How is the weather today?");
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
// Close the WebSocket connection after the task completes
synthesizer.getDuplexApi().close(1000, "bye");
}
if (audio != null) {
// Save the audio data to a local file "output.mp3"
File file = new File("output.mp3");
// The first text submission requires establishing a WebSocket connection, so the first-packet latency includes connection setup time
// Note: getFirstPackageDelay() requires dashscope-sdk-java 2.18.0 or later
System.out.println(
"[Metric] requestId: "
+ synthesizer.getLastRequestId()
+ ", first package delay (ms): "
+ synthesizer.getFirstPackageDelay());
try (FileOutputStream fos = new FileOutputStream(file)) {
fos.write(audio.array());
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
public static void main(String[] args) {
// The following is the Singapore region URL. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
streamAudioDataToSpeaker();
System.exit(0);
}
}Qwen-TTS
Synthesize speech with a system voice
The following example shows how to synthesize speech with a system voice (see Supported voices).
To use Instruction-based control, set model to qwen3-tts-instruct-flash-realtime and configure the instruction through the instructions parameter.
Python
Server commit mode
import os
import base64
import threading
import time
import dashscope
from dashscope.audio.qwen_tts_realtime import *
qwen_tts_realtime: QwenTtsRealtime = None
text_to_synthesize = [
'Right? I love supermarkets like this.',
'Especially during Chinese New Year,',
'I go shopping at supermarkets.',
'And I feel',
'absolutely thrilled!',
'I want to buy so many things!'
]
DO_VIDEO_TEST = False
def init_dashscope_api_key():
"""
Set your DashScope API key. More information:
https://github.com/aliyun/alibabacloud-bailian-speech-demo/blob/master/PREREQUISITES.md
"""
# API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
if 'DASHSCOPE_API_KEY' in os.environ:
dashscope.api_key = os.environ[
'DASHSCOPE_API_KEY'] # Load API key from environment variable DASHSCOPE_API_KEY
else:
dashscope.api_key = 'your-dashscope-api-key' # Set API key manually
class MyCallback(QwenTtsRealtimeCallback):
def __init__(self):
self.complete_event = threading.Event()
self.file = open('result_24k.pcm', 'wb')
def on_open(self) -> None:
print('connection opened, init player')
def on_close(self, close_status_code, close_msg) -> None:
self.file.close()
print('connection closed with code: {}, msg: {}, destroy player'.format(close_status_code, close_msg))
def on_event(self, response: str) -> None:
try:
global qwen_tts_realtime
type = response['type']
if 'session.created' == type:
print('start session: {}'.format(response['session']['id']))
if 'response.audio.delta' == type:
recv_audio_b64 = response['delta']
self.file.write(base64.b64decode(recv_audio_b64))
if 'response.done' == type:
print(f'response {qwen_tts_realtime.get_last_response_id()} done')
if 'session.finished' == type:
print('session finished')
self.complete_event.set()
except Exception as e:
print('[Error] {}'.format(e))
return
def wait_for_finished(self):
self.complete_event.wait()
if __name__ == '__main__':
init_dashscope_api_key()
print('Initializing ...')
callback = MyCallback()
qwen_tts_realtime = QwenTtsRealtime(
# To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime
model='qwen3-tts-flash-realtime',
callback=callback,
# This URL is for the Singapore region. If you use the Beijing region, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime'
)
qwen_tts_realtime.connect()
qwen_tts_realtime.update_session(
voice = 'Cherry',
response_format = AudioFormat.PCM_24000HZ_MONO_16BIT,
# To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime
# instructions='Speak quickly with a rising intonation, suitable for introducing fashion products.',
# optimize_instructions=True,
mode = 'server_commit'
)
for text_chunk in text_to_synthesize:
print(f'send text: {text_chunk}')
qwen_tts_realtime.append_text(text_chunk)
time.sleep(0.1)
qwen_tts_realtime.finish()
callback.wait_for_finished()
print('[Metric] session: {}, first audio delay: {}'.format(
qwen_tts_realtime.get_session_id(),
qwen_tts_realtime.get_first_audio_delay(),
))
Commit mode
import base64
import os
import threading
import dashscope
from dashscope.audio.qwen_tts_realtime import *
qwen_tts_realtime: QwenTtsRealtime = None
text_to_synthesize = [
'This is the first sentence.',
'This is the second sentence.',
'This is the third sentence.',
]
DO_VIDEO_TEST = False
def init_dashscope_api_key():
"""
Set your DashScope API key. More information:
https://github.com/aliyun/alibabacloud-bailian-speech-demo/blob/master/PREREQUISITES.md
"""
# API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
if 'DASHSCOPE_API_KEY' in os.environ:
dashscope.api_key = os.environ[
'DASHSCOPE_API_KEY'] # Load API key from environment variable DASHSCOPE_API_KEY
else:
dashscope.api_key = 'your-dashscope-api-key' # Set API key manually
class MyCallback(QwenTtsRealtimeCallback):
def __init__(self):
super().__init__()
self.response_counter = 0
self.complete_event = threading.Event()
self.file = open(f'result_{self.response_counter}_24k.pcm', 'wb')
def reset_event(self):
self.response_counter += 1
self.file = open(f'result_{self.response_counter}_24k.pcm', 'wb')
self.complete_event = threading.Event()
def on_open(self) -> None:
print('connection opened, init player')
def on_close(self, close_status_code, close_msg) -> None:
print('connection closed with code: {}, msg: {}, destroy player'.format(close_status_code, close_msg))
def on_event(self, response: str) -> None:
try:
global qwen_tts_realtime
type = response['type']
if 'session.created' == type:
print('start session: {}'.format(response['session']['id']))
if 'response.audio.delta' == type:
recv_audio_b64 = response['delta']
self.file.write(base64.b64decode(recv_audio_b64))
if 'response.done' == type:
print(f'response {qwen_tts_realtime.get_last_response_id()} done')
self.complete_event.set()
self.file.close()
if 'session.finished' == type:
print('session finished')
self.complete_event.set()
except Exception as e:
print('[Error] {}'.format(e))
return
def wait_for_response_done(self):
self.complete_event.wait()
if __name__ == '__main__':
init_dashscope_api_key()
print('Initializing ...')
callback = MyCallback()
qwen_tts_realtime = QwenTtsRealtime(
# To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime
model='qwen3-tts-flash-realtime',
callback=callback,
# This URL is for the Singapore region. If you use the Beijing region, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/realtime
url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime'
)
qwen_tts_realtime.connect()
qwen_tts_realtime.update_session(
voice = 'Cherry',
response_format = AudioFormat.PCM_24000HZ_MONO_16BIT,
# To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime
# instructions='Speak quickly with a rising intonation, suitable for introducing fashion products.',
# optimize_instructions=True,
mode = 'commit'
)
print(f'send text: {text_to_synthesize[0]}')
qwen_tts_realtime.append_text(text_to_synthesize[0])
qwen_tts_realtime.commit()
callback.wait_for_response_done()
callback.reset_event()
print(f'send text: {text_to_synthesize[1]}')
qwen_tts_realtime.append_text(text_to_synthesize[1])
qwen_tts_realtime.commit()
callback.wait_for_response_done()
callback.reset_event()
print(f'send text: {text_to_synthesize[2]}')
qwen_tts_realtime.append_text(text_to_synthesize[2])
qwen_tts_realtime.commit()
callback.wait_for_response_done()
qwen_tts_realtime.finish()
print('[Metric] session: {}, first audio delay: {}'.format(
qwen_tts_realtime.get_session_id(),
qwen_tts_realtime.get_first_audio_delay(),
))
Java
Server commit mode
appendText()
import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.SourceDataLine;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.DataLine;
import javax.sound.sampled.AudioSystem;
import java.io.*;
import java.util.Base64;
import java.util.Queue;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;
public class Main {
static String[] textToSynthesize = {
"Right? I really love this kind of supermarket.",
"Especially during the Chinese New Year.",
"Going to the supermarket.",
"It just makes me feel.",
"Super, super happy!",
"I want to buy so many things!"
};
public static QwenTtsRealtimeAudioFormat ttsFormat = QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT;
// Real-time PCM audio player
public static class RealtimePcmPlayer {
private int sampleRate;
private SourceDataLine line;
private AudioFormat audioFormat;
private Thread decoderThread;
private Thread playerThread;
private AtomicBoolean stopped = new AtomicBoolean(false);
private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();
private ByteArrayOutputStream totalAudioStream = new ByteArrayOutputStream();
// Initialize the audio format and audio line.
public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
this.sampleRate = sampleRate;
this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
line = (SourceDataLine) AudioSystem.getLine(info);
line.open(audioFormat);
line.start();
decoderThread = new Thread(new Runnable() {
@Override
public void run() {
while (!stopped.get()) {
String b64Audio = b64AudioBuffer.poll();
if (b64Audio != null) {
byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
RawAudioBuffer.add(rawAudio);
// Write audio data to totalAudioStream.
try {
totalAudioStream.write(rawAudio);
} catch (IOException e) {
throw new RuntimeException(e);
}
} else {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
});
playerThread = new Thread(new Runnable() {
@Override
public void run() {
while (!stopped.get()) {
byte[] rawAudio = RawAudioBuffer.poll();
if (rawAudio != null) {
try {
playChunk(rawAudio);
} catch (IOException e) {
throw new RuntimeException(e);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
} else {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
});
decoderThread.start();
playerThread.start();
}
// Play an audio chunk and block until playback completes.
private void playChunk(byte[] chunk) throws IOException, InterruptedException {
if (chunk == null || chunk.length == 0) return;
int bytesWritten = 0;
while (bytesWritten < chunk.length) {
bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
}
int audioLength = chunk.length / (this.sampleRate*2/1000);
// Wait for the buffered audio to finish playing.
Thread.sleep(audioLength - 10);
}
public void write(String b64Audio) {
b64AudioBuffer.add(b64Audio);
}
public void cancel() {
b64AudioBuffer.clear();
RawAudioBuffer.clear();
}
public void waitForComplete() throws InterruptedException {
while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
Thread.sleep(100);
}
line.drain();
}
public void shutdown() throws InterruptedException, IOException {
stopped.set(true);
decoderThread.join();
playerThread.join();
// Save the complete audio file.
File file = new File("TotalAudio_"+ttsFormat.getSampleRate()+"."+ttsFormat.getFormat());
try (FileOutputStream fos = new FileOutputStream(file)) {
fos.write(totalAudioStream.toByteArray());
}
if (line != null && line.isRunning()) {
line.drain();
line.close();
}
}
}
public static void main(String[] args) throws InterruptedException, LineUnavailableException, IOException {
QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
// To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime.
.model("qwen3-tts-flash-realtime")
// Singapore endpoint. For China (Beijing), use wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
.url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
// API keys differ between Singapore and China (Beijing). See https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
.apikey(System.getenv("DASHSCOPE_API_KEY"))
.build();
AtomicReference<CountDownLatch> completeLatch = new AtomicReference<>(new CountDownLatch(1));
final AtomicReference<QwenTtsRealtime> qwenTtsRef = new AtomicReference<>(null);
// Create a real-time audio player instance.
RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);
QwenTtsRealtime qwenTtsRealtime = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
@Override
public void onOpen() {
// Handle connection establishment.
}
@Override
public void onEvent(JsonObject message) {
String type = message.get("type").getAsString();
switch(type) {
case "session.created":
// Handle session creation.
if (message.has("session")) {
String eventId = message.get("event_id").getAsString();
String sessionId = message.get("session").getAsJsonObject().get("id").getAsString();
System.out.println("[onEvent] session.created, session_id: "
+ sessionId + ", event_id: " + eventId);
}
break;
case "response.audio.delta":
String recvAudioB64 = message.get("delta").getAsString();
// Play audio in real time.
audioPlayer.write(recvAudioB64);
break;
case "response.done":
// Handle response completion.
break;
case "session.finished":
// Handle session termination.
completeLatch.get().countDown();
default:
break;
}
}
@Override
public void onClose(int code, String reason) {
// Handle connection closure.
}
});
qwenTtsRef.set(qwenTtsRealtime);
try {
qwenTtsRealtime.connect();
} catch (NoApiKeyException e) {
throw new RuntimeException(e);
}
QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
.voice("Cherry")
.responseFormat(ttsFormat)
.mode("server_commit")
// To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime.
// .instructions("")
// .optimizeInstructions(true)
.build();
qwenTtsRealtime.updateSession(config);
for (String text:textToSynthesize) {
qwenTtsRealtime.appendText(text);
Thread.sleep(100);
}
qwenTtsRealtime.finish();
completeLatch.get().await();
qwenTtsRealtime.close();
// Wait for audio playback to complete, then shut down the player.
audioPlayer.waitForComplete();
audioPlayer.shutdown();
System.exit(0);
}
}Commit mode
commit()
import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.SourceDataLine;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.DataLine;
import javax.sound.sampled.AudioSystem;
import java.io.*;
import java.util.Base64;
import java.util.Queue;
import java.util.Scanner;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;
public class Main {
public static QwenTtsRealtimeAudioFormat ttsFormat = QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT;
// Real-time PCM audio player
public static class RealtimePcmPlayer {
private int sampleRate;
private SourceDataLine line;
private AudioFormat audioFormat;
private Thread decoderThread;
private Thread playerThread;
private AtomicBoolean stopped = new AtomicBoolean(false);
private Queue<String> b64AudioBuffer = new ConcurrentLinkedQueue<>();
private Queue<byte[]> RawAudioBuffer = new ConcurrentLinkedQueue<>();
private ByteArrayOutputStream totalAudioStream = new ByteArrayOutputStream();
// Initialize the audio format and audio line.
public RealtimePcmPlayer(int sampleRate) throws LineUnavailableException {
this.sampleRate = sampleRate;
this.audioFormat = new AudioFormat(this.sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(SourceDataLine.class, audioFormat);
line = (SourceDataLine) AudioSystem.getLine(info);
line.open(audioFormat);
line.start();
decoderThread = new Thread(new Runnable() {
@Override
public void run() {
while (!stopped.get()) {
String b64Audio = b64AudioBuffer.poll();
if (b64Audio != null) {
byte[] rawAudio = Base64.getDecoder().decode(b64Audio);
RawAudioBuffer.add(rawAudio);
// Write audio data to totalAudioStream.
try {
totalAudioStream.write(rawAudio);
} catch (IOException e) {
throw new RuntimeException(e);
}
} else {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
});
playerThread = new Thread(new Runnable() {
@Override
public void run() {
while (!stopped.get()) {
byte[] rawAudio = RawAudioBuffer.poll();
if (rawAudio != null) {
try {
playChunk(rawAudio);
} catch (IOException e) {
throw new RuntimeException(e);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
} else {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
});
decoderThread.start();
playerThread.start();
}
// Play an audio chunk and block until playback completes.
private void playChunk(byte[] chunk) throws IOException, InterruptedException {
if (chunk == null || chunk.length == 0) return;
int bytesWritten = 0;
while (bytesWritten < chunk.length) {
bytesWritten += line.write(chunk, bytesWritten, chunk.length - bytesWritten);
}
int audioLength = chunk.length / (this.sampleRate*2/1000);
// Wait for the buffered audio to finish playing.
Thread.sleep(audioLength - 10);
}
public void write(String b64Audio) {
b64AudioBuffer.add(b64Audio);
}
public void cancel() {
b64AudioBuffer.clear();
RawAudioBuffer.clear();
}
public void waitForComplete() throws InterruptedException {
// Wait for all buffered audio data to finish playing.
while (!b64AudioBuffer.isEmpty() || !RawAudioBuffer.isEmpty()) {
Thread.sleep(100);
}
// Wait for the audio line to drain.
line.drain();
}
public void shutdown() throws InterruptedException {
stopped.set(true);
decoderThread.join();
playerThread.join();
// Save the complete audio file.
File file = new File("TotalAudio_"+ttsFormat.getSampleRate()+"."+ttsFormat.getFormat());
try (FileOutputStream fos = new FileOutputStream(file)) {
fos.write(totalAudioStream.toByteArray());
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
if (line != null && line.isRunning()) {
line.drain();
line.close();
}
}
}
public static void main(String[] args) throws InterruptedException, LineUnavailableException, FileNotFoundException {
Scanner scanner = new Scanner(System.in);
QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
// To use instruction control, replace the model with qwen3-tts-instruct-flash-realtime.
.model("qwen3-tts-flash-realtime")
// Singapore endpoint. For China (Beijing), use wss://dashscope.aliyuncs.com/api-ws/v1/realtime.
.url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
// API keys differ between Singapore and China (Beijing). See https://www.alibabacloud.com/help/zh/model-studio/get-api-key.
.apikey(System.getenv("DASHSCOPE_API_KEY"))
.build();
AtomicReference<CountDownLatch> completeLatch = new AtomicReference<>(new CountDownLatch(1));
// Create a real-time player instance.
RealtimePcmPlayer audioPlayer = new RealtimePcmPlayer(24000);
final AtomicReference<QwenTtsRealtime> qwenTtsRef = new AtomicReference<>(null);
QwenTtsRealtime qwenTtsRealtime = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
@Override
public void onOpen() {
System.out.println("connection opened");
System.out.println("Enter text and press Enter to send. Enter 'quit' to exit the program.");
}
@Override
public void onEvent(JsonObject message) {
String type = message.get("type").getAsString();
switch(type) {
case "session.created":
System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
break;
case "response.audio.delta":
String recvAudioB64 = message.get("delta").getAsString();
byte[] rawAudio = Base64.getDecoder().decode(recvAudioB64);
// Play audio in real time.
audioPlayer.write(recvAudioB64);
break;
case "response.done":
System.out.println("response done");
// Wait for audio playback to complete.
try {
audioPlayer.waitForComplete();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
// Prepare for the next input.
completeLatch.get().countDown();
break;
case "session.finished":
System.out.println("session finished");
if (qwenTtsRef.get() != null) {
System.out.println("[Metric] response: " + qwenTtsRef.get().getResponseId() +
", first audio delay: " + qwenTtsRef.get().getFirstAudioDelay() + " ms");
}
completeLatch.get().countDown();
default:
break;
}
}
@Override
public void onClose(int code, String reason) {
System.out.println("connection closed code: " + code + ", reason: " + reason);
try {
// Wait for playback to complete, then shut down the player.
audioPlayer.waitForComplete();
audioPlayer.shutdown();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
});
qwenTtsRef.set(qwenTtsRealtime);
try {
qwenTtsRealtime.connect();
} catch (NoApiKeyException e) {
throw new RuntimeException(e);
}
QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
.voice("Cherry")
.responseFormat(ttsFormat)
.mode("commit")
// To use instruction control, uncomment the following lines and replace the model with qwen3-tts-instruct-flash-realtime.
// .instructions("")
// .optimizeInstructions(true)
.build();
qwenTtsRealtime.updateSession(config);
// Read user input in a loop.
while (true) {
System.out.print("Enter the text to synthesize: ");
String text = scanner.nextLine();
// Exit when the user enters 'quit'.
if ("quit".equalsIgnoreCase(text.trim())) {
System.out.println("Closing the connection...");
qwenTtsRealtime.finish();
completeLatch.get().await();
break;
}
// Skip empty input.
if (text.trim().isEmpty()) {
continue;
}
// Re-initialize the countdown latch.
completeLatch.set(new CountDownLatch(1));
// Send the text.
qwenTtsRealtime.appendText(text);
qwenTtsRealtime.commit();
// Wait for the current synthesis to complete.
completeLatch.get().await();
}
// Clean up resources.
audioPlayer.waitForComplete();
audioPlayer.shutdown();
scanner.close();
System.exit(0);
}
}Advanced features
The following features provide fine-grained control over speech synthesis.
Qwen-TTS interaction modes
The Qwen-TTS Realtime API provides two WebSocket interaction modes, switched through the session.mode parameter:
server_commit mode: The server intelligently handles text segmentation and synthesis timing. This mode suits continuous synthesis of large text blocks. The client only needs to append text without managing segmentation or submission.
commit mode: The client manually submits the text buffer to trigger synthesis. This mode suits scenarios that require precise control over synthesis timing, such as turn-by-turn synthesis in conversational AI.
Set the interaction mode in the SDK:
Python SDK: Set the
modeparameter in theupdate_sessionmethod.qwen_tts_realtime.update_session( voice='Cherry', response_format=AudioFormat.PCM_24000HZ_MONO_16BIT, mode='server_commit' )Java SDK: Set the
modeparameter throughQwenTtsRealtimeConfig.builder().QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder() .voice("Cherry") .responseFormat(ttsFormat) .mode("server_commit") .build(); qwenTtsRealtime.updateSession(config);
For complete SDK code examples, see Python SDK and Java SDK. For details on the WebSocket event lifecycle and connection reuse, see Realtime speech synthesis - Qwen API reference.
Instruction-based control
Instruction-based control lets you precisely shape the vocal expression through natural language descriptions, without adjusting complex audio parameters. Describe the desired tone, speed, emotion, or timbre in plain text to produce the corresponding speech effect.
Supported models:
CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash
Different models have different instruction format requirements:
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Accept arbitrary instructions to control synthesis effects such as emotion and speed.
cosyvoice-v3-flash with Voice Design or Voice Clone timbres: Accept arbitrary instructions to control synthesis effects.
cosyvoice-v3-flash with system voices: Instructions must follow a fixed format. For details, see CosyVoice voice list.
Qwen-TTS: Only the Qwen3-TTS-Instruct-Flash-Realtime series models are supported.
How to use:
CosyVoice: Specify instructions through the instructions parameter, for example, "Fast pace with a rising intonation, suitable for introducing fashion products."
Qwen-TTS: Specify instructions through the instruction parameter, for example, "Fast pace with a rising intonation, suitable for introducing fashion products."
Supported languages for instruction text:
CosyVoice:
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese, English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.
cosyvoice-v3-flash: Chinese, English, French, German, Japanese, Korean, and Russian.
Qwen-TTS: Chinese and English only.
Instruction text length limits:
CosyVoice: Up to 100 characters. Chinese characters (including Simplified/Traditional Chinese, Japanese Kanji, and Korean Hanja) count as 2 characters each. Other characters (punctuation, letters, numbers, Japanese Kana, Korean Hangul, etc.) count as 1 character each.
Qwen-TTS: Up to 1,600 tokens.
Use cases:
Audiobook and radio drama voiceover
Advertising and promotional voiceover
Game character and animation voiceover
Emotionally expressive voice assistants
Documentary narration and news broadcasting
Tips for writing high-quality voice descriptions:
Core principles:
Be specific, not vague: Use words that describe concrete vocal qualities, such as "deep," "crisp," or "slightly fast." Avoid subjective, low-information terms like "nice" or "normal."
Be multidimensional, not single-faceted: A good description combines multiple dimensions (pitch, speed, emotion, etc.). Describing only one dimension (e.g., "high pitch") is too broad to produce a distinctive effect.
Be objective, not subjective: Focus on the physical and perceptual qualities of the voice, not personal preferences. For example, use "slightly high pitch with energy" rather than "my favorite voice."
Be original, not imitative: Describe the vocal qualities you want, rather than requesting imitation of specific public figures (such as celebrities or actors). Imitation requests involve copyright risks and are not supported.
Be concise, not redundant: Make every word count. Avoid repeating synonyms or stacking meaningless intensifiers (e.g., "a very very great voice").
Description dimensions: Combining multiple dimensions creates richer expression effects.
Dimension
Example descriptions
Pitch
High, mid, low, slightly high, slightly low
Speed
Fast, moderate, slow, slightly fast, slightly slow
Emotion
Cheerful, calm, gentle, serious, lively, composed, soothing
Timbre
Magnetic, crisp, husky, mellow, sweet, rich, powerful
Use case
News broadcasting, advertising, audiobook, animation character, voice assistant, documentary narration
Examples:
Standard broadcasting style: Clear and precise articulation with standard pronunciation
Emotional escalation: Volume rising rapidly from normal conversation to a shout; straightforward personality with externalized, easily agitated emotions
Special emotional state: Slightly slurred pronunciation from a teary voice, slightly husky, with noticeable tension from a sobbing tone
Advertising voiceover style: Slightly high pitch, moderate speed, energetic and engaging, suitable for advertising
Gentle soothing style: Slightly slow speed, soft and sweet tone, warm and comforting like a caring friend
WebSocket direct connection examples
If you don't use the DashScope SDK, connect directly to the server through the raw WebSocket protocol for speech synthesis. The following examples provide minimal working implementations. You need to develop the actual business logic yourself. For the WebSocket protocol specification (server endpoint, request headers, interaction flow) of each model, see the corresponding API reference.
Connection reuse
WebSocket services support connection reuse to improve resource efficiency and avoid the overhead of repeatedly establishing connections. After a synthesis task completes, the WebSocket connection can be reused for the next task without reconnecting.
Reuse flow:
CosyVoice / Sambert: The client sends
finish-task(CosyVoice) or after the task completes, the server returns atask-finishedevent to end the task. The client can then send a newrun-taskevent to start the next task.Qwen-TTS: After the client sends
session.finish, the server returns asession.finishedevent to end the session. The client can then establish a new session for the next synthesis task.
Wait for the server to return the completion event (
task-finishedorsession.finished) before starting a new task.CosyVoice and Sambert require a different
task_idfor each task on a reused connection.If a failure occurs during task execution, the server returns an error event and closes the connection. The connection can't be reused after this.
If no new task starts within 60 seconds after a task ends, the connection times out and disconnects automatically.
For details on events for each model, see the CosyVoice API reference, Qwen-TTS API reference.
High-concurrency best practices
In high-concurrency scenarios, creating and destroying a WebSocket connection for each request produces significant overhead. The DashScope SDK includes built-in connection pooling and object pooling to reuse connections and objects, significantly reducing latency and resource consumption.
Supported regions
Available models vary by deployment region:
International
If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.
To call the following models, select an API Key from the Singapore region:
CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash
Qwen-TTS:
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime (stable, currently equivalent to qwen3-tts-instruct-flash-realtime-2026-01-22), qwen3-tts-instruct-flash-realtime-2026-01-22 (latest snapshot)
Qwen3-TTS-VD-Realtime: qwen3-tts-vd-realtime-2026-01-15 (latest snapshot), qwen3-tts-vd-realtime-2025-12-16 (snapshot)
Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime (stable, currently equivalent to qwen3-tts-flash-realtime-2025-11-27), qwen3-tts-flash-realtime-2025-11-27 (latest snapshot), qwen3-tts-flash-realtime-2025-09-18 (snapshot)
Chinese mainland
If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).
To call the following models, select an API Key from the Beijing region:
CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
Qwen-TTS:
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime (stable, currently equivalent to qwen3-tts-instruct-flash-realtime-2026-01-22), qwen3-tts-instruct-flash-realtime-2026-01-22 (latest snapshot)
Qwen3-TTS-VD-Realtime: qwen3-tts-vd-realtime-2026-01-15 (latest snapshot), qwen3-tts-vd-realtime-2025-12-16 (snapshot)
Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime (stable, currently equivalent to qwen3-tts-flash-realtime-2025-11-27), qwen3-tts-flash-realtime-2025-11-27 (latest snapshot), qwen3-tts-flash-realtime-2025-09-18 (snapshot)
Qwen-TTS-Realtime: qwen-tts-realtime (stable, currently equivalent to qwen-tts-realtime-2025-07-15), qwen-tts-realtime-latest (latest, currently equivalent to qwen-tts-realtime-2025-07-15), qwen-tts-realtime-2025-07-15 (snapshot)
Supported voices
Different models support different voices. Set the voice request parameter to the value in the voice parameter column of the voice list.
Qwen-TTS voice list:
voiceparameterDetails
Supported languages
Supported models
CherryVoice name: Cherry
Description: A sunny, positive, friendly, and natural young woman (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Qwen-TTS-Realtime: qwen-tts-realtime, qwen-tts-realtime-latest, qwen-tts-realtime-2025-07-15
SerenaVoice name: Serena
Description: A gentle young woman (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Qwen-TTS-Realtime: qwen-tts-realtime, qwen-tts-realtime-latest, qwen-tts-realtime-2025-07-15
EthanVoice name: Ethan
Description: Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
Qwen-TTS-Realtime: qwen-tts-realtime, qwen-tts-realtime-latest, qwen-tts-realtime-2025-07-15
ChelsieVoice name: Chelsie
Description: A two-dimensional virtual girlfriend (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Qwen-TTS-Realtime: qwen-tts-realtime, qwen-tts-realtime-latest, qwen-tts-realtime-2025-07-15
MomoVoice name: Momo
Description: Playful and mischievous, cheering you up (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
VivianVoice name: Vivian
Description: Confident, cute, and slightly feisty (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
MoonVoice name: Moon
Description: A bold and handsome man named Yuebai (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
MaiaVoice name: Maia
Description: A blend of intellect and gentleness (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
KaiVoice name: Kai
Description: A soothing audio spa for your ears (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
NofishVoice name: Nofish
Description: A designer who cannot pronounce retroflex sounds (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
BellaVoice name: Bella
Description: A little girl who drinks but never throws punches when drunk (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
JenniferVoice name: Jennifer
Description: A premium, cinematic-quality American English female voice (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
RyanVoice name: Ryan
Description: Full of rhythm, bursting with dramatic flair, balancing authenticity and tension (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
KaterinaVoice name: Katerina
Description: A mature-woman voice with rich, memorable rhythm (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
AidenVoice name: Aiden
Description: An American English young man skilled in cooking (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Eldric SageVoice name: Eldric Sage
Description: A calm and wise elder—weathered like a pine tree, yet clear-minded as a mirror (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
MiaVoice name: Mia
Description: Gentle as spring water, obedient as fresh snow (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
MochiVoice name: Mochi
Description: A clever, quick-witted young adult—childlike innocence remains, yet wisdom shines through (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
BellonaVoice name: Bellona
Description: A powerful, clear voice that brings characters to life—so stirring it makes your blood boil. With heroic grandeur and perfect diction, this voice captures the full spectrum of human expression.
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
VincentVoice name: Vincent
Description: A uniquely raspy, smoky voice—just one line evokes armies and heroic tales (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
BunnyVoice name: Bunny
Description: A little girl overflowing with "cuteness" (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
NeilVoice name: Neil
Description: A flat baseline intonation with precise, clear pronunciation—the most professional news anchor (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
EliasVoice name: Elias
Description: Maintains academic rigor while using storytelling techniques to turn complex knowledge into digestible learning modules (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
ArthurVoice name: Arthur
Description: A simple, earthy voice steeped in time and tobacco smoke—slowly unfolding village stories and curiosities (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
NiniVoice name: Nini
Description: A soft, clingy voice like sweet rice cakes—those drawn-out calls of “Big Brother” are so sweet they melt your bones (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
SerenVoice name: Seren
Description: A gentle, soothing voice to help you fall asleep faster. Good night, sweet dreams (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
PipVoice name: Pip
Description: A playful, mischievous boy full of childlike wonder—is this your memory of Shin-chan? (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
StellaVoice name: Stella
Description: Normally a cloyingly sweet, dazed teenage-girl voice—but when shouting “I represent the moon to defeat you!”, she instantly radiates unwavering love and justice (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Instruct-Flash-Realtime: qwen3-tts-instruct-flash-realtime, qwen3-tts-instruct-flash-realtime-2026-01-22
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
BodegaVoice name: Bodega
Description: A passionate Spanish man (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
SonrisaVoice name: Sonrisa
Description: A cheerful, outgoing Latin American woman (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
AlekVoice name: Alek
Description: Cold like the Russian spirit, yet warm like wool coat lining (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
DolceVoice name: Dolce
Description: A laid-back Italian man (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
SoheeVoice name: Sohee
Description: A warm, cheerful, emotionally expressive Korean unnie (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Ono AnnaVoice name: Ono Anna
Description: A clever, spirited childhood friend (female)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
LennVoice name: Lenn
Description: Rational at heart, rebellious in detail—a German youth who wears suits and listens to post-punk
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
EmilienVoice name: Emilien
Description: A romantic French big brother (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
AndreVoice name: Andre
Description: A magnetic, natural, and steady male voice
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
Radio GolVoice name: Radio Gol
Description: Football poet Radio Gol! Today I’ll commentate on football using my name (male)
Chinese (Mandarin), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27
JadaVoice name: Shanghai - Jada
Description: A fast-paced, energetic Shanghai auntie (female)
Shanghainese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
DylanVoice name: Beijing - Dylan
Description: A young man raised in Beijing’s hutongs (male)
Beijing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
LiVoice name: Nanjing - Li
Description: A patient yoga teacher (male)
Nanjing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
MarcusVoice name: Shaanxi - Marcus
Description: Broad face, few words, sincere heart, deep voice—the authentic Shaanxi flavor (male)
Shaanxi dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
RoyVoice name: Southern Min - Roy
Description: A humorous, straightforward, lively Taiwanese guy (male)
Southern Min, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
PeterVoice name: Tianjin - Peter
Description: Tianjin-style crosstalk, professional foil (male)
Tianjin dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
SunnyVoice name: Sichuan - Sunny
Description: A Sichuan girl sweet enough to melt your heart (female)
Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
EricVoice name: Sichuan - Eric
Description: A Sichuanese man from Chengdu who stands out in everyday life (male)
Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
RockyVoice name: Cantonese - Rocky
Description: A humorous, witty A Qiang providing live chat (male)
Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
KikiVoice name: Cantonese - Kiki
Description: A sweet Hong Kong girl best friend (female)
Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Qwen3-TTS-Flash-Realtime: qwen3-tts-flash-realtime, qwen3-tts-flash-realtime-2025-11-27, qwen3-tts-flash-realtime-2025-09-18
API reference
FAQ
Q: How do I fix incorrect pronunciation in speech synthesis? How do I control the pronunciation of polyphonic characters?
Replace the polyphonic character with a homophone to quickly fix the pronunciation issue.
Use SSML markup language to control pronunciation .
Q: How do I troubleshoot silent audio when using a cloned voice?
Verify the voice status
Call the Voice cloning/design API API and confirm that the voice
statusisOK.Check model version consistency
Make sure the
target_modelparameter used during voice cloning matches themodelparameter used during speech synthesis. For example:If you used
cosyvoice-v3-plusfor cloningYou must also use
cosyvoice-v3-plusfor synthesis
Verify source audio quality
Check whether the source audio used for voice cloning meets the Voice cloning/design API:
Audio duration: 10-20 seconds
Clear audio quality
No background noise
Check request parameters
Confirm that the
voiceparameter in your speech synthesis request is set to the cloned voice ID.
Q: What should I do if the cloned voice produces unstable or incomplete speech?
If the synthesized speech from a cloned voice exhibits any of the following issues:
Incomplete playback that only reads part of the text
Inconsistent synthesis quality
Abnormal pauses or silent segments in the speech
Possible cause: The source audio quality doesn't meet the requirements.
Solution: Check whether the source audio meets the following requirements. We recommend re-recording based on the Recording guide for voice cloning.
Check audio continuity: Make sure the speech content in the source audio is continuous, with no pauses or silent segments exceeding 2 seconds. Noticeable gaps in the audio can cause the model to misidentify silence or noise as voice characteristics, which degrades synthesis quality.
Check speech activity ratio: Make sure active speech accounts for more than 60% of the total audio duration. Excessive background noise or non-speech segments interfere with voice feature extraction.
Verify audio quality details:
Audio duration: 10-20 seconds (approximately 15 seconds recommended)
Clear pronunciation with steady speaking pace
No background noise, echo, or distortion
Concentrated voice energy with no long silent segments
Q: Why does the actual duration differ from the duration displayed in the WAV file header?
Speech synthesis uses a streaming mechanism that returns data progressively as it's generated. The duration in the saved WAV file header is an estimate and may contain inaccuracies. For precise duration, set format to pcm, wait for the complete synthesis result, and then add the WAV file header yourself.
Q: Why won't the audio play?
Troubleshoot based on the following scenarios:
Audio saved as a complete file (such as xx.mp3)
Audio format consistency: Make sure the audio format specified in the request parameters matches the file extension. For example, if the request format is set to
wavbut the file is saved as.mp3, playback may fail.Player compatibility: Confirm that your player supports the audio file's format and sample rate. Some players don't support high sample rates or specific audio encodings.
Streaming audio playback
Save the audio stream as a complete file and try playing it with a media player. If the file won't play, refer to the troubleshooting steps in Scenario 1.
If the file plays correctly, the issue is likely in your streaming playback implementation. Confirm that your player supports streaming playback. Common tools and libraries that support streaming playback include ffmpeg, pyaudio (Python), AudioFormat (Java), and MediaSource (JavaScript).
Q: Why is audio playback stuttering?
Troubleshoot with the following steps:
Check text send rate: Make sure the interval between text segments is appropriate, so that the next segment is sent before the previous audio finishes playing.
Check callback function performance:
Confirm that the callback function doesn't contain excessive business logic that causes blocking.
The callback function runs on the WebSocket thread. Blocking it delays network packet reception and causes audio stutter.
Write audio data to a separate audio buffer and process it in a different thread to avoid blocking the WebSocket thread.
Check network stability: Make sure your network connection is stable. Network fluctuations can cause audio transmission interruptions or delays.
Q: Why is speech synthesis taking a long time?
Troubleshoot with the following steps:
Check input interval
For streaming speech synthesis, confirm that the interval between text segments isn't too long (for example, waiting several seconds after sending one segment before sending the next). Long intervals increase the total synthesis time.
Analyze performance metrics
First-packet latency: typically around 500 ms.
RTF (Real-Time Factor = total synthesis time / audio duration): should be less than 1.0 under normal conditions.
Q: How do I restrict an API key to speech synthesis only (permission isolation)?
Create a new workspace and authorize only specific models to limit the scope of your API key. For details, see Manage workspaces.