Non-real-time speech synthesis converts text to speech (TTS) through an HTTP API. It suits latency-tolerant scenarios such as audiobook production, e-learning narration, and content production. The service offers a wide selection of voices, multilingual support, voice cloning, and voice design.
Overview
Convert complete text to speech files through an HTTP API. Two output modes are available: non-streaming and streaming.
-
Non-streaming returns an audio file URL that expires after 24 hours. Streaming returns PCM audio data in chunks.
-
Supports multiple languages, including Chinese dialects.
-
Supports Voice cloning and Voice Design for custom voice creation.
-
Supports Instruction control, which lets you control speech expressiveness through natural-language instructions.
For low-latency streaming synthesis, see Real-time speech synthesis. To choose a model, see Speech synthesis.
Prerequisites
Before you begin, ensure that you have:
-
(Optional) To call the API through the DashScope SDK, install the latest SDK version
Quick start
Each tab demonstrates synthesis with a different model family. For more code examples and parameter details, see the API reference.
Qwen-TTS
All examples in this section use a built-in voice.
Non-streaming output
In non-streaming mode, the response includes a url field pointing to the synthesized audio file. The URL expires after 24 hours.
Python
import os
import dashscope
# Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
dashscope.base_http_api_url = 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1'
text = "Today is a wonderful day to build something people love!"
# SpeechSynthesizer usage: dashscope.audio.qwen_tts.SpeechSynthesizer.call(...)
response = dashscope.MultiModalConversation.call(
# To use the instruction control feature, replace model with qwen3-tts-instruct-flash
model="qwen3-tts-flash",
# The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Alibaba Cloud Model Studio API Key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
text=text,
voice="Cherry",
language_type="English", # It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
# To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
# instructions='Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.',
# optimize_instructions=True,
stream=False
)
print(response)
Java
Import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:
Maven
Add the following to pom.xml:
<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.13.1</version>
</dependency>
Gradle
Add the following to build.gradle:
// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")
import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.utils.Constants;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
public class Main {
// To use the instruction control feature, replace MODEL with qwen3-tts-instruct-flash
private static final String MODEL = "qwen3-tts-flash";
public static void call() throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured an environment variable, replace the following line with your Alibaba Cloud Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model(MODEL)
.text("Today is a wonderful day to build something people love!")
.voice(AudioParameters.Voice.CHERRY)
.languageType("English") // It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
// To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
// .parameter("instructions","Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.")
// .parameter("optimize_instructions",true)
.build();
MultiModalConversationResult result = conv.call(param);
String audioUrl = result.getOutput().getAudio().getUrl();
System.out.print(audioUrl);
// Download the audio file to local storage
try (InputStream in = new URL(audioUrl).openStream();
FileOutputStream out = new FileOutputStream("downloaded_audio.wav")) {
byte[] buffer = new byte[1024];
int bytesRead;
while ((bytesRead = in.read(buffer)) != -1) {
out.write(buffer, 0, bytesRead);
}
System.out.println("\nAudio file downloaded to local storage: downloaded_audio.wav");
} catch (Exception e) {
System.out.println("\nError downloading audio file: " + e.getMessage());
}
}
public static void main(String[] args) {
// Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
Constants.baseHttpApiUrl = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";
try {
call();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
cURL
# ======= IMPORTANT =======
# Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
# === Remove this comment before running ===
curl -X POST 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-tts-flash",
"input": {
"text": "Today is a wonderful day to build something people love!",
"voice": "Cherry",
"language_type": "English"
}
}'
Streaming output
In streaming mode, audio data is returned incrementally as Base64-encoded PCM segments. The last packet includes a URL for the complete audio file.
Python
# coding=utf-8
#
# Installation instructions for pyaudio:
# APPLE Mac OS X
# brew install portaudio
# pip install pyaudio
# Debian/Ubuntu
# sudo apt-get install python-pyaudio python3-pyaudio
# or
# pip install pyaudio
# CentOS
# sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
# python -m pip install pyaudio
import os
import dashscope
import pyaudio
import time
import base64
import numpy as np
# The following is the Singapore region URL. To use models in the Beijing region, replace the URL with: https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1'
p = pyaudio.PyAudio()
# Create an audio stream
stream = p.open(format=pyaudio.paInt16,
channels=1,
rate=24000,
output=True)
text = "Today is a wonderful day to build something people love!"
response = dashscope.MultiModalConversation.call(
# The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Alibaba Cloud Model Studio API Key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# To use the instruction control feature, replace model with qwen3-tts-instruct-flash
model="qwen3-tts-flash",
text=text,
voice="Cherry",
language_type="English", # It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
# To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
# instructions='Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.',
# optimize_instructions=True,
stream=True
)
for chunk in response:
if chunk.output is not None:
audio = chunk.output.audio
if audio.data is not None:
wav_bytes = base64.b64decode(audio.data)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
# Play the audio data directly
stream.write(audio_np.tobytes())
if chunk.output.finish_reason == "stop":
print("finish at: {} ", chunk.output.audio.expires_at)
time.sleep(0.8)
# Clean up resources
stream.stop_stream()
stream.close()
p.terminate()
Java
Import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:
Maven
Add the following to pom.xml:
<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.13.1</version>
</dependency>
Gradle
Add the following to build.gradle:
// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")
// Install the latest version of the DashScope SDK
import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
import javax.sound.sampled.*;
import java.util.Base64;
public class Main {
// To use the instruction control feature, replace MODEL with qwen3-tts-instruct-flash
private static final String MODEL = "qwen3-tts-flash";
public static void streamCall() throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If you have not configured an environment variable, replace the following line with your Alibaba Cloud Model Studio API Key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model(MODEL)
.text("Today is a wonderful day to build something people love!")
.voice(AudioParameters.Voice.CHERRY)
.languageType("English") // It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
// To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
// .parameter("instructions","Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.")
// .parameter("optimize_instructions",true)
.build();
Flowable<MultiModalConversationResult> result = conv.streamCall(param);
result.blockingForEach(r -> {
try {
// 1. Get the Base64-encoded audio data
String base64Data = r.getOutput().getAudio().getData();
byte[] audioBytes = Base64.getDecoder().decode(base64Data);
// 2. Configure the audio format (adjust according to the audio format returned by the API)
AudioFormat format = new AudioFormat(
AudioFormat.Encoding.PCM_SIGNED,
24000, // Sample rate (must match the format returned by the API)
16, // Bits per sample
1, // Number of channels
2, // Frame size (bytes)
24000, // Frame rate (must match the sample rate)
false // Big-endian
);
// 3. Play the audio data in real time
DataLine.Info info = new DataLine.Info(SourceDataLine.class, format);
try (SourceDataLine line = (SourceDataLine) AudioSystem.getLine(info)) {
if (line != null) {
line.open(format);
line.start();
line.write(audioBytes, 0, audioBytes.length);
line.drain();
}
}
} catch (LineUnavailableException e) {
e.printStackTrace();
}
});
}
public static void main(String[] args) {
// The following is the Singapore region URL. To use models in the Beijing region, replace the URL with: https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1
Constants.baseHttpApiUrl = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1";
try {
streamCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
cURL
# ======= IMPORTANT =======
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# Note: API Keys differ between the Singapore and Beijing regions. To obtain an API Key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Remove this comment before running ===
curl -X POST 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
"model": "qwen3-tts-flash",
"input": {
"text": "Today is a wonderful day to build something people love!",
"voice": "Cherry",
"language_type": "English"
}
}'
Advanced features
Instruction control
Instruction-based control lets you shape tone, speed, emotion, and timbre through natural language descriptions, without adjusting complex audio parameters.
Supported models: Qwen3-TTS-Instruct-Flash family
Usage: Pass the instruction text in the instructions parameter.
Supported instruction languages: Chinese and English.
Maximum instruction length: 1,600 tokens.
Use cases:
-
Audiobook and radio drama voiceover
-
Advertising and promotional voiceover
-
Game character and animation voiceover
-
Emotionally expressive voice assistants
-
Documentary narration and news broadcasting
Tips for writing high-quality voice descriptions:
-
Core principles:
-
Be specific, not vague: Use words that describe concrete vocal qualities, such as "deep," "crisp," or "slightly fast." Avoid subjective or vague terms like "nice" or "normal."
-
Be multidimensional, not single-faceted: A good description covers multiple dimensions (gender, age, emotion, etc.). Writing only "female voice" is too broad to produce a distinctive timbre.
-
Be objective, not subjective: Focus on the physical and perceptual qualities of the voice. For example, use "slightly high pitch with energy" rather than "my favorite voice."
-
Be original, not imitative: Describe the vocal qualities you want, rather than requesting imitation of specific public figures (such as celebrities or actors). The model doesn't support imitation, and it may involve copyright risks.
-
Be concise, not redundant: Make every word count. Avoid repeating synonyms or stacking meaningless modifiers.
-
-
Description dimensions:
Combining the following dimensions produces more accurate results. The more dimensions described, the more precise the output.
Dimension
Example descriptions
Gender
Male, female, neutral
Age
Child (5-12), teenager (13-18), young adult (19-35), middle-aged (36-55), elderly (55+)
Pitch
High, mid, low, slightly high, slightly low
Speed
Fast, moderate, slow, slightly fast, slightly slow
Emotion
Cheerful, calm, gentle, serious, lively, composed, soothing
Timbre
Magnetic, crisp, husky, mellow, sweet, rich, powerful
Use case
News broadcasting, advertising, audiobook, animation character, voice assistant, documentary narration
-
Examples:
-
Standard broadcasting style: Clear and precise articulation with standard pronunciation
-
Young, lively female voice with a slightly fast pace and a noticeable rising intonation, suitable for introducing fashion products
-
Calm middle-aged male voice with a slow pace, deep and magnetic timbre, suitable for reading news or narrating documentaries
-
Gentle, intellectual female voice, around 30 years old, with a calm tone, suitable for audiobook reading
-
Cute child voice, about 8-year-old girl, slightly childish speech, suitable for animation character voiceover
-
Dialects
This section explains how to make the model output speech in a Chinese dialect (for example, Henan or Sichuan). Settings vary by model and by voice type.
Qwen-TTS
-
Built-in voices: Use a built-in voice that supports dialects. See the voice list for details.
-
Voice clone: Dialects are not supported.
-
Voice design: Dialects are not supported.
-
Supported dialects: See the Supported languages column for each model in Qwen3-TTS.
Supported models and regions
Singapore
To call the following models, use an API key from the Singapore region:
-
Qwen-TTS:
-
Qwen3-TTS-Instruct-Flash: qwen3-tts-instruct-flash (stable, currently equivalent to qwen3-tts-instruct-flash-2026-01-26), qwen3-tts-instruct-flash-2026-01-26 (latest snapshot)
-
Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)
-
Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)
-
Qwen3-TTS-Flash: qwen3-tts-flash (stable, currently equivalent to qwen3-tts-flash-2025-11-27), qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18
-
China (Beijing)
To call the following models, use an API key from the Beijing region:
-
Qwen-TTS:
-
Qwen3-TTS-Instruct-Flash: qwen3-tts-instruct-flash (stable, currently equivalent to qwen3-tts-instruct-flash-2026-01-26), qwen3-tts-instruct-flash-2026-01-26 (latest snapshot)
-
Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)
-
Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)
-
Qwen3-TTS-Flash: qwen3-tts-flash (stable, currently equivalent to qwen3-tts-flash-2025-11-27), qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18
-
Qwen-TTS: qwen-tts (stable, currently equivalent to qwen-tts-2025-04-10), qwen-tts-latest (latest, currently equivalent to qwen-tts-2025-05-22), qwen-tts-2025-05-22 (snapshot), qwen-tts-2025-04-10 (snapshot)
-
Built-in voices
Voices vary by model. Set the voice parameter to the value in the voice parameter column of the tables below.
API reference
FAQ
How long does the audio file URL remain valid?
The audio file URL expires 24 hours after it is generated. To get a new URL, call the API again.