All Products
Search
Document Center

Alibaba Cloud Model Studio:Speech synthesis - Qwen

Last Updated:May 12, 2026

Non-real-time speech synthesis converts text to speech (TTS) through an HTTP API, making it suitable for latency-tolerant scenarios such as audiobook production, e-learning narration, and content production. The service supports the Qwen-TTS families, with a wide selection of voices, multilingual support, voice cloning, and voice design.

Overview

Convert text to speech files through an HTTP API. This approach suits latency-tolerant scenarios such as audiobook production, e-learning narration, and batch content production.

  • Submit complete text to the HTTP API to receive audio output. Streaming output (synthesize while playing) is also supported.

  • Supports multiple languages, including Chinese dialects.

  • Supports Voice cloning and Voice Design for custom voice creation.

  • Supports Instruction control, which lets you control speech expressiveness through natural-language instructions.

For real-time, low-latency speech synthesis, see Real-time speech synthesis(WebSocket API). To choose a model, see Speech synthesis.

Prerequisites

Quick start

The following examples demonstrate how to synthesize speech with each model family. For more language examples and detailed parameter descriptions, see the API reference for each model.

Qwen-TTS

The following examples show how to synthesize speech with a built-in voice.

Non-streaming output

In non-streaming mode, the response includes a url field pointing to the synthesized audio file. The URL expires after 24 hours.

Python

import os
import dashscope

# The following is the Singapore region URL. To use models in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

text = "Today is a wonderful day to build something people love!"
# SpeechSynthesizer usage: dashscope.audio.qwen_tts.SpeechSynthesizer.call(...)
response = dashscope.MultiModalConversation.call(
    # To use the instruction control feature, replace model with qwen3-tts-instruct-flash
    model="qwen3-tts-flash",
    # The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    text=text,
    voice="Cherry",
    language_type="English", # It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
    # To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
    # instructions='Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.',
    # optimize_instructions=True,
    stream=False
)
print(response)

Java

Import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

Maven

Add the following to pom.xml:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

Add the following to build.gradle:

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")
import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;

public class Main {
    // To use the instruction control feature, replace MODEL with qwen3-tts-instruct-flash
    private static final String MODEL = "qwen3-tts-flash";
    public static void call() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // If you have not configured an environment variable, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL)
                .text("Today is a wonderful day to build something people love!")
                .voice(AudioParameters.Voice.CHERRY)
                .languageType("English") // It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
                // To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
                // .parameter("instructions","Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.")
                // .parameter("optimize_instructions",true)
                .build();
        MultiModalConversationResult result = conv.call(param);
        String audioUrl = result.getOutput().getAudio().getUrl();
        System.out.print(audioUrl);

        // Download the audio file to local storage
        try (InputStream in = new URL(audioUrl).openStream();
             FileOutputStream out = new FileOutputStream("downloaded_audio.wav")) {
            byte[] buffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = in.read(buffer)) != -1) {
                out.write(buffer, 0, bytesRead);
            }
            System.out.println("\nAudio file downloaded to local storage: downloaded_audio.wav");
        } catch (Exception e) {
            System.out.println("\nError downloading audio file: " + e.getMessage());
        }
    }
    public static void main(String[] args) {
        // The following is the Singapore region URL. To use models in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        try {
            call();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

# ======= IMPORTANT =======
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# Note: API Keys differ between the Singapore and Beijing regions. To obtain an API Key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Remove this comment before running ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-tts-flash",
    "input": {
        "text": "Today is a wonderful day to build something people love!",
        "voice": "Cherry",
        "language_type": "English"
    }
}'

Streaming output

In streaming mode, audio data is returned incrementally as Base64-encoded PCM segments. The last packet includes a URL for the complete audio file.

Python

# coding=utf-8
#
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import dashscope
import pyaudio
import time
import base64
import numpy as np

# The following is the Singapore region URL. To use models in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

p = pyaudio.PyAudio()
# Create an audio stream
stream = p.open(format=pyaudio.paInt16,
                channels=1,
                rate=24000,
                output=True)

text = "Today is a wonderful day to build something people love!"
response = dashscope.MultiModalConversation.call(
    # The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API Key: api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # To use the instruction control feature, replace model with qwen3-tts-instruct-flash
    model="qwen3-tts-flash",
    text=text,
    voice="Cherry",
    language_type="English", # It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
    # To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
    # instructions='Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.',
    # optimize_instructions=True,
    stream=True
)

for chunk in response:
    if chunk.output is not None:
      audio = chunk.output.audio
      if audio.data is not None:
          wav_bytes = base64.b64decode(audio.data)
          audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
          # Play the audio data directly
          stream.write(audio_np.tobytes())
      if chunk.output.finish_reason == "stop":
          print("finish at: {} ", chunk.output.audio.expires_at)
time.sleep(0.8)
# Clean up resources
stream.stop_stream()
stream.close()
p.terminate()

Java

Import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

Maven

Add the following to pom.xml:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

Add the following to build.gradle:

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")
// Install the latest version of the DashScope SDK
import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
import javax.sound.sampled.*;
import java.util.Base64;

public class Main {
    // To use the instruction control feature, replace MODEL with qwen3-tts-instruct-flash
    private static final String MODEL = "qwen3-tts-flash";
    public static void streamCall() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API Keys for the Singapore and Beijing regions are different. Get an API Key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // If you have not configured an environment variable, replace the following line with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL)
                .text("Today is a wonderful day to build something people love!")
                .voice(AudioParameters.Voice.CHERRY)
                .languageType("English") // It is recommended to match the language of the text to ensure correct pronunciation and natural intonation.
                // To use the instruction control feature, uncomment the following lines and replace model with qwen3-tts-instruct-flash
                // .parameter("instructions","Speak at a relatively fast speed with a noticeable rising intonation, suitable for introducing fashion products.")
                // .parameter("optimize_instructions",true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(r -> {
            try {
                // 1. Get the Base64-encoded audio data
                String base64Data = r.getOutput().getAudio().getData();
                byte[] audioBytes = Base64.getDecoder().decode(base64Data);

                // 2. Configure the audio format (adjust according to the audio format returned by the API)
                AudioFormat format = new AudioFormat(
                        AudioFormat.Encoding.PCM_SIGNED,
                        24000, // Sample rate (must match the format returned by the API)
                        16,    // Bits per sample
                        1,     // Number of channels
                        2,     // Frame size (bytes)
                        24000, // Frame rate (must match the sample rate)
                        false  // Big-endian
                );

                // 3. Play the audio data in real time
                DataLine.Info info = new DataLine.Info(SourceDataLine.class, format);
                try (SourceDataLine line = (SourceDataLine) AudioSystem.getLine(info)) {
                    if (line != null) {
                        line.open(format);
                        line.start();
                        line.write(audioBytes, 0, audioBytes.length);
                        line.drain();
                    }
                }
            } catch (LineUnavailableException e) {
                e.printStackTrace();
            }
        });
    }
    public static void main(String[] args) {
        // The following is the Singapore region URL. To use models in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

# ======= IMPORTANT =======
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# Note: API Keys differ between the Singapore and Beijing regions. To obtain an API Key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Remove this comment before running ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-tts-flash",
    "input": {
        "text": "Today is a wonderful day to build something people love!",
        "voice": "Cherry",
        "language_type": "English"
    }
}'

Advanced features

Instruction control

Instruction-based control lets you precisely shape the vocal expression through natural language descriptions, without adjusting complex audio parameters. Describe the desired tone, speed, emotion, or timbre in plain text to produce the corresponding speech effect.

Supported models: Qwen3-TTS-Instruct-Flash-Realtime family

Usage: Pass instruction content in the instruction parameter. For example: "Speak quickly with a noticeable rising tone, as if you're introducing a fashion item."

Supported instruction languages: Chinese and English.

Instruction text length limit: Up to 1,600 tokens.

Use cases:

  • Audiobook and radio drama voiceover

  • Advertising and promotional voiceover

  • Game character and animation voiceover

  • Emotionally expressive voice assistants

  • Documentary narration and news broadcasting

Tips for writing high-quality voice descriptions:

  • Core principles:

    1. Be specific, not vague: Use words that describe concrete vocal qualities, such as "deep," "crisp," or "slightly fast." Avoid subjective, low-information terms like "nice" or "normal."

    2. Be multidimensional, not single-faceted: A good description combines multiple dimensions (pitch, speed, emotion, etc.). Describing only one dimension (e.g., "high pitch") is too broad to produce a distinctive effect.

    3. Be objective, not subjective: Focus on the physical and perceptual qualities of the voice, not personal preferences. For example, use "slightly high pitch with energy" rather than "my favorite voice."

    4. Be original, not imitative: Describe the vocal qualities you want, rather than requesting imitation of specific public figures (such as celebrities or actors). Imitation requests involve copyright risks and are not supported.

    5. Be concise, not redundant: Make every word count. Avoid repeating synonyms or stacking meaningless intensifiers (e.g., "a very very great voice").

  • Description dimensions: Combining multiple dimensions creates richer expression effects.

    Dimension

    Example descriptions

    Pitch

    High, mid, low, slightly high, slightly low

    Speed

    Fast, moderate, slow, slightly fast, slightly slow

    Emotion

    Cheerful, calm, gentle, serious, lively, composed, soothing

    Timbre

    Magnetic, crisp, husky, mellow, sweet, rich, powerful

    Use case

    News broadcasting, advertising, audiobook, animation character, voice assistant, documentary narration

  • Examples:

    • Standard broadcasting style: Clear and precise articulation with standard pronunciation

    • Emotional escalation: Volume rising rapidly from normal conversation to a shout; straightforward personality with externalized, easily agitated emotions

    • Special emotional state: Slightly slurred pronunciation from a teary voice, slightly husky, with noticeable tension from a sobbing tone

    • Advertising voiceover style: Slightly high pitch, moderate speed, energetic and engaging, suitable for advertising

    • Gentle soothing style: Slightly slow speed, soft and sweet tone, warm and comforting like a caring friend

Supported scope

Model availability varies by deployment region:

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

To call the following models, use an API key from the Singapore region:

  • Qwen-TTS:

    • Qwen3-TTS-Instruct-Flash: qwen3-tts-instruct-flash (stable, currently equivalent to qwen3-tts-instruct-flash-2026-01-26), qwen3-tts-instruct-flash-2026-01-26 (latest snapshot)

    • Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)

    • Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)

    • Qwen3-TTS-Flash: qwen3-tts-flash (stable, currently equivalent to qwen3-tts-flash-2025-11-27), qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18

Chinese mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

To call the following models, use an API key from the Beijing region:

  • Qwen-TTS:

    • Qwen3-TTS-Instruct-Flash: qwen3-tts-instruct-flash (stable, currently equivalent to qwen3-tts-instruct-flash-2026-01-26), qwen3-tts-instruct-flash-2026-01-26 (latest snapshot)

    • Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)

    • Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)

    • Qwen3-TTS-Flash: qwen3-tts-flash (stable, currently equivalent to qwen3-tts-flash-2025-11-27), qwen3-tts-flash-2025-11-27, qwen3-tts-flash-2025-09-18

    • Qwen-TTS: qwen-tts (stable, currently equivalent to qwen-tts-2025-04-10), qwen-tts-latest (latest, currently equivalent to qwen-tts-2025-05-22), qwen-tts-2025-05-22 (snapshot), qwen-tts-2025-04-10 (snapshot)

Built-in voices

Voices vary by model. To specify a voice, set the voice parameter to the value in the voice parameter column of the tables below.

API reference

FAQ

Q: How long does the audio file URL remain valid?

A: The audio file URL expires 24 hours after it's generated. To get a new URL, call the API again.