非即時語音合成 - Alibaba Cloud Model Studio

概述

通過HTTP API將完整文本轉換為語音檔案，支援非流式和流式兩種輸出模式。

非流式返迴音頻檔案 URL，有效期間 24 小時；流式逐段返回 PCM 音頻資料。
支援多種語言，含中文方言。
支援聲音複刻與聲音設計進行定製音色建立。
支援指令控制，通過自然語言指令控制語音表現力。

低延遲流式情境請參見即時語音合成。各模型選型建議請參見語音合成。

前提條件

開始前，請確認已完成以下準備工作：

配置API Key，並設定到環境變數
（可選）如果通過 DashScope SDK調用，安裝最新版SDK

快速開始

以下各 Tab 分別示範不同模型系列的語音合成。更多語言樣本和詳細參數說明，請參見API 參考。

Qwen-TTS

本節所有樣本均使用系統音色。

非流式輸出

非流式模式下，響應中包含 url 欄位，指向合成的音頻檔案。URL 有效期間為 24 小時。

Python

import os
import dashscope

# 以下為新加坡地區的配置。
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

text = "Today is a wonderful day to build something people love!"
# SpeechSynthesizer介面使用方法：dashscope.audio.qwen_tts.SpeechSynthesizer.call(...)
response = dashscope.MultiModalConversation.call(
    # 如需使用指令控制功能，請將model替換為qwen3-tts-instruct-flash
    model="qwen3-tts-flash",
    # 新加坡地區和北京地區的API Key不同。擷取API Key：https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # 若沒有配置環境變數，請用阿里雲百鍊API Key將下行替換為：api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    text=text,
    voice="Cherry",
    language_type="English", # 建議與文本語種一致，以獲得正確的發音和自然的語調。
    # 如需使用指令控制功能，請取消下方注釋，並將model替換為qwen3-tts-instruct-flash
    # instructions='語速較快，帶有明顯的上揚語調，適合介紹時尚產品。',
    # optimize_instructions=True,
    stream=False
)
print(response)

Java

需要匯入 Gson 依賴，Maven 或 Gradle 添加方式如下：

Maven

在pom.xml中添加：

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

在build.gradle中添加：

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")

import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;

public class Main {
    // 如需使用指令控制功能，請將MODEL替換為qwen3-tts-instruct-flash
    private static final String MODEL = "qwen3-tts-flash";
    public static void call() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 新加坡地區和北京地區的API Key不同。擷取API Key：https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // 若沒有配置環境變數，請用阿里雲百鍊API Key將下行替換為：.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL)
                .text("Today is a wonderful day to build something people love!")
                .voice(AudioParameters.Voice.CHERRY)
                .languageType("English") // 建議與文本語種一致，以獲得正確的發音和自然的語調。
                // 如需使用指令控制功能，請取消下方注釋，並將model替換為qwen3-tts-instruct-flash
                // .parameter("instructions","語速較快，帶有明顯的上揚語調，適合介紹時尚產品。")
                // .parameter("optimize_instructions",true)
                .build();
        MultiModalConversationResult result = conv.call(param);
        String audioUrl = result.getOutput().getAudio().getUrl();
        System.out.print(audioUrl);

        // 下載音頻檔案到本地
        try (InputStream in = new URL(audioUrl).openStream();
             FileOutputStream out = new FileOutputStream("downloaded_audio.wav")) {
            byte[] buffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = in.read(buffer)) != -1) {
                out.write(buffer, 0, bytesRead);
            }
            System.out.println("\n音頻檔案已下載到本地：downloaded_audio.wav");
        } catch (Exception e) {
            System.out.println("\n下載音頻檔案時出錯：" + e.getMessage());
        }
    }
    public static void main(String[] args) {
        // 以下為新加坡地區的配置。
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        try {
            call();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

# ======= 重要提示 =======
# 以下為新加坡地區的配置。
# 新加坡地區和北京地區的API Key不同。擷取API Key：https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === 執行時請刪除該注釋 ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-tts-flash",
    "input": {
        "text": "Today is a wonderful day to build something people love!",
        "voice": "Cherry",
        "language_type": "English"
    }
}'

流式輸出

流式模式下，音頻資料以 Base 64 編碼的 PCM 格式逐段返回，最後一個資料包中包含完整音訊 URL。

Python

# coding=utf-8
#
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import dashscope
import pyaudio
import time
import base64
import numpy as np

# 以下為新加坡地區的配置。
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

p = pyaudio.PyAudio()
# 建立音頻流
stream = p.open(format=pyaudio.paInt16,
                channels=1,
                rate=24000,
                output=True)

text = "Today is a wonderful day to build something people love!"
response = dashscope.MultiModalConversation.call(
    # 新加坡地區和北京地區的API Key不同。擷取API Key：https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # 若沒有配置環境變數，請用阿里雲百鍊API Key將下行替換為：api_key = "sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # 如需使用指令控制功能，請將model替換為qwen3-tts-instruct-flash
    model="qwen3-tts-flash",
    text=text,
    voice="Cherry",
    language_type="English", # 建議與文本語種一致，以獲得正確的發音和自然的語調。
    # 如需使用指令控制功能，請取消下方注釋，並將model替換為qwen3-tts-instruct-flash
    # instructions='語速較快，帶有明顯的上揚語調，適合介紹時尚產品。',
    # optimize_instructions=True,
    stream=True
)

for chunk in response:
    if chunk.output is not None:
      audio = chunk.output.audio
      if audio.data is not None:
          wav_bytes = base64.b64decode(audio.data)
          audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
          # 直接播放音頻資料
          stream.write(audio_np.tobytes())
      if chunk.output.finish_reason == "stop":
          print(f"finish at: {chunk.output.audio.expires_at}")
time.sleep(0.8)
# 清理資源
stream.stop_stream()
stream.close()
p.terminate()

Java

需要匯入 Gson 依賴，Maven 或 Gradle 添加方式如下：

Maven

在pom.xml中添加：

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

在build.gradle中添加：

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")

// 請安裝 DashScope SDK 的最新版本
import com.alibaba.dashscope.aigc.multimodalconversation.AudioParameters;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
import javax.sound.sampled.*;
import java.util.Base64;

public class Main {
    // 如需使用指令控制功能，請將MODEL替換為qwen3-tts-instruct-flash
    private static final String MODEL = "qwen3-tts-flash";
    public static void streamCall() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // 新加坡地區和北京地區的API Key不同。擷取API Key：https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                // 若沒有配置環境變數，請用阿里雲百鍊API Key將下行替換為：.apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL)
                .text("Today is a wonderful day to build something people love!")
                .voice(AudioParameters.Voice.CHERRY)
                .languageType("English") // 建議與文本語種一致，以獲得正確的發音和自然的語調。
                // 如需使用指令控制功能，請取消下方注釋，並將model替換為qwen3-tts-instruct-flash
                // .parameter("instructions","語速較快，帶有明顯的上揚語調，適合介紹時尚產品。")
                // .parameter("optimize_instructions",true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(r -> {
            try {
                // 1. 擷取Base64編碼的音頻資料
                String base64Data = r.getOutput().getAudio().getData();
                byte[] audioBytes = Base64.getDecoder().decode(base64Data);

                // 2. 配置音頻格式（根據API返回的音頻格式調整）
                AudioFormat format = new AudioFormat(
                        AudioFormat.Encoding.PCM_SIGNED,
                        24000, // 採樣率（需與API返回格式一致）
                        16,    // 採樣位元
                        1,     // 聲道數
                        2,     // 幀大小（位元組數）
                        24000, // 幀率（需與採樣率一致）
                        false  // 大端序
                );

                // 3. 即時播放音頻資料
                DataLine.Info info = new DataLine.Info(SourceDataLine.class, format);
                try (SourceDataLine line = (SourceDataLine) AudioSystem.getLine(info)) {
                    if (line != null) {
                        line.open(format);
                        line.start();
                        line.write(audioBytes, 0, audioBytes.length);
                        line.drain();
                    }
                }
            } catch (LineUnavailableException e) {
                e.printStackTrace();
            }
        });
    }
    public static void main(String[] args) {
        // 以下為新加坡地區的配置。
        Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

cURL

# ======= 重要提示 =======
# 以下為新加坡地區的配置。
# 新加坡地區和北京地區的API Key不同。擷取API Key：https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === 執行時請刪除該注釋 ===

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-tts-flash",
    "input": {
        "text": "Today is a wonderful day to build something people love!",
        "voice": "Cherry",
        "language_type": "English"
    }
}'

進階功能

指令控制

指令控制通過自然語言描述控制語音的音調、語速、情感和音色特點，無需調整複雜的音頻參數。

各模型指令規格：

Qwen-TTS

支援的模型：僅支援Qwen3-TTS-Instruct-Flash 系列模型。

使用方式：通過 instructions 參數傳入指令內容。

指令文本支援的語言：僅支援中文和英文。

指令文本長度限制：不超過 1,600 Token。

適用情境：

有聲書和廣播劇配音
廣告和宣傳片配音
遊戲角色和動畫配音
情感化的智能語音助手
紀錄片和新聞播報

如何編寫高品質的聲音描述：

核心原則：
1. 具體而非模糊：使用描繪聲音特質的詞語，如“低沉”、“清脆”、“語速偏快”，避免“好聽”、“普通”等主觀或模糊的表述。
2. 多維而非單一：好的描述通常涵蓋多個維度（如性別、年齡、情感等）。僅寫“女聲”過於寬泛，難以產生有特色的音色。
3. 客觀而非主觀：聚焦聲音的物理和感知特徵。例如，用”音調偏高，帶有活力“代替”我最喜歡的聲音”。
4. 原創而非模仿：描述聲音的特質，而非要求模仿特定人物（如名人、演員）。模型不支援模仿，且可能涉及著作權風險。
5. 簡潔而非冗餘：確保每個詞都有明確作用，避免重複的同義字或無意義的修飾。

描述維度參考：

建議組合以下維度描述聲音，維度越豐富，產生效果越精準。

維度	描述樣本
性別	男性、女性、中性
年齡	兒童（5-12 歲）、青少年（13-18 歲）、青年（19-35 歲）、中年（36-55 歲）、老年（55 歲以上）
音調	高音、中音、低音、偏高、偏低
語速	快速、中速、緩慢、偏快、偏慢
情感	開朗、沉穩、溫柔、嚴肅、活潑、冷靜、治癒
特點	有磁性、清脆、沙啞、圓潤、甜美、渾厚、有力
用途	新聞播報、廣告配音、有聲書、動畫角色、語音助手、紀錄片解說

樣本：
- 標準播音風格：吐字清晰精準，字正腔圓
- 年輕活潑的女性聲音，語速較快，帶有明顯的上揚語調，適合介紹時尚產品
- 沉穩的中年男性，語速緩慢，音色低沉有磁性，適合朗讀新聞或紀錄片解說
- 溫柔知性的女性，30 歲左右，語調平和，適合有聲書朗讀
- 可愛的兒童聲音，大約 8 歲女孩，說話略帶稚氣，適合動畫角色配音

方言

本節介紹如何讓模型用中文方言（如河南話、四川話等）輸出語音。不同模型和音色類型的設定方式不同。

Qwen-TTS

系統音色：使用支援方言的系統音色，參見Qwen-TTS音色列表。
聲音複刻音色：不支援方言。
聲音設計音色：不支援方言。

具體支援哪些方言：參見Qwen3-TTS中各模型“支援的語言”。

支援的模型與地區

新加坡

調用以下模型時，請選擇新加坡地區的API Key：

Qwen-TTS：
- Qwen3-TTS-Instruct-Flash：qwen3-tts-instruct-flash（穩定版，當前等同 qwen3-tts-instruct-flash-2026-01-26）、qwen3-tts-instruct-flash-2026-01-26（最新快照版）
- Qwen3-TTS-VD：qwen3-tts-vd-2026-01-26（最新快照版）
- Qwen3-TTS-VC：qwen3-tts-vc-2026-01-22（最新快照版）
- Qwen3-TTS-Flash：qwen3-tts-flash（穩定版，當前等同 qwen3-tts-flash-2025-11-27）、qwen3-tts-flash-2025-11-27、qwen3-tts-flash-2025-09-18

華北2（北京）

調用以下模型時，請選擇北京地區的API Key：

Qwen-TTS：
- Qwen3-TTS-Instruct-Flash：qwen3-tts-instruct-flash（穩定版，當前等同 qwen3-tts-instruct-flash-2026-01-26）、qwen3-tts-instruct-flash-2026-01-26（最新快照版）
- Qwen3-TTS-VD：qwen3-tts-vd-2026-01-26（最新快照版）
- Qwen3-TTS-VC：qwen3-tts-vc-2026-01-22（最新快照版）
- Qwen3-TTS-Flash：qwen3-tts-flash（穩定版，當前等同 qwen3-tts-flash-2025-11-27）、qwen3-tts-flash-2025-11-27、qwen3-tts-flash-2025-09-18
- Qwen-TTS：qwen-tts（穩定版，當前等同 qwen-tts-2025-04-10）、qwen-tts-latest（最新版，當前等同 qwen-tts-2025-05-22）、qwen-tts-2025-05-22（快照版）、qwen-tts-2025-04-10（快照版）

支援的系統音色

不同模型支援的音色不同。將請求參數 voice 設為下表中 voice 參數列的值即可。

Qwen-TTS音色列表

API 參考

非即時語音合成-千問API參考

常見問題

Q：音頻檔案連結的有效期間是多久？

A：音頻檔案連結在產生後 24 小時內有效。連結到期後，重新調用介面即可擷取新連結。