全部產品
Search
文件中心

Alibaba Cloud Model Studio:即時語音合成-CosyVoice

更新時間:Jan 05, 2026

語音合成,又稱文本轉語音(Text-to-Speech,TTS),是將文本轉換為自然語音的技術。該技術基於機器學習演算法,通過學習大量語音樣本,掌握語言的韻律、語調和發音規則,從而在接收到文本輸入時產生真人般自然的語音內容。

核心功能

  • 即時產生高保真語音,支援中英等多語種自然發聲

  • 提供聲音複刻能力,快速定製個人化音色

  • 支援流式輸入輸出,低延遲響應即時互動情境

  • 可調節語速、語調、音量與碼率,精細控制語音表現

  • 相容主流音頻格式,最高支援48kHz採樣率輸出

適用範圍

  • 支援的地區:僅支援北京地區,需使用該地區的API Key

  • 支援的模型:cosyvoice-v3-plus、cosyvoice-v3-flash、cosyvoice-v2

  • 支援的音色:參見CosyVoice音色列表

模型選型

情境

推薦模型

理由

注意事項

品牌形象語音定製/個人化語音複製服務

cosyvoice-v3-plus

聲音複刻能力最強,支援48kHz高音質輸出,高音質+聲音複刻,打造擬人化品牌聲紋

成本較高($0.286706/萬字元),建議用於核心情境

智能客服 / 語音助手

cosyvoice-v3-flash

成本最低($0.14335/萬字元),支援流式互動、情感表達,響應快,性價比高

方言廣播系統

cosyvoice-v3-flash、cosyvoice-v3-plus

支援東北話、閩南語等多種方言,適合地方內容播報

cosyvoice-v3-plus成本較高($0.286706/萬字元

教育類應用(含公式朗讀)

cosyvoice-v2、cosyvoice-v3-flash、cosyvoice-v3-plus

支援LaTeX公式轉語音,適合數理化課程講解

cosyvoice-v2和cosyvoice-v3-plus成本較高($0.286706/萬字元

結構化語音播報(新聞/公告)

cosyvoice-v3-plus、cosyvoice-v3-flash、cosyvoice-v2

支援SSML控制語速、停頓、發音等,提升播報專業度

需額外開發 SSML 產生邏輯,不支援設定情感

語音與文本精準對齊(如字幕產生、教學回放、聽寫訓練)

cosyvoice-v3-flash、cosyvoice-v3-plus、cosyvoice-v2

支援時間戳記輸出,可實現合成語音與原文同步

需顯式啟用時間戳記功能,預設關閉

多語言出海產品

cosyvoice-v3-flash、cosyvoice-v3-plus

支援多語種

Sambert不支援流式輸入,價格高於cosyvoice-v3-flash

更多說明請參見模型功能特性對比

快速開始

下面是調用API的範例程式碼。更多常用情境的程式碼範例,請參見GitHub

您需要已擷取與配置 API Key配置API Key到環境變數(準備下線,併入配置 API Key)。如果通過SDK調用,還需要安裝DashScope SDK

CosyVoice

將合成音頻儲存為檔案

Python

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

# 若沒有將API Key配置到環境變數中,需將your-api-key替換為自己的API Key
# dashscope.api_key = "your-api-key"

# 模型
# 不同模型版本需要使用對應版本的音色:
# cosyvoice-v3-flash/cosyvoice-v3-plus:使用longanyang等音色。
# cosyvoice-v2:使用longxiaochun_v2等音色。
model = "cosyvoice-v3-flash"
# 音色
voice = "longanyang"

# 執行個體化SpeechSynthesizer,並在構造方法中傳入模型(model)、音色(voice)等請求參數
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# 發送待合成文本,擷取二進位音頻
audio = synthesizer.call("今天天氣怎麼樣?")
# 首次發送文本時需建立 WebSocket 串連,因此首包延遲會包含串連建立的耗時
print('[Metric] requestId為:{},首包延遲為:{}毫秒'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# 將音頻儲存至本地
with open('output.mp3', 'wb') as f:
    f.write(audio)

Java

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
    // 模型
    // 不同模型版本需要使用對應版本的音色:
    // cosyvoice-v3-flash/cosyvoice-v3-plus:使用longanyang等音色。
    // cosyvoice-v2:使用longxiaochun_v2等音色。
    private static String model = "cosyvoice-v3-flash";
    // 音色
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        // 請求參數
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // 若沒有將API Key配置到環境變數中,需將下面這行代碼注釋放開,並將your-api-key替換為自己的API Key
                        // .apiKey("your-api-key")
                        .model(model) // 模型
                        .voice(voice) // 音色
                        .build();

        // 同步模式:禁用回調(第二個參數為null)
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = null;
        try {
            // 阻塞直至音頻返回
            audio = synthesizer.call("今天天氣怎麼樣?");
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // 任務結束關閉websocket串連
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        if (audio != null) {
            // 將音頻資料儲存到本地檔案“output.mp3”中
            File file = new File("output.mp3");
            // 首次發送文本時需建立 WebSocket 串連,因此首包延遲會包含串連建立的耗時
            System.out.println(
                    "[Metric] requestId為:"
                            + synthesizer.getLastRequestId()
                            + "首包延遲(毫秒)為:"
                            + synthesizer.getFirstPackageDelay());
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audio.array());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    public static void main(String[] args) {
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

將LLM產生的文本即時轉成語音並通過擴音器播放

以下代碼展示通過本地裝置播放通義千問大語言模型(qwen-turbo)即時返回的常值內容。

Python

運行Python樣本前,需要通過pip安裝第三方音頻播放庫。

# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import pyaudio
import dashscope
from dashscope.audio.tts_v2 import *


from http import HTTPStatus
from dashscope import Generation

# 若沒有將API Key配置到環境變數中,需將下面這行代碼注釋放開,並將apiKey替換為自己的API Key
# dashscope.api_key = "apiKey"

# 不同模型版本需要使用對應版本的音色:
# cosyvoice-v3-flash/cosyvoice-v3-plus:使用longanyang等音色。
# cosyvoice-v2:使用longxiaochun_v2等音色。
model = "cosyvoice-v3-flash"
voice = "longanyang"


class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("websocket is open.")
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("speech synthesis task complete successfully.")

    def on_error(self, message: str):
        print(f"speech synthesis task failed, {message}")

    def on_close(self):
        print("websocket is closed.")
        # stop player
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        print(f"recv speech synthsis message {message}")

    def on_data(self, data: bytes) -> None:
        print("audio result length:", len(data))
        self._stream.write(data)


def synthesizer_with_llm():
    callback = Callback()
    synthesizer = SpeechSynthesizer(
        model=model,
        voice=voice,
        format=AudioFormat.PCM_22050HZ_MONO_16BIT,
        callback=callback,
    )

    messages = [{"role": "user", "content": "請介紹一下你自己"}]
    responses = Generation.call(
        model="qwen-turbo",
        messages=messages,
        result_format="message",  # set result format as 'message'
        stream=True,  # enable stream output
        incremental_output=True,  # enable incremental output 
    )
    for response in responses:
        if response.status_code == HTTPStatus.OK:
            print(response.output.choices[0]["message"]["content"], end="")
            synthesizer.streaming_call(response.output.choices[0]["message"]["content"])
        else:
            print(
                "Request id: %s, Status code: %s, error code: %s, error message: %s"
                % (
                    response.request_id,
                    response.status_code,
                    response.code,
                    response.message,
                )
            )
    synthesizer.streaming_complete()
    print('requestId: ', synthesizer.get_last_request_id())


if __name__ == "__main__":
    synthesizer_with_llm()

Java

import com.alibaba.dashscope.aigc.generation.Generation;
import com.alibaba.dashscope.aigc.generation.GenerationParam;
import com.alibaba.dashscope.aigc.generation.GenerationResult;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.Message;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;
import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.atomic.AtomicBoolean;
import javax.sound.sampled.*;

public class Main {
    // 不同模型版本需要使用對應版本的音色:
    // cosyvoice-v3-flash/cosyvoice-v3-plus:使用longanyang等音色。
    // cosyvoice-v2:使用longxiaochun_v2等音色。
    private static String model = "cosyvoice-v3-flash";
    private static String voice = "longanyang";
    public static void process() throws NoApiKeyException, InputRequiredException {
        // Playback thread
        class PlaybackRunnable implements Runnable {
            // Set the audio format. Please configure according to your actual device,
            // synthesized audio parameters, and platform choice Here it is set to
            // 22050Hz16bit single channel. It is recommended that customers choose other
            // sample rates and formats based on the model sample rate and device
            // compatibility.
            private AudioFormat af = new AudioFormat(22050, 16, 1, true, false);
            private DataLine.Info info = new DataLine.Info(SourceDataLine.class, af);
            private SourceDataLine targetSource = null;
            private AtomicBoolean runFlag = new AtomicBoolean(true);
            private ConcurrentLinkedQueue<ByteBuffer> queue =
                    new ConcurrentLinkedQueue<>();

            // Prepare the player
            public void prepare() throws LineUnavailableException {
                targetSource = (SourceDataLine) AudioSystem.getLine(info);
                targetSource.open(af, 4096);
                targetSource.start();
            }

            public void put(ByteBuffer buffer) {
                queue.add(buffer);
            }

            // Stop playback
            public void stop() {
                runFlag.set(false);
            }

            @Override
            public void run() {
                if (targetSource == null) {
                    return;
                }

                while (runFlag.get()) {
                    if (queue.isEmpty()) {
                        try {
                            Thread.sleep(100);
                        } catch (InterruptedException e) {
                        }
                        continue;
                    }

                    ByteBuffer buffer = queue.poll();
                    if (buffer == null) {
                        continue;
                    }

                    byte[] data = buffer.array();
                    targetSource.write(data, 0, data.length);
                }

                // Play all remaining cache
                if (!queue.isEmpty()) {
                    ByteBuffer buffer = null;
                    while ((buffer = queue.poll()) != null) {
                        byte[] data = buffer.array();
                        targetSource.write(data, 0, data.length);
                    }
                }
                // Release the player
                targetSource.drain();
                targetSource.stop();
                targetSource.close();
            }
        }

        // Create a subclass inheriting from ResultCallback<SpeechSynthesisResult>
        // to implement the callback interface
        class ReactCallback extends ResultCallback<SpeechSynthesisResult> {
            private PlaybackRunnable playbackRunnable = null;

            public ReactCallback(PlaybackRunnable playbackRunnable) {
                this.playbackRunnable = playbackRunnable;
            }

            // Callback when the service side returns the streaming synthesis result
            @Override
            public void onEvent(SpeechSynthesisResult result) {
                // Get the binary data of the streaming result via getAudio
                if (result.getAudioFrame() != null) {
                    // Stream the data to the player
                    playbackRunnable.put(result.getAudioFrame());
                }
            }

            // Callback when the service side completes the synthesis
            @Override
            public void onComplete() {
                // Notify the playback thread to end
                playbackRunnable.stop();
            }

            // Callback when an error occurs
            @Override
            public void onError(Exception e) {
                // Tell the playback thread to end
                System.out.println(e);
                playbackRunnable.stop();
            }
        }

        PlaybackRunnable playbackRunnable = new PlaybackRunnable();
        try {
            playbackRunnable.prepare();
        } catch (LineUnavailableException e) {
            throw new RuntimeException(e);
        }
        Thread playbackThread = new Thread(playbackRunnable);
        // Start the playback thread
        playbackThread.start();
        /*******  Call the Generative AI Model to get streaming text *******/
        // Prepare for the LLM call
        Generation gen = new Generation();
        Message userMsg = Message.builder()
                .role(Role.USER.getValue())
                .content("請介紹一下你自己")
                .build();
        GenerationParam genParam =
                GenerationParam.builder()
                        // 若沒有將API Key配置到環境變數中,需將下面這行代碼注釋放開,並將apiKey替換為自己的API Key
                        // .apiKey("apikey")
                        .model("qwen-turbo")
                        .messages(Arrays.asList(userMsg))
                        .resultFormat(GenerationParam.ResultFormat.MESSAGE)
                        .topP(0.8)
                        .incrementalOutput(true)
                        .build();
        // Prepare the speech synthesis task
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // 若沒有將API Key配置到環境變數中,需將下面這行代碼注釋放開,並將apiKey替換為自己的API Key
                        // .apiKey("apikey")
                        .model(model)
                        .voice(voice)
                        .format(SpeechSynthesisAudioFormat
                                .PCM_22050HZ_MONO_16BIT)
                        .build();
        SpeechSynthesizer synthesizer =
                new SpeechSynthesizer(param, new ReactCallback(playbackRunnable));
        Flowable<GenerationResult> result = gen.streamCall(genParam);
        result.blockingForEach(message -> {
            String text =
                    message.getOutput().getChoices().get(0).getMessage().getContent().trim();
            if (text != null && !text.isEmpty()) {
                System.out.println("LLM output:" + text);
                synthesizer.streamingCall(text);
            }
        });
        synthesizer.streamingComplete();
        System.out.print("requestId: " + synthesizer.getLastRequestId());
        try {
            // Wait for the playback thread to finish playing all
            playbackThread.join();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] args) throws NoApiKeyException, InputRequiredException {
        process();
        System.exit(0);
    }
}

API參考

模型功能特性對比

功能/特性

cosyvoice-v3-plus

cosyvoice-v3-flash

cosyvoice-v2

支援語言

音色而異:中文(普通話、廣東話、東北話、甘肅話、貴州話、河南話、湖北話、江西話、閩南話、寧夏話、山西話、陝西話、山東話、上海話、四川話、天津話、雲南話)、英文、法語、德語、日語、韓語、俄語

音色而異:中文、英文(英式、美式)、韓語、日語

音頻格式

pcm、wav、mp3、opus

音頻採樣率

8kHz、16kHz、22.05kHz、24kHz、44.1kHz、48kHz

聲音複刻

支援 參見CosyVoice聲音複刻API

SSML

支援 參見SSML標記語言介紹此功能適用於複刻音色,以及音色列表中已標記為支援的系統音色

LaTeX

支援 參見LaTeX 方程式轉語音

音量大小

支援

語速調節

支援

語調(音高)調節

支援

碼率調節

支援 僅opus格式音頻支援

時間戳記

支援 預設關閉,可開啟;此功能適用於複刻音色,以及音色列表中已標記為支援的系統音色

指令控制(Instruct)

支援 此功能適用於複刻音色,以及音色列表中已標記為支援的系統音色

不支援

流式輸入

支援

流式輸出

支援

限流(RPS)

3

接入方式

Java/Python SDK、WebSocket API

價格

$0.286706/萬字元

$0.14335/萬字元

$0.286706/萬字元

常見問題

Q:語音合成的發音讀錯怎麼辦?多音字如何控制發音?

  • 將多音字替換成同音的其他漢字,快速解決發音問題。

  • 使用SSML標記語言控制發音。