All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time speech synthesis - CosyVoice

Last Updated:Feb 11, 2026

Speech synthesis (Text-to-Speech, TTS) converts text into natural speech. This document covers supported models, call methods, and parameter configurations for real-time speech synthesis.

Core features

  • Generates high-fidelity speech in real-time, supporting natural pronunciation in multiple languages including Chinese and English.

  • Provides voice cloning to quickly customize personalized timbres.

  • Supports streaming input and output with low-latency response for real-time interaction.

  • Adjusts speech rate, pitch, volume, and bitrate for fine-grained control.

  • Supports mainstream audio formats with up to 48 kHz sample rate output.

Supported models

Supported models:

International

In International deployment mode, access points and data storage are located in the Singapore region . Model inference computing resources are dynamically scheduled globally (excluding Mainland China).

When invoking the following models, select an API key for the Singapore region:

  • CosyVoice : cosyvoice-v3-plus, cosyvoice-v3-flash

Mainland China

In Mainland China deployment mode, access points and data storage are located in the Beijing region . Model inference computing resources are limited to Mainland China.

When invoking the following models, select an API key for the Beijing region:

  • CosyVoice : cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2

See Model list.

Model selection

Scenario

Recommended

Reason

Notes

Brand voice customization / Personalized voice cloning service

cosyvoice-v3-plus

Strongest voice cloning capability, supports 48 kHz high-quality audio output. High-quality audio and voice cloning create a human-like brand voiceprint.

Higher cost ($0.286706/10,000 characters). Use for core scenarios.

Intelligent customer service / Voice assistant

cosyvoice-v3-flash

Lowest cost ($0.14335/10,000 characters). Supports streaming interaction and emotional expression, with fast response and high cost-effectiveness.

Dialect broadcast system

cosyvoice-v3-flash, cosyvoice-v3-plus

Supports multiple dialects such as Northeastern Mandarin and Minnan, suitable for local content broadcasting.

cosyvoice-v3-plus has a higher cost ($0.286706/10,000 characters).

Educational applications (including formula reading)

cosyvoice-v2, cosyvoice-v3-flash, cosyvoice-v3-plus

Supports LaTeX formula-to-speech, suitable for explaining math, physics, and chemistry courses.

cosyvoice-v2 and cosyvoice-v3-plus have higher costs ($0.286706/10,000 characters). .

Structured voice broadcasting (news/announcements)

cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2

Supports SSML to control speech rate, pauses, and pronunciation, enhancing broadcast professionalism.

Requires additional development for SSML generation logic. Does not support emotion settings.

Precise speech-text alignment (such as caption generation, teaching playback, dictation training)

cosyvoice-v3-flash, cosyvoice-v3-plus, cosyvoice-v2

Supports timestamp output for synchronizing synthesized speech with the original text.

Explicitly enable the timestamp feature; it is disabled by default. .

Multilingual overseas products

cosyvoice-v3-flash, cosyvoice-v3-plus

Supports multiple languages.

Sambert does not support streaming input and is more expensive than cosyvoice-v3-flash.

Capabilities vary across regions and models. Review Model feature comparison before selecting a model.

Getting started

Sample code for calling an API is provided below. For more code examples, see GitHub.

You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.

CosyVoice

Save synthesized audio to a file

Python

# coding=utf-8

import os
import dashscope
from dashscope.audio.tts_v2 import *

# API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
# Different model versions require corresponding voices:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For details, see CosyVoice Voice List.
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"

# Instantiate SpeechSynthesizer and pass parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio.
audio = synthesizer.call("How is the weather today?")
# The first time you send text, a WebSocket connection is established. Therefore, the first packet delay includes the connection establishment time.
print('[Metric] requestId: {} , first packet delay: {} milliseconds'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio locally.
with open('output.mp3', 'wb') as f:
    f.write(audio)

Java

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
    // Model
    // Different model versions require corresponding voices:
    // cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
    // cosyvoice-v2: Use voices such as longxiaochun_v2.
    // Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For details, see CosyVoice Voice List.
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // API keys differ between the Singapore and Beijing regions. Get an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                        // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();

        // Synchronous mode: Disable callback (second parameter is null).
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = null;
        try {
            // Block until audio returns.
            audio = synthesizer.call("How is the weather today?");
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection when the task ends.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        if (audio != null) {
            // Save the audio data to the local file "output.mp3".
            File file = new File("output.mp3");
            // The first time you send text, a WebSocket connection is established. Therefore, the first packet delay includes the connection establishment time.
            System.out.println(
                    "[Metric] requestId: "
                            + synthesizer.getLastRequestId()
                            + ", first packet delay (milliseconds): "
                            + synthesizer.getFirstPackageDelay());
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audio.array());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    public static void main(String[] args) {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Convert text from an LLM to speech in real time and play it through a speaker

The following code shows how to play real-time text from the Qwen large language model (qwen-turbo) on an on-premises device.

Python

Before running the Python example, install a third-party audio playback library using pip.

# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import pyaudio
import dashscope
from dashscope.audio.tts_v2 import *


from http import HTTPStatus
from dashscope import Generation

# API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you have not configured the environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following URL is for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Different model versions require corresponding voice versions:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
# Each voice supports different languages. To synthesize non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
voice = "longanyang"


class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("websocket is open.")
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("speech synthesis task complete successfully.")

    def on_error(self, message: str):
        print(f"speech synthesis task failed, {message}")

    def on_close(self):
        print("websocket is closed.")
        # stop player
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        print(f"recv speech synthsis message {message}")

    def on_data(self, data: bytes) -> None:
        print("audio result length:", len(data))
        self._stream.write(data)


def synthesizer_with_llm():
    callback = Callback()
    synthesizer = SpeechSynthesizer(
        model=model,
        voice=voice,
        format=AudioFormat.PCM_22050HZ_MONO_16BIT,
        callback=callback,
    )

    messages = [{"role": "user", "content": "Please introduce yourself"}]
    responses = Generation.call(
        model="qwen-turbo",
        messages=messages,
        result_format="message",  # set result format as 'message'
        stream=True,  # enable stream output
        incremental_output=True,  # enable incremental output 
    )
    for response in responses:
        if response.status_code == HTTPStatus.OK:
            print(response.output.choices[0]["message"]["content"], end="")
            synthesizer.streaming_call(response.output.choices[0]["message"]["content"])
        else:
            print(
                "Request id: %s, Status code: %s, error code: %s, error message: %s"
                % (
                    response.request_id,
                    response.status_code,
                    response.code,
                    response.message,
                )
            )
    synthesizer.streaming_complete()
    print('requestId: ', synthesizer.get_last_request_id())


if __name__ == "__main__":
    synthesizer_with_llm()

Java

API reference

Model feature comparison

International

In the international deployment mode, the endpoint and data storage are both located in the Singapore region. Model inference computing resources are dynamically scheduled worldwide, excluding Mainland China.

Feature

cosyvoice-v3-plus

cosyvoice-v3-flash

Supported languages

Varies by system voice: Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean

Varies by system voice: Chinese (Mandarin), English

Audio format

pcm, wav, mp3, opus

Audio sampling rate

8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz

Voice cloning

Not supported

SSML

Supported

This feature is available for cloned voices and system voices marked as SSML-compatible in the voice list.
See Introduction to SSML

LaTeX

Supported

See Convert LaTeX formulas to speech

Volume adjustment

Supported

See the volume

Speech rate adjustment

Supported

See the speech_rate
In the Java SDK, this parameter is speechRate

Pitch adjustment

Supported

See the pitch_rate
In the Java SDK, this parameter is pitchRate

Bitrate adjustment

Supported

This feature is supported only for audio in opus format.
See the bit_rate
In the Java SDK, this parameter is pitchRate

Timestamp

Supported Disabled by default, but can be enabled.

This feature is available for cloned voices and system voices marked as timestamp-compatible in the voice list.
See the word_timestamp_enabled
In the Java SDK, this parameter is enableWordTimestamp

Instruction control (Instruct)

Supported

This feature is available for cloned voices and system voices marked as Instruct-compatible in the voice list.
See the instruction

Streaming input

Supported

Streaming output

Supported

Rate limit (RPS)

3

Connection type

Java/Python SDK, WebSocket API

Price

$0.26/10,000 characters

$0.13/10,000 characters

Mainland China

In the Mainland China deployment mode, the endpoint and data storage are both located in the Beijing region. Model inference computing resources are restricted to Mainland China.

Feature

cosyvoice-v3-plus

cosyvoice-v3-flash

cosyvoice-v2

Supported languages

System voices (varies by voice): Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean

Cloned voices: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian

System voices (varies by voice): Chinese (Mandarin), English

Cloned voices: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian

System voices (varies by voice): Chinese (Mandarin), English, Korean, Japanese

Cloned voices: Chinese (Mandarin), English

Audio format

pcm, wav, mp3, opus

Audio sampling rate

8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz

Voice cloning

Supported

See CosyVoice Voice Cloning API
The languages supported by voice cloning are as follows:
cosyvoice-v2: Chinese (Mandarin), English
cosyvoice-v3-flash, cosyvoice-v3-plus: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian

SSML

Supported

This feature is available for cloned voices and system voices marked as SSML-compatible in the voice list.
See Introduction to SSML

LaTeX

Supported

See Convert LaTeX formulas to speech

Volume adjustment

Supported

See the volume

Speech rate adjustment

Supported

See the speech_rate
In the Java SDK, this parameter is speechRate

Pitch adjustment

Supported

See the pitch_rate
In the Java SDK, this parameter is pitchRate

Bitrate adjustment

Supported

This feature is supported only for audio in opus format.
See the bit_rate
In the Java SDK, this parameter is pitchRate

Timestamp

Supported Disabled by default, but can be enabled.

This feature is available for cloned voices and system voices marked as timestamp-compatible in the voice list.
See the word_timestamp_enabled
In the Java SDK, this parameter is enableWordTimestamp

Instruction control (Instruct)

Supported

This feature is available for cloned voices and system voices marked as Instruct-compatible in the voice list.
See the instruction

Not supported

Streaming input

Supported

Streaming output

Supported

Rate limit (RPS)

3

Connection type

Java/Python SDK, WebSocket API

Price

$0.286706/10,000 characters

$0.14335/10,000 characters

$0.286706/10,000 characters

Supported system voices

CosyVoice voice list

FAQ

Q: What if speech synthesis mispronounces words? How do I control the pronunciation of homographs?

  • You can replace polyphonic characters with homophones to quickly fix pronunciation problems.

  • Use SSML markup language to control pronunciation. .

Q: How do I troubleshoot if audio generated with a cloned voice has no sound?

  1. Check Voice Status

    Call the Query a specific voice interface to check whether the voice model's status is OK.

  2. Check model version consistency

    Ensure that the target_model parameter used for voice cloning is identical to the model parameter used for speech synthesis. For example:

    • Use cosyvoice-v3-plus for cloning.

    • You must also use cosyvoice-v3-plus for synthesis.

  3. Verify source audio quality

    Verify that the source audio used for voice cloning meets the audio requirements:

    • Audio duration: 10-20 seconds

    • Clear sound quality

    • No background noise

  4. Check request parameters

    Confirm that the speech synthesis request parameter voice is set to the cloned voice ID.

Q: How do I handle unstable synthesis effects or incomplete speech after voice cloning?

If the synthesized speech after voice cloning has these issues:

  • Incomplete speech playback, where only part of the text is read.

  • Unstable synthesis effect or inconsistent quality.

  • Speech contains abnormal pauses or silent segments.

Possible reason: The source audio quality does not meet requirements.

Solution: Check whether the source audio meets these requirements. We recommend re-recording by following the Recording guide.

  • Check audio continuity: Ensure continuous speech content in the source audio. Avoid long pauses or silent segments (over 2 seconds). If the audio contains obvious blank segments, the model may interpret silence or noise as part of the voice characteristics, affecting the generation quality.

  • Check speech activity ratio: Ensure effective speech accounts for more than 60% of the total audio duration. Excessive background noise or non-speech segments can interfere with voice characteristic extraction.

  • Verify audio quality details:

    • Audio duration: 10-20 seconds (15 seconds recommended)

    • Clear pronunciation, stable speech rate

    • No background noise, echo, or static

    • Concentrated speech energy, no long silent segments