All Products
Search
Document Center

Alibaba Cloud Model Studio:Real-time speech synthesis - CosyVoice

Last Updated:Mar 31, 2026

Speech synthesis, also known as Text-to-Speech (TTS), learns the rhythm, intonation, and pronunciation patterns of a language, and generates human-like speech from text input.

Core features

  • Generates high-fidelity speech in real time with support for multiple languages, including Chinese and English.

  • Offers two voice customization methods: voice cloning and voice design.

  • Supports streaming input and output with low latency, ideal for real-time interactive scenarios.

  • Allows adjustment of speech rate, pitch, volume, and bitrate for fine-grained control over voice output.

  • Compatible with mainstream audio formats with output sample rates up to 48 kHz.

Availability

Supported models:

International

In the international deployment mode, access points and data storage are located in the Singapore region. Model inference computing resources are dynamically scheduled worldwide, excluding the Chinese mainland.

When you invoke the following models, select the API key for the Singapore region.

  • CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash

Chinese mainland

In the Chinese mainland deployment mode, access points and data storage are located in the Beijing region. Model inference computing resources are limited to the Chinese mainland.

When you invoke the following models, select an API key for the Beijing region:

  • CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2

For more information, see the Model list.

Model selection

Scenario

Recommended

Reason

Notes

Voice customization for brand identity, exclusive voice, or extended system voices (based on text description)

cosyvoice-v3.5-plus

Supports voice design, allowing you to create customized voices from text descriptions without audio samples. Ideal for designing brand-exclusive voices from scratch.

cosyvoice-v3.5-plus is available only in the Beijing region and does not support system voices.

Voice customization for brand identity, exclusive voice, or extended system voices (based on audio samples)

cosyvoice-v3.5-plus

Supports voice cloning, enabling you to quickly clone voices from real audio samples to create human-like brand voiceprints with high fidelity and consistency.

cosyvoice-v3.5-plus is available only in the Beijing region and does not support system voices.

Intelligent customer service / Voice assistant

cosyvoice-v3-flash, cosyvoice-v3.5-flash

Lower cost than plus models with support for streaming interaction and emotional expression, delivering fast responses at an affordable price point.

cosyvoice-v3.5-flash is available only in the Beijing region and does not support system voices.

Regional dialect broadcasting

cosyvoice-v3.5-plus

Supports multiple Chinese dialects, such as Northeastern Mandarin and Minnan, making it ideal for localized content broadcasting.

cosyvoice-v3.5-plus is available only in the Beijing region and does not support system voices.

Educational applications (including formula reading)

cosyvoice-v2, cosyvoice-v3-flash, cosyvoice-v3-plus

Supports LaTeX formula-to-speech conversion, ideal for mathematics, physics, and chemistry instruction.

cosyvoice-v2 and cosyvoice-v3-plus have higher costs ($0.286706 per 10,000 characters).

Structured voice broadcasting (news/announcements)

cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2

Supports SSML for controlling speech rate, pauses, and pronunciation to enhance broadcast professionalism.

Implement the SSML generation logic independently. This model does not support emotion settings.

Precise speech-text alignment for scenarios such as caption generation, lesson playback, and dictation practice

cosyvoice-v3-flash, cosyvoice-v3-plus, cosyvoice-v2

Supports timestamp output to synchronize the synthesized speech with the original text.

Manually enable the timestamp feature.

Multilingual international products

cosyvoice-v3-flash, cosyvoice-v3-plus

Supports multiple languages.

Capabilities vary by region and model. Before selecting a model, review the Compare models section.

Getting started

The following examples demonstrate how to invoke the API. For more code examples covering common scenarios, see GitHub.

Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.

CosyVoice

Important

cosyvoice-v3.5-plus and cosyvoice-v3.5-flash models are currently available only in the Beijing region and are designed specifically for voice design and voice cloning scenarios. They do not support system voices. Before using them for speech synthesis, use the CosyVoice voice cloning/design API to create your voice. Then, simply update the voice field in your code with your voice ID and specify the corresponding model in the model field.

Use system voices

The following example demonstrates how to perform speech synthesis using system voices. For more information, see the Voice list.

Save synthesized audio to a file

Python

# coding=utf-8

import os
import dashscope
from dashscope.audio.tts_v2 import *

# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
# Different model versions require corresponding voices:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio.
audio = synthesizer.call("How is the weather today?")
# The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
print('[Metric] Request ID: {}, First packet delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio locally.
with open('output.mp3', 'wb') as f:
    f.write(audio)

Java

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
    // Model
    // Different model versions require corresponding voices:
    // cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
    // cosyvoice-v2: Use voices such as longxiaochun_v2.
    // Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
    private static String model = "cosyvoice-v3-flash";
    // Voice
    private static String voice = "longanyang";

    public static void streamAudioDataToSpeaker() {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                        // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();

        // Synchronous mode: Disable callback (second parameter is null).
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = null;
        try {
            // Block until audio returns.
            audio = synthesizer.call("How is the weather today?");
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection when the task ends.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        if (audio != null) {
            // Save the audio data to the local file "output.mp3".
            File file = new File("output.mp3");
            // The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
            System.out.println(
                    "[Metric] Request ID: "
                            + synthesizer.getLastRequestId()
                            + ", First packet delay (ms): "
                            + synthesizer.getFirstPackageDelay());
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audio.array());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    public static void main(String[] args) {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

Convert LLM-generated text to speech in real time and play it through speakers

The following code shows how to play text content returned in real time from the Qwen large language model (qwen-turbo) on a local device.

Python

Before you run the Python example, install a third-party audio playback library using pip.

# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import pyaudio
import dashscope
from dashscope.audio.tts_v2 import *


from http import HTTPStatus
from dashscope import Generation

# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Different model versions require corresponding voices:
# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# cosyvoice-v2: Use voices such as longxiaochun_v2.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
voice = "longanyang"


class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("websocket is open.")
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("speech synthesis task complete successfully.")

    def on_error(self, message: str):
        print(f"speech synthesis task failed, {message}")

    def on_close(self):
        print("websocket is closed.")
        # stop player
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        print(f"recv speech synthsis message {message}")

    def on_data(self, data: bytes) -> None:
        print("audio result length:", len(data))
        self._stream.write(data)


def synthesizer_with_llm():
    callback = Callback()
    synthesizer = SpeechSynthesizer(
        model=model,
        voice=voice,
        format=AudioFormat.PCM_22050HZ_MONO_16BIT,
        callback=callback,
    )

    messages = [{"role": "user", "content": "Please introduce yourself"}]
    responses = Generation.call(
        model="qwen-turbo",
        messages=messages,
        result_format="message",  # set result format as 'message'
        stream=True,  # enable stream output
        incremental_output=True,  # enable incremental output 
    )
    for response in responses:
        if response.status_code == HTTPStatus.OK:
            print(response.output.choices[0]["message"]["content"], end="")
            synthesizer.streaming_call(response.output.choices[0]["message"]["content"])
        else:
            print(
                "Request id: %s, Status code: %s, error code: %s, error message: %s"
                % (
                    response.request_id,
                    response.status_code,
                    response.code,
                    response.message,
                )
            )
    synthesizer.streaming_complete()
    print('requestId: ', synthesizer.get_last_request_id())


if __name__ == "__main__":
    synthesizer_with_llm()

Java

Use cloned voices

image

Voice cloning and speech synthesis are two separate but related steps that follow a "create then use" workflow:

  1. Prepare an audio recording file.

    Upload an audio file that meets the requirements specified in Voice cloning: Input audio formats to a publicly accessible location, such as Object Storage Service (OSS), and ensure the URL is publicly accessible.

  2. Create a voice.

    Call the Create voice API. Specify target_model or targetModel to define the speech synthesis model to be used with the created voice.

    If you already have a voice (you can check by calling the Query voice list API), you can skip this step and proceed to the next one.

  3. Speech synthesis using voice

    After you successfully create a voice using the Create voice API, the system returns a voice_id/voiceID:

    • This voice_id or voiceID can be used as the voice parameter in the speech synthesis API or various language SDKs for subsequent text-to-speech conversion.

    • Multiple invocation modes are supported, including non-streaming, unidirectional streaming, and bidirectional streaming synthesis.

    • The speech synthesis model specified for synthesis must match the target_model or targetModel used when creating the voice, or synthesis will fail.

Sample code:

import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer

# 1. Prepare the environment.
# We recommend that you configure the API key using an environment variable.
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
    raise ValueError("DASHSCOPE_API_KEY environment variable not set.")

# The following is the WebSocket URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# The following is the HTTP URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'


# 2. Define the cloning parameters.
TARGET_MODEL = "cosyvoice-v3.5-plus" 
# Give the voice a meaningful prefix.
VOICE_PREFIX = "myvoice" # Only digits and lowercase letters are allowed. The prefix must be less than 10 characters in length.
# A publicly accessible audio URL.
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # This is a sample URL. Replace it with your own.

# 3. Create a voice (asynchronous task).
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
    voice_id = service.create_voice(
        target_model=TARGET_MODEL,
        prefix=VOICE_PREFIX,
        url=AUDIO_URL
    )
    print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
    print(f"Generated Voice ID: {voice_id}")
except Exception as e:
    print(f"Error during voice creation: {e}")
    raise e
# 4. Poll for the voice status.
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
    try:
        voice_info = service.query_voice(voice_id=voice_id)
        status = voice_info.get("status")
        print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
        
        if status == "OK":
            print("Voice is ready for synthesis.")
            break
        elif status == "UNDEPLOYED":
            print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
            raise RuntimeError(f"Voice processing failed with status: {status}")
        # For intermediate statuses such as "DEPLOYING", continue to wait.
        time.sleep(poll_interval)
    except Exception as e:
        print(f"Error during status polling: {e}")
        time.sleep(poll_interval)
else:
    print("Polling timed out. The voice is not ready after several attempts.")
    raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")

# 5. Use the cloned voice for speech synthesis.
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
    synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
    text_to_synthesize = "Congratulations, you have successfully cloned and synthesized your own voice!"
    
    # The call() method returns binary audio data.
    audio_data = synthesizer.call(text_to_synthesize)
    print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")

    # 6. Save the audio file.
    output_file = "my_custom_voice_output.mp3"
    with open(output_file, "wb") as f:
        f.write(audio_data)
    print(f"Audio saved to {output_file}")

except Exception as e:
    print(f"Error during speech synthesis: {e}")

Use designed voices

image

Voice design and speech synthesis are two separate but related steps that follow a "create then use" workflow:

  1. Prepare the voice description and preview text for voice design.

    • Voice description (voice_prompt): Describes the features of the target voice. For more information, see Voice design: Write high-quality voice descriptions?.

    • Preview text (preview_text): The content that the target voice will read for the preview audio, for example, "Hello everyone, welcome."

  2. Call the Create voice API to create a custom voice and retrieve the voice name and preview audio.

    Specify target_model to define the speech synthesis model to be used with the created voice.

    Listen to the preview audio to determine if it meets your expectations. If it does, proceed to the next step. Otherwise, redesign the voice.

    If you already have a voice (you can check by calling the Query voice list API), you can skip this step and proceed to the next one.

  3. Use voice for speech synthesis.

    After you successfully create a voice using the Create voice API, the system returns a voice_id/voiceID:

    • This voice_id or voiceID can be used directly as the voice parameter in the speech synthesis API or various language SDKs for subsequent text-to-speech conversion.

    • Multiple invocation modes are supported, including non-streaming, unidirectional streaming, and bidirectional streaming synthesis.

    • The speech synthesis model specified during synthesis must match the target_model or targetModel used when creating the voice, or synthesis will fail.

Sample code:

  1. Generate a custom voice and preview the result. If you are satisfied with the result, proceed to the next step. Otherwise, regenerate the voice.

    Python

    import requests
    import base64
    import os
    
    def create_voice_and_play():
        # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        # If you have not configured environment variables, replace the following line with your Model Studio API key: api_key = "sk-xxx"
        api_key = os.getenv("DASHSCOPE_API_KEY")
        
        if not api_key:
            print("Error: The DASHSCOPE_API_KEY environment variable is not found. Set the API key.")
            return None, None, None
        
        # Prepare the request data.
        headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
        
        data = {
            "model": "voice-enrollment",
            "input": {
                "action": "create_voice",
                "target_model": "cosyvoice-v3.5-plus",
                "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
                "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
                "prefix": "announcer"
            },
            "parameters": {
                "sample_rate": 24000,
                "response_format": "wav"
            }
        }
        
        # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
        url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
        
        try:
            # Send the request.
            response = requests.post(
                url,
                headers=headers,
                json=data,
                timeout=60  # Add a timeout setting.
            )
            
            if response.status_code == 200:
                result = response.json()
                
                # Get the voice ID.
                voice_id = result["output"]["voice_id"]
                print(f"Voice ID: {voice_id}")
                
                # Get the preview audio data.
                base64_audio = result["output"]["preview_audio"]["data"]
                
                # Decode the Base64 audio data.
                audio_bytes = base64.b64decode(base64_audio)
                
                # Save the audio file to your on-premises device.
                filename = f"{voice_id}_preview.wav"
                
                # Write the audio data to a local file.
                with open(filename, 'wb') as f:
                    f.write(audio_bytes)
                
                print(f"The audio is saved to the local file: {filename}")
                print(f"File path: {os.path.abspath(filename)}")
                
                return voice_id, audio_bytes, filename
            else:
                print(f"Request failed. Status code: {response.status_code}")
                print(f"Response content: {response.text}")
                return None, None, None
                
        except requests.exceptions.RequestException as e:
            print(f"A network request error occurred: {e}")
            return None, None, None
        except KeyError as e:
            print(f"The response data is in an invalid format. The required field is missing: {e}")
            print(f"Response content: {response.text if 'response' in locals() else 'No response'}")
            return None, None, None
        except Exception as e:
            print(f"An unknown error occurred: {e}")
            return None, None, None
    
    if __name__ == "__main__":
        print("Creating the voice...")
        voice_id, audio_data, saved_filename = create_voice_and_play()
        
        if voice_id:
            print(f"\nVoice '{voice_id}' is created.")
            print(f"The audio file is saved: '{saved_filename}'")
            print(f"File size: {os.path.getsize(saved_filename)} bytes")
        else:
            print("\nFailed to create the voice.")

    Java

    You need to import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

    Maven

    Add the following to your pom.xml file:

    <!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
    <dependency>
        <groupId>com.google.code.gson</groupId>
        <artifactId>gson</artifactId>
        <version>2.13.1</version>
    </dependency>

    Gradle

    Add the following to your build.gradle file:

    // https://mvnrepository.com/artifact/com.google.code.gson/gson
    implementation("com.google.code.gson:gson:2.13.1")
    import com.google.gson.JsonObject;
    import com.google.gson.JsonParser;
    import java.io.*;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.util.Base64;
    
    public class Main {
        public static void main(String[] args) {
            Main example = new Main();
            example.createVoice();
        }
    
        public void createVoice() {
            // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
            // If you have not configured environment variables, replace the following line with your Model Studio API key: String apiKey = "sk-xxx"
            String apiKey = System.getenv("DASHSCOPE_API_KEY");
    
            // Create a JSON request body string.
            String jsonBody = "{\n" +
                    "    \"model\": \"voice-enrollment\",\n" +
                    "    \"input\": {\n" +
                    "        \"action\": \"create_voice\",\n" +
                    "        \"target_model\": \"cosyvoice-v3.5-plus\",\n" +
                    "        \"voice_prompt\": \"A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.\",\n" +
                    "        \"preview_text\": \"Dear listeners, hello everyone. Welcome to the evening news.\",\n" +
                    "        \"prefix\": \"announcer\"\n" +
                    "    },\n" +
                    "    \"parameters\": {\n" +
                    "        \"sample_rate\": 24000,\n" +
                    "        \"response_format\": \"wav\"\n" +
                    "    }\n" +
                    "}";
    
            HttpURLConnection connection = null;
            try {
                // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
                URL url = new URL("https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization");
                connection = (HttpURLConnection) url.openConnection();
    
                // Set the request method and headers.
                connection.setRequestMethod("POST");
                connection.setRequestProperty("Authorization", "Bearer " + apiKey);
                connection.setRequestProperty("Content-Type", "application/json");
                connection.setDoOutput(true);
                connection.setDoInput(true);
    
                // Send the request body.
                try (OutputStream os = connection.getOutputStream()) {
                    byte[] input = jsonBody.getBytes("UTF-8");
                    os.write(input, 0, input.length);
                    os.flush();
                }
    
                // Get the response.
                int responseCode = connection.getResponseCode();
                if (responseCode == HttpURLConnection.HTTP_OK) {
                    // Read the response content.
                    StringBuilder response = new StringBuilder();
                    try (BufferedReader br = new BufferedReader(
                            new InputStreamReader(connection.getInputStream(), "UTF-8"))) {
                        String responseLine;
                        while ((responseLine = br.readLine()) != null) {
                            response.append(responseLine.trim());
                        }
                    }
    
                    // Parse the JSON response.
                    JsonObject jsonResponse = JsonParser.parseString(response.toString()).getAsJsonObject();
                    JsonObject outputObj = jsonResponse.getAsJsonObject("output");
                    JsonObject previewAudioObj = outputObj.getAsJsonObject("preview_audio");
    
                    // Get the voice name.
                    String voiceId = outputObj.get("voice_id").getAsString();
                    System.out.println("Voice ID: " + voiceId);
    
                    // Get the Base64-encoded audio data.
                    String base64Audio = previewAudioObj.get("data").getAsString();
    
                    // Decode the Base64 audio data.
                    byte[] audioBytes = Base64.getDecoder().decode(base64Audio);
    
                    // Save the audio to a local file.
                    String filename = voiceId + "_preview.wav";
                    saveAudioToFile(audioBytes, filename);
    
                    System.out.println("The audio is saved to the local file: " + filename);
    
                } else {
                    // Read the error response.
                    StringBuilder errorResponse = new StringBuilder();
                    try (BufferedReader br = new BufferedReader(
                            new InputStreamReader(connection.getErrorStream(), "UTF-8"))) {
                        String responseLine;
                        while ((responseLine = br.readLine()) != null) {
                            errorResponse.append(responseLine.trim());
                        }
                    }
    
                    System.out.println("Request failed. Status code: " + responseCode);
                    System.out.println("Error response: " + errorResponse.toString());
                }
    
            } catch (Exception e) {
                System.err.println("A request error occurred: " + e.getMessage());
                e.printStackTrace();
            } finally {
                if (connection != null) {
                    connection.disconnect();
                }
            }
        }
    
        private void saveAudioToFile(byte[] audioBytes, String filename) {
            try {
                File file = new File(filename);
                try (FileOutputStream fos = new FileOutputStream(file)) {
                    fos.write(audioBytes);
                }
                System.out.println("The audio is saved to: " + file.getAbsolutePath());
            } catch (IOException e) {
                System.err.println("An error occurred while saving the audio file: " + e.getMessage());
                e.printStackTrace();
            }
        }
    }
  2. Use the custom voice you generated in the previous step for speech synthesis.

    This step references the non-streaming call example code. Replace the voice parameter with the custom voice generated through voice design for speech synthesis.

    Key principle: The model used for voice design (target_model) must match the model used for subsequent speech synthesis (model), or synthesis will fail.

    Python

    # coding=utf-8
    
    import dashscope
    from dashscope.audio.tts_v2 import *
    import os
    
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
    
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
    
    # Use the same model for voice design and speech synthesis.
    model = "cosyvoice-v3.5-plus"
    # Replace the voice parameter with the custom voice generated by voice design.
    voice = "your_voice"
    
    # Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
    synthesizer = SpeechSynthesizer(model=model, voice=voice)
    # Send the text to be synthesized and get the binary audio.
    audio = synthesizer.call("How is the weather today?")
    # The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
    print('[Metric] Request ID: {}, First packet delay: {} ms'.format(
        synthesizer.get_last_request_id(),
        synthesizer.get_first_package_delay()))
    
    # Save the audio locally.
    with open('output.mp3', 'wb') as f:
        f.write(audio)

    Java

    import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
    import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
    import com.alibaba.dashscope.utils.Constants;
    
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.nio.ByteBuffer;
    
    public class Main {
        // Use the same model for voice design and speech synthesis.
        private static String model = "cosyvoice-v3.5-plus";
        // Replace the voice parameter with the custom voice generated by voice design.
        private static String voice = "your_voice_id";
    
        public static void streamAudioDataToSpeaker() {
            // Request parameters
            SpeechSynthesisParam param =
                    SpeechSynthesisParam.builder()
                            // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                            // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                            .model(model) // Model
                            .voice(voice) // Voice
                            .build();
    
            // Synchronous mode: Disable callback (second parameter is null).
            SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
            ByteBuffer audio = null;
            try {
                // Block until audio returns.
                audio = synthesizer.call("How is the weather today?");
            } catch (Exception e) {
                throw new RuntimeException(e);
            } finally {
                // Close the WebSocket connection when the task ends.
                synthesizer.getDuplexApi().close(1000, "bye");
            }
            if (audio != null) {
                // Save the audio data to the local file "output.mp3".
                File file = new File("output.mp3");
                // The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
                System.out.println(
                        "[Metric] Request ID: "
                                + synthesizer.getLastRequestId()
                                + ", First packet delay (ms): "
                                + synthesizer.getFirstPackageDelay());
                try (FileOutputStream fos = new FileOutputStream(file)) {
                    fos.write(audio.array());
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }
    
        public static void main(String[] args) {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
            Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
            streamAudioDataToSpeaker();
            System.exit(0);
        }
    }

Voice cloning: Input audio format

Important

Not supported in the Singapore region.

High-quality input audio is the foundation for achieving excellent cloning results.

Item

Requirements

Supported formats

WAV (16-bit), MP3, M4A

Audio duration

Recommended: 10 to 20 seconds. Maximum: 60 seconds.

File size

≤ 10 MB

Sample rate

≥ 16 kHz

Sound channel

Mono or stereo. For stereo audio, only the first channel is processed. Make sure that the first channel contains a clear human voice.

Content

The audio must contain at least 5 seconds of continuous, clear speech without background sound. The rest of the audio can have only short pauses (≤ 2 seconds). The entire audio segment should be free of background music, noise, or other voices to ensure high-quality core speech content. Use normal spoken audio as input. Do not upload songs or singing audio to ensure accuracy and usability of the cloning effect.

Voice design: Write high-quality voice descriptions?

Important

Not supported in the Singapore region.

Limitations

When writing voice descriptions (voice_prompt), follow these technical constraints:

  • Length limit: The content of voice_prompt must not exceed 500 characters.

  • Supported languages: The description text supports only Chinese and English.

Core principles

A high-quality voice description (voice_prompt) is essential for creating your ideal voice. It serves as the blueprint for voice design and directly guides the model to generate sounds with specific characteristics.

Follow these core principles when describing voices:

  1. Be specific, not vague: Use words that describe concrete sound qualities, such as "deep," "crisp," or "fast-paced." Avoid subjective, uninformative terms such as "nice-sounding" or "ordinary."

  2. Be multidimensional, not single-dimensional: Excellent descriptions typically combine multiple dimensions, such as gender, age, and emotion. Single-dimensional descriptions, such as "female voice," are too broad to generate distinctive voices.

  3. Be objective, not subjective: Focus on the physical and perceptual characteristics of the sound itself, not your personal preferences. For example, use "high-pitched with energetic delivery" instead of "my favorite voice."

  4. Be original, not imitative: Describe sound characteristics rather than requesting imitation of specific individuals, such as celebrities or actors. Such requests pose copyright risks, and the model does not support direct imitation.

  5. Be concise, not redundant: Ensure every word adds meaning. Avoid repeating synonyms or using meaningless intensifiers, such as "very very nice voice."

Dimension example

Dimension

Example

Gender

Male, female, neutral

Age

Child (5-12 years), teenager (13-18 years), young adult (19-35 years), middle-aged (36-55 years), senior (55+ years)

Pitch

High, medium, low, slightly high, slightly low

Speech rate

Fast, medium, slow, slightly fast, slightly slow

Emotion

Cheerful, calm, gentle, serious, lively, cool, soothing

Characteristics

Magnetic, crisp, raspy, mellow, sweet, rich, powerful

Purpose

News broadcasting, advertisement voice-over, audiobooks, animated characters, voice assistants, documentary narration

Example comparison

✅ Good cases

  • "Young and lively female voice, fast speech rate with noticeable rising intonation, suitable for introducing fashion products."

    Analysis: This description combines age, personality, speech rate, and intonation, and specifies the use case, creating a clear voice profile.

  • "Calm middle-aged male, slow speech rate, deep and magnetic voice quality, suitable for reading news or documentary narration."

    Analysis: This description clearly defines gender, age range, speech rate, voice quality, and intended use.

  • "Cute child's voice, approximately 8-year-old girl, slightly childish speech, suitable for animated character dubbing."

    Analysis: This description pinpoints the specific age and voice quality (childishness) and has a clear purpose.

  • "Gentle and intellectual female, around 30 years old, calm tone, suitable for audiobook narration."

    Analysis: This description effectively conveys voice emotion and style through terms such as "intellectual" and "calm."

❌ Bad cases and suggestions

Bad case

Main issue

Improvement suggestion

'Nice-sounding voice'

This description is too vague and subjective, and lacks actionable detail.

Add specific dimensions, such as "Clear-toned young female voice with gentle intonation."

'Voice like a celebrity'

This poses a copyright risk. The model does not support direct imitation.

Extract the voice characteristics for the description, such as "Mature, magnetic, steady-paced male voice."

'Very very very nice female voice'

This description is redundant. Repeating words does not help define the voice.

Remove repetitions and add effective descriptions, such as "A 20- to 24-year-old female voice with a light, cheerful tone, lively pitch, and sweet quality."

123456

This is an invalid input. It cannot be parsed as voice characteristics.

Provide a meaningful text description. For more information, see the recommended examples above.

API reference

Compare models

International

In the international deployment mode, access points and data storage are located in the Singapore region. Model inference computing resources are dynamically scheduled worldwide, excluding the Chinese mainland.

Feature

cosyvoice-v3-plus

cosyvoice-v3-flash

Supported languages

Varies by system voice: Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean

Varies by system voice: Chinese (Mandarin), English

Audio format

pcm, wav, mp3, opus

Audio sample rate

8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz

Voice cloning

Not supported

Voice design

Not supported

SSML

Supported

This feature applies to cloned voices and system voices in the Voice list marked as supporting SSML.
For usage instructions, see SSML

LaTeX

Supported

For usage instructions, see LaTeX formula-to-speech

Volume adjustment

Supported

See request parameter volume

Speech rate adjustment

Supported

See request parameter speech_rate
In the Java SDK, this parameter is speechRate

Pitch adjustment

Supported

For usage, see the request parameter pitch_rate
In the Java SDK, this parameter is pitchRate

Bitrate adjustment

Supported

Only the opus audio format supports this feature.
For usage instructions, see the request parameter bit_rate
In the Java SDK, this parameter is pitchRate

Timestamp

Supported Disabled by default but can be enabled.

This feature applies to cloned voices and system voices in the Voice list marked as supporting timestamps.
For usage instructions, see request parameter word_timestamp_enabled
In the Java SDK, this parameter is enableWordTimestamp

Instruction control (Instruct)

Not supported

Supported

This feature applies to system voices in the Voice list marked as supporting Instruct.
For more information, see request parameter instruction

Streaming input

Supported

Streaming output

Supported

Rate limiting (RPS)

3

Connection type

Java/Python SDK, WebSocket API

Price

$0.26 per 10,000 characters

$0.13 per 10,000 characters

Chinese mainland

In the Chinese mainland deployment mode, access points and data storage are located in the Beijing region. Model inference computing resources are limited to the Chinese mainland.

Feature

cosyvoice-v3.5-plus

cosyvoice-v3.5-flash

cosyvoice-v3-plus

cosyvoice-v3-flash

cosyvoice-v2

Supported languages

No system voices. Cloned voices support the following languages: Chinese (Mandarin, Cantonese, Henan, Hubei, Minnan, Ningxia, Shaanxi, Shandong, Shanghai, Sichuan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.

Designed voices support the following languages: Chinese (Mandarin) and English.

System voices (varies by voice): Chinese (Mandarin, Northeastern, Minnan, Shaanxi), English, Japanese, Korean

Cloned voices: Chinese (Mandarin), English, French, German, Japanese, Korean, and Russian.

System voices (varies by voice): Chinese (Mandarin), English

Cloned voices: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.

System voices (varies by voice): Chinese (Mandarin), English, Korean, Japanese

Cloned voices: Chinese (Mandarin) and English.

Audio format

pcm, wav, mp3, opus

Audio sample rate

8 kHz, 16 kHz, 22.05 kHz, 24 kHz, 44.1 kHz, 48 kHz

Voice cloning

Supported

For usage instructions, see CosyVoice voice cloning/design API
The following languages are supported for voice cloning:
cosyvoice-v2: Chinese (Mandarin) and English.
cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.
cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, and Russian.
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan, Hubei, Minnan, Ningxia, Shaanxi, Shandong, Shanghai, Sichuan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, and Vietnamese.

Voice design

Supported

For usage instructions, see CosyVoice voice cloning/design API
The following languages are supported for voice design: Chinese and English.

Not supported

SSML

Supported

This feature applies to cloned voices and system voices in the Voice list marked as supporting SSML.
For usage instructions, see SSML

LaTeX

Supported

For usage instructions, see LaTeX formula-to-speech

Volume adjustment

Supported

For usage instructions, see the request parameter volume

Speech rate adjustment

Supported

For usage, see the request parameter speech_rate
In the Java SDK, this parameter is speechRate

Pitch adjustment

Supported

For more information, see the request parameter pitch_rate
In the Java SDK, this parameter is pitchRate

Bitrate adjustment

Supported

Only the opus audio format supports this feature.
For usage instructions, see request parameter bit_rate
In the Java SDK, this parameter is pitchRate

Timestamp

Supported Disabled by default but can be enabled.

This feature applies to cloned voices and system voices in the Voice list marked as supporting timestamps.
See request parameter word_timestamp_enabled
In the Java SDK, this parameter is enableWordTimestamp

Instruction control (Instruct)

Supported

This feature applies to cloned voices and system voices in the Voice list marked as supporting Instruct
Suitable for scenarios that require exaggerated expressiveness, such as video dubbing and audiobook narration. If you want to preserve the original timbre and prosody, you do not need to enable this feature.
Instruct commands may not take effect if they conflict with the inherent style of the voice. For example, applying a sad instruction to a cheerful voice may not produce the expected result.
For usage instructions, see request parameter instruction

Not supported

Supported

This feature applies to cloned voices and system voices in the Voice list marked as supporting Instruct.
For usage, see the instruction

Not supported

Streaming input

Supported

Streaming output

Supported

Rate limiting (RPS)

3

Connection type

Java/Python SDK, WebSocket API

Price

$0.22 per 10,000 characters

$0.116 per 10,000 characters

$0.286706 per 10,000 characters

$0.14335 per 10,000 characters

$0.286706 per 10,000 characters

System voices

CosyVoice voice list

FAQ

Q: What should I do if speech synthesis produces incorrect pronunciations? How can I control the pronunciation of characters with multiple pronunciations?

  • Replace characters with multiple pronunciations with homophones to quickly resolve pronunciation issues.

  • Use the Speech Synthesis Markup Language (SSML) to control pronunciation.

Q: How do I troubleshoot silent audio output from a cloned voice?

  1. Confirm the voice status.

    Call the CosyVoice voice cloning/design API and check whether the voice status is OK.

  2. Check model version consistency.

    Ensure the target_model parameter used for voice cloning exactly matches the model parameter used for speech synthesis. For example:

    • When cloning, use cosyvoice-v3-plus.

    • Also use cosyvoice-v3-plus for synthesis.

  3. Verify source audio quality.

    Check whether the source audio used for voice cloning meets the requirements specified in the CosyVoice voice cloning/design API:

    • Audio duration: 10 to 20 seconds

    • Clear audio quality

    • No background noise

  4. Check the request parameters.

    Confirm that the voice parameter in the speech synthesis request is set to the ID of the cloned voice.

Q: What should I do if the synthesis effect is unstable or the speech is incomplete after voice cloning?

If the synthesized speech after voice cloning has the following issues:

  • Incomplete playback where only part of the text is spoken

  • Inconsistent synthesis quality

  • Abnormal pauses or silent segments in the speech

Possible cause: The source audio quality does not meet the requirements.

Solution: Check whether the source audio meets the following requirements. If not, consider re-recording the audio by following the Recording operation guide:

  • Check audio continuity: Ensure the source audio contains uninterrupted speech with no pauses or silent segments longer than 2 seconds. If the audio contains significant silent gaps, the model may treat the silence or noise as part of the voice profile, degrading output quality.

  • Check speech activity ratio: Ensure active speech comprises at least 60% of the total audio duration. Excessive background noise or non-speech segments can interfere with voice feature extraction.

  • Verify the audio quality details:

    • Audio duration: 10 to 20 seconds (15 seconds is recommended)

    • Clear pronunciation and a stable speech rate

    • No background noise, echo, or static

    • Consistent speech levels with no long silent gaps