Build Custom Voice Apps with CosyVoice Cloning API - Model Studio - Alibaba Cloud - Alibaba Cloud Model Studio

Supported models

Voice cloning:
- cosyvoice-v3.5-plus, cosyvoice-v3.5-flash
- cosyvoice-v3-plus, cosyvoice-v3-flash
- cosyvoice-v2
Voice design:
- cosyvoice-v3.5-plus, cosyvoice-v3.5-flash
- cosyvoice-v3-plus, cosyvoice-v3-flash

Important

cosyvoice-v3.5-plus and cosyvoice-v3.5-flash are available only in the Chinese Mainland deployment mode (Beijing region).
In the International deployment mode (Singapore region), cosyvoice-v3-plus and cosyvoice-v3-flash do not support voice cloning or voice design. Choose another model instead.

Supported languages

Voice cloning: Depends on the speech synthesis model that determines the voice (specified by the target_model/targetModel parameter):
- cosyvoice-v2: Chinese (Mandarin), English
- cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeast dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Jiangxi dialect, Minnan dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Shanghai dialect, Sichuan dialect, Tianjin dialect, Yunnan dialect), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese
- cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, Russian
- cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan dialect, Hubei dialect, Minnan dialect, Ningxia dialect, Shaanxi dialect, Shandong dialect, Shanghai dialect, Sichuan dialect), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese
Voice cloning does not currently support other languages (Spanish, Italian, etc.).
Voice design: Chinese, English.

Getting started: from voice cloning to speech synthesis

Voice cloning and speech synthesis are two separate but related steps that follow a "create then use" workflow:

Prepare an audio recording file.
Upload an audio file that meets the requirements specified in Voice cloning: Input audio formats to a publicly accessible location, such as Object Storage Service (OSS), and ensure the URL is publicly accessible.
Create a voice.
Call the Create voice API. Specify target_model or targetModel to define the speech synthesis model to be used with the created voice.
If you already have a voice (you can check by calling the Query voice list API), you can skip this step and proceed to the next one.
Speech synthesis using voice
After you successfully create a voice using the Create voice API, the system returns a voice_id/voiceID:
- This voice_id or voiceID can be used as the voice parameter in the speech synthesis API or various language SDKs for subsequent text-to-speech conversion.
- Multiple invocation modes are supported, including non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
- The speech synthesis model specified for synthesis must match the target_model or targetModel used when creating the voice, or synthesis will fail.

Sample code:

import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer

# 1. Prepare the environment.
# We recommend that you configure the API key using an environment variable.
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
    raise ValueError("DASHSCOPE_API_KEY environment variable not set.")

# The following is the WebSocket URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# The following is the HTTP URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'


# 2. Define the cloning parameters.
TARGET_MODEL = "cosyvoice-v3.5-plus" 
# Give the voice a meaningful prefix.
VOICE_PREFIX = "myvoice" # Only digits and lowercase letters are allowed. The prefix must be less than 10 characters in length.
# A publicly accessible audio URL.
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # This is a sample URL. Replace it with your own.

# 3. Create a voice (asynchronous task).
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
    voice_id = service.create_voice(
        target_model=TARGET_MODEL,
        prefix=VOICE_PREFIX,
        url=AUDIO_URL
    )
    print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
    print(f"Generated Voice ID: {voice_id}")
except Exception as e:
    print(f"Error during voice creation: {e}")
    raise e
# 4. Poll for the voice status.
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
    try:
        voice_info = service.query_voice(voice_id=voice_id)
        status = voice_info.get("status")
        print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
        
        if status == "OK":
            print("Voice is ready for synthesis.")
            break
        elif status == "UNDEPLOYED":
            print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
            raise RuntimeError(f"Voice processing failed with status: {status}")
        # For intermediate statuses such as "DEPLOYING", continue to wait.
        time.sleep(poll_interval)
    except Exception as e:
        print(f"Error during status polling: {e}")
        time.sleep(poll_interval)
else:
    print("Polling timed out. The voice is not ready after several attempts.")
    raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")

# 5. Use the cloned voice for speech synthesis.
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
    synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
    text_to_synthesize = "Congratulations, you have successfully cloned and synthesized your own voice!"
    
    # The call() method returns binary audio data.
    audio_data = synthesizer.call(text_to_synthesize)
    print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")

    # 6. Save the audio file.
    output_file = "my_custom_voice_output.mp3"
    with open(output_file, "wb") as f:
        f.write(audio_data)
    print(f"Audio saved to {output_file}")

except Exception as e:
    print(f"Error during speech synthesis: {e}")

Getting started: from voice design to speech synthesis

Voice design and speech synthesis are two separate but related steps that follow a "create then use" workflow:

Prepare the voice description and preview text for voice design.
- Voice description (voice_prompt): Describes the features of the target voice. For more information, see Voice design: Write high-quality voice descriptions?.
- Preview text (preview_text): The content that the target voice will read for the preview audio, for example, "Hello everyone, welcome."
Call the Create voice API to create a custom voice and retrieve the voice name and preview audio.
Specify target_model to define the speech synthesis model to be used with the created voice.
Listen to the preview audio to determine if it meets your expectations. If it does, proceed to the next step. Otherwise, redesign the voice.
If you already have a voice (you can check by calling the Query voice list API), you can skip this step and proceed to the next one.
Use voice for speech synthesis.
After you successfully create a voice using the Create voice API, the system returns a voice_id/voiceID:
- This voice_id or voiceID can be used directly as the voice parameter in the speech synthesis API or various language SDKs for subsequent text-to-speech conversion.
- Multiple invocation modes are supported, including non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
- The speech synthesis model specified during synthesis must match the target_model or targetModel used when creating the voice, or synthesis will fail.

Sample code:

Generate a custom voice and preview the result. If you are satisfied with the result, proceed to the next step. Otherwise, regenerate the voice.

Python

import requests
import base64
import os

def create_voice_and_play():
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured environment variables, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY")
    
    if not api_key:
        print("Error: The DASHSCOPE_API_KEY environment variable is not found. Set the API key.")
        return None, None, None
    
    # Prepare the request data.
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    data = {
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
            "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
            "prefix": "announcer"
        },
        "parameters": {
            "sample_rate": 24000,
            "response_format": "wav"
        }
    }
    
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
    
    try:
        # Send the request.
        response = requests.post(
            url,
            headers=headers,
            json=data,
            timeout=60  # Add a timeout setting.
        )
        
        if response.status_code == 200:
            result = response.json()
            
            # Get the voice ID.
            voice_id = result["output"]["voice_id"]
            print(f"Voice ID: {voice_id}")
            
            # Get the preview audio data.
            base64_audio = result["output"]["preview_audio"]["data"]
            
            # Decode the Base64 audio data.
            audio_bytes = base64.b64decode(base64_audio)
            
            # Save the audio file to your on-premises device.
            filename = f"{voice_id}_preview.wav"
            
            # Write the audio data to a local file.
            with open(filename, 'wb') as f:
                f.write(audio_bytes)
            
            print(f"The audio is saved to the local file: {filename}")
            print(f"File path: {os.path.abspath(filename)}")
            
            return voice_id, audio_bytes, filename
        else:
            print(f"Request failed. Status code: {response.status_code}")
            print(f"Response content: {response.text}")
            return None, None, None
            
    except requests.exceptions.RequestException as e:
        print(f"A network request error occurred: {e}")
        return None, None, None
    except KeyError as e:
        print(f"The response data is in an invalid format. The required field is missing: {e}")
        print(f"Response content: {response.text if 'response' in locals() else 'No response'}")
        return None, None, None
    except Exception as e:
        print(f"An unknown error occurred: {e}")
        return None, None, None

if __name__ == "__main__":
    print("Creating the voice...")
    voice_id, audio_data, saved_filename = create_voice_and_play()
    
    if voice_id:
        print(f"\nVoice '{voice_id}' is created.")
        print(f"The audio file is saved: '{saved_filename}'")
        print(f"File size: {os.path.getsize(saved_filename)} bytes")
    else:
        print("\nFailed to create the voice.")

Java

You need to import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

Maven

Add the following to your pom.xml file:

<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.13.1</version>
</dependency>

Gradle

Add the following to your build.gradle file:

// https://mvnrepository.com/artifact/com.google.code.gson/gson
implementation("com.google.code.gson:gson:2.13.1")

import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Base64;

public class Main {
    public static void main(String[] args) {
        Main example = new Main();
        example.createVoice();
    }

    public void createVoice() {
        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured environment variables, replace the following line with your Model Studio API key: String apiKey = "sk-xxx"
        String apiKey = System.getenv("DASHSCOPE_API_KEY");

        // Create a JSON request body string.
        String jsonBody = "{\n" +
                "    \"model\": \"voice-enrollment\",\n" +
                "    \"input\": {\n" +
                "        \"action\": \"create_voice\",\n" +
                "        \"target_model\": \"cosyvoice-v3.5-plus\",\n" +
                "        \"voice_prompt\": \"A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.\",\n" +
                "        \"preview_text\": \"Dear listeners, hello everyone. Welcome to the evening news.\",\n" +
                "        \"prefix\": \"announcer\"\n" +
                "    },\n" +
                "    \"parameters\": {\n" +
                "        \"sample_rate\": 24000,\n" +
                "        \"response_format\": \"wav\"\n" +
                "    }\n" +
                "}";

        HttpURLConnection connection = null;
        try {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
            URL url = new URL("https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization");
            connection = (HttpURLConnection) url.openConnection();

            // Set the request method and headers.
            connection.setRequestMethod("POST");
            connection.setRequestProperty("Authorization", "Bearer " + apiKey);
            connection.setRequestProperty("Content-Type", "application/json");
            connection.setDoOutput(true);
            connection.setDoInput(true);

            // Send the request body.
            try (OutputStream os = connection.getOutputStream()) {
                byte[] input = jsonBody.getBytes("UTF-8");
                os.write(input, 0, input.length);
                os.flush();
            }

            // Get the response.
            int responseCode = connection.getResponseCode();
            if (responseCode == HttpURLConnection.HTTP_OK) {
                // Read the response content.
                StringBuilder response = new StringBuilder();
                try (BufferedReader br = new BufferedReader(
                        new InputStreamReader(connection.getInputStream(), "UTF-8"))) {
                    String responseLine;
                    while ((responseLine = br.readLine()) != null) {
                        response.append(responseLine.trim());
                    }
                }

                // Parse the JSON response.
                JsonObject jsonResponse = JsonParser.parseString(response.toString()).getAsJsonObject();
                JsonObject outputObj = jsonResponse.getAsJsonObject("output");
                JsonObject previewAudioObj = outputObj.getAsJsonObject("preview_audio");

                // Get the voice name.
                String voiceId = outputObj.get("voice_id").getAsString();
                System.out.println("Voice ID: " + voiceId);

                // Get the Base64-encoded audio data.
                String base64Audio = previewAudioObj.get("data").getAsString();

                // Decode the Base64 audio data.
                byte[] audioBytes = Base64.getDecoder().decode(base64Audio);

                // Save the audio to a local file.
                String filename = voiceId + "_preview.wav";
                saveAudioToFile(audioBytes, filename);

                System.out.println("The audio is saved to the local file: " + filename);

            } else {
                // Read the error response.
                StringBuilder errorResponse = new StringBuilder();
                try (BufferedReader br = new BufferedReader(
                        new InputStreamReader(connection.getErrorStream(), "UTF-8"))) {
                    String responseLine;
                    while ((responseLine = br.readLine()) != null) {
                        errorResponse.append(responseLine.trim());
                    }
                }

                System.out.println("Request failed. Status code: " + responseCode);
                System.out.println("Error response: " + errorResponse.toString());
            }

        } catch (Exception e) {
            System.err.println("A request error occurred: " + e.getMessage());
            e.printStackTrace();
        } finally {
            if (connection != null) {
                connection.disconnect();
            }
        }
    }

    private void saveAudioToFile(byte[] audioBytes, String filename) {
        try {
            File file = new File(filename);
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audioBytes);
            }
            System.out.println("The audio is saved to: " + file.getAbsolutePath());
        } catch (IOException e) {
            System.err.println("An error occurred while saving the audio file: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Use the custom voice you generated in the previous step for speech synthesis.

This step references the non-streaming call example code. Replace the voice parameter with the custom voice generated through voice design for speech synthesis.

Key principle: The model used for voice design (target_model) must match the model used for subsequent speech synthesis (model), or synthesis will fail.

Python

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import os

# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Use the same model for voice design and speech synthesis.
model = "cosyvoice-v3.5-plus"
# Replace the voice parameter with the custom voice generated by voice design.
voice = "your_voice"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send the text to be synthesized and get the binary audio.
audio = synthesizer.call("How is the weather today?")
# The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
print('[Metric] Request ID: {}, First packet delay: {} ms'.format(
    synthesizer.get_last_request_id(),
    synthesizer.get_first_package_delay()))

# Save the audio locally.
with open('output.mp3', 'wb') as f:
    f.write(audio)

Java

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Main {
    // Use the same model for voice design and speech synthesis.
    private static String model = "cosyvoice-v3.5-plus";
    // Replace the voice parameter with the custom voice generated by voice design.
    private static String voice = "your_voice_id";

    public static void streamAudioDataToSpeaker() {
        // Request parameters
        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                        // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                        .model(model) // Model
                        .voice(voice) // Voice
                        .build();

        // Synchronous mode: Disable callback (second parameter is null).
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
        ByteBuffer audio = null;
        try {
            // Block until audio returns.
            audio = synthesizer.call("How is the weather today?");
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            // Close the WebSocket connection when the task ends.
            synthesizer.getDuplexApi().close(1000, "bye");
        }
        if (audio != null) {
            // Save the audio data to the local file "output.mp3".
            File file = new File("output.mp3");
            // The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
            System.out.println(
                    "[Metric] Request ID: "
                            + synthesizer.getLastRequestId()
                            + ", First packet delay (ms): "
                            + synthesizer.getFirstPackageDelay());
            try (FileOutputStream fos = new FileOutputStream(file)) {
                fos.write(audio.array());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    }

    public static void main(String[] args) {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
        Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
        streamAudioDataToSpeaker();
        System.exit(0);
    }
}

API reference

Use the same Alibaba Cloud account for all API operations.

Important

The Java and Python DashScope SDKs do not support voice design. For voice design, use the RESTful API.

Create voice

RESTful API

URL

Chinese Mainland:

POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization

International:

POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

Request headers

Parameter	Type	Required	Description
Authorization	string	Supported	Authentication token. Format: `Bearer <your_api_key>`. Replace "`<your_api_key>`" with your actual API key.
Content-Type	string	Supported	Media type of data in the request body. Fixed value: `application/json`.

Request body

The request body contains all parameters (omit optional fields as needed).

Important

Note the difference between these parameters:

model: Voice cloning/design model. Fixed value: voice-enrollment
target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

Voice cloning

{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3.5-plus",
        "prefix": "myvoice",
        "url": "https://yourAudioFileUrl",
        "language_hints": ["zh"]
    }
}

Voice design

{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3.5-plus",
        "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
        "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
        "prefix": "announcer",
        "language_hints": ["zh"]
    },
    "parameters": {
        "sample_rate": 24000,
        "response_format": "wav"
    }
}

Request parameters

Parameter	Type	Default	Required	Description
model	string	-	Supported	Voice cloning/design model. Fixed value: `voice-enrollment`.
action	string	-	Supported	Action type. Fixed value: `create_voice`.
target_model	string	-	Supported	Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
url	string	-	Supported	Important Required only for voice cloning Publicly accessible URL of the audio file used for voice cloning. For audio format details, see . For recording guidance, see Recording Guide.
voice_prompt	string	-	Supported	Important Required only for voice design Voice description. Maximum length: 500 characters. Chinese and English only. For guidance on writing voice descriptions, see .
preview_text	string	-	Supported	Important Required only for voice design Text for the preview audio. Maximum length: 200 characters. Supported languages: Chinese (zh), English (en).
prefix	string	-	Supported	The voice name (letters/numbers only, max 10 characters). Use role or scenario identifiers. This keyword appears in the final voice name. For example, if the keyword is "announcer", the final voice names are: Voice cloning: cosyvoice-v3.5-plus-announcer-8aae0c0397fa408ca60c29cf**** Voice design: cosyvoice-v3.5-plus-vd-announcer-8aae0c0397fa408ca60c29cf****
language_hints	array[string]	["zh"]	No	Specifies the sample audio language for voice feature extraction. Available for cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This parameter is an array, but only the first element is processed—pass only one value. Functionality: Voice cloning Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g., `en` for Chinese audio), the system ignores it and auto-detects the language. Valid values (by model): cosyvoice-v3-plus: zh: Chinese (default) en: English fr: French de: German ja: Japanese ko: Korean ru: Russian cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh: Chinese (default) en: English fr: French de: German ja: Japanese ko: Korean ru: Russian pt: Portuguese th: Thai id: Indonesian vi: Vietnamese For Chinese dialects (e.g., Northeastern, Cantonese), set `language_hints` to `zh`. Control dialect style in speech synthesis using text content or the `instruct` parameter. Voice design Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match `preview_text` language. Valid values: zh: Chinese (default) en: English
max_prompt_audio_length	float	10.0	No	Important This parameter is only available for voice cloning scenarios. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio. The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash. Valid range: [3.0, 30.0].
enable_preprocess	boolean	false	No	Important This parameter is only available for voice cloning scenarios. If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech. For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality. Enables audio preprocessing (noise reduction, audio enhancement, volume normalization) before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash. Valid values: true: Enable false: Disable
sample_rate	int	24000	No	Important Available only for voice design The preview audio sample rate (Hz) for voice design. Valid values: 16000 24000 48000
response_format	string	wav	No	Important Available only for voice design The preview audio format for voice design. Valid values: pcm wav mp3

Response parameters

View response examples

Voice cloning

{
    "output": {
        "voice_id": "yourVoiceId"
    },
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Voice design

{
    "output": {
        "preview_audio": {
            "data": "{base64_encoded_audio}",
            "sample_rate": 24000,
            "response_format": "wav"
        },
        "target_model": "cosyvoice-v3.5-plus",
        "voice_id": "yourVoice"
    },
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Key parameters:

Parameter	Type	Description
voice_id	string	Voice ID. Use directly as the `voice` parameter in the speech synthesis API.
data	string	The preview audio data from voice design (Base64-encoded).
sample_rate	int	The preview audio sample rate (Hz) from voice design. This value matches the creation request. Default: 24000 Hz.
response_format	string	The preview audio format from voice design. This value matches the creation request. Default: wav.
target_model	string	Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
request_id	string	Request ID.
count	integer	Number of "create voice" operations in this request. Always 1 for voice creation.

Sample code

Important

Note the difference between these parameters:

model: Voice cloning/design model. Fixed value: voice-enrollment
target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

Voice cloning

If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key
# Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key

curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="28f184e9f7vq7">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3.5-plus",
        "prefix": "myvoice",
        "url": "https://yourAudioFileUrl"
    }
}'

Voice design

If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key
# Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key

curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="087ab4e9d2b9r">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3.5-plus",
        "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
        "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
        "prefix": "announcer"
    },
    "parameters": {
        "sample_rate": 24000,
        "response_format": "wav"
    }
}'

Python SDK

Interface description

Before using this API, install the latest DashScope SDK.

def create_voice(self, target_model: str, prefix: str, url: str, language_hints: List[str] = None) -> str:
    '''
    Create a new custom voice.
    param: target_model Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
    param: prefix The voice name (letters, numbers, underscores only; max 10 characters). Use role or scenario identifiers. Format: model-name-prefix-unique-id (e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx).
    param: url Publicly accessible URL of the audio file used for voice cloning.
    param: language_hints The reference audio language for voice feature extraction. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus.
                Helps identify sample audio language to improve voice feature extraction and cloning quality.
                If the hint doesn't match actual audio, the system ignores it and auto-detects the language.
                Valid values (by model):
                    cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
                    cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
                This parameter is an array, but only the first element is processed. Pass only one value.
    param: max_prompt_audio_length The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
                Valid range: [3.0, 30.0]. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
    param: enable_preprocess Enables audio preprocessing. When enabled, the system performs noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
                If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
                For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
    return: voice_id The voice ID. Use directly as the voice parameter in the speech synthesis API.
    '''

Important

target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
language_hints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.
Functionality:
Voice cloning

Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g., en for Chinese audio), the system ignores it and auto-detects the language.

Valid values (by model):
- cosyvoice-v3-plus:
  
  zh: Chinese (default)
  
  en: English
  
  fr: French
  
  de: German
  
  ja: Japanese
  
  ko: Korean
  
  ru: Russian
- cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
  
  zh: Chinese (default)
  
  en: English
  
  fr: French
  
  de: German
  
  ja: Japanese
  
  ko: Korean
  
  ru: Russian
  
  pt: Portuguese
  
  th: Thai
  
  id: Indonesian
  
  vi: Vietnamese
For Chinese dialects (e.g., Northeastern, Cantonese), set language_hints to zh. Control dialect style in speech synthesis using text content or the instruct parameter.
Voice design

Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match preview_text language.

Valid values:
- zh: Chinese (default)
- en: English

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()

# Avoid frequent calls. Each call creates a new voice. After reaching your quota limit, you cannot create more.
voice_id = service.create_voice(
    target_model='cosyvoice-v3.5-plus',
    prefix='myvoice',
    url='https://your-audio-file-url'
    # language_hints=['zh'],
    # max_prompt_audio_length=10.0,
    # enable_preprocess=False
)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice ID: {voice_id}")

Java SDK

Interface description

Before using this API, install the latest DashScope SDK.

/**
 * Create a new custom voice.
 *
 * @param targetModel Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
 * @param prefix The voice name (letters, numbers, underscores only; max 10 characters). Use role or scenario identifiers. Format: model-name-prefix-unique-id (e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx).
 * @param url Publicly accessible URL of the audio file used for voice cloning.
 * @param customParam Custom parameters. Specify languageHints and maxPromptAudioLength here.
 *              languageHints: The reference audio language for voice feature extraction. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus.
 *                  Helps identify sample audio language to improve voice feature extraction and cloning quality.
 *                  If hint doesn't match actual audio, the system ignores it and auto-detects the language.
 *                  Valid values (by model):
 *                      cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
 *                      cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
 *                  Only the first element is processed. Pass only one value.
 *              maxPromptAudioLength: The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3-flash model.
 *                  Valid range: [3.0, 30.0]. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
 *              enable_preprocess: Configured through the generic parameter. Enables audio preprocessing. When enabled, the system performs noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
 *                  If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
 *                  For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
 * @return Voice New voice. Call Voice.getVoiceId() to get the voice ID. Use directly as the voice parameter in the speech synthesis API.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice createVoice(String targetModel, String prefix, String url, VoiceEnrollmentParam customParam) throws NoApiKeyException, InputRequiredException

Important

targetModel: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
languageHints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.
Functionality:
Voice cloning

Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g., en for Chinese audio), the system ignores it and auto-detects the language.

Valid values (by model):
- cosyvoice-v3-plus:
  
  zh: Chinese (default)
  
  en: English
  
  fr: French
  
  de: German
  
  ja: Japanese
  
  ko: Korean
  
  ru: Russian
- cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
  
  zh: Chinese (default)
  
  en: English
  
  fr: French
  
  de: German
  
  ja: Japanese
  
  ko: Korean
  
  ru: Russian
  
  pt: Portuguese
  
  th: Thai
  
  id: Indonesian
  
  vi: Vietnamese
For Chinese dialects (e.g., Northeastern, Cantonese), set language_hints to zh. Control dialect style in speech synthesis using text content or the instruct parameter.
Voice design

Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match preview_text language.

Valid values:
- zh: Chinese (default)
- en: English

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentParam;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Collections;

public class Main {
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args) {
        String apiKey = System.getenv("DASHSCOPE_API_KEY");
        String targetModel = "cosyvoice-v3.5-plus";
        String prefix = "myvoice";
        String fileUrl = "https://your-audio-file-url";
        String cloneModelName = "voice-enrollment";

        try {
            VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
            Voice myVoice = service.createVoice(
                    targetModel,
                    prefix,
                    fileUrl,
                    VoiceEnrollmentParam.builder()
                            .model(cloneModelName)
                            .languageHints(Collections.singletonList("zh"))
                            // .maxPromptAudioLength(10.0f)
                            // .parameter("enable_preprocess", false)
                            .build());

            logger.info("Voice creation submitted. Request ID: {}", service.getLastRequestId());
            logger.info("Generated Voice ID: {}", myVoice.getVoiceId());
        } catch (Exception e) {
            logger.error("Failed to create voice", e);
        }
    }
}

List voices

Query created voices with pagination.

RESTful API

URL and Request headers are the same as the Create voice API
Request body

The request body contains all parameters. Omit optional fields as needed.

Important
model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.
```
{
    "model": "voice-enrollment",
    "input": {
        "action": "list_voice",
        "prefix": "announcer"
        "page_size": 10,
        "page_index": 0
    }
}
```

Request parameters

Parameter	Type	Default	Required	Description
model	string	-	Supported	Voice cloning/design model. Fixed value: `voice-enrollment`.
action	string	-	Supported	Action type. Fixed value: `list_voice`.
prefix	string	-	No	The same prefix used when creating the voice (letters/numbers only, max 10 characters).
page_index	integer	0	Not supported	Page index (≥ 0).
page_size	integer	10	No	The number of items per page. Valid range: [0, 1000].

Response parameters

View response examples

Voice cloning

{
    "output": {
        "voice_list": [
            {
                "gmt_create": "2024-12-11 13:38:02",
                "voice_id": "yourVoiceId",
                "gmt_modified": "2024-12-11 13:38:02",
                "status": "OK"
            }
        ]
    },
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Voice design

{
    "output": {
        "voice_list": [
            {
                "gmt_create": "2025-12-10 17:04:54",
                "gmt_modified": "2025-12-10 17:04:54",
                "preview_text": "Dear listeners, hello everyone. Welcome to today's show.",
                "target_model": "cosyvoice-v3.5-plus",
                "voice_id": "yourVoice1",
                "status": "OK",
                "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary., deep and magnetic, steady pace"
            },
            {
                "gmt_create": "2025-12-10 15:31:35",
                "gmt_modified": "2025-12-10 15:31:35",
                "language": "zh",
                "preview_text": "Dear listeners, hello everyone",
                "target_model": "cosyvoice-v3.5-plus",
                "voice_id": "yourVoice2",
                "status": "OK"
                "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary."
            }
        ]
    },
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Key parameters:

Parameter	Type	Description
voice_id	string	Voice ID. Use directly as the `voice` parameter in the speech synthesis API.
target_model	string	Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
gmt_create	string	Time the voice was created.
gmt_modified	string	Time the voice was last modified.
voice_prompt	string	Voice description.
preview_text	string	Preview text.
request_id	string	Request ID.
status	string	Voice status: DEPLOYING: Under review OK: Ready to use UNDEPLOYED: Unavailable

Sample code

Important

model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

# This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key
# Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key

curl -X POST https://dashscope.aliyuncs-intl.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "list_voice",
        "prefix": "announcer",
        "page_size": 10,
        "page_index": 0
    }
}'

Python SDK

Interface description

def list_voices(self, prefix=None, page_index: int = 0, page_size: int = 10) -> List[dict]:
    '''
    Query all created voices
    param: prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
    param: page_index Page index to query
    param: page_size Page size to query
    return: List[dict] Voice list containing ID, creation time, modification time, and status for each voice. Format: [{'gmt_create': '2025-10-09 14:51:01', 'gmt_modified': '2025-10-09 14:51:07', 'status': 'OK', 'voice_id': 'cosyvoice-v3-myvoice-xxx'}]
    Voice statuses:
        DEPLOYING: Under review
        OK: Approved and ready to use
        UNDEPLOYED: Rejected and unavailable
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()

# Filter by prefix, or set to None to query all
voices = service.list_voices(prefix='myvoice', page_index=0, page_size=10)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Found voices: {voices}")

Response example

[
    {
        "gmt_create": "2024-09-13 11:29:41",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 11:29:41",
        "status": "OK"
    },
    {
        "gmt_create": "2024-09-13 13:22:38",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 13:22:38",
        "status": "OK"
    }
]

Response parameters

Parameter	Type	Description
voice_id	string	Voice ID. Use directly as the `voice` parameter in the speech synthesis API.
target_model	string	Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
gmt_create	string	Time the voice was created.
gmt_modified	string	Time the voice was last modified.
voice_prompt	string	Voice description.
preview_text	string	Preview text.
request_id	string	Request ID.
status	string	Voice status: DEPLOYING: Under review OK: Ready to use UNDEPLOYED: Unavailable

Java SDK

Interface description

// Voice statuses:
//        DEPLOYING: Under review
//        OK: Approved and ready to use
//        UNDEPLOYED: Rejected and unavailable
/**
 * Query all created voices. Default page index is 0, default page size is 10.
 *
 * @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters). Can be null.
 * @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice[] listVoice(String prefix) throws NoApiKeyException, InputRequiredException 

/**
 * Query all created voices.
 *
 * @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
 * @param pageIndex Page index to query.
 * @param pageSize Page size to query.
 * @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice[] listVoice(String prefix, int pageIndex, int pageSize) throws NoApiKeyException, InputRequiredException

Request example

You need to import the third-party library com.google.gson.Gson.

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String prefix = "myvoice"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Query voices
        Voice[] voices = service.listVoice(prefix, 0, 10);
        logger.info("List successful. Request ID: {}", service.getLastRequestId());
        logger.info("Voices Details: {}", new Gson().toJson(voices));
    }
}

Response example

[
    {
        "gmt_create": "2024-09-13 11:29:41",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 11:29:41",
        "status": "OK"
    },
    {
        "gmt_create": "2024-09-13 13:22:38",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 13:22:38",
        "status": "OK"
    }
]

Response parameters

Parameter	Type	Description
voice_id	string	Voice ID. Use directly as the `voice` parameter in the speech synthesis API.
target_model	string	Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
gmt_create	string	Time the voice was created.
gmt_modified	string	Time the voice was last modified.
voice_prompt	string	Voice description.
preview_text	string	Preview text.
request_id	string	Request ID.
status	string	Voice status: DEPLOYING: Under review OK: Ready to use UNDEPLOYED: Unavailable

Query specific voice

Get detailed information about a specific voice by name.

RESTful API

URL and Request headers are the same as the Create voice API
Request body

The request body contains all parameters. Omit optional fields as needed.

Important
model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.
```
{
    "model": "voice-enrollment",
    "input": {
        "action": "query_voice",
        "voice_id": "yourVoiceID"
    }
}
```

Request parameters

Parameter	Type	Default	Required	Description
model	string	-	Supported	Voice cloning/design model. Fixed value: `voice-enrollment`.
action	string	-	Supported	Action type. Fixed value: `query_voice`.
voice_id	string	-	Supported	ID of the voice to query.

Response parameters

View response examples

Voice cloning

{
    "output": {
        "gmt_create": "2024-12-11 13:38:02",
        "resource_link": "https://yourAudioFileUrl",
        "target_model": "cosyvoice-v3.5-plus",
        "gmt_modified": "2024-12-11 13:38:02",
        "status": "OK"
    },
    "usage": {
        "count": 1
    },
    "request_id": "2450f969-d9ea-9483-bafc-************"
}

Voice design

{
    "output": {
        "gmt_create": "2025-12-10 14:54:09",
        "gmt_modified": "2025-12-10 17:47:48",
        "preview_text": "Dear listeners, hello everyone",
        "target_model": "cosyvoice-v3.5-plus",
        "status": "OK",
        "voice_id": "yourVoice",
        "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary."
    },
    "usage": {},
    "request_id": "yourRequestId"
}

For parameter descriptions, see the List voices API.

Sample code

Important

model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

# ======= Important Notice =======
# The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Delete this comment before running ===

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "query_voice",
        "voice_id": "yourVoiceID"
    }
}'

Python SDK

Interface description

def query_voice(self, voice_id: str) -> List[str]:
    '''
    Query details for a specific voice
    param: voice_id ID of the voice to query
    return: List[str] Voice details, including status, creation time, audio link, etc.
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
voice_id = 'cosyvoice-v3-plus-myvoice-xxxxxxxx'

voice_details = service.query_voice(voice_id=voice_id)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice Details: {voice_details}")

Response example

{
    "gmt_create": "2024-09-13 11:29:41",
    "resource_link": "https://yourAudioFileUrl",
    "target_model": "cosyvoice-v3.5-plus",
    "gmt_modified": "2024-09-13 11:29:41",
    "status": "OK"
}

Response parameters

See the API.

Java SDK

Interface description

/**
 * Query details for a specific voice
 *
 * @param voiceId ID of the voice to query
 * @return Voice Voice details, including status, creation time, audio link, etc.
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public Voice queryVoice(String voiceId) throws NoApiKeyException, InputRequiredException

Request example

You need to import the third-party library com.google.gson.Gson.

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        Voice voice = service.queryVoice(voiceId);
        
        logger.info("Query successful. Request ID: {}", service.getLastRequestId());
        logger.info("Voice Details: {}", new Gson().toJson(voice));
    }
}

Response example

{
    "gmt_create": "2024-09-13 11:29:41",
    "resource_link": "https://yourAudioFileUrl",
    "target_model": "cosyvoice-v3.5-plus",
    "gmt_modified": "2024-09-13 11:29:41",
    "status": "OK"
}

Response parameters

See the API.

Update voice (voice cloning only)

Updates an existing voice with a new audio file.

Important

This feature is not supported for voice design.

RESTful API

URL and Request headers are the same as the Create voice API
Request body

The request body contains all parameters. Omit optional fields as needed:

Important
model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.
```
{
    "model": "voice-enrollment",
    "input": {
        "action": "update_voice",
        "voice_id": "yourVoiceId",
        "url": "https://yourAudioFileUrl"
    }
}
```

Request parameters

Parameter	Type	Default	Required	Description
model	string	-	Supported	Voice cloning/design model. Fixed value: `voice-enrollment`.
action	string	-	Supported	Action type. Fixed value: `update_voice`.
voice_id	string	-	Supported	Voice to update.
url	string	-	Supported	URL of the audio file to update the voice. The URL must be publicly accessible.

Response parameters

View response example

{
    "output": {},
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Sample code

Important

model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

# ======= Important Notice =======
# The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Delete this comment before running ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "update_voice",
        "voice_id": "yourVoiceId",
        "url": "https://yourAudioFileUrl"
    }
}'

Python SDK

Interface description

def update_voice(self, voice_id: str, url: str) -> None:
    '''
    Update a voice
    param: voice_id Voice ID
    param: url URL of the audio file for voice cloning
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
service.update_voice(
    voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx',
    url='https://your-new-audio-file-url'
)
print(f"Update submitted. Request ID: {service.get_last_request_id()}")

Java SDK

Interface description

/**
 * Update a voice
 *
 * @param voiceId Voice to update
 * @param url URL of the audio file for voice cloning
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public void updateVoice(String voiceId, String url)
    throws NoApiKeyException, InputRequiredException

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String fileUrl = "https://your-audio-file-url";  // Replace with your actual value
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    
    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Update voice
        service.updateVoice(voiceId, fileUrl);
        logger.info("Update submitted. Request ID: {}", service.getLastRequestId());
    }
}

Delete voice

Deletes a voice you no longer need to free up the quota. This action cannot be undone.

RESTful API

URL and Request headers are the same as the Create voice API
Request body

The request body contains all parameters. Omit optional fields as needed:

Important
model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.
```
{
    "model": "voice-enrollment",
    "input": {
        "action": "delete_voice",
        "voice_id": "yourVoiceID"
    }
}
```

Request parameters

Parameter	Type	Default	Required	Description
model	string	-	Supported	Voice cloning/design model. Fixed value: `voice-enrollment`.
action	string	-	Supported	Action type. Fixed value: `delete_voice`.
voice_id	string	-	Supported	Voice to delete.

Response parameters

View response example

{
    "output": {},
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Sample code

Important

model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

# ======= Important Notice =======
# The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# === Delete this comment before running ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "delete_voice",
        "voice_id": "yourVoiceID"
    }
}'

Python SDK

Interface description

def delete_voice(self, voice_id: str) -> None:
    '''
    Delete a voice
    param: voice_id Voice to delete
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
service.delete_voice(voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx')
print(f"Deletion submitted. Request ID: {service.get_last_request_id()}")

Java SDK

Interface description

/**
 * Delete a voice
 *
 * @param voiceId Voice to delete
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public void deleteVoice(String voiceId) throws NoApiKeyException, InputRequiredException

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    
    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Delete voice
        service.deleteVoice(voiceId);
        logger.info("Deletion submitted. Request ID: {}", service.getLastRequestId());
    }
}

Quota and cleanup

Total limit: 1000 voices

The current API does not provide a way to query the voice count. You can call the API to count the voices yourself.
Automatic cleanup: If a voice has not been used for any speech synthesis requests in the past year, the system automatically deletes it.

Billing

Voice cloning/design: Creating, querying, updating, and deleting voices is free.
Speech synthesis using custom voices: Billed based on the number of text characters. For more information, see Real-time speech synthesis – CosyVoice.

Copyright and legality

You are responsible for the ownership and legal right to use any voice you provide. Read the terms of service.

Error codes

If you encounter an error, see Error messages for troubleshooting.

FAQ

Features

Q: How do I adjust the speed and volume of a custom voice?

The same way you adjust a preset voice. Pass the corresponding parameters when calling the speech synthesis API. For example, use speech_rate (Python) or speechRate (Java) to adjust speed, and volume to adjust volume. For more information, see the speech synthesis API documentation (Java SDK/Python SDK/WebSocket API).

Q: How do I call the API using languages other than Java and Python (such as Go, C#, or Node.js)?

For voice management, use the RESTful API provided in this document. For speech synthesis, use the WebSocket API and pass the cloned voice_id as the voice parameter.

Troubleshooting

If you encounter code errors, troubleshoot using the information in Error codes.

Q: What should I do if the synthesized audio from a cloned voice contains extra content?

If you find extra characters or noise in the synthesized audio from a cloned voice, follow these steps to troubleshoot:

Check the source audio quality

The quality of the cloned audio directly affects the synthesis result. Ensure the source audio meets these requirements:
- No background noise or static
- Clear sound quality (sample rate ≥ 16 kHz recommended)
- Audio format: WAV is better than MP3 (avoid lossy compression)
- Mono (stereo may cause interference)
- No silent segments or long pauses
- Moderate speech rate (a fast rate affects feature extraction)
Check the input text

Confirm the input text does not contain special symbols or markers:
- Avoid special symbols such as **, "", and ''
- Unless used for LaTeX formulas, preprocess the text to filter out special symbols.
Verify voice cloning parameters

Ensure the language parameter (language_hints/languageHints) is set correctly when .
Try cloning again

Use a higher-quality source audio file to clone the voice again and test the result.
Compare with system voices

Test the same text with a preset system voice to confirm if the issue is specific to the cloned voice.

Q: What should I do if the audio generated from a cloned voice is silent?

Check voice status

Call the Query specific voice API to check if the voice status is OK.
Check model version consistency

Ensure the target_model parameter used for voice cloning exactly matches the model parameter used for speech synthesis. For example:
- When you clone the voice, use cosyvoice-v3-plus.
- You must also use cosyvoice-v3-plus for synthesis
Verify source audio quality

Check if the source audio used for cloning meets the voice cloning input audio format requirements:
- Audio duration: 10–20 seconds
- Clear sound quality
- No background noise
Check request parameters

Confirm the voice parameter is set to the cloned voice's ID during speech synthesis.

Q: What should I do if the synthesized speech from a cloned voice is unstable or incomplete?

If the synthesized speech from a cloned voice has these issues:

Incomplete playback; only part of the text is read
Unstable synthesis quality; sometimes good, sometimes bad
Abnormal pauses or silent segments in the audio

Possible cause: The source audio quality does not meet requirements.

Solution: Verify that the source audio meets the following requirements. Rerecord the audio following the Recording Guide.

Check audio continuity: Ensure the speech in the source audio is continuous. Avoid long pauses or silent segments (over 2 seconds). Obvious blank segments can cause the model to treat silence or noise as part of the voice's features, affecting the result.
Check the speech activity ratio: Ensure that active speech makes up more than 60% of the total audio duration. Too much background noise or non-speech segments will interfere with voice feature extraction.
Verify audio quality details:
- Audio duration: 10–20 seconds (15 seconds recommended)
- Clear pronunciation and steady speech rate
- No background noise, echo, or static
- Concentrated speech energy with no long silent segments

Q: Why can't I find the VoiceEnrollmentService class?

Your SDK version is too old. Install the latest SDK.

Q: What should I do if the voice cloning result is poor, with noise or unclear audio?

This is usually caused by low-quality input audio. Rerecord and upload the audio, strictly following the Recording Guide.

Q: Why is there a long silence at the beginning or an abnormal total duration when I synthesize very short text (like a single word) with a cloned voice?

The voice cloning model learns pauses and rhythm from the sample audio. If the original recording has a long initial silence or pause, the synthesis result retains a similar pattern. For single words or very short text, this silence ratio is amplified, making it seem like the audio is long but mostly silent. To avoid this, trim long silences from sample audio. Use complete sentences or longer text for synthesis. If you must synthesize a single word, add some context before or after it.