All Products
Search
Document Center

Alibaba Cloud Model Studio:Voice cloning/design API

Last Updated:Mar 31, 2026

CosyVoice voice cloning creates natural-sounding custom voices from 10-20 seconds of audio with no traditional training required. Voice design generates custom voices from text descriptions with support for multilingual and multidimensional voice features. This topic covers voice cloning/design APIs and parameters. For speech synthesis, see Real-time speech synthesis – CosyVoice.

User guide: For model overviews and selection recommendations, see Real-time speech synthesis – CosyVoice.

Important
  • CosyVoice voice design is powered by the FunAudioGen-VD model.

  • Voice designs created with identical prompts may produce different voices. We recommend generating multiple results and selecting the best one.

  • This topic describes CosyVoice voice cloning/design APIs. If you use Qwen models, see Qwen voice cloning and Qwen voice design.

Supported models

  • Voice cloning:

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash

    • cosyvoice-v3-plus, cosyvoice-v3-flash

    • cosyvoice-v2

  • Voice design:

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash

    • cosyvoice-v3-plus, cosyvoice-v3-flash

Important

Supported languages

  • Voice cloning: Depends on the speech synthesis model that determines the voice (specified by the target_model/targetModel parameter):

    • cosyvoice-v2: Chinese (Mandarin), English

    • cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeast dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Jiangxi dialect, Minnan dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Shanghai dialect, Sichuan dialect, Tianjin dialect, Yunnan dialect), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese

    • cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, Russian

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan dialect, Hubei dialect, Minnan dialect, Ningxia dialect, Shaanxi dialect, Shandong dialect, Shanghai dialect, Sichuan dialect), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese

    Voice cloning does not currently support other languages (Spanish, Italian, etc.).

  • Voice design: Chinese, English.

Getting started: from voice cloning to speech synthesis

image

Voice cloning and speech synthesis are two separate but related steps that follow a "create then use" workflow:

  1. Prepare an audio recording file.

    Upload an audio file that meets the requirements specified in Voice cloning: Input audio formats to a publicly accessible location, such as Object Storage Service (OSS), and ensure the URL is publicly accessible.

  2. Create a voice.

    Call the Create voice API. Specify target_model or targetModel to define the speech synthesis model to be used with the created voice.

    If you already have a voice (you can check by calling the Query voice list API), you can skip this step and proceed to the next one.

  3. Speech synthesis using voice

    After you successfully create a voice using the Create voice API, the system returns a voice_id/voiceID:

    • This voice_id or voiceID can be used as the voice parameter in the speech synthesis API or various language SDKs for subsequent text-to-speech conversion.

    • Multiple invocation modes are supported, including non-streaming, unidirectional streaming, and bidirectional streaming synthesis.

    • The speech synthesis model specified for synthesis must match the target_model or targetModel used when creating the voice, or synthesis will fail.

Sample code:

import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer

# 1. Prepare the environment.
# We recommend that you configure the API key using an environment variable.
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
    raise ValueError("DASHSCOPE_API_KEY environment variable not set.")

# The following is the WebSocket URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# The following is the HTTP URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'


# 2. Define the cloning parameters.
TARGET_MODEL = "cosyvoice-v3.5-plus" 
# Give the voice a meaningful prefix.
VOICE_PREFIX = "myvoice" # Only digits and lowercase letters are allowed. The prefix must be less than 10 characters in length.
# A publicly accessible audio URL.
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # This is a sample URL. Replace it with your own.

# 3. Create a voice (asynchronous task).
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
    voice_id = service.create_voice(
        target_model=TARGET_MODEL,
        prefix=VOICE_PREFIX,
        url=AUDIO_URL
    )
    print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
    print(f"Generated Voice ID: {voice_id}")
except Exception as e:
    print(f"Error during voice creation: {e}")
    raise e
# 4. Poll for the voice status.
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
    try:
        voice_info = service.query_voice(voice_id=voice_id)
        status = voice_info.get("status")
        print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
        
        if status == "OK":
            print("Voice is ready for synthesis.")
            break
        elif status == "UNDEPLOYED":
            print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
            raise RuntimeError(f"Voice processing failed with status: {status}")
        # For intermediate statuses such as "DEPLOYING", continue to wait.
        time.sleep(poll_interval)
    except Exception as e:
        print(f"Error during status polling: {e}")
        time.sleep(poll_interval)
else:
    print("Polling timed out. The voice is not ready after several attempts.")
    raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")

# 5. Use the cloned voice for speech synthesis.
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
    synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
    text_to_synthesize = "Congratulations, you have successfully cloned and synthesized your own voice!"
    
    # The call() method returns binary audio data.
    audio_data = synthesizer.call(text_to_synthesize)
    print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")

    # 6. Save the audio file.
    output_file = "my_custom_voice_output.mp3"
    with open(output_file, "wb") as f:
        f.write(audio_data)
    print(f"Audio saved to {output_file}")

except Exception as e:
    print(f"Error during speech synthesis: {e}")

Getting started: from voice design to speech synthesis

image

Voice design and speech synthesis are two separate but related steps that follow a "create then use" workflow:

  1. Prepare the voice description and preview text for voice design.

    • Voice description (voice_prompt): Describes the features of the target voice. For more information, see Voice design: Write high-quality voice descriptions?.

    • Preview text (preview_text): The content that the target voice will read for the preview audio, for example, "Hello everyone, welcome."

  2. Call the Create voice API to create a custom voice and retrieve the voice name and preview audio.

    Specify target_model to define the speech synthesis model to be used with the created voice.

    Listen to the preview audio to determine if it meets your expectations. If it does, proceed to the next step. Otherwise, redesign the voice.

    If you already have a voice (you can check by calling the Query voice list API), you can skip this step and proceed to the next one.

  3. Use voice for speech synthesis.

    After you successfully create a voice using the Create voice API, the system returns a voice_id/voiceID:

    • This voice_id or voiceID can be used directly as the voice parameter in the speech synthesis API or various language SDKs for subsequent text-to-speech conversion.

    • Multiple invocation modes are supported, including non-streaming, unidirectional streaming, and bidirectional streaming synthesis.

    • The speech synthesis model specified during synthesis must match the target_model or targetModel used when creating the voice, or synthesis will fail.

Sample code:

  1. Generate a custom voice and preview the result. If you are satisfied with the result, proceed to the next step. Otherwise, regenerate the voice.

    Python

    import requests
    import base64
    import os
    
    def create_voice_and_play():
        # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        # If you have not configured environment variables, replace the following line with your Model Studio API key: api_key = "sk-xxx"
        api_key = os.getenv("DASHSCOPE_API_KEY")
        
        if not api_key:
            print("Error: The DASHSCOPE_API_KEY environment variable is not found. Set the API key.")
            return None, None, None
        
        # Prepare the request data.
        headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
        
        data = {
            "model": "voice-enrollment",
            "input": {
                "action": "create_voice",
                "target_model": "cosyvoice-v3.5-plus",
                "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
                "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
                "prefix": "announcer"
            },
            "parameters": {
                "sample_rate": 24000,
                "response_format": "wav"
            }
        }
        
        # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
        url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
        
        try:
            # Send the request.
            response = requests.post(
                url,
                headers=headers,
                json=data,
                timeout=60  # Add a timeout setting.
            )
            
            if response.status_code == 200:
                result = response.json()
                
                # Get the voice ID.
                voice_id = result["output"]["voice_id"]
                print(f"Voice ID: {voice_id}")
                
                # Get the preview audio data.
                base64_audio = result["output"]["preview_audio"]["data"]
                
                # Decode the Base64 audio data.
                audio_bytes = base64.b64decode(base64_audio)
                
                # Save the audio file to your on-premises device.
                filename = f"{voice_id}_preview.wav"
                
                # Write the audio data to a local file.
                with open(filename, 'wb') as f:
                    f.write(audio_bytes)
                
                print(f"The audio is saved to the local file: {filename}")
                print(f"File path: {os.path.abspath(filename)}")
                
                return voice_id, audio_bytes, filename
            else:
                print(f"Request failed. Status code: {response.status_code}")
                print(f"Response content: {response.text}")
                return None, None, None
                
        except requests.exceptions.RequestException as e:
            print(f"A network request error occurred: {e}")
            return None, None, None
        except KeyError as e:
            print(f"The response data is in an invalid format. The required field is missing: {e}")
            print(f"Response content: {response.text if 'response' in locals() else 'No response'}")
            return None, None, None
        except Exception as e:
            print(f"An unknown error occurred: {e}")
            return None, None, None
    
    if __name__ == "__main__":
        print("Creating the voice...")
        voice_id, audio_data, saved_filename = create_voice_and_play()
        
        if voice_id:
            print(f"\nVoice '{voice_id}' is created.")
            print(f"The audio file is saved: '{saved_filename}'")
            print(f"File size: {os.path.getsize(saved_filename)} bytes")
        else:
            print("\nFailed to create the voice.")

    Java

    You need to import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

    Maven

    Add the following to your pom.xml file:

    <!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
    <dependency>
        <groupId>com.google.code.gson</groupId>
        <artifactId>gson</artifactId>
        <version>2.13.1</version>
    </dependency>

    Gradle

    Add the following to your build.gradle file:

    // https://mvnrepository.com/artifact/com.google.code.gson/gson
    implementation("com.google.code.gson:gson:2.13.1")
    import com.google.gson.JsonObject;
    import com.google.gson.JsonParser;
    import java.io.*;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.util.Base64;
    
    public class Main {
        public static void main(String[] args) {
            Main example = new Main();
            example.createVoice();
        }
    
        public void createVoice() {
            // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
            // If you have not configured environment variables, replace the following line with your Model Studio API key: String apiKey = "sk-xxx"
            String apiKey = System.getenv("DASHSCOPE_API_KEY");
    
            // Create a JSON request body string.
            String jsonBody = "{\n" +
                    "    \"model\": \"voice-enrollment\",\n" +
                    "    \"input\": {\n" +
                    "        \"action\": \"create_voice\",\n" +
                    "        \"target_model\": \"cosyvoice-v3.5-plus\",\n" +
                    "        \"voice_prompt\": \"A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.\",\n" +
                    "        \"preview_text\": \"Dear listeners, hello everyone. Welcome to the evening news.\",\n" +
                    "        \"prefix\": \"announcer\"\n" +
                    "    },\n" +
                    "    \"parameters\": {\n" +
                    "        \"sample_rate\": 24000,\n" +
                    "        \"response_format\": \"wav\"\n" +
                    "    }\n" +
                    "}";
    
            HttpURLConnection connection = null;
            try {
                // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
                URL url = new URL("https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization");
                connection = (HttpURLConnection) url.openConnection();
    
                // Set the request method and headers.
                connection.setRequestMethod("POST");
                connection.setRequestProperty("Authorization", "Bearer " + apiKey);
                connection.setRequestProperty("Content-Type", "application/json");
                connection.setDoOutput(true);
                connection.setDoInput(true);
    
                // Send the request body.
                try (OutputStream os = connection.getOutputStream()) {
                    byte[] input = jsonBody.getBytes("UTF-8");
                    os.write(input, 0, input.length);
                    os.flush();
                }
    
                // Get the response.
                int responseCode = connection.getResponseCode();
                if (responseCode == HttpURLConnection.HTTP_OK) {
                    // Read the response content.
                    StringBuilder response = new StringBuilder();
                    try (BufferedReader br = new BufferedReader(
                            new InputStreamReader(connection.getInputStream(), "UTF-8"))) {
                        String responseLine;
                        while ((responseLine = br.readLine()) != null) {
                            response.append(responseLine.trim());
                        }
                    }
    
                    // Parse the JSON response.
                    JsonObject jsonResponse = JsonParser.parseString(response.toString()).getAsJsonObject();
                    JsonObject outputObj = jsonResponse.getAsJsonObject("output");
                    JsonObject previewAudioObj = outputObj.getAsJsonObject("preview_audio");
    
                    // Get the voice name.
                    String voiceId = outputObj.get("voice_id").getAsString();
                    System.out.println("Voice ID: " + voiceId);
    
                    // Get the Base64-encoded audio data.
                    String base64Audio = previewAudioObj.get("data").getAsString();
    
                    // Decode the Base64 audio data.
                    byte[] audioBytes = Base64.getDecoder().decode(base64Audio);
    
                    // Save the audio to a local file.
                    String filename = voiceId + "_preview.wav";
                    saveAudioToFile(audioBytes, filename);
    
                    System.out.println("The audio is saved to the local file: " + filename);
    
                } else {
                    // Read the error response.
                    StringBuilder errorResponse = new StringBuilder();
                    try (BufferedReader br = new BufferedReader(
                            new InputStreamReader(connection.getErrorStream(), "UTF-8"))) {
                        String responseLine;
                        while ((responseLine = br.readLine()) != null) {
                            errorResponse.append(responseLine.trim());
                        }
                    }
    
                    System.out.println("Request failed. Status code: " + responseCode);
                    System.out.println("Error response: " + errorResponse.toString());
                }
    
            } catch (Exception e) {
                System.err.println("A request error occurred: " + e.getMessage());
                e.printStackTrace();
            } finally {
                if (connection != null) {
                    connection.disconnect();
                }
            }
        }
    
        private void saveAudioToFile(byte[] audioBytes, String filename) {
            try {
                File file = new File(filename);
                try (FileOutputStream fos = new FileOutputStream(file)) {
                    fos.write(audioBytes);
                }
                System.out.println("The audio is saved to: " + file.getAbsolutePath());
            } catch (IOException e) {
                System.err.println("An error occurred while saving the audio file: " + e.getMessage());
                e.printStackTrace();
            }
        }
    }
  2. Use the custom voice you generated in the previous step for speech synthesis.

    This step references the non-streaming call example code. Replace the voice parameter with the custom voice generated through voice design for speech synthesis.

    Key principle: The model used for voice design (target_model) must match the model used for subsequent speech synthesis (model), or synthesis will fail.

    Python

    # coding=utf-8
    
    import dashscope
    from dashscope.audio.tts_v2 import *
    import os
    
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
    
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
    
    # Use the same model for voice design and speech synthesis.
    model = "cosyvoice-v3.5-plus"
    # Replace the voice parameter with the custom voice generated by voice design.
    voice = "your_voice"
    
    # Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor.
    synthesizer = SpeechSynthesizer(model=model, voice=voice)
    # Send the text to be synthesized and get the binary audio.
    audio = synthesizer.call("How is the weather today?")
    # The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
    print('[Metric] Request ID: {}, First packet delay: {} ms'.format(
        synthesizer.get_last_request_id(),
        synthesizer.get_first_package_delay()))
    
    # Save the audio locally.
    with open('output.mp3', 'wb') as f:
        f.write(audio)

    Java

    import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
    import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
    import com.alibaba.dashscope.utils.Constants;
    
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.nio.ByteBuffer;
    
    public class Main {
        // Use the same model for voice design and speech synthesis.
        private static String model = "cosyvoice-v3.5-plus";
        // Replace the voice parameter with the custom voice generated by voice design.
        private static String voice = "your_voice_id";
    
        public static void streamAudioDataToSpeaker() {
            // Request parameters
            SpeechSynthesisParam param =
                    SpeechSynthesisParam.builder()
                            // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                            // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                            .model(model) // Model
                            .voice(voice) // Voice
                            .build();
    
            // Synchronous mode: Disable callback (second parameter is null).
            SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
            ByteBuffer audio = null;
            try {
                // Block until audio returns.
                audio = synthesizer.call("How is the weather today?");
            } catch (Exception e) {
                throw new RuntimeException(e);
            } finally {
                // Close the WebSocket connection when the task ends.
                synthesizer.getDuplexApi().close(1000, "bye");
            }
            if (audio != null) {
                // Save the audio data to the local file "output.mp3".
                File file = new File("output.mp3");
                // The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time.
                System.out.println(
                        "[Metric] Request ID: "
                                + synthesizer.getLastRequestId()
                                + ", First packet delay (ms): "
                                + synthesizer.getFirstPackageDelay());
                try (FileOutputStream fos = new FileOutputStream(file)) {
                    fos.write(audio.array());
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }
    
        public static void main(String[] args) {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
            Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
            streamAudioDataToSpeaker();
            System.exit(0);
        }
    }

API reference

Use the same Alibaba Cloud account for all API operations.

Important

The Java and Python DashScope SDKs do not support voice design. For voice design, use the RESTful API.

Create voice

RESTful API

  • URL

    Chinese Mainland:

    POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization

    International:

    POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization
  • Request headers

    Parameter

    Type

    Required

    Description

    Authorization

    string

    Supported

    Authentication token. Format: Bearer <your_api_key>. Replace "<your_api_key>" with your actual API key.

    Content-Type

    string

    Supported

    Media type of data in the request body. Fixed value: application/json.

  • Request body

    The request body contains all parameters (omit optional fields as needed).

    Important

    Note the difference between these parameters:

    • model: Voice cloning/design model. Fixed value: voice-enrollment

    • target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

    Voice cloning

    {
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "prefix": "myvoice",
            "url": "https://yourAudioFileUrl",
            "language_hints": ["zh"]
        }
    }

    Voice design

    {
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
            "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
            "prefix": "announcer",
            "language_hints": ["zh"]
        },
        "parameters": {
            "sample_rate": 24000,
            "response_format": "wav"
        }
    }
  • Request parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: create_voice.

    target_model

    string

    -

    Supported

    Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.

    url

    string

    -

    Supported

    Important

    Required only for voice cloning

    Publicly accessible URL of the audio file used for voice cloning.

    For audio format details, see .

    For recording guidance, see Recording Guide.

    voice_prompt

    string

    -

    Supported

    Important

    Required only for voice design

    Voice description. Maximum length: 500 characters.

    Chinese and English only.

    For guidance on writing voice descriptions, see .

    preview_text

    string

    -

    Supported

    Important

    Required only for voice design

    Text for the preview audio. Maximum length: 200 characters.

    Supported languages: Chinese (zh), English (en).

    prefix

    string

    -

    Supported

    The voice name (letters/numbers only, max 10 characters). Use role or scenario identifiers.

    This keyword appears in the final voice name. For example, if the keyword is "announcer", the final voice names are:
    Voice cloning: cosyvoice-v3.5-plus-announcer-8aae0c0397fa408ca60c29cf******
    Voice design: cosyvoice-v3.5-plus-vd-announcer-8aae0c0397fa408ca60c29cf******

    language_hints

    array[string]

    ["zh"]

    No

    Specifies the sample audio language for voice feature extraction. Available for cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This parameter is an array, but only the first element is processed—pass only one value.

    Functionality:

    Voice cloning

    Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g., en for Chinese audio), the system ignores it and auto-detects the language.

    Valid values (by model):

    • cosyvoice-v3-plus:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

      • pt: Portuguese

      • th: Thai

      • id: Indonesian

      • vi: Vietnamese

    For Chinese dialects (e.g., Northeastern, Cantonese), set language_hints to zh. Control dialect style in speech synthesis using text content or the instruct parameter.

    Voice design

    Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match preview_text language.

    Valid values:

    • zh: Chinese (default)

    • en: English

    max_prompt_audio_length

    float

    10.0

    No

    Important
    • This parameter is only available for voice cloning scenarios.

    • Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.

    The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash. Valid range: [3.0, 30.0].

    enable_preprocess

    boolean

    false

    No

    Important
    • This parameter is only available for voice cloning scenarios.

    • If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.

    • For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.

    Enables audio preprocessing (noise reduction, audio enhancement, volume normalization) before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash.

    Valid values:

    • true: Enable

    • false: Disable

    sample_rate

    int

    24000

    No

    Important

    Available only for voice design

    The preview audio sample rate (Hz) for voice design.

    Valid values:

    • 16000

    • 24000

    • 48000

    response_format

    string

    wav

    No

    Important

    Available only for voice design

    The preview audio format for voice design.

    Valid values:

    • pcm

    • wav

    • mp3

  • Response parameters

    View response examples

    Voice cloning

    {
        "output": {
            "voice_id": "yourVoiceId"
        },
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }

    Voice design

    {
        "output": {
            "preview_audio": {
                "data": "{base64_encoded_audio}",
                "sample_rate": 24000,
                "response_format": "wav"
            },
            "target_model": "cosyvoice-v3.5-plus",
            "voice_id": "yourVoice"
        },
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }

    Key parameters:

    Parameter

    Type

    Description

    voice_id

    string

    Voice ID. Use directly as the voice parameter in the speech synthesis API.

    data

    string

    The preview audio data from voice design (Base64-encoded).

    sample_rate

    int

    The preview audio sample rate (Hz) from voice design. This value matches the creation request. Default: 24000 Hz.

    response_format

    string

    The preview audio format from voice design. This value matches the creation request. Default: wav.

    target_model

    string

    Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.

    request_id

    string

    Request ID.

    count

    integer

    Number of "create voice" operations in this request. Always 1 for voice creation.

  • Sample code

    Important

    Note the difference between these parameters:

    • model: Voice cloning/design model. Fixed value: voice-enrollment

    • target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

    Voice cloning

    If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

    https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key
    # Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    
    curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="28f184e9f7vq7">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "prefix": "myvoice",
            "url": "https://yourAudioFileUrl"
        }
    }'

    Voice design

    If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

    https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key
    # Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    
    curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="087ab4e9d2b9r">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
            "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
            "prefix": "announcer"
        },
        "parameters": {
            "sample_rate": 24000,
            "response_format": "wav"
        }
    }'

Python SDK

Interface description

Before using this API, install the latest DashScope SDK.

def create_voice(self, target_model: str, prefix: str, url: str, language_hints: List[str] = None) -> str:
    '''
    Create a new custom voice.
    param: target_model Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
    param: prefix The voice name (letters, numbers, underscores only; max 10 characters). Use role or scenario identifiers. Format: model-name-prefix-unique-id (e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx).
    param: url Publicly accessible URL of the audio file used for voice cloning.
    param: language_hints The reference audio language for voice feature extraction. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus.
                Helps identify sample audio language to improve voice feature extraction and cloning quality.
                If the hint doesn't match actual audio, the system ignores it and auto-detects the language.
                Valid values (by model):
                    cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
                    cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
                This parameter is an array, but only the first element is processed. Pass only one value.
    param: max_prompt_audio_length The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
                Valid range: [3.0, 30.0]. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
    param: enable_preprocess Enables audio preprocessing. When enabled, the system performs noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
                If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
                For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
    return: voice_id The voice ID. Use directly as the voice parameter in the speech synthesis API.
    '''
Important
  • target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

  • language_hints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.

    Functionality:

    Voice cloning

    Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g., en for Chinese audio), the system ignores it and auto-detects the language.

    Valid values (by model):

    • cosyvoice-v3-plus:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

      • pt: Portuguese

      • th: Thai

      • id: Indonesian

      • vi: Vietnamese

    For Chinese dialects (e.g., Northeastern, Cantonese), set language_hints to zh. Control dialect style in speech synthesis using text content or the instruct parameter.

    Voice design

    Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match preview_text language.

    Valid values:

    • zh: Chinese (default)

    • en: English

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()

# Avoid frequent calls. Each call creates a new voice. After reaching your quota limit, you cannot create more.
voice_id = service.create_voice(
    target_model='cosyvoice-v3.5-plus',
    prefix='myvoice',
    url='https://your-audio-file-url'
    # language_hints=['zh'],
    # max_prompt_audio_length=10.0,
    # enable_preprocess=False
)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice ID: {voice_id}")

Java SDK

Interface description

Before using this API, install the latest DashScope SDK.

/**
 * Create a new custom voice.
 *
 * @param targetModel Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
 * @param prefix The voice name (letters, numbers, underscores only; max 10 characters). Use role or scenario identifiers. Format: model-name-prefix-unique-id (e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx).
 * @param url Publicly accessible URL of the audio file used for voice cloning.
 * @param customParam Custom parameters. Specify languageHints and maxPromptAudioLength here.
 *              languageHints: The reference audio language for voice feature extraction. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus.
 *                  Helps identify sample audio language to improve voice feature extraction and cloning quality.
 *                  If hint doesn't match actual audio, the system ignores it and auto-detects the language.
 *                  Valid values (by model):
 *                      cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
 *                      cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
 *                  Only the first element is processed. Pass only one value.
 *              maxPromptAudioLength: The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3-flash model.
 *                  Valid range: [3.0, 30.0]. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
 *              enable_preprocess: Configured through the generic parameter. Enables audio preprocessing. When enabled, the system performs noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
 *                  If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
 *                  For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
 * @return Voice New voice. Call Voice.getVoiceId() to get the voice ID. Use directly as the voice parameter in the speech synthesis API.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice createVoice(String targetModel, String prefix, String url, VoiceEnrollmentParam customParam) throws NoApiKeyException, InputRequiredException
Important
  • targetModel: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

  • languageHints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.

    Functionality:

    Voice cloning

    Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g., en for Chinese audio), the system ignores it and auto-detects the language.

    Valid values (by model):

    • cosyvoice-v3-plus:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

      • pt: Portuguese

      • th: Thai

      • id: Indonesian

      • vi: Vietnamese

    For Chinese dialects (e.g., Northeastern, Cantonese), set language_hints to zh. Control dialect style in speech synthesis using text content or the instruct parameter.

    Voice design

    Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match preview_text language.

    Valid values:

    • zh: Chinese (default)

    • en: English

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentParam;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Collections;

public class Main {
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args) {
        String apiKey = System.getenv("DASHSCOPE_API_KEY");
        String targetModel = "cosyvoice-v3.5-plus";
        String prefix = "myvoice";
        String fileUrl = "https://your-audio-file-url";
        String cloneModelName = "voice-enrollment";

        try {
            VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
            Voice myVoice = service.createVoice(
                    targetModel,
                    prefix,
                    fileUrl,
                    VoiceEnrollmentParam.builder()
                            .model(cloneModelName)
                            .languageHints(Collections.singletonList("zh"))
                            // .maxPromptAudioLength(10.0f)
                            // .parameter("enable_preprocess", false)
                            .build());

            logger.info("Voice creation submitted. Request ID: {}", service.getLastRequestId());
            logger.info("Generated Voice ID: {}", myVoice.getVoiceId());
        } catch (Exception e) {
            logger.error("Failed to create voice", e);
        }
    }
}

List voices

Query created voices with pagination.

RESTful API

  • URL and Request headers are the same as the Create voice API

  • Request body

    The request body contains all parameters. Omit optional fields as needed.

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    {
        "model": "voice-enrollment",
        "input": {
            "action": "list_voice",
            "prefix": "announcer"
            "page_size": 10,
            "page_index": 0
        }
    }
  • Request parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: list_voice.

    prefix

    string

    -

    No

    The same prefix used when creating the voice (letters/numbers only, max 10 characters).

    page_index

    integer

    0

    Not supported

    Page index (≥ 0).

    page_size

    integer

    10

    No

    The number of items per page. Valid range: [0, 1000].

  • Response parameters

    View response examples

    Voice cloning

    {
        "output": {
            "voice_list": [
                {
                    "gmt_create": "2024-12-11 13:38:02",
                    "voice_id": "yourVoiceId",
                    "gmt_modified": "2024-12-11 13:38:02",
                    "status": "OK"
                }
            ]
        },
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }

    Voice design

    {
        "output": {
            "voice_list": [
                {
                    "gmt_create": "2025-12-10 17:04:54",
                    "gmt_modified": "2025-12-10 17:04:54",
                    "preview_text": "Dear listeners, hello everyone. Welcome to today's show.",
                    "target_model": "cosyvoice-v3.5-plus",
                    "voice_id": "yourVoice1",
                    "status": "OK",
                    "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary., deep and magnetic, steady pace"
                },
                {
                    "gmt_create": "2025-12-10 15:31:35",
                    "gmt_modified": "2025-12-10 15:31:35",
                    "language": "zh",
                    "preview_text": "Dear listeners, hello everyone",
                    "target_model": "cosyvoice-v3.5-plus",
                    "voice_id": "yourVoice2",
                    "status": "OK"
                    "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary."
                }
            ]
        },
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }

    Key parameters:

    Parameter

    Type

    Description

    voice_id

    string

    Voice ID. Use directly as the voice parameter in the speech synthesis API.

    target_model

    string

    Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.

    gmt_create

    string

    Time the voice was created.

    gmt_modified

    string

    Time the voice was last modified.

    voice_prompt

    string

    Voice description.

    preview_text

    string

    Preview text.

    request_id

    string

    Request ID.

    status

    string

    Voice status:

    • DEPLOYING: Under review

    • OK: Ready to use

    • UNDEPLOYED: Unavailable

  • Sample code

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

    # This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key
    # Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    
    curl -X POST https://dashscope.aliyuncs-intl.com/api/v1/services/audio/tts/customization \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "list_voice",
            "prefix": "announcer",
            "page_size": 10,
            "page_index": 0
        }
    }'

Python SDK

Interface description

def list_voices(self, prefix=None, page_index: int = 0, page_size: int = 10) -> List[dict]:
    '''
    Query all created voices
    param: prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
    param: page_index Page index to query
    param: page_size Page size to query
    return: List[dict] Voice list containing ID, creation time, modification time, and status for each voice. Format: [{'gmt_create': '2025-10-09 14:51:01', 'gmt_modified': '2025-10-09 14:51:07', 'status': 'OK', 'voice_id': 'cosyvoice-v3-myvoice-xxx'}]
    Voice statuses:
        DEPLOYING: Under review
        OK: Approved and ready to use
        UNDEPLOYED: Rejected and unavailable
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()

# Filter by prefix, or set to None to query all
voices = service.list_voices(prefix='myvoice', page_index=0, page_size=10)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Found voices: {voices}")

Response example

[
    {
        "gmt_create": "2024-09-13 11:29:41",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 11:29:41",
        "status": "OK"
    },
    {
        "gmt_create": "2024-09-13 13:22:38",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 13:22:38",
        "status": "OK"
    }
]

Response parameters

Parameter

Type

Description

voice_id

string

Voice ID. Use directly as the voice parameter in the speech synthesis API.

target_model

string

Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.

gmt_create

string

Time the voice was created.

gmt_modified

string

Time the voice was last modified.

voice_prompt

string

Voice description.

preview_text

string

Preview text.

request_id

string

Request ID.

status

string

Voice status:

  • DEPLOYING: Under review

  • OK: Ready to use

  • UNDEPLOYED: Unavailable

Java SDK

Interface description

// Voice statuses:
//        DEPLOYING: Under review
//        OK: Approved and ready to use
//        UNDEPLOYED: Rejected and unavailable
/**
 * Query all created voices. Default page index is 0, default page size is 10.
 *
 * @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters). Can be null.
 * @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice[] listVoice(String prefix) throws NoApiKeyException, InputRequiredException 

/**
 * Query all created voices.
 *
 * @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
 * @param pageIndex Page index to query.
 * @param pageSize Page size to query.
 * @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice[] listVoice(String prefix, int pageIndex, int pageSize) throws NoApiKeyException, InputRequiredException

Request example

You need to import the third-party library com.google.gson.Gson.

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String prefix = "myvoice"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Query voices
        Voice[] voices = service.listVoice(prefix, 0, 10);
        logger.info("List successful. Request ID: {}", service.getLastRequestId());
        logger.info("Voices Details: {}", new Gson().toJson(voices));
    }
}

Response example

[
    {
        "gmt_create": "2024-09-13 11:29:41",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 11:29:41",
        "status": "OK"
    },
    {
        "gmt_create": "2024-09-13 13:22:38",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 13:22:38",
        "status": "OK"
    }
]

Response parameters

Parameter

Type

Description

voice_id

string

Voice ID. Use directly as the voice parameter in the speech synthesis API.

target_model

string

Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.

gmt_create

string

Time the voice was created.

gmt_modified

string

Time the voice was last modified.

voice_prompt

string

Voice description.

preview_text

string

Preview text.

request_id

string

Request ID.

status

string

Voice status:

  • DEPLOYING: Under review

  • OK: Ready to use

  • UNDEPLOYED: Unavailable

Query specific voice

Get detailed information about a specific voice by name.

RESTful API

  • URL and Request headers are the same as the Create voice API

  • Request body

    The request body contains all parameters. Omit optional fields as needed.

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    {
        "model": "voice-enrollment",
        "input": {
            "action": "query_voice",
            "voice_id": "yourVoiceID"
        }
    }
  • Request parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: query_voice.

    voice_id

    string

    -

    Supported

    ID of the voice to query.

  • Response parameters

    View response examples

    Voice cloning

    {
        "output": {
            "gmt_create": "2024-12-11 13:38:02",
            "resource_link": "https://yourAudioFileUrl",
            "target_model": "cosyvoice-v3.5-plus",
            "gmt_modified": "2024-12-11 13:38:02",
            "status": "OK"
        },
        "usage": {
            "count": 1
        },
        "request_id": "2450f969-d9ea-9483-bafc-************"
    }

    Voice design

    {
        "output": {
            "gmt_create": "2025-12-10 14:54:09",
            "gmt_modified": "2025-12-10 17:47:48",
            "preview_text": "Dear listeners, hello everyone",
            "target_model": "cosyvoice-v3.5-plus",
            "status": "OK",
            "voice_id": "yourVoice",
            "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary."
        },
        "usage": {},
        "request_id": "yourRequestId"
    }

    For parameter descriptions, see the List voices API.

  • Sample code

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

    # ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "query_voice",
            "voice_id": "yourVoiceID"
        }
    }'

Python SDK

Interface description

def query_voice(self, voice_id: str) -> List[str]:
    '''
    Query details for a specific voice
    param: voice_id ID of the voice to query
    return: List[str] Voice details, including status, creation time, audio link, etc.
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
voice_id = 'cosyvoice-v3-plus-myvoice-xxxxxxxx'

voice_details = service.query_voice(voice_id=voice_id)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice Details: {voice_details}")

Response example

{
    "gmt_create": "2024-09-13 11:29:41",
    "resource_link": "https://yourAudioFileUrl",
    "target_model": "cosyvoice-v3.5-plus",
    "gmt_modified": "2024-09-13 11:29:41",
    "status": "OK"
}

Response parameters

See the API.

Java SDK

Interface description

/**
 * Query details for a specific voice
 *
 * @param voiceId ID of the voice to query
 * @return Voice Voice details, including status, creation time, audio link, etc.
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public Voice queryVoice(String voiceId) throws NoApiKeyException, InputRequiredException

Request example

You need to import the third-party library com.google.gson.Gson.

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        Voice voice = service.queryVoice(voiceId);
        
        logger.info("Query successful. Request ID: {}", service.getLastRequestId());
        logger.info("Voice Details: {}", new Gson().toJson(voice));
    }
}

Response example

{
    "gmt_create": "2024-09-13 11:29:41",
    "resource_link": "https://yourAudioFileUrl",
    "target_model": "cosyvoice-v3.5-plus",
    "gmt_modified": "2024-09-13 11:29:41",
    "status": "OK"
}

Response parameters

See the API.

Update voice (voice cloning only)

Updates an existing voice with a new audio file.

Important

This feature is not supported for voice design.

RESTful API

  • URL and Request headers are the same as the Create voice API

  • Request body

    The request body contains all parameters. Omit optional fields as needed:

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    {
        "model": "voice-enrollment",
        "input": {
            "action": "update_voice",
            "voice_id": "yourVoiceId",
            "url": "https://yourAudioFileUrl"
        }
    }
  • Request parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: update_voice.

    voice_id

    string

    -

    Supported

    Voice to update.

    url

    string

    -

    Supported

    URL of the audio file to update the voice. The URL must be publicly accessible.

  • Response parameters

    View response example

    {
        "output": {},
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }
  • Sample code

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

    # ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "update_voice",
            "voice_id": "yourVoiceId",
            "url": "https://yourAudioFileUrl"
        }
    }'

Python SDK

Interface description

def update_voice(self, voice_id: str, url: str) -> None:
    '''
    Update a voice
    param: voice_id Voice ID
    param: url URL of the audio file for voice cloning
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
service.update_voice(
    voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx',
    url='https://your-new-audio-file-url'
)
print(f"Update submitted. Request ID: {service.get_last_request_id()}")

Java SDK

Interface description

/**
 * Update a voice
 *
 * @param voiceId Voice to update
 * @param url URL of the audio file for voice cloning
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public void updateVoice(String voiceId, String url)
    throws NoApiKeyException, InputRequiredException

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String fileUrl = "https://your-audio-file-url";  // Replace with your actual value
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    
    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Update voice
        service.updateVoice(voiceId, fileUrl);
        logger.info("Update submitted. Request ID: {}", service.getLastRequestId());
    }
}

Delete voice

Deletes a voice you no longer need to free up the quota. This action cannot be undone.

RESTful API

  • URL and Request headers are the same as the Create voice API

  • Request body

    The request body contains all parameters. Omit optional fields as needed:

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    {
        "model": "voice-enrollment",
        "input": {
            "action": "delete_voice",
            "voice_id": "yourVoiceID"
        }
    }
  • Request parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: delete_voice.

    voice_id

    string

    -

    Supported

    Voice to delete.

  • Response parameters

    View response example

    {
        "output": {},
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }
  • Sample code

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    If the API key isn’t in an environment variable, replace $DASHSCOPE_API_KEY with your actual key.

    # ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "delete_voice",
            "voice_id": "yourVoiceID"
        }
    }'

Python SDK

Interface description

def delete_voice(self, voice_id: str) -> None:
    '''
    Delete a voice
    param: voice_id Voice to delete
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
service.delete_voice(voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx')
print(f"Deletion submitted. Request ID: {service.get_last_request_id()}")

Java SDK

Interface description

/**
 * Delete a voice
 *
 * @param voiceId Voice to delete
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public void deleteVoice(String voiceId) throws NoApiKeyException, InputRequiredException 

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    
    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Delete voice
        service.deleteVoice(voiceId);
        logger.info("Deletion submitted. Request ID: {}", service.getLastRequestId());
    }
}

Quota and cleanup

  • Total limit: 1000 voices

    The current API does not provide a way to query the voice count. You can call the API to count the voices yourself.
  • Automatic cleanup: If a voice has not been used for any speech synthesis requests in the past year, the system automatically deletes it.

Billing

  • Voice cloning/design: Creating, querying, updating, and deleting voices is free.

  • Speech synthesis using custom voices: Billed based on the number of text characters. For more information, see Real-time speech synthesis – CosyVoice.

Copyright and legality

You are responsible for the ownership and legal right to use any voice you provide. Read the terms of service.

Error codes

If you encounter an error, see Error messages for troubleshooting.

FAQ

Features

Q: How do I adjust the speed and volume of a custom voice?

The same way you adjust a preset voice. Pass the corresponding parameters when calling the speech synthesis API. For example, use speech_rate (Python) or speechRate (Java) to adjust speed, and volume to adjust volume. For more information, see the speech synthesis API documentation (Java SDK/Python SDK/WebSocket API).

Q: How do I call the API using languages other than Java and Python (such as Go, C#, or Node.js)?

For voice management, use the RESTful API provided in this document. For speech synthesis, use the WebSocket API and pass the cloned voice_id as the voice parameter.

Troubleshooting

If you encounter code errors, troubleshoot using the information in Error codes.

Q: What should I do if the synthesized audio from a cloned voice contains extra content?

If you find extra characters or noise in the synthesized audio from a cloned voice, follow these steps to troubleshoot:

  1. Check the source audio quality

    The quality of the cloned audio directly affects the synthesis result. Ensure the source audio meets these requirements:

    • No background noise or static

    • Clear sound quality (sample rate ≥ 16 kHz recommended)

    • Audio format: WAV is better than MP3 (avoid lossy compression)

    • Mono (stereo may cause interference)

    • No silent segments or long pauses

    • Moderate speech rate (a fast rate affects feature extraction)

  2. Check the input text

    Confirm the input text does not contain special symbols or markers:

    • Avoid special symbols such as **, "", and ''

    • Unless used for LaTeX formulas, preprocess the text to filter out special symbols.

  3. Verify voice cloning parameters

    Ensure the language parameter (language_hints/languageHints) is set correctly when .

  4. Try cloning again

    Use a higher-quality source audio file to clone the voice again and test the result.

  5. Compare with system voices

    Test the same text with a preset system voice to confirm if the issue is specific to the cloned voice.

Q: What should I do if the audio generated from a cloned voice is silent?

  1. Check voice status

    Call the Query specific voice API to check if the voice status is OK.

  2. Check model version consistency

    Ensure the target_model parameter used for voice cloning exactly matches the model parameter used for speech synthesis. For example:

    • When you clone the voice, use cosyvoice-v3-plus.

    • You must also use cosyvoice-v3-plus for synthesis

  3. Verify source audio quality

    Check if the source audio used for cloning meets the voice cloning input audio format requirements:

    • Audio duration: 10–20 seconds

    • Clear sound quality

    • No background noise

  4. Check request parameters

    Confirm the voice parameter is set to the cloned voice's ID during speech synthesis.

Q: What should I do if the synthesized speech from a cloned voice is unstable or incomplete?

If the synthesized speech from a cloned voice has these issues:

  • Incomplete playback; only part of the text is read

  • Unstable synthesis quality; sometimes good, sometimes bad

  • Abnormal pauses or silent segments in the audio

Possible cause: The source audio quality does not meet requirements.

Solution: Verify that the source audio meets the following requirements. Rerecord the audio following the Recording Guide.

  • Check audio continuity: Ensure the speech in the source audio is continuous. Avoid long pauses or silent segments (over 2 seconds). Obvious blank segments can cause the model to treat silence or noise as part of the voice's features, affecting the result.

  • Check the speech activity ratio: Ensure that active speech makes up more than 60% of the total audio duration. Too much background noise or non-speech segments will interfere with voice feature extraction.

  • Verify audio quality details:

    • Audio duration: 10–20 seconds (15 seconds recommended)

    • Clear pronunciation and steady speech rate

    • No background noise, echo, or static

    • Concentrated speech energy with no long silent segments

Q: Why can't I find the VoiceEnrollmentService class?

Your SDK version is too old. Install the latest SDK.

Q: What should I do if the voice cloning result is poor, with noise or unclear audio?

This is usually caused by low-quality input audio. Rerecord and upload the audio, strictly following the Recording Guide.

Q: Why is there a long silence at the beginning or an abnormal total duration when I synthesize very short text (like a single word) with a cloned voice?

The voice cloning model learns pauses and rhythm from the sample audio. If the original recording has a long initial silence or pause, the synthesis result retains a similar pattern. For single words or very short text, this silence ratio is amplified, making it seem like the audio is long but mostly silent. To avoid this, trim long silences from sample audio. Use complete sentences or longer text for synthesis. If you must synthesize a single word, add some context before or after it.