All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice Voice Cloning and Voice Design API

Last Updated:Mar 06, 2026

The CosyVoice voice cloning service uses generative large speech models to create highly similar and natural custom voices from just 10–20 seconds of audio samples—no traditional training required. Voice design generates custom voices from text descriptions and supports multilingual and multidimensional voice feature definitions. Use cases include ad narration, character voice creation, and audiobook production. Voice cloning and voice design are two sequential steps that feed into speech synthesis. This document covers the parameters and interfaces for voice cloning and voice design. For speech synthesis, see Real-time Speech Synthesis – CosyVoice/Sambert.

User Guide: For model introductions and selection guidance, see Real-time Speech Synthesis – CosyVoice/Sambert.

Important
  • This document covers only the CosyVoice voice cloning and voice design APIs. If you use Qwen models, see Voice Cloning (Qwen) and Voice Design (Qwen).

  • CosyVoice voice design uses the FunAudioGen-VD model.

Supported Models

  • Voice cloning:

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash

    • cosyvoice-v3-plus, cosyvoice-v3-flash

    • cosyvoice-v2

  • Voice design:

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash

    • cosyvoice-v3-plus, cosyvoice-v3-flash

Important

Supported Languages

  • Voice cloning: Depends on the target speech synthesis model (specified by the target_model/targetModel parameter):

    • cosyvoice-v2: Chinese (Mandarin), English

    • cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese

    • cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, Russian

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan, Hubei, Minnan, Ningxia, Shaanxi, Shandong, Shanghai, Sichuan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese

    Voice cloning does not currently support Spanish, Italian, or other languages.

  • Voice design: Chinese, English.

Quick Start: From Voice Cloning to Speech Synthesis

image

Voice cloning and speech synthesis are two independent yet closely related steps. Follow the “create first, use later” workflow:

  1. Prepare an audio recording file

    Upload an audio file that meets the voice cloning: input audio format requirements to a publicly accessible location, such as Alibaba Cloud OSS. You must ensure the URL is publicly accessible.

  2. Create a voice

    Call the Create voice API. You must specifytarget_model/targetModel to declare which speech synthesis model will drive the created voice.

    If you already have a created voice (check using the List voices API), skip this step and go to the next one.

  3. Using voice timbre for speech synthesis

    After successfully creating a voice using the Create voice API, the system returns a voice_id/voiceID:

    • You can use this voice_id/voiceID directly as the voice parameter in the speech synthesis API or in SDKs for text-to-speech.

    • It supports multiple invocation modes: non-streaming, unidirectional streaming, and bidirectional streaming synthesis.

    • The speech synthesis model used during synthesis must match the target_model/targetModel used when creating the voice. Otherwise, synthesis fails.

Sample code:

import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer

# 1. Environment preparation
# We recommend configuring your API key as an environment variable
# API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you haven’t configured an environment variable, replace the next line with: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
    raise ValueError("DASHSCOPE_API_KEY environment variable not set.")

# This is the Singapore-region WebSocket URL. If you use a Beijing-region model, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# This is the Singapore-region HTTP URL. If you use a Beijing-region model, replace it with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'


# 2. Define cloning parameters
TARGET_MODEL = "cosyvoice-v3.5-plus" 
# Give your voice a meaningful prefix
VOICE_PREFIX = "myvoice" # Letters and numbers only. Less than 10 characters.
# Publicly accessible audio URL
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # Example URL. Replace with your own.

# 3. Create a voice (asynchronous task)
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
    voice_id = service.create_voice(
        target_model=TARGET_MODEL,
        prefix=VOICE_PREFIX,
        url=AUDIO_URL
    )
    print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
    print(f"Generated Voice ID: {voice_id}")
except Exception as e:
    print(f"Error during voice creation: {e}")
    raise e
# 4. Poll for voice status
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
    try:
        voice_info = service.query_voice(voice_id=voice_id)
        status = voice_info.get("status")
        print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
        
        if status == "OK":
            print("Voice is ready for synthesis.")
            break
        elif status == "UNDEPLOYED":
            print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
            raise RuntimeError(f"Voice processing failed with status: {status}")
        # Continue waiting for intermediate statuses like "DEPLOYING"
        time.sleep(poll_interval)
    except Exception as e:
        print(f"Error during status polling: {e}")
        time.sleep(poll_interval)
else:
    print("Polling timed out. The voice is not ready after several attempts.")
    raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")

# 5. Use the cloned voice for speech synthesis
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
    synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
    text_to_synthesize = "Congratulations! You’ve successfully cloned and synthesized your own voice!"
    
    # The call() method returns binary audio data
    audio_data = synthesizer.call(text_to_synthesize)
    print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")

    # 6. Save audio file
    output_file = "my_custom_voice_output.mp3"
    with open(output_file, "wb") as f:
        f.write(audio_data)
    print(f"Audio saved to {output_file}")

except Exception as e:
    print(f"Error during speech synthesis: {e}")

Quick Start: From Voice Design to Speech Synthesis

image

Voice design and speech synthesis are two independent but closely linked steps. Follow the “create first, use later” workflow:

  1. Prepare the voice description and preview text for voice design.

    • Voice prompt (voice_prompt): Defines the target voice’s characteristics. For more information, see “Voice design: How to write high-quality voice prompts?”.

    • Preview text (preview_text): The text read aloud in the preview audio (such as “Hello everyone, welcome to listen.”).

  2. Call the Create voice API to create a custom voice and obtain the voice name and preview audio.

    You must specifytarget_model to declare which speech synthesis model will drive the created voice.

    You can listen to the preview audio to check if it meets expectations. If it meets expectations, proceed to the next step. Otherwise, redesign it.

    If you already have a created voice (check using the List voices API), skip this step and go to the next one.

  3. Using Voices for Speech Synthesis

    After successfully creating a voice using the Create voice API, the system returns a voice_id/voiceID:

    • You can use this voice_id/voiceID directly as the voice parameter in the speech synthesis API or in SDKs for text-to-speech.

    • It supports multiple invocation modes: non-streaming, unidirectional streaming, and bidirectional streaming synthesis.

    • The speech synthesis model used during synthesis must match the target_model/targetModel used when creating the voice. Otherwise, synthesis fails.

Sample code:

  1. Create a custom voice and preview it. If you are satisfied, proceed. Otherwise, recreate it.

    Python

    import requests
    import base64
    import os
    
    def create_voice_and_play():
        # API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
        # If you haven’t configured an environment variable, replace the next line with: api_key = "sk-xxx"
        api_key = os.getenv("DASHSCOPE_API_KEY")
        
        if not api_key:
            print("Error: DASHSCOPE_API_KEY environment variable not found. Please set your API key first.")
            return None, None, None
        
        # Prepare request data
        headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
        
        data = {
            "model": "voice-enrollment",
            "input": {
                "action": "create_voice",
                "target_model": "cosyvoice-v3.5-plus",
                "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
                "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
                "prefix": "announcer"
            },
            "parameters": {
                "sample_rate": 24000,
                "response_format": "wav"
            }
        }
        
        # This is the Singapore-region URL. If you use a Beijing-region model, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
        url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
        
        try:
            # Send request
            response = requests.post(
                url,
                headers=headers,
                json=data,
                timeout=60  # Add timeout setting
            )
            
            if response.status_code == 200:
                result = response.json()
                
                # Get voice ID
                voice_id = result["output"]["voice_id"]
                print(f"Voice ID: {voice_id}")
                
                # Get preview audio data
                base64_audio = result["output"]["preview_audio"]["data"]
                
                # Decode Base64 audio data
                audio_bytes = base64.b64decode(base64_audio)
                
                # Save audio file locally
                filename = f"{voice_id}_preview.wav"
                
                # Write audio data to local file
                with open(filename, 'wb') as f:
                    f.write(audio_bytes)
                
                print(f"Audio saved to local file: {filename}")
                print(f"File path: {os.path.abspath(filename)}")
                
                return voice_id, audio_bytes, filename
            else:
                print(f"Request failed. Status code: {response.status_code}")
                print(f"Response: {response.text}")
                return None, None, None
                
        except requests.exceptions.RequestException as e:
            print(f"Network request error: {e}")
            return None, None, None
        except KeyError as e:
            print(f"Response format error. Missing required field: {e}")
            print(f"Response: {response.text if 'response' in locals() else 'No response'}")
            return None, None, None
        except Exception as e:
            print(f"Unknown error: {e}")
            return None, None, None
    
    if __name__ == "__main__":
        print("Starting voice creation...")
        voice_id, audio_data, saved_filename = create_voice_and_play()
        
        if voice_id:
            print(f"\nSuccessfully created voice '{voice_id}'")
            print(f"Audio file saved: '{saved_filename}'")
            print(f"File size: {os.path.getsize(saved_filename)} bytes")
        else:
            print("\nVoice creation failed")

    Java

    You need to import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:

    Maven

    Add the following to your pom.xml:

    <!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
    <dependency>
        <groupId>com.google.code.gson</groupId>
        <artifactId>gson</artifactId>
        <version>2.13.1</version>
    </dependency>

    Gradle

    Add the following to your build.gradle:

    // https://mvnrepository.com/artifact/com.google.code.gson/gson
    implementation("com.google.code.gson:gson:2.13.1")
    import com.google.gson.JsonObject;
    import com.google.gson.JsonParser;
    import java.io.*;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.util.Base64;
    
    public class Main {
        public static void main(String[] args) {
            Main example = new Main();
            example.createVoice();
        }
    
        public void createVoice() {
            // API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
            // If you haven’t configured an environment variable, replace the next line with: String apiKey = "sk-xxx"
            String apiKey = System.getenv("DASHSCOPE_API_KEY");
    
            // Create JSON request body string
            String jsonBody = "{\n" +
                    "    \"model\": \"voice-enrollment\",\n" +
                    "    \"input\": {\n" +
                    "        \"action\": \"create_voice\",\n" +
                    "        \"target_model\": \"cosyvoice-v3.5-plus\",\n" +
                    "        \"voice_prompt\": \"A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.\",\n" +
                    "        \"preview_text\": \"Dear listeners, hello everyone. Welcome to the evening news.\",\n" +
                    "        \"prefix\": \"announcer\"\n" +
                    "    },\n" +
                    "    \"parameters\": {\n" +
                    "        \"sample_rate\": 24000,\n" +
                    "        \"response_format\": \"wav\"\n" +
                    "    }\n" +
                    "}";
    
            HttpURLConnection connection = null;
            try {
                // This is the Singapore-region URL. If you use a Beijing-region model, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
                URL url = new URL("https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization");
                connection = (HttpURLConnection) url.openConnection();
    
                // Set request method and headers
                connection.setRequestMethod("POST");
                connection.setRequestProperty("Authorization", "Bearer " + apiKey);
                connection.setRequestProperty("Content-Type", "application/json");
                connection.setDoOutput(true);
                connection.setDoInput(true);
    
                // Send request body
                try (OutputStream os = connection.getOutputStream()) {
                    byte[] input = jsonBody.getBytes("UTF-8");
                    os.write(input, 0, input.length);
                    os.flush();
                }
    
                // Get response
                int responseCode = connection.getResponseCode();
                if (responseCode == HttpURLConnection.HTTP_OK) {
                    // Read response content
                    StringBuilder response = new StringBuilder();
                    try (BufferedReader br = new BufferedReader(
                            new InputStreamReader(connection.getInputStream(), "UTF-8"))) {
                        String responseLine;
                        while ((responseLine = br.readLine()) != null) {
                            response.append(responseLine.trim());
                        }
                    }
    
                    // Parse JSON response
                    JsonObject jsonResponse = JsonParser.parseString(response.toString()).getAsJsonObject();
                    JsonObject outputObj = jsonResponse.getAsJsonObject("output");
                    JsonObject previewAudioObj = outputObj.getAsJsonObject("preview_audio");
    
                    // Get voice ID
                    String voiceId = outputObj.get("voice_id").getAsString();
                    System.out.println("Voice ID: " + voiceId);
    
                    // Get Base64-encoded audio data
                    String base64Audio = previewAudioObj.get("data").getAsString();
    
                    // Decode Base64 audio data
                    byte[] audioBytes = Base64.getDecoder().decode(base64Audio);
    
                    // Save audio to local file
                    String filename = voiceId + "_preview.wav";
                    saveAudioToFile(audioBytes, filename);
    
                    System.out.println("Audio saved to local file: " + filename);
    
                } else {
                    // Read error response
                    StringBuilder errorResponse = new StringBuilder();
                    try (BufferedReader br = new BufferedReader(
                            new InputStreamReader(connection.getErrorStream(), "UTF-8"))) {
                        String responseLine;
                        while ((responseLine = br.readLine()) != null) {
                            errorResponse.append(responseLine.trim());
                        }
                    }
    
                    System.out.println("Request failed. Status code: " + responseCode);
                    System.out.println("Error response: " + errorResponse.toString());
                }
    
            } catch (Exception e) {
                System.err.println("Request error: " + e.getMessage());
                e.printStackTrace();
            } finally {
                if (connection != null) {
                    connection.disconnect();
                }
            }
        }
    
        private void saveAudioToFile(byte[] audioBytes, String filename) {
            try {
                File file = new File(filename);
                try (FileOutputStream fos = new FileOutputStream(file)) {
                    fos.write(audioBytes);
                }
                System.out.println("Audio saved to: " + file.getAbsolutePath());
            } catch (IOException e) {
                System.err.println("Error saving audio file: " + e.getMessage());
                e.printStackTrace();
            }
        }
    }
  2. Use the custom voice created in the previous step for speech synthesis.

    This example builds on the non-streaming call example. You must replace the voice parameter with the custom voice ID generated by voice design.

    Key rule: The model used for voice design (target_model) must match the model used for speech synthesis (model). Otherwise, synthesis fails.

    Python

    # coding=utf-8
    
    import dashscope
    from dashscope.audio.tts_v2 import *
    import os
    
    # The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
    dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
    
    # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
    dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
    
    # The same model must be used for voice design and speech synthesis.
    model = "cosyvoice-v3.5-plus"
    # Replace the voice parameter with the custom voice generated from voice design.
    voice = "your_voice"
    
    # Instantiate SpeechSynthesizer and pass request parameters such as model and voice to the constructor.
    synthesizer = SpeechSynthesizer(model=model, voice=voice)
    # Send the text to be synthesized to obtain the binary audio.
    audio = synthesizer.call("How is the weather today?")
    # When you send text for the first time, a WebSocket connection needs to be established. Therefore, the first-package latency includes the connection establishment time.
    print('[Metric] Request ID: {}, First-package latency: {} ms'.format(
        synthesizer.get_last_request_id(),
        synthesizer.get_first_package_delay()))
    
    # Save the audio to a local file.
    with open('output.mp3', 'wb') as f:
        f.write(audio)

    Java

    import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
    import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
    import com.alibaba.dashscope.utils.Constants;
    
    import java.io.File;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.nio.ByteBuffer;
    
    public class Main {
        // Use the same model for voice design and speech synthesis
        private static String model = "cosyvoice-v3.5-plus";
        // Replace the voice parameter with the custom voice ID generated by voice design
        private static String voice = "your_voice_id";
    
        public static void streamAudioDataToSpeaker() {
            // Request parameters
            SpeechSynthesisParam param =
                    SpeechSynthesisParam.builder()
                            // API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
                            // If you haven’t configured an environment variable, replace the next line with: .apiKey("sk-xxx")
                            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                            .model(model) // Model
                            .voice(voice) // Voice
                            .build();
    
            // Sync mode: disable callback (second parameter is null)
            SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
            ByteBuffer audio = null;
            try {
                // Block until audio returns
                audio = synthesizer.call("How is the weather today?");
            } catch (Exception e) {
                throw new RuntimeException(e);
            } finally {
                // Close WebSocket connection after task ends
                synthesizer.getDuplexApi().close(1000, "bye");
            }
            if (audio != null) {
                // Save audio data to local file “output.mp3”
                File file = new File("output.mp3");
                // The first packet delay includes WebSocket connection setup time
                System.out.println(
                        "[Metric] requestId: "
                                + synthesizer.getLastRequestId()
                                + ", first packet delay (ms): "
                                + synthesizer.getFirstPackageDelay());
                try (FileOutputStream fos = new FileOutputStream(file)) {
                    fos.write(audio.array());
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }
    
        public static void main(String[] args) {
            // This is the Singapore-region URL. If you use a Beijing-region model, replace it with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
            Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
            streamAudioDataToSpeaker();
            System.exit(0);
        }
    }

API Reference

Use the same Alibaba Cloud account for all API operations.

Important

The Java and Python DashScope SDKs do not support voice design. For voice design, use the RESTful API.

Create Voice

RESTful API

  • URL

    China Mainland:

    POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization

    International:

    POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization
  • Request Headers

    Parameter

    Type

    Required

    Description

    Authorization

    string

    Supported

    Authentication token. Format: Bearer <your_api_key>. Replace "<your_api_key>" with your actual API key.

    Content-Type

    string

    Supported

    Media type of data in the request body. Fixed value: application/json.

  • Request Body

    The request body contains all parameters. Optional fields can be omitted based on your business needs.

    Important

    Note the difference between these parameters:

    • model: Voice cloning/design model. Fixed value: voice-enrollment

    • target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

    Voice Cloning

    {
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "prefix": "myvoice",
            "url": "https://yourAudioFileUrl",
            "language_hints": ["zh"]
        }
    }

    Voice Design

    {
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
            "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
            "prefix": "announcer",
            "language_hints": ["zh"]
        },
        "parameters": {
            "sample_rate": 24000,
            "response_format": "wav"
        }
    }
  • Request Parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: create_voice.

    target_model

    string

    -

    Supported

    Speech synthesis model that drives the voice (see Supported Models).

    Must match the speech synthesis model used later. Otherwise, synthesis fails.

    url

    string

    -

    Supported

    Important

    Required only for voice cloning

    Publicly accessible URL of the audio file used for voice cloning.

    For audio format details, see .

    For recording guidance, see Recording Guide.

    voice_prompt

    string

    -

    Supported

    Important

    Required only for voice design

    Voice description. Maximum length: 500 characters.

    Chinese and English only.

    For guidance on writing voice descriptions, see .

    preview_text

    string

    -

    Supported

    Important

    Required only for voice design

    Text for the preview audio. Maximum length: 200 characters.

    Supported languages: Chinese (zh), English (en).

    prefix

    string

    -

    Supported

    Name for the voice (letters and numbers only; up to 10 characters). Use identifiers related to role or scenario.

    This keyword appears in the final voice name. For example, if the keyword is "announcer", the final voice names are:
    Voice cloning: cosyvoice-v3.5-plus-announcer-8aae0c0397fa408ca60c29cf******
    Voice design: cosyvoice-v3.5-plus-vd-announcer-8aae0c0397fa408ca60c29cf******

    language_hints

    array[string]

    ["zh"]

    No

    You can specify the language of the sample audio used to extract target timbre features. This option is only available for the cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.

    Note: This parameter is an array, but only the first element is processed. Pass only one value.

    Functionality:

    Voice Cloning

    This parameter helps the model identify the language of the sample audio (original reference audio), so that it can more accurately extract voice characteristics and improve voice cloning quality. If the specified language hint does not match the actual audio language (for example, specifying en for Chinese audio), the system ignores this hint and automatically detects the language based on the audio content.

    Valid values (by model):

    • cosyvoice-v3-plus:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

      • pt: Portuguese

      • th: Thai

      • id: Indonesian

      • vi: Vietnamese

    For Chinese dialects (e.g., Northeastern, Cantonese), set language_hints to zh. Control dialect style in speech synthesis using text content or the instruct parameter.

    Voice Design

    Specifies the language preference for the generated voice. Affects language features and pronunciation. Choose the language code matching your use case.

    If used, the language must match the preview_text language.

    Valid values:

    • zh: Chinese (default)

    • en: English

    max_prompt_audio_length

    float

    10.0

    No

    Important

    Available only for voice cloning

    Maximum duration (in seconds) of the reference audio used for voice cloning after preprocessing. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.

    Valid range: [3.0, 30.0].

    enable_preprocess

    boolean

    false

    No

    Important

    Available only for voice cloning

    Enable audio preprocessing. When enabled, the system applies noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.

    Valid values:

    • true: Enable

    • false: Disable

    sample_rate

    int

    24000

    No

    Important

    Available only for voice design

    Sample rate (Hz) of the preview audio generated by voice design.

    Valid values:

    • 16000

    • 24000

    • 48000

    response_format

    string

    wav

    No

    Important

    Available only for voice design

    Format of the preview audio generated by voice design.

    Valid values:

    • pcm

    • wav

    • mp3

  • Response Parameters

    View response examples

    Voice Cloning

    {
        "output": {
            "voice_id": "yourVoiceId"
        },
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }

    Voice Design

    {
        "output": {
            "preview_audio": {
                "data": "{base64_encoded_audio}",
                "sample_rate": 24000,
                "response_format": "wav"
            },
            "target_model": "cosyvoice-v3.5-plus",
            "voice_id": "yourVoice"
        },
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }

    Key parameters:

    Parameter

    Type

    Description

    voice_id

    string

    Voice ID. Use directly as the voice parameter in the speech synthesis API.

    data

    string

    Preview audio data generated by voice design, returned as a Base64-encoded string.

    sample_rate

    int

    Sample rate (Hz) of the preview audio generated by voice design. Matches the sample rate used when creating the voice. Default: 24000 Hz if unspecified.

    response_format

    string

    Format of the preview audio generated by voice design. Matches the format used when creating the voice. Default: wav if unspecified.

    target_model

    string

    Speech synthesis model that drives the voice (see Supported Models).

    Must match the speech synthesis model used later. Otherwise, synthesis fails.

    request_id

    string

    Request ID.

    count

    integer

    Number of "create voice" operations in this request.

    Always 1 for voice creation.

  • Sample Code

    Important

    Note the difference between these parameters:

    • model: Voice cloning/design model. Fixed value: voice-enrollment

    • target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

    Voice Cloning

    If you haven’t configured your API key in an environment variable, replace $DASHSCOPE_API_KEY in the sample with your actual API key.

    https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="28f184e9f7vq7">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "prefix": "myvoice",
            "url": "https://yourAudioFileUrl"
        }
    }'

    Voice Design

    If you haven’t configured your API key in an environment variable, replace $DASHSCOPE_API_KEY in the sample with your actual API key.

    https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="087ab4e9d2b9r">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "create_voice",
            "target_model": "cosyvoice-v3.5-plus",
            "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
            "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
            "prefix": "announcer"
        },
        "parameters": {
            "sample_rate": 24000,
            "response_format": "wav"
        }
    }'

Python SDK

Interface Description

Before using this interface, install the latest DashScope SDK.

def create_voice(self, target_model: str, prefix: str, url: str, language_hints: List[str] = None) -> str:
    '''
    Create a new custom voice.
    param: target_model Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
    param: prefix Name for the voice (letters, numbers, and underscores only; up to 10 characters). Use identifiers related to role or scenario. This keyword appears in the cloned voice name. Format: model-name-prefix-unique-id, e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx.
    param: url Publicly accessible URL of the audio file used for voice cloning.
    param: language_hints Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.
                Helps the model identify the language of the reference audio (original sample), improving voice feature extraction and cloning quality.
                If the language hint does not match the actual audio (e.g., "en" for Chinese audio), the system ignores the hint and detects the language automatically.
                Valid values (by model):
                    cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
                    cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
                This parameter is an array, but only the first element is processed. Pass only one value.
    param: max_prompt_audio_length Maximum duration (in seconds) of the reference audio used for voice cloning after preprocessing. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
                Valid range: [3.0, 30.0].
    param: enable_preprocess Enable audio preprocessing. When enabled, the system applies noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
    return: voice_id Voice ID. Use directly as the voice parameter in the speech synthesis API.
    '''
Important
  • target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

  • language_hints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.

    Functionality:

    Voice Cloning

    This parameter helps the model identify the language of the sample audio (original reference audio), so that it can more accurately extract voice characteristics and improve voice cloning quality. If the specified language hint does not match the actual audio language (for example, specifying en for Chinese audio), the system ignores this hint and automatically detects the language based on the audio content.

    Valid values (by model):

    • cosyvoice-v3-plus:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

      • pt: Portuguese

      • th: Thai

      • id: Indonesian

      • vi: Vietnamese

    For Chinese dialects (e.g., Northeastern, Cantonese), set language_hints to zh. Control dialect style in speech synthesis using text content or the instruct parameter.

    Voice Design

    Specifies the language preference for the generated voice. Affects language features and pronunciation. Choose the language code matching your use case.

    If used, the language must match the preview_text language.

    Valid values:

    • zh: Chinese (default)

    • en: English

Request Example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()

# Avoid frequent calls. Each call creates a new voice. After reaching your quota limit, you cannot create more.
voice_id = service.create_voice(
    target_model='cosyvoice-v3.5-plus',
    prefix='myvoice',
    url='https://your-audio-file-url'
    # language_hints=['zh'],
    # max_prompt_audio_length=10.0,
    # enable_preprocess=False
)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice ID: {voice_id}")

Java SDK

Interface Description

Before using this interface, install the latest DashScope SDK.

/**
 * Create a new custom voice.
 *
 * @param targetModel Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
 * @param prefix Name for the voice (letters, numbers, and underscores only; up to 10 characters). Use identifiers related to role or scenario. This keyword appears in the cloned voice name. Format: model-name-prefix-unique-id, e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx.
 * @param url Publicly accessible URL of the audio file used for voice cloning.
 * @param customParam Custom parameters. Specify languageHints and maxPromptAudioLength here.
 *              languageHints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.
 *                  Helps the model identify the language of the reference audio (original sample), improving voice feature extraction and cloning quality.
 *                  If the language hint does not match the actual audio (e.g., "en" for Chinese audio), the system ignores the hint and detects the language automatically.
 *                  Valid values (by model):
 *                      cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
 *                      cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
 *                  Only the first element is processed. Pass only one value.
 *              maxPromptAudioLength: Maximum duration (in seconds) of the reference audio used for voice cloning after preprocessing. Applies only to cosyvoice-v3-flash models.
 *                  Valid range: [3.0, 30.0].
 *              enable_preprocess: Configure this parameter using the generic parameter "parameter". Enable audio preprocessing. When enabled, the system applies noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
 * @return Voice New voice. Call Voice.getVoiceId() to get the voice ID. Use directly as the voice parameter in the speech synthesis API.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice createVoice(String targetModel, String prefix, String url, VoiceEnrollmentParam customParam) throws NoApiKeyException, InputRequiredException
Important
  • targetModel: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.

  • languageHints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.

    Functionality:

    Voice Cloning

    This parameter helps the model identify the language of the sample audio (original reference audio), so that it can more accurately extract voice characteristics and improve voice cloning quality. If the specified language hint does not match the actual audio language (for example, specifying en for Chinese audio), the system ignores this hint and automatically detects the language based on the audio content.

    Valid values (by model):

    • cosyvoice-v3-plus:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

    • cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:

      • zh: Chinese (default)

      • en: English

      • fr: French

      • de: German

      • ja: Japanese

      • ko: Korean

      • ru: Russian

      • pt: Portuguese

      • th: Thai

      • id: Indonesian

      • vi: Vietnamese

    For Chinese dialects (e.g., Northeastern, Cantonese), set language_hints to zh. Control dialect style in speech synthesis using text content or the instruct parameter.

    Voice Design

    Specifies the language preference for the generated voice. Affects language features and pronunciation. Choose the language code matching your use case.

    If used, the language must match the preview_text language.

    Valid values:

    • zh: Chinese (default)

    • en: English

Request Example

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentParam;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Collections;

public class Main {
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args) {
        String apiKey = System.getenv("DASHSCOPE_API_KEY");
        String targetModel = "cosyvoice-v3.5-plus";
        String prefix = "myvoice";
        String fileUrl = "https://your-audio-file-url";
        String cloneModelName = "voice-enrollment";

        try {
            VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
            Voice myVoice = service.createVoice(
                    targetModel,
                    prefix,
                    fileUrl,
                    VoiceEnrollmentParam.builder()
                            .model(cloneModelName)
                            .languageHints(Collections.singletonList("zh"))
                            // .maxPromptAudioLength(10.0f)
                            // .parameter("enable_preprocess", false)
                            .build());

            logger.info("Voice creation submitted. Request ID: {}", service.getLastRequestId());
            logger.info("Generated Voice ID: {}", myVoice.getVoiceId());
        } catch (Exception e) {
            logger.error("Failed to create voice", e);
        }
    }
}

List Voices

Query the list of created voices with pagination.

RESTful API

  • URL and Request Headers are the same as the Create Voice API

  • Request Body

    The request body contains all parameters. Optional fields can be omitted based on your business needs.

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    {
        "model": "voice-enrollment",
        "input": {
            "action": "list_voice",
            "prefix": "announcer"
            "page_size": 10,
            "page_index": 0
        }
    }
  • Request Parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: list_voice.

    prefix

    string

    -

    No

    Same prefix used when creating the voice. Letters and numbers only; up to 10 characters.

    page_index

    integer

    0

    Not supported

    Page index. Must be greater than or equal to 0.

    page_size

    integer

    10

    No

    Number of items per page. Valid range: [0, 1000].

  • Response Parameters

    View response examples

    Voice Cloning

    {
        "output": {
            "voice_list": [
                {
                    "gmt_create": "2024-12-11 13:38:02",
                    "voice_id": "yourVoiceId",
                    "gmt_modified": "2024-12-11 13:38:02",
                    "status": "OK"
                }
            ]
        },
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }

    Voice Design

    {
        "output": {
            "voice_list": [
                {
                    "gmt_create": "2025-12-10 17:04:54",
                    "gmt_modified": "2025-12-10 17:04:54",
                    "preview_text": "Dear listeners, hello everyone. Welcome to today's show.",
                    "target_model": "cosyvoice-v3.5-plus",
                    "voice_id": "yourVoice1",
                    "status": "OK",
                    "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary., deep and magnetic, steady pace"
                },
                {
                    "gmt_create": "2025-12-10 15:31:35",
                    "gmt_modified": "2025-12-10 15:31:35",
                    "language": "zh",
                    "preview_text": "Dear listeners, hello everyone",
                    "target_model": "cosyvoice-v3.5-plus",
                    "voice_id": "yourVoice2",
                    "status": "OK"
                    "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary."
                }
            ]
        },
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }

    Key parameters:

    Parameter

    Type

    Description

    voice_id

    string

    Voice ID. Use directly as the voice parameter in the speech synthesis API.

    target_model

    string

    Speech synthesis model that drives the voice (see Supported Models).

    Must match the speech synthesis model used later. Otherwise, synthesis fails.

    gmt_create

    string

    Time the voice was created.

    gmt_modified

    string

    The time at which the timbre is modified.

    voice_prompt

    string

    Voice description.

    preview_text

    string

    Preview text.

    request_id

    string

    Request ID.

    status

    string

    Voice status:

    • DEPLOYING: Under review

    • OK: Approved and ready to use

    • UNDEPLOYED: Rejected and unavailable

  • Sample Code

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    If you haven’t configured your API key in an environment variable, replace $DASHSCOPE_API_KEY in the sample with your actual API key.

    # ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST https://dashscope.aliyuncs-intl.com/api/v1/services/audio/tts/customization \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "list_voice",
            "prefix": "announcer",
            "page_size": 10,
            "page_index": 0
        }
    }'

Python SDK

Interface Description

def list_voices(self, prefix=None, page_index: int = 0, page_size: int = 10) -> List[dict]:
    '''
    Query all created voices
    param: prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
    param: page_index Page index to query
    param: page_size Page size to query
    return: List[dict] Voice list containing ID, creation time, modification time, and status for each voice. Format: [{'gmt_create': '2025-10-09 14:51:01', 'gmt_modified': '2025-10-09 14:51:07', 'status': 'OK', 'voice_id': 'cosyvoice-v3-myvoice-xxx'}]
    Voice statuses:
        DEPLOYING: Under review
        OK: Approved and ready to use
        UNDEPLOYED: Rejected and unavailable
    '''

Request Example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()

# Filter by prefix, or set to None to query all
voices = service.list_voices(prefix='myvoice', page_index=0, page_size=10)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Found voices: {voices}")

Response Example

[
    {
        "gmt_create": "2024-09-13 11:29:41",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 11:29:41",
        "status": "OK"
    },
    {
        "gmt_create": "2024-09-13 13:22:38",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 13:22:38",
        "status": "OK"
    }
]

Response Parameters

Parameter

Type

Description

voice_id

string

Voice ID. Use directly as the voice parameter in the speech synthesis API.

target_model

string

Speech synthesis model that drives the voice (see Supported Models).

Must match the speech synthesis model used later. Otherwise, synthesis fails.

gmt_create

string

Time the voice was created.

gmt_modified

string

The time at which the timbre is modified.

voice_prompt

string

Voice description.

preview_text

string

Preview text.

request_id

string

Request ID.

status

string

Voice status:

  • DEPLOYING: Under review

  • OK: Approved and ready to use

  • UNDEPLOYED: Rejected and unavailable

Java SDK

Interface Description

// Voice statuses:
//        DEPLOYING: Under review
//        OK: Approved and ready to use
//        UNDEPLOYED: Rejected and unavailable
/**
 * Query all created voices. Default page index is 0, default page size is 10.
 *
 * @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters). Can be null.
 * @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice[] listVoice(String prefix) throws NoApiKeyException, InputRequiredException 

/**
 * Query all created voices.
 *
 * @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
 * @param pageIndex Page index to query.
 * @param pageSize Page size to query.
 * @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
 * @throws NoApiKeyException If the API key is empty.
 * @throws InputRequiredException If a required parameter is empty.
 */
public Voice[] listVoice(String prefix, int pageIndex, int pageSize) throws NoApiKeyException, InputRequiredException

Request Example

You need to import the third-party library com.google.gson.Gson.

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String prefix = "myvoice"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Query voices
        Voice[] voices = service.listVoice(prefix, 0, 10);
        logger.info("List successful. Request ID: {}", service.getLastRequestId());
        logger.info("Voices Details: {}", new Gson().toJson(voices));
    }
}

Response Example

[
    {
        "gmt_create": "2024-09-13 11:29:41",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 11:29:41",
        "status": "OK"
    },
    {
        "gmt_create": "2024-09-13 13:22:38",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 13:22:38",
        "status": "OK"
    }
]

Response Parameters

Parameter

Type

Description

voice_id

string

Voice ID. Use directly as the voice parameter in the speech synthesis API.

target_model

string

Speech synthesis model that drives the voice (see Supported Models).

Must match the speech synthesis model used later. Otherwise, synthesis fails.

gmt_create

string

Time the voice was created.

gmt_modified

string

The time at which the timbre is modified.

voice_prompt

string

Voice description.

preview_text

string

Preview text.

request_id

string

Request ID.

status

string

Voice status:

  • DEPLOYING: Under review

  • OK: Approved and ready to use

  • UNDEPLOYED: Rejected and unavailable

Query Specific Voice

You can retrieve detailed information about a specific voice by its name.

RESTful API

  • URL and Request Headers are the same as the Create Voice API

  • Request Body

    The request body contains all parameters. Optional fields can be omitted based on your business needs.

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    {
        "model": "voice-enrollment",
        "input": {
            "action": "query_voice",
            "voice_id": "yourVoiceID"
        }
    }
  • Request Parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: query_voice.

    voice_id

    string

    -

    Supported

    ID of the voice to query.

  • Response Parameters

    View response examples

    Voice Cloning

    {
        "output": {
            "gmt_create": "2024-12-11 13:38:02",
            "resource_link": "https://yourAudioFileUrl",
            "target_model": "cosyvoice-v3.5-plus",
            "gmt_modified": "2024-12-11 13:38:02",
            "status": "OK"
        },
        "usage": {
            "count": 1
        },
        "request_id": "2450f969-d9ea-9483-bafc-************"
    }

    Voice Design

    {
        "output": {
            "gmt_create": "2025-12-10 14:54:09",
            "gmt_modified": "2025-12-10 17:47:48",
            "preview_text": "Dear listeners, hello everyone",
            "target_model": "cosyvoice-v3.5-plus",
            "status": "OK",
            "voice_id": "yourVoice",
            "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary."
        },
        "usage": {},
        "request_id": "yourRequestId"
    }

    For parameter descriptions, see the List Voices API.

  • Sample Code

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    If you haven’t configured your API key in an environment variable, replace $DASHSCOPE_API_KEY in the sample with your actual API key.

    # ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "query_voice",
            "voice_id": "yourVoiceID"
        }
    }'

Python SDK

Interface Description

def query_voice(self, voice_id: str) -> List[str]:
    '''
    Query details for a specific voice
    param: voice_id ID of the voice to query
    return: List[str] Voice details, including status, creation time, audio link, etc.
    '''

Request Example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
voice_id = 'cosyvoice-v3-plus-myvoice-xxxxxxxx'

voice_details = service.query_voice(voice_id=voice_id)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice Details: {voice_details}")

Response Example

{
    "gmt_create": "2024-09-13 11:29:41",
    "resource_link": "https://yourAudioFileUrl",
    "target_model": "cosyvoice-v3.5-plus",
    "gmt_modified": "2024-09-13 11:29:41",
    "status": "OK"
}

Response Parameters

See the API.

Java SDK

Interface Description

/**
 * Query details for a specific voice
 *
 * @param voiceId ID of the voice to query
 * @return Voice Voice details, including status, creation time, audio link, etc.
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public Voice queryVoice(String voiceId) throws NoApiKeyException, InputRequiredException

Request Example

You need to import the third-party library com.google.gson.Gson.

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        Voice voice = service.queryVoice(voiceId);
        
        logger.info("Query successful. Request ID: {}", service.getLastRequestId());
        logger.info("Voice Details: {}", new Gson().toJson(voice));
    }
}

Response Example

{
    "gmt_create": "2024-09-13 11:29:41",
    "resource_link": "https://yourAudioFileUrl",
    "target_model": "cosyvoice-v3.5-plus",
    "gmt_modified": "2024-09-13 11:29:41",
    "status": "OK"
}

Response Parameters

See the API.

Update Voice (Voice Cloning Only)

Update an existing voice with a new audio file.

Important

This feature is not supported for voice design.

RESTful API

  • URL and Request Headers are the same as the Create Voice API

  • Request Body

    The request body contains all parameters. Optional fields can be omitted based on your business needs:

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    {
        "model": "voice-enrollment",
        "input": {
            "action": "update_voice",
            "voice_id": "yourVoiceId",
            "url": "https://yourAudioFileUrl"
        }
    }
  • Request Parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: update_voice.

    voice_id

    string

    -

    Supported

    Voice to update.

    url

    string

    -

    Supported

    URL of the audio file to update the voice. The URL must be publicly accessible.

  • Response Parameters

    View response example

    {
        "output": {},
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }
  • Sample Code

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    If you haven’t configured your API key in an environment variable, replace $DASHSCOPE_API_KEY in the sample with your actual API key.

    # ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "update_voice",
            "voice_id": "yourVoiceId",
            "url": "https://yourAudioFileUrl"
        }
    }'

Python SDK

Interface Description

def update_voice(self, voice_id: str, url: str) -> None:
    '''
    Update a voice
    param: voice_id Voice ID
    param: url URL of the audio file for voice cloning
    '''

Request Example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
service.update_voice(
    voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx',
    url='https://your-new-audio-file-url'
)
print(f"Update submitted. Request ID: {service.get_last_request_id()}")

Java SDK

Interface Description

/**
 * Update a voice
 *
 * @param voiceId Voice to update
 * @param url URL of the audio file for voice cloning
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public void updateVoice(String voiceId, String url)
    throws NoApiKeyException, InputRequiredException

Request Example

import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String fileUrl = "https://your-audio-file-url";  // Replace with your actual value
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    
    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Update voice
        service.updateVoice(voiceId, fileUrl);
        logger.info("Update submitted. Request ID: {}", service.getLastRequestId());
    }
}

Delete Voice

Delete a voice you no longer need to free up your quota. This action is irreversible.

RESTful API

  • URL and Request Headers are the same as the Create Voice API

  • Request Body

    The request body contains all parameters. Optional fields can be omitted based on your business needs:

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    {
        "model": "voice-enrollment",
        "input": {
            "action": "delete_voice",
            "voice_id": "yourVoiceID"
        }
    }
  • Request Parameters

    Parameter

    Type

    Default

    Required

    Description

    model

    string

    -

    Supported

    Voice cloning/design model. Fixed value: voice-enrollment.

    action

    string

    -

    Supported

    Action type. Fixed value: delete_voice.

    voice_id

    string

    -

    Supported

    Voice to delete.

  • Response Parameters

    View response example

    {
        "output": {},
        "usage": {
            "count": 1
        },
        "request_id": "yourRequestId"
    }
  • Sample Code

    Important

    model: Voice cloning/design model. Fixed value: voice-enrollment. Do not change.

    If you haven’t configured your API key in an environment variable, replace $DASHSCOPE_API_KEY in the sample with your actual API key.

    # ======= Important Notice =======
    # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # === Delete this comment before running ===
    
    curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
    -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "voice-enrollment",
        "input": {
            "action": "delete_voice",
            "voice_id": "yourVoiceID"
        }
    }'

Python SDK

Interface Description

def delete_voice(self, voice_id: str) -> None:
    '''
    Delete a voice
    param: voice_id Voice to delete
    '''

Request Example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
service.delete_voice(voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx')
print(f"Deletion submitted. Request ID: {service.get_last_request_id()}")

Java SDK

Interface Description

/**
 * Delete a voice
 *
 * @param voiceId Voice to delete
 * @throws NoApiKeyException If the API key is empty
 * @throws InputRequiredException If a required parameter is empty
 */
public void deleteVoice(String voiceId) throws NoApiKeyException, InputRequiredException 

Request Example

import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you haven't set the environment variable, replace this with your API key
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    
    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Delete voice
        service.deleteVoice(voiceId);
        logger.info("Deletion submitted. Request ID: {}", service.getLastRequestId());
    }
}

Voice Quotas and Automatic Cleanup Rules

  • Total limit: 1000 voices

    The current API does not provide a way to query the voice count. You can call the API to count the voices yourself.
  • Automatic cleanup: If a voice has not been used for any speech synthesis requests in the past year, the system automatically deletes it.

Billing

  • Voice cloning/design: Creating, querying, updating, and deleting voices is free.

  • Speech synthesis using custom voices: Billed based on the number of text characters. For more information, see Real-time Speech Synthesis – CosyVoice/Sambert.

Copyright and Legality

You are responsible for the ownership and legal right to use any voice you provide. Read the Terms of Service.

Error Codes

If you encounter an error, see Error Messages for troubleshooting.

FAQ

Features

Q: How do I adjust the speed and volume of a custom voice?

Adjust them the same way you adjust a preset voice. Pass the corresponding parameters when calling the speech synthesis API. For example, use speech_rate (Python) or speechRate (Java) to adjust speed, and volume to adjust volume. For more information, see the speech synthesis API documentation (Java SDK/Python SDK/WebSocket API).

Q: How do I call the API using languages other than Java and Python (such as Go, C#, or Node.js)?

For voice management, use the RESTful API provided in this document. For speech synthesis, use the WebSocket API and pass the cloned voice_id as the voice parameter.

Troubleshooting

If you encounter code errors, troubleshoot using the information in Error Codes.

Q: What should I do if the synthesized audio from a cloned voice contains extra content?

If you find extra characters or noise in the synthesized audio from a cloned voice, follow these steps to troubleshoot:

  1. Check the source audio quality

    The quality of the cloned audio directly affects the synthesis result. Ensure the source audio meets these requirements:

    • No background noise or static

    • Clear sound quality (sample rate ≥ 16 kHz recommended)

    • Audio format: WAV is better than MP3 (avoid lossy compression)

    • Mono (stereo may cause interference)

    • No silent segments or long pauses

    • Moderate speech rate (a fast rate affects feature extraction)

  2. Check the input text

    Confirm the input text does not contain special symbols or markers:

    • Avoid special symbols such as **, "", and ''

    • Unless used for LaTeX formulas, preprocess the text to filter out special symbols.

  3. Verify voice cloning parameters

    Ensure the language parameter (language_hints/languageHints) is set correctly when .

  4. Try cloning again

    Use a higher-quality source audio file to clone the voice again and test the result.

  5. Compare with system voices

    Test the same text with a preset system voice to confirm if the issue is specific to the cloned voice.

Q: How do I troubleshoot if the audio generated from a cloned voice is silent?

  1. Check Voice Status

    Call the Query Specific Voice API to check if the voice status is OK.

  2. Check for model version consistency

    Ensure the target_model parameter used for voice cloning exactly matches the model parameter used for speech synthesis. For example:

    • When you clone the repository, use cosyvoice-v3-plus.

    • You must also use cosyvoice-v3-plus for synthesis

  3. Verify source audio quality

    Check if the source audio used for cloning meets the voice cloning input audio format requirements:

    • Audio duration: 10–20 seconds

    • Clear sound quality

    • No background noise

  4. Check request parameters

    Confirm the voice parameter is set to the cloned voice's ID during speech synthesis.

Q: What should I do if the synthesized speech from a cloned voice is unstable or incomplete?

If the synthesized speech from a cloned voice has these issues:

  • Incomplete playback; only part of the text is read

  • Unstable synthesis quality; sometimes good, sometimes bad

  • Abnormal pauses or silent segments in the audio

Possible cause: The source audio quality does not meet requirements.

Solution: Check if the source audio meets the following requirements. Rerecord the audio following the Recording Guide.

  • Check audio continuity: Ensure the speech in the source audio is continuous. Avoid long pauses or silent segments (over 2 seconds). Obvious blank segments can cause the model to treat silence or noise as part of the voice's features, affecting the result.

  • Check the speech activity ratio: Ensure that active speech makes up more than 60% of the total audio duration. Too much background noise or non-speech segments will interfere with voice feature extraction.

  • Verify audio quality details:

    • Audio duration: 10–20 seconds (15 seconds recommended)

    • Clear pronunciation and steady speech rate

    • No background noise, echo, or static

    • Concentrated speech energy with no long silent segments

Q: Why can't I find the VoiceEnrollmentService class?

Your SDK version is too old. Install the latest SDK.

Q: What should I do if the voice cloning result is poor, with noise or unclear audio?

This is usually because of low-quality input audio. Rerecord and upload the audio, strictly following the Recording Guide.

Q: Why is there a long silence at the beginning or an abnormal total duration when I synthesize very short text (like a single word) with a cloned voice?

The voice cloning model learns the pauses and rhythm from the sample audio. If the original recording has a long initial silence or pause, the synthesis result might retain a similar pattern. For single words or very short text, this silence ratio is amplified, making it seem like "the audio is long but mostly silent." Avoid long silences when recording sample audio. Use complete sentences or longer text for synthesis. If you must synthesize a single word, add some context before or after it, or use a homophone to avoid extreme cases.