All Products
Search
Document Center

Alibaba Cloud Model Studio:CosyVoice voice cloning API

Last Updated:Feb 11, 2026

CosyVoice voice cloning creates a custom voice from a 10- to 20-second audio sample. This document covers voice cloning parameters and API details. For speech synthesis, see Real-time speech synthesis - CosyVoice.

User guide: For model introductions and selection recommendations, see Real-time speech synthesis - CosyVoice.

Important

This document covers CosyVoice voice cloning API. For Qwen models, see Voice cloning (Qwen).

Audio requirements

High-quality input audio is essential for excellent cloning results.

Item

Requirement

Supported formats

WAV (16-bit), MP3, M4A

Audio duration

Recommended: 10 to 20 seconds. Maximum: 60 seconds.

File size

≤ 10 MB

Sample rate

≥ 16 kHz

Sound channel

Mono or stereo. For stereo audio, only the first channel is processed. Ensure it contains a valid human voice.

Content

Audio must contain at least 5 seconds of continuous, clear reading without background sound. The rest can only have short pauses (≤ 2 seconds). The entire segment should be free of background music, noise, or other voices. Use normal speaking audio. Do not upload songs or singing to ensure accurate, usable cloning.

Language

Varies depending on the speech synthesis model that drives the voice (specified by the target_model/targetModel parameter):

  • cosyvoice-v2: Chinese (Mandarin), English

  • cosyvoice-v3-flash, cosyvoice-v3-plus: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian

Voice cloning only supports the languages listed above (Mandarin Chinese and listed dialects, English, French, German, Japanese, Korean, and Russian). Other languages such as Spanish and Italian are not supported.

Getting started: From cloning to synthesis

image

1. Workflow

Voice cloning and speech synthesis are two closely related but separate steps following a "create first, then use" flow:

  1. Create a voice

    Call the Create a voice API and upload an audio segment. The system analyzes the audio and creates a unique cloned voice. Specify target_model / targetModel to declare the target speech synthesis model for the voice.

    If you already have a created voice (check by calling the Query the voice list API), skip this step.

  2. Use voice styles for speech synthesis.

    After creating a voice using the Create a voice API, the system returns a voice_id/voiceID:

    • This voice_id/voiceID can be used directly as the voice parameter in the speech synthesis API or SDKs for subsequent text-to-speech.

    • Supports multiple call methods: non-streaming, unidirectional streaming, and bidirectional streaming synthesis.

    • The speech synthesis model must match the target_model/targetModel specified when creating the voice. Otherwise, synthesis fails.

2. Model configuration and preparations

Select a suitable model and complete the preparations.

Model configuration

Important

In international deployment mode (Singapore region), cosyvoice-v3-plus and cosyvoice-v3-flash do not support voice cloning. Select other models.

When cloning a voice, specify these two models:

  • Voice cloning model: voice-enrollment

  • Speech synthesis model to drive the voice:

    For best results, use cosyvoice-v3-plus if resources and budget allow.

    Version

    Scenarios

    cosyvoice-v3-plus

    For the best sound quality and expressiveness with a sufficient budget

    cosyvoice-v3-flash

    Balances performance and cost for high overall value

    cosyvoice-v2

    For compatibility with older versions or low-requirement scenarios

Preparations

  1. Get an API key : Get an API key. For security, set the API key as an environment variable.

  2. Install the SDK : Ensure you have installed the latest version of the DashScope SDK.

  3. Prepare an audio URL : Upload an audio file meeting the audio requirements to a publicly accessible location, such as Object Storage Service (OSS), and ensure the URL is publicly accessible.

3. End-to-end example: From cloning to synthesis

The following example shows how to use a custom voice generated by voice cloning in speech synthesis to achieve output highly similar to the original voice.

  • Key principle: When cloning a voice, the target_model (the speech synthesis model that drives the voice) must match the speech synthesis model specified in subsequent speech synthesis API calls. Otherwise, synthesis will fail.

  • Note: Replace AUDIO_URL in the example with your actual audio URL.

import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer

# 1. Prepare the environment
# Set the API key as an environment variable.
# export DASHSCOPE_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
    raise ValueError("DASHSCOPE_API_KEY environment variable not set.")

# 2. Define cloning parameters
TARGET_MODEL = "cosyvoice-v3-plus" 
# Give the voice a meaningful prefix.
VOICE_PREFIX = "myvoice" # Only digits and lowercase letters are allowed, less than 10 characters.
# Publicly accessible audio URL
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # Example URL, replace it with your own.

# 3. Create the voice (asynchronous task)
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
    voice_id = service.create_voice(
        target_model=TARGET_MODEL,
        prefix=VOICE_PREFIX,
        url=AUDIO_URL
    )
    print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
    print(f"Generated Voice ID: {voice_id}")
except Exception as e:
    print(f"Error during voice creation: {e}")
    raise e
# 4. Poll for voice status
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
    try:
        voice_info = service.query_voice(voice_id=voice_id)
        status = voice_info.get("status")
        print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
        
        if status == "OK":
            print("Voice is ready for synthesis.")
            break
        elif status == "UNDEPLOYED":
            print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
            raise RuntimeError(f"Voice processing failed with status: {status}")
        # For intermediate statuses such as "DEPLOYING", continue to wait.
        time.sleep(poll_interval)
    except Exception as e:
        print(f"Error during status polling: {e}")
        time.sleep(poll_interval)
else:
    print("Polling timed out. The voice is not ready after several attempts.")
    raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")

# 5. Use the cloned voice for speech synthesis
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
    synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
    text_to_synthesize = "Congratulations, you have successfully cloned and synthesized your own voice!"
    
    # The call() method returns binary audio data.
    audio_data = synthesizer.call(text_to_synthesize)
    print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")

    # 6. Save the audio file
    output_file = "my_custom_voice_output.mp3"
    with open(output_file, "wb") as f:
        f.write(audio_data)
    print(f"Audio saved to {output_file}")

except Exception as e:
    print(f"Error during speech synthesis: {e}")

API reference

When using different APIs, make sure to use the same account for all operations.

Create a voice

Uploads an audio file for cloning to create a custom voice.

Python SDK

API description

def create_voice(self, target_model: str, prefix: str, url: str, language_hints: List[str] = None) -> str:
    '''
    Creates a new custom voice.
    param: target_model The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail. Recommended models are cosyvoice-v3-flash or cosyvoice-v3-plus.
    param: prefix A recognizable name for the voice (only digits, uppercase and lowercase letters, and underscores are allowed, up to 10 characters). We recommend using an identifier related to the role or scenario. This keyword appears in the cloned voice name. Generated voice name format: model_name-prefix-unique_identifier, such as cosyvoice-v3-plus-myvoice-xxxxxxxx.
    param: url The URL of the audio file for voice cloning. The URL must be publicly accessible.
    param: language_hints Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.
            It helps the model identify the language of the sample audio to extract features more accurately and improve cloning results.
            If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.
            Valid values: zh (default), en, fr, de, ja, ko, ru. This parameter is an array, but the current version processes only the first element. Pass only one value.
    return: voice_id The voice ID. It can be used directly as the voice parameter in the speech synthesis API.
    '''
Important
  • target_model: The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.

  • language_hints: Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.

    Description: This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results. If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.

    Valid values:

    • zh: Chinese (default)

    • en: English

    • fr: French

    • de: German

    • ja: Japanese

    • ko: Korean

    • ru: Russian

    For Chinese dialects (such as Northeastern or Cantonese), set language_hints to zh. Control the dialect style in subsequent speech synthesis calls through text content or parameters such as instruct.

    Note: This parameter is an array, but the current version processes only the first element. Pass only one value.

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()

# Avoid frequent calls. Each call creates a new voice. You cannot create more voices after you reach the quota limit.
voice_id = service.create_voice(
    target_model='cosyvoice-v3-plus',
    prefix='myvoice',
    url='https://your-audio-file-url',
    language_hints=['zh']
)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice ID: {voice_id}")

Java SDK

API description

/**
 * Creates a new custom voice.
 *
 * @param targetModel The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail. Recommended models are cosyvoice-v3-flash or cosyvoice-v3-plus.
 * @param prefix A recognizable name for the voice (only digits, uppercase and lowercase letters, and underscores are allowed, up to 10 characters). We recommend using an identifier related to the role or scenario. This keyword appears in the cloned voice name. Generated voice name format: model_name-prefix-unique_identifier, such as cosyvoice-v3-plus-myvoice-xxxxxxxx.
 * @param url The URL of the audio file for voice cloning. The URL must be publicly accessible.
 * @param customParam Custom parameters. You can specify languageHints here.
 *                  languageHints specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.
 *                  This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results.
 *                  If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.
 *                  Valid values: zh (default), en, fr, de, ja, ko, ru. This parameter is an array, but the current version processes only the first element. Pass only one value.
 * @return Voice The newly created voice. You can get the voice ID using the getVoiceId method of the Voice object. The voice ID can be used directly as the voice parameter in the speech synthesis API.
 * @throws NoApiKeyException if the API key is empty.
 * @throws InputRequiredException if a required parameter is empty.
 */
public Voice createVoice(String targetModel, String prefix, String url, VoiceEnrollmentParam customParam) throws NoApiKeyException, InputRequiredException
Important
  • targetModel: The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.

  • languageHints: Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.

    Description: This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results. If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.

    Valid values:

    • zh: Chinese (default)

    • en: English

    • fr: French

    • de: German

    • ja: Japanese

    • ko: Korean

    • ru: Russian

    For Chinese dialects (such as Northeastern or Cantonese), set language_hints to zh. Control the dialect style in subsequent speech synthesis calls through text content or parameters such as instruct.

    Note: This parameter is an array, but the current version processes only the first element. Pass only one value.

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentParam;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Collections;

public class Main {
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args) {
        String apiKey = System.getenv("DASHSCOPE_API_KEY");
        String targetModel = "cosyvoice-v3-plus";
        String prefix = "myvoice";
        String fileUrl = "https://your-audio-file-url";
        String cloneModelName = "voice-enrollment";

        try {
            VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
            Voice myVoice = service.createVoice(
                    targetModel,
                    prefix,
                    fileUrl,
                    VoiceEnrollmentParam.builder()
                    .model(cloneModelName)
                    .languageHints(Collections.singletonList("zh")).build());

            logger.info("Voice creation submitted. Request ID: {}", service.getLastRequestId());
            logger.info("Generated Voice ID: {}", myVoice.getVoiceId());
        } catch (Exception e) {
            logger.error("Failed to create voice", e);
        }
    }
}

RESTful API

Basic information

URL

The Chinese mainland:

https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization

International:

https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

Request method

POST

Request header

Authorization: Bearer {api-key} // Replace with your API key
Content-Type: application/json

Message body

The message body containing all request parameters. Optional fields can be omitted as needed:

Important
  • model: The voice cloning model. Set to voice-enrollment.

  • target_model: The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.

  • language_hints: Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.

    Description: This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results. If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.

    Valid values:

    • zh: Chinese (default)

    • en: English

    • fr: French

    • de: German

    • ja: Japanese

    • ko: Korean

    • ru: Russian

    For Chinese dialects (such as Northeastern or Cantonese), set language_hints to zh. Control the dialect style in subsequent speech synthesis calls through text content or parameters such as instruct.

    Note: This parameter is an array, but the current version processes only the first element. Pass only one value.

{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3-plus",
        "prefix": "myvoice",
        "url": "https://yourAudioFileUrl",
        "language_hints": ["zh"]
    }
}

Request parameters

Click to view a request example

Important
  • model: The voice cloning model. Set to voice-enrollment.

  • target_model: The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.

  • language_hints: Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.

    Description: This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results. If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.

    Valid values:

    • zh: Chinese (default)

    • en: English

    • fr: French

    • de: German

    • ja: Japanese

    • ko: Korean

    • ru: Russian

    For Chinese dialects (such as Northeastern or Cantonese), set language_hints to zh. Control the dialect style in subsequent speech synthesis calls through text content or parameters such as instruct.

    Note: This parameter is an array, but the current version processes only the first element. Pass only one value.

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3-plus",
        "prefix": "myvoice",
        "url": "https://yourAudioFileUrl",
        "language_hints": ["zh"]
    }
}'

Parameter

Type

Default

Required

Description

model

string

-

Yes

The voice cloning model. Set to voice-enrollment.

action

string

-

Yes

The operation type. Set to create_voice.

target_model

string

-

Yes

A speech synthesis model that controls voice timbre. We recommend cosyvoice-v3-flash or cosyvoice-v3-plus.

This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.

prefix

string

-

Yes

A recognizable name for the voice (only digits, uppercase and lowercase letters, and underscores are allowed, up to 10 characters). We recommend using an identifier related to the role or scenario.

This keyword appears in the cloned voice name. Generated voice name format: model_name-prefix-unique_identifier, such as cosyvoice-v3-plus-myvoice-xxxxxxxx.

url

string

-

Yes

The URL of the audio file for voice cloning. Must be publicly accessible.

language_hints

array[string]

["zh"]

No

Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.

Description: This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results. If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.

Valid values:

  • zh: Chinese (default)

  • en: English

  • fr: French

  • de: German

  • ja: Japanese

  • ko: Korean

  • ru: Russian

For Chinese dialects (such as Northeastern or Cantonese), set language_hints to zh. Control the dialect style in subsequent speech synthesis calls through text content or parameters such as instruct.

Note: This parameter is an array, but the current version processes only the first element. Pass only one value.

Response parameters

Click to view a response example

{
    "output": {
        "voice_id": "yourVoiceId"
    },
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Parameter

Type

Description

voice_id

string

The voice ID. It can be used directly as the voice parameter in the speech synthesis API.

Query the voice list

Performs a paged query of created voices.

Python SDK

API description

def list_voices(self, prefix=None, page_index: int = 0, page_size: int = 10) -> List[dict]:
    '''
    Queries all created voices.
    param: prefix The custom prefix of the voice. Only digits and lowercase letters are allowed, up to 10 characters.
    param: page_index The page index for the query.
    param: page_size The page size for the query.
    return: List[dict] A list of voices, including the ID, creation time, modification time, and status of each voice. Format: [{'gmt_create': '2025-10-09 14:51:01', 'gmt_modified': '2025-10-09 14:51:07', 'status': 'OK', 'voice_id': 'cosyvoice-v3-myvoice-xxx'}]
    Three voice statuses:
        DEPLOYING: Under review
        OK: Approved and ready to use
        UNDEPLOYED: Rejected and not usable
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()

# Filter by prefix, or set to None to query all.
voices = service.list_voices(prefix='myvoice', page_index=0, page_size=10)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Found voices: {voices}")

Response example

[
    {
        "gmt_create": "2024-09-13 11:29:41",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 11:29:41",
        "status": "OK"
    },
    {
        "gmt_create": "2024-09-13 13:22:38",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 13:22:38",
        "status": "OK"
    }
]

Response parameters

Parameter

Type

Description

voice_id

string

The voice ID.

gmt_create

string

The time when the voice was created.

gmt_modified

string

The time when the voice was modified.

status

string

The voice status:

  • DEPLOYING: Under review

  • OK: Approved and ready to use

  • UNDEPLOYED: Rejected and not usable

Java SDK

API description

// Three voice statuses:
//        DEPLOYING: Under review
//        OK: Approved and ready to use
//        UNDEPLOYED: Rejected and not usable
/**
 * Queries all created voices. The default page index is 0, and the default page size is 10.
 *
 * @param prefix The custom prefix of the voice. Only digits and lowercase letters are allowed, up to 10 characters. Can be null.
 * @return Voice[] An array of Voice objects. The Voice object encapsulates the ID, creation time, modification time, and status of the voice.
 * @throws NoApiKeyException if the API key is empty.
 * @throws InputRequiredException if a required parameter is empty.
 */
public Voice[] listVoice(String prefix) throws NoApiKeyException, InputRequiredException 

/**
 * Queries all created voices.
 *
 * @param prefix The custom prefix of the voice. Only digits and lowercase letters are allowed, up to 10 characters.
 * @param pageIndex The page index for the query.
 * @param pageSize The page size for the query.
 * @return Voice[] An array of Voice objects. The Voice object encapsulates the ID, creation time, modification time, and status of the voice.
 * @throws NoApiKeyException if the API key is empty.
 * @throws InputRequiredException if a required parameter is empty.
 */
public Voice[] listVoice(String prefix, int pageIndex, int pageSize) throws NoApiKeyException, InputRequiredException

Request example

You need to import the third-party library com.google.gson.Gson.

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you have not configured an environment variable, replace this with your API key.
    private static String prefix = "myvoice"; // Replace this with the actual value.
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Query voices
        Voice[] voices = service.listVoice(prefix, 0, 10);
        logger.info("List successful. Request ID: {}", service.getLastRequestId());
        logger.info("Voices Details: {}", new Gson().toJson(voices));
    }
}

Response example

[
    {
        "gmt_create": "2024-09-13 11:29:41",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 11:29:41",
        "status": "OK"
    },
    {
        "gmt_create": "2024-09-13 13:22:38",
        "voice_id": "yourVoiceId",
        "gmt_modified": "2024-09-13 13:22:38",
        "status": "OK"
    }
]

Response parameters

Parameter

Type

Description

voice_id

string

The voice ID.

gmt_create

string

The time when the voice was created.

gmt_modified

string

The time when the voice was modified.

status

string

The voice status:

  • DEPLOYING: Under review

  • OK: Approved and ready to use

  • UNDEPLOYED: Rejected and not usable

RESTful API

Basic information

URL

The Chinese mainland:

https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization

International:

https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

Request method

POST

Request header

Authorization: Bearer {api-key} // Replace with your API key
Content-Type: application/json

Message body

The message body containing all request parameters. Optional fields can be omitted as needed:

Important

The model is the voice cloning model. Set it to voice-enrollment.

{
    "model": "voice-enrollment",
    "input": {
        "action": "list_voice",
        "prefix": "myvoice",
        "page_index": 0,
        "page_size": 10
    }
}

Request parameters

Click to view a request example

Important

The model is the voice cloning model. Set it to voice-enrollment.

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "list_voice",
        "prefix": "myvoice",
        "page_index": 0,
        "page_size": 10
    }
}'

Parameter

Type

Default

Required

Description

model

string

-

Yes

The voice cloning model. Set to voice-enrollment.

action

string

-

Yes

The operation type. Set to list_voice.

prefix

string

null

No

The custom prefix of the voice. Only digits and lowercase letters are allowed, up to 10 characters.

page_index

integer

0

No

The page number index, starting from 0.

page_size

integer

10

No

The number of data entries on each page.

Response parameters

Click to view a response example

{
    "output": {
        "voice_list": [
            {
                "gmt_create": "2024-12-11 13:38:02",
                "voice_id": "yourVoiceId",
                "gmt_modified": "2024-12-11 13:38:02",
                "status": "OK"
            }
        ]
    },
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Parameter

Type

Description

voice_id

string

The voice ID.

gmt_create

string

The time when the voice was created.

gmt_modified

string

The time when the voice was modified.

status

string

The voice status:

  • DEPLOYING: Under review

  • OK: Approved and ready to use

  • UNDEPLOYED: Rejected and not usable

Query a specific voice

Gets details of a specific voice.

Python SDK

API description

def query_voice(self, voice_id: str) -> List[str]:
    '''
    Queries details of a specific voice.
    param: voice_id The ID of the voice to query.
    return: List[str] Voice details, including status, creation time, audio link, and more.
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
voice_id = 'cosyvoice-v3-plus-myvoice-xxxxxxxx'

voice_details = service.query_voice(voice_id=voice_id)

print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice Details: {voice_details}")

Response example

{
    "gmt_create": "2024-09-13 11:29:41",
    "resource_link": "https://yourAudioFileUrl",
    "target_model": "cosyvoice-v3-plus",
    "gmt_modified": "2024-09-13 11:29:41",
    "status": "OK"
}

Response parameters

Parameter

Type

Description

resource_link

string

The URL of the audio that was cloned.

target_model

string

A speech synthesis model that controls voice timbre. We recommend cosyvoice-v3-flash or cosyvoice-v3-plus.

This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.

gmt_create

string

The time when the voice was created.

gmt_modified

string

The time when the voice was modified.

status

string

The voice status:

  • DEPLOYING: Under review

  • OK: Approved and ready to use

  • UNDEPLOYED: Rejected and not usable

Java SDK

API description

/**
 * Queries details of a specific voice.
 *
 * @param voiceId The ID of the voice to query.
 * @return Voice Voice details, including status, creation time, audio link, and more.
 * @throws NoApiKeyException if the API key is empty.
 * @throws InputRequiredException if a required parameter is empty.
 */
public Voice queryVoice(String voiceId) throws NoApiKeyException, InputRequiredException

Request example

You need to import the third-party library com.google.gson.Gson.

import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you have not configured an environment variable, replace this with your API key.
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace this with the actual value.
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        Voice voice = service.queryVoice(voiceId);
        
        logger.info("Query successful. Request ID: {}", service.getLastRequestId());
        logger.info("Voice Details: {}", new Gson().toJson(voice));
    }
}

Response example

{
    "gmt_create": "2024-09-13 11:29:41",
    "resource_link": "https://yourAudioFileUrl",
    "target_model": "cosyvoice-v3-plus",
    "gmt_modified": "2024-09-13 11:29:41",
    "status": "OK"
}

Response parameters

Parameter

Type

Description

resource_link

string

The URL of the audio that was cloned.

target_model

string

A speech synthesis model that controls voice timbre. We recommend cosyvoice-v3-flash or cosyvoice-v3-plus.

This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.

gmt_create

string

The time when the voice was created.

gmt_modified

string

The time when the voice was modified.

status

string

The voice status:

  • DEPLOYING: Under review

  • OK: Approved and ready to use

  • UNDEPLOYED: Rejected and not usable

RESTful API

Basic information

URL

The Chinese mainland:

https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization

International:

https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

Request method

POST

Request header

Authorization: Bearer {api-key} // Replace with your API key
Content-Type: application/json

Message body

The message body containing all request parameters is as follows. Optional fields can be omitted as needed:

Important

The model is the voice cloning model. Set it to voice-enrollment.

{
    "model": "voice-enrollment",
    "input": {
        "action": "query_voice",
        "voice_id": "yourVoiceId"
    }
}

Request parameters

Click to view a request example

Important

The model is the voice cloning model. Set it to voice-enrollment.

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "query_voice",
        "voice_id": "yourVoiceId"
    }
}'

Parameter

Type

Default

Required

Description

model

string

-

Yes

The voice cloning model. Set to voice-enrollment.

action

string

-

Yes

The operation type. Set to query_voice.

voice_id

string

-

Yes

The ID of the voice to query.

Response parameters

Click to view a response example

{
    "output": {
        "gmt_create": "2024-12-11 13:38:02",
        "resource_link": "https://yourAudioFileUrl",
        "target_model": "cosyvoice-v3-plus",
        "gmt_modified": "2024-12-11 13:38:02",
        "status": "OK"
    },
    "usage": {
        "count": 1
    },
    "request_id": "2450f969-d9ea-9483-bafc-************"
}

Parameter

Type

Description

resource_link

string

The URL of the audio that was cloned.

target_model

string

A speech synthesis model that controls voice timbre. We recommend cosyvoice-v3-flash or cosyvoice-v3-plus.

This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.

gmt_create

string

The time when the voice was created.

gmt_modified

string

The time when the voice was modified.

status

string

The voice status:

  • DEPLOYING: Under review

  • OK: Approved and ready to use

  • UNDEPLOYED: Rejected and not usable

Update a voice

Updates an existing voice with new audio.

Python SDK

API description

def update_voice(self, voice_id: str, url: str) -> None:
    '''
    Updates a voice.
    param: voice_id The ID of the voice.
    param: url The URL of the audio file for voice cloning.
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
service.update_voice(
    voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx',
    url='https://your-new-audio-file-url'
)
print(f"Update submitted. Request ID: {service.get_last_request_id()}")

Java SDK

API description

/**
 * Updates a voice.
 *
 * @param voiceId The voice to update.
 * @param url The URL of the audio file for voice cloning.
 * @throws NoApiKeyException if the API key is empty.
 * @throws InputRequiredException if a required parameter is empty.
 */
public void updateVoice(String voiceId, String url)
    throws NoApiKeyException, InputRequiredException

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you have not configured an environment variable, replace this with your API key.
    private static String fileUrl = "https://your-audio-file-url";  // Replace this with the actual value.
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace this with the actual value.
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    
    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Update the voice
        service.updateVoice(voiceId, fileUrl);
        logger.info("Update submitted. Request ID: {}", service.getLastRequestId());
    }
}

RESTful API

Basic information

URL

The Chinese mainland:

https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization

International:

https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

Request method

POST

Request header

Authorization: Bearer {api-key} // Replace with your API key
Content-Type: application/json

Message body

The message body containing all request parameters. Optional fields can be omitted as needed:

Important

The model is the voice cloning model. Set it to voice-enrollment.

{
    "model": "voice-enrollment",
    "input": {
        "action": "update_voice",
        "voice_id": "yourVoiceId",
        "url": "https://yourAudioFileUrl"
    }
}

Request parameters

Click to view a request example

Important

The model is the voice cloning model. Set it to voice-enrollment.

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "update_voice",
        "voice_id": "yourVoiceId",
        "url": "https://yourAudioFileUrl"
    }
}'

Parameter

Type

Default

Required

Description

model

string

-

Yes

The voice cloning model. Set to voice-enrollment.

action

string

-

Yes

The operation type. Set to update_voice.

voice_id

string

-

Yes

The ID of the voice to update.

url

string

-

Yes

The URL of the audio file to update the voice. Must be publicly accessible.

For how to record audio, see Recording Guide.

Click to view a response example

{
    "output": {},
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Delete a voice

Deletes a voice that is no longer needed to release quota. This operation is irreversible.

Python SDK

API description

def delete_voice(self, voice_id: str) -> None:
    '''
    Deletes a voice.
    param: voice_id The voice to delete.
    '''

Request example

from dashscope.audio.tts_v2 import VoiceEnrollmentService

service = VoiceEnrollmentService()
service.delete_voice(voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx')
print(f"Deletion submitted. Request ID: {service.get_last_request_id()}")

Java SDK

API description

/**
 * Deletes a voice.
 *
 * @param voiceId The voice to delete.
 * @throws NoApiKeyException if the API key is empty.
 * @throws InputRequiredException if a required parameter is empty.
 */
public void deleteVoice(String voiceId) throws NoApiKeyException, InputRequiredException 

Request example

import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    public static String apiKey = System.getenv("DASHSCOPE_API_KEY");  // If you have not configured an environment variable, replace this with your API key.
    private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace this with the actual value.
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    
    public static void main(String[] args)
            throws NoApiKeyException, InputRequiredException {
        VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
        // Delete the voice
        service.deleteVoice(voiceId);
        logger.info("Deletion submitted. Request ID: {}", service.getLastRequestId());
    }
}

RESTful API

Basic information

URL

The Chinese mainland:

https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization

International:

https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

Request method

POST

Request header

Authorization: Bearer {api-key} // Replace with your API key
Content-Type: application/json

Message body

The message body containing all request parameters. Optional fields can be omitted as needed:

Important

The model is the voice cloning model. Set it to voice-enrollment.

{
    "model": "voice-enrollment",
    "input": {
        "action": "delete_voice",
        "voice_id": "yourVoiceId"
    }
}

Request parameters

Click to view a request example

Important

The model is the voice cloning model. Set it to voice-enrollment.

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "delete_voice",
        "voice_id": "yourVoiceId"
    }
}'

Parameter

Type

Default

Required

Description

model

string

-

Yes

The voice cloning model. Set to voice-enrollment.

action

string

-

Yes

The operation type. Set to delete_voice.

voice_id

string

-

Yes

The ID of the voice to delete.

Click to view a response example

{
    "output": {},
    "usage": {
        "count": 1
    },
    "request_id": "yourRequestId"
}

Voice quota and cleanup rules

  • Total limit: 1000 voices per account

    The current API does not provide a feature to query voice count. You can count voices by calling the Query the voice list API.
  • Automatic cleanup: If a voice has not been used for any speech synthesis requests in the past year, the system will automatically delete it.

Billing

  • Voice cloning: Creating, querying, updating, and deleting voices are free of charge.

  • Using cloned voices for speech synthesis: Billed on a pay-as-you-go basis (by the number of text characters). For more information, see Real-time speech synthesis - CosyVoice.

Copyright and legality

You are responsible for the ownership and legal right to use the voices you provide. Read the Terms of Service.

Error codes

If you encounter an error, see Error messages for troubleshooting.

FAQ

Features

Q: How do I adjust the speech rate and volume of a custom voice?

The process is the same as for preset voices. When calling the speech synthesis API, pass the corresponding parameters, such as speech_rate (Python) or speechRate (Java) to adjust speech rate, and volume to adjust volume. For more information, see the speech synthesis API documentation (Java SDK/Python SDK/WebSocket API).

Q: How can I make calls in other languages such as Go, C#, and Node.js, besides Java and Python?

For voice management, use the RESTful API provided in this document. For speech synthesis, use the WebSocket API and pass the cloned voice_id as the voice parameter.

Troubleshooting

If you encounter a code error, troubleshoot based on the information in Error codes.

Q: What should I do if the synthesized audio from a cloned voice contains extra content?

If you find that output audio synthesized from a cloned voice contains extra characters or noise beyond the input text, follow these steps to troubleshoot:

  1. Check source audio quality

    The quality of the cloned audio directly affects synthesis results. Ensure the source audio meets these requirements:

    • No background noise or static

    • Clear sound quality (recommended sample rate ≥ 16 kHz)

    • Audio format: WAV format is better than MP3 (to avoid lossy compression)

    • Mono (stereo may introduce interference)

    • No silent segments or long pauses

    • Moderate speech rate (a fast speech rate affects feature extraction)

  2. Check the input text

    Confirm the input text does not contain special symbols or markers:

    • Avoid special symbols such as **, "", or ''.

    • If not used for wrapping LaTeX formulas, pre-process text to filter out special symbols.

  3. Verify voice cloning parameters

    When creating a voice, ensure language parameters (language_hints/languageHints) are set correctly.

  4. Try cloning again

    Use higher-quality source audio to clone the voice again and test the synthesis result.

  5. Compare with system voices

    Test the same text with a preset system voice to confirm if the issue is specific to the cloned voice.

Q: How do I troubleshoot when the audio generated from a cloned voice has no sound?

  1. Confirm Voice Status

    Call the Query a specific voice API to check if the voice status is OK.

  2. Check for model version consistency

    Ensure the target_model parameter used for voice cloning matches the model parameter used for speech synthesis. For example:

    • If you use cosyvoice-v3-plus for cloning.

    • You must also use cosyvoice-v3-plus for synthesis.

  3. Verify source audio quality

    Check if the source audio used for voice cloning meets the audio requirements:

    • Audio duration: 10 to 20 seconds

    • Clear sound quality

    • No background noise

  4. Check request parameters

    Confirm that the voice request parameter for speech synthesis is set to the ID of the cloned voice.

Q: What should I do if the synthesis result is unstable or the speech is incomplete after voice cloning?

If synthesized speech from a cloned voice has these issues:

  • Incomplete speech playback, where only part of the text is read.

  • Unstable synthesis results, varying between good and bad.

  • Abnormal pauses or silent segments in the speech.

Possible cause: Source audio quality does not meet requirements.

Solution: Check if the source audio meets the following requirements. We recommend re-recording according to the Recording Guide.

  • Check audio continuity: Ensure speech content in the source audio is continuous. Avoid long pauses or silent segments (over 2 s). If the audio contains significant blank segments, the model may treat silence or noise as part of voice features, affecting generation results.

  • Check speech activity ratio: Ensure valid speech accounts for more than 60% of total audio duration. Excessive background noise or non-speech segments can interfere with voice feature extraction.

  • Verify audio quality details:

    • Audio duration: 10 to 20 seconds (15 seconds is recommended)

    • Clear pronunciation and steady speech rate

    • No background noise, echo, or static

    • Concentrated speech energy with no long silent segments

Q: Why can't I find the VoiceEnrollmentService class?

The SDK version is too old. Install the latest version of DashScope SDK.

Q: What should I do if the voice cloning result is poor, with noise or unclear audio?

This is usually caused by low-quality input audio. Strictly follow the Recording guide to re-record and upload the audio.

Q: Why is there a long silence at the beginning or abnormal audio duration when using a cloned voice to synthesize very short text (such as a single word)?

The voice cloning model learns pauses and rhythm from sample audio. If the original recording contains long initial silences or pauses, the synthesized result may retain a similar pattern. For single words or very short text, this silence ratio is amplified, making the audio seem long but mostly silent. Avoid long silences when recording sample audio. When synthesizing, use complete sentences or longer text. If you must synthesize a single word, add context before or after it, or use a homophone to avoid extreme cases.