CosyVoice voice cloning creates a custom voice from a 10- to 20-second audio sample. This document covers voice cloning parameters and API details. For speech synthesis, see Real-time speech synthesis - CosyVoice.
User guide: For model introductions and selection recommendations, see Real-time speech synthesis - CosyVoice.
This document covers CosyVoice voice cloning API. For Qwen models, see Voice cloning (Qwen).
Audio requirements
High-quality input audio is essential for excellent cloning results.
Item | Requirement |
Supported formats | WAV (16-bit), MP3, M4A |
Audio duration | Recommended: 10 to 20 seconds. Maximum: 60 seconds. |
File size | ≤ 10 MB |
Sample rate | ≥ 16 kHz |
Sound channel | Mono or stereo. For stereo audio, only the first channel is processed. Ensure it contains a valid human voice. |
Content | Audio must contain at least 5 seconds of continuous, clear reading without background sound. The rest can only have short pauses (≤ 2 seconds). The entire segment should be free of background music, noise, or other voices. Use normal speaking audio. Do not upload songs or singing to ensure accurate, usable cloning. |
Language | Varies depending on the speech synthesis model that drives the voice (specified by the
Voice cloning only supports the languages listed above (Mandarin Chinese and listed dialects, English, French, German, Japanese, Korean, and Russian). Other languages such as Spanish and Italian are not supported. |
Getting started: From cloning to synthesis
1. Workflow
Voice cloning and speech synthesis are two closely related but separate steps following a "create first, then use" flow:
Create a voice
Call the Create a voice API and upload an audio segment. The system analyzes the audio and creates a unique cloned voice. Specify
target_model/targetModelto declare the target speech synthesis model for the voice.If you already have a created voice (check by calling the Query the voice list API), skip this step.
Use voice styles for speech synthesis.
After creating a voice using the Create a voice API, the system returns a
voice_id/voiceID:This
voice_id/voiceIDcan be used directly as thevoiceparameter in the speech synthesis API or SDKs for subsequent text-to-speech.Supports multiple call methods: non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
The speech synthesis model must match the
target_model/targetModelspecified when creating the voice. Otherwise, synthesis fails.
2. Model configuration and preparations
Select a suitable model and complete the preparations.
Model configuration
In international deployment mode (Singapore region), cosyvoice-v3-plus and cosyvoice-v3-flash do not support voice cloning. Select other models.
When cloning a voice, specify these two models:
Voice cloning model: voice-enrollment
Speech synthesis model to drive the voice:
For best results, use
cosyvoice-v3-plusif resources and budget allow.Version
Scenarios
cosyvoice-v3-plus
For the best sound quality and expressiveness with a sufficient budget
cosyvoice-v3-flash
Balances performance and cost for high overall value
cosyvoice-v2
For compatibility with older versions or low-requirement scenarios
Preparations
Get an API key : Get an API key. For security, set the API key as an environment variable.
Install the SDK : Ensure you have installed the latest version of the DashScope SDK.
Prepare an audio URL : Upload an audio file meeting the audio requirements to a publicly accessible location, such as Object Storage Service (OSS), and ensure the URL is publicly accessible.
3. End-to-end example: From cloning to synthesis
The following example shows how to use a custom voice generated by voice cloning in speech synthesis to achieve output highly similar to the original voice.
Key principle: When cloning a voice, the
target_model(the speech synthesis model that drives the voice) must match the speech synthesis model specified in subsequent speech synthesis API calls. Otherwise, synthesis will fail.Note: Replace
AUDIO_URLin the example with your actual audio URL.
import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer
# 1. Prepare the environment
# Set the API key as an environment variable.
# export DASHSCOPE_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
raise ValueError("DASHSCOPE_API_KEY environment variable not set.")
# 2. Define cloning parameters
TARGET_MODEL = "cosyvoice-v3-plus"
# Give the voice a meaningful prefix.
VOICE_PREFIX = "myvoice" # Only digits and lowercase letters are allowed, less than 10 characters.
# Publicly accessible audio URL
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # Example URL, replace it with your own.
# 3. Create the voice (asynchronous task)
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
voice_id = service.create_voice(
target_model=TARGET_MODEL,
prefix=VOICE_PREFIX,
url=AUDIO_URL
)
print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
print(f"Generated Voice ID: {voice_id}")
except Exception as e:
print(f"Error during voice creation: {e}")
raise e
# 4. Poll for voice status
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
try:
voice_info = service.query_voice(voice_id=voice_id)
status = voice_info.get("status")
print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
if status == "OK":
print("Voice is ready for synthesis.")
break
elif status == "UNDEPLOYED":
print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
raise RuntimeError(f"Voice processing failed with status: {status}")
# For intermediate statuses such as "DEPLOYING", continue to wait.
time.sleep(poll_interval)
except Exception as e:
print(f"Error during status polling: {e}")
time.sleep(poll_interval)
else:
print("Polling timed out. The voice is not ready after several attempts.")
raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")
# 5. Use the cloned voice for speech synthesis
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
text_to_synthesize = "Congratulations, you have successfully cloned and synthesized your own voice!"
# The call() method returns binary audio data.
audio_data = synthesizer.call(text_to_synthesize)
print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")
# 6. Save the audio file
output_file = "my_custom_voice_output.mp3"
with open(output_file, "wb") as f:
f.write(audio_data)
print(f"Audio saved to {output_file}")
except Exception as e:
print(f"Error during speech synthesis: {e}")API reference
When using different APIs, make sure to use the same account for all operations.
Create a voice
Uploads an audio file for cloning to create a custom voice.
Python SDK
API description
def create_voice(self, target_model: str, prefix: str, url: str, language_hints: List[str] = None) -> str:
'''
Creates a new custom voice.
param: target_model The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail. Recommended models are cosyvoice-v3-flash or cosyvoice-v3-plus.
param: prefix A recognizable name for the voice (only digits, uppercase and lowercase letters, and underscores are allowed, up to 10 characters). We recommend using an identifier related to the role or scenario. This keyword appears in the cloned voice name. Generated voice name format: model_name-prefix-unique_identifier, such as cosyvoice-v3-plus-myvoice-xxxxxxxx.
param: url The URL of the audio file for voice cloning. The URL must be publicly accessible.
param: language_hints Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.
It helps the model identify the language of the sample audio to extract features more accurately and improve cloning results.
If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.
Valid values: zh (default), en, fr, de, ja, ko, ru. This parameter is an array, but the current version processes only the first element. Pass only one value.
return: voice_id The voice ID. It can be used directly as the voice parameter in the speech synthesis API.
'''target_model: The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.language_hints: Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.Description: This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results. If the language hint does not match the actual audio language (for example, setting
enfor Chinese audio), the system ignores the hint and automatically detects the language from audio content.Valid values:
zh: Chinese (default)
en: English
fr: French
de: German
ja: Japanese
ru: Russian
For Chinese dialects (such as Northeastern or Cantonese), set
language_hintstozh. Control the dialect style in subsequent speech synthesis calls through text content or parameters such asinstruct.Note: This parameter is an array, but the current version processes only the first element. Pass only one value.
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
# Avoid frequent calls. Each call creates a new voice. You cannot create more voices after you reach the quota limit.
voice_id = service.create_voice(
target_model='cosyvoice-v3-plus',
prefix='myvoice',
url='https://your-audio-file-url',
language_hints=['zh']
)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice ID: {voice_id}")Java SDK
API description
/**
* Creates a new custom voice.
*
* @param targetModel The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail. Recommended models are cosyvoice-v3-flash or cosyvoice-v3-plus.
* @param prefix A recognizable name for the voice (only digits, uppercase and lowercase letters, and underscores are allowed, up to 10 characters). We recommend using an identifier related to the role or scenario. This keyword appears in the cloned voice name. Generated voice name format: model_name-prefix-unique_identifier, such as cosyvoice-v3-plus-myvoice-xxxxxxxx.
* @param url The URL of the audio file for voice cloning. The URL must be publicly accessible.
* @param customParam Custom parameters. You can specify languageHints here.
* languageHints specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.
* This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results.
* If the language hint does not match the actual audio language (for example, setting en for Chinese audio), the system ignores the hint and automatically detects the language from audio content.
* Valid values: zh (default), en, fr, de, ja, ko, ru. This parameter is an array, but the current version processes only the first element. Pass only one value.
* @return Voice The newly created voice. You can get the voice ID using the getVoiceId method of the Voice object. The voice ID can be used directly as the voice parameter in the speech synthesis API.
* @throws NoApiKeyException if the API key is empty.
* @throws InputRequiredException if a required parameter is empty.
*/
public Voice createVoice(String targetModel, String prefix, String url, VoiceEnrollmentParam customParam) throws NoApiKeyException, InputRequiredExceptiontargetModel: The speech synthesis model for the voice. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail.languageHints: Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models.Description: This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results. If the language hint does not match the actual audio language (for example, setting
enfor Chinese audio), the system ignores the hint and automatically detects the language from audio content.Valid values:
zh: Chinese (default)
en: English
fr: French
de: German
ja: Japanese
ru: Russian
For Chinese dialects (such as Northeastern or Cantonese), set
language_hintstozh. Control the dialect style in subsequent speech synthesis calls through text content or parameters such asinstruct.Note: This parameter is an array, but the current version processes only the first element. Pass only one value.
Request example
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentParam;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Collections;
public class Main {
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args) {
String apiKey = System.getenv("DASHSCOPE_API_KEY");
String targetModel = "cosyvoice-v3-plus";
String prefix = "myvoice";
String fileUrl = "https://your-audio-file-url";
String cloneModelName = "voice-enrollment";
try {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
Voice myVoice = service.createVoice(
targetModel,
prefix,
fileUrl,
VoiceEnrollmentParam.builder()
.model(cloneModelName)
.languageHints(Collections.singletonList("zh")).build());
logger.info("Voice creation submitted. Request ID: {}", service.getLastRequestId());
logger.info("Generated Voice ID: {}", myVoice.getVoiceId());
} catch (Exception e) {
logger.error("Failed to create voice", e);
}
}
}RESTful API
Basic information
URL | The Chinese mainland: International: |
Request method | POST |
Request header | |
Message body | The message body containing all request parameters. Optional fields can be omitted as needed: Important
|
Request parameters
Parameter | Type | Default | Required | Description |
model | string | - | Yes | The voice cloning model. Set to |
action | string | - | Yes | The operation type. Set to |
target_model | string | - | Yes | A speech synthesis model that controls voice timbre. We recommend cosyvoice-v3-flash or cosyvoice-v3-plus. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail. |
prefix | string | - | Yes | A recognizable name for the voice (only digits, uppercase and lowercase letters, and underscores are allowed, up to 10 characters). We recommend using an identifier related to the role or scenario. This keyword appears in the cloned voice name. Generated voice name format: |
url | string | - | Yes | The URL of the audio file for voice cloning. Must be publicly accessible. |
language_hints | array[string] | ["zh"] | No | Specifies the language of the sample audio for extracting voice features. This parameter applies only to the cosyvoice-v3-flash and cosyvoice-v3-plus models. Description: This parameter helps the model identify the language of the sample audio (original reference audio) to extract voice features more accurately and improve cloning results. If the language hint does not match the actual audio language (for example, setting Valid values:
For Chinese dialects (such as Northeastern or Cantonese), set Note: This parameter is an array, but the current version processes only the first element. Pass only one value. |
Response parameters
Parameter | Type | Description |
voice_id | string | The voice ID. It can be used directly as the |
Query the voice list
Performs a paged query of created voices.
Python SDK
API description
def list_voices(self, prefix=None, page_index: int = 0, page_size: int = 10) -> List[dict]:
'''
Queries all created voices.
param: prefix The custom prefix of the voice. Only digits and lowercase letters are allowed, up to 10 characters.
param: page_index The page index for the query.
param: page_size The page size for the query.
return: List[dict] A list of voices, including the ID, creation time, modification time, and status of each voice. Format: [{'gmt_create': '2025-10-09 14:51:01', 'gmt_modified': '2025-10-09 14:51:07', 'status': 'OK', 'voice_id': 'cosyvoice-v3-myvoice-xxx'}]
Three voice statuses:
DEPLOYING: Under review
OK: Approved and ready to use
UNDEPLOYED: Rejected and not usable
'''Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
# Filter by prefix, or set to None to query all.
voices = service.list_voices(prefix='myvoice', page_index=0, page_size=10)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Found voices: {voices}")Response example
[
{
"gmt_create": "2024-09-13 11:29:41",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
},
{
"gmt_create": "2024-09-13 13:22:38",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 13:22:38",
"status": "OK"
}
]Response parameters
Parameter | Type | Description |
voice_id | string | The voice ID. |
gmt_create | string | The time when the voice was created. |
gmt_modified | string | The time when the voice was modified. |
status | string | The voice status:
|
Java SDK
API description
// Three voice statuses:
// DEPLOYING: Under review
// OK: Approved and ready to use
// UNDEPLOYED: Rejected and not usable
/**
* Queries all created voices. The default page index is 0, and the default page size is 10.
*
* @param prefix The custom prefix of the voice. Only digits and lowercase letters are allowed, up to 10 characters. Can be null.
* @return Voice[] An array of Voice objects. The Voice object encapsulates the ID, creation time, modification time, and status of the voice.
* @throws NoApiKeyException if the API key is empty.
* @throws InputRequiredException if a required parameter is empty.
*/
public Voice[] listVoice(String prefix) throws NoApiKeyException, InputRequiredException
/**
* Queries all created voices.
*
* @param prefix The custom prefix of the voice. Only digits and lowercase letters are allowed, up to 10 characters.
* @param pageIndex The page index for the query.
* @param pageSize The page size for the query.
* @return Voice[] An array of Voice objects. The Voice object encapsulates the ID, creation time, modification time, and status of the voice.
* @throws NoApiKeyException if the API key is empty.
* @throws InputRequiredException if a required parameter is empty.
*/
public Voice[] listVoice(String prefix, int pageIndex, int pageSize) throws NoApiKeyException, InputRequiredExceptionRequest example
You need to import the third-party library com.google.gson.Gson.
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you have not configured an environment variable, replace this with your API key.
private static String prefix = "myvoice"; // Replace this with the actual value.
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Query voices
Voice[] voices = service.listVoice(prefix, 0, 10);
logger.info("List successful. Request ID: {}", service.getLastRequestId());
logger.info("Voices Details: {}", new Gson().toJson(voices));
}
}Response example
[
{
"gmt_create": "2024-09-13 11:29:41",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
},
{
"gmt_create": "2024-09-13 13:22:38",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 13:22:38",
"status": "OK"
}
]Response parameters
Parameter | Type | Description |
voice_id | string | The voice ID. |
gmt_create | string | The time when the voice was created. |
gmt_modified | string | The time when the voice was modified. |
status | string | The voice status:
|
RESTful API
Basic information
URL | The Chinese mainland: International: |
Request method | POST |
Request header | |
Message body | The message body containing all request parameters. Optional fields can be omitted as needed: Important The |
Request parameters
Parameter | Type | Default | Required | Description |
model | string | - | Yes | The voice cloning model. Set to |
action | string | - | Yes | The operation type. Set to |
prefix | string | null | No | The custom prefix of the voice. Only digits and lowercase letters are allowed, up to 10 characters. |
page_index | integer | 0 | No | The page number index, starting from 0. |
page_size | integer | 10 | No | The number of data entries on each page. |
Response parameters
Parameter | Type | Description |
voice_id | string | The voice ID. |
gmt_create | string | The time when the voice was created. |
gmt_modified | string | The time when the voice was modified. |
status | string | The voice status:
|
Query a specific voice
Gets details of a specific voice.
Python SDK
API description
def query_voice(self, voice_id: str) -> List[str]:
'''
Queries details of a specific voice.
param: voice_id The ID of the voice to query.
return: List[str] Voice details, including status, creation time, audio link, and more.
'''Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
voice_id = 'cosyvoice-v3-plus-myvoice-xxxxxxxx'
voice_details = service.query_voice(voice_id=voice_id)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice Details: {voice_details}")Response example
{
"gmt_create": "2024-09-13 11:29:41",
"resource_link": "https://yourAudioFileUrl",
"target_model": "cosyvoice-v3-plus",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
}Response parameters
Parameter | Type | Description |
resource_link | string | The URL of the audio that was cloned. |
target_model | string | A speech synthesis model that controls voice timbre. We recommend cosyvoice-v3-flash or cosyvoice-v3-plus. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail. |
gmt_create | string | The time when the voice was created. |
gmt_modified | string | The time when the voice was modified. |
status | string | The voice status:
|
Java SDK
API description
/**
* Queries details of a specific voice.
*
* @param voiceId The ID of the voice to query.
* @return Voice Voice details, including status, creation time, audio link, and more.
* @throws NoApiKeyException if the API key is empty.
* @throws InputRequiredException if a required parameter is empty.
*/
public Voice queryVoice(String voiceId) throws NoApiKeyException, InputRequiredExceptionRequest example
You need to import the third-party library com.google.gson.Gson.
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you have not configured an environment variable, replace this with your API key.
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace this with the actual value.
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
Voice voice = service.queryVoice(voiceId);
logger.info("Query successful. Request ID: {}", service.getLastRequestId());
logger.info("Voice Details: {}", new Gson().toJson(voice));
}
}Response example
{
"gmt_create": "2024-09-13 11:29:41",
"resource_link": "https://yourAudioFileUrl",
"target_model": "cosyvoice-v3-plus",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
}Response parameters
Parameter | Type | Description |
resource_link | string | The URL of the audio that was cloned. |
target_model | string | A speech synthesis model that controls voice timbre. We recommend cosyvoice-v3-flash or cosyvoice-v3-plus. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail. |
gmt_create | string | The time when the voice was created. |
gmt_modified | string | The time when the voice was modified. |
status | string | The voice status:
|
RESTful API
Basic information
URL | The Chinese mainland: International: |
Request method | POST |
Request header | |
Message body | The message body containing all request parameters is as follows. Optional fields can be omitted as needed: Important The |
Request parameters
Parameter | Type | Default | Required | Description |
model | string | - | Yes | The voice cloning model. Set to |
action | string | - | Yes | The operation type. Set to |
voice_id | string | - | Yes | The ID of the voice to query. |
Response parameters
Parameter | Type | Description |
resource_link | string | The URL of the audio that was cloned. |
target_model | string | A speech synthesis model that controls voice timbre. We recommend cosyvoice-v3-flash or cosyvoice-v3-plus. This must match the model used in subsequent speech synthesis API calls, otherwise synthesis will fail. |
gmt_create | string | The time when the voice was created. |
gmt_modified | string | The time when the voice was modified. |
status | string | The voice status:
|
Update a voice
Updates an existing voice with new audio.
Python SDK
API description
def update_voice(self, voice_id: str, url: str) -> None:
'''
Updates a voice.
param: voice_id The ID of the voice.
param: url The URL of the audio file for voice cloning.
'''Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
service.update_voice(
voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx',
url='https://your-new-audio-file-url'
)
print(f"Update submitted. Request ID: {service.get_last_request_id()}")Java SDK
API description
/**
* Updates a voice.
*
* @param voiceId The voice to update.
* @param url The URL of the audio file for voice cloning.
* @throws NoApiKeyException if the API key is empty.
* @throws InputRequiredException if a required parameter is empty.
*/
public void updateVoice(String voiceId, String url)
throws NoApiKeyException, InputRequiredExceptionRequest example
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you have not configured an environment variable, replace this with your API key.
private static String fileUrl = "https://your-audio-file-url"; // Replace this with the actual value.
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace this with the actual value.
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Update the voice
service.updateVoice(voiceId, fileUrl);
logger.info("Update submitted. Request ID: {}", service.getLastRequestId());
}
}RESTful API
Basic information
URL | The Chinese mainland: International: |
Request method | POST |
Request header | |
Message body | The message body containing all request parameters. Optional fields can be omitted as needed: Important The |
Request parameters
Parameter | Type | Default | Required | Description |
model | string | - | Yes | The voice cloning model. Set to |
action | string | - | Yes | The operation type. Set to |
voice_id | string | - | Yes | The ID of the voice to update. |
url | string | - | Yes | The URL of the audio file to update the voice. Must be publicly accessible. For how to record audio, see Recording Guide. |
Delete a voice
Deletes a voice that is no longer needed to release quota. This operation is irreversible.
Python SDK
API description
def delete_voice(self, voice_id: str) -> None:
'''
Deletes a voice.
param: voice_id The voice to delete.
'''Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
service.delete_voice(voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx')
print(f"Deletion submitted. Request ID: {service.get_last_request_id()}")Java SDK
API description
/**
* Deletes a voice.
*
* @param voiceId The voice to delete.
* @throws NoApiKeyException if the API key is empty.
* @throws InputRequiredException if a required parameter is empty.
*/
public void deleteVoice(String voiceId) throws NoApiKeyException, InputRequiredException Request example
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you have not configured an environment variable, replace this with your API key.
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace this with the actual value.
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Delete the voice
service.deleteVoice(voiceId);
logger.info("Deletion submitted. Request ID: {}", service.getLastRequestId());
}
}RESTful API
Basic information
URL | The Chinese mainland: International: |
Request method | POST |
Request header | |
Message body | The message body containing all request parameters. Optional fields can be omitted as needed: Important The |
Request parameters
Parameter | Type | Default | Required | Description |
model | string | - | Yes | The voice cloning model. Set to |
action | string | - | Yes | The operation type. Set to |
voice_id | string | - | Yes | The ID of the voice to delete. |
Voice quota and cleanup rules
Total limit: 1000 voices per account
The current API does not provide a feature to query voice count. You can count voices by calling the Query the voice list API.
Automatic cleanup: If a voice has not been used for any speech synthesis requests in the past year, the system will automatically delete it.
Billing
Voice cloning: Creating, querying, updating, and deleting voices are free of charge.
Using cloned voices for speech synthesis: Billed on a pay-as-you-go basis (by the number of text characters). For more information, see Real-time speech synthesis - CosyVoice.
Copyright and legality
You are responsible for the ownership and legal right to use the voices you provide. Read the Terms of Service.
Error codes
If you encounter an error, see Error messages for troubleshooting.
FAQ
Features
Q: How do I adjust the speech rate and volume of a custom voice?
The process is the same as for preset voices. When calling the speech synthesis API, pass the corresponding parameters, such as speech_rate (Python) or speechRate (Java) to adjust speech rate, and volume to adjust volume. For more information, see the speech synthesis API documentation (Java SDK/Python SDK/WebSocket API).
Q: How can I make calls in other languages such as Go, C#, and Node.js, besides Java and Python?
For voice management, use the RESTful API provided in this document. For speech synthesis, use the WebSocket API and pass the cloned voice_id as the voice parameter.
Troubleshooting
If you encounter a code error, troubleshoot based on the information in Error codes.
Q: What should I do if the synthesized audio from a cloned voice contains extra content?
If you find that output audio synthesized from a cloned voice contains extra characters or noise beyond the input text, follow these steps to troubleshoot:
Check source audio quality
The quality of the cloned audio directly affects synthesis results. Ensure the source audio meets these requirements:
No background noise or static
Clear sound quality (recommended sample rate ≥ 16 kHz)
Audio format: WAV format is better than MP3 (to avoid lossy compression)
Mono (stereo may introduce interference)
No silent segments or long pauses
Moderate speech rate (a fast speech rate affects feature extraction)
Check the input text
Confirm the input text does not contain special symbols or markers:
Avoid special symbols such as
**,"", or''.If not used for wrapping LaTeX formulas, pre-process text to filter out special symbols.
Verify voice cloning parameters
When creating a voice, ensure language parameters (
language_hints/languageHints) are set correctly.Try cloning again
Use higher-quality source audio to clone the voice again and test the synthesis result.
Compare with system voices
Test the same text with a preset system voice to confirm if the issue is specific to the cloned voice.
Q: How do I troubleshoot when the audio generated from a cloned voice has no sound?
Confirm Voice Status
Call the Query a specific voice API to check if the voice
statusisOK.Check for model version consistency
Ensure the
target_modelparameter used for voice cloning matches themodelparameter used for speech synthesis. For example:If you use
cosyvoice-v3-plusfor cloning.You must also use
cosyvoice-v3-plusfor synthesis.
Verify source audio quality
Check if the source audio used for voice cloning meets the audio requirements:
Audio duration: 10 to 20 seconds
Clear sound quality
No background noise
Check request parameters
Confirm that the
voicerequest parameter for speech synthesis is set to the ID of the cloned voice.
Q: What should I do if the synthesis result is unstable or the speech is incomplete after voice cloning?
If synthesized speech from a cloned voice has these issues:
Incomplete speech playback, where only part of the text is read.
Unstable synthesis results, varying between good and bad.
Abnormal pauses or silent segments in the speech.
Possible cause: Source audio quality does not meet requirements.
Solution: Check if the source audio meets the following requirements. We recommend re-recording according to the Recording Guide.
Check audio continuity: Ensure speech content in the source audio is continuous. Avoid long pauses or silent segments (over 2 s). If the audio contains significant blank segments, the model may treat silence or noise as part of voice features, affecting generation results.
Check speech activity ratio: Ensure valid speech accounts for more than 60% of total audio duration. Excessive background noise or non-speech segments can interfere with voice feature extraction.
Verify audio quality details:
Audio duration: 10 to 20 seconds (15 seconds is recommended)
Clear pronunciation and steady speech rate
No background noise, echo, or static
Concentrated speech energy with no long silent segments
Q: Why can't I find the VoiceEnrollmentService class?
The SDK version is too old. Install the latest version of DashScope SDK.
Q: What should I do if the voice cloning result is poor, with noise or unclear audio?
This is usually caused by low-quality input audio. Strictly follow the Recording guide to re-record and upload the audio.
Q: Why is there a long silence at the beginning or abnormal audio duration when using a cloned voice to synthesize very short text (such as a single word)?
The voice cloning model learns pauses and rhythm from sample audio. If the original recording contains long initial silences or pauses, the synthesized result may retain a similar pattern. For single words or very short text, this silence ratio is amplified, making the audio seem long but mostly silent. Avoid long silences when recording sample audio. When synthesizing, use complete sentences or longer text. If you must synthesize a single word, add context before or after it, or use a homophone to avoid extreme cases.