CosyVoice voice cloning creates natural-sounding custom voices from 10-20 seconds of audio with no traditional training required. Voice design generates custom voices from text descriptions with support for multilingual and multidimensional voice features. This topic covers voice cloning/design APIs and parameters. For speech synthesis, see Real-time speech synthesis – CosyVoice.
User guide: For model overviews and selection recommendations, see Real-time speech synthesis – CosyVoice.
-
CosyVoice voice design is powered by the FunAudioGen-VD model.
-
Voice designs created with identical prompts may produce different voices. We recommend generating multiple results and selecting the best one.
-
This topic describes CosyVoice voice cloning/design APIs. If you use Qwen models, see Qwen voice cloning and Qwen voice design.
Supported models
-
Voice cloning:
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash
-
cosyvoice-v3-plus, cosyvoice-v3-flash
-
cosyvoice-v2
-
-
Voice design:
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash
-
cosyvoice-v3-plus, cosyvoice-v3-flash
-
-
cosyvoice-v3.5-plus and cosyvoice-v3.5-flash are available only in the Chinese Mainland deployment mode (Beijing region).
-
In the International deployment mode (Singapore region), cosyvoice-v3-plus and cosyvoice-v3-flash do not support voice cloning or voice design. Choose another model instead.
Supported languages
-
Voice cloning: Depends on the speech synthesis model that determines the voice (specified by the
target_model/targetModelparameter):-
cosyvoice-v2: Chinese (Mandarin), English
-
cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeast dialect, Gansu dialect, Guizhou dialect, Henan dialect, Hubei dialect, Jiangxi dialect, Minnan dialect, Ningxia dialect, Shanxi dialect, Shaanxi dialect, Shandong dialect, Shanghai dialect, Sichuan dialect, Tianjin dialect, Yunnan dialect), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese
-
cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, Russian
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan dialect, Hubei dialect, Minnan dialect, Ningxia dialect, Shaanxi dialect, Shandong dialect, Shanghai dialect, Sichuan dialect), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese
Voice cloning does not currently support other languages (Spanish, Italian, etc.).
-
-
Voice design: Chinese, English.
Getting started: from voice cloning to speech synthesis
Voice cloning and speech synthesis are two separate but related steps that follow a "create then use" workflow:
Prepare an audio recording file.
Upload an audio file that meets the requirements specified in Voice cloning: Input audio formats to a publicly accessible location, such as Object Storage Service (OSS), and ensure the URL is publicly accessible.
Create a voice.
Call the Create voice API. Specify
target_modelortargetModelto define the speech synthesis model to be used with the created voice.If you already have a voice (you can check by calling the Query voice list API), you can skip this step and proceed to the next one.
Speech synthesis using voice
After you successfully create a voice using the Create voice API, the system returns a
voice_id/voiceID:This
voice_idorvoiceIDcan be used as thevoiceparameter in the speech synthesis API or various language SDKs for subsequent text-to-speech conversion.Multiple invocation modes are supported, including non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
The speech synthesis model specified for synthesis must match the
target_modelortargetModelused when creating the voice, or synthesis will fail.
Sample code:
import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer
# 1. Prepare the environment.
# We recommend that you configure the API key using an environment variable.
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
raise ValueError("DASHSCOPE_API_KEY environment variable not set.")
# The following is the WebSocket URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# The following is the HTTP URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# 2. Define the cloning parameters.
TARGET_MODEL = "cosyvoice-v3.5-plus"
# Give the voice a meaningful prefix.
VOICE_PREFIX = "myvoice" # Only digits and lowercase letters are allowed. The prefix must be less than 10 characters in length.
# A publicly accessible audio URL.
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # This is a sample URL. Replace it with your own.
# 3. Create a voice (asynchronous task).
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
voice_id = service.create_voice(
target_model=TARGET_MODEL,
prefix=VOICE_PREFIX,
url=AUDIO_URL
)
print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
print(f"Generated Voice ID: {voice_id}")
except Exception as e:
print(f"Error during voice creation: {e}")
raise e
# 4. Poll for the voice status.
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
try:
voice_info = service.query_voice(voice_id=voice_id)
status = voice_info.get("status")
print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
if status == "OK":
print("Voice is ready for synthesis.")
break
elif status == "UNDEPLOYED":
print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
raise RuntimeError(f"Voice processing failed with status: {status}")
# For intermediate statuses such as "DEPLOYING", continue to wait.
time.sleep(poll_interval)
except Exception as e:
print(f"Error during status polling: {e}")
time.sleep(poll_interval)
else:
print("Polling timed out. The voice is not ready after several attempts.")
raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")
# 5. Use the cloned voice for speech synthesis.
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
text_to_synthesize = "Congratulations, you have successfully cloned and synthesized your own voice!"
# The call() method returns binary audio data.
audio_data = synthesizer.call(text_to_synthesize)
print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")
# 6. Save the audio file.
output_file = "my_custom_voice_output.mp3"
with open(output_file, "wb") as f:
f.write(audio_data)
print(f"Audio saved to {output_file}")
except Exception as e:
print(f"Error during speech synthesis: {e}")Getting started: from voice design to speech synthesis
Voice design and speech synthesis are two separate but related steps that follow a "create then use" workflow:
Prepare the voice description and preview text for voice design.
Voice description (voice_prompt): Describes the features of the target voice. For more information, see Voice design: Write high-quality voice descriptions?.
Preview text (preview_text): The content that the target voice will read for the preview audio, for example, "Hello everyone, welcome."
Call the Create voice API to create a custom voice and retrieve the voice name and preview audio.
Specify
target_modelto define the speech synthesis model to be used with the created voice.Listen to the preview audio to determine if it meets your expectations. If it does, proceed to the next step. Otherwise, redesign the voice.
If you already have a voice (you can check by calling the Query voice list API), you can skip this step and proceed to the next one.
Use voice for speech synthesis.
After you successfully create a voice using the Create voice API, the system returns a
voice_id/voiceID:This
voice_idorvoiceIDcan be used directly as thevoiceparameter in the speech synthesis API or various language SDKs for subsequent text-to-speech conversion.Multiple invocation modes are supported, including non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
The speech synthesis model specified during synthesis must match the
target_modelortargetModelused when creating the voice, or synthesis will fail.
Sample code:
Generate a custom voice and preview the result. If you are satisfied with the result, proceed to the next step. Otherwise, regenerate the voice.
Python
import requests import base64 import os def create_voice_and_play(): # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key # If you have not configured environment variables, replace the following line with your Model Studio API key: api_key = "sk-xxx" api_key = os.getenv("DASHSCOPE_API_KEY") if not api_key: print("Error: The DASHSCOPE_API_KEY environment variable is not found. Set the API key.") return None, None, None # Prepare the request data. headers = { "Authorization": f"Bearer {api_key}", "Content-Type": "application/json" } data = { "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer" }, "parameters": { "sample_rate": 24000, "response_format": "wav" } } # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization" try: # Send the request. response = requests.post( url, headers=headers, json=data, timeout=60 # Add a timeout setting. ) if response.status_code == 200: result = response.json() # Get the voice ID. voice_id = result["output"]["voice_id"] print(f"Voice ID: {voice_id}") # Get the preview audio data. base64_audio = result["output"]["preview_audio"]["data"] # Decode the Base64 audio data. audio_bytes = base64.b64decode(base64_audio) # Save the audio file to your on-premises device. filename = f"{voice_id}_preview.wav" # Write the audio data to a local file. with open(filename, 'wb') as f: f.write(audio_bytes) print(f"The audio is saved to the local file: {filename}") print(f"File path: {os.path.abspath(filename)}") return voice_id, audio_bytes, filename else: print(f"Request failed. Status code: {response.status_code}") print(f"Response content: {response.text}") return None, None, None except requests.exceptions.RequestException as e: print(f"A network request error occurred: {e}") return None, None, None except KeyError as e: print(f"The response data is in an invalid format. The required field is missing: {e}") print(f"Response content: {response.text if 'response' in locals() else 'No response'}") return None, None, None except Exception as e: print(f"An unknown error occurred: {e}") return None, None, None if __name__ == "__main__": print("Creating the voice...") voice_id, audio_data, saved_filename = create_voice_and_play() if voice_id: print(f"\nVoice '{voice_id}' is created.") print(f"The audio file is saved: '{saved_filename}'") print(f"File size: {os.path.getsize(saved_filename)} bytes") else: print("\nFailed to create the voice.")Java
You need to import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:
Maven
Add the following to your
pom.xmlfile:<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson --> <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.13.1</version> </dependency>Gradle
Add the following to your
build.gradlefile:// https://mvnrepository.com/artifact/com.google.code.gson/gson implementation("com.google.code.gson:gson:2.13.1")import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.*; import java.net.HttpURLConnection; import java.net.URL; import java.util.Base64; public class Main { public static void main(String[] args) { Main example = new Main(); example.createVoice(); } public void createVoice() { // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key // If you have not configured environment variables, replace the following line with your Model Studio API key: String apiKey = "sk-xxx" String apiKey = System.getenv("DASHSCOPE_API_KEY"); // Create a JSON request body string. String jsonBody = "{\n" + " \"model\": \"voice-enrollment\",\n" + " \"input\": {\n" + " \"action\": \"create_voice\",\n" + " \"target_model\": \"cosyvoice-v3.5-plus\",\n" + " \"voice_prompt\": \"A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.\",\n" + " \"preview_text\": \"Dear listeners, hello everyone. Welcome to the evening news.\",\n" + " \"prefix\": \"announcer\"\n" + " },\n" + " \"parameters\": {\n" + " \"sample_rate\": 24000,\n" + " \"response_format\": \"wav\"\n" + " }\n" + "}"; HttpURLConnection connection = null; try { // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization URL url = new URL("https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"); connection = (HttpURLConnection) url.openConnection(); // Set the request method and headers. connection.setRequestMethod("POST"); connection.setRequestProperty("Authorization", "Bearer " + apiKey); connection.setRequestProperty("Content-Type", "application/json"); connection.setDoOutput(true); connection.setDoInput(true); // Send the request body. try (OutputStream os = connection.getOutputStream()) { byte[] input = jsonBody.getBytes("UTF-8"); os.write(input, 0, input.length); os.flush(); } // Get the response. int responseCode = connection.getResponseCode(); if (responseCode == HttpURLConnection.HTTP_OK) { // Read the response content. StringBuilder response = new StringBuilder(); try (BufferedReader br = new BufferedReader( new InputStreamReader(connection.getInputStream(), "UTF-8"))) { String responseLine; while ((responseLine = br.readLine()) != null) { response.append(responseLine.trim()); } } // Parse the JSON response. JsonObject jsonResponse = JsonParser.parseString(response.toString()).getAsJsonObject(); JsonObject outputObj = jsonResponse.getAsJsonObject("output"); JsonObject previewAudioObj = outputObj.getAsJsonObject("preview_audio"); // Get the voice name. String voiceId = outputObj.get("voice_id").getAsString(); System.out.println("Voice ID: " + voiceId); // Get the Base64-encoded audio data. String base64Audio = previewAudioObj.get("data").getAsString(); // Decode the Base64 audio data. byte[] audioBytes = Base64.getDecoder().decode(base64Audio); // Save the audio to a local file. String filename = voiceId + "_preview.wav"; saveAudioToFile(audioBytes, filename); System.out.println("The audio is saved to the local file: " + filename); } else { // Read the error response. StringBuilder errorResponse = new StringBuilder(); try (BufferedReader br = new BufferedReader( new InputStreamReader(connection.getErrorStream(), "UTF-8"))) { String responseLine; while ((responseLine = br.readLine()) != null) { errorResponse.append(responseLine.trim()); } } System.out.println("Request failed. Status code: " + responseCode); System.out.println("Error response: " + errorResponse.toString()); } } catch (Exception e) { System.err.println("A request error occurred: " + e.getMessage()); e.printStackTrace(); } finally { if (connection != null) { connection.disconnect(); } } } private void saveAudioToFile(byte[] audioBytes, String filename) { try { File file = new File(filename); try (FileOutputStream fos = new FileOutputStream(file)) { fos.write(audioBytes); } System.out.println("The audio is saved to: " + file.getAbsolutePath()); } catch (IOException e) { System.err.println("An error occurred while saving the audio file: " + e.getMessage()); e.printStackTrace(); } } }Use the custom voice you generated in the previous step for speech synthesis.
This step references the non-streaming call example code. Replace the
voiceparameter with the custom voice generated through voice design for speech synthesis.Key principle: The model used for voice design (
target_model) must match the model used for subsequent speech synthesis (model), or synthesis will fail.Python
# coding=utf-8 import dashscope from dashscope.audio.tts_v2 import * import os # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key # If you have not configured environment variables, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx" dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY') # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference' # Use the same model for voice design and speech synthesis. model = "cosyvoice-v3.5-plus" # Replace the voice parameter with the custom voice generated by voice design. voice = "your_voice" # Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor. synthesizer = SpeechSynthesizer(model=model, voice=voice) # Send the text to be synthesized and get the binary audio. audio = synthesizer.call("How is the weather today?") # The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time. print('[Metric] Request ID: {}, First packet delay: {} ms'.format( synthesizer.get_last_request_id(), synthesizer.get_first_package_delay())) # Save the audio locally. with open('output.mp3', 'wb') as f: f.write(audio)Java
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam; import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer; import com.alibaba.dashscope.utils.Constants; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.nio.ByteBuffer; public class Main { // Use the same model for voice design and speech synthesis. private static String model = "cosyvoice-v3.5-plus"; // Replace the voice parameter with the custom voice generated by voice design. private static String voice = "your_voice_id"; public static void streamAudioDataToSpeaker() { // Request parameters SpeechSynthesisParam param = SpeechSynthesisParam.builder() // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key // If you have not configured environment variables, replace the following line with your Model Studio API key: .apiKey("sk-xxx") .apiKey(System.getenv("DASHSCOPE_API_KEY")) .model(model) // Model .voice(voice) // Voice .build(); // Synchronous mode: Disable callback (second parameter is null). SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null); ByteBuffer audio = null; try { // Block until audio returns. audio = synthesizer.call("How is the weather today?"); } catch (Exception e) { throw new RuntimeException(e); } finally { // Close the WebSocket connection when the task ends. synthesizer.getDuplexApi().close(1000, "bye"); } if (audio != null) { // Save the audio data to the local file "output.mp3". File file = new File("output.mp3"); // The first time you send text, a WebSocket connection is established. The first packet delay includes the connection establishment time. System.out.println( "[Metric] Request ID: " + synthesizer.getLastRequestId() + ", First packet delay (ms): " + synthesizer.getFirstPackageDelay()); try (FileOutputStream fos = new FileOutputStream(file)) { fos.write(audio.array()); } catch (IOException e) { throw new RuntimeException(e); } } } public static void main(String[] args) { // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference"; streamAudioDataToSpeaker(); System.exit(0); } }
API reference
Use the same Alibaba Cloud account for all API operations.
The Java and Python DashScope SDKs do not support voice design. For voice design, use the RESTful API.
Create voice
RESTful API
-
URL
Chinese Mainland:
POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customizationInternational:
POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization -
Request headers
Parameter
Type
Required
Description
Authorization
string
Authentication token. Format:
Bearer <your_api_key>. Replace "<your_api_key>" with your actual API key.Content-Type
string
Media type of data in the request body. Fixed value:
application/json. -
Request body
The request body contains all parameters (omit optional fields as needed).
ImportantNote the difference between these parameters:
-
model: Voice cloning/design model. Fixed value: voice-enrollment -
target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
Voice cloning
{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "prefix": "myvoice", "url": "https://yourAudioFileUrl", "language_hints": ["zh"] } }Voice design
{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer", "language_hints": ["zh"] }, "parameters": { "sample_rate": 24000, "response_format": "wav" } } -
-
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
create_voice.target_model
string
-
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
url
string
-
ImportantRequired only for voice cloning
Publicly accessible URL of the audio file used for voice cloning.
For audio format details, see .
For recording guidance, see Recording Guide.
voice_prompt
string
-
ImportantRequired only for voice design
Voice description. Maximum length: 500 characters.
Chinese and English only.
preview_text
string
-
ImportantRequired only for voice design
Text for the preview audio. Maximum length: 200 characters.
Supported languages: Chinese (zh), English (en).
prefix
string
-
The voice name (letters/numbers only, max 10 characters). Use role or scenario identifiers.
This keyword appears in the final voice name. For example, if the keyword is "announcer", the final voice names are:
Voice cloning: cosyvoice-v3.5-plus-announcer-8aae0c0397fa408ca60c29cf******
Voice design: cosyvoice-v3.5-plus-vd-announcer-8aae0c0397fa408ca60c29cf******
language_hints
array[string]
["zh"]
Specifies the sample audio language for voice feature extraction. Available for cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models. This parameter is an array, but only the first element is processed—pass only one value.
Functionality:
Voice cloning
Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g.,
enfor Chinese audio), the system ignores it and auto-detects the language.Valid values (by model):
-
cosyvoice-v3-plus:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
pt: Portuguese
-
th: Thai
-
id: Indonesian
-
vi: Vietnamese
-
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice design
Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match
preview_textlanguage.Valid values:
-
zh: Chinese (default)
-
en: English
max_prompt_audio_length
float
10.0
No
Important-
This parameter is only available for voice cloning scenarios.
-
Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash. Valid range: [3.0, 30.0].
enable_preprocess
boolean
false
No
Important-
This parameter is only available for voice cloning scenarios.
-
If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
-
For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
Enables audio preprocessing (noise reduction, audio enhancement, volume normalization) before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash.
Valid values:
-
true: Enable
-
false: Disable
sample_rate
int
24000
ImportantAvailable only for voice design
The preview audio sample rate (Hz) for voice design.
Valid values:
-
16000
-
24000
-
48000
response_format
string
wav
ImportantAvailable only for voice design
The preview audio format for voice design.
Valid values:
-
pcm
-
wav
-
mp3
-
-
Response parameters
Key parameters:
Parameter
Type
Description
voice_id
string
Voice ID. Use directly as the
voiceparameter in the speech synthesis API.data
string
The preview audio data from voice design (Base64-encoded).
sample_rate
int
The preview audio sample rate (Hz) from voice design. This value matches the creation request. Default: 24000 Hz.
response_format
string
The preview audio format from voice design. This value matches the creation request. Default: wav.
target_model
string
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
request_id
string
Request ID.
count
integer
Number of "create voice" operations in this request. Always 1 for voice creation.
-
Sample code
ImportantNote the difference between these parameters:
-
model: Voice cloning/design model. Fixed value: voice-enrollment -
target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
Voice cloning
If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key # Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="28f184e9f7vq7">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "prefix": "myvoice", "url": "https://yourAudioFileUrl" } }'Voice design
If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key # Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="087ab4e9d2b9r">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer" }, "parameters": { "sample_rate": 24000, "response_format": "wav" } }' -
Python SDK
Interface description
Before using this API, install the latest DashScope SDK.
def create_voice(self, target_model: str, prefix: str, url: str, language_hints: List[str] = None) -> str:
'''
Create a new custom voice.
param: target_model Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
param: prefix The voice name (letters, numbers, underscores only; max 10 characters). Use role or scenario identifiers. Format: model-name-prefix-unique-id (e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx).
param: url Publicly accessible URL of the audio file used for voice cloning.
param: language_hints The reference audio language for voice feature extraction. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus.
Helps identify sample audio language to improve voice feature extraction and cloning quality.
If the hint doesn't match actual audio, the system ignores it and auto-detects the language.
Valid values (by model):
cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
This parameter is an array, but only the first element is processed. Pass only one value.
param: max_prompt_audio_length The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
Valid range: [3.0, 30.0]. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
param: enable_preprocess Enables audio preprocessing. When enabled, the system performs noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
return: voice_id The voice ID. Use directly as the voice parameter in the speech synthesis API.
'''
-
target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails. -
language_hints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.Functionality:
Voice cloning
Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g.,
enfor Chinese audio), the system ignores it and auto-detects the language.Valid values (by model):
-
cosyvoice-v3-plus:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
pt: Portuguese
-
th: Thai
-
id: Indonesian
-
vi: Vietnamese
-
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice design
Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match
preview_textlanguage.Valid values:
-
zh: Chinese (default)
-
en: English
-
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
# Avoid frequent calls. Each call creates a new voice. After reaching your quota limit, you cannot create more.
voice_id = service.create_voice(
target_model='cosyvoice-v3.5-plus',
prefix='myvoice',
url='https://your-audio-file-url'
# language_hints=['zh'],
# max_prompt_audio_length=10.0,
# enable_preprocess=False
)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice ID: {voice_id}")
Java SDK
Interface description
Before using this API, install the latest DashScope SDK.
/**
* Create a new custom voice.
*
* @param targetModel Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
* @param prefix The voice name (letters, numbers, underscores only; max 10 characters). Use role or scenario identifiers. Format: model-name-prefix-unique-id (e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx).
* @param url Publicly accessible URL of the audio file used for voice cloning.
* @param customParam Custom parameters. Specify languageHints and maxPromptAudioLength here.
* languageHints: The reference audio language for voice feature extraction. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus.
* Helps identify sample audio language to improve voice feature extraction and cloning quality.
* If hint doesn't match actual audio, the system ignores it and auto-detects the language.
* Valid values (by model):
* cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
* cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
* Only the first element is processed. Pass only one value.
* maxPromptAudioLength: The maximum reference audio duration (seconds) after preprocessing for voice cloning. Applies only to cosyvoice-v3-flash model.
* Valid range: [3.0, 30.0]. Longer audio produces better results. For optimal voice reproduction, use at least 20 seconds of audio.
* enable_preprocess: Configured through the generic parameter. Enables audio preprocessing. When enabled, the system performs noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
* If the audio has background noise, we recommend enabling this parameter. Otherwise, noise may appear at sentence breaks in the synthesized speech.
* For quiet environments, we recommend disabling this parameter to maximize voice reproduction quality.
* @return Voice New voice. Call Voice.getVoiceId() to get the voice ID. Use directly as the voice parameter in the speech synthesis API.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice createVoice(String targetModel, String prefix, String url, VoiceEnrollmentParam customParam) throws NoApiKeyException, InputRequiredException
-
targetModel: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails. -
languageHints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.Functionality:
Voice cloning
Helps identify sample audio language to improve voice feature extraction and cloning quality. If the hint doesn't match actual audio language (e.g.,
enfor Chinese audio), the system ignores it and auto-detects the language.Valid values (by model):
-
cosyvoice-v3-plus:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
-
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
-
zh: Chinese (default)
-
en: English
-
fr: French
-
de: German
-
ja: Japanese
-
ko: Korean
-
ru: Russian
-
pt: Portuguese
-
th: Thai
-
id: Indonesian
-
vi: Vietnamese
-
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice design
Specifies the language preference for the generated voice. Affects pronunciation and language features. Must match
preview_textlanguage.Valid values:
-
zh: Chinese (default)
-
en: English
-
Request example
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentParam;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Collections;
public class Main {
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args) {
String apiKey = System.getenv("DASHSCOPE_API_KEY");
String targetModel = "cosyvoice-v3.5-plus";
String prefix = "myvoice";
String fileUrl = "https://your-audio-file-url";
String cloneModelName = "voice-enrollment";
try {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
Voice myVoice = service.createVoice(
targetModel,
prefix,
fileUrl,
VoiceEnrollmentParam.builder()
.model(cloneModelName)
.languageHints(Collections.singletonList("zh"))
// .maxPromptAudioLength(10.0f)
// .parameter("enable_preprocess", false)
.build());
logger.info("Voice creation submitted. Request ID: {}", service.getLastRequestId());
logger.info("Generated Voice ID: {}", myVoice.getVoiceId());
} catch (Exception e) {
logger.error("Failed to create voice", e);
}
}
}
List voices
Query created voices with pagination.
RESTful API
-
URL and Request headers are the same as the Create voice API
-
Request body
The request body contains all parameters. Omit optional fields as needed.
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "list_voice", "prefix": "announcer" "page_size": 10, "page_index": 0 } } -
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
list_voice.prefix
string
-
The same prefix used when creating the voice (letters/numbers only, max 10 characters).
page_index
integer
0
Page index (≥ 0).
page_size
integer
10
The number of items per page. Valid range: [0, 1000].
-
Response parameters
Key parameters:
Parameter
Type
Description
voice_id
string
Voice ID. Use directly as the
voiceparameter in the speech synthesis API.target_model
string
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail.
gmt_create
string
Time the voice was created.
gmt_modified
string
Time the voice was last modified.
voice_prompt
string
Voice description.
preview_text
string
Preview text.
request_id
string
Request ID.
status
string
Voice status:
-
DEPLOYING: Under review
-
OK: Ready to use
-
UNDEPLOYED: Unavailable
-
-
Sample code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.# This is the Singapore region URL. For Beijing region: use dashscope.aliyuncs.com with a different regional API key # Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key curl -X POST https://dashscope.aliyuncs-intl.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "list_voice", "prefix": "announcer", "page_size": 10, "page_index": 0 } }'
Python SDK
Interface description
def list_voices(self, prefix=None, page_index: int = 0, page_size: int = 10) -> List[dict]:
'''
Query all created voices
param: prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
param: page_index Page index to query
param: page_size Page size to query
return: List[dict] Voice list containing ID, creation time, modification time, and status for each voice. Format: [{'gmt_create': '2025-10-09 14:51:01', 'gmt_modified': '2025-10-09 14:51:07', 'status': 'OK', 'voice_id': 'cosyvoice-v3-myvoice-xxx'}]
Voice statuses:
DEPLOYING: Under review
OK: Approved and ready to use
UNDEPLOYED: Rejected and unavailable
'''
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
# Filter by prefix, or set to None to query all
voices = service.list_voices(prefix='myvoice', page_index=0, page_size=10)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Found voices: {voices}")
Response example
[
{
"gmt_create": "2024-09-13 11:29:41",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
},
{
"gmt_create": "2024-09-13 13:22:38",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 13:22:38",
"status": "OK"
}
]
Response parameters
|
Parameter |
Type |
Description |
|
voice_id |
string |
Voice ID. Use directly as the |
|
target_model |
string |
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail. |
|
gmt_create |
string |
Time the voice was created. |
|
gmt_modified |
string |
Time the voice was last modified. |
|
voice_prompt |
string |
Voice description. |
|
preview_text |
string |
Preview text. |
|
request_id |
string |
Request ID. |
|
status |
string |
Voice status:
|
Java SDK
Interface description
// Voice statuses:
// DEPLOYING: Under review
// OK: Approved and ready to use
// UNDEPLOYED: Rejected and unavailable
/**
* Query all created voices. Default page index is 0, default page size is 10.
*
* @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters). Can be null.
* @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice[] listVoice(String prefix) throws NoApiKeyException, InputRequiredException
/**
* Query all created voices.
*
* @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
* @param pageIndex Page index to query.
* @param pageSize Page size to query.
* @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice[] listVoice(String prefix, int pageIndex, int pageSize) throws NoApiKeyException, InputRequiredException
Request example
You need to import the third-party library com.google.gson.Gson.
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String prefix = "myvoice"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Query voices
Voice[] voices = service.listVoice(prefix, 0, 10);
logger.info("List successful. Request ID: {}", service.getLastRequestId());
logger.info("Voices Details: {}", new Gson().toJson(voices));
}
}
Response example
[
{
"gmt_create": "2024-09-13 11:29:41",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
},
{
"gmt_create": "2024-09-13 13:22:38",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 13:22:38",
"status": "OK"
}
]
Response parameters
|
Parameter |
Type |
Description |
|
voice_id |
string |
Voice ID. Use directly as the |
|
target_model |
string |
Speech synthesis model that drives the voice (see Supported models). Must match the synthesis model used later or synthesis will fail. |
|
gmt_create |
string |
Time the voice was created. |
|
gmt_modified |
string |
Time the voice was last modified. |
|
voice_prompt |
string |
Voice description. |
|
preview_text |
string |
Preview text. |
|
request_id |
string |
Request ID. |
|
status |
string |
Voice status:
|
Query specific voice
Get detailed information about a specific voice by name.
RESTful API
-
URL and Request headers are the same as the Create voice API
-
Request body
The request body contains all parameters. Omit optional fields as needed.
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "query_voice", "voice_id": "yourVoiceID" } } -
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
query_voice.voice_id
string
-
ID of the voice to query.
-
Response parameters
For parameter descriptions, see the List voices API.
-
Sample code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "query_voice", "voice_id": "yourVoiceID" } }'
Python SDK
Interface description
def query_voice(self, voice_id: str) -> List[str]:
'''
Query details for a specific voice
param: voice_id ID of the voice to query
return: List[str] Voice details, including status, creation time, audio link, etc.
'''
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
voice_id = 'cosyvoice-v3-plus-myvoice-xxxxxxxx'
voice_details = service.query_voice(voice_id=voice_id)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice Details: {voice_details}")
Response example
{
"gmt_create": "2024-09-13 11:29:41",
"resource_link": "https://yourAudioFileUrl",
"target_model": "cosyvoice-v3.5-plus",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
}
Response parameters
Java SDK
Interface description
/**
* Query details for a specific voice
*
* @param voiceId ID of the voice to query
* @return Voice Voice details, including status, creation time, audio link, etc.
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public Voice queryVoice(String voiceId) throws NoApiKeyException, InputRequiredException
Request example
You need to import the third-party library com.google.gson.Gson.
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
Voice voice = service.queryVoice(voiceId);
logger.info("Query successful. Request ID: {}", service.getLastRequestId());
logger.info("Voice Details: {}", new Gson().toJson(voice));
}
}
Response example
{
"gmt_create": "2024-09-13 11:29:41",
"resource_link": "https://yourAudioFileUrl",
"target_model": "cosyvoice-v3.5-plus",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
}
Response parameters
Update voice (voice cloning only)
Updates an existing voice with a new audio file.
This feature is not supported for voice design.
RESTful API
-
URL and Request headers are the same as the Create voice API
-
Request body
The request body contains all parameters. Omit optional fields as needed:
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "update_voice", "voice_id": "yourVoiceId", "url": "https://yourAudioFileUrl" } } -
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
update_voice.voice_id
string
-
Voice to update.
url
string
-
URL of the audio file to update the voice. The URL must be publicly accessible.
-
Response parameters
-
Sample code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "update_voice", "voice_id": "yourVoiceId", "url": "https://yourAudioFileUrl" } }'
Python SDK
Interface description
def update_voice(self, voice_id: str, url: str) -> None:
'''
Update a voice
param: voice_id Voice ID
param: url URL of the audio file for voice cloning
'''
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
service.update_voice(
voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx',
url='https://your-new-audio-file-url'
)
print(f"Update submitted. Request ID: {service.get_last_request_id()}")
Java SDK
Interface description
/**
* Update a voice
*
* @param voiceId Voice to update
* @param url URL of the audio file for voice cloning
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public void updateVoice(String voiceId, String url)
throws NoApiKeyException, InputRequiredException
Request example
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String fileUrl = "https://your-audio-file-url"; // Replace with your actual value
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Update voice
service.updateVoice(voiceId, fileUrl);
logger.info("Update submitted. Request ID: {}", service.getLastRequestId());
}
}
Delete voice
Deletes a voice you no longer need to free up the quota. This action cannot be undone.
RESTful API
-
URL and Request headers are the same as the Create voice API
-
Request body
The request body contains all parameters. Omit optional fields as needed:
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "delete_voice", "voice_id": "yourVoiceID" } } -
Request parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
delete_voice.voice_id
string
-
Voice to delete.
-
Response parameters
-
Sample code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If the API key isn’t in an environment variable, replace
$DASHSCOPE_API_KEYwith your actual key.# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "delete_voice", "voice_id": "yourVoiceID" } }'
Python SDK
Interface description
def delete_voice(self, voice_id: str) -> None:
'''
Delete a voice
param: voice_id Voice to delete
'''
Request example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
service.delete_voice(voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx')
print(f"Deletion submitted. Request ID: {service.get_last_request_id()}")
Java SDK
Interface description
/**
* Delete a voice
*
* @param voiceId Voice to delete
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public void deleteVoice(String voiceId) throws NoApiKeyException, InputRequiredException
Request example
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Delete voice
service.deleteVoice(voiceId);
logger.info("Deletion submitted. Request ID: {}", service.getLastRequestId());
}
}
Quota and cleanup
Billing
-
Voice cloning/design: Creating, querying, updating, and deleting voices is free.
-
Speech synthesis using custom voices: Billed based on the number of text characters. For more information, see Real-time speech synthesis – CosyVoice.
Copyright and legality
You are responsible for the ownership and legal right to use any voice you provide. Read the terms of service.
Error codes
If you encounter an error, see Error messages for troubleshooting.
FAQ
Features
Q: How do I adjust the speed and volume of a custom voice?
The same way you adjust a preset voice. Pass the corresponding parameters when calling the speech synthesis API. For example, use speech_rate (Python) or speechRate (Java) to adjust speed, and volume to adjust volume. For more information, see the speech synthesis API documentation (Java SDK/Python SDK/WebSocket API).
Q: How do I call the API using languages other than Java and Python (such as Go, C#, or Node.js)?
For voice management, use the RESTful API provided in this document. For speech synthesis, use the WebSocket API and pass the cloned voice_id as the voice parameter.
Troubleshooting
If you encounter code errors, troubleshoot using the information in Error codes.
Q: What should I do if the synthesized audio from a cloned voice contains extra content?
If you find extra characters or noise in the synthesized audio from a cloned voice, follow these steps to troubleshoot:
-
Check the source audio quality
The quality of the cloned audio directly affects the synthesis result. Ensure the source audio meets these requirements:
-
No background noise or static
-
Clear sound quality (sample rate ≥ 16 kHz recommended)
-
Audio format: WAV is better than MP3 (avoid lossy compression)
-
Mono (stereo may cause interference)
-
No silent segments or long pauses
-
Moderate speech rate (a fast rate affects feature extraction)
-
-
Check the input text
Confirm the input text does not contain special symbols or markers:
-
Avoid special symbols such as
**,"", and'' -
Unless used for LaTeX formulas, preprocess the text to filter out special symbols.
-
-
Verify voice cloning parameters
Ensure the language parameter (
language_hints/languageHints) is set correctly when . -
Try cloning again
Use a higher-quality source audio file to clone the voice again and test the result.
-
Compare with system voices
Test the same text with a preset system voice to confirm if the issue is specific to the cloned voice.
Q: What should I do if the audio generated from a cloned voice is silent?
-
Check voice status
Call the Query specific voice API to check if the voice
statusisOK. -
Check model version consistency
Ensure the
target_modelparameter used for voice cloning exactly matches themodelparameter used for speech synthesis. For example:-
When you clone the voice, use
cosyvoice-v3-plus. -
You must also use
cosyvoice-v3-plusfor synthesis
-
-
Verify source audio quality
Check if the source audio used for cloning meets the voice cloning input audio format requirements:
-
Audio duration: 10–20 seconds
-
Clear sound quality
-
No background noise
-
-
Check request parameters
Confirm the
voiceparameter is set to the cloned voice's ID during speech synthesis.
Q: What should I do if the synthesized speech from a cloned voice is unstable or incomplete?
If the synthesized speech from a cloned voice has these issues:
-
Incomplete playback; only part of the text is read
-
Unstable synthesis quality; sometimes good, sometimes bad
-
Abnormal pauses or silent segments in the audio
Possible cause: The source audio quality does not meet requirements.
Solution: Verify that the source audio meets the following requirements. Rerecord the audio following the Recording Guide.
-
Check audio continuity: Ensure the speech in the source audio is continuous. Avoid long pauses or silent segments (over 2 seconds). Obvious blank segments can cause the model to treat silence or noise as part of the voice's features, affecting the result.
-
Check the speech activity ratio: Ensure that active speech makes up more than 60% of the total audio duration. Too much background noise or non-speech segments will interfere with voice feature extraction.
-
Verify audio quality details:
-
Audio duration: 10–20 seconds (15 seconds recommended)
-
Clear pronunciation and steady speech rate
-
No background noise, echo, or static
-
Concentrated speech energy with no long silent segments
-
Q: Why can't I find the VoiceEnrollmentService class?
Your SDK version is too old. Install the latest SDK.
Q: What should I do if the voice cloning result is poor, with noise or unclear audio?
This is usually caused by low-quality input audio. Rerecord and upload the audio, strictly following the Recording Guide.
Q: Why is there a long silence at the beginning or an abnormal total duration when I synthesize very short text (like a single word) with a cloned voice?
The voice cloning model learns pauses and rhythm from the sample audio. If the original recording has a long initial silence or pause, the synthesis result retains a similar pattern. For single words or very short text, this silence ratio is amplified, making it seem like the audio is long but mostly silent. To avoid this, trim long silences from sample audio. Use complete sentences or longer text for synthesis. If you must synthesize a single word, add some context before or after it.