The CosyVoice voice cloning service uses generative large speech models to create highly similar and natural custom voices from just 10–20 seconds of audio samples—no traditional training required. Voice design generates custom voices from text descriptions and supports multilingual and multidimensional voice feature definitions. Use cases include ad narration, character voice creation, and audiobook production. Voice cloning and voice design are two sequential steps that feed into speech synthesis. This document covers the parameters and interfaces for voice cloning and voice design. For speech synthesis, see Real-time Speech Synthesis – CosyVoice/Sambert.
User Guide: For model introductions and selection guidance, see Real-time Speech Synthesis – CosyVoice/Sambert.
This document covers only the CosyVoice voice cloning and voice design APIs. If you use Qwen models, see Voice Cloning (Qwen) and Voice Design (Qwen).
CosyVoice voice design uses the FunAudioGen-VD model.
Supported Models
Voice cloning:
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash
cosyvoice-v3-plus, cosyvoice-v3-flash
cosyvoice-v2
Voice design:
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash
cosyvoice-v3-plus, cosyvoice-v3-flash
cosyvoice-v3.5-plus and cosyvoice-v3.5-flash are available only in the China Mainland deployment mode (Beijing region).
In the International deployment mode (Singapore region), cosyvoice-v3-plus and cosyvoice-v3-flash do not support voice cloning or voice design. Choose another model.
Supported Languages
Voice cloning: Depends on the target speech synthesis model (specified by the
target_model/targetModelparameter):cosyvoice-v2: Chinese (Mandarin), English
cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghai, Sichuan, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese
cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, Russian
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan, Hubei, Minnan, Ningxia, Shaanxi, Shandong, Shanghai, Sichuan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese
Voice cloning does not currently support Spanish, Italian, or other languages.
Voice design: Chinese, English.
Quick Start: From Voice Cloning to Speech Synthesis
Voice cloning and speech synthesis are two independent yet closely related steps. Follow the “create first, use later” workflow:
Prepare an audio recording file
Upload an audio file that meets the voice cloning: input audio format requirements to a publicly accessible location, such as Alibaba Cloud OSS. You must ensure the URL is publicly accessible.
Create a voice
Call the Create voice API. You must specify
target_model/targetModelto declare which speech synthesis model will drive the created voice.If you already have a created voice (check using the List voices API), skip this step and go to the next one.
Using voice timbre for speech synthesis
After successfully creating a voice using the Create voice API, the system returns a
voice_id/voiceID:You can use this
voice_id/voiceIDdirectly as thevoiceparameter in the speech synthesis API or in SDKs for text-to-speech.It supports multiple invocation modes: non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
The speech synthesis model used during synthesis must match the
target_model/targetModelused when creating the voice. Otherwise, synthesis fails.
Sample code:
import os
import time
import dashscope
from dashscope.audio.tts_v2 import VoiceEnrollmentService, SpeechSynthesizer
# 1. Environment preparation
# We recommend configuring your API key as an environment variable
# API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If you haven’t configured an environment variable, replace the next line with: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
if not dashscope.api_key:
raise ValueError("DASHSCOPE_API_KEY environment variable not set.")
# This is the Singapore-region WebSocket URL. If you use a Beijing-region model, replace it with: wss://dashscope.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# This is the Singapore-region HTTP URL. If you use a Beijing-region model, replace it with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# 2. Define cloning parameters
TARGET_MODEL = "cosyvoice-v3.5-plus"
# Give your voice a meaningful prefix
VOICE_PREFIX = "myvoice" # Letters and numbers only. Less than 10 characters.
# Publicly accessible audio URL
AUDIO_URL = "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/cosyvoice/cosyvoice-zeroshot-sample.wav" # Example URL. Replace with your own.
# 3. Create a voice (asynchronous task)
print("--- Step 1: Creating voice enrollment ---")
service = VoiceEnrollmentService()
try:
voice_id = service.create_voice(
target_model=TARGET_MODEL,
prefix=VOICE_PREFIX,
url=AUDIO_URL
)
print(f"Voice enrollment submitted successfully. Request ID: {service.get_last_request_id()}")
print(f"Generated Voice ID: {voice_id}")
except Exception as e:
print(f"Error during voice creation: {e}")
raise e
# 4. Poll for voice status
print("\n--- Step 2: Polling for voice status ---")
max_attempts = 30
poll_interval = 10 # seconds
for attempt in range(max_attempts):
try:
voice_info = service.query_voice(voice_id=voice_id)
status = voice_info.get("status")
print(f"Attempt {attempt + 1}/{max_attempts}: Voice status is '{status}'")
if status == "OK":
print("Voice is ready for synthesis.")
break
elif status == "UNDEPLOYED":
print(f"Voice processing failed with status: {status}. Please check audio quality or contact support.")
raise RuntimeError(f"Voice processing failed with status: {status}")
# Continue waiting for intermediate statuses like "DEPLOYING"
time.sleep(poll_interval)
except Exception as e:
print(f"Error during status polling: {e}")
time.sleep(poll_interval)
else:
print("Polling timed out. The voice is not ready after several attempts.")
raise RuntimeError("Polling timed out. The voice is not ready after several attempts.")
# 5. Use the cloned voice for speech synthesis
print("\n--- Step 3: Synthesizing speech with the new voice ---")
try:
synthesizer = SpeechSynthesizer(model=TARGET_MODEL, voice=voice_id)
text_to_synthesize = "Congratulations! You’ve successfully cloned and synthesized your own voice!"
# The call() method returns binary audio data
audio_data = synthesizer.call(text_to_synthesize)
print(f"Speech synthesis successful. Request ID: {synthesizer.get_last_request_id()}")
# 6. Save audio file
output_file = "my_custom_voice_output.mp3"
with open(output_file, "wb") as f:
f.write(audio_data)
print(f"Audio saved to {output_file}")
except Exception as e:
print(f"Error during speech synthesis: {e}")Quick Start: From Voice Design to Speech Synthesis
Voice design and speech synthesis are two independent but closely linked steps. Follow the “create first, use later” workflow:
Prepare the voice description and preview text for voice design.
Voice prompt (voice_prompt): Defines the target voice’s characteristics. For more information, see “Voice design: How to write high-quality voice prompts?”.
Preview text (preview_text): The text read aloud in the preview audio (such as “Hello everyone, welcome to listen.”).
Call the Create voice API to create a custom voice and obtain the voice name and preview audio.
You must specify
target_modelto declare which speech synthesis model will drive the created voice.You can listen to the preview audio to check if it meets expectations. If it meets expectations, proceed to the next step. Otherwise, redesign it.
If you already have a created voice (check using the List voices API), skip this step and go to the next one.
Using Voices for Speech Synthesis
After successfully creating a voice using the Create voice API, the system returns a
voice_id/voiceID:You can use this
voice_id/voiceIDdirectly as thevoiceparameter in the speech synthesis API or in SDKs for text-to-speech.It supports multiple invocation modes: non-streaming, unidirectional streaming, and bidirectional streaming synthesis.
The speech synthesis model used during synthesis must match the
target_model/targetModelused when creating the voice. Otherwise, synthesis fails.
Sample code:
Create a custom voice and preview it. If you are satisfied, proceed. Otherwise, recreate it.
Python
import requests import base64 import os def create_voice_and_play(): # API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # If you haven’t configured an environment variable, replace the next line with: api_key = "sk-xxx" api_key = os.getenv("DASHSCOPE_API_KEY") if not api_key: print("Error: DASHSCOPE_API_KEY environment variable not found. Please set your API key first.") return None, None, None # Prepare request data headers = { "Authorization": f"Bearer {api_key}", "Content-Type": "application/json" } data = { "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer" }, "parameters": { "sample_rate": 24000, "response_format": "wav" } } # This is the Singapore-region URL. If you use a Beijing-region model, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization" try: # Send request response = requests.post( url, headers=headers, json=data, timeout=60 # Add timeout setting ) if response.status_code == 200: result = response.json() # Get voice ID voice_id = result["output"]["voice_id"] print(f"Voice ID: {voice_id}") # Get preview audio data base64_audio = result["output"]["preview_audio"]["data"] # Decode Base64 audio data audio_bytes = base64.b64decode(base64_audio) # Save audio file locally filename = f"{voice_id}_preview.wav" # Write audio data to local file with open(filename, 'wb') as f: f.write(audio_bytes) print(f"Audio saved to local file: {filename}") print(f"File path: {os.path.abspath(filename)}") return voice_id, audio_bytes, filename else: print(f"Request failed. Status code: {response.status_code}") print(f"Response: {response.text}") return None, None, None except requests.exceptions.RequestException as e: print(f"Network request error: {e}") return None, None, None except KeyError as e: print(f"Response format error. Missing required field: {e}") print(f"Response: {response.text if 'response' in locals() else 'No response'}") return None, None, None except Exception as e: print(f"Unknown error: {e}") return None, None, None if __name__ == "__main__": print("Starting voice creation...") voice_id, audio_data, saved_filename = create_voice_and_play() if voice_id: print(f"\nSuccessfully created voice '{voice_id}'") print(f"Audio file saved: '{saved_filename}'") print(f"File size: {os.path.getsize(saved_filename)} bytes") else: print("\nVoice creation failed")Java
You need to import the Gson dependency. If you use Maven or Gradle, add the dependency as follows:
Maven
Add the following to your
pom.xml:<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson --> <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.13.1</version> </dependency>Gradle
Add the following to your
build.gradle:// https://mvnrepository.com/artifact/com.google.code.gson/gson implementation("com.google.code.gson:gson:2.13.1")import com.google.gson.JsonObject; import com.google.gson.JsonParser; import java.io.*; import java.net.HttpURLConnection; import java.net.URL; import java.util.Base64; public class Main { public static void main(String[] args) { Main example = new Main(); example.createVoice(); } public void createVoice() { // API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key // If you haven’t configured an environment variable, replace the next line with: String apiKey = "sk-xxx" String apiKey = System.getenv("DASHSCOPE_API_KEY"); // Create JSON request body string String jsonBody = "{\n" + " \"model\": \"voice-enrollment\",\n" + " \"input\": {\n" + " \"action\": \"create_voice\",\n" + " \"target_model\": \"cosyvoice-v3.5-plus\",\n" + " \"voice_prompt\": \"A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.\",\n" + " \"preview_text\": \"Dear listeners, hello everyone. Welcome to the evening news.\",\n" + " \"prefix\": \"announcer\"\n" + " },\n" + " \"parameters\": {\n" + " \"sample_rate\": 24000,\n" + " \"response_format\": \"wav\"\n" + " }\n" + "}"; HttpURLConnection connection = null; try { // This is the Singapore-region URL. If you use a Beijing-region model, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization URL url = new URL("https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"); connection = (HttpURLConnection) url.openConnection(); // Set request method and headers connection.setRequestMethod("POST"); connection.setRequestProperty("Authorization", "Bearer " + apiKey); connection.setRequestProperty("Content-Type", "application/json"); connection.setDoOutput(true); connection.setDoInput(true); // Send request body try (OutputStream os = connection.getOutputStream()) { byte[] input = jsonBody.getBytes("UTF-8"); os.write(input, 0, input.length); os.flush(); } // Get response int responseCode = connection.getResponseCode(); if (responseCode == HttpURLConnection.HTTP_OK) { // Read response content StringBuilder response = new StringBuilder(); try (BufferedReader br = new BufferedReader( new InputStreamReader(connection.getInputStream(), "UTF-8"))) { String responseLine; while ((responseLine = br.readLine()) != null) { response.append(responseLine.trim()); } } // Parse JSON response JsonObject jsonResponse = JsonParser.parseString(response.toString()).getAsJsonObject(); JsonObject outputObj = jsonResponse.getAsJsonObject("output"); JsonObject previewAudioObj = outputObj.getAsJsonObject("preview_audio"); // Get voice ID String voiceId = outputObj.get("voice_id").getAsString(); System.out.println("Voice ID: " + voiceId); // Get Base64-encoded audio data String base64Audio = previewAudioObj.get("data").getAsString(); // Decode Base64 audio data byte[] audioBytes = Base64.getDecoder().decode(base64Audio); // Save audio to local file String filename = voiceId + "_preview.wav"; saveAudioToFile(audioBytes, filename); System.out.println("Audio saved to local file: " + filename); } else { // Read error response StringBuilder errorResponse = new StringBuilder(); try (BufferedReader br = new BufferedReader( new InputStreamReader(connection.getErrorStream(), "UTF-8"))) { String responseLine; while ((responseLine = br.readLine()) != null) { errorResponse.append(responseLine.trim()); } } System.out.println("Request failed. Status code: " + responseCode); System.out.println("Error response: " + errorResponse.toString()); } } catch (Exception e) { System.err.println("Request error: " + e.getMessage()); e.printStackTrace(); } finally { if (connection != null) { connection.disconnect(); } } } private void saveAudioToFile(byte[] audioBytes, String filename) { try { File file = new File(filename); try (FileOutputStream fos = new FileOutputStream(file)) { fos.write(audioBytes); } System.out.println("Audio saved to: " + file.getAbsolutePath()); } catch (IOException e) { System.err.println("Error saving audio file: " + e.getMessage()); e.printStackTrace(); } } }Use the custom voice created in the previous step for speech synthesis.
This example builds on the non-streaming call example. You must replace the
voiceparameter with the custom voice ID generated by voice design.Key rule: The model used for voice design (
target_model) must match the model used for speech synthesis (model). Otherwise, synthesis fails.Python
# coding=utf-8 import dashscope from dashscope.audio.tts_v2 import * import os # The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key # If you have not configured the environment variable, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx" dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY') # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: wss://dashscope.aliyuncs.com/api-ws/v1/inference dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference' # The same model must be used for voice design and speech synthesis. model = "cosyvoice-v3.5-plus" # Replace the voice parameter with the custom voice generated from voice design. voice = "your_voice" # Instantiate SpeechSynthesizer and pass request parameters such as model and voice to the constructor. synthesizer = SpeechSynthesizer(model=model, voice=voice) # Send the text to be synthesized to obtain the binary audio. audio = synthesizer.call("How is the weather today?") # When you send text for the first time, a WebSocket connection needs to be established. Therefore, the first-package latency includes the connection establishment time. print('[Metric] Request ID: {}, First-package latency: {} ms'.format( synthesizer.get_last_request_id(), synthesizer.get_first_package_delay())) # Save the audio to a local file. with open('output.mp3', 'wb') as f: f.write(audio)Java
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam; import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer; import com.alibaba.dashscope.utils.Constants; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.nio.ByteBuffer; public class Main { // Use the same model for voice design and speech synthesis private static String model = "cosyvoice-v3.5-plus"; // Replace the voice parameter with the custom voice ID generated by voice design private static String voice = "your_voice_id"; public static void streamAudioDataToSpeaker() { // Request parameters SpeechSynthesisParam param = SpeechSynthesisParam.builder() // API keys differ between the Singapore and Beijing regions. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key // If you haven’t configured an environment variable, replace the next line with: .apiKey("sk-xxx") .apiKey(System.getenv("DASHSCOPE_API_KEY")) .model(model) // Model .voice(voice) // Voice .build(); // Sync mode: disable callback (second parameter is null) SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null); ByteBuffer audio = null; try { // Block until audio returns audio = synthesizer.call("How is the weather today?"); } catch (Exception e) { throw new RuntimeException(e); } finally { // Close WebSocket connection after task ends synthesizer.getDuplexApi().close(1000, "bye"); } if (audio != null) { // Save audio data to local file “output.mp3” File file = new File("output.mp3"); // The first packet delay includes WebSocket connection setup time System.out.println( "[Metric] requestId: " + synthesizer.getLastRequestId() + ", first packet delay (ms): " + synthesizer.getFirstPackageDelay()); try (FileOutputStream fos = new FileOutputStream(file)) { fos.write(audio.array()); } catch (IOException e) { throw new RuntimeException(e); } } } public static void main(String[] args) { // This is the Singapore-region URL. If you use a Beijing-region model, replace it with: wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference"; streamAudioDataToSpeaker(); System.exit(0); } }
API Reference
Use the same Alibaba Cloud account for all API operations.
The Java and Python DashScope SDKs do not support voice design. For voice design, use the RESTful API.
Create Voice
RESTful API
URL
China Mainland:
POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customizationInternational:
POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customizationRequest Headers
Parameter
Type
Required
Description
Authorization
string
Authentication token. Format:
Bearer <your_api_key>. Replace "<your_api_key>" with your actual API key.Content-Type
string
Media type of data in the request body. Fixed value:
application/json.Request Body
The request body contains all parameters. Optional fields can be omitted based on your business needs.
ImportantNote the difference between these parameters:
model: Voice cloning/design model. Fixed value: voice-enrollmenttarget_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
Voice Cloning
{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "prefix": "myvoice", "url": "https://yourAudioFileUrl", "language_hints": ["zh"] } }Voice Design
{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer", "language_hints": ["zh"] }, "parameters": { "sample_rate": 24000, "response_format": "wav" } }Request Parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
create_voice.target_model
string
-
Speech synthesis model that drives the voice (see Supported Models).
Must match the speech synthesis model used later. Otherwise, synthesis fails.
url
string
-
ImportantRequired only for voice cloning
Publicly accessible URL of the audio file used for voice cloning.
For audio format details, see .
For recording guidance, see Recording Guide.
voice_prompt
string
-
ImportantRequired only for voice design
Voice description. Maximum length: 500 characters.
Chinese and English only.
preview_text
string
-
ImportantRequired only for voice design
Text for the preview audio. Maximum length: 200 characters.
Supported languages: Chinese (zh), English (en).
prefix
string
-
Name for the voice (letters and numbers only; up to 10 characters). Use identifiers related to role or scenario.
This keyword appears in the final voice name. For example, if the keyword is "announcer", the final voice names are:
Voice cloning: cosyvoice-v3.5-plus-announcer-8aae0c0397fa408ca60c29cf******
Voice design: cosyvoice-v3.5-plus-vd-announcer-8aae0c0397fa408ca60c29cf******
language_hints
array[string]
["zh"]
You can specify the language of the sample audio used to extract target timbre features. This option is only available for the cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.
Note: This parameter is an array, but only the first element is processed. Pass only one value.
Functionality:
Voice Cloning
This parameter helps the model identify the language of the sample audio (original reference audio), so that it can more accurately extract voice characteristics and improve voice cloning quality. If the specified language hint does not match the actual audio language (for example, specifying
enfor Chinese audio), the system ignores this hint and automatically detects the language based on the audio content.Valid values (by model):
cosyvoice-v3-plus:
zh: Chinese (default)
en: English
fr: French
de: German
ja: Japanese
ko: Korean
ru: Russian
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
zh: Chinese (default)
en: English
fr: French
de: German
ja: Japanese
ko: Korean
ru: Russian
pt: Portuguese
th: Thai
id: Indonesian
vi: Vietnamese
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice Design
Specifies the language preference for the generated voice. Affects language features and pronunciation. Choose the language code matching your use case.
If used, the language must match the
preview_textlanguage.Valid values:
zh: Chinese (default)
en: English
max_prompt_audio_length
float
10.0
No
ImportantAvailable only for voice cloning
Maximum duration (in seconds) of the reference audio used for voice cloning after preprocessing. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
Valid range: [3.0, 30.0].
enable_preprocess
boolean
false
No
ImportantAvailable only for voice cloning
Enable audio preprocessing. When enabled, the system applies noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
Valid values:
true: Enable
false: Disable
sample_rate
int
24000
ImportantAvailable only for voice design
Sample rate (Hz) of the preview audio generated by voice design.
Valid values:
16000
24000
48000
response_format
string
wav
ImportantAvailable only for voice design
Format of the preview audio generated by voice design.
Valid values:
pcm
wav
mp3
Response Parameters
Key parameters:
Parameter
Type
Description
voice_id
string
Voice ID. Use directly as the
voiceparameter in the speech synthesis API.data
string
Preview audio data generated by voice design, returned as a Base64-encoded string.
sample_rate
int
Sample rate (Hz) of the preview audio generated by voice design. Matches the sample rate used when creating the voice. Default: 24000 Hz if unspecified.
response_format
string
Format of the preview audio generated by voice design. Matches the format used when creating the voice. Default: wav if unspecified.
target_model
string
Speech synthesis model that drives the voice (see Supported Models).
Must match the speech synthesis model used later. Otherwise, synthesis fails.
request_id
string
Request ID.
count
integer
Number of "create voice" operations in this request.
Always 1 for voice creation.
Sample Code
ImportantNote the difference between these parameters:
model: Voice cloning/design model. Fixed value: voice-enrollmenttarget_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
Voice Cloning
If you haven’t configured your API key in an environment variable, replace
$DASHSCOPE_API_KEYin the sample with your actual API key.https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="28f184e9f7vq7">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "prefix": "myvoice", "url": "https://yourAudioFileUrl" } }'Voice Design
If you haven’t configured your API key in an environment variable, replace
$DASHSCOPE_API_KEYin the sample with your actual API key.https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST <a data-init-id="9f104f338c7kz" href="https://poc-dashscope.aliyuncs.com/api/v1/services/audio/tts/customization" id="087ab4e9d2b9r">https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization</a> \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "create_voice", "target_model": "cosyvoice-v3.5-plus", "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.", "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.", "prefix": "announcer" }, "parameters": { "sample_rate": 24000, "response_format": "wav" } }'
Python SDK
Interface Description
Before using this interface, install the latest DashScope SDK.
def create_voice(self, target_model: str, prefix: str, url: str, language_hints: List[str] = None) -> str:
'''
Create a new custom voice.
param: target_model Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
param: prefix Name for the voice (letters, numbers, and underscores only; up to 10 characters). Use identifiers related to role or scenario. This keyword appears in the cloned voice name. Format: model-name-prefix-unique-id, e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx.
param: url Publicly accessible URL of the audio file used for voice cloning.
param: language_hints Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.
Helps the model identify the language of the reference audio (original sample), improving voice feature extraction and cloning quality.
If the language hint does not match the actual audio (e.g., "en" for Chinese audio), the system ignores the hint and detects the language automatically.
Valid values (by model):
cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
This parameter is an array, but only the first element is processed. Pass only one value.
param: max_prompt_audio_length Maximum duration (in seconds) of the reference audio used for voice cloning after preprocessing. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
Valid range: [3.0, 30.0].
param: enable_preprocess Enable audio preprocessing. When enabled, the system applies noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
return: voice_id Voice ID. Use directly as the voice parameter in the speech synthesis API.
'''target_model: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.language_hints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.Functionality:
Voice Cloning
This parameter helps the model identify the language of the sample audio (original reference audio), so that it can more accurately extract voice characteristics and improve voice cloning quality. If the specified language hint does not match the actual audio language (for example, specifying
enfor Chinese audio), the system ignores this hint and automatically detects the language based on the audio content.Valid values (by model):
cosyvoice-v3-plus:
zh: Chinese (default)
en: English
fr: French
de: German
ja: Japanese
ko: Korean
ru: Russian
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
zh: Chinese (default)
en: English
fr: French
de: German
ja: Japanese
ko: Korean
ru: Russian
pt: Portuguese
th: Thai
id: Indonesian
vi: Vietnamese
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice Design
Specifies the language preference for the generated voice. Affects language features and pronunciation. Choose the language code matching your use case.
If used, the language must match the
preview_textlanguage.Valid values:
zh: Chinese (default)
en: English
Request Example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
# Avoid frequent calls. Each call creates a new voice. After reaching your quota limit, you cannot create more.
voice_id = service.create_voice(
target_model='cosyvoice-v3.5-plus',
prefix='myvoice',
url='https://your-audio-file-url'
# language_hints=['zh'],
# max_prompt_audio_length=10.0,
# enable_preprocess=False
)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice ID: {voice_id}")Java SDK
Interface Description
Before using this interface, install the latest DashScope SDK.
/**
* Create a new custom voice.
*
* @param targetModel Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.
* @param prefix Name for the voice (letters, numbers, and underscores only; up to 10 characters). Use identifiers related to role or scenario. This keyword appears in the cloned voice name. Format: model-name-prefix-unique-id, e.g., cosyvoice-v3-plus-myvoice-xxxxxxxx.
* @param url Publicly accessible URL of the audio file used for voice cloning.
* @param customParam Custom parameters. Specify languageHints and maxPromptAudioLength here.
* languageHints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.
* Helps the model identify the language of the reference audio (original sample), improving voice feature extraction and cloning quality.
* If the language hint does not match the actual audio (e.g., "en" for Chinese audio), the system ignores the hint and detects the language automatically.
* Valid values (by model):
* cosyvoice-v3-plus: zh (default), en, fr, de, ja, ko, ru.
* cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh (default), en, fr, de, ja, ko, ru, pt, th, id, vi.
* Only the first element is processed. Pass only one value.
* maxPromptAudioLength: Maximum duration (in seconds) of the reference audio used for voice cloning after preprocessing. Applies only to cosyvoice-v3-flash models.
* Valid range: [3.0, 30.0].
* enable_preprocess: Configure this parameter using the generic parameter "parameter". Enable audio preprocessing. When enabled, the system applies noise reduction, audio enhancement, and volume normalization before cloning. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, and cosyvoice-v3-flash models.
* @return Voice New voice. Call Voice.getVoiceId() to get the voice ID. Use directly as the voice parameter in the speech synthesis API.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice createVoice(String targetModel, String prefix, String url, VoiceEnrollmentParam customParam) throws NoApiKeyException, InputRequiredExceptiontargetModel: Speech synthesis model that drives the voice. Must match the speech synthesis model used later. Otherwise, synthesis fails.languageHints: Language of the reference audio used to extract voice features. Applies only to cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash, and cosyvoice-v3-plus models.Functionality:
Voice Cloning
This parameter helps the model identify the language of the sample audio (original reference audio), so that it can more accurately extract voice characteristics and improve voice cloning quality. If the specified language hint does not match the actual audio language (for example, specifying
enfor Chinese audio), the system ignores this hint and automatically detects the language based on the audio content.Valid values (by model):
cosyvoice-v3-plus:
zh: Chinese (default)
en: English
fr: French
de: German
ja: Japanese
ko: Korean
ru: Russian
cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash:
zh: Chinese (default)
en: English
fr: French
de: German
ja: Japanese
ko: Korean
ru: Russian
pt: Portuguese
th: Thai
id: Indonesian
vi: Vietnamese
For Chinese dialects (e.g., Northeastern, Cantonese), set
language_hintstozh. Control dialect style in speech synthesis using text content or theinstructparameter.Voice Design
Specifies the language preference for the generated voice. Affects language features and pronunciation. Choose the language code matching your use case.
If used, the language must match the
preview_textlanguage.Valid values:
zh: Chinese (default)
en: English
Request Example
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentParam;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Collections;
public class Main {
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args) {
String apiKey = System.getenv("DASHSCOPE_API_KEY");
String targetModel = "cosyvoice-v3.5-plus";
String prefix = "myvoice";
String fileUrl = "https://your-audio-file-url";
String cloneModelName = "voice-enrollment";
try {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
Voice myVoice = service.createVoice(
targetModel,
prefix,
fileUrl,
VoiceEnrollmentParam.builder()
.model(cloneModelName)
.languageHints(Collections.singletonList("zh"))
// .maxPromptAudioLength(10.0f)
// .parameter("enable_preprocess", false)
.build());
logger.info("Voice creation submitted. Request ID: {}", service.getLastRequestId());
logger.info("Generated Voice ID: {}", myVoice.getVoiceId());
} catch (Exception e) {
logger.error("Failed to create voice", e);
}
}
}List Voices
Query the list of created voices with pagination.
RESTful API
URL and Request Headers are the same as the Create Voice API
Request Body
The request body contains all parameters. Optional fields can be omitted based on your business needs.
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "list_voice", "prefix": "announcer" "page_size": 10, "page_index": 0 } }Request Parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
list_voice.prefix
string
-
Same prefix used when creating the voice. Letters and numbers only; up to 10 characters.
page_index
integer
0
Page index. Must be greater than or equal to 0.
page_size
integer
10
Number of items per page. Valid range: [0, 1000].
Response Parameters
Key parameters:
Parameter
Type
Description
voice_id
string
Voice ID. Use directly as the
voiceparameter in the speech synthesis API.target_model
string
Speech synthesis model that drives the voice (see Supported Models).
Must match the speech synthesis model used later. Otherwise, synthesis fails.
gmt_create
string
Time the voice was created.
gmt_modified
string
The time at which the timbre is modified.
voice_prompt
string
Voice description.
preview_text
string
Preview text.
request_id
string
Request ID.
status
string
Voice status:
DEPLOYING: Under review
OK: Approved and ready to use
UNDEPLOYED: Rejected and unavailable
Sample Code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If you haven’t configured your API key in an environment variable, replace
$DASHSCOPE_API_KEYin the sample with your actual API key.# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope.aliyuncs-intl.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "list_voice", "prefix": "announcer", "page_size": 10, "page_index": 0 } }'
Python SDK
Interface Description
def list_voices(self, prefix=None, page_index: int = 0, page_size: int = 10) -> List[dict]:
'''
Query all created voices
param: prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
param: page_index Page index to query
param: page_size Page size to query
return: List[dict] Voice list containing ID, creation time, modification time, and status for each voice. Format: [{'gmt_create': '2025-10-09 14:51:01', 'gmt_modified': '2025-10-09 14:51:07', 'status': 'OK', 'voice_id': 'cosyvoice-v3-myvoice-xxx'}]
Voice statuses:
DEPLOYING: Under review
OK: Approved and ready to use
UNDEPLOYED: Rejected and unavailable
'''Request Example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
# Filter by prefix, or set to None to query all
voices = service.list_voices(prefix='myvoice', page_index=0, page_size=10)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Found voices: {voices}")Response Example
[
{
"gmt_create": "2024-09-13 11:29:41",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
},
{
"gmt_create": "2024-09-13 13:22:38",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 13:22:38",
"status": "OK"
}
]Response Parameters
Parameter | Type | Description |
voice_id | string | Voice ID. Use directly as the |
target_model | string | Speech synthesis model that drives the voice (see Supported Models). Must match the speech synthesis model used later. Otherwise, synthesis fails. |
gmt_create | string | Time the voice was created. |
gmt_modified | string | The time at which the timbre is modified. |
voice_prompt | string | Voice description. |
preview_text | string | Preview text. |
request_id | string | Request ID. |
status | string | Voice status:
|
Java SDK
Interface Description
// Voice statuses:
// DEPLOYING: Under review
// OK: Approved and ready to use
// UNDEPLOYED: Rejected and unavailable
/**
* Query all created voices. Default page index is 0, default page size is 10.
*
* @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters). Can be null.
* @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice[] listVoice(String prefix) throws NoApiKeyException, InputRequiredException
/**
* Query all created voices.
*
* @param prefix Custom prefix for the voice (letters and lowercase letters only; fewer than 10 characters).
* @param pageIndex Page index to query.
* @param pageSize Page size to query.
* @return Voice[] Array of Voice objects. Voice encapsulates the voice's ID, creation time, modification time, and status.
* @throws NoApiKeyException If the API key is empty.
* @throws InputRequiredException If a required parameter is empty.
*/
public Voice[] listVoice(String prefix, int pageIndex, int pageSize) throws NoApiKeyException, InputRequiredExceptionRequest Example
You need to import the third-party library com.google.gson.Gson.
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String prefix = "myvoice"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Query voices
Voice[] voices = service.listVoice(prefix, 0, 10);
logger.info("List successful. Request ID: {}", service.getLastRequestId());
logger.info("Voices Details: {}", new Gson().toJson(voices));
}
}Response Example
[
{
"gmt_create": "2024-09-13 11:29:41",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
},
{
"gmt_create": "2024-09-13 13:22:38",
"voice_id": "yourVoiceId",
"gmt_modified": "2024-09-13 13:22:38",
"status": "OK"
}
]Response Parameters
Parameter | Type | Description |
voice_id | string | Voice ID. Use directly as the |
target_model | string | Speech synthesis model that drives the voice (see Supported Models). Must match the speech synthesis model used later. Otherwise, synthesis fails. |
gmt_create | string | Time the voice was created. |
gmt_modified | string | The time at which the timbre is modified. |
voice_prompt | string | Voice description. |
preview_text | string | Preview text. |
request_id | string | Request ID. |
status | string | Voice status:
|
Query Specific Voice
You can retrieve detailed information about a specific voice by its name.
RESTful API
URL and Request Headers are the same as the Create Voice API
Request Body
The request body contains all parameters. Optional fields can be omitted based on your business needs.
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "query_voice", "voice_id": "yourVoiceID" } }Request Parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
query_voice.voice_id
string
-
ID of the voice to query.
Response Parameters
For parameter descriptions, see the List Voices API.
Sample Code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If you haven’t configured your API key in an environment variable, replace
$DASHSCOPE_API_KEYin the sample with your actual API key.# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "query_voice", "voice_id": "yourVoiceID" } }'
Python SDK
Interface Description
def query_voice(self, voice_id: str) -> List[str]:
'''
Query details for a specific voice
param: voice_id ID of the voice to query
return: List[str] Voice details, including status, creation time, audio link, etc.
'''Request Example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
voice_id = 'cosyvoice-v3-plus-myvoice-xxxxxxxx'
voice_details = service.query_voice(voice_id=voice_id)
print(f"Request ID: {service.get_last_request_id()}")
print(f"Voice Details: {voice_details}")Response Example
{
"gmt_create": "2024-09-13 11:29:41",
"resource_link": "https://yourAudioFileUrl",
"target_model": "cosyvoice-v3.5-plus",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
}Response Parameters
Java SDK
Interface Description
/**
* Query details for a specific voice
*
* @param voiceId ID of the voice to query
* @return Voice Voice details, including status, creation time, audio link, etc.
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public Voice queryVoice(String voiceId) throws NoApiKeyException, InputRequiredExceptionRequest Example
You need to import the third-party library com.google.gson.Gson.
import com.alibaba.dashscope.audio.ttsv2.enrollment.Voice;
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.Gson;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
Voice voice = service.queryVoice(voiceId);
logger.info("Query successful. Request ID: {}", service.getLastRequestId());
logger.info("Voice Details: {}", new Gson().toJson(voice));
}
}Response Example
{
"gmt_create": "2024-09-13 11:29:41",
"resource_link": "https://yourAudioFileUrl",
"target_model": "cosyvoice-v3.5-plus",
"gmt_modified": "2024-09-13 11:29:41",
"status": "OK"
}Response Parameters
Update Voice (Voice Cloning Only)
Update an existing voice with a new audio file.
This feature is not supported for voice design.
RESTful API
URL and Request Headers are the same as the Create Voice API
Request Body
The request body contains all parameters. Optional fields can be omitted based on your business needs:
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "update_voice", "voice_id": "yourVoiceId", "url": "https://yourAudioFileUrl" } }Request Parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
update_voice.voice_id
string
-
Voice to update.
url
string
-
URL of the audio file to update the voice. The URL must be publicly accessible.
Response Parameters
Sample Code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If you haven’t configured your API key in an environment variable, replace
$DASHSCOPE_API_KEYin the sample with your actual API key.# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "update_voice", "voice_id": "yourVoiceId", "url": "https://yourAudioFileUrl" } }'
Python SDK
Interface Description
def update_voice(self, voice_id: str, url: str) -> None:
'''
Update a voice
param: voice_id Voice ID
param: url URL of the audio file for voice cloning
'''Request Example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
service.update_voice(
voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx',
url='https://your-new-audio-file-url'
)
print(f"Update submitted. Request ID: {service.get_last_request_id()}")Java SDK
Interface Description
/**
* Update a voice
*
* @param voiceId Voice to update
* @param url URL of the audio file for voice cloning
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public void updateVoice(String voiceId, String url)
throws NoApiKeyException, InputRequiredExceptionRequest Example
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String fileUrl = "https://your-audio-file-url"; // Replace with your actual value
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Update voice
service.updateVoice(voiceId, fileUrl);
logger.info("Update submitted. Request ID: {}", service.getLastRequestId());
}
}Delete Voice
Delete a voice you no longer need to free up your quota. This action is irreversible.
RESTful API
URL and Request Headers are the same as the Create Voice API
Request Body
The request body contains all parameters. Optional fields can be omitted based on your business needs:
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.{ "model": "voice-enrollment", "input": { "action": "delete_voice", "voice_id": "yourVoiceID" } }Request Parameters
Parameter
Type
Default
Required
Description
model
string
-
Voice cloning/design model. Fixed value:
voice-enrollment.action
string
-
Action type. Fixed value:
delete_voice.voice_id
string
-
Voice to delete.
Response Parameters
Sample Code
Importantmodel: Voice cloning/design model. Fixed value:voice-enrollment. Do not change.If you haven’t configured your API key in an environment variable, replace
$DASHSCOPE_API_KEYin the sample with your actual API key.# ======= Important Notice ======= # The following is the URL for the Singapore region. If you use the Beijing region model, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization # The API keys for Singapore and Beijing regions differ. Get your API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key # === Delete this comment before running === curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \ -H "Authorization: Bearer $DASHSCOPE_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "voice-enrollment", "input": { "action": "delete_voice", "voice_id": "yourVoiceID" } }'
Python SDK
Interface Description
def delete_voice(self, voice_id: str) -> None:
'''
Delete a voice
param: voice_id Voice to delete
'''Request Example
from dashscope.audio.tts_v2 import VoiceEnrollmentService
service = VoiceEnrollmentService()
service.delete_voice(voice_id='cosyvoice-v3-plus-myvoice-xxxxxxxx')
print(f"Deletion submitted. Request ID: {service.get_last_request_id()}")Java SDK
Interface Description
/**
* Delete a voice
*
* @param voiceId Voice to delete
* @throws NoApiKeyException If the API key is empty
* @throws InputRequiredException If a required parameter is empty
*/
public void deleteVoice(String voiceId) throws NoApiKeyException, InputRequiredException Request Example
import com.alibaba.dashscope.audio.ttsv2.enrollment.VoiceEnrollmentService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Main {
public static String apiKey = System.getenv("DASHSCOPE_API_KEY"); // If you haven't set the environment variable, replace this with your API key
private static String voiceId = "cosyvoice-v3-plus-myvoice-xxx"; // Replace with your actual value
private static final Logger logger = LoggerFactory.getLogger(Main.class);
public static void main(String[] args)
throws NoApiKeyException, InputRequiredException {
VoiceEnrollmentService service = new VoiceEnrollmentService(apiKey);
// Delete voice
service.deleteVoice(voiceId);
logger.info("Deletion submitted. Request ID: {}", service.getLastRequestId());
}
}Voice Quotas and Automatic Cleanup Rules
Billing
Voice cloning/design: Creating, querying, updating, and deleting voices is free.
Speech synthesis using custom voices: Billed based on the number of text characters. For more information, see Real-time Speech Synthesis – CosyVoice/Sambert.
Copyright and Legality
You are responsible for the ownership and legal right to use any voice you provide. Read the Terms of Service.
Error Codes
If you encounter an error, see Error Messages for troubleshooting.
FAQ
Features
Q: How do I adjust the speed and volume of a custom voice?
Adjust them the same way you adjust a preset voice. Pass the corresponding parameters when calling the speech synthesis API. For example, use speech_rate (Python) or speechRate (Java) to adjust speed, and volume to adjust volume. For more information, see the speech synthesis API documentation (Java SDK/Python SDK/WebSocket API).
Q: How do I call the API using languages other than Java and Python (such as Go, C#, or Node.js)?
For voice management, use the RESTful API provided in this document. For speech synthesis, use the WebSocket API and pass the cloned voice_id as the voice parameter.
Troubleshooting
If you encounter code errors, troubleshoot using the information in Error Codes.
Q: What should I do if the synthesized audio from a cloned voice contains extra content?
If you find extra characters or noise in the synthesized audio from a cloned voice, follow these steps to troubleshoot:
Check the source audio quality
The quality of the cloned audio directly affects the synthesis result. Ensure the source audio meets these requirements:
No background noise or static
Clear sound quality (sample rate ≥ 16 kHz recommended)
Audio format: WAV is better than MP3 (avoid lossy compression)
Mono (stereo may cause interference)
No silent segments or long pauses
Moderate speech rate (a fast rate affects feature extraction)
Check the input text
Confirm the input text does not contain special symbols or markers:
Avoid special symbols such as
**,"", and''Unless used for LaTeX formulas, preprocess the text to filter out special symbols.
Verify voice cloning parameters
Ensure the language parameter (
language_hints/languageHints) is set correctly when .Try cloning again
Use a higher-quality source audio file to clone the voice again and test the result.
Compare with system voices
Test the same text with a preset system voice to confirm if the issue is specific to the cloned voice.
Q: How do I troubleshoot if the audio generated from a cloned voice is silent?
Check Voice Status
Call the Query Specific Voice API to check if the voice
statusisOK.Check for model version consistency
Ensure the
target_modelparameter used for voice cloning exactly matches themodelparameter used for speech synthesis. For example:When you clone the repository, use
cosyvoice-v3-plus.You must also use
cosyvoice-v3-plusfor synthesis
Verify source audio quality
Check if the source audio used for cloning meets the voice cloning input audio format requirements:
Audio duration: 10–20 seconds
Clear sound quality
No background noise
Check request parameters
Confirm the
voiceparameter is set to the cloned voice's ID during speech synthesis.
Q: What should I do if the synthesized speech from a cloned voice is unstable or incomplete?
If the synthesized speech from a cloned voice has these issues:
Incomplete playback; only part of the text is read
Unstable synthesis quality; sometimes good, sometimes bad
Abnormal pauses or silent segments in the audio
Possible cause: The source audio quality does not meet requirements.
Solution: Check if the source audio meets the following requirements. Rerecord the audio following the Recording Guide.
Check audio continuity: Ensure the speech in the source audio is continuous. Avoid long pauses or silent segments (over 2 seconds). Obvious blank segments can cause the model to treat silence or noise as part of the voice's features, affecting the result.
Check the speech activity ratio: Ensure that active speech makes up more than 60% of the total audio duration. Too much background noise or non-speech segments will interfere with voice feature extraction.
Verify audio quality details:
Audio duration: 10–20 seconds (15 seconds recommended)
Clear pronunciation and steady speech rate
No background noise, echo, or static
Concentrated speech energy with no long silent segments
Q: Why can't I find the VoiceEnrollmentService class?
Your SDK version is too old. Install the latest SDK.
Q: What should I do if the voice cloning result is poor, with noise or unclear audio?
This is usually because of low-quality input audio. Rerecord and upload the audio, strictly following the Recording Guide.
Q: Why is there a long silence at the beginning or an abnormal total duration when I synthesize very short text (like a single word) with a cloned voice?
The voice cloning model learns the pauses and rhythm from the sample audio. If the original recording has a long initial silence or pause, the synthesis result might retain a similar pattern. For single words or very short text, this silence ratio is amplified, making it seem like "the audio is long but mostly silent." Avoid long silences when recording sample audio. Use complete sentences or longer text for synthesis. If you must synthesize a single word, add some context before or after it, or use a homophone to avoid extreme cases.