Voice Cloning creates a highly realistic custom voice from a 10- to 20-second audio sample, with no model training required.
Overview
Voice Cloning creates a custom voice from an audio sample as short as 10 to 20 seconds — no model training required. Use it to build personalized voice assistants, branded audio broadcasts, and custom narration.
Model Studio supports Voice Cloning through the following model families:
-
CosyVoice: Create voices through the DashScope SDK or HTTP API. Supports real-time speech synthesis. Available in the China (Beijing) and Singapore regions.
-
Qwen-TTS: Create voices through the HTTP API. Supports both real-time and non-real-time speech synthesis. Available in the China (Beijing) and Singapore regions.
For a detailed comparison and guidance on choosing a model family, see Speech synthesis.
Prerequisites
-
If you call the API through the DashScope SDK, install the latest SDK.
-
Prepare an audio file that meets the Audio requirements.
Quick start
Voice cloning involves three steps:
-
Prepare the audio: Prepare an audio file that meets the Audio requirements.
-
Create a voice: Call the Voice Cloning API to upload the audio. The system extracts the voice characteristics and generates a custom voice. Specify the target speech synthesis model in the
target_modelparameter when creating the voice. -
Synthesize speech with the voice: Call the speech synthesis API and pass the voice ID returned during voice creation. The model used for synthesis must match the
target_modelvalue set when you created the voice; otherwise, synthesis fails.
Qwen-TTS voice cloning
The following example shows the complete voice cloning workflow: upload an audio file to create a voice, then use that voice for speech synthesis. The example uses the local audio file voice.mp3. Replace voice.mp3 with the actual path to your audio file before running the code.
The target_model set during voice creation must exactly match the model used for speech synthesis. Otherwise, synthesis fails.
Python
import os
import requests
import base64
import pathlib
import dashscope
# ======= Constants =======
DEFAULT_TARGET_MODEL = "qwen3-tts-vc-2026-01-22" # Use the same model for both voice cloning and speech synthesis
DEFAULT_PREFERRED_NAME = "guanyu"
DEFAULT_AUDIO_MIME_TYPE = "audio/mpeg"
VOICE_FILE_PATH = "voice.mp3" # Relative path to the local audio file used for voice cloning
def create_voice(file_path: str,
target_model: str = DEFAULT_TARGET_MODEL,
preferred_name: str = DEFAULT_PREFERRED_NAME,
audio_mime_type: str = DEFAULT_AUDIO_MIME_TYPE) -> str:
"""
Create a custom voice and return the voice parameter.
"""
# The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key = "sk-xxx"
api_key = os.getenv("DASHSCOPE_API_KEY")
file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
payload = {
"model": "qwen-voice-enrollment", # Do not modify this value
"input": {
"action": "create",
"target_model": target_model,
"preferred_name": preferred_name,
"audio": {"data": data_uri}
}
}
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
resp = requests.post(url, json=payload, headers=headers)
if resp.status_code != 200:
raise RuntimeError(f"Failed to create voice: {resp.status_code}, {resp.text}")
try:
return resp.json()["output"]["voice"]
except (KeyError, ValueError) as e:
raise RuntimeError(f"Failed to parse voice response: {e}")
if __name__ == '__main__':
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
text = "How is the weather today?"
response = dashscope.MultiModalConversation.call(
model=DEFAULT_TARGET_MODEL,
# The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
text=text,
voice=create_voice(VOICE_FILE_PATH), # Replace the voice parameter with the custom voice generated by cloning
stream=False
)
print(response)
cURL
Voice cloning with cURL is a two-step process: create a voice, then use it to synthesize speech.
Step 1: Create a voice
# Replace voice.mp3 with the actual path to your audio file
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# Note: API Keys differ between the Singapore and Beijing regions. To obtain an API Key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
AUDIO_BASE64=$(base64 -i voice.mp3)
curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-voice-enrollment",
"input": {
"action": "create",
"target_model": "qwen3-tts-vc-2026-01-22",
"preferred_name": "guanyu",
"audio": {"data": "data:audio/mpeg;base64,'$AUDIO_BASE64'"}
}
}'
Step 2: Synthesize speech with the cloned voice
Replace voice in the following request with the value returned in the previous step.
# Replace YOUR_VOICE_ID with the voice value returned in the previous step
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-tts-vc-2026-01-22",
"input": {
"text": "How is the weather today?",
"voice": "YOUR_VOICE_ID"
}
}'
CosyVoice voice cloning
CosyVoice voice cloning uses a dedicated Voice Cloning API. The workflow is the same: create a voice, then synthesize speech with it.
CosyVoice voice cloning is available only in the China (Beijing) region (v3.5/v2/v1 series) and the Singapore region (v3 series).
Step 1: Create a voice
Call the Voice Cloning API to upload an audio file and create a voice. The url parameter is the accessible URL of the audio file; prefix sets a prefix for the voice name.
# Replace url with the publicly accessible URL of your audio file
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# To obtain an API Key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "voice-enrollment",
"input": {
"action": "create_voice",
"target_model": "cosyvoice-v3-plus",
"prefix": "myvoice",
"url": "https://your-audio-url.wav",
"language_hints": ["en"]
}
}'
Step 2: Synthesize speech with the cloned voice
Replace voice in the following request with the value returned in the previous step.
# Replace YOUR_VOICE_ID with the voice value returned in the previous step
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "cosyvoice-v3-plus",
"input": {
"text": "How is the weather today?",
"voice": "YOUR_VOICE_ID",
"format": "wav",
"sample_rate": 24000
}
}'
Audio requirements
The quality of the input audio directly affects the cloning result. Each model family has different audio requirements. Prepare your audio sample according to the requirements of your target model.
CosyVoice
|
Item |
Requirement |
|
Supported formats |
WAV (16-bit), MP3, M4A |
|
Duration |
10 to 20 seconds recommended. Maximum 60 seconds. |
|
File size |
10 MB or less |
|
Sample rate |
16 kHz or higher |
|
Channels |
Mono or stereo. For stereo audio, only the first channel is processed. Make sure the first channel contains valid speech. |
|
Content |
The audio must contain at least 5 seconds of continuous, clear speech. Brief pauses in the remaining portion must not exceed 2 seconds. Avoid background music, ambient noise, or other voices. Use normal-speed spoken audio; don't upload songs or singing. |
|
Supported languages |
Varies by the speech synthesis model specified through the
|
Qwen-TTS
|
Item |
Requirement |
|
Supported formats |
WAV (16-bit), MP3, M4A |
|
Duration |
10 to 20 seconds recommended. Maximum 60 seconds. |
|
File size |
Less than 10 MB |
|
Sample rate |
24 kHz or higher |
|
Channels |
Mono |
|
Content |
The audio must contain at least 3 seconds of continuous, clear speech. Brief pauses in the remaining portion must not exceed 2 seconds. Avoid background music, ambient noise, or other voices. Use normal-speed spoken audio; don't upload songs or singing. |
|
Supported languages |
Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian |
MiniMax
|
Item |
Requirement |
|
Supported formats |
MP3, M4A, WAV |
|
Duration |
At least 10 seconds. Maximum 5 minutes. |
|
File size |
20 MB or less |
|
Content |
The audio must contain continuous, clear speech with no background sound. Pauses must not exceed 2 seconds. Avoid background music, ambient noise, or other voices throughout the recording. Use normal-speed spoken audio. Don't upload songs or singing recordings. |
|
Supported languages |
No restrictions |
For best cloning results, follow the Recording tips when preparing your audio sample.
Recording tips
High-quality input audio produces better cloning results. The following sections provide guidance on recording equipment, environment, script content, and recording workflow.
Recording equipment
Use a smartphone, digital voice recorder, or professional recording device. For best results, use a device with a sample rate of 24 kHz or higher.
Recording environment
Venue
-
Record in a small, enclosed space of 10 square meters or less.
-
Prefer a room with sound-absorbing materials such as acoustic foam, carpet, or curtains.
-
Avoid open halls, conference rooms, classrooms, and other spaces with high reverberation.
Noise control
-
Outdoor noise: Close doors and windows to block traffic, construction, and other external sounds.
-
Indoor noise: Turn off air conditioners, fans, fluorescent light ballasts, and other appliances. To identify hidden noise sources, record a few seconds of ambient sound and play it back at higher volume.
Reverberation control
-
Reverberation blurs the sound and reduces clarity.
-
Reduce reflections from smooth surfaces: close curtains, open closet doors, and drape clothing or blankets over desks and cabinets.
-
Use irregularly shaped objects such as bookshelves and upholstered furniture to diffuse sound.
Recording script
-
No specific content restrictions. Match the script to the target use case when possible.
-
Avoid short phrases such as "Hello" or "Yes." Use complete sentences.
-
Keep the content coherent and avoid frequent pauses. Aim for at least 3 seconds of continuous speech without interruption.
-
Maintain a consistent pace throughout, including the beginning and end of the recording. Speaking too fast at the start or finish may cause stuttering in the synthesized speech.
-
Include natural emotional expression — warmth, friendliness, or seriousness. Avoid robotic delivery.
-
Don't include sensitive content such as political, sexual, or violent material. This causes the cloning request to fail.
Recording workflow
The following example uses a typical bedroom as the recording space:
-
Close all doors and windows to block external noise.
-
Turn off air conditioners, fans, and other appliances.
-
Close the curtains to reduce glass reflections.
-
Drape clothing or a blanket over the desk to reduce surface reflections.
-
Review the script, decide on a tone and persona, then record naturally.
-
Hold the recording device about 10 cm from your mouth to avoid plosive distortion or a weak signal.
Manage custom voices
After creating a voice with Qwen-TTS or CosyVoice, you can query and manage your voices through the API.
-
List voices: Get a list of all custom voices under your account.
-
Get voice details: View details of a specific voice, such as the creation time and the bound speech synthesis model.
-
Delete voices: Delete custom voices you no longer need to free up quota.
For API endpoints and parameter details, see API reference.
Supported models
Available models vary by deployment region:
International
If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.
Use a Singapore-region API Key when calling the following models:
-
CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash
-
Qwen-TTS:
-
Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
-
Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)
-
Chinese mainland
If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).
Use a China (Beijing)-region API Key when calling the following models:
-
CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
-
Qwen-TTS:
-
Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
-
Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)
-
API reference
FAQ
Q: Can I use a created voice with different speech synthesis models?
No. A voice is bound to a specific speech synthesis model through the target_model parameter during voice creation and can't be used across models. To use the same audio recording with multiple models, create a separate voice for each model.
Q: How long does a cloned voice remain valid?
Voices created with Qwen-TTS and CosyVoice are valid indefinitely by default. However, the system may remove voices that haven't been used for an extended period. Save your voice IDs and use the query API to verify that a voice is still available when needed.
Q: Does poor audio quality affect the cloning result?
Yes. The quality of the input audio directly affects the cloning result. Background noise, reverberation, and overlapping voices all reduce the similarity and naturalness of the cloned voice. Follow the Audio requirements and Recording tips when preparing your audio sample.