Voice Cloning creates a highly realistic custom voice from a 10- to 20-second audio sample, with no model training required.
Overview
Voice Cloning lets you build personalized voice assistants, branded audio broadcasts, and custom narration.
Model Studio supports Voice Cloning through the following model families:
-
CosyVoice: Create voices through the DashScope SDK or HTTP API. Supports real-time speech synthesis. Available in the China (Beijing) and Singapore regions.
-
Qwen-TTS: Create voices through the HTTP API. Supports both real-time and non-real-time speech synthesis. Available in the China (Beijing) and Singapore regions.
For a detailed comparison and guidance on choosing a model family, see Speech synthesis.
Prerequisites
-
If you call the API through the DashScope SDK, install the latest SDK.
-
Prepare an audio file that meets the Audio requirements.
Quick start
Voice cloning involves three steps:
-
Prepare the audio: Prepare an audio file that meets the Audio requirements.
-
Create a voice: Call the Voice Cloning API to upload the audio and create a voice. In the
target_modelparameter, specify the speech synthesis model to bind to the voice. -
Synthesize speech: Call the speech synthesis API and pass the voice ID returned when you created the voice.
CosyVoice voice cloning
CosyVoice voice cloning is available in the China (Beijing) region (v3.5, v3, v2, and v1 series) and the Singapore region (v3 series only).
Step 1: Create a voice
Call the Voice Cloning API to upload an audio file and create a voice. The url parameter is the accessible URL of the audio file; prefix sets a prefix for the voice name.
China (Beijing) region URL. The URL varies by region.
Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
curl -X POST 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "voice-enrollment",
"input": {
"action": "create_voice",
"target_model": "cosyvoice-v3-plus",
"prefix": "myvoice",
"url": "https://your-audio-url.wav"
}
}'
Step 2: Synthesize speech with the cloned voice
Replace voice_id in the following code with the value returned in the previous step.
# coding=utf-8
import dashscope
from dashscope.audio.tts_v2 import *
import os
# The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
# Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
dashscope.base_websocket_api_url='wss://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api-ws/v1/inference'
# Use the same model for both voice cloning and speech synthesis
model = "cosyvoice-v3-plus"
# Replace the voice parameter with the custom voice generated by cloning
voice = "voice_id"
# Create a SpeechSynthesizer instance with the model and voice parameters
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send text for synthesis and get binary audio
audio = synthesizer.call("How is the weather today?")
# The first call incurs extra latency for establishing the WebSocket connection
print('[Metric] requestId: {}, first packet latency: {} ms'.format(
synthesizer.get_last_request_id(),
synthesizer.get_first_package_delay()))
# Save the audio to a local file
with open('output.mp3', 'wb') as f:
f.write(audio)
Qwen-TTS voice cloning
The example uses the local audio file voice.mp3. Before running the code, replace voice.mp3 with the path to your audio file.
The target_model set during voice creation must exactly match the model used for speech synthesis. Otherwise, synthesis fails.
Python
import os
import requests
import base64
import pathlib
import dashscope
# ======= Constants =======
DEFAULT_TARGET_MODEL = "qwen3-tts-vc-2026-01-22" # Use the same model for both voice cloning and speech synthesis
DEFAULT_PREFERRED_NAME = "guanyu"
DEFAULT_AUDIO_MIME_TYPE = "audio/mpeg"
VOICE_FILE_PATH = "voice.mp3" # Relative path to the local audio file used for voice cloning
def create_voice(file_path: str,
target_model: str = DEFAULT_TARGET_MODEL,
preferred_name: str = DEFAULT_PREFERRED_NAME,
audio_mime_type: str = DEFAULT_AUDIO_MIME_TYPE) -> str:
"""
Create a custom voice and return the voice parameter.
"""
# The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key = "sk-xxx"
api_key = os.getenv("DASHSCOPE_API_KEY")
file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"
# Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
url = "https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/tts/customization"
payload = {
"model": "qwen-voice-enrollment", # Do not modify this value
"input": {
"action": "create",
"target_model": target_model,
"preferred_name": preferred_name,
"audio": {"data": data_uri}
}
}
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
resp = requests.post(url, json=payload, headers=headers)
if resp.status_code != 200:
raise RuntimeError(f"Failed to create voice: {resp.status_code}, {resp.text}")
try:
return resp.json()["output"]["voice"]
except (KeyError, ValueError) as e:
raise RuntimeError(f"Failed to parse voice response: {e}")
if __name__ == '__main__':
# Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
dashscope.base_http_api_url = 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1'
text = "How is the weather today?"
response = dashscope.MultiModalConversation.call(
model=DEFAULT_TARGET_MODEL,
api_key=os.getenv("DASHSCOPE_API_KEY"),
text=text,
voice=create_voice(VOICE_FILE_PATH), # Replace the voice parameter with the custom voice generated by cloning
stream=False
)
print(response)
cURL
Replace data with the actual path to your audio file.
China (Beijing) region URL. The URL varies by region.
Step 1: Create a voice
Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
curl -X POST 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-voice-enrollment",
"input": {
"action": "create",
"target_model": "qwen3-tts-vc-2026-01-22",
"preferred_name": "guanyu",
"audio": {
"data": "https://xxx.wav"
}
}
}'
Step 2: Synthesize speech with the cloned voice
Replace YOUR_VOICE_ID with the voice value from the previous step's response.
China (Beijing) region URL. The URL varies by region.
Singapore region URL. Replace WorkspaceId with your actual workspace ID. The URL varies by region.
curl -X POST 'https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-tts-vc-2026-01-22",
"input": {
"text": "How is the weather today?",
"voice": "YOUR_VOICE_ID"
}
}'
Audio requirements
The quality of the input audio directly affects the cloning result. Each model family has different audio requirements. Prepare your audio sample according to the requirements of your target model.
CosyVoice
|
Item |
Requirement |
|
Supported formats |
WAV (16-bit), MP3, M4A |
|
Duration |
10 to 20 seconds recommended. Maximum 60 seconds. |
|
File size |
10 MB or less |
|
Sample rate |
16 kHz or higher |
|
Channels |
Mono or stereo. For stereo audio, only the first channel is processed. Make sure the first channel contains valid speech. |
|
Content |
The audio must contain at least 5 seconds of continuous, clear speech. Brief pauses in the remaining portion must not exceed 2 seconds. Avoid background music, ambient noise, or other voices. Use normal-speed spoken audio; don't upload songs or singing. |
|
Supported languages |
Varies by the speech synthesis model specified through the
|
MiniMax
|
Item |
Requirement |
|
Supported formats |
MP3, M4A, WAV |
|
Duration |
At least 10 seconds. Maximum 5 minutes. |
|
File size |
20 MB or less |
|
Content |
The audio must contain continuous, clear speech with no background sound. Pauses must not exceed 2 seconds. Avoid background music, ambient noise, or other voices throughout the recording. Use normal-speed spoken audio. Don't upload songs or singing recordings. |
|
Supported languages |
No restrictions |
Qwen-TTS
|
Item |
Requirement |
|
Supported formats |
WAV (16-bit), MP3, M4A |
|
Duration |
10 to 20 seconds recommended. Maximum 60 seconds. |
|
File size |
Less than 10 MB |
|
Sample rate |
24 kHz or higher |
|
Channels |
Mono |
|
Content |
The audio must contain at least 3 seconds of continuous, clear speech. Brief pauses in the remaining portion must not exceed 2 seconds. Avoid background music, ambient noise, or other voices. Use normal-speed spoken audio; don't upload songs or singing. |
|
Supported languages |
Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian |
For best cloning results, follow the Recording tips when preparing your audio sample.
Recording tips
High-quality input audio produces better cloning results.
Recording equipment
Use a smartphone, digital voice recorder, or professional recording device. For best results, use a device with a sample rate of 24 kHz or higher.
Recording environment
Venue
-
Record in a small, enclosed space of 10 square meters or less.
-
Prefer a room with sound-absorbing materials such as acoustic foam, carpet, or curtains.
-
Avoid open halls, conference rooms, classrooms, and other spaces with high reverberation.
Noise control
-
Outdoor noise: Close doors and windows to block traffic, construction, and other external sounds.
-
Indoor noise: Turn off air conditioners, fans, fluorescent light ballasts, and other appliances. To identify hidden noise sources, record a few seconds of ambient sound and play it back at higher volume.
Reverberation control
-
Reverberation blurs the sound and reduces clarity.
-
Reduce reflections from smooth surfaces: close curtains, open closet doors, and drape clothing or blankets over desks and cabinets.
-
Use irregularly shaped objects such as bookshelves and upholstered furniture to diffuse sound.
Recording script
-
No specific content restrictions. Match the script to the target use case when possible.
-
Avoid short phrases such as "Hello" or "Yes." Use complete sentences.
-
Keep the content coherent and avoid frequent pauses. Aim for at least 3 seconds of continuous speech without interruption.
-
Maintain a consistent pace throughout the recording. Speaking too fast at the start or finish may cause stuttering in the synthesized speech.
-
Include natural emotional expression — warmth, friendliness, or seriousness. Avoid robotic delivery.
-
Don't include sensitive content such as political, sexual, or violent material. This causes the cloning request to fail.
Recording workflow
The following example uses a typical bedroom as the recording space. Complete the noise reduction and reverberation control steps described above, then:
-
Review the script, decide on a tone and persona, then record naturally.
-
Hold the recording device about 10 cm from your mouth to avoid plosive distortion or a weak signal.
Manage custom voices
After creating a voice with Qwen-TTS or CosyVoice, you can query and manage your voices through the API.
-
List voices: Get a list of all custom voices under your account.
-
Get voice details: View details of a specific voice, such as the creation time and the bound speech synthesis model.
-
Delete voices: Delete custom voices you no longer need to free up quota.
For API endpoints and parameter details, see API reference.
Quota and billing
Voice quota and automatic cleanup
Total voice limit: Each Alibaba Cloud Model Studio account has a separate limit of 1,000 custom voices for CosyVoice and 1,000 for Qwen-TTS. The two quotas are counted independently.
Automatic cleanup: If a voice isn't used in any speech synthesis request for one year, the system automatically deletes it.
Billing rules
-
CosyVoice: Voice creation is free.
-
Qwen-TTS: Each voice creation costs USD 0.01. Failed creations aren't charged.
Free quota (Singapore region only):
-
You get 1,000 free voice creations during the first 90 days after activating Alibaba Cloud Model Studio.
-
Failed creations don't consume the free quota.
-
Deleting a voice doesn't restore the free quota.
-
After the free quota is used up or the 90-day window expires, voice creation is billed at USD 0.01 per voice.
-
Supported models and regions
Singapore
Use a Singapore-region API Key when calling the following models:
-
CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash
-
Qwen-TTS:
-
Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
-
Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)
-
China (Beijing)
Use a China (Beijing)-region API Key when calling the following models:
-
CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
-
Qwen-TTS:
-
Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
-
Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)
-
API reference
FAQ
Q: Can I use a created voice with different speech synthesis models?
No. A voice is bound to a specific speech synthesis model through the target_model parameter during voice creation and can't be used across models. To use the same audio recording with multiple models, create a separate voice for each model.
Q: How long does a cloned voice remain valid?
Voices created with Qwen-TTS and CosyVoice are valid indefinitely by default. If a voice goes unused for one year, the system automatically deletes it. For details, see Voice quota and automatic cleanup. Save your voice IDs and use the query API to check whether a voice is still available.
Q: Does poor audio quality affect the cloning result?
Yes. The quality of the input audio directly affects the cloning result. Background noise, reverberation, and overlapping voices all reduce the similarity and naturalness of the cloned voice. Follow the Audio requirements and Recording tips when preparing your audio sample.