Voice cloning - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

Overview

Voice Cloning creates a custom voice from an audio sample as short as 10 to 20 seconds — no model training required. Use it to build personalized voice assistants, branded audio broadcasts, and custom narration.

Model Studio supports Voice Cloning through the following model families:

CosyVoice: Create voices through the DashScope SDK or HTTP API. Supports real-time speech synthesis. Available in the China (Beijing) and Singapore regions.
Qwen-TTS: Create voices through the HTTP API. Supports both real-time and non-real-time speech synthesis. Available in the China (Beijing) and Singapore regions.

For a detailed comparison and guidance on choosing a model family, see Speech synthesis.

Prerequisites

Configure an API key and set it as an environment variable.
If you call the API through the DashScope SDK, install the latest SDK.
Prepare an audio file that meets the Audio requirements.

Quick start

Voice cloning involves three steps:

Prepare the audio: Prepare an audio file that meets the Audio requirements.
Create a voice: Call the Voice Cloning API to upload the audio. The system extracts the voice characteristics and generates a custom voice. Specify the target speech synthesis model in the target_model parameter when creating the voice.
Synthesize speech with the voice: Call the speech synthesis API and pass the voice ID returned during voice creation. The model used for synthesis must match the target_model value set when you created the voice; otherwise, synthesis fails.

Qwen-TTS voice cloning

The following example shows the complete voice cloning workflow: upload an audio file to create a voice, then use that voice for speech synthesis. The example uses the local audio file voice.mp3. Replace voice.mp3 with the actual path to your audio file before running the code.

Important

The target_model set during voice creation must exactly match the model used for speech synthesis. Otherwise, synthesis fails.

Python

import os
import requests
import base64
import pathlib
import dashscope

# ======= Constants =======
DEFAULT_TARGET_MODEL = "qwen3-tts-vc-2026-01-22"  # Use the same model for both voice cloning and speech synthesis
DEFAULT_PREFERRED_NAME = "guanyu"
DEFAULT_AUDIO_MIME_TYPE = "audio/mpeg"
VOICE_FILE_PATH = "voice.mp3"  # Relative path to the local audio file used for voice cloning

def create_voice(file_path: str,
                 target_model: str = DEFAULT_TARGET_MODEL,
                 preferred_name: str = DEFAULT_PREFERRED_NAME,
                 audio_mime_type: str = DEFAULT_AUDIO_MIME_TYPE) -> str:
    """
    Create a custom voice and return the voice parameter.
    """
    # The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If the environment variable is not set, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY")

    file_path_obj = pathlib.Path(file_path)
    if not file_path_obj.exists():
        raise FileNotFoundError(f"Audio file not found: {file_path}")

    base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
    data_uri = f"data:{audio_mime_type};base64,{base64_str}"

    # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
    payload = {
        "model": "qwen-voice-enrollment", # Do not modify this value
        "input": {
            "action": "create",
            "target_model": target_model,
            "preferred_name": preferred_name,
            "audio": {"data": data_uri}
        }
    }
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    resp = requests.post(url, json=payload, headers=headers)
    if resp.status_code != 200:
        raise RuntimeError(f"Failed to create voice: {resp.status_code}, {resp.text}")

    try:
        return resp.json()["output"]["voice"]
    except (KeyError, ValueError) as e:
        raise RuntimeError(f"Failed to parse voice response: {e}")

if __name__ == '__main__':
    # The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
    dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

    text = "How is the weather today?"
    response = dashscope.MultiModalConversation.call(
        model=DEFAULT_TARGET_MODEL,
        # The API keys for the Singapore and Beijing regions are different. To obtain an API key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
        # If the environment variable is not set, replace the following line with your Model Studio API key: api_key = "sk-xxx"
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        text=text,
        voice=create_voice(VOICE_FILE_PATH), # Replace the voice parameter with the custom voice generated by cloning
        stream=False
    )
    print(response)

cURL

Voice cloning with cURL is a two-step process: create a voice, then use it to synthesize speech.

Step 1: Create a voice

# Replace voice.mp3 with the actual path to your audio file
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# Note: API Keys differ between the Singapore and Beijing regions. To obtain an API Key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key

AUDIO_BASE64=$(base64 -i voice.mp3)

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-voice-enrollment",
    "input": {
        "action": "create",
        "target_model": "qwen3-tts-vc-2026-01-22",
        "preferred_name": "guanyu",
        "audio": {"data": "data:audio/mpeg;base64,'$AUDIO_BASE64'"}
    }
}'

Step 2: Synthesize speech with the cloned voice

Replace voice in the following request with the value returned in the previous step.

# Replace YOUR_VOICE_ID with the voice value returned in the previous step
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-tts-vc-2026-01-22",
    "input": {
        "text": "How is the weather today?",
        "voice": "YOUR_VOICE_ID"
    }
}'

CosyVoice voice cloning

CosyVoice voice cloning uses a dedicated Voice Cloning API. The workflow is the same: create a voice, then synthesize speech with it.

Important

CosyVoice voice cloning is available only in the China (Beijing) region (v3.5/v2/v1 series) and the Singapore region (v3 series).

Step 1: Create a voice

Call the Voice Cloning API to upload an audio file and create a voice. The url parameter is the accessible URL of the audio file; prefix sets a prefix for the voice name.

# Replace url with the publicly accessible URL of your audio file
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# To obtain an API Key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3-plus",
        "prefix": "myvoice",
        "url": "https://your-audio-url.wav",
        "language_hints": ["en"]
    }
}'

Step 2: Synthesize speech with the cloned voice

Replace voice in the following request with the value returned in the previous step.

# Replace YOUR_VOICE_ID with the voice value returned in the previous step
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "cosyvoice-v3-plus",
    "input": {
      "text": "How is the weather today?",
      "voice": "YOUR_VOICE_ID",
      "format": "wav",
      "sample_rate": 24000
    }
}'

Audio requirements

The quality of the input audio directly affects the cloning result. Each model family has different audio requirements. Prepare your audio sample according to the requirements of your target model.

CosyVoice

Item	Requirement
Supported formats	WAV (16-bit), MP3, M4A
Duration	10 to 20 seconds recommended. Maximum 60 seconds.
File size	10 MB or less
Sample rate	16 kHz or higher
Channels	Mono or stereo. For stereo audio, only the first channel is processed. Make sure the first channel contains valid speech.
Content	The audio must contain at least 5 seconds of continuous, clear speech. Brief pauses in the remaining portion must not exceed 2 seconds. Avoid background music, ambient noise, or other voices. Use normal-speed spoken audio; don't upload songs or singing.
Supported languages	Varies by the speech synthesis model specified through the `target_model` parameter: cosyvoice-v2: Chinese (Mandarin), English cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuanese, Tianjin, Yunnan), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, Russian cosyvoice-v3.5-plus, cosyvoice-v3.5-flash: Chinese (Mandarin, Cantonese, Henan, Hubei, Minnan, Ningxia, Shaanxi, Shandong, Shanghainese, Sichuanese), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese

Qwen-TTS

Item	Requirement
Supported formats	WAV (16-bit), MP3, M4A
Duration	10 to 20 seconds recommended. Maximum 60 seconds.
File size	Less than 10 MB
Sample rate	24 kHz or higher
Channels	Mono
Content	The audio must contain at least 3 seconds of continuous, clear speech. Brief pauses in the remaining portion must not exceed 2 seconds. Avoid background music, ambient noise, or other voices. Use normal-speed spoken audio; don't upload songs or singing.
Supported languages	Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian

MiniMax

Item	Requirement
Supported formats	MP3, M4A, WAV
Duration	At least 10 seconds. Maximum 5 minutes.
File size	20 MB or less
Content	The audio must contain continuous, clear speech with no background sound. Pauses must not exceed 2 seconds. Avoid background music, ambient noise, or other voices throughout the recording. Use normal-speed spoken audio. Don't upload songs or singing recordings.
Supported languages	No restrictions

Note

For best cloning results, follow the Recording tips when preparing your audio sample.

Recording tips

High-quality input audio produces better cloning results. The following sections provide guidance on recording equipment, environment, script content, and recording workflow.

Recording equipment

Use a smartphone, digital voice recorder, or professional recording device. For best results, use a device with a sample rate of 24 kHz or higher.

Recording environment

Venue

Record in a small, enclosed space of 10 square meters or less.
Prefer a room with sound-absorbing materials such as acoustic foam, carpet, or curtains.
Avoid open halls, conference rooms, classrooms, and other spaces with high reverberation.

Noise control

Outdoor noise: Close doors and windows to block traffic, construction, and other external sounds.
Indoor noise: Turn off air conditioners, fans, fluorescent light ballasts, and other appliances. To identify hidden noise sources, record a few seconds of ambient sound and play it back at higher volume.

Reverberation control

Reverberation blurs the sound and reduces clarity.
Reduce reflections from smooth surfaces: close curtains, open closet doors, and drape clothing or blankets over desks and cabinets.
Use irregularly shaped objects such as bookshelves and upholstered furniture to diffuse sound.

Recording script

No specific content restrictions. Match the script to the target use case when possible.
Avoid short phrases such as "Hello" or "Yes." Use complete sentences.
Keep the content coherent and avoid frequent pauses. Aim for at least 3 seconds of continuous speech without interruption.
Maintain a consistent pace throughout, including the beginning and end of the recording. Speaking too fast at the start or finish may cause stuttering in the synthesized speech.
Include natural emotional expression — warmth, friendliness, or seriousness. Avoid robotic delivery.
Don't include sensitive content such as political, sexual, or violent material. This causes the cloning request to fail.

Recording workflow

The following example uses a typical bedroom as the recording space:

Close all doors and windows to block external noise.
Turn off air conditioners, fans, and other appliances.
Close the curtains to reduce glass reflections.
Drape clothing or a blanket over the desk to reduce surface reflections.
Review the script, decide on a tone and persona, then record naturally.
Hold the recording device about 10 cm from your mouth to avoid plosive distortion or a weak signal.

Manage custom voices

After creating a voice with Qwen-TTS or CosyVoice, you can query and manage your voices through the API.

List voices: Get a list of all custom voices under your account.
Get voice details: View details of a specific voice, such as the creation time and the bound speech synthesis model.
Delete voices: Delete custom voices you no longer need to free up quota.

For API endpoints and parameter details, see API reference.

Supported models

Available models vary by deployment region:

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

Use a Singapore-region API Key when calling the following models:

CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash
Qwen-TTS:
- Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
- Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)

Chinese mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

Use a China (Beijing)-region API Key when calling the following models:

CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v2
Qwen-TTS:
- Qwen3-TTS-VC-Realtime: qwen3-tts-vc-realtime-2026-01-15 (latest snapshot), qwen3-tts-vc-realtime-2025-11-27 (snapshot)
- Qwen3-TTS-VC: qwen3-tts-vc-2026-01-22 (latest snapshot)

API reference

Voice Cloning

FAQ

Q: Can I use a created voice with different speech synthesis models?

No. A voice is bound to a specific speech synthesis model through the target_model parameter during voice creation and can't be used across models. To use the same audio recording with multiple models, create a separate voice for each model.

Q: How long does a cloned voice remain valid?

Voices created with Qwen-TTS and CosyVoice are valid indefinitely by default. However, the system may remove voices that haven't been used for an extended period. Save your voice IDs and use the query API to verify that a voice is still available when needed.

Q: Does poor audio quality affect the cloning result?

Yes. The quality of the input audio directly affects the cloning result. Background noise, reverberation, and overlapping voices all reduce the similarity and naturalness of the cloned voice. Follow the Audio requirements and Recording tips when preparing your audio sample.