Voice Design - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

Voice Design lets you create custom voices from voice descriptions alone—no audio samples required. For example, you can describe "a gentle young female voice with moderate speaking rate, suitable for audiobook narration," and the system generates a voice that matches your description.

Overview

Voice Design creates custom voices from voice descriptions—no audio samples needed. Use it for rapid prototyping, creative content production, game character voices, and more.

Model Studio provides Voice Design through the following model families:

CosyVoice: Create voices through the HTTP API. Supports speech synthesis. Available in the Beijing and Singapore regions.
Qwen-TTS: Create voices through the HTTP API. Supports real-time and non-real-time speech synthesis. Available in the Beijing and Singapore regions.

If you have audio samples and want to replicate a specific person's voice, see Voice cloning. For a detailed comparison and selection guide across model families, see Speech synthesis.

Prerequisites

Configure an API key and set it as an environment variable.
If you call the API through the DashScope SDK, install the latest SDK.

Quick start

The Voice Design workflow follows three steps: describe, create, and use.

Write a voice description: Describe the voice characteristics you want in natural language. For detailed guidelines, see Write voice descriptions.
Create a voice: Call the Voice Design API. The system generates a voice based on your description and returns preview audio. Review the preview before using the voice.
Synthesize speech with the voice: Call the speech synthesis API, passing in the voice ID to generate speech.

Qwen-TTS Voice Design

The following example covers the complete Voice Design workflow: create a voice, listen to the preview audio, and use the voice to synthesize speech.

Note

The Voice Design service returns preview audio. Listen to it first and make sure it meets your expectations before using the voice for speech synthesis. This helps keep API costs down.

Python

import os
import requests
import dashscope

# ======= Constants =======
DEFAULT_TARGET_MODEL = "qwen3-tts-vd-2026-01-26"  # Use the same model for both voice design and speech synthesis
DEFAULT_PREFERRED_NAME = "custom_voice"

# Voice description: describe the desired voice characteristics in natural language
VOICE_PROMPT = "A young, lively female voice with a fast speaking pace and a noticeable rising intonation, suitable for introducing fashion products."

def create_voice_by_design(voice_prompt: str,
                           target_model: str = DEFAULT_TARGET_MODEL,
                           preferred_name: str = DEFAULT_PREFERRED_NAME) -> str:
    """
    Create a voice timbre through a voice description and return the voice parameter
    """
    # The API keys for the Singapore and Beijing regions are different. Obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
    # If no environment variable is configured, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY")

    # The following URL is for the Singapore region. If you use models in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
    url = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization"
    payload = {
        "model": "qwen-voice-enrollment", # Do not modify this value
        "input": {
            "action": "create",
            "target_model": target_model,
            "preferred_name": preferred_name,
            "voice_prompt": voice_prompt
        }
    }
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    resp = requests.post(url, json=payload, headers=headers)
    if resp.status_code != 200:
        raise RuntimeError(f"Failed to create voice: {resp.status_code}, {resp.text}")

    result = resp.json()
    preview_audio = result.get("output", {}).get("preview_audio")
    if preview_audio:
        print(f"Preview audio URL: {preview_audio}")

    try:
        return result["output"]["voice"]
    except (KeyError, ValueError) as e:
        raise RuntimeError(f"Failed to parse voice response: {e}")

if __name__ == '__main__':
    # The following URL is for the Singapore region. If you use models in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
    dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

    voice_id = create_voice_by_design(VOICE_PROMPT)
    print(f"Created voice ID: {voice_id}")

    text = "Hello everyone, welcome to our live stream! The product we are recommending today is truly amazing."
    response = dashscope.MultiModalConversation.call(
        model=DEFAULT_TARGET_MODEL,
        # The API keys for the Singapore and Beijing regions are different. Obtain an API key: https://www.alibabacloud.com/help/zh/model-studio/get-api-key
        # If no environment variable is configured, replace the following line with your Model Studio API key: api_key = "sk-xxx"
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        text=text,
        voice=voice_id,
        stream=False
    )
    print(response)

cURL

Step 1: Create a voice

# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# Note: API Keys differ between the Singapore and Beijing regions. To obtain an API Key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-voice-enrollment",
    "input": {
        "action": "create",
        "target_model": "qwen3-tts-vd-2026-01-26",
        "preferred_name": "custom_voice",
        "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary."
    }
}'

Step 2: Synthesize speech

Replace voice in the following request with the value returned in the previous step.

# Replace YOUR_VOICE_ID with the voice value returned in the previous step
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-tts-vd-2026-01-26",
    "input": {
        "text": "Dear listeners, hello everyone. Welcome to the evening news.",
        "voice": "YOUR_VOICE_ID"
    }
}'

CosyVoice Voice Design

CosyVoice also creates voices from voice descriptions. It follows the same workflow as Qwen-TTS Voice Design.

Important

CosyVoice Voice Design is available only in the Beijing region (v3.5 series) and the Singapore region (v3 series).

Step 1: Create a voice

Call the Voice Cloning/Design API. Pass the voice description in the voice_prompt parameter and specify the text for the preview audio in the preview_text parameter.

# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/customization
# To obtain an API Key, visit: https://www.alibabacloud.com/help/zh/model-studio/get-api-key

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3.5-plus",
        "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
        "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
        "prefix": "announcer",
        "language_hints": ["zh"]
    },
    "parameters": {
        "sample_rate": 24000,
        "response_format": "wav"
    }
}'

Step 2: Synthesize speech

Replace voice in the following request with the value returned in the previous step.

# Replace YOUR_VOICE_ID with the voice value returned in the previous step
# The URL below points to the Singapore region. If you are using a model in the China (Beijing) region, replace it with: https://dashscope.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "cosyvoice-v3.5-plus",
    "input": {
      "text": "Dear listeners, hello everyone. Welcome to the evening news.",
      "voice": "YOUR_VOICE_ID",
      "format": "wav",
      "sample_rate": 24000
    }
}'

Write voice descriptions

The voice description (voice_prompt) determines the quality of the generated voice. A clear, specific description helps the model produce a voice that closely matches your intent.

Requirements and limitations

Length limit: The maximum length of voice_prompt varies by model: up to 500 characters for CosyVoice and up to 2,048 characters for Qwen-TTS.
Supported languages: Voice descriptions support Chinese and English only.

Key principles

Be specific, not vague: Use descriptive terms for voice characteristics, such as "deep," "crisp," or "fast-paced." Avoid subjective or vague terms like "pleasant" or "ordinary."
Be multi-dimensional: A good description covers multiple dimensions (such as gender, age, and emotion). A description like "female voice" alone is too broad to produce a distinctive voice.
Be objective, not subjective: Focus on objective, measurable characteristics. For example, use "high-pitched with an energetic tone" instead of "my favorite voice."
Be original, not imitative: Describe the voice characteristics rather than requesting imitation of a specific person (such as a celebrity or actor). The model doesn't support imitation, and such requests may raise copyright concerns.
Be concise, not redundant: Make every word count. Avoid repeating synonyms or adding unnecessary modifiers.

Description dimensions

Combine the following dimensions to describe a voice. The more dimensions you include, the more accurate the result.

Dimension	Examples
Gender	Male, female, neutral
Age	Child (5-12), teenager (13-18), young adult (19-35), middle-aged (36-55), elderly (55+)
Pitch	High, mid, low, slightly high, slightly low
Speaking rate	Fast, moderate, slow, slightly fast, slightly slow
Emotion	Cheerful, calm, gentle, serious, lively, composed, soothing
Characteristics	Magnetic, crisp, husky, smooth, sweet, rich, powerful
Use case	News broadcasting, advertisement voice-over, audiobook, animated character, voice assistant, documentary narration

Examples

Standard broadcasting style: clear and precise pronunciation, well-articulated
Lively young female voice with a fast speaking rate, noticeable rising intonation, suitable for fashion product introductions
Calm middle-aged male with a slow speaking rate, deep and magnetic voice, suitable for news broadcasting or documentary narration
Gentle and intellectual female, around 30 years old, even-toned, suitable for audiobook narration
Cute child's voice, approximately an 8-year-old girl, slightly childish tone, suitable for animated character voice-over

Manage custom voices

Voices created through Voice Design and Voice Cloning share the same management API. You can list voices, view voice details, or delete voices that are no longer needed.

For API endpoints and parameter details, see API reference.

Supported scope

Model availability varies by deployment region.

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

To call the following models, select an API key from the Singapore region:

CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash
Qwen-TTS:
- Qwen3-TTS-VD-Realtime: qwen3-tts-vd-realtime-2026-01-15 (latest snapshot), qwen3-tts-vd-realtime-2025-12-16 (snapshot)
- Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)

Chinese mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

To call the following models, select an API key from the Beijing region:

CosyVoice: cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-plus, cosyvoice-v3-flash
Qwen-TTS:
- Qwen3-TTS-VD-Realtime: qwen3-tts-vd-realtime-2026-01-15 (latest snapshot), qwen3-tts-vd-realtime-2025-12-16 (snapshot)
- Qwen3-TTS-VD: qwen3-tts-vd-2026-01-26 (latest snapshot)

Note

CosyVoice Voice Design is powered by the FunAudioGen-VD model.
The same voice description may produce slightly different voices each time. Generate multiple voices and listen to each one to find the best match.

API reference

Voice Design

FAQ

Q: Does the same voice description always produce the same voice?

Not necessarily. Voice Design is a generative process, so the same description may produce slightly different voices each time. Generate multiple voices, listen to each one, and choose the one that sounds best.

Q: What languages can I use for the voice description?

Currently, the voice description (voice_prompt) supports Chinese and English only. However, the generated voice can synthesize speech in multiple languages.

Q: What's the difference between Voice Design and Voice Cloning?

Voice Design creates a voice from scratch using a voice description, with no audio samples required. It's ideal for designing entirely new voice identities. Voice Cloning replicates a voice from real audio samples and is best for reproducing a specific person's voice. For details, see Voice cloning.