CosyVoice voice cloning Python SDK - Alibaba Cloud Model Studio

CosyVoice voice cloning is accessible through the DashScope Python SDK.

Service endpoint

The SDK uses the China (Beijing) region endpoint by default. To switch to a different region, set dashscope.base_http_api_url before you initialize the client.

International

If you select the International deployment scope, model inference compute resources are dynamically scheduled worldwide, excluding the Chinese mainland. Static data is stored in your selected region. Supported region: Singapore.

https://dashscope-intl.aliyuncs.com/api/v1

Chinese mainland

If you select the Chinese mainland deployment scope, model inference compute resources are restricted to the Chinese mainland. Static data is stored in your selected region. Supported region: China (Beijing).

https://dashscope.aliyuncs.com/api/v1

Switch to the Singapore region:

import dashscope

# Set at the beginning of your code
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

Note:

API keys differ between regions. Use the API key that corresponds to the target region.
The region setting is global and affects all DashScope SDK API calls.

VoiceEnrollmentService class

Package path: dashscope.audio.tts_v2.VoiceEnrollmentService

Purpose: Manage the lifecycle of CosyVoice cloned voices, including creating, querying, updating, and deleting voices.

Constructor

VoiceEnrollmentService()

create_voice() — Create a voice

Method signature:

def create_voice(self, target_model: str, prefix: str, url: str,
                 language_hints: List[str] = None,
                 max_prompt_audio_length: float = None,
                 enable_preprocess: bool = None) -> str

Parameters:

Parameter	Type	Required	Description
target_model	str	Yes	The text-to-speech (TTS) model that drives the cloned voice. It must match the model you specify when calling the TTS API; otherwise, synthesis fails.
prefix	str	Yes	A prefix for the voice name. Only alphanumeric characters are allowed, with a maximum length of 10 characters. The resulting voice name follows this format: `{target_model}-{prefix}-{unique_id}`.
url	str	Yes	The URL of the audio file for voice cloning. The URL must be publicly accessible.
language_hints	List[str]	No	Important Applies only to CosyVoice voice cloning (when model is `voice-enrollment`). Supported only by cosyvoice-v3.5-plus, v3.5-flash, v3-plus, and v3-flash. Helps the model identify the language of the sample audio to extract voice features more accurately and improve cloning quality. If the specified language doesn't match the actual audio language (for example, setting `en` when the audio is in Chinese), the system ignores this value and detects the language automatically. This parameter is an array, but the current version processes only the first element. Valid values vary by model: cosyvoice-v3-plus: zh: Chinese en: English fr: French de: German ja: Japanese ko: Korean ru: Russian cosyvoice-v3.5-plus, cosyvoice-v3.5-flash, cosyvoice-v3-flash: zh: Chinese en: English fr: French de: German ja: Japanese ko: Korean ru: Russian pt: Portuguese th: Thai id: Indonesian vi: Vietnamese Default: ["zh"].
max_prompt_audio_length	float	No	Important Applies only to CosyVoice voice cloning (when model is `voice-enrollment`). Supported only by cosyvoice-v3.5-plus, v3.5-flash, and v3-flash. The maximum duration (in seconds) of the reference audio after preprocessing. Valid values: [3.0, 30.0]. Longer durations produce better results. Default: 10.0.
enable_preprocess	bool	No	Important Applies only to CosyVoice voice cloning (when model is `voice-enrollment`). Supported only by cosyvoice-v3.5-plus, v3.5-flash, and v3-flash. Whether to enable audio preprocessing (noise reduction, audio enhancement, and volume normalization). Enable this for recordings with background noise. Disable it for recordings in quiet environments to preserve the original voice characteristics. Default: false.

Return value: str — The voice ID (voice_id).

list_voice() — List voices

Method signature:

def list_voice(self, prefix: str = None, page_index: int = 0, page_size: int = 10) -> list

Parameters:

Parameter	Type	Required	Description
prefix	str	No	Filter by voice name prefix.
page_index	int	No	Page index. Default: 0.
page_size	int	No	Number of entries per page. Default: 10.

Return value: list — A list of voices.

query_voice() — Query voice details

Method signature:

def query_voice(self, voice_id: str) -> dict

Parameters:

Parameter	Type	Required	Description
voice_id	str	Yes	The voice ID to query.

Return value: dict — Voice details.

update_voice() — Update a voice

Method signature:

def update_voice(self, voice_id: str, url: str, language_hints: List[str] = None,
                 max_prompt_audio_length: float = None, enable_preprocess: bool = None) -> None

Parameters:

Parameter	Type	Required	Description
voice_id	str	Yes	The voice ID to update.
url	str	Yes	The new audio file URL.
language_hints	List[str]	No	Language hints for the sample audio.
max_prompt_audio_length	float	No	Maximum duration of the reference audio.
enable_preprocess	bool	No	Whether to enable audio preprocessing.

delete_voice() — Delete a voice

Method signature:

def delete_voice(self, voice_id: str) -> None

Parameters:

Parameter	Type	Required	Description
voice_id	str	Yes	The voice ID to delete.