Deploy CosyVoice 2.0 WebUI & API on PAI-EAS - Platform for AI

CosyVoice 2.0 is a next-generation, high-fidelity speech synthesis model developed by Alibaba DAMO Academy. It features voice cloning, which lets you clone a target voice from a prompt audio clip of less than 30 seconds. It also supports cross-lingual voice replication. CosyVoice 2.0 is suitable for various scenarios, such as customer service conversations, audiobook narration, and short video voiceovers. The Elastic Algorithm Service (EAS) on Alibaba Cloud Platform for AI (PAI) provides a visual WebUI interface built on this model. This lets you quickly deploy a cloud-based speech inference service. This topic describes how to deploy a CosyVoice 2.0 service on PAI-EAS and use the inference service to generate audio.

Background information

CosyVoice 2.0 is designed to create natural, friendly, and expressive AI voices. Trained on large-scale speech corpora with fine-grained prosody modeling, CosyVoice 2.0 achieves a level of expressiveness comparable to human streamers. Whether it is a warm greeting in customer service or an emotional reading of audio content, CosyVoice 2.0 can generate warm and natural speech. This eliminates the cold, synthetic feel and provides a more emotionally engaging listening experience.

CosyVoice 2.0 has the following advantages:

Natural and friendly voice: Avoids a robotic tone by emulating the rhythm, emotion, and prosody of human speech.
Adaptable to multiple scenarios: Supports customer service conversations, audiobook narration, short video voiceovers, E-commerce voice recommendations, and more.
High efficiency and low latency: Lightweight cloud deployment for fast generation of fluent speech.
Highly controllable: Supports adjustments for tone and emotion, and provides character customization to create a unique brand voice.

The CosyVoice 2.0 WebUI service deployed in this topic is for trial purposes only. You can also use the high-concurrency version of CosyVoice 2.0 for high-performance inference. For more information, see Quickly deploy a high-performance service with a separate frontend and backend.

Limitations

The Pre-trained Voice inference mode is not currently supported.

Billing

When you deploy the CosyVoice 2.0 image service, you are charged only for the computing resources and the system disk. If you no longer need the service, you can go to the Operation column for the service and click Stop to prevent further charges. For more information about billing, see Billing of Elastic Algorithm Service (EAS).

Deploy the CosyVoice 2.0 service

Method 1: Scenario-based deployment (recommended)

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click Deploy CosyVoice for AI Speech Generation.

Configure the following key parameters:

Parameter		Description
Basic Information	Version selection	Select Standard Edition.
Environment Information	Image Version	Select an image based on the resource type. In this example, `cosyvoice-webui:0.2.0-pytorch2.3.1-gpu-py310-cu128-ubuntu22.04` is selected. Note Due to rapid iterations, select the latest image version during deployment.
	Command	After you select an image version, the system automatically configures the run command `/bin/bash /tmp/entry.sh --action=start_webui --port=9000 --data_dir=/mnt/data/ --model_dir=/nasmnt/models/pretrained_models/CosyVoice2-0.5B/ --ttsfrd_dir=/nasmnt/models/pretrained_models/CosyVoice-ttsfrd/ --workers 1`. The parameters are described as follows: --port: The service port number. It must be the same as the port number configured for the EAS service. --data_dir: The mount directory for storing reference audio and models. The default value is `/mnt/data`. If you mount a storage volume, this path must be consistent with the mount path set in Storage Mount. --model_dir: The model loading directory. The following parameter is also supported: --gpu_memory_utilization: Sets the upper limit for GPU memory utilization.
	Port Number	After you select an image version, the system automatically sets the port number to `9000`. You do not need to modify it.
Resource Information	Resource Type	In this example, Public Resources is selected. You can also select other resource types as needed.
	Instances	Set this parameter to 1.
	Deployment Resources	The resource specification must be a GPU-accelerated instance type, such as `ecs.gn8is.4xlarge` or `ml.gu8is.c16m128.1-gu60`.
	Configure system disk	The image file is large. To prevent service deployment failure due to insufficient storage space, set the system disk size to 100 GiB. If you do not manually set the size, the EAS backend allocates 100 GiB of storage space to the CosyVoice 2.0 scenario by default.
Network Information	VPC configuration	Optional. To access the service through a VPC direct connection or configure public network access for the service, you must configure a virtual private cloud (VPC). Select a VPC, a vSwitch, and a security group from the drop-down lists. For information about how to create them, see Create and manage a VPC and Manage security groups.

After you configure the parameters, click Deploy.
Pulling the image takes about 5 to 10 minutes. The service is deployed when the Service Status changes to Running.

Method 2: Custom deployment

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

On the Custom Deployment page, configure the following key parameters. For more information about the other parameters, see Custom deployment.

Parameter		Description
Environment Information	Deployment Method	Select Image-based Deployment and select the Enable Web App check box.
	Image Configuration	From the Alibaba Cloud Image list, select cosyvoice-webui > cosyvoice-webui:0.2.0-pytorch2.3.1-gpu-py310-cu128-ubuntu22.04. Note Due to rapid iterations, select the latest image version during deployment.
	Command	After you select an image, the system automatically configures the run command `/bin/bash /tmp/entry.sh --action=start_webui --port=9000 --data_dir=/mnt/data/ --model_dir=/nasmnt/models/pretrained_models/CosyVoice2-0.5B/ --ttsfrd_dir=/nasmnt/models/pretrained_models/CosyVoice-ttsfrd/ --workers 1`. The parameters are described as follows: --port: The service port number. It must be the same as the port number configured for the EAS service. --data_dir: The mount directory for storing reference audio and models. The default value is `/mnt/data`. If you mount a storage volume, this path must be consistent with the mount path set in Storage Mount. --model_dir: The model loading directory. --workers: Sets the number of workers for the built-in frontend service. If not specified, the system automatically configures this parameter based on the resource specifications. To access the WebUI page from a browser, you must set this parameter to `--workers 1`. The following parameter is also supported: --gpu_memory_utilization: Sets the upper limit for GPU memory utilization.
	Port Number	After you select an image, the system automatically sets the port number to `9000`. You do not need to modify it.
Resource Information	Resource Type	In this example, Public Resources is selected. You can also select other resource types as needed.
	Instances	Set this parameter to 1.
	Deployment Resources	The resource specification must be a GPU-accelerated instance type, such as `ecs.gn8is.4xlarge` or `ml.gu8is.c16m128.1-gu60`.
	Configure system disk	The image file is large. To prevent service deployment failure due to insufficient storage space, set the system disk size to 100 GiB. If you do not manually set the size, the EAS backend allocates 100 GiB of storage space to the CosyVoice 2.0 scenario by default.
Network Information	VPC configuration	Optional. To access the service through a VPC direct connection or configure public network access for the service, you must configure a VPC. Select a VPC, a vSwitch, and a security group from the drop-down lists. For information about how to create them, see Create and manage a VPC and Manage security groups.

After you configure the parameters, click Deploy.
Pulling the image takes about 5 to 10 minutes. The service is deployed when the Service Status changes to Running.

Generate audio with the inference service

Call the service using an API operation

You can also call the service using an API operation to generate audio. For more information, see API operations.