Deploy MoE models using expert parallelism and Prefill-Decode separation - Platform For AI

Mixture-of-experts (MoE) models use a sparse activation mechanism to achieve trillion-parameter scales while reducing computational costs. However, this approach presents challenges for traditional inference deployment. Expert parallelism (EP) is a distributed strategy designed for MoE models. It deploys different experts on separate GPUs and uses dynamic routing for requests. This method resolves GPU memory bottlenecks, improves parallel computing performance, and significantly lowers deployment costs. This topic describes how to enable expert parallelism (EP) and Prefill-Decode (PD) separation for MoE models on Platform for AI (PAI) Elastic Algorithm Service (EAS) to achieve higher inference throughput and cost-effectiveness.

Solution architecture

Alibaba Cloud's Platform for AI (PAI) provides Elastic Algorithm Service (EAS), which supports production-grade EP deployment. EAS integrates technologies such as PD separation, large-scale EP, computation-communication co-optimization, and MTP to create a new paradigm of multi-dimensional, joint optimization.

Benefits:

One-click deployment: EAS provides EP deployment templates with built-in images, optional resources, and run commands. This simplifies complex distributed deployments into a wizard-based process and removes the need to manage the underlying implementation.
Aggregated service management: You can independently monitor, scale, and manage the lifecycle of sub-services, such as Prefill, Decode, and the LLM intelligent router, from a unified view.

Deploy an EP service

This section uses the DeepSeek-R1-0528-PAI-optimized model as an example. This PAI-optimized model supports higher throughput and lower latency. Follow these steps:

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click LLM Deployment.
In the Model Settings section, select the public model DeepSeek-R1-0528-PAI-optimized.

Set the Inference Engine to vLLM and the Deployment Template to EP+PD Separation-PAI Optimized.
Configure deployment resources for the Prefill and Decode services. You can select public resources or a resource quota.
- Public resources: Suitable for quick trials and development testing. The available specifications are ml.gu8tea.8.48xlarge or ml.gu8tef.8.46xlarge.
- Resource quota: Recommended for production environments to ensure resource stability and isolation. You cannot select this type if no resource configurations are available.
(Optional) Adjust deployment parameters to optimize performance.
- Number of Instances: Adjust the number of instances for Prefill and Decode to change the PD ratio. The default number of instances in the deployment template is 1.
- Parallelism parameters: You can adjust the parallelism parameters for the Prefill and Decode services, such as EP_SIZE, DP_SIZE, and TP_SIZE, in the environment variables. The deployment template sets the default value of TP_SIZE for Prefill to 8, and the default values of EP_SIZE and DP_SIZE for Decode to 8.
  Note
  To protect the model weights of DeepSeek-R1-0528-PAI-optimized, the platform does not expose the run command for the inference engine. You can modify important parameters using environment variables.
Click Deploy and wait for the service to start. This process takes about 40 minutes.
Verify the service status. After the deployment is complete, go to the Online Debugging tab on the service details page to test whether the service is running correctly.
Note
For more information about API calls and third-party application integration, see Call an LLM service.
Construct a request that follows the OpenAI format. Append /v1/chat/completions to the URL path. The request body is as follows:
```
{
    "model": "",
    "messages": [
        {
            "role": "user",
            "content": "Hello!"
        }
    ],
    "max_tokens": 1024
}
```
Click Send Request. A response status of 200 and a successful answer from the model indicate that the service is running correctly.

Manage an EP service

On the service list page, click a service name to go to its details page for fine-grained management. This page provides views for the overall aggregated service and for sub-services, such as Prefill, Decode, and the LLM intelligent router.
You can view service monitoring data and logs, and configure auto-scaling policies.