All Products
Search
Document Center

Platform For AI:Deploy MoE models with expert parallelism

Last Updated:Mar 11, 2026

Deploy Mixture-of-Experts (MoE) models on PAI Elastic Algorithm Service (EAS) with expert parallelism (EP) and Prefill-Decode (PD) separation to achieve higher throughput and lower costs.

Architecture

PAI Elastic Algorithm Service (EAS) supports production-grade expert parallelism (EP) deployment with integrated technologies: Prefill-Decode (PD) separation, large-scale EP, computation-communication co-optimization, and MTP.

image.png

Key benefits:

  • One-click deployment: EAS provides EP deployment templates with built-in images, optional resources, and run commands, simplifying complex distributed deployments into a wizard-based process.

  • Unified service management: Independently monitor, scale, and manage sub-services (Prefill, Decode, and LLM intelligent router) from a single view.

Deploy EP service

This section uses DeepSeek-R1-0528-PAI-optimized as an example. This PAI-optimized model supports higher throughput and lower latency.

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click LLM Deployment.

  3. In the Model Settings section, select public model DeepSeek-R1-0528-PAI-optimized.

    image.png

  1. Set Inference Engine to vLLM and Deployment Template to EP+PD Separation-PAI Optimized.

    image.png

  2. Configure deployment resources for Prefill and Decode services. Select public resources or a resource quota.

    • Public resources: Suitable for quick trials and development testing. The available specifications are ml.gu8tea.8.48xlarge or ml.gu8tef.8.46xlarge.image.png

    • Resource quota: Recommended for production environments to ensure resource stability and isolation. Cannot be selected if no resource configurations are available.

      image.png

  3. (Optional) Adjust deployment parameters to optimize performance.

    • Number of instances: Adjust the number of instances for Prefill and Decode to change the PD ratio. Default is 1 instance per service.

    • Parallelism parameters: Adjust parallelism parameters for Prefill and Decode services (EP_SIZE, DP_SIZE, TP_SIZE) in environment variables. Default values: TP_SIZE=8 for Prefill, EP_SIZE=8 and DP_SIZE=8 for Decode.

      Note

      To protect model weights of DeepSeek-R1-0528-PAI-optimized, the platform does not expose the inference engine run command. Modify important parameters using environment variables.

      image.png

  4. Click Deploy and wait for the service to start. Deployment takes approximately 40 minutes.

  5. Verify the service status. After deployment completes, go to the Online Debugging tab on the service details page to test whether the service runs correctly.

    Note

    For more information about API calls and third-party application integration, see Call an LLM service.

    Construct a request following the OpenAI format. Append /v1/chat/completions to the URL path. Request body:

    {
        "model": "",
        "messages": [
            {
                "role": "user",
                "content": "Hello!"
            }
        ],
        "max_tokens": 1024
    }

    Click Send Request. A response status of 200 and a successful model answer indicate the service runs correctly.

    image.png

Manage EP service

  1. On the service list page, click a service name to open its details page for fine-grained management. This page provides views for the overall aggregated service and for sub-services (Prefill, Decode, and LLM intelligent router).

    image.png

  2. View service monitoring data and logs, and configure auto-scaling policies.

    image.png