All Products
Search
Document Center

Platform For AI:Multi-machine distributed inference

Last Updated:Dec 04, 2025

The ultra-large-scale MoE models, such as DeepSeek-R1 671B, have surpassed the capacity of individual devices. To address this, Elastic Algorithm Service (EAS) introduces a multi-machine distributed inference solution. It allows a single service instance to be distributed across multiple machines, facilitating the efficient deployment and operation of ultra-large-scale models. This topic describes how to configure multi-machine distributed inference.

Usage notes

Some SGLang and vLLM images officially provided by EAS or ModelGallery natively support distributed inference. If you want to deploy distributed inference using a custom image, you need to follow the networking specifications of the distributed inference framework and the basic paradigms of distributed processing. For more information, see How it works.

How it works

The following content introduces the basic concepts and principles of distributed inference:

Instance unit

Compared with standard EAS inference services, distributed inference services introduce the concept of instance unit (referred to as Unit below). Within a Unit, instances coordinate through high-performance network communication using patterns such as TP/PP to complete request processing. Instances within a Unit are stateful, while different Units are completely symmetric and stateless.

Instance number

Each instance in a Unit is assigned an instance number through environment variables (for environment variables and descriptions of each instance in a Unit, see Appendix). Different instances are assigned different numbers sequentially, controlling the execution of different tasks by different instances.

Traffic handling

By default, the entire Unit receives traffic only through instance 0 (RANK_ID is 0). The system's service discovery mechanism allocates user traffic to instance 0 of different Units, which then process it in a distributed manner within the Unit. Different Units handle user traffic independently without interference.

image

Rolling updates

During rolling updates, the Unit is rebuilt as a whole, with all instances in the new Unit created in parallel. After all instances in the new Unit are ready, traffic is first diverted from the Unit to be deleted, and then all instances in that Unit are deleted.

image

Lifecycle

Unit rebuilding

When a Unit is rebuilt entirely, all instances in the old Unit are deleted in parallel, and all instances in the new Unit are created in parallel, without special handling based on instance numbers.

Instance rebuilding

By default, the lifecycle of each instance in a Unit aligns with that of instance 0. When instance 0 is rebuilt, it triggers the rebuilding of all other instances in the Unit. When a non-0 instance is rebuilt, other instances in the Unit are not affected.

Distributed fault tolerance

  • Instance exception handling mechanism

    • When an exception is detected in an instance of a distributed service, the system automatically triggers the restart of all instances in the Unit.

    • Functionality: Effectively resolves cluster state inconsistencies caused by single points of failure, ensuring that all instances have completely reset runtime environments.

  • Coordinated recovery mechanism

    • After instances restart, the system waits for all instances in the Unit to reach a consistent ready state (implemented through a synchronization barrier), and only starts business processes after all instances are ready.

    • Functionality:

      • Prevents grouping exceptions during NCCL communication networking due to inconsistent instance states.

      • Ensures strict startup synchronization of all nodes participating in distributed inference tasks.

The distributed fault tolerance feature is disabled by default. If you need to enable it, you can configure unit.guard, as shown in the following example:

{
  "unit": {
    "size": 2,
    "guard": true
  }
}

Configure multi-machine distributed inference

EAS custom deployment

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).

    • Deploy Service: On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

    • Update Service: On the Inference Service tab, find the service you want to operate in the service list, and then click Update in the Actions column.

  2. Configure the following key parameters. For information about the other parameters, see Custom deployment.

    • In the Environment Information section, configure the image and command:

      • Image Configuration: Choose Alibaba Cloud Image and select vllm:0.7.1 or sglang:0.4.1.image

      • Command: The system automatically sets the command after you select the image. No modification is necessary.

    • In the Resource Configuration section, turn on the Distributed Inference option and configure the following key parameters:image

      Parameter

      Description

      Machines for Each Instance

      The number of machines for a single model inference instance. The minimum value is 2.

      RDMA Network

      Enable RDMA network to ensure efficient network connectivity between machines.

      Note

      Currently, only services deployed using Lingjun resources can use the RDMA network.

  3. Click Deploy or Update.

Deploy in Model Gallery with one click

Distributed inference is only supported when the Deployment Method is SGLang Accelerate Deployment or vLLM Accelerate Deployment.

For models with large parameter amounts, when you deploy a model service in Model Gallery with one click, select SGLang Accelerate Deployment or vLLM Accelerate Deployment as the distributed deployment method, and EAS automatically enables Distributed Inference. You can click Modify Configurations to adjust the number of machines used for a single instance.image

Appendix

When deploying distributed inference services, networking operations are typically required (such as using Torch distributed or Ray frameworks). When VPC or RDMA is configured, each instance has multiple network cards, so you need to specify which network card to use for networking communication.

  • When RDMA is configured, the RDMA network card (net0) is used by default.

  • When RDMA is not configured, the network card corresponding to the user-configured VPC (eth1) is used.

Related configurations are passed through environment variables that you can use in startup commands. For example:

Environment variable

Description

Example value

RANK_ID

Instance number, starting from 0 and incrementing.

0

COMM_IFNAME

Default network card used for networking:

  • When RDMA is configured, the value is net0.

  • When RDMA is not configured, the value is eth1 (the network card where the user-configured VPC is located).

net0

RANK_IP

The IP address used for networking, which is the IP address of the COMM_IFNAME network card.

11.*.*.*

MASTER_ADDRESS

The IP address of the instance with RANK_ID 0, which is the IP address of the COMM_IFNAME network card of instance 0.

11.*.*.*