All Products
Search
Document Center

Platform For AI:Multi-machine distributed inference

Last Updated:Jan 21, 2026

Ultra-large-scale Mixture of Experts (MoE) models, such as DeepSeek-R1 671B, exceed the capacity of individual devices. To address this, Elastic Algorithm Service (EAS) provides a multi-machine distributed inference solution that allows a single service instance to span multiple machines, enabling efficient deployment and operation of ultra-large-scale models.

Usage notes

Some SGLang and vLLM images officially provided by EAS or Model Gallery natively support distributed inference. To deploy distributed inference using a custom image, follow the networking specifications of the distributed inference framework and the basic paradigms of distributed processing. For more information, see How it works.

How it works

This section covers the basic concepts and principles of distributed inference.

Instance unit

Compared with standard EAS inference services, distributed inference services introduce the concept of instance unit (referred to as Unit below). Within a Unit, instances coordinate through high-performance network communication using patterns such as tensor parallelism (TP) and pipeline parallelism (PP) to complete request processing. Instances within a Unit are stateful, while different Units are completely symmetric and stateless.

Instance number

Each instance in a Unit is assigned an instance number through environment variables (for details, see Appendix). Instances are numbered sequentially starting from 0, and each number controls which tasks that instance executes.

Traffic handling

By default, the entire Unit receives traffic only through instance 0 (RANK_ID is 0). The system's service discovery mechanism allocates user traffic to instance 0 of different Units, which then process it in a distributed manner within the Unit. Different Units handle user traffic independently without interference.

image

Rolling updates

During rolling updates, the Unit is rebuilt as a whole, with all instances in the new Unit created in parallel. After all instances in the new Unit are ready, traffic is first diverted from the Unit to be deleted, and then all instances in that Unit are deleted.

image

Lifecycle

Unit rebuilding

When a Unit is rebuilt entirely, all instances in the old Unit are deleted in parallel, and all instances in the new Unit are created in parallel, without special handling based on instance numbers.

Instance rebuilding

By default, the lifecycle of each instance in a Unit aligns with that of instance 0. When instance 0 is rebuilt, it triggers the rebuilding of all other instances in the Unit. When a non-0 instance is rebuilt, other instances in the Unit are not affected.

Distributed fault tolerance

  • Instance exception handling mechanism

    • When an exception is detected in an instance of a distributed service, the system automatically triggers the restart of all instances in the Unit.

    • This effectively resolves cluster state inconsistencies caused by single points of failure, ensuring that all instances have completely reset runtime environments.

  • Coordinated recovery mechanism

    • After instances restart, the system waits for all instances in the Unit to reach a consistent ready state (implemented through a synchronization barrier), and only starts business processes after all instances are ready.

    • This ensures:

      • Prevents grouping exceptions during NVIDIA Collective Communications Library (NCCL) networking caused by inconsistent instance states.

      • Ensures strict startup synchronization of all nodes participating in distributed inference tasks.

Distributed fault tolerance is disabled by default. To enable it, set unit.guard to true:

{
  "unit": {
    "size": 2,
    "guard": true
  }
}

Configure multi-machine distributed inference

EAS custom deployment

  1. Log on to the PAI console. Select a region at the top of the page, then select the desired workspace and click Enter Elastic Algorithm Service (EAS).

    • Deploy Service: On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

    • Update Service: On the Inference Service tab, find your target service in the list and click Update in the Actions column.

  2. Configure the following key parameters. For information about the other parameters, see Custom deployment.

    • In the Environment Information section, configure the image and command:

      • Image Configuration: Choose Alibaba Cloud Image and select vllm:0.7.1 or sglang:0.4.1.image

      • Command: Automatically set when you select the image. No changes required.

    • In the Resource Configuration section, turn on the Distributed Inference option and configure the following key parameters:image

      Parameter

      Description

      Machines for Each Instance

      The number of machines for a single inference instance. Minimum: 2.

      RDMA Network

      Enables high-speed RDMA networking between machines for optimal performance.

      Note

      Currently, only services deployed using Lingjun resources can use the RDMA network.

  3. Click Deploy or Update.

One-click deployment in Model Gallery

Distributed inference is supported only when the Deployment Method is SGLang Accelerate Deployment or vLLM Accelerate Deployment.

For models with large parameter counts, when you deploy a model service in Model Gallery with one click, select SGLang Accelerate Deployment or vLLM Accelerate Deployment as the distributed deployment method. EAS automatically enables Distributed Inference. Click Modify Configurations to adjust the number of machines for a single instance.image

Appendix

Distributed inference services typically require networking operations (such as Torch distributed or Ray frameworks). When VPC or RDMA is configured, each instance has multiple network cards, so you must specify which card to use for inter-node communication.

  • When RDMA is configured, the RDMA network card (net0) is used by default.

  • When RDMA is not configured, the network card corresponding to the user-configured VPC (eth1) is used.

Related configurations are passed through environment variables for use in startup commands:

Environment variable

Description

Example value

RANK_ID

Instance number, starting from 0 and incrementing.

0

COMM_IFNAME

Default network card for inter-node communication:

  • When RDMA is configured, the value is net0.

  • When RDMA is not configured, the value is eth1 (the network card where the user-configured VPC is located).

net0

RANK_IP

IP address of this instance's COMM_IFNAME network card, used for inter-node communication.

11.*.*.*

MASTER_ADDRESS

IP address of the COMM_IFNAME network card on instance 0 (the master node).

11.*.*.*