Ultra-large-scale Mixture of Experts (MoE) models, such as DeepSeek-R1 671B, exceed the capacity of individual devices. To address this, Elastic Algorithm Service (EAS) provides a multi-machine distributed inference solution that allows a single service instance to span multiple machines, enabling efficient deployment and operation of ultra-large-scale models.
Usage notes
Some SGLang and vLLM images officially provided by EAS or Model Gallery natively support distributed inference. To deploy distributed inference using a custom image, follow the networking specifications of the distributed inference framework and the basic paradigms of distributed processing. For more information, see How it works.
How it works
This section covers the basic concepts and principles of distributed inference.
Instance unit
Compared with standard EAS inference services, distributed inference services introduce the concept of instance unit (referred to as Unit below). Within a Unit, instances coordinate through high-performance network communication using patterns such as tensor parallelism (TP) and pipeline parallelism (PP) to complete request processing. Instances within a Unit are stateful, while different Units are completely symmetric and stateless.
Instance number
Each instance in a Unit is assigned an instance number through environment variables (for details, see Appendix). Instances are numbered sequentially starting from 0, and each number controls which tasks that instance executes.
Traffic handling
By default, the entire Unit receives traffic only through instance 0 (RANK_ID is 0). The system's service discovery mechanism allocates user traffic to instance 0 of different Units, which then process it in a distributed manner within the Unit. Different Units handle user traffic independently without interference.
Rolling updates
During rolling updates, the Unit is rebuilt as a whole, with all instances in the new Unit created in parallel. After all instances in the new Unit are ready, traffic is first diverted from the Unit to be deleted, and then all instances in that Unit are deleted.
Lifecycle
Unit rebuilding
When a Unit is rebuilt entirely, all instances in the old Unit are deleted in parallel, and all instances in the new Unit are created in parallel, without special handling based on instance numbers.
Instance rebuilding
By default, the lifecycle of each instance in a Unit aligns with that of instance 0. When instance 0 is rebuilt, it triggers the rebuilding of all other instances in the Unit. When a non-0 instance is rebuilt, other instances in the Unit are not affected.
Distributed fault tolerance
Instance exception handling mechanism
When an exception is detected in an instance of a distributed service, the system automatically triggers the restart of all instances in the Unit.
This effectively resolves cluster state inconsistencies caused by single points of failure, ensuring that all instances have completely reset runtime environments.
Coordinated recovery mechanism
After instances restart, the system waits for all instances in the Unit to reach a consistent ready state (implemented through a synchronization barrier), and only starts business processes after all instances are ready.
This ensures:
Prevents grouping exceptions during NVIDIA Collective Communications Library (NCCL) networking caused by inconsistent instance states.
Ensures strict startup synchronization of all nodes participating in distributed inference tasks.
Distributed fault tolerance is disabled by default. To enable it, set unit.guard to true:
{
"unit": {
"size": 2,
"guard": true
}
}Configure multi-machine distributed inference
EAS custom deployment
Log on to the PAI console. Select a region at the top of the page, then select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Deploy Service: On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
Update Service: On the Inference Service tab, find your target service in the list and click Update in the Actions column.
Configure the following key parameters. For information about the other parameters, see Custom deployment.
In the Environment Information section, configure the image and command:
Image Configuration: Choose Alibaba Cloud Image and select vllm:0.7.1 or sglang:0.4.1.

Command: Automatically set when you select the image. No changes required.
In the Resource Configuration section, turn on the Distributed Inference option and configure the following key parameters:

Parameter
Description
Machines for Each Instance
The number of machines for a single inference instance. Minimum: 2.
RDMA Network
Enables high-speed RDMA networking between machines for optimal performance.
NoteCurrently, only services deployed using Lingjun resources can use the RDMA network.
Click Deploy or Update.
One-click deployment in Model Gallery
Distributed inference is supported only when the Deployment Method is SGLang Accelerate Deployment or vLLM Accelerate Deployment.
For models with large parameter counts, when you deploy a model service in Model Gallery with one click, select SGLang Accelerate Deployment or vLLM Accelerate Deployment as the distributed deployment method. EAS automatically enables Distributed Inference. Click Modify Configurations to adjust the number of machines for a single instance.
Appendix
Distributed inference services typically require networking operations (such as Torch distributed or Ray frameworks). When VPC or RDMA is configured, each instance has multiple network cards, so you must specify which card to use for inter-node communication.
When RDMA is configured, the RDMA network card (net0) is used by default.
When RDMA is not configured, the network card corresponding to the user-configured VPC (eth1) is used.
Related configurations are passed through environment variables for use in startup commands:
Environment variable | Description | Example value |
RANK_ID | Instance number, starting from 0 and incrementing. | 0 |
COMM_IFNAME | Default network card for inter-node communication:
| net0 |
RANK_IP | IP address of this instance's COMM_IFNAME network card, used for inter-node communication. | 11.*.*.* |
MASTER_ADDRESS | IP address of the COMM_IFNAME network card on instance 0 (the master node). | 11.*.*.* |