The ultra-large-scale MoE models, such as DeepSeek-R1 671B, have surpassed the capacity of individual devices. To address this, Elastic Algorithm Service (EAS) introduces a multi-machine distributed inference solution. It allows a single service instance to be distributed across multiple machines, facilitating the efficient deployment and operation of ultra-large-scale models. This topic describes how to configure multi-machine distributed inference.
Limitations
The feature is available only when deploying services with images that use distributed inference logic. EAS provides official images that support this logic: vllm:0.7.1
and sglang:0.4.1
.
How it works
The multi-machine distributed inference service is similar to standard inference services. But they are also different in some ways:
Differences: Multi-machine distributed inference introduces the instance group concept:
How it works: An instance group uses high-performance network communication to coordinate request processing through parallel patterns like TP/PP. Each instance in a group receives an instance number as an environment variable (see Environment variables and descriptions). The number dictates the specific tasks assigned to each instance.
Traffic allocation: The instance group receives traffic only through the instance 1 (RANK_ID is 0). The service discovery mechanism then allocates request traffic to the instance 1 of various instance groups.
Lifecycle management: By default, the lifecycle of an instance group aligns with that of the group's instance 1. Rebuilding instance 1 triggers the reconstruction of all other instances within the group.
Similarities: The instance group in multi-machine distributed inference also supports rolling updates. During such updates, the entire instance group is rebuilt. All instances in the new group are created simultaneously. Once ready, the system first diverts traffic away from the group to be deleted, then removes all its instances.
Configure multi-machine distributed inference
EAS custom deployment
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Create a service: On the Inference Service tab, click Deploy Service. On the Deploy Service page, click Custom Deployment in the Custom Model Deployment section.
Update a service: On the Inference Service tab, find the desired service and click Update in the Actions column.
Configure the following key parameters. For information about the other parameters, see Deploy a model service in the PAI console.
In the Environment Information section:
Image Configuration: Choose Alibaba Cloud Image and select vllm:0.7.1 or sglang:0.4.1.
Command: The system automatically sets the command after you select the image. No modification is necessary.
In the Resource Deployment section, turn on the Distributed Inference option and configure the following key parameters:
Parameter
Description
Machines for Each Instance
The number of machines for a single model inference instance. The minimum value is 2.
RDMA Network
Enable RDMA network to ensure efficient network connectivity between machines.
NoteCurrently, only services deployed using Lingjun resources can use the RDMA network.
Click Deploy or Update.
Deploy in Model Gallery with one click
Distributed inference is only supported when the Deployment Method is SGLang Accelerate Deployment or vLLM Accelerate Deployment.
For models with large parameter amounts, select the SGLang Accelerate Deployment or vLLM Accelerate Deployment method and Distributed Inference is automatically enabled.
You can click Modify Configurations to adjust the number of machines for a single instance.
Appendix
Here are the environment variables injected into group instances and their descriptions:
Environment variable | Description | Example value |
RANK_ID | The instance ID. | 0 |
MASTER_ADDRESS | The IP address of the instance with RANK_ID 0.
| 11.*.*.* |
COMM_IFNAME | The name of the network interface card used by the instance.
| net0 |
RANK_IP | The IP address of the instance. | 11.*.*.* |