Multi-node GPU workloads demand network throughput and latency that TCP/IP cannot reliably deliver at scale. ACS (Alibaba Cloud Container Compute Service) provides an RDMA-enabled high-performance network that reduces communication overhead for distributed training and inference. Add a single pod label to enable RDMA, configure the NCCL environment variables for your GPU model, and the pod is ready.
How it works
TCP/IP introduces overhead that compounds as the number of GPUs grows: data copying between kernel and user space, complex protocol stack processing, flow control algorithms, and frequent context switching.
RDMA (Remote Direct Memory Access) bypasses these costs using zero-copy transfers and kernel bypass. The result is lower latency, higher throughput, and lower CPU usage—benefits that are most significant in multi-node GPU collective operations.
To enable RDMA on a pod in ACS, add the label alibabacloud.com/hpn-type: "rdma" to the pod spec. ACS automatically attaches an RDMA network interface card (NIC) to the container. Configure the NCCL environment variables for your GPU model, and the pod is ready for distributed training or inference.
GPU models that support RDMA
ACS supports multiple GPU models. The following table lists the models that support RDMA and their constraints.
| GPU model | compute-class | RDMA constraint | RDMA NIC type |
|---|---|---|---|
| GU8TF | gpu | 8-card pods only | Type 1 |
| GU8TEF | gpu | 8-card pods only | Type 1 |
| GX8SF | gpu | 8-card pods only | Type 1 |
| P16EN | gpu | 16-card pods only | Type 2 |
| gpu-hpn | — | 1, 2, 4, 8, and 16-card pods | — |
The RDMA NIC type determines which NCCL environment variables you need to configure.
NCCL configuration
ACS GPUs use two RDMA NIC types, each requiring different NCCL settings.
Type 1 NIC (GU8TF, GU8TEF, GX8SF)
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=mlx5
export NCCL_DEBUG=INFO
Type 2 NIC (P16EN)
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=
export NCCL_DEBUG=INFO
Environment variables
| Variable | Description |
|---|---|
NCCL_SOCKET_IFNAME |
The port that NCCL uses to establish connections. Use eth0 in ACS. |
NCCL_IB_HCA |
Specifies which Host Channel Adapter (HCA) interfaces to use for RDMA communication. Set to mlx5 for Type 1 NICs. For P16EN (Type 2), leave this blank. |
NCCL_IB_DISABLE |
Specifies whether to disable IB/RoCE networks and use IP Sockets instead. A value of 1 disables IB/RoCE. Set to 1 for P16EN. |
NCCL_DEBUG |
Controls the output level of NCCL debug logs. |
For other NCCL environment variables, see the NCCL documentation.
Deploy a GPU pod with RDMA
Prerequisites
Before you begin, ensure that you have:
-
An ACS cluster with a GPU model that supports RDMA (see GPU models that support RDMA)
-
kubectlconfigured to connect to your cluster
Deploy the pod
-
Create a file named
dep-demo-hpn-gpu.yamlwith the following content. The example uses GU8TF (Type 1 NIC). Adjustalibabacloud.com/gpu-model-seriesand the NCCL variables for your GPU model.apiVersion: apps/v1 kind: Deployment metadata: name: dep-demo-hpn-gpu labels: app: demo-hpn-gpu spec: replicas: 1 selector: matchLabels: app: demo-hpn-gpu template: metadata: labels: app: demo-hpn-gpu alibabacloud.com/compute-class: gpu alibabacloud.com/compute-qos: default # Specify the GPU model. Change this value as needed. alibabacloud.com/gpu-model-series: "GU8TF" alibabacloud.com/hpn-type: "rdma" spec: containers: - name: demo image: registry-cn-wulanchabu.ack.aliyuncs.com/acs/stress:v1.0.4 command: - "sleep" - "1000h" env: - name: NCCL_SOCKET_IFNAME value: "eth0" - name: NCCL_IB_HCA value: "mlx5" - name: NCCL_DEBUG value: "INFO" resources: requests: cpu: 128 memory: 512Gi nvidia.com/gpu: 8 limits: cpu: 128 memory: 512Gi nvidia.com/gpu: 8 -
Apply the manifest.
kubectl apply -f dep-demo-hpn-gpu.yaml -
Wait for the pod to reach
Runningstate.kubectl get pod | grep dep-demo-hpn-gpuThe pod is ready when the output shows
1/1 Running:dep-demo-hpn-gpu-5d9xxxxxb6-xxxxx 1/1 Running 0 25m16s
Verify RDMA is active
Check that the RDMA NIC was attached to the container:
kubectl exec -it deploy/dep-demo-hpn-gpu -- ifconfig | grep hpn -A 8
The output shows the hpn0 interface with UP BROADCAST RUNNING MULTICAST, confirming the RDMA NIC is active:
hpn0 Link encap:Ethernet HWaddr xx:xx:xx:xx:xx:xx
inet6 addr: xxxx::x:xxxx:xxxx:xxx/xx Scope:Link
inet6 addr: xxxx:xxx:xxx:x:x:xxxx:x:xxx/xxx Scope:Global
UP BROADCAST RUNNING MULTICAST MTU:xxxx Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:xx errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:x (892.0 B)
If hpn0 does not appear, check that alibabacloud.com/hpn-type: "rdma" is present in the pod labels and that the GPU model supports RDMA with the requested card count.