Configure RDMA & NCCL for AI Training on ACS Pods - Container Compute Service

Multi-node GPU workloads demand network throughput and latency that TCP/IP cannot reliably deliver at scale. ACS (Alibaba Cloud Container Compute Service) provides an RDMA-enabled high-performance network that reduces communication overhead for distributed training and inference. Add a single pod label to enable RDMA, configure the NCCL environment variables for your GPU model, and the pod is ready.

How it works

TCP/IP introduces overhead that compounds as the number of GPUs grows: data copying between kernel and user space, complex protocol stack processing, flow control algorithms, and frequent context switching.

RDMA (Remote Direct Memory Access) bypasses these costs using zero-copy transfers and kernel bypass. The result is lower latency, higher throughput, and lower CPU usage—benefits that are most significant in multi-node GPU collective operations.

To enable RDMA on a pod in ACS, add the label alibabacloud.com/hpn-type: "rdma" to the pod spec. ACS automatically attaches an RDMA network interface card (NIC) to the container. Configure the NCCL environment variables for your GPU model, and the pod is ready for distributed training or inference.

GPU models that support RDMA

ACS supports multiple GPU models. The following table lists the models that support RDMA and their constraints.

GPU model	compute-class	RDMA constraint	RDMA NIC type
GU8TF	gpu	8-card pods only	Type 1
GU8TEF	gpu	8-card pods only	Type 1
GX8SF	gpu	8-card pods only	Type 1
P16EN	gpu	16-card pods only	Type 2
gpu-hpn	—	1, 2, 4, 8, and 16-card pods	—

The RDMA NIC type determines which NCCL environment variables you need to configure.

NCCL configuration

ACS GPUs use two RDMA NIC types, each requiring different NCCL settings.

Type 1 NIC (GU8TF, GU8TEF, GX8SF)

export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=mlx5
export NCCL_DEBUG=INFO

Type 2 NIC (P16EN)

export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=
export NCCL_DEBUG=INFO

Environment variables

Variable	Description
`NCCL_SOCKET_IFNAME`	The port that NCCL uses to establish connections. Use `eth0` in ACS.
`NCCL_IB_HCA`	Specifies which Host Channel Adapter (HCA) interfaces to use for RDMA communication. Set to `mlx5` for Type 1 NICs. For P16EN (Type 2), leave this blank.
`NCCL_IB_DISABLE`	Specifies whether to disable IB/RoCE networks and use IP Sockets instead. A value of `1` disables IB/RoCE. Set to `1` for P16EN.
`NCCL_DEBUG`	Controls the output level of NCCL debug logs.

For other NCCL environment variables, see the NCCL documentation.

Deploy a GPU pod with RDMA

Prerequisites

Before you begin, ensure that you have:

An ACS cluster with a GPU model that supports RDMA (see GPU models that support RDMA)
kubectl configured to connect to your cluster

Deploy the pod

Create a file named dep-demo-hpn-gpu.yaml with the following content. The example uses GU8TF (Type 1 NIC). Adjust alibabacloud.com/gpu-model-series and the NCCL variables for your GPU model.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dep-demo-hpn-gpu
  labels:
    app: demo-hpn-gpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo-hpn-gpu
  template:
    metadata:
      labels:
        app: demo-hpn-gpu
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/compute-qos: default
        # Specify the GPU model. Change this value as needed.
        alibabacloud.com/gpu-model-series: "GU8TF"
        alibabacloud.com/hpn-type: "rdma"
    spec:
      containers:
      - name: demo
        image: registry-cn-wulanchabu.ack.aliyuncs.com/acs/stress:v1.0.4
        command:
        - "sleep"
        - "1000h"
        env:
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"
        - name: NCCL_IB_HCA
          value: "mlx5"
        - name: NCCL_DEBUG
          value: "INFO"
        resources:
          requests:
            cpu: 128
            memory: 512Gi
            nvidia.com/gpu: 8
          limits:
            cpu: 128
            memory: 512Gi
            nvidia.com/gpu: 8

Apply the manifest.
```
kubectl apply -f dep-demo-hpn-gpu.yaml
```

Wait for the pod to reach Running state.

kubectl get pod | grep dep-demo-hpn-gpu

The pod is ready when the output shows 1/1 Running:

dep-demo-hpn-gpu-5d9xxxxxb6-xxxxx   1/1     Running   0          25m16s

Verify RDMA is active

Check that the RDMA NIC was attached to the container:

kubectl exec -it deploy/dep-demo-hpn-gpu -- ifconfig | grep hpn -A 8

The output shows the hpn0 interface with UP BROADCAST RUNNING MULTICAST, confirming the RDMA NIC is active:

hpn0      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet6 addr: xxxx::x:xxxx:xxxx:xxx/xx Scope:Link
          inet6 addr: xxxx:xxx:xxx:x:x:xxxx:x:xxx/xxx Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:xxxx  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:xx errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:x (892.0 B)

If hpn0 does not appear, check that alibabacloud.com/hpn-type: "rdma" is present in the pod labels and that the GPU model supports RDMA with the requested card count.