All Products
Search
Document Center

Container Service for Kubernetes:Run GDR applications on eRDMA nodes in ACK clusters

Last Updated:Mar 26, 2026

Distributed GPU training jobs use the NVIDIA Collective Communication Library (NCCL) to exchange gradients across nodes. Without a high-speed, low-latency interconnect, cross-node communication becomes a bottleneck that limits training throughput. GPU Direct RDMA (GDR) bypasses the CPU entirely, letting GPUs exchange data directly with Remote Direct Memory Access (RDMA)-capable devices. This topic explains how to run GDR workloads on elastic RDMA (eRDMA) nodes in Container Service for Kubernetes (ACK) clusters, using eRDMA as the transport layer.

Prerequisites

Before you begin, ensure that you have:

Submit an MPI job with eRDMA

Use Arena to submit an Message Passing Interface (MPI) job that requests an eRDMA device. The job runs a PyTorch AllReduce benchmark using Horovod with NCCL as the communication backend. The --device=aliyun/erdma=1 flag requests one eRDMA interface for the job, and --hostNetwork true ensures that NCCL can reach the eRDMA device directly.

arena submit mpijob \
  --name=mpi-allreduce-sync-erdma \
  --device=aliyun/erdma=1 \
  -e NCCL_DEBUG=TRACE \
  -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
  -e OMPI_ALLOW_RUN_AS_ROOT=1 \
  --gpus=8 \
  --memory=16Gi \
  --hostNetwork true \
  --cpu=4 \
  --workers=2 \
  --image=registry.cn-beijing.aliyuncs.com/acs/horovod:0.28.1-tf2.9.2-torch1.12.1-py3.8-erdma \
  --toleration all \
  "mpirun -np 2 \
  --allow-run-as-root \
  --mca btl_tcp_if_include bond0 \
  --mca oob_tcp_if_include bond0 \
  --mca pml ob1 \
  --mca btl ^openib \
  python /examples/pytorch/pytorch_synthetic_benchmark.py"

After the job starts, check the logs to confirm that NCCL detected the eRDMA device. Look for lines similar to the following:

iZ2zeg0kcgyxepyc5r63kgZ:17:28 [0] NCCL INFO NET/IB : Using [0]rocep26s0:1/RoCE ; OOB eth0:192.168.8.128<0>
iZ2zeg0kcgyxepyc5r63kgZ:17:28 [0] NCCL INFO Using network IB
iZ2zeg0kcgyxepyc5r63kgZ:18:27 [1] NCCL INFO NET/IB : Using [0]rocep26s0:1/RoCE ; OOB eth0:192.168.8.128<0>
iZ2zeg0kcgyxepyc5r63kgZ:18:27 [1] NCCL INFO Using network IB

The NET/IB and Using network IB entries confirm that NCCL detected the eRDMA device (rocep26s0) and is using it in RDMA over Converged Ethernet (RoCE) mode over eRDMA InfiniBand (IB) for inter-node communication. If these lines are absent, NCCL has fallen back to TCP — verify that the eRDMA Controller is running and that the pod received the aliyun/erdma device resource.

Verify eRDMA interfaces on the host

ibv_devinfo lists all Elastic RDMA Interfaces (ERIs) on a host along with their transport type and port states. Run it on a node to confirm that the eRDMA devices are registered with the kernel driver and that at least one is in an active state.

$ ibv_devinfo
hca_id:	rocep156s0
transport:			eRDMA
fw_ver:				0.2.0
node_guid:			0216:3eff:fe2c:b8f3
sys_image_guid:			0216:3eff:fe2c:b8f3
vendor_id:			0x1ded
vendor_part_id:			4223
hw_ver:				0x0
phys_port_cnt:			1
	port:	1
		state:			PORT_DOWN (1)
		max_mtu:		1024 (3)
		active_mtu:		1024 (3)
		sm_lid:			0
		port_lid:		0
		port_lmc:		0x00
		link_layer:		Ethernet

hca_id:	rocep26s0
transport:			eRDMA
fw_ver:				0.2.0
node_guid:			0216:3eff:fe10:f8b0
sys_image_guid:			0216:3eff:fe10:f8b0
vendor_id:			0x1ded
vendor_part_id:			4223
hw_ver:				0x0
phys_port_cnt:			1
	port:	1
		state:			PORT_ACTIVE (4)
		max_mtu:		1024 (3)
		active_mtu:		1024 (3)
		sm_lid:			0
		port_lid:		0
		port_lmc:		0x00
		link_layer:		Ethernet

The output shows two eRDMA devices:

DeviceStateMeaning
rocep26s0PORT_ACTIVE (4)Ready to carry traffic — NCCL uses this interface
rocep156s0PORT_DOWN (1)Not currently active — not used by NCCL

At least one device must show PORT_ACTIVE (4) for the job to use eRDMA. If all devices show PORT_DOWN, check that the eRDMA Controller is running and that the node has active eRDMA links.

Monitor eRDMA traffic in real time

Use eadm to confirm that eRDMA traffic is flowing during a job. This verifies that the job is using the eRDMA network path rather than falling back to TCP.

$ eadm stat -d rocep26s0 -l
Monitoring rocep26s0...    (press CTRL-C to stop)

 15:59:56  rx:           0 B/s     0 p/s          tx:           0 B/s     0 p/s


 rocep26s0  /  traffic statistics

                          rx         |       tx
--------------------------------------+------------------
  bytes                    11.06 KiB  |       11.18 KiB
--------------------------------------+------------------
          max            15.43 KiB/s  |     15.10 KiB/s
      average             4.03 KiB/s  |      4.07 KiB/s
          min                  0 B/s  |           0 B/s
--------------------------------------+------------------
  packets                    8406769  |         8546764
--------------------------------------+------------------
          max              38990 p/s  |       37488 p/s
      average               2988 p/s  |        3038 p/s
          min                  0 p/s  |           0 p/s
--------------------------------------+------------------
  time                 33.78 minutes

Non-zero rx and tx bytes and packets confirm that the job is sending and receiving data over the eRDMA interface. If both rx and tx remain at 0 B/s throughout the job, the workload is not using eRDMA — check NCCL logs for the NET/IB entries described in the previous section.