In multi-node GPU training, network communication latency is often the primary bottleneck. elastic Remote Direct Memory Access (eRDMA) provides kernel-bypass, low-latency data transfer between nodes, directly improving gradient synchronization speed and overall training throughput.
This topic walks you through installing the eRDMA controller, enabling eRDMA on GPU nodes, submitting a PyTorch distributed training job via Arena, and verifying that eRDMA acceleration is active.
Prerequisites
Before you begin, make sure you have:
An ACK managed cluster. eRDMA is not supported on dedicated clusters
Nodes using eRDMA-compatible instance types:
gn8is,ebmgn8is,gn8v, orebmgn8vThe Arena component of the Cloud-native AI Suite installed in the cluster
The Arena client configured on your local machine
kubectl installed and connected to the cluster via kubeconfig
Step 1: Install the ACK eRDMA Controller add-on
If your cluster uses the Terway network plugin, configure a whitelist for Terway before installing this add-on. This prevents Terway from modifying the eRDMA NIC. For details, see Configure a whitelist for ENIs.
On the Clusters page, click your cluster name. In the left navigation pane, click Add-ons.
On the Component Management page, click the Network tab, find ACK eRDMA Controller, and follow the prompts to configure and install it.
When a node has multiple NICs, the ACK eRDMA Controller assigns a lower route priority (
200by default) to eRDMA NIC routes than to other NICs in the same CIDR block. If you manually configure NICs after installation, avoid route conflicts.Configuration item Description preferDriver (Driver type) The eRDMA driver mode for your nodes. Options: default(default mode),compat(RoCE-compatible mode),ofed(OFED-based mode, recommended for GPU-accelerated instances). For details, see Enable eRDMA.Assign all eRDMA devices on a node to a pod True: assigns all eRDMA devices on the node to the pod.False: assigns one eRDMA device per pod based on NUMA topology. Requires the CPU Static Policy to be enabled on the node. For details, see Create and manage node pools.After installation completes, go to Workloads > Pods, set the namespace to
ack-erdma-controller, and confirm that the controller pod is in Running state.
Step 2: Enable and verify eRDMA on nodes
Verify that the eRDMA device is active. In the Clusters page, click your cluster name, go to Workloads > Pods, and open a terminal on an eRDMA-enabled node. Run:
ibv_devinfoIn the output, look for the
erdma_0device withport_state: PORT_ACTIVE. This confirms eRDMA is enabled on the node.Verify that the node exposes eRDMA as a schedulable resource:
"nvidia.com/gpu":"1"— the node has 1 GPU available for scheduling"aliyun/erdma":"200"— up to 200 pods on this node can request eRDMA resources
kubectl get node <node-name> -o jsonpath='{.status.allocatable}'Expected output:
{"aliyun/erdma":"200","cpu":"15890m","ephemeral-storage":"243149919035","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"128290128Ki","nvidia.com/gpu":"1","pods":"64"}Confirm two fields:
Step 3: Submit a PyTorch distributed training job
The example below uses a prebuilt image for quick evaluation. For production use, build your own image that includes the required RDMA user-mode libraries. See Build a production container image with eRDMA support.
arena submit pytorch \
--name=pytorch-mnist \
--namespace=default \
--workers=2 \
--gpus=1 \
--device=aliyun/erdma=1 \
--nproc-per-node=1 \
--env NCCL_NET_GDR_LEVEL=1 \
--env NCCL_P2P_LEVEL=5 \
--env NCCL_DEBUG=INFO \
--env NCCL_SOCKET_IFNAME=eth0 \
--env NCCL_ALGO=Ring \
--clean-task-policy=None \
--working-dir=/root \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch:mnist-example \
--image-pull-policy=Always \
--tensorboard \
--logdir=/workspace/logs \
"torchrun /workspace/arena/examples/pytorch/mnist/main.py \
--epochs 10000 \
--backend nccl \
--data /workspace \
--dir /workspace/logs"The key parameter is --device=aliyun/erdma=1, which requests one eRDMA resource per worker and makes it available inside the container. Without this parameter, NCCL will not use eRDMA even if the node supports it.
For a full parameter reference, see Arena job parameters and NCCL environment variables.
Step 4: Verify the training job
Check that the master pod logs show training progress:
kubectl logs pytorch-mnist-master-0 | head -n 50If logs contain
Train Epoch, the job is running correctly.Confirm eRDMA is active in the NCCL communication path. Look for the following line in the master pod logs:
NCCL detected the eRDMA NIC (
erdma_0)Communication uses RoCE (RDMA over Converged Ethernet)
Node-to-node gradient synchronization is using eRDMA acceleration
NET/IB : Using [0]erdma_0:1/RoCEThis line confirms that:
Step 5: Monitor eRDMA NIC traffic
Open a terminal on a training node (via Workloads > Pods > Actions > Terminal) and run:
eadm stat -d erdma_0 -lDuring active training, the output shows continuous bidirectional traffic, for example:
rx (download): 914.89 MiB/s 150423 p/s; tx (upload): 914.66 MiB/s 147128 p/sSustained upload and download traffic at this level indicates that gradient synchronization between nodes is flowing through eRDMA.
To analyze training performance at the Torch, Python, and CUDA kernel levels, use the AI Profiling feature in ACK. AI Profiling identifies compute, communication, and memory bottlenecks in your training jobs.
Clean up resources
After training completes, delete the job:
arena delete pytorch-mnist -n defaultExpected output:
INFO[0001] The training job pytorch-mnist has been deleted successfullyThis removes all pods and Kubernetes resources created by the job. TensorBoard logs in persistent storage are not deleted.
Arena job parameters
| Parameter | Description | Recommended value |
|---|---|---|
--name | Job name. Must be unique within the cluster. | Use a name that reflects the task or team context. |
--namespace | Kubernetes namespace for the job. | Isolate by team or project. |
--workers | Total number of workers, including the master. | 2 for one master and one worker. |
--gpus | Number of GPUs per worker. | Set based on model size and GPU memory requirements. |
--device=aliyun/erdma=1 | Required to enable eRDMA. Requests one eRDMA resource per worker. | Always set to 1 when using eRDMA. |
--nproc-per-node | Number of training processes per node. | Match to --gpus. |
--clean-task-policy | Pod cleanup policy after job completion. | None to retain pods for log inspection. |
--env | Environment variables passed to training containers. | Configure NCCL parameters as needed. |
NCCL environment variables
NCCL environment variables fall into two categories: system configuration parameters that are safe to keep in production scripts, and debugging parameters that should be removed or reduced in production.
System configuration (safe for production):
| Variable | Description | Recommended value |
|---|---|---|
NCCL_SOCKET_IFNAME | Network interface for NCCL coordination traffic. | eth0 (default NIC in ACK clusters) |
NCCL_NET_GDR_LEVEL | GPU Direct RDMA policy level (0–5). | 1 (PIX level) for eRDMA |
NCCL_P2P_LEVEL | Peer-to-peer communication policy (0–5). | 5 (SYS level) for cross-node communication |
NCCL_ALGO | Collective communication algorithm. | Ring or Tree based on your network topology |
NCCL_IB_HCA | Name of the RDMA NIC to use. | Run ibstat to list available NIC names. |
Debugging (use only when troubleshooting, remove in production):
| Variable | Description | Recommended value |
|---|---|---|
NCCL_DEBUG | Log verbosity level. | INFO for debugging; switch to WARN in production. |
NCCL_IB_TIMEOUT | InfiniBand communication timeout. Default: 22. | Increase to 23 if NCCL timeout errors occur. |
NCCL_IB_RETRY_CNT | Retry count for failed InfiniBand operations. | Set to 10 for unstable networks. |
For the full NCCL environment variable reference, see the NCCL documentation.
FAQ
The log does not show NET/IB : Using erdma_0
This almost always means eRDMA was not mounted into the container. The most common cause is a missing --device=aliyun/erdma=1 in the submit command — add it and resubmit.
If the parameter is present, verify at the node level:
Check that the node exposes eRDMA resources:
kubectl get node <node-name> -o jsonpath='{.status.allocatable}'The output must include
"aliyun/erdma". If it does not, the ACK eRDMA Controller is not running correctly — check the pod status in theack-erdma-controllernamespace.Check that the eRDMA device is healthy on the node:
ibv_devinfoLook for
erdma_0withport_state: PORT_ACTIVE.Check that the container image includes the required RDMA user-mode libraries:
libibverbs1,ibverbs-providers, andlibrdmacm1. If not, build a new image. See Build a production container image with eRDMA support.
NCCL timeout errors occur during training
Timeout errors usually mean network instability or that the default timeout is too short for your cluster. Increase the timeout and retry limit:
--env NCCL_IB_TIMEOUT=23 \
--env NCCL_IB_RETRY_CNT=10If errors persist, check network connectivity between nodes and verify eRDMA device status with ibv_devinfo.
How do I measure whether eRDMA actually improves training performance?
Submit two otherwise identical jobs — one with --device=aliyun/erdma=1 and one without — then compare total completion time for the same number of training steps. For a more detailed breakdown, use the AI Profiling tool to measure the fraction of time spent in communication versus compute.
eRDMA delivers the most noticeable performance gains in multi-node scenarios (two or more nodes) with large models (billions of parameters or more).
Arena reports insufficient resources when submitting a job
Run kubectl get nodes to check node status and available resources. If a node shows insufficient aliyun/erdma or nvidia.com/gpu capacity, either add nodes or reduce resource requests.
If resources appear available but pods still fail to schedule, run:
kubectl describe pod <pod-name>The Events section shows the specific scheduling failure reason, such as a missing toleration or unsatisfied node selector.
Build a production container image with eRDMA support
For production, build a container image that installs RDMA user-mode libraries from the Alibaba Cloud eRDMA APT repository. The following Dockerfile builds on the official PyTorch image and includes the eRDMA drivers and the MNIST training example.
For more details, see Enable eRDMA in containers (Docker).
ARG PYTORCH_IMAGE=docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
FROM docker.io/library/alpine:3.22.1 AS downloader
WORKDIR /workspace
RUN apk add git wget gzip
RUN mkdir -p /workspace/MNIST/raw && \
cat <<EOF > /workspace/MNIST/raw/checksums.md5
f68b3c2dcbeaaa9fbdd348bbdeb94873 train-images-idx3-ubyte.gz
d53e105ee54ea40749a09fcbcd1e9432 train-labels-idx1-ubyte.gz
9fb629c4189551a2d022fa330f9573f3 t10k-images-idx3-ubyte.gz
ec29112dd5afa0611ce80d1b7f02629c t10k-labels-idx1-ubyte.gz
EOF
RUN cd /workspace/MNIST/raw && \
wget https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz && \
wget https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz && \
wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz && \
wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz && \
md5sum -c checksums.md5 && \
rm checksums.md5 && \
gzip -d *.gz
RUN git clone https://github.com/kubeflow/arena.git -b v0.15.2
FROM ${PYTORCH_IMAGE}
WORKDIR /workspace
COPY --from=downloader /workspace .
RUN set -eux && \
apt update && \
apt install -y wget gpg && \
wget -qO - https://mirrors.aliyun.com/erdma/GPGKEY | gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg && \
echo "deb [ ] https://mirrors.aliyun.com/erdma/apt/ubuntu jammy/erdma main" > /etc/apt/sources.list.d/erdma.list && \
apt update && \
apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1 && \
pip install --no-cache-dir -r /workspace/arena/examples/pytorch/mnist/requirements.txt && \
rm -rf /var/lib/apt/lists/*