All Products
Search
Document Center

Container Service for Kubernetes:Submit eRDMA-accelerated PyTorch training jobs

Last Updated:Mar 26, 2026

In multi-node GPU training, network communication latency is often the primary bottleneck. elastic Remote Direct Memory Access (eRDMA) provides kernel-bypass, low-latency data transfer between nodes, directly improving gradient synchronization speed and overall training throughput.

This topic walks you through installing the eRDMA controller, enabling eRDMA on GPU nodes, submitting a PyTorch distributed training job via Arena, and verifying that eRDMA acceleration is active.

Prerequisites

Before you begin, make sure you have:

Step 1: Install the ACK eRDMA Controller add-on

If your cluster uses the Terway network plugin, configure a whitelist for Terway before installing this add-on. This prevents Terway from modifying the eRDMA NIC. For details, see Configure a whitelist for ENIs.
  1. On the Clusters page, click your cluster name. In the left navigation pane, click Add-ons.

  2. On the Component Management page, click the Network tab, find ACK eRDMA Controller, and follow the prompts to configure and install it.

    When a node has multiple NICs, the ACK eRDMA Controller assigns a lower route priority (200 by default) to eRDMA NIC routes than to other NICs in the same CIDR block. If you manually configure NICs after installation, avoid route conflicts.
    Configuration itemDescription
    preferDriver (Driver type)The eRDMA driver mode for your nodes. Options: default (default mode), compat (RoCE-compatible mode), ofed (OFED-based mode, recommended for GPU-accelerated instances). For details, see Enable eRDMA.
    Assign all eRDMA devices on a node to a podTrue: assigns all eRDMA devices on the node to the pod. False: assigns one eRDMA device per pod based on NUMA topology. Requires the CPU Static Policy to be enabled on the node. For details, see Create and manage node pools.
  3. After installation completes, go to Workloads > Pods, set the namespace to ack-erdma-controller, and confirm that the controller pod is in Running state.

Step 2: Enable and verify eRDMA on nodes

  1. Enable eRDMA on your GPU instances.

  2. Verify that the eRDMA device is active. In the Clusters page, click your cluster name, go to Workloads > Pods, and open a terminal on an eRDMA-enabled node. Run:

    ibv_devinfo

    In the output, look for the erdma_0 device with port_state: PORT_ACTIVE. This confirms eRDMA is enabled on the node.

  3. Verify that the node exposes eRDMA as a schedulable resource:

    • "nvidia.com/gpu":"1" — the node has 1 GPU available for scheduling

    • "aliyun/erdma":"200" — up to 200 pods on this node can request eRDMA resources

    kubectl get node <node-name> -o jsonpath='{.status.allocatable}'

    Expected output:

    {"aliyun/erdma":"200","cpu":"15890m","ephemeral-storage":"243149919035","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"128290128Ki","nvidia.com/gpu":"1","pods":"64"}

    Confirm two fields:

Step 3: Submit a PyTorch distributed training job

The example below uses a prebuilt image for quick evaluation. For production use, build your own image that includes the required RDMA user-mode libraries. See Build a production container image with eRDMA support.

arena submit pytorch \
    --name=pytorch-mnist \
    --namespace=default \
    --workers=2 \
    --gpus=1 \
    --device=aliyun/erdma=1 \
    --nproc-per-node=1 \
    --env NCCL_NET_GDR_LEVEL=1 \
    --env NCCL_P2P_LEVEL=5 \
    --env NCCL_DEBUG=INFO \
    --env NCCL_SOCKET_IFNAME=eth0 \
    --env NCCL_ALGO=Ring \
    --clean-task-policy=None \
    --working-dir=/root \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch:mnist-example \
    --image-pull-policy=Always \
    --tensorboard \
    --logdir=/workspace/logs \
    "torchrun /workspace/arena/examples/pytorch/mnist/main.py \
        --epochs 10000 \
        --backend nccl \
        --data /workspace \
        --dir /workspace/logs"

The key parameter is --device=aliyun/erdma=1, which requests one eRDMA resource per worker and makes it available inside the container. Without this parameter, NCCL will not use eRDMA even if the node supports it.

For a full parameter reference, see Arena job parameters and NCCL environment variables.

Step 4: Verify the training job

  1. Check that the master pod logs show training progress:

    kubectl logs pytorch-mnist-master-0 | head -n 50

    If logs contain Train Epoch, the job is running correctly.

  2. Confirm eRDMA is active in the NCCL communication path. Look for the following line in the master pod logs:

    • NCCL detected the eRDMA NIC (erdma_0)

    • Communication uses RoCE (RDMA over Converged Ethernet)

    • Node-to-node gradient synchronization is using eRDMA acceleration

    NET/IB : Using [0]erdma_0:1/RoCE

    This line confirms that:

Step 5: Monitor eRDMA NIC traffic

Open a terminal on a training node (via Workloads > Pods > Actions > Terminal) and run:

eadm stat -d erdma_0 -l

During active training, the output shows continuous bidirectional traffic, for example:

rx (download): 914.89 MiB/s 150423 p/s; tx (upload): 914.66 MiB/s 147128 p/s

Sustained upload and download traffic at this level indicates that gradient synchronization between nodes is flowing through eRDMA.

To analyze training performance at the Torch, Python, and CUDA kernel levels, use the AI Profiling feature in ACK. AI Profiling identifies compute, communication, and memory bottlenecks in your training jobs.

Clean up resources

After training completes, delete the job:

arena delete pytorch-mnist -n default

Expected output:

INFO[0001] The training job pytorch-mnist has been deleted successfully

This removes all pods and Kubernetes resources created by the job. TensorBoard logs in persistent storage are not deleted.

Arena job parameters

ParameterDescriptionRecommended value
--nameJob name. Must be unique within the cluster.Use a name that reflects the task or team context.
--namespaceKubernetes namespace for the job.Isolate by team or project.
--workersTotal number of workers, including the master.2 for one master and one worker.
--gpusNumber of GPUs per worker.Set based on model size and GPU memory requirements.
--device=aliyun/erdma=1Required to enable eRDMA. Requests one eRDMA resource per worker.Always set to 1 when using eRDMA.
--nproc-per-nodeNumber of training processes per node.Match to --gpus.
--clean-task-policyPod cleanup policy after job completion.None to retain pods for log inspection.
--envEnvironment variables passed to training containers.Configure NCCL parameters as needed.

NCCL environment variables

NCCL environment variables fall into two categories: system configuration parameters that are safe to keep in production scripts, and debugging parameters that should be removed or reduced in production.

System configuration (safe for production):

VariableDescriptionRecommended value
NCCL_SOCKET_IFNAMENetwork interface for NCCL coordination traffic.eth0 (default NIC in ACK clusters)
NCCL_NET_GDR_LEVELGPU Direct RDMA policy level (0–5).1 (PIX level) for eRDMA
NCCL_P2P_LEVELPeer-to-peer communication policy (0–5).5 (SYS level) for cross-node communication
NCCL_ALGOCollective communication algorithm.Ring or Tree based on your network topology
NCCL_IB_HCAName of the RDMA NIC to use.Run ibstat to list available NIC names.

Debugging (use only when troubleshooting, remove in production):

VariableDescriptionRecommended value
NCCL_DEBUGLog verbosity level.INFO for debugging; switch to WARN in production.
NCCL_IB_TIMEOUTInfiniBand communication timeout. Default: 22.Increase to 23 if NCCL timeout errors occur.
NCCL_IB_RETRY_CNTRetry count for failed InfiniBand operations.Set to 10 for unstable networks.

For the full NCCL environment variable reference, see the NCCL documentation.

FAQ

The log does not show NET/IB : Using erdma_0

This almost always means eRDMA was not mounted into the container. The most common cause is a missing --device=aliyun/erdma=1 in the submit command — add it and resubmit.

If the parameter is present, verify at the node level:

  1. Check that the node exposes eRDMA resources:

    kubectl get node <node-name> -o jsonpath='{.status.allocatable}'

    The output must include "aliyun/erdma". If it does not, the ACK eRDMA Controller is not running correctly — check the pod status in the ack-erdma-controller namespace.

  2. Check that the eRDMA device is healthy on the node:

    ibv_devinfo

    Look for erdma_0 with port_state: PORT_ACTIVE.

  3. Check that the container image includes the required RDMA user-mode libraries: libibverbs1, ibverbs-providers, and librdmacm1. If not, build a new image. See Build a production container image with eRDMA support.

NCCL timeout errors occur during training

Timeout errors usually mean network instability or that the default timeout is too short for your cluster. Increase the timeout and retry limit:

--env NCCL_IB_TIMEOUT=23 \
--env NCCL_IB_RETRY_CNT=10

If errors persist, check network connectivity between nodes and verify eRDMA device status with ibv_devinfo.

How do I measure whether eRDMA actually improves training performance?

Submit two otherwise identical jobs — one with --device=aliyun/erdma=1 and one without — then compare total completion time for the same number of training steps. For a more detailed breakdown, use the AI Profiling tool to measure the fraction of time spent in communication versus compute.

eRDMA delivers the most noticeable performance gains in multi-node scenarios (two or more nodes) with large models (billions of parameters or more).

Arena reports insufficient resources when submitting a job

Run kubectl get nodes to check node status and available resources. If a node shows insufficient aliyun/erdma or nvidia.com/gpu capacity, either add nodes or reduce resource requests.

If resources appear available but pods still fail to schedule, run:

kubectl describe pod <pod-name>

The Events section shows the specific scheduling failure reason, such as a missing toleration or unsatisfied node selector.

Build a production container image with eRDMA support

For production, build a container image that installs RDMA user-mode libraries from the Alibaba Cloud eRDMA APT repository. The following Dockerfile builds on the official PyTorch image and includes the eRDMA drivers and the MNIST training example.

For more details, see Enable eRDMA in containers (Docker).

ARG PYTORCH_IMAGE=docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

FROM docker.io/library/alpine:3.22.1 AS downloader

WORKDIR /workspace

RUN apk add git wget gzip

RUN mkdir -p /workspace/MNIST/raw && \
    cat <<EOF > /workspace/MNIST/raw/checksums.md5
f68b3c2dcbeaaa9fbdd348bbdeb94873 train-images-idx3-ubyte.gz
d53e105ee54ea40749a09fcbcd1e9432 train-labels-idx1-ubyte.gz
9fb629c4189551a2d022fa330f9573f3 t10k-images-idx3-ubyte.gz
ec29112dd5afa0611ce80d1b7f02629c t10k-labels-idx1-ubyte.gz
EOF

RUN cd /workspace/MNIST/raw && \
    wget https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz && \
    wget https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz && \
    wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz && \
    wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz && \
    md5sum -c checksums.md5 && \
    rm checksums.md5 && \
    gzip -d *.gz

RUN git clone https://github.com/kubeflow/arena.git -b v0.15.2

FROM ${PYTORCH_IMAGE}

WORKDIR /workspace

COPY --from=downloader /workspace .

RUN set -eux && \
    apt update && \
    apt install -y wget gpg && \
    wget -qO - https://mirrors.aliyun.com/erdma/GPGKEY | gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg && \
    echo "deb [ ] https://mirrors.aliyun.com/erdma/apt/ubuntu jammy/erdma main" > /etc/apt/sources.list.d/erdma.list && \
    apt update && \
    apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1 && \
    pip install --no-cache-dir -r /workspace/arena/examples/pytorch/mnist/requirements.txt && \
    rm -rf /var/lib/apt/lists/*