Submit PyTorch distributed training jobs with eRDMA acceleration using Arena - Container Service for Kubernetes

In multi-node GPU training, network communication latency can degrade overall performance. To reduce model training time, use Arena to submit PyTorch distributed training jobs and enable eRDMA network acceleration. This provides low-latency, high-throughput node-to-node communication, improves training efficiency, and increases cluster utilization.

Applicable scenarios

Cluster type: Only ACK managed clusters.
eRDMA instance types:
- gn8is, ebmgn8is, gn8v, ebmgn8v

Step 1: Install the ACK eRDMA Controller component

Install the ACK eRDMA Controller component.

Note

If your cluster uses the Terway network plugin, you must also configure a whitelist for Terway. This prevents the Terway component from modifying the eRDMA NIC. For more information, see Configure a whitelist for ENIs.
When a node has multiple NICs, the ACK eRDMA Controller assigns a lower route priority to the routes for the attached eRDMA NICs than to the routes for other NICs in the same CIDR block. The default route priority is 200. If you manually configure NICs after you install the ACK eRDMA Controller, you must avoid route conflicts.

On the Clusters page, click the name of your cluster. In the navigation pane on the left, click Add-ons.

On the Component Management page, click the Network tab, find the ACK eRDMA Controller component, and then follow the prompts to configure and install the component.

Configuration Item

Description

preferDriver Driver Type

Select the eRDMA driver type used on the cluster nodes. Valid values:

default: The default driver mode.
compat: The RoCE-compatible driver mode.
ofed: The OFED-based driver mode. This mode is suitable for GPU-accelerated instance types.

For more information about driver types, see Enable eRDMA.

Assign all eRDMA devices on a node to a pod

Valid values:

True (selected): Assigns all eRDMA devices on the node to the pod.
False (cleared): Assigns one eRDMA device to the pod based on the NUMA topology. The node must have the CPU Static Policy enabled to ensure fixed NUMA allocation for pods and devices. For information about how to configure the CPU Policy, see Create and manage node pools.

After the installation is complete, in the navigation pane on the left, choose Workloads > Pods. Then, set the namespace to ack-erdma-controller and check the pod status to confirm that the component is running as expected.

Step 2: Enable and verify eRDMA on nodes

Enable eRDMA on nodes and confirm that GPUs and eRDMA resources are correctly detected and schedulable.

Enable eRDMA on GPU instances.
On the Clusters page, click the target cluster name. In the navigation pane on the left, choose Workloads > Pods. In the row of the created pod, click Actions, then click Terminal. Log on to an eRDMA-enabled node and run the following command to view eRDMA device information.

The expected output shows the erdma_0 device with port status PORT_ACTIVE. This confirms eRDMA is enabled.
```
ibv_devinfo
```

View allocatable resources on the node.

kubectl get node <node-name> -o jsonpath='{.status.allocatable}'

Expected output:

{"aliyun/erdma":"200","cpu":"15890m","ephemeral-storage":"243149919035","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"128290128Ki","nvidia.com/gpu":"1","pods":"64"}

"nvidia.com/gpu":"1": Number of available GPUs is 1.
"aliyun/erdma":"200": Up to 200 pods can use eRDMA.

Step 3: Submit training jobs using Arena

Submit PyTorch distributed training jobs using Arena and request eRDMA resources to accelerate node-to-node communication.

Install the Arena component of the Cloud-native AI Suite.

Configure the Arena client and submit training jobs. For more parameters, see Key job parameters.

This topic uses the prebuilt image kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch:mnist-example for demonstration. This image is for quick evaluation only. For production environments, see How do I build a container image that supports eRDMA in production?

arena submit pytorch \
    --name=pytorch-mnist \
    --namespace=default \
    --workers=2 \
    --gpus=1 \
    --device=aliyun/erdma=1 \
    --nproc-per-node=1 \
    --env NCCL_NET_GDR_LEVEL=1 \
    --env NCCL_P2P_LEVEL=5 \
    --env NCCL_DEBUG=INFO \
    --env NCCL_SOCKET_IFNAME=eth0 \
    --env NCCL_ALGO=Ring \
    --clean-task-policy=None \
    --working-dir=/root \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch:mnist-example \
    --image-pull-policy=Always \
    --tensorboard \
    --logdir=/workspace/logs \
    "torchrun /workspace/arena/examples/pytorch/mnist/main.py \
        --epochs 10000 \
        --backend nccl \
        --data /workspace \
        --dir /workspace/logs"

Step 4: Verify training job status

Confirm the training job starts successfully and verify eRDMA network acceleration is active.

Get the cluster kubeconfig and connect to the cluster using kubectl.
View logs of the master pod. If logs show Train Epoch, the training job runs normally.
```
kubectl logs pytorch-mnist-master-0 | head -n 50
```
Verify eRDMA is active. In the master pod logs, look for NET/IB : Using [0]erdma_0:1/RoCE. If found, this means:
- NCCL detects the eRDMA NIC (erdma_0).
- Communication uses RoCE (RDMA over Converged Ethernet).
- Node-to-node communication in distributed training uses eRDMA acceleration.

Step 5: Monitor eRDMA NIC traffic

Monitor eRDMA NIC traffic in real time to verify data transfer.

On the Clusters page, click the target cluster name. In the navigation pane on the left, choose Workloads > Pods. In the row of the created pod, click Actions, then click Terminal. Log on to a training node and run the following command to monitor eRDMA NIC traffic.
```
eadm stat -d erdma_0 -l
```
The expected output is rx (download): 914.89 MiB/s 150423 p/s; tx (upload): 914.66 MiB/s 147128 p/s. Continuous bidirectional upload and download traffic during training indicates gradient synchronization and other communication operations between nodes.
analyzes training performance.

ACK clusters support AI Profiling. This feature analyzes training performance at the Torch, Python, and CUDA kernel levels. Use the Profiling tool to identify performance bottlenecks in training jobs, including compute, communication, and memory usage.

Clean up resources

After the training job completes, run the following command to delete it:

This deletes all pods and resources associated with the job but does not delete persistent data such as TensorBoard logs.

arena delete pytorch-mnist -n default

Expected output:

INFO[0001] The training job pytorch-mnist has been deleted successfully

FAQ

Why does the log not show `NET/IB : Using erdma_0`?

Possible causes include the following:

You did not add the --device=aliyun/erdma=1 parameter in the submit command.
The node does not support eRDMA or the eRDMA Controller add-on is not installed correctly.
The container image lacks RDMA user-mode libraries (such as libibverbs and librdmacm).

Solutions:

Run kubectl get node <node-name> -o jsonpath='{.status.allocatable}' to confirm the node has aliyun/erdma resources.
Log on to the node and run ibv_devinfo to check if the eRDMA device works.
Check whether the container image includes required RDMA libraries.

What do I do if NCCL timeout errors occur during training?

NCCL timeouts usually result from unstable networks or incorrect configurations. We recommend:

Increase the timeout value: Set --env NCCL_IB_TIMEOUT=23 (default is 22).
Increase retry attempts: Set --env NCCL_IB_RETRY_CNT=10.
Check network connectivity between nodes and eRDMA device status.

How do I determine whether eRDMA actually improves training performance?

Compare results as follows:

Submit two training jobs: one with eRDMA (add --device=aliyun/erdma=1) and one without.
Compare completion times for the same number of training steps.
Analyze communication time percentage using the AI Profiling tool.

eRDMA delivers the most noticeable performance gains in multi-node (two or more nodes) scenarios with large models (billions of parameters or more).

What do I do if Arena fails to submit a job and reports insufficient resources?

Possible causes include the following:

Insufficient GPU or eRDMA resources.
Nodes do not meet scheduling requirements (such as node selectors or taint tolerations).

Solutions:

Run kubectl get nodes to view node status and available resources.
Run kubectl describe pod <pod-name> to view detailed reasons for pod scheduling failure.
Adjust resource requests or add nodes as needed.

Key Arena job parameters

Parameter	Description	Recommended Configuration
`--name`	Job name. Must be unique across the cluster.	Use a name that reflects business context.
`--namespace`	Namespace for the job.	Isolate by team or project.
`--workers`	Number of worker nodes (including the master).	Set to 2 for one master and one worker.
`--gpus`	Number of GPUs per worker.	Set based on model size and GPU memory requirements.
`--device=aliyun/erdma=1`	Key parameter to enable eRDMA. Assigns one eRDMA resource per worker.	Must be set to 1 to enable eRDMA.
`--nproc-per-node`	Number of training processes per node.	Usually matches `--gpus`.
`--clean-task-policy`	Pod cleanup policy.	Set to `None` to retain pods for log inspection.
`--env`	Environment variables.	Configure NCCL communication parameters.

NCCL environment variables (to optimize distributed communication):

Environment variable	Description	Recommended configuration
`NCCL_SOCKET_IFNAME`	Name of the network interface used for communication.	Set to `eth0` (default NIC for ACK clusters).
`NCCL_NET_GDR_LEVEL`	GPU Direct RDMA policy level (0–5).	For eRDMA, set to `1` (PIX level).
`NCCL_P2P_LEVEL`	Peer-to-peer communication policy (0–5).	Set to `5` (SYS level) to support cross-node communication.
`NCCL_ALGO`	Collective communication algorithm.	Choose `Ring`, `Tree`, or another algorithm based on your network topology.
`NCCL_DEBUG`	Log level.	Set to `INFO` for debugging. In production, use `WARN`.
`NCCL_IB_HCA`	Name of the RDMA NIC to use.	Run `ibstat` to list available NIC names.
`NCCL_IB_TIMEOUT`	InfiniBand communication timeout.	Adjust based on network latency.
`NCCL_IB_RETRY_CNT`	Number of retries for failed InfiniBand communication.	Increase for unstable networks.

For more NCCL environment variables, see the NCCL documentation.

How do I build a container image that supports eRDMA in production?

To use eRDMA, install RDMA user-mode libraries and drivers in the container. The following Dockerfile example builds on the official PyTorch image and integrates eRDMA drivers and the MNIST training example.

For more information about building container images that support eRDMA in production, see Enable eRDMA in containers (Docker).

ARG PYTORCH_IMAGE=docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

FROM docker.io/library/alpine:3.22.1 AS downloader

WORKDIR /workspace

RUN apk add git wget gzip

RUN mkdir -p /workspace/MNIST/raw && \
    cat <<EOF > /workspace/MNIST/raw/checksums.md5
f68b3c2dcbeaaa9fbdd348bbdeb94873 train-images-idx3-ubyte.gz
d53e105ee54ea40749a09fcbcd1e9432 train-labels-idx1-ubyte.gz
9fb629c4189551a2d022fa330f9573f3 t10k-images-idx3-ubyte.gz
ec29112dd5afa0611ce80d1b7f02629c t10k-labels-idx1-ubyte.gz
EOF

RUN cd /workspace/MNIST/raw && \
    wget https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz && \
    wget https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz && \
    wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz && \
    wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz && \
    md5sum -c checksums.md5 && \
    rm checksums.md5 && \
    gzip -d *.gz

RUN git clone https://github.com/kubeflow/arena.git -b v0.15.2

FROM ${PYTORCH_IMAGE}

WORKDIR /workspace

COPY --from=downloader /workspace .

RUN set -eux && \
    apt update && \
    apt install -y wget gpg && \
    wget -qO - https://mirrors.aliyun.com/erdma/GPGKEY | gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg && \
    echo "deb [ ] https://mirrors.aliyun.com/erdma/apt/ubuntu jammy/erdma main" > /etc/apt/sources.list.d/erdma.list && \
    apt update && \
    apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1 && \
    pip install --no-cache-dir -r /workspace/arena/examples/pytorch/mnist/requirements.txt && \
    rm -rf /var/lib/apt/lists/*