In multi-node GPU training, network communication latency can become a performance bottleneck. To shorten model training cycles, you can use Arena to submit PyTorch distributed jobs and enable elastic Remote Direct Memory Access (eRDMA) for network acceleration. This setup provides low-latency, high-throughput communication between nodes, which improves training efficiency and cluster utilization.
Applicability
Cluster type: Only ACK managed clusters are supported.
Supported instance types for elastic Remote Direct Memory Access (eRDMA):
gn8is, ebmgn8is, gn8v, and ebmgn8v
Step 1: Install ACK eRDMA Controller
You can perform the following steps to install ACK eRDMA Controller.
If your ACK cluster uses Terway, configure an elastic network interface (ENI) filter for Terway in case Terway modifies the eRDMA ENIs. For more information, see Configure an ENI filter.
If a node has multiple ENIs, ACK eRDMA Controller configures routes for additional ENIs of eRDMA with a lower priority than routes for ENIs within the same CIDR block, using a default routing priority of
200. If you need to manually configure ENIs after installing ACK eRDMA Controller, make sure to avoid routing conflicts.
On the Clusters page, find the one you want to manage and click its name. In the left navigation pane, click Add-ons.
On the Add-ons page, click the Networking tab, find ACK eRDMA Controller, follow the instructions on the page to configure and install the component.
Parameter
Description
preferDriver Driver type
Select the type of the eRDMA driver used on the cluster nodes. Valid values:
default: The default driver mode.compat: The driver mode that is compatible with RDMA over Converged Ethernet (RoCE).ofed: The ofed-based driver mode, which is applicable to GPU models.
For more information about the types of drivers, see Enable eRDMA.
Specifies whether to assign all eRDMA devices of nodes to pods
Valid values:
True: If you select this check box, all eRDMA devices on the node are assigned to the pod.
False: If you do not select this check box, the pod is assigned an eRDMA device based on the non-uniform memory access (NUMA) topology. You must enable the static CPU policy for the node to ensure that NUMA can be allocated to pods and devices. For more information about how to configure CPU policies, see Create and manage a node pool.
In the left-side navigation pane, choose Workloads > Pods. On the Pods page, select the ack-erdma-controller namespace to view the status of pods and ensure that the component runs as expected.
Step 2: Enable and verify eRDMA on nodes
Enable eRDMA on the nodes and verify that the GPU and eRDMA resources on the nodes are detected and available for scheduling.
On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose . In the row of the pod you created, click Terminal in the Actions column. Log on to a node that supports eRDMA and run the following command to view eRDMA device information.
If the output shows the
erdma_0device and its port status isPORT_ACTIVE, the eRDMA device is enabled.ibv_devinfoView the allocatable resources of the node.
kubectl get node <node_name> -o jsonpath='{.status.allocatable}'Sample output:
{"aliyun/erdma":"200","cpu":"15890m","ephemeral-storage":"243149919035","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"128290128Ki","nvidia.com/gpu":"1","pods":"64"}"nvidia.com/gpu":"1": Indicates that one GPU is available."aliyun/erdma":"200": Indicates that up to 200 pods can use eRDMA.
Step 3: Submit a training job using Arena
Use Arena to submit a PyTorch distributed training job and request eRDMA resources to accelerate communication between nodes.
Configure the Arena client and submit a training job. For more information about the parameters, see Core job parameters.
This topic uses the pre-built image
kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch:mnist-examplefor demonstration purposes. This image is for trial use only. For production environments, see How to build a container image that supports eRDMA for a production environment?arena submit pytorch \ --name=pytorch-mnist \ --namespace=default \ --workers=2 \ --gpus=1 \ --device=aliyun/erdma=1 \ --nproc-per-node=1 \ --env NCCL_NET_GDR_LEVEL=1 \ --env NCCL_P2P_LEVEL=5 \ --env NCCL_DEBUG=INFO \ --env NCCL_SOCKET_IFNAME=eth0 \ --env NCCL_ALGO=Ring \ --clean-task-policy=None \ --working-dir=/root \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/pytorch:mnist-example \ --image-pull-policy=Always \ --tensorboard \ --logdir=/workspace/logs \ "torchrun /workspace/arena/examples/pytorch/mnist/main.py \ --epochs 10000 \ --backend nccl \ --data /workspace \ --dir /workspace/logs"
Step 4: Verify the training job status
Confirm that the training job has started and that the eRDMA network is active.
Obtain the KubeConfig of a cluster and use kubectl to connect to the cluster.
View the logs of the Master pod. If the training log shows
Train Epochinformation, the training job is running correctly.kubectl logs pytorch-mnist-master-0 | head -n 50Verify that eRDMA is active. Search for the message
NET/IB : Using [0]erdma_0:1/RoCEin the logs of the Master pod. The presence of this line indicates the following:NCCL has successfully detected the eRDMA network interface card (
erdma_0).The RoCE (RDMA over Converged Ethernet) protocol is used for communication.
Communication between nodes for distributed training is accelerated by eRDMA.
Step 5: Monitor eRDMA network interface card traffic
Monitor the network traffic of the eRDMA network interface card in real time to verify data transmission.
On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose . In the row of the pod you created, click Terminal in the Actions column. Log on to a training node and run the following command to monitor the eRDMA network interface card traffic.
eadm stat -d erdma_0 -lThe expected output is similar to
rx (download): 914.89 MiB/s 150423 p/s; tx (upload):914.66 MiB/s 147128 p/s. If you see continuous bidirectional traffic for uploads and downloads during training, it indicates that communication operations, such as gradient synchronization, are occurring between nodes.For instructions on how to analyze training performance, see .
ACK clusters provide the AI Profiling feature, which lets you analyze training performance at multiple levels, such as Torch, Python, and CUDA Kernel. You can use the profiling tool to analyze performance bottlenecks in the training job, including computation, communication, and memory usage.
Clean up resources
After the training job is complete, run the following command to delete the job:
This operation deletes all pods and resources related to the job but does not delete persistent data, such as TensorBoard logs.
arena delete pytorch-mnist -n defaultSample output:
INFO[0001] The training job pytorch-mnist has been deleted successfullyFAQ
Why doesn't the log show NET/IB : Using erdma_0?
Possible reasons include the following:
The
--device=aliyun/erdma=1parameter was not added to the submission command.The node does not support eRDMA, or the eRDMA Controller component is not installed correctly.
The container image is missing RDMA user mode libraries, such as `libibverbs` and `librdmacm`.
Solutions:
Run
kubectl get node <node_name> -o jsonpath='{.status.allocatable}'to confirm that the node has thealiyun/erdmaresource.Log on to the node and run
ibv_devinfoto check if the eRDMA device is functioning correctly.Verify that the container image contains the necessary RDMA libraries.
What should I do if an NCCL timeout error occurs during training?
NCCL timeouts are usually caused by network instability or incorrect configuration. To resolve this issue, you can try the following:
Increase the timeout period: Set
--env NCCL_IB_TIMEOUT=23(the default is 22).Increase the number of retries: Set
--env NCCL_IB_RETRY_CNT=10.Check the network connectivity between nodes and the status of the eRDMA devices.
How can I determine if eRDMA has improved training performance?
You can compare the performance by following these steps:
Submit two training jobs: one that uses eRDMA and one that does not. You can differentiate them by adding or omitting the
--device=aliyun/erdma=1parameter.Compare the completion time for the same number of training steps.
Use the AI Profiling tool to analyze the proportion of time spent on communication.
Typically, the performance improvement from eRDMA is most significant in scenarios with multiple nodes (more than two) and large models (billions of parameters or more).
What should I do if an Arena job submission fails with an insufficient resources error?
Possible reasons include the following:
Insufficient GPU or eRDMA resources.
The node does not meet the scheduling conditions, such as node selectors or taint tolerations.
Solutions:
Run
kubectl get nodesto view the node status and available resources.Run
kubectl describe pod <pod_name>to view the detailed reason for the pod scheduling failure.Adjust the resource request or add more nodes to meet the requirements.
Core job parameters
Parameter | Description | Recommended configuration |
| The job name, which must be unique within the cluster. | Use a business-specific name. |
| The namespace to which the job belongs. | Isolate by team or project. |
| The number of worker nodes, including the Master node. | Set to 2 for one Master and one worker. |
| The number of GPUs used by each worker. | Set based on the model size and GPU memory requirements. |
| The key parameter to enable eRDMA. It allocates one eRDMA resource to each worker. | Must be set to 1 to enable eRDMA. |
| The number of training processes to start on each node. | Usually set to the same value as |
| The pod cleanup policy. | Set to |
| Environment variable. | Used to configure NCCL communication parameters. |
NCCL environment variable configuration (for optimizing distributed communication):
Environment variable | Description | Recommended configuration |
| Specifies the network interface card for communication. | Set to |
| The GPU Direct RDMA policy level (0-5). | In eRDMA scenarios, set this to |
| The peer-to-peer communication policy (0-5). | Set to |
| The collective communication algorithm. | Options include |
| The log level. | Set to |
| Specifies the RDMA network interface card to use. | Use |
| The InfiniBand (IB) communication timeout period. | Adjust based on network latency. |
| The number of retries for failed IB communication. | You can increase this value for unstable networks. |
For more information about NCCL environment variables, see the official NCCL documentation.
How do I build a container image that supports eRDMA for a production environment?
To use eRDMA, you must install the RDMA user mode libraries and drivers in the container. The following Dockerfile example is based on the official PyTorch image and integrates the eRDMA driver and the MNIST training example.
Build a production container image that supports eRDMA. For more information, see Enable eRDMA in a container (Docker).
ARG PYTORCH_IMAGE=docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
FROM docker.io/library/alpine:3.22.1 AS downloader
WORKDIR /workspace
RUN apk add git wget gzip
RUN mkdir -p /workspace/MNIST/raw && \
cat <<EOF > /workspace/MNIST/raw/checksums.md5
f68b3c2dcbeaaa9fbdd348bbdeb94873 train-images-idx3-ubyte.gz
d53e105ee54ea40749a09fcbcd1e9432 train-labels-idx1-ubyte.gz
9fb629c4189551a2d022fa330f9573f3 t10k-images-idx3-ubyte.gz
ec29112dd5afa0611ce80d1b7f02629c t10k-labels-idx1-ubyte.gz
EOF
RUN cd /workspace/MNIST/raw && \
wget https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz && \
wget https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz && \
wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz && \
wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz && \
md5sum -c checksums.md5 && \
rm checksums.md5 && \
gzip -d *.gz
RUN git clone https://github.com/kubeflow/arena.git -b v0.15.2
FROM ${PYTORCH_IMAGE}
WORKDIR /workspace
COPY --from=downloader /workspace .
RUN set -eux && \
apt update && \
apt install -y wget gpg && \
wget -qO - https://mirrors.aliyun.com/erdma/GPGKEY | gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg && \
echo "deb [ ] https://mirrors.aliyun.com/erdma/apt/ubuntu jammy/erdma main" > /etc/apt/sources.list.d/erdma.list && \
apt update && \
apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1 && \
pip install --no-cache-dir -r /workspace/arena/examples/pytorch/mnist/requirements.txt && \
rm -rf /var/lib/apt/lists/*