GPU Direct RDMA (GDR) is a technology developed by NVIDIA for high-performance computing and deep learning. It allows GPUs to directly exchange data with devices that support Remote Direct Memory Access (RDMA) without involving the CPU, such as other GPUs or accelerators. This topic describes how to run GDR applications on elastic RDMA (eRDMA) nodes in Container Service for Kubernetes (ACK) clusters.
Prerequisites
Arena is installed in hostNetwork mode. For more information, see Configure the Arena client.
A kubectl client is connected to the cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
ACK eRDMA Controller is installed and configured on nodes. For more information, see Use eRDMA to accelerate container networking.
Procedure
Use Arena to submit an inference task.
arena submit mpijob \ --name=mpi-allreduce-sync-erdma \ --device=aliyun/erdma=1 \ -e NCCL_DEBUG=TRACE \ -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \ -e OMPI_ALLOW_RUN_AS_ROOT=1 \ --gpus=8 \ --memory=16Gi \ --hostNetwork true \ --cpu=4 \ --workers=2 \ --image=registry.cn-beijing.aliyuncs.com/acs/horovod:0.28.1-tf2.9.2-torch1.12.1-py3.8-erdma \ --toleration all \ "mpirun -np 2 \ --allow-run-as-root \ --mca btl_tcp_if_include bond0 \ --mca oob_tcp_if_include bond0 \ --mca pml ob1 \ --mca btl ^openib \ python /examples/pytorch/pytorch_synthetic_benchmark.py"
Expected output:
iZ2zeg0kcgyxepyc5r63kgZ:17:28 [0] NCCL INFO NET/IB : Using [0]rocep26s0:1/RoCE ; OOB eth0:192.168.8.128<0> iZ2zeg0kcgyxepyc5r63kgZ:17:28 [0] NCCL INFO Using network IB iZ2zeg0kcgyxepyc5r63kgZ:18:27 [1] NCCL INFO NET/IB : Using [0]rocep26s0:1/RoCE ; OOB eth0:192.168.8.128<0> iZ2zeg0kcgyxepyc5r63kgZ:18:27 [1] NCCL INFO Using network IB
The log of the job indicates that an eRDMA device is identified during the NVIDIA Collective Communication Library (NCCL) initialization. The eRDMA device runs in RoCE mode and uses eRDMA InfiniBand (BI) for network communication.
Query information about eRDMA interfaces (ERIs) on the host.
$ ibv_devinfo hca_id: rocep156s0 transport: eRDMA fw_ver: 0.2.0 node_guid: 0216:3eff:fe2c:b8f3 sys_image_guid: 0216:3eff:fe2c:b8f3 vendor_id: 0x1ded vendor_part_id: 4223 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 1024 (3) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: rocep26s0 transport: eRDMA fw_ver: 0.2.0 node_guid: 0216:3eff:fe10:f8b0 sys_image_guid: 0216:3eff:fe10:f8b0 vendor_id: 0x1ded vendor_part_id: 4223 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 1024 (3) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet
Use eadm to monitor eRDMA traffic on the host.
$ eadm stat -d rocep26s0 -l Monitoring rocep26s0... (press CTRL-C to stop) 15:59:56 rx: 0 B/s 0 p/s tx: 0 B/s 0 p/s rocep26s0 / traffic statistics rx | tx --------------------------------------+------------------ bytes 11.06 KiB | 11.18 KiB --------------------------------------+------------------ max 15.43 KiB/s | 15.10 KiB/s average 4.03 KiB/s | 4.07 KiB/s min 0 B/s | 0 B/s --------------------------------------+------------------ packets 8406769 | 8546764 --------------------------------------+------------------ max 38990 p/s | 37488 p/s average 2988 p/s | 3038 p/s min 0 p/s | 0 p/s --------------------------------------+------------------ time 33.78 minutes
The preceding output indicates that eRDMA traffic is detected in real time.