This document describes how to configure and use Remote Direct Memory Access (RDMA) in an ACK LINGJUN Cluster for high-performance container network communication. RDMA technology significantly reduces network latency and increases throughput. It is ideal for scenarios that demand high network performance, such as high-performance computing (HPC), AI training, and distributed storage.
Introduction to RDMA
Remote Direct Memory Access (RDMA) is a technology that reduces the latency caused by server-side data processing during network transfers. It moves data directly from the memory of one computer to another without involving the operating systems, which enables high-throughput, low-latency network communication. This makes RDMA ideal for large-scale parallel computer clusters. By bypassing the operating system, RDMA eliminates the overhead from external memory replication and context switching. This process consumes few compute resources, saves memory bandwidth and CPU cycles, and improves application performance.
Use RDMA on ACK Lingjun nodes
Confirm that the RDMA Device Plugin is running correctly on each Lingjun node that is equipped with RDMA.
# kubectl get ds ack-rdma-dp-ds -n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ack-rdma-dp-ds 2 2 2 2 2 <none> xxhCheck whether the node has the
rdma/hcaresource.# kubectl get node e01-cn-xxxx -oyaml ... allocatable: cpu: 189280m ephemeral-storage: "3401372677838" hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 2063229768Ki nvidia.com/gpu: "8" pods: "64" rdma/hca: 1k capacity: cpu: "192" ephemeral-storage: 3690725568Ki hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 2112881480Ki nvidia.com/gpu: "8" pods: "64" rdma/hca: 1k ...Apply the following YAML configuration to request the
rdma/hcaresource for the pod.Set the resource request to
rdma/hca: 1.Check if the pod has
hostNetwork: true. This setting allows pods on a Lingjun node to use the RDMA feature.
apiVersion: batch/v1 kind: Job metadata: name: hps-benchmark spec: parallelism: 1 template: spec: containers: - name: hps-benchmark image: <YOUR_IMAGE> # Replace with your actual Registry Address command: - sh - -c - | python /workspace/wdl_8gpu_outbrain.py resources: limits: nvidia.com/gpu: 8 rdma/hca: 1 workingDir: /root volumeMounts: - name: shm mountPath: /dev/shm restartPolicy: Never volumes: - name: shm emptyDir: medium: Memory sizeLimit: 8Gi hostNetwork: true tolerations: - operator: Exists