This document describes how to configure and use Remote Direct Memory Access (RDMA) in an ACK LINGJUN Cluster for high-performance container networking. RDMA reduces network latency and increases throughput. It is ideal for demanding scenarios such as high-performance computing (HPC), AI training, and distributed storage.
Introduction to RDMA
RDMA was created to address data processing latency on servers during network transfers. It moves data directly from the memory of one computer to another without involving the operating system of either computer. This enables high-throughput, low-latency networking, which is ideal for large-scale parallel computing clusters. By bypassing the operating system, RDMA eliminates the overhead from data copies and context switches. This conserves memory bandwidth and CPU cycles, and improves application performance.
Use RDMA on ACK Lingjun nodes
Confirm that the RDMA Device Plugin is running on each RDMA-enabled Lingjun node.
kubectl get ds ack-rdma-dp-ds -n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ack-rdma-dp-ds 2 2 2 2 2 <none> xxhVerify that the node has the
rdma/hcaresource.kubectl get node e01-cn-xxxx -oyaml ... allocatable: cpu: 189280m ephemeral-storage: "3401372677838" hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 2063229768Ki nvidia.com/gpu: "8" pods: "64" rdma/hca: 1k capacity: cpu: "192" ephemeral-storage: 3690725568Ki hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 2112881480Ki nvidia.com/gpu: "8" pods: "64" rdma/hca: 1k ...Apply the following YAML file to request the
rdma/hcaresource for the pod.Set the request for
rdma/hcato 1.Verify that
hostNetwork: trueis set. Pods must use the host network to access the RDMA feature on a Lingjun node.
apiVersion: batch/v1 kind: Job metadata: name: hps-benchmark spec: parallelism: 1 template: spec: containers: - name: hps-benchmark image: <YOUR_IMAGE> # Replace with your actual registry address command: - sh - -c - | python /workspace/wdl_8gpu_outbrain.py resources: limits: nvidia.com/gpu: 8 rdma/hca: 1 workingDir: /root volumeMounts: - name: shm mountPath: /dev/shm restartPolicy: Never volumes: - name: shm emptyDir: medium: Memory sizeLimit: 8Gi hostNetwork: true tolerations: - operator: Exists