All Products
Search
Document Center

Container Service for Kubernetes:Use RDMA on ACK Lingjun nodes

Last Updated:Dec 25, 2025

This document describes how to configure and use Remote Direct Memory Access (RDMA) in an ACK LINGJUN Cluster for high-performance container networking. RDMA reduces network latency and increases throughput. It is ideal for demanding scenarios such as high-performance computing (HPC), AI training, and distributed storage.

Introduction to RDMA

RDMA was created to address data processing latency on servers during network transfers. It moves data directly from the memory of one computer to another without involving the operating system of either computer. This enables high-throughput, low-latency networking, which is ideal for large-scale parallel computing clusters. By bypassing the operating system, RDMA eliminates the overhead from data copies and context switches. This conserves memory bandwidth and CPU cycles, and improves application performance.

Use RDMA on ACK Lingjun nodes

  1. Confirm that the RDMA Device Plugin is running on each RDMA-enabled Lingjun node.

    kubectl get ds ack-rdma-dp-ds -n kube-system
    NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    ack-rdma-dp-ds   2         2         2       2            2           <none>          xxh
  2. Verify that the node has the rdma/hca resource.

    kubectl get node e01-cn-xxxx -oyaml
    ...
      allocatable:
        cpu: 189280m
        ephemeral-storage: "3401372677838"
        hugepages-1Gi: "0"
        hugepages-2Mi: "0"
        memory: 2063229768Ki
        nvidia.com/gpu: "8"
        pods: "64"
        rdma/hca: 1k
      capacity:
        cpu: "192"
        ephemeral-storage: 3690725568Ki
        hugepages-1Gi: "0"
        hugepages-2Mi: "0"
        memory: 2112881480Ki
        nvidia.com/gpu: "8"
        pods: "64"
        rdma/hca: 1k
    ...
  3. Apply the following YAML file to request the rdma/hca resource for the pod.

    • Set the request for rdma/hca to 1.

    • Verify that hostNetwork: true is set. Pods must use the host network to access the RDMA feature on a Lingjun node.

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: hps-benchmark
    spec:
      parallelism: 1
      template:
        spec:
          containers:
          - name: hps-benchmark
            image: <YOUR_IMAGE> # Replace with your actual registry address
            command:
            - sh
            - -c
            - |
              python /workspace/wdl_8gpu_outbrain.py
            resources:
              limits:
                nvidia.com/gpu: 8
                rdma/hca: 1
            workingDir: /root
            volumeMounts:
              - name: shm
                mountPath: /dev/shm
          restartPolicy: Never
          volumes:
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 8Gi
          hostNetwork: true
          tolerations:
            - operator: Exists