All Products
Search
Document Center

Container Service for Kubernetes:Use RDMA on ACK Lingjun pods

Last Updated:Dec 24, 2025

This document describes how to configure and use Remote Direct Memory Access (RDMA) in an ACK LINGJUN Cluster for high-performance container network communication. RDMA technology significantly reduces network latency and increases throughput. It is ideal for scenarios that demand high network performance, such as high-performance computing (HPC), AI training, and distributed storage.

Introduction to RDMA

Remote Direct Memory Access (RDMA) is a technology that reduces the latency caused by server-side data processing during network transfers. It moves data directly from the memory of one computer to another without involving the operating systems, which enables high-throughput, low-latency network communication. This makes RDMA ideal for large-scale parallel computer clusters. By bypassing the operating system, RDMA eliminates the overhead from external memory replication and context switching. This process consumes few compute resources, saves memory bandwidth and CPU cycles, and improves application performance.

Use RDMA on ACK Lingjun nodes

  1. Confirm that the RDMA Device Plugin is running correctly on each Lingjun node that is equipped with RDMA.

    # kubectl get ds ack-rdma-dp-ds -n kube-system
    NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    ack-rdma-dp-ds   2         2         2       2            2           <none>          xxh
  2. Check whether the node has the rdma/hca resource.

    # kubectl get node e01-cn-xxxx -oyaml
    ...
      allocatable:
        cpu: 189280m
        ephemeral-storage: "3401372677838"
        hugepages-1Gi: "0"
        hugepages-2Mi: "0"
        memory: 2063229768Ki
        nvidia.com/gpu: "8"
        pods: "64"
        rdma/hca: 1k
      capacity:
        cpu: "192"
        ephemeral-storage: 3690725568Ki
        hugepages-1Gi: "0"
        hugepages-2Mi: "0"
        memory: 2112881480Ki
        nvidia.com/gpu: "8"
        pods: "64"
        rdma/hca: 1k
    ...
  3. Apply the following YAML configuration to request the rdma/hca resource for the pod.

    • Set the resource request to rdma/hca: 1.

    • Check if the pod has hostNetwork: true. This setting allows pods on a Lingjun node to use the RDMA feature.

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: hps-benchmark
    spec:
      parallelism: 1
      template:
        spec:
          containers:
          - name: hps-benchmark
            image: <YOUR_IMAGE> # Replace with your actual Registry Address
            command:
            - sh
            - -c
            - |
              python /workspace/wdl_8gpu_outbrain.py
            resources:
              limits:
                nvidia.com/gpu: 8
                rdma/hca: 1
            workingDir: /root
            volumeMounts:
              - name: shm
                mountPath: /dev/shm
          restartPolicy: Never
          volumes:
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 8Gi
          hostNetwork: true
          tolerations:
            - operator: Exists