All Products
Search
Document Center

Container Service for Kubernetes:Use RDMA networks for pods on Node Lingjun

Last Updated:Mar 26, 2026

Remote Direct Memory Access (RDMA) transfers data directly between the memory of two machines, bypassing the operating system on both sides. This eliminates memory replication and context switching overhead, reducing CPU usage and improving network throughput and latency for AI training, high-performance computing (HPC), and distributed storage workloads.

This guide explains how to install the RDMA device plugin on Node Lingjun and deploy pods that use RDMA.

Choose your networking mode

Before you start, decide which networking mode your pods will use. This determines whether you need IPv6 on the underlying Lingjun bare metal cluster.

Mode Description IPv6 required on Lingjun bare metal cluster
hostNetwork mode The pod shares the host node's network stack No
non-hostNetwork mode The pod has its own IP address Yes

If your pods use non-hostNetwork mode, the Lingjun bare metal cluster that hosts Node Lingjun must be configured with IPv6. Select IPv6 mode when creating that cluster. To enable this, submitting a ticket to contact the Lingjun team.

How it works

The ACK RDMA device plugin (ack-rdma-device-plugin) runs as a DaemonSet on each RDMA-enabled Node Lingjun. It registers the rdma/hca extended resource with Kubernetes so that pods can request RDMA access through standard resource limits.

Prerequisites

Before you begin, make sure you have:

  • An ACK managed cluster Pro with at least one Node Lingjun that has RDMA hardware

  • (For non-hostNetwork mode) A Lingjun bare metal cluster configured with IPv6

Install the RDMA device plugin

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Add-ons.

  2. On the Add-ons page, click the Others tab. Find ack-rdma-device-plugin and install it. Configure the following parameter during installation:

    Important

    If you enable RDMA for non-hostNetwork mode but the Lingjun bare metal cluster does not use IPv6, the RDMA configuration does not take effect.

    Parameter Description
    Enable RDMA for non-hostNetwork Controls which pods can use RDMA. Set to False (cleared) to restrict RDMA to pods in hostNetwork mode. Set to True (selected) to also allow pods in non-hostNetwork mode — requires the Lingjun bare metal cluster to use IPv6.

Verify the setup

After installation, run the following checks to confirm that RDMA is available on your nodes.

Check that the DaemonSet is running

kubectl get ds ack-rdma-dp-ds -n kube-system

Expected output:

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
ack-rdma-dp-ds   2         2         2       2            2           <none>          xxh

All RDMA-enabled nodes appear in the DESIRED, CURRENT, and READY counts. If READY is lower than DESIRED, check the DaemonSet pod logs for errors.

Check that nodes expose the RDMA resource

kubectl get node e01-cn-xxxx -oyaml

Look for rdma/hca in both allocatable and capacity. A value of 1k confirms that RDMA resources are registered and available for pods to request.

Expected output:

...
  allocatable:
    cpu: 189280m
    ephemeral-storage: "3401372677838"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 2063229768Ki
    nvidia.com/gpu: "8"
    pods: "64"
    rdma/hca: 1k
  capacity:
    cpu: "192"
    ephemeral-storage: 3690725568Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 2112881480Ki
    nvidia.com/gpu: "8"
    pods: "64"
    rdma/hca: 1k
...

Deploy a pod with RDMA

To give a pod access to RDMA, request the rdma/hca resource in its limits. A request of rdma/hca: 1 is sufficient.

Important

The pod requires two Linux capabilities (IPC_LOCK and SYS_RESOURCE) and a shared memory volume at /dev/shm. IPC_LOCK allows memory locking for zero-copy transfers. SYS_RESOURCE allows the process to set resource limits. Without these, RDMA operations will fail or perform poorly.

The following Job manifest shows a complete RDMA configuration:

apiVersion: batch/v1
kind: Job
metadata:
  name: hps-benchmark
spec:
  parallelism: 1
  template:
    spec:
      containers:
      - name: hps-benchmark
        image: **
        command:
        - sh
        - -c
        - |
          python /workspace/wdl_8gpu_outbrain.py
        resources:
          limits:
            nvidia.com/gpu: 8
            rdma/hca: 1        # Request one RDMA HCA device
        workingDir: /root
        volumeMounts:
          - name: shm
            mountPath: /dev/shm  # Shared memory required for RDMA operations
        securityContext:
          capabilities:
            add:
            - SYS_RESOURCE       # Allows the process to set resource limits (required for RDMA)
            - IPC_LOCK           # Allows memory locking (required for zero-copy RDMA transfers)
      restartPolicy: Never
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi
      hostNetwork: true          # Remove this line if using non-hostNetwork mode with IPv6
      tolerations:
        - operator: Exists

If you enabled RDMA for non-hostNetwork mode, remove the hostNetwork: true line. Pods without that field use non-hostNetwork mode and can still access RDMA through the plugin.