Remote Direct Memory Access (RDMA) transfers data directly between the memory of two machines, bypassing the operating system on both sides. This eliminates memory replication and context switching overhead, reducing CPU usage and improving network throughput and latency for AI training, high-performance computing (HPC), and distributed storage workloads.
This guide explains how to install the RDMA device plugin on Node Lingjun and deploy pods that use RDMA.
Choose your networking mode
Before you start, decide which networking mode your pods will use. This determines whether you need IPv6 on the underlying Lingjun bare metal cluster.
| Mode | Description | IPv6 required on Lingjun bare metal cluster |
|---|---|---|
| hostNetwork mode | The pod shares the host node's network stack | No |
| non-hostNetwork mode | The pod has its own IP address | Yes |
If your pods use non-hostNetwork mode, the Lingjun bare metal cluster that hosts Node Lingjun must be configured with IPv6. Select IPv6 mode when creating that cluster. To enable this, submitting a ticket to contact the Lingjun team.
How it works
The ACK RDMA device plugin (ack-rdma-device-plugin) runs as a DaemonSet on each RDMA-enabled Node Lingjun. It registers the rdma/hca extended resource with Kubernetes so that pods can request RDMA access through standard resource limits.
Prerequisites
Before you begin, make sure you have:
-
An ACK managed cluster Pro with at least one Node Lingjun that has RDMA hardware
-
(For non-hostNetwork mode) A Lingjun bare metal cluster configured with IPv6
Install the RDMA device plugin
-
On the Clusters page, click the name of your cluster. In the left navigation pane, click Add-ons.
-
On the Add-ons page, click the Others tab. Find ack-rdma-device-plugin and install it. Configure the following parameter during installation:
ImportantIf you enable RDMA for non-hostNetwork mode but the Lingjun bare metal cluster does not use IPv6, the RDMA configuration does not take effect.
Parameter Description Enable RDMA for non-hostNetwork Controls which pods can use RDMA. Set to False (cleared)to restrict RDMA to pods in hostNetwork mode. Set toTrue (selected)to also allow pods in non-hostNetwork mode — requires the Lingjun bare metal cluster to use IPv6.
Verify the setup
After installation, run the following checks to confirm that RDMA is available on your nodes.
Check that the DaemonSet is running
kubectl get ds ack-rdma-dp-ds -n kube-system
Expected output:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ack-rdma-dp-ds 2 2 2 2 2 <none> xxh
All RDMA-enabled nodes appear in the DESIRED, CURRENT, and READY counts. If READY is lower than DESIRED, check the DaemonSet pod logs for errors.
Check that nodes expose the RDMA resource
kubectl get node e01-cn-xxxx -oyaml
Look for rdma/hca in both allocatable and capacity. A value of 1k confirms that RDMA resources are registered and available for pods to request.
Expected output:
...
allocatable:
cpu: 189280m
ephemeral-storage: "3401372677838"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 2063229768Ki
nvidia.com/gpu: "8"
pods: "64"
rdma/hca: 1k
capacity:
cpu: "192"
ephemeral-storage: 3690725568Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 2112881480Ki
nvidia.com/gpu: "8"
pods: "64"
rdma/hca: 1k
...
Deploy a pod with RDMA
To give a pod access to RDMA, request the rdma/hca resource in its limits. A request of rdma/hca: 1 is sufficient.
The pod requires two Linux capabilities (IPC_LOCK and SYS_RESOURCE) and a shared memory volume at /dev/shm. IPC_LOCK allows memory locking for zero-copy transfers. SYS_RESOURCE allows the process to set resource limits. Without these, RDMA operations will fail or perform poorly.
The following Job manifest shows a complete RDMA configuration:
apiVersion: batch/v1
kind: Job
metadata:
name: hps-benchmark
spec:
parallelism: 1
template:
spec:
containers:
- name: hps-benchmark
image: **
command:
- sh
- -c
- |
python /workspace/wdl_8gpu_outbrain.py
resources:
limits:
nvidia.com/gpu: 8
rdma/hca: 1 # Request one RDMA HCA device
workingDir: /root
volumeMounts:
- name: shm
mountPath: /dev/shm # Shared memory required for RDMA operations
securityContext:
capabilities:
add:
- SYS_RESOURCE # Allows the process to set resource limits (required for RDMA)
- IPC_LOCK # Allows memory locking (required for zero-copy RDMA transfers)
restartPolicy: Never
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
hostNetwork: true # Remove this line if using non-hostNetwork mode with IPv6
tolerations:
- operator: Exists
If you enabled RDMA for non-hostNetwork mode, remove the hostNetwork: true line. Pods without that field use non-hostNetwork mode and can still access RDMA through the plugin.