All Products
Search
Document Center

Container Compute Service:Run applications using high-performance RDMA networks

Last Updated:Mar 26, 2026

Multi-node GPU workloads demand network throughput and latency that TCP/IP cannot reliably deliver at scale. ACS (Alibaba Cloud Container Compute Service) provides an RDMA-enabled high-performance network that reduces communication overhead for distributed training and inference. Add a single pod label to enable RDMA, configure the NCCL environment variables for your GPU model, and the pod is ready.

How it works

TCP/IP introduces overhead that compounds as the number of GPUs grows: data copying between kernel and user space, complex protocol stack processing, flow control algorithms, and frequent context switching.

RDMA (Remote Direct Memory Access) bypasses these costs using zero-copy transfers and kernel bypass. The result is lower latency, higher throughput, and lower CPU usage—benefits that are most significant in multi-node GPU collective operations.

To enable RDMA on a pod in ACS, add the label alibabacloud.com/hpn-type: "rdma" to the pod spec. ACS automatically attaches an RDMA network interface card (NIC) to the container. Configure the NCCL environment variables for your GPU model, and the pod is ready for distributed training or inference.

GPU models that support RDMA

ACS supports multiple GPU models. The following table lists the models that support RDMA and their constraints.

GPU model compute-class RDMA constraint RDMA NIC type
GU8TF gpu 8-card pods only Type 1
GU8TEF gpu 8-card pods only Type 1
GX8SF gpu 8-card pods only Type 1
P16EN gpu 16-card pods only Type 2
gpu-hpn 1, 2, 4, 8, and 16-card pods

The RDMA NIC type determines which NCCL environment variables you need to configure.

NCCL configuration

ACS GPUs use two RDMA NIC types, each requiring different NCCL settings.

Type 1 NIC (GU8TF, GU8TEF, GX8SF)

export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=mlx5
export NCCL_DEBUG=INFO

Type 2 NIC (P16EN)

export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=
export NCCL_DEBUG=INFO

Environment variables

Variable Description
NCCL_SOCKET_IFNAME The port that NCCL uses to establish connections. Use eth0 in ACS.
NCCL_IB_HCA Specifies which Host Channel Adapter (HCA) interfaces to use for RDMA communication. Set to mlx5 for Type 1 NICs. For P16EN (Type 2), leave this blank.
NCCL_IB_DISABLE Specifies whether to disable IB/RoCE networks and use IP Sockets instead. A value of 1 disables IB/RoCE. Set to 1 for P16EN.
NCCL_DEBUG Controls the output level of NCCL debug logs.

For other NCCL environment variables, see the NCCL documentation.

Deploy a GPU pod with RDMA

Prerequisites

Before you begin, ensure that you have:

Deploy the pod

  1. Create a file named dep-demo-hpn-gpu.yaml with the following content. The example uses GU8TF (Type 1 NIC). Adjust alibabacloud.com/gpu-model-series and the NCCL variables for your GPU model.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: dep-demo-hpn-gpu
      labels:
        app: demo-hpn-gpu
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: demo-hpn-gpu
      template:
        metadata:
          labels:
            app: demo-hpn-gpu
            alibabacloud.com/compute-class: gpu
            alibabacloud.com/compute-qos: default
            # Specify the GPU model. Change this value as needed.
            alibabacloud.com/gpu-model-series: "GU8TF"
            alibabacloud.com/hpn-type: "rdma"
        spec:
          containers:
          - name: demo
            image: registry-cn-wulanchabu.ack.aliyuncs.com/acs/stress:v1.0.4
            command:
            - "sleep"
            - "1000h"
            env:
            - name: NCCL_SOCKET_IFNAME
              value: "eth0"
            - name: NCCL_IB_HCA
              value: "mlx5"
            - name: NCCL_DEBUG
              value: "INFO"
            resources:
              requests:
                cpu: 128
                memory: 512Gi
                nvidia.com/gpu: 8
              limits:
                cpu: 128
                memory: 512Gi
                nvidia.com/gpu: 8
  2. Apply the manifest.

    kubectl apply -f dep-demo-hpn-gpu.yaml
  3. Wait for the pod to reach Running state.

    kubectl get pod | grep dep-demo-hpn-gpu

    The pod is ready when the output shows 1/1 Running:

    dep-demo-hpn-gpu-5d9xxxxxb6-xxxxx   1/1     Running   0          25m16s

Verify RDMA is active

Check that the RDMA NIC was attached to the container:

kubectl exec -it deploy/dep-demo-hpn-gpu -- ifconfig | grep hpn -A 8

The output shows the hpn0 interface with UP BROADCAST RUNNING MULTICAST, confirming the RDMA NIC is active:

hpn0      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx
          inet6 addr: xxxx::x:xxxx:xxxx:xxx/xx Scope:Link
          inet6 addr: xxxx:xxx:xxx:x:x:xxxx:x:xxx/xxx Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:xxxx  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:xx errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:x (892.0 B)

If hpn0 does not appear, check that alibabacloud.com/hpn-type: "rdma" is present in the pod labels and that the GPU model supports RDMA with the requested card count.