All Products
Search
Document Center

Container Service for Kubernetes:Run applications with high-performance RDMA networks

Last Updated:Apr 27, 2025

In large-scale AI computing applications, communication efficiency between tasks must be optimized to fully leverage GPU computing power. The integration of ACK One registered clusters with Alibaba Cloud Container Compute Service (ACS) enables a low-latency and high-throughput Remote Direct Memory Access (RDMA) network service. This topic demonstrates how to deploy applications using this high-performance RDMA network.

Introduction

TCP/IP is the mainstream network communication protocol used by many applications. Due to the emergence of AI technologies, a growing number of applications demand high network performance.

TCP/IP has certain limits:

  • Complex protocol stack and traffic throttling algorithm

  • High overheads of file copy

  • Frequent context switchover

The network performance of TCP/IP has become a bottleneck that limits applications.

RDMA helps address the preceding issues. Compared with TCP/IP, RDMA features zero copy and kernel bypass to help avoid file copy and frequent context switchover. It can reduce the latency and CPU usage and increase the throughput.

ACS allows you to add labels to YAML files to run an application in an RDMA network.

...
labels:
  alibabacloud.com/hpn-type: "rdma"
...

GPU models that support RDMA

ACS offers multiple GPU options. If you require high-performance RDMA network capabilities, deploy the 8th-gen GPU A GPU. To verify compatibility with other GPU models, submit a ticket and contact support.

Prerequisites

Procedure

  1. Create a file named dep-demo-hpn-gpu.yaml and add the following to it:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: dep-demo-hpn-gpu
      labels:
        app: demo-hpn-gpu
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: demo-hpn-gpu
      template:
        metadata:
          labels:
            app: demo-hpn-gpu
            alibabacloud.com/acs: "true" # Use the compute power of ACS
            alibabacloud.com/compute-class: gpu
            alibabacloud.com/compute-qos: default
            # Set the GPU model to example-model. The value is for reference only.
            alibabacloud.com/gpu-model-series: "example-model"        
            alibabacloud.com/hpn-type: "rdma"
        spec:
          containers:
          - name: demo
            image: registry.cn-wulanchabu.aliyuncs.com/acs/stress:v1.0.4
            command:
            - "sleep"
            - "1000h"
            resources:
              requests:
                cpu: 128
                memory: 512Gi
                nvidia.com/gpu: 8
              limits:
                cpu: 128
                memory: 512Gi
                nvidia.com/gpu: 8
  2. Run the following command to deploy the application:

    kubectl apply -f dep-demo-hpn-gpu.yaml
  3. Run the following command to view the network interface card (NIC) information of the RDMA:

    kubectl exec -it dep-demo-hpn-gpu-xxxxx-xxx  -- ifconfig | grep hpn -A 8 

    Expected results:

    hpn0      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx  
              inet6 addr: xxxx::x:xxxx:xxxx:xxx/xx Scope:Link
              inet6 addr: xxxx:xxx:xxx:x:x:xxxx:x:xxx/xxx Scope:Global
              UP BROADCAST RUNNING MULTICAST  MTU:xxxx  Metric:1
              RX packets:0 errors:0 dropped:0 overruns:0 frame:0
              TX packets:xx errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000 
              RX bytes:0 (0.0 B)  TX bytes:x (892.0 B)

    The output indicates that the pod is configured with an RDMA NIC.