In large-scale AI computing applications, communication efficiency between tasks must be optimized to fully leverage GPU computing power. The integration of ACK One registered clusters with Alibaba Cloud Container Compute Service (ACS) enables a low-latency and high-throughput Remote Direct Memory Access (RDMA) network service. This topic demonstrates how to deploy applications using this high-performance RDMA network.
Introduction
TCP/IP is the mainstream network communication protocol used by many applications. Due to the emergence of AI technologies, a growing number of applications demand high network performance.
TCP/IP has certain limits:
Complex protocol stack and traffic throttling algorithm
High overheads of file copy
Frequent context switchover
The network performance of TCP/IP has become a bottleneck that limits applications.
RDMA helps address the preceding issues. Compared with TCP/IP, RDMA features zero copy and kernel bypass to help avoid file copy and frequent context switchover. It can reduce the latency and CPU usage and increase the throughput.
ACS allows you to add labels to YAML files to run an application in an RDMA network.
...
labels:
alibabacloud.com/hpn-type: "rdma"
...GPU models that support RDMA
ACS offers multiple GPU options. If you require high-performance RDMA network capabilities, deploy the 8th-gen GPU A GPU. To verify compatibility with other GPU models, submit a ticket and contact support.
Prerequisites
An ACK One registered cluster is created and connected to a data center or a Kubernetes cluster of another cloud service provider (Kubernetes 1.24 or later is recommended).
The ACK virtual node component is installed and the version of the component is 2.13.0 or later. For more information, see Grant RAM permissions to ack-virtual-node and Install ack-virtual-node.
Procedure
Create a file named
dep-demo-hpn-gpu.yamland add the following to it:apiVersion: apps/v1 kind: Deployment metadata: name: dep-demo-hpn-gpu labels: app: demo-hpn-gpu spec: replicas: 1 selector: matchLabels: app: demo-hpn-gpu template: metadata: labels: app: demo-hpn-gpu alibabacloud.com/acs: "true" # Use the compute power of ACS alibabacloud.com/compute-class: gpu alibabacloud.com/compute-qos: default # Set the GPU model to example-model. The value is for reference only. alibabacloud.com/gpu-model-series: "example-model" alibabacloud.com/hpn-type: "rdma" spec: containers: - name: demo image: registry.cn-wulanchabu.aliyuncs.com/acs/stress:v1.0.4 command: - "sleep" - "1000h" resources: requests: cpu: 128 memory: 512Gi nvidia.com/gpu: 8 limits: cpu: 128 memory: 512Gi nvidia.com/gpu: 8Run the following command to deploy the application:
kubectl apply -f dep-demo-hpn-gpu.yamlRun the following command to view the network interface card (NIC) information of the RDMA:
kubectl exec -it dep-demo-hpn-gpu-xxxxx-xxx -- ifconfig | grep hpn -A 8Expected results:
hpn0 Link encap:Ethernet HWaddr xx:xx:xx:xx:xx:xx inet6 addr: xxxx::x:xxxx:xxxx:xxx/xx Scope:Link inet6 addr: xxxx:xxx:xxx:x:x:xxxx:x:xxx/xxx Scope:Global UP BROADCAST RUNNING MULTICAST MTU:xxxx Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:xx errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:x (892.0 B)The output indicates that the pod is configured with an RDMA NIC.