PAI-Lingjun AI Computing Service (Lingjun) provides high-performance computing power that is required in compute-intensive scenarios such as AI and high-performance computing. It implements high-performance and large-scale pooled computing power to meet the heterogeneous computing power requirements of industries such as autonomous driving, scientific research, finance, and biopharmaceutical R&D. This topic describes the features of Lingjun.
High-speed RDMA network architecture
Alibaba Group has invested in special research on Remote Direct Memory Access (RDMA) since 2016 to transform RDMA and improve data transmission performance. Alibaba Group has built a high-speed network in large-scale data centers, which reduces latency by 90% and supports Alibaba Cloud services and internal services of Alibaba Group, such as high-performance storage and AI computing.
Backed by the large-scale deployment practices of the RDMA network, Alibaba Cloud independently developed the high-performance RDMA protocol and High Performance Computing and Communications (HPCC) algorithm for congestion control based on the collaboration of clients and networks. Alibaba Cloud also implements hardware offloading over protocols based on intelligent network interface controllers (NICs), reduces end-to-end network latency, improves network I/O throughput, and effectively reduces and prevents performance losses of upper-layer applications that are caused by traditional network exceptions, such as network faults and black holes.
High-performance ACCL
Lingjun supports the high-performance Alibaba Collective Communication Library (ACCL). ACCL can be used together with hardware such as vSwitches to provide congestion-free and high-performance communication capabilities for AI clusters containing tens of thousands of GPUs.
In AI clusters, latency is mainly caused by communications between clusters. To prevent network congestion, you can build a high-speed RDMA network and perform appropriate communication scheduling. Alibaba Cloud uses ACCL to implement intelligent matching of GPUs and NICs, automatic identification of physical topologies inside and outside nodes, and topology-aware scheduling algorithms. This eliminates network congestion, facilitates network communications, and improves the elasticity of distributed training systems. For an AI cluster that contains tens of thousands of GPUs, the linear read capacity of the cluster can reach more than 80%. For an AI cluster that contains hundreds of GPUs, the computing performance of the cluster can reach more than 95%, which can meet the requirements of more than 80% of business scenarios.
High-performance KSpeed for data preloading acceleration
Based on the high-performance RDMA network and high-performance ACCL, Lingjun develops the high-performance KSpeed for data preloading acceleration to optimize intelligent data I/O.
Compute-storage separation architectures are widely used in AI, high-performance computing, and big data business scenarios. However, the loading of a large amount of training data causes efficiency bottlenecks. Alibaba Cloud uses KSpeed to improve data I/O performance.
For example, in specific scenarios, the amount of time that is consumed for data loading can occupy more than 60% of the total amount of training time. KSpeed can proactively preload data into the memory. This way, the amount of time that is consumed for data loading is reduced to less than 10% of the total amount of training time, which is equivalent to doubling the computing performance per unit time.
eGPU for GPU-accelerated containers
To resolve issues that may occur in your business scenarios, such as oversized AI tasks, high costs of GPU hardware resources, and low GPU utilization, Lingjun supports eGPU, which is a GPU virtualization technology that can effectively improve the GPU utilization of AI clusters. eGPU has the following benefits:
Supports GPU isolation based on video memory and computing power.
Supports multiple specifications.
Supports dynamic creation and destruction.
Supports hot upgrade.
Supports user-mode technologies to ensure higher reliability.