PAI-Lingjun AI Computing Service (Lingjun) organizes compute resources in a three-level hierarchy: a cluster contains one or more node groups, and each node group consists of individual nodes. A Lingjun optimization suite runs across the cluster to accelerate large-scale parallel training.
Key concepts
Cluster
A cluster is a collection of high-performance heterogeneous accelerated compute nodes, all equipped with the Lingjun optimization suite. Nodes within a cluster communicate over a high-speed, low-latency remote direct memory access (RDMA) network with 800 Gbit/s bandwidth.
Lingjun gives you two deployment options: use the native physical cluster services on their own, or combine them with other Alibaba Cloud services.
Node group
A node group is a subset of nodes within a cluster. In most cases, nodes in the same group consist of one or more nodes with the same specifications or features. For example, you can group all GU100 nodes in a cluster together.
Node
Nodes are high-performance GPU servers accelerated by the Lingjun optimization suite. When you create a cluster, you select the operating system for its nodes. CentOS 7.9 is supported.
Lingjun optimization suite
The Lingjun optimization suite is a software layer for clusters running large-scale parallel computing workloads. It includes four components:
Data loading optimization
Collective communication optimization
Computing resource optimization
Network optimization