PAI-Lingjun AI Computing Key Architecture Concepts and Terms - Platform for AI - Alibaba Cloud - Platform For AI

PAI-Lingjun AI Computing Service (Lingjun) organizes compute resources in a three-level hierarchy: a cluster contains one or more node groups, and each node group consists of individual nodes. A Lingjun optimization suite runs across the cluster to accelerate large-scale parallel training.

Key concepts

Cluster

A cluster is a collection of high-performance heterogeneous accelerated compute nodes, all equipped with the Lingjun optimization suite. Nodes within a cluster communicate over a high-speed, low-latency remote direct memory access (RDMA) network with 800 Gbit/s bandwidth.

Lingjun gives you two deployment options: use the native physical cluster services on their own, or combine them with other Alibaba Cloud services.

Node group

A node group is a subset of nodes within a cluster. In most cases, nodes in the same group consist of one or more nodes with the same specifications or features. For example, you can group all GU100 nodes in a cluster together.

Node

Nodes are high-performance GPU servers accelerated by the Lingjun optimization suite. When you create a cluster, you select the operating system for its nodes. CentOS 7.9 is supported.

Lingjun optimization suite

The Lingjun optimization suite is a software layer for clusters running large-scale parallel computing workloads. It includes four components:

Data loading optimization
Collective communication optimization
Computing resource optimization
Network optimization