All Products
Search
Document Center

Platform For AI:What is PAI-Lingjun AI Computing Service?

Last Updated:Dec 24, 2024

PAI-Lingjun AI Computing Service (Lingjun) is a large-scale and high-density computing service that provides heterogeneous computing capabilities for high-performance AI training and computing. Lingjun is mainly used in large-scale distributed AI R&D scenarios, such as image recognition, natural language processing (NLP), search-based ad recommendations, and general large language models (LLMs). Lingjun is suitable for industries such as autonomous driving, financial risk control, pharmaceutical R&D, scientific intelligence, metauniverse, Internet, and independent software vendor (ISV). You are charged only for the resources that are consumed by AI training. You can use the high-extensibility, high-performance, and cost-effective intelligent computing infrastructures without the need to create, tune, or maintain complex compute nodes, storage systems, and Remote Direct Memory Access (RDMA) networks.

Architecture

image
  • Lingjun is a computing cluster service that integrates software and hardware. The hardware includes servers, networks, and storage systems. Lingjun delivers and manages the hardware as clusters. The software includes computing resource management and O&M, AI acceleration kits, cloud-native task management, and a comprehensive AI development platform. Lingjun supports common AI frameworks such as PyTorch and TensorFlow.

  • The underlying core hardware components of Lingjun consist of Panjiu servers and high-performance RDMA networks.

    • The Panjiu servers that are developed by Alibaba Cloud are extensively optimized in configurations for Lingjun to ensure hardware performance.

    • The networks support the common Fat-Tree network topologies and multiple communication protocols such as TCP/IP and RDMA. The 25 Gbit/s network and 100 Gbit/s network of Lingjun are independently built. The 25 Gbit/s network is used for in-band management of servers. The 100 Gbit/s network uses multiple network interface controllers (NICs) for efficient communication in AI training services. To improve network availability, Lingjun supports dual-uplink networking. In this networking mode, each NIC is connected to two vSwitches by using two ports. If the connection to a vSwitch fails, the connection to the other vSwitch is automatically used to ensure network availability.

  • The software architecture consists of multiple components, including resource management, computing acceleration library, machine learning and deep learning frameworks, development environment, and task management from the bottom up.

    • In terms of resource management, Lingjun uses Docker to divide and schedule resources and is compatible with orchestration tools such as Kubernetes.

    • In terms of system O&M and monitoring, Lingjun uses Apsara Infrastructure Management Framework of Alibaba Group to monitor the underlying resources and the status of clusters in real time.

    • The acceleration library is deeply customized and optimized for the communication of Lingjun clusters.

    • The computing system allows you to submit tasks and view task logs in the console and supports mainstream AI computing frameworks, such as PyTorch and TensorFlow.

Why Lingjun?

You can use Lingjun to easily build AI clusters that have the following benefits:

  • Compute as a Service. Lingjun clusters provide high-performance and high-elasticity heterogeneous computing services and can be scaled up to support tens of thousands of GPUs. The network bandwidth of a single cluster reaches 4 Pbit/s with a latency that is as low as 2 microseconds.

  • High resource utilization. The resource utilization is increased by three times, and the parallel computing efficiency is increased by more than 90%.

  • Unified computing power pool. Lingjun clusters support centralized allocation and scheduling of computing power in AI and high performance computing scenarios.

  • Computing power management and monitoring. Lingjun provides an O&M and management platform that is deeply customized for heterogeneous computing power. The platform implements comprehensive monitoring and management of heterogeneous computing power, pooled resources, and efficiency.

Benefits

  • Accelerated AI innovation. The end-to-end performance is accelerated. The iteration efficiency of compute-intensive projects can be improved by more than two times.

  • Maximized return on investment (ROI). The efficient pooling and scheduling of heterogeneous computing power ensure that each computing resource is fully utilized. The resource utilization is improved by three times.

  • Adaption to all business scales. Lingjun can provide the computing power that is required for simulations of large models and large-scale projects. This prevents innovation from being limited by computing power.

  • Visualization and controllability. Lingjun helps you manage the allocation of heterogeneous computing power in an easy manner. You can use Lingjun to continuously monitor and optimize the use of your computing power.

Scenarios

Lingjun is mainly used in large-scale distributed AI R&D scenarios, such as image recognition, NLP, search-based ad recommendations, and general LLMs. Lingjun is suitable for industries such as autonomous driving, financial risk control, pharmaceutical R&D, scientific intelligence, metauniverse, Internet, and ISV.

  • Large-scale distributed training

    • Computing system with an ultra-large number of GPUs

      The peer-to-peer network architecture and pooled resources can be used with Machine Learning Platform for AI (PAI). Lingjun supports a variety of training frameworks, such as PyTorch, TensorFlow, Caffe, Keras, XGBoost, and Apache MXNet, and can meet the requirements of various AI training and inference services.

    • AI infrastructure

      • Smooth scale-up. Lingjun can meet the GPU requirements in different scales. Lingjun supports smooth scale-up to linearly improve computing performance.

      • Intelligent data acceleration. Lingjun provides intelligent data acceleration for AI training scenarios by prefetching the data that is required for training to improve training efficiency.

      • Improved resource utilization. Lingjun supports fine-grained management of heterogeneous resources to improve resource turnover efficiency.

  • Autonomous driving

    • Rich deployment and scheduling policies

      Lingjun supports multiple GPU scheduling policies to ensure efficient execution of training tasks. Lingjun uses Cloud Parallel File Storage (CPFS) and the RDMA network architecture to ensure high-performance data provision and computing I/O. Lingjun can also use the tiered storage feature of Object Storage Service (OSS) to store archived data, which reduces storage costs.

    • Support for both training and simulation

      Lingjun provides polled computing power in an intelligent manner and supports both training and simulation scenarios. This improves iteration efficiency and reduces data migration costs in collaboration mode.

  • Scientific intelligence

    • Expanded upper limit of innovation

      Based on the ultra-large high-speed RDMA networks and communication flow control technologies for data centers, Lingjun reduces the latency of end-to-end communication to microseconds. Based on the ultra-large linear elasticity, a Lingjun cluster can be scaled up to support tens of thousands of GPUs for parallel computing.

    • Integrated ecosystems and expanded boundaries of innovation

      Lingjun supports the centralized scheduling of high performance computing and AI tasks, provides a unified and collaborative base for scientific research and AI, and facilitates the integration of technologies and ecosystems.

    • Cloud scientific research and inclusive computing power

      Lingjun supports cloud-native and containerized AI and high performance application ecosystems, deep resource sharing, and inclusive intelligent computing power.

Features

  • High-speed RDMA network architecture. Alibaba Group has invested in special research on Remote Direct Memory Access (RDMA) since 2016.

    Alibaba Group has built a high-speed network in large-scale data centers. Based on the large-scale deployment practices of the RDMA network, Alibaba Cloud independently developed the high-performance RDMA protocol and High Performance Computing and Communications (HPCC) algorithm for congestion control on the basis of the collaboration of clients and networks. Alibaba Cloud also implements hardware offloading over protocols based on intelligent NICs. This reduces end-to-end network latency, improves network I/O throughput, and effectively reduces and prevents performance losses of upper-layer applications that are caused by traditional network exceptions such as network faults and blackholes.

  • High-performance Alibaba Collective Communication Library (ACCL). Lingjun supports the high-performance ACCL. ACCL can be used together with hardware such as vSwitches to provide congestion-free and high-performance communication capabilities for AI clusters that contain tens of thousands of GPUs. Alibaba Cloud uses ACCL to implement intelligent matching of GPUs and NICs, automatic identification of physical topologies inside and outside nodes, and topology-aware scheduling algorithms. This eliminates network congestion, accelerates network communications, and improves the elasticity of distributed training systems. For a Lingjun cluster that contains tens of thousands of GPUs, over 80% of the linear cluster capability can be utilized. For a Lingjun cluster that contains hundreds of GPUs, over 95% of the computing power can be effectively used, which can meet the requirements of more than 80% of business scenarios.

  • High-performance KSpeed for data preloading acceleration. Based on high-performance RDMA networks and ACCL, Lingjun develops high-performance KSpeed for data preloading acceleration to intelligently optimize data I/O. Compute-storage separation architectures are widely used in AI, high performance computing, and big data scenarios. However, the loading of a large amount of training data causes efficiency bottlenecks. Alibaba Cloud uses KSpeed to improve data I/O performance in orders of magnitude.

  • eGPU for virtualization of GPU-accelerated containers. To resolve issues that may occur in actual business scenarios, such as over-large AI tasks, high costs of GPU hardware resources, and low GPU utilization, Lingjun supports eGPU, which is a GPU virtualization technology that can effectively improve the GPU utilization of AI clusters. eGPU has the following benefits:

    • GPU isolation based on video memory and computing power.

    • Multiple specifications.

    • Dynamic creation and destruction.

    • Hot upgrade.

    • User-mode technologies for higher reliability.

Limits on Lingjun networks

Item

Limit

Method to increase the quota

Maximum number of Lingjun Virtual Private Datacenters (VPDs) that can be created by using a single Alibaba Cloud account in the same region

8

For more information, see Manage quotas.

Maximum number of Lingjun subnets that can be created in a single Lingjun VPD

16

For more information, see Manage quotas.

Maximum number of nodes that can be deployed in a single Lingjun subnet

1,000

N/A

Maximum number of nodes that can be deployed in a single Lingjun VPD

1,000

N/A

CIDR blocks that can be configured as the CIDR blocks of Lingjun VPDs and Lingjun subnets

You can configure custom CIDR blocks other than 100.64.0.0/10, 224.0.0.0/4, 127.0.0.0/8, 169.254.0.0/16, and their subnets as the CIDR blocks of Lingjun VPDs.

N/A

Maximum number of Lingjun connection instances that can be created by using a single Alibaba Cloud account in the same region

16

N/A

Maximum number of IPv4 routes that can be learned from the Alibaba Cloud public cloud by a single Lingjun connection instance

50

N/A

Maximum number of IPv6 routes that can be learned from the Alibaba Cloud public cloud by a single Lingjun connection instance

25

N/A

Maximum number of Lingjun Hub instances that can be created by using a single Alibaba Cloud account in the same region

4

For more information, see Manage quotas.

Maximum number of Lingjun Hub instances that can be connected to a single Lingjun VPD

1

For more information, see Manage quotas.

Maximum number of Lingjun Hub instances that can be connected to a single Lingjun connection instance

1

For more information, see Manage quotas.

Maximum number of Lingjun connection instances that can be connected to a single Lingjun Hub instance

32

For more information, see Manage quotas.

Maximum number of nodes in all Lingjun VPDs that are supported by a single Lingjun Hub instance in the same region

2,000

N/A

Maximum number of routing policy entries that can be configured for a single Lingjun Hub instance

100

N/A

Maximum number of secondary private IP addresses that are supported by a single Lingjun NIC

3

For more information, see Manage quotas.

Service specifications and activation