Common scenarios of Lingjun - Platform For AI - Alibaba Cloud Documentation Center

PAI-Lingjun AI Computing Service (Lingjun) is mainly used in large-scale distributed AI R&D scenarios, such as graphic and image recognition, natural language processing (NLP), search, recommendation, and advertising, and general large models. It is suitable for industries such as autonomous driving, financial risk control, biopharmaceutical R&D, scientific intelligence, metauniverse, the Internet, and independent software vendor (ISV).

Large-scale distributed training

High-performance AI evolution base
Lingjun is used to build computing systems with an ultra-large number of GPUs. The peer-to-peer network architecture and pooled resources can be used together with Platform for AI (PAI). Lingjun supports a variety of training frameworks, such as PyTorch, TensorFlow, Caffe, Keras, XGBoost, and Apache MXNet, and can meet the requirements of various AI training and inference services.
AI infrastructure
- Smooth scale-up. Lingjun can meet the GPU requirements in different scales. Lingjun supports smooth scale-up to linearly improve computing performance.
- Intelligent data acceleration. Lingjun provides intelligent data acceleration for AI training scenarios by prefetching the data that is required for training to improve training efficiency.
- Higher resource utilization. Lingjun supports fine-grained management of heterogeneous resources to improve resource turnover efficiency.

Autonomous driving

Support for training and simulation on one platform
Support for all scenarios and assurance of security compliance
- Various deployment and scheduling policies
  Lingjun supports multiple GPU scheduling policies to ensure efficient execution of training tasks.
- Data storage with high performance and high throughput
  Lingjun uses Cloud Parallel File Storage (CPFS) and the remote direct memory access (RDMA) network architecture to ensure high-performance data provision and computing I/O. Lingjun can also use the tiered storage feature of Object Storage Service (OSS) to store archived data, which reduces storage costs.
- Support for both training and simulation
  Lingjun provides integrated computing power in an intelligent manner and supports both training and simulation scenarios. This improves iteration efficiency and reduces data migration costs in collaboration mode.

Scientific intelligence

Integrated computing power to support diversified innovation
Deepened and expanded boundaries of innovation
- Expanded upper limit of innovation
  Based on the ultra-large high-speed RDMA networks and communication flow control technologies for data centers, Lingjun reduces the latency of end-to-end communication to microseconds. Based on the ultra-large linear elasticity, a Lingjun cluster can be scaled up to support tens of thousands of GPUs for parallel computing.
- Integrated ecosystems and expanded boundaries of innovation
  Lingjun supports the centralized scheduling of high-performance computing and AI tasks, provides a unified and collaborative base for scientific research and AI, and facilitates the integration of technologies and ecosystems.
- Cloud scientific research and inclusive computing power
  Lingjun supports cloud-native and containerized AI and high-performance application ecosystems, deep resource sharing, and inclusive intelligent computing power.