PAI-Lingjun AI Computing Service (Lingjun) is a PaaS product designed for large-scale deep learning and integrated intelligent computing. It is built on optimized software and hardware integration to provide a high-performance heterogeneous computing power. Lingjun offers full-process AI engineering capabilities with core advantages such as high performance, high efficiency, and high utilization, meeting the needs of high-performance computing, such as large language model training, autonomous driving, basic scientific research, and finance.
Large-scale distributed training
Supports AI research and development scenarios with a serverless architecture. Lingjun can handle training tasks for large-scale models, including GPT-3 (175 billion parameters), M6 (trillion parameters), PLUG, and STAR. It provides deeply optimized intelligent computing services suitable for application fields such as graphic and image processing (such as AIGC image generation), natural language processing (such as AIGC text generation), and audio and video, ensuring efficient and predictable training services to accelerate model iteration efficiency.
Ten-thousand-GPU-level linear expansion: Supports AI training computing power needs of different scales, achieving point-to-point communication latency as low as 2 microseconds. Lingjun ensures smooth scale-out of computing power resources and linear expansion of performance.
Ultra-high throughput and IOPS: For AI training scenarios, data is preloaded into persistent storage to meet the high bandwidth requirements for data loading and writing during training, thereby improving training efficiency.
High resource utilization: By fine-grained slicing and scheduling of GPU resources, Lingjun supports collaborative development. This technology has been validated through large-scale application during Double 11 Shopping Festivals, with resource utilization improved up to threefold.
Autonomous driving
Provides a one-stop training and simulation platform that supports full-scenario applications. Through various GPU resource scheduling strategies, RDMA networks, and CPFS storage systems, Lingjun ensures efficient data processing and computing power. Meanwhile, the platform emphasizes data security and compliance, offering rich deployment and scheduling strategies to enhance iteration efficiency and reduce data migration costs.
Efficient training and simulation support
Provides a unified platform to support training and simulation, simplifying the development process. Through various GPU resource scheduling strategies, Lingjun ensures efficient execution of training tasks.
Combining CPFS with RDMA network architecture ensures high bandwidth supply of training data and computing IO performance. Meanwhile, tiered storage through OSS reduces the storage cost of archived data.
Comprehensive security and compliance assurance
The platform supports various autonomous driving application scenarios, meeting security and compliance requirements. Lingjun includes Data Security Center, Cloud Firewall, Bastionhost, Encryption Service, SSL encryption, RAM, and Database Audit to ensure the security of data and applications.
High resource utilization and flexible expansion
By fine-grained slicing and scheduling of GPU resources, Lingjun supports collaborative development, with resource utilization improved up to threefold. Cloud resource elastic expansion is optional and can be connected on-demand, ensuring flexible resource management, enhancing iteration efficiency, and reducing data migration costs.
Scientific research
Through ultra-large-scale integrated computing power, Lingjun achieves unified deployment and scheduling of deep learning and high-performance computing tasks. It provides standardized computing services for fields such as basic scientific research, medicine development, and engineering simulation. This not only promotes paradigm innovation and efficiency improvement but also facilitates the deep integration of AI and high-performance computing (HPC) development ecosystems.
Promoting new paradigms in scientific research
By supporting cloud-native and containerized AI and HPC application ecosystems, it provides unified computing services for fields such as basic scientific research, new drug development, and new material research. It supports cross-regional and cross-team collaborative work, improves resource utilization, and promotes the integration of technology ecosystems, enhancing collaborative effects.
Building a large scientific research platform
Utilizing RDMA technology and Alibaba Cloud's high-performance communication library, it constructs a low-latency, high-bandwidth network environment. It optimizes communication for AI and HPC applications, achieving point-to-point communication latency as low as 2 microseconds and supporting parallel computing with tens of thousands of nodes, providing efficient intelligent computing services for large-scale scientific computing.