All Products
Search
Document Center

Platform For AI:Use cases

Last Updated:Apr 01, 2026

PAI-Lingjun AI Computing Service (Lingjun) is a platform as a service (PaaS) product for large-scale deep learning and integrated intelligent computing. Built on optimized hardware-software integration, Lingjun delivers high-performance heterogeneous computing with core advantages of high performance, high efficiency, and high utilization — supporting workloads that demand extreme scale, from large language model (LLM) training and autonomous driving simulation to scientific research and financial computing.

Large-scale distributed training

Train foundation models at scale on a serverless architecture. Lingjun supports training runs for GPT-3 (175 billion parameters), M6 (trillion parameters), PLUG, and STAR across application domains including graphic and image processing (AI-generated content (AIGC) image generation), natural language processing (NLP) for AIGC text generation, and audio and video processing.

image
  • Ten-thousand-GPU-level linear scale-out: Scale training from small experiments to ten-thousand-GPU clusters without reconfiguring your setup. Point-to-point communication latency stays as low as 2 microseconds, and throughput scales linearly as you add resources.

  • High-throughput storage for training data: Data is preloaded into persistent storage ahead of training, satisfying the high bandwidth requirements of data loading and writing during training — so storage is never the bottleneck.

  • Up to 3x improvement in resource utilization: Fine-grained GPU slicing and scheduling lets multiple teams share the same cluster concurrently. This approach has been validated at Double 11 Shopping Festivals scale.

Autonomous driving

Run training and simulation on a one-stop platform that supports full-scenario applications. Lingjun combines GPU resource scheduling, Remote Direct Memory Access (RDMA) networking, and Cloud Parallel File Storage (CPFS) to keep both compute and data pipelines running at full speed.

image
  • One-stop training and simulation platform: A single platform handles both model training and simulation workloads, eliminating the overhead of managing separate environments. GPU scheduling strategies keep training tasks running efficiently across full-scenario applications.

  • High-bandwidth storage and network: CPFS paired with RDMA networking delivers the throughput and I/O performance that training data pipelines require. Tiered storage through Object Storage Service (OSS) reduces the cost of archiving historical simulation data.

  • Security and compliance built in: The platform includes Data Security Center, Cloud Firewall, Bastionhost, Encryption Service, SSL encryption, Resource Access Management (RAM), and Database Audit — covering data protection, access control, and audit requirements for regulated autonomous driving applications.

  • Elastic resource management: Fine-grained GPU slicing supports multi-team collaboration with up to a threefold improvement in resource utilization. Cloud resources scale on demand, cutting data migration costs and accelerating iteration cycles.

Scientific research

Run deep learning and high-performance computing (HPC) workloads on a shared, ultra-large-scale compute infrastructure. Lingjun provides standardized computing services for basic scientific research, medicine development, and engineering simulation — bringing AI and HPC ecosystems together on a single platform.

image
  • AI and HPC on a unified platform: Cloud-native and containerized AI and HPC application ecosystems run side by side, with unified scheduling across both workload types. Built-in support for cross-regional and cross-team collaboration — including new drug development and new material research — improves resource utilization and removes ecosystem silos.

  • Low-latency, high-bandwidth network fabric: RDMA technology and Alibaba Cloud's high-performance communication library deliver point-to-point communication latency as low as 2 microseconds. The network supports parallel computing across tens of thousands of nodes — enough for the most demanding large-scale scientific computing workloads.