All Products
Search
Document Center

Platform For AI:Common scenarios

Last Updated:Mar 12, 2024

PAI-Lingjun AI Computing Service (Lingjun) is mainly used in large-scale distributed AI R&D scenarios, such as graphic and image recognition, natural language processing (NLP), search, recommendation, and advertising, and general large models. It is suitable for industries such as autonomous driving, financial risk control, biopharmaceutical R&D, scientific intelligence, metauniverse, the Internet, and independent software vendor (ISV).

Large-scale distributed training

image
  • High-performance AI evolution base

    Lingjun is used to build computing systems with an ultra-large number of GPUs. The peer-to-peer network architecture and pooled resources can be used together with Platform for AI (PAI). Lingjun supports a variety of training frameworks, such as PyTorch, TensorFlow, Caffe, Keras, XGBoost, and Apache MXNet, and can meet the requirements of various AI training and inference services.

  • AI infrastructure

    • Smooth scale-up. Lingjun can meet the GPU requirements in different scales. Lingjun supports smooth scale-up to linearly improve computing performance.

    • Intelligent data acceleration. Lingjun provides intelligent data acceleration for AI training scenarios by prefetching the data that is required for training to improve training efficiency.

    • Higher resource utilization. Lingjun supports fine-grained management of heterogeneous resources to improve resource turnover efficiency.

Autonomous driving

image
  • Support for training and simulation on one platform

  • Support for all scenarios and assurance of security compliance

    • Various deployment and scheduling policies

      Lingjun supports multiple GPU scheduling policies to ensure efficient execution of training tasks.

    • Data storage with high performance and high throughput

      Lingjun uses Cloud Parallel File Storage (CPFS) and the remote direct memory access (RDMA) network architecture to ensure high-performance data provision and computing I/O. Lingjun can also use the tiered storage feature of Object Storage Service (OSS) to store archived data, which reduces storage costs.

    • Support for both training and simulation

      Lingjun provides integrated computing power in an intelligent manner and supports both training and simulation scenarios. This improves iteration efficiency and reduces data migration costs in collaboration mode.

Scientific intelligence

image
  • Integrated computing power to support diversified innovation

  • Deepened and expanded boundaries of innovation

    • Expanded upper limit of innovation

      Based on the ultra-large high-speed RDMA networks and communication flow control technologies for data centers, Lingjun reduces the latency of end-to-end communication to microseconds. Based on the ultra-large linear elasticity, a Lingjun cluster can be scaled up to support tens of thousands of GPUs for parallel computing.

    • Integrated ecosystems and expanded boundaries of innovation

      Lingjun supports the centralized scheduling of high-performance computing and AI tasks, provides a unified and collaborative base for scientific research and AI, and facilitates the integration of technologies and ecosystems.

    • Cloud scientific research and inclusive computing power

      Lingjun supports cloud-native and containerized AI and high-performance application ecosystems, deep resource sharing, and inclusive intelligent computing power.