This topic describes the architecture of Platform for AI (PAI).

The PAI architecture has four layers:
Basic resources layer (Computing resources & infrastructure):
Infrastructure: Provides CPUs, GPUs, high-speed Remote Direct Memory Access (RDMA) networks, and Container Service for Kubernetes (ACK).
Computing resources: Includes cloud-native computing resources (Lingjun resources and general computing resources) and big data engine resources (MaxCompute and Flink).
Platform and tools layer (Lingjun AI Computing Service and AI frameworks):
AI frameworks: Supports mainstream frameworks such as Alink, TensorFlow, PyTorch, Megatron, DeepSpeed, and Reinforcement Learning from Human Feedback (RLHF).
Optimization and acceleration: Provides Dataset Acceleration (DatasetAcc), Training Acceleration (TorchAcc), Parallel Training (EPL), Inference Acceleration (BladeLLM), Automatic Fault-tolerant Training (AIMaster), and Training Snapshot (EasyCkpt).
End-to-end machine learning tools:
Data preparation: Provides the iTAG data annotation service and dataset management features.
Model development and training: Provides tools such as Machine Learning Designer, Data Science Workshop (DSW), Deep Learning Containers (DLC), and FeatureStore.
Model deployment: Elastic Algorithm Service (EAS) deploys models as services.
Application layer (model services): PAI integrates with model services and application platforms such as the ModelScope community, PAI-DashScope, third-party MaaS platforms, and Alibaba Cloud Model Studio.
Business layer (scenario-based solutions): PAI provides scenario-based solutions in fields such as autonomous driving, AI for Science, financial risk control, and Intelligent Recommendation. For example, internal Alibaba Group systems for search, recommendation, and financial services use PAI for data mining.