PAI ensures secure and reliable AI workloads through zone isolation, automatic fault tolerance, computing power health checks, and infrastructure monitoring.
Inter-zone fault isolation
A zone is a physical area within a region with an independent power supply and network.
Zones within the same region are connected by low-latency internal networks. PAI isolates faults between zones, so a failure in one zone does not affect others. Each region is independent, and zones across different regions are isolated.
Elastic automatic fault tolerance
PAI provides fault tolerance monitoring through AIMaster, a task-level component. When enabled for a task, an AIMaster instance starts and runs alongside other task instances to monitor tasks, detect faults, and control resources. For more information, see AIMaster: The elastic automatic fault tolerance engine.
Computing power health check
DLC provides sanity checks to verify the health and performance of computing resources for distributed training tasks. Enable sanity checks when creating a DLC training task. The checks inspect all training resources, automatically isolate faulty nodes, and trigger background automated O&M processes, reducing early-stage failures and improving training success rates. Upon completion, the system generates a GPU computing power and communication performance report to help identify and locate performance-degrading elements and improve problem diagnosis efficiency. For detailed instructions, see SanityCheck: Computing power health check.
Infrastructure monitoring
PAI integrates with Cloud Monitor to build and strengthen security defense systems. Related topics:
-
Model inference monitoring (EAS): View EAS Cloud Monitor events.