All Products
Search
Document Center

Platform For AI:Infrastructure security

Last Updated:Jun 17, 2026

PAI ensures secure and reliable AI workloads through zone isolation, automatic fault tolerance, computing power health checks, and infrastructure monitoring.

Inter-zone fault isolation

A zone is a physical area within a region with an independent power supply and network.

Zones within the same region are connected by low-latency internal networks. PAI isolates faults between zones, so a failure in one zone does not affect others. Each region is independent, and zones across different regions are isolated.

Elastic automatic fault tolerance

PAI provides fault tolerance monitoring through AIMaster, a task-level component. When enabled for a task, an AIMaster instance starts and runs alongside other task instances to monitor tasks, detect faults, and control resources. For more information, see AIMaster: The elastic automatic fault tolerance engine.

Computing power health check

DLC provides sanity checks to verify the health and performance of computing resources for distributed training tasks. Enable sanity checks when creating a DLC training task. The checks inspect all training resources, automatically isolate faulty nodes, and trigger background automated O&M processes, reducing early-stage failures and improving training success rates. Upon completion, the system generates a GPU computing power and communication performance report to help identify and locate performance-degrading elements and improve problem diagnosis efficiency. For detailed instructions, see SanityCheck: Computing power health check.

Infrastructure monitoring

PAI integrates with Cloud Monitor to build and strengthen security defense systems. Related topics: