All Products
Search
Document Center

Platform For AI:Infrastructure security

Last Updated:Mar 06, 2026

PAI infrastructure security includes zone isolation, fault tolerance, health checks, and monitoring.

Inter-zone fault isolation

A zone is a physical area within a region with an independent power supply and network.

Zones within the same region are connected by low-latency internal networks. PAI implements fault isolation between zones. A failure in one zone does not affect other zone operations. Each region is independent, and zones in different regions are isolated.

Elastic automatic fault tolerance

PAI provides fault tolerance monitoring based on AIMaster. AIMaster is a task-level component. When enabled for a task, an AIMaster instance starts and runs alongside other task instances to monitor tasks, detect faults, and control resources. For more information, see AIMaster: The elastic automatic fault tolerance engine.

Computing power health check

DLC provides sanity checks for AI training to verify health and performance of computing resources for distributed training tasks. Enable this feature when creating a DLC training task. Sanity checks inspect all training resources, automatically isolate faulty nodes, and trigger background automated O&M processes. This reduces early-stage issues and improves training success rates. Upon completion, the system generates a report on GPU computing power and communication performance. This report helps identify and locate performance-degrading elements and improves problem diagnosis efficiency. For detailed instructions, see SanityCheck: Computing power health check.

Infrastructure monitoring

Integrate with Cloud Monitor to build and strengthen security defense systems. Related topics: