All Products
Search
Document Center

Cloud Config:Enable AIMaster-based fault tolerance monitoring for PAI distributed training

Last Updated:Nov 21, 2025

A PAI Deep Learning Containers (DLC) Job is considered compliant if AIMaster-based fault tolerance monitoring is enabled. This rule does not apply if no training Jobs exist.

Risk level

The default risk level is High.

You can change the risk level as needed.

Detection logic

  • A PAI Deep Learning Containers (DLC) Job is considered compliant if AIMaster-based fault tolerance monitoring is enabled.

  • If no training Jobs exist, this rule does not apply.

Rule details

Parameter

Description

Rule name

Enable AIMaster-based fault tolerance monitoring for PAI distributed training

Rule identifier

pai-dlc-error-monitoring-ai-master-enabled

Tag

[PAIWorkspace]

Automatic remediation

Not supported

Rule trigger

Periodic, every 24 hours

Supported resource types

[ACS::PAIWorkspace::Workspace]

Input parameters

None

Remediation guide

For more information about remediation, see AIMaster: Elastic Automatic Fault Tolerance Engine.