All Products
Search
Document Center

Platform For AI:SanityCheck: Health check

Last Updated:May 20, 2026

SanityCheck is a computing power health check feature provided by Deep Learning Containers (DLC). Before a distributed training job starts — or after it restarts — SanityCheck inspects every compute node in the job, automatically isolates faulty nodes, and triggers an Operations and Maintenance (O&M) process in the background. After the check completes, the system generates a report on GPU computing power and communication performance to help you identify and locate performance bottlenecks.

SanityCheck addresses two common failure patterns in AI training:

  • Faulty resources that waste GPU time: A job can fail to start training even after spending minutes loading model checkpoints, requiring manual investigation and resubmission.

  • Hard-to-locate performance degradation: Slow nodes can silently degrade training throughput, with no fast way to pinpoint which node is responsible or verify baseline GPU performance.

Limits

  • Supported job type: PyTorch only

  • Supported resource type: Lingjun Intelligent Computing resources only

  • GPU count must be configured at the instance level

  • Lingjun Intelligent Computing resources are available to allowlisted users only. Contact your account manager to request access.

Enable health check

Prerequisites

Before you begin, make sure you have:

  • Access to the PAI console with Lingjun Intelligent Computing resources on your allowlist

  • A Lingjun resource quota. See Create a resource quota if you need to create one.

Enable health check in the console

  1. Create a DLC job in the PAI console.

  2. In the Resource information section, set the following parameters:

    Parameter Value
    Resource type Lingjun Intelligence Resources
    Source Resource Quota
    Resource quota Select an existing Lingjun resource quota
    Framework PyTorch
    Job resource GPU (number of cards), configured at the instance level
  3. In the Fault tolerance and diagnosis section, enable the Health check switch.

    image

  4. Configure the health check parameters:

    Parameter Description
    Check time When the health check runs: Before job runs (default) runs the check after the job acquires resources, before executing your training code. After job restarts runs the check each time AIMaster restarts the job after a failure. Requires Automatic fault tolerance to be enabled — see AIMaster: Elastic automatic fault tolerance engine.
    Check items The validation suite includes four categories: compute performance, node communication, compute-communication interference, and model simulation. GPU GEMM and All-Reduce are enabled by default. Search for specific items, select them individually, or choose a preset template. For descriptions and estimated durations, see Check item descriptions.
    Maximum check duration Maximum runtime for the health check. Default: 60 minutes. If the check times out, the exception handling policy is triggered.
    Exception handling policy What happens when the check detects a faulty or suspicious node: Stop job terminates the job and marks it as Check failed. Blacklist and rerun (recommended if you want automatic recovery) blocks the node, restarts the job, and reruns the check until it passes. Choose Blacklist and rerun to recover automatically; choose Stop job if you only need detection without automatic retry.
    Maximum restart count Applies when Exception handling policy is set to Blacklist and rerun. Default: 10. If restarts exceed this limit, the job automatically fails.
    Other configurations Leave blank by default. Supports advanced parameter settings.
  5. Complete the remaining job configuration and click Submit. After the job is created, the system checks the health and availability of all compute nodes. This process may take several minutes.

View check results

Health check statuses

Status Meaning
Checking Health check is in progress
Check failed An abnormal node was detected, or the check timed out
Check passed All checks passed; the job enters Running status

View results in the console

On the DLC job details page, go to the Events tab and click Sanity check to view the check progress and results.

image

Click the Restart records tab to view the number of restarts, restart reasons, and restart results.

image

Configure notifications

To receive alerts when a health check fails, create a notification rule in your PAI workspace:

  1. Go to the event notification settings in your PAI workspace.

  2. Set Event type to DLC Jobs > Automatic fault tolerance.

  3. Configure the remaining notification parameters. See Notifications for details.

For instructions on creating notification rules, see Event notification settings.

image

Appendix: Check item descriptions

Note

Estimated durations are based on two machines and are for reference only. The actual duration may vary.

Category Check item What it detects Estimated duration
Computing performance GPU GEMM GPU GEMM performance. Identifies faulty GPUs (computation errors or hangs) and slow nodes (low TFLOPS). 1 minute
GPU Kernel Launch GPU kernel launch latency. Identifies faulty nodes (kernel launch errors or hangs) and slow nodes (long kernel launch time). 1 minute
Node communication All-Reduce Inter-node communication performance. Identifies communication fault nodes (errors or hangs) and slow communication nodes (low bandwidth). 5 minutes
All-to-All Inter-node communication performance across all-to-all patterns. Identifies communication faults and slow nodes. 5 minutes
All-Gather Inter-node communication performance in all-gather operations. Identifies communication faults and slow nodes. 5 minutes
Multi-All-Reduce Performance of multiple concurrent All-Reduce operations. Identifies communication faults and slow nodes. 5 minutes
Network Connectivity Network connectivity of head and tail nodes. Identifies nodes with abnormal communication connectivity. 2 minutes
Compute-communication interference MatMul/All-Reduce Overlap Single-node performance when computation and communication kernels overlap. Identifies faulty nodes (overlap errors or hangs) and slow nodes (long overlap time). 1 minute
Model simulation Mini GPT AI system reliability using a small GPT model simulation. Identifies faulty nodes (abnormal training loss, hangs, or errors) and slow nodes (long per-step time). 1 minute
Megatron GPT AI system reliability using a Megatron GPT model simulation. Identifies faulty nodes and slow nodes. 5 minutes
ResNet AI system reliability using a ResNet model simulation. Identifies faulty nodes and slow nodes. 2 minutes