SanityCheck is a computing power health check feature provided by Deep Learning Containers (DLC). Before a distributed training job starts — or after it restarts — SanityCheck inspects every compute node in the job, automatically isolates faulty nodes, and triggers an Operations and Maintenance (O&M) process in the background. After the check completes, the system generates a report on GPU computing power and communication performance to help you identify and locate performance bottlenecks.
SanityCheck addresses two common failure patterns in AI training:
-
Faulty resources that waste GPU time: A job can fail to start training even after spending minutes loading model checkpoints, requiring manual investigation and resubmission.
-
Hard-to-locate performance degradation: Slow nodes can silently degrade training throughput, with no fast way to pinpoint which node is responsible or verify baseline GPU performance.
Limits
-
Supported job type: PyTorch only
-
Supported resource type: Lingjun Intelligent Computing resources only
-
GPU count must be configured at the instance level
-
Lingjun Intelligent Computing resources are available to allowlisted users only. Contact your account manager to request access.
Enable health check
Prerequisites
Before you begin, make sure you have:
-
Access to the PAI console with Lingjun Intelligent Computing resources on your allowlist
-
A Lingjun resource quota. See Create a resource quota if you need to create one.
Enable health check in the console
-
Create a DLC job in the PAI console.
-
In the Resource information section, set the following parameters:
Parameter Value Resource type Lingjun Intelligence Resources Source Resource Quota Resource quota Select an existing Lingjun resource quota Framework PyTorch Job resource GPU (number of cards), configured at the instance level -
In the Fault tolerance and diagnosis section, enable the Health check switch.

-
Configure the health check parameters:
Parameter Description Check time When the health check runs: Before job runs (default) runs the check after the job acquires resources, before executing your training code. After job restarts runs the check each time AIMaster restarts the job after a failure. Requires Automatic fault tolerance to be enabled — see AIMaster: Elastic automatic fault tolerance engine. Check items The validation suite includes four categories: compute performance, node communication, compute-communication interference, and model simulation. GPU GEMM and All-Reduce are enabled by default. Search for specific items, select them individually, or choose a preset template. For descriptions and estimated durations, see Check item descriptions. Maximum check duration Maximum runtime for the health check. Default: 60 minutes. If the check times out, the exception handling policy is triggered. Exception handling policy What happens when the check detects a faulty or suspicious node: Stop job terminates the job and marks it as Check failed. Blacklist and rerun (recommended if you want automatic recovery) blocks the node, restarts the job, and reruns the check until it passes. Choose Blacklist and rerun to recover automatically; choose Stop job if you only need detection without automatic retry. Maximum restart count Applies when Exception handling policy is set to Blacklist and rerun. Default: 10. If restarts exceed this limit, the job automatically fails. Other configurations Leave blank by default. Supports advanced parameter settings. -
Complete the remaining job configuration and click Submit. After the job is created, the system checks the health and availability of all compute nodes. This process may take several minutes.
View check results
Health check statuses
| Status | Meaning |
|---|---|
| Checking | Health check is in progress |
| Check failed | An abnormal node was detected, or the check timed out |
| Check passed | All checks passed; the job enters Running status |
View results in the console
On the DLC job details page, go to the Events tab and click Sanity check to view the check progress and results.
Click the Restart records tab to view the number of restarts, restart reasons, and restart results.
Configure notifications
To receive alerts when a health check fails, create a notification rule in your PAI workspace:
-
Go to the event notification settings in your PAI workspace.
-
Set Event type to DLC Jobs > Automatic fault tolerance.
-
Configure the remaining notification parameters. See Notifications for details.
For instructions on creating notification rules, see Event notification settings.
Appendix: Check item descriptions
Estimated durations are based on two machines and are for reference only. The actual duration may vary.
| Category | Check item | What it detects | Estimated duration |
|---|---|---|---|
| Computing performance | GPU GEMM | GPU GEMM performance. Identifies faulty GPUs (computation errors or hangs) and slow nodes (low TFLOPS). | 1 minute |
| GPU Kernel Launch | GPU kernel launch latency. Identifies faulty nodes (kernel launch errors or hangs) and slow nodes (long kernel launch time). | 1 minute | |
| Node communication | All-Reduce | Inter-node communication performance. Identifies communication fault nodes (errors or hangs) and slow communication nodes (low bandwidth). | 5 minutes |
| All-to-All | Inter-node communication performance across all-to-all patterns. Identifies communication faults and slow nodes. | 5 minutes | |
| All-Gather | Inter-node communication performance in all-gather operations. Identifies communication faults and slow nodes. | 5 minutes | |
| Multi-All-Reduce | Performance of multiple concurrent All-Reduce operations. Identifies communication faults and slow nodes. | 5 minutes | |
| Network Connectivity | Network connectivity of head and tail nodes. Identifies nodes with abnormal communication connectivity. | 2 minutes | |
| Compute-communication interference | MatMul/All-Reduce Overlap | Single-node performance when computation and communication kernels overlap. Identifies faulty nodes (overlap errors or hangs) and slow nodes (long overlap time). | 1 minute |
| Model simulation | Mini GPT | AI system reliability using a small GPT model simulation. Identifies faulty nodes (abnormal training loss, hangs, or errors) and slow nodes (long per-step time). | 1 minute |
| Megatron GPT | AI system reliability using a Megatron GPT model simulation. Identifies faulty nodes and slow nodes. | 5 minutes | |
| ResNet | AI system reliability using a ResNet model simulation. Identifies faulty nodes and slow nodes. | 2 minutes |