Enable health checks when you create a DLC job - Platform For AI

This topic describes how to use the SanityCheck feature in DLC.

Overview

In AI training scenarios, you may encounter the following issues:

Resource failures that interrupt jobs and waste GPU resources: A job might spend significant time on initialization, such as loading a model checkpoint, only to fail due to faulty resources. Investigating the issue and resubmitting the job wastes GPU resources.
Insufficient tools for performance diagnosis and testing: If model training performance degrades during a job, a slow node might be the cause, but it can be difficult to identify quickly. Additionally, users often lack convenient and reliable benchmarks for testing the GPU compute and communication performance of machines in a resource group.

To address these issues, DLC provides the SanityCheck feature to check the health and performance of compute resources for distributed training jobs. You can enable this feature when you create a DLC training job. SanityCheck performs a comprehensive check on the training resources, automatically isolates faulty nodes, and triggers automated backend maintenance workflows. This process reduces the likelihood of issues during the initial phase of training and improves the job success rate. After the checks are complete, SanityCheck generates a report on GPU compute and communication performance. The report helps you identify factors that may degrade training performance, improving overall diagnostic efficiency.

Limitations

Currently, this feature is available only for PyTorch training jobs that use Lingjun intelligent computing resources. These jobs must use all GPUs on each allocated machine. Lingjun intelligent computing resources are available only to allowlisted users. To request access, contact your account manager.

Enable health checks

Use the console

When you create a DLC training job in the PAI console, you can enable health checks by configuring the following key parameters. After you create the job, the system checks the resource health and availability before providing the results.

The key parameters are described as follows:

In the Resource Information section:

Parameter	Description
Resource Type	Select Lingjun Intelligence Resources.
Source	Select Resource Quota.
Resource Quota	Select an existing resource quota for Lingjun intelligent computing resources. For information about how to create a resource quota, see Create a resource quota.
Framework	Select PyTorch.
Job Resource	The job must be configured to use all GPUs on each machine.

In the Fault Tolerance and Diagnosis section, turn on the Health Check switch and configure the following parameters:

Parameter	Description
Check Time	Before Job Runs (default): Runs a pre-check on compute nodes after the system allocates resources to the job and before it executes your code. After Job Restarts: Runs a health check after AIMaster automatic fault tolerance restarts a job due to an exception. Note To use this option, you must turn on the Automatic Fault Tolerance switch. For more information, see AIMaster: An elastic and automatic fault tolerance engine. Before Job Runs + After Job Restarts: Runs a health check both before the job runs and after the job restarts. Note To use this option, you must turn on the Automatic Fault Tolerance switch. For more information, see AIMaster: An elastic and automatic fault tolerance engine.
Check Items	The check items are grouped into four categories: compute performance check, node communication check, compute and communication overlap check, and model simulation verification. For more information about the check items and recommended scenarios, see Appendix: Check items. By default, GPU GEMM (for checking GPU GEMM performance) and All-Reduce (for checking node communication performance and identifying slow or faulty nodes) are enabled. You can search for or select check items from a list. You can also use a quick configuration template to select a predefined set of check items.
Maximum Check Duration	The maximum time allowed for the health check. The default is 60 minutes. If the check times out, the system triggers the configured policy for handling check exceptions.
Exception Handling Policy	If a health check fails, the system handles the job according to the selected policy: End job: If a faulty or suspicious node is identified, the job is terminated and marked as Check Failed. Add to blocklist and rerun: If a faulty or suspicious node is identified, the system automatically adds the node to a blocklist, restarts the job, and reruns the checks until all checks pass.
Maximum Restart Count	When the processing policy is set to 'add to blocklist and rerun', you can configure the maximum number of restarts. The default value is 10. If the maximum number of restarts is exceeded, the task automatically fails.
Other Configurations	This parameter is empty by default. You can use it to configure advanced parameters.

View check results

Health check status

The health check process for a DLC job includes the following statuses:

Checking: The health check is in progress.
Check Failed: The check fails if a faulty node is detected or if the check times out.
Check Passed: After all health checks pass, the job's status changes to Running.

View health check results

Use the console

On the details page of a DLC job, go to the Event tab and click Health Check to view the check progress and results.

The health check includes items such as Preparing Check Environment, GPU GEMM, GPU Kernel Launch, All-Reduce-Single-Node, MatMul/All-Reduce Overlap, and Mini GPT-Single-Node. A green check mark indicates that an item has passed.

Click the Restart History tab to view the number of restarts, the restart reason, and the result.

Configure message notifications

You can create a notification rule in the event notification settings of your PAI workspace. For Event Type, select DLC Job > Automatic Fault Tolerance. For information about how to configure other parameters, see Message Notification. The system sends a notification if a health check fails.

Note

For instructions on how to create message notification rules in a workspace, see Event Notification Settings.

Appendix: Check items

Note

The estimated check durations are based on a two-machine setup and are for reference only. Actual times may vary.

Check item		Description (scenarios)	Estimated duration
Compute performance check	GPU GEMM	Tests GPU GEMM performance. This check can identify: Faulty GPUs: compute errors or hangs. Slow nodes: low compute TFLOPS.	1 minute
Compute performance check	GPU Kernel Launch	Tests the latency of GPU kernel launches. This check can identify: Faulty nodes: kernel launch errors or hangs. Slow nodes: high kernel launch latency.	1 minute
Node communication check	All-Reduce	Tests node communication performance and identifies slow or faulty communication nodes. For various collective communication patterns, it identifies: Faulty communication nodes: communication errors or hangs. Slow communication nodes: low communication bandwidth.	Single collective communication check 5 minutes
	All-to-All
	All-Gather
	Multi-All-Reduce
	Network Connectivity	Tests network connectivity for the head or tail nodes to identify nodes with connectivity issues.	2 minutes
Compute and communication overlap check	MatMul/All-Reduce Overlap	Tests single-node performance when communication and compute kernels overlap. This check can identify: Faulty nodes: overlap computation errors or hangs. Slow nodes: high latency for overlapped computations.	1 minute
Model simulation verification	Mini GPT	Verifies the reliability of the AI system by using model simulation. It identifies: Faulty nodes: training loss anomalies, training hangs, or training errors. Slow nodes: high latency per training step.	1 minute
	Megatron GPT		5 minutes
	ResNet		2 minutes