enable sanity check when you create a DLC job - Platform For AI

This topic describes how to use sanity check provided by Deep Learning Containers (DLC).

Overview

You may encounter the following issues when you run a DLC job in Platform for AI (PAI):

The job fails after loading the model checkpoints or performing other initialization operations due to resource failure. You need to troubleshoot before submitting the job again. This process results in a waste of GPU resources.
The model performance degrades when the job is running due to slow nodes, but it is hard to locate the issue in a quick and effective manner. It is also hard to test the GPU computing power and communication performance of instances in the resource group due to the lack of a convenient and reliable benchmark.

To handle the preceding issues, DLC provides the sanity check feature to check the health status and performance of computing resources that are used to run distributed training jobs. You can enable sanity check when you create a DLC job. The system detects the resources that are related to the training, automatically isolates faulty nodes, and triggers an automated O&M process in the background. Sanity check effectively reduces failures in the early stage of a training job and increases the possibility of job success. After the sanity check is completed, the system generates a test report on the computing power and communication performance of the related GPUs. You can use the report to identify and locate potential risks that may degrade the training performance and handle the issues in an efficient manner.

Limits

You can enable sanity check only for the DLC job that runs on intelligent computing LINGJUN resources in the China (Ulanqab) region.
You can enable sanity check only for PyTorch jobs that use more than 0 GPU.

Enable sanity check

Enable sanity check in the PAI console

When you create a DLC job in the PAI console, you enable Sanity Check in the Resource Configuration section and configure the related parameters. For more information, see Submit training jobs. After you enable sanity check and submit a training job, the system takes some time to check the health status and availability of resources and provides a check report.

The following table describes key parameters.

Parameter	Description
Check Time	Before the Job Runs: After the job obtains the resources, the system checks the health status of the computing power and then runs the job. This is the default setting. After Fault Tolerance Occurs: After the system restarts a failed job, the system runs the sanity check first. Note This option is available if you enable Automatic Fault Tolerance feature.
Maximum Check Duration	The maximum duration for which a sanity check runs. Default value: 30 minutes. If the sanity check runs a longer period of time than the specified maximum check duration, the configured action is triggered.
Timeout Action	Specify a job status after a sanity check times out: Stop Job (default): The system stops the job. The status of the job changes to Check Failed. Suspend Job: The system suspends the job. The job remains in the Checking state and waits for manual intervention or system instructions on the next operation.
Other Settings	This parameter is empty by default.

View the check results

Sanity check status

The DLC job may be in one of the following statuses during a sanity check:

Checking: The sanity check on computing power is in progress.
Check Failed: The sanity check fails if issues are detected or the check times out.
Check Passed: After the job passes the sanity check, the job enters the running status.

View the results of a sanity check

View the results in the PAI console

On the Events tab of the DLC job details page, click Sanity Check to view the check results.

截屏2024-01-03 18.22.25.png

Configure an event rule

You can create an event notification rule on the Events tab of a PAI workspace. Set Event Type to DLC Job and Automatic Fault Tolerance. For more information about other parameters, see Create a notification rule. If the job fails the sanity check, the system sends a notification.

Note

For more information about configuring a notification, see Workspace notification.