Enable fault tolerance monitoring when you create a DLC job - Platform For AI

This topic describes how to use the AIMaster-based fault tolerance monitoring feature provided by DLC.

Background

Deep learning is widely used. As models and datasets become larger, distributed training has become a common practice. However, an increase in the number of job instances also increases the likelihood of software and hardware exceptions, which can cause the job to fail.

To ensure the stable operation of large-scale, distributed deep learning jobs, DLC provides an AIMaster-based fault tolerance monitoring feature. AIMaster is a job-level component. When enabled, an AIMaster instance runs alongside the other job instances to perform job monitoring, fault tolerance assessment, and resource control.

Limitations

AIMaster currently supports the following frameworks: PyTorch, MPI, TensorFlow, and ElasticBatch.

Step 1: Configure fault tolerance parameters

This section describes all parameters for the fault tolerance monitoring feature. You can use the sample configurations to help plan your settings. When you enable the feature, you can specify these parameters in the Other Cofiguration section.

Parameters

Category	Feature	Parameter	Description	Default
General configuration	Job execution mode	--job-execution-mode	Specifies the execution mode of the job. Valid values: Sync: Synchronous job. Async: Asynchronous job. The fault tolerance behavior for a retriable error varies by job type: For a synchronous job, the entire job is restarted. For an asynchronous job, worker instances are independent. Only the failed instance is restarted, and other instances are not affected.	Sync
	Job restart settings	--enable-job-restart	Specifies whether the job restarts when a fault tolerance condition is met or a runtime exception is encountered. Valid values: False: The job does not restart. True: The job restarts.	False
	Job restart settings	--max-num-of-job-restart	Specifies the maximum number of times the job can restart. If this number is exceeded, the job is marked as failed.	3
Runtime configuration Note Applies to scenarios where no instance fails.	Job hang detection	--enable-job-hang-detection	Specifies whether to enable hang detection for a running job. This feature supports only synchronous jobs. Valid values: False: Disables the feature. True: Enables the feature. If the stdout and stderr logs of all instances are not updated within the specified time, a job restart is triggered.	False
		--job-hang-interval	Specifies the duration in seconds that a job can be inactive before it is considered to hang. The value must be a positive integer. If the inactive duration exceeds this value, the job is marked as abnormal and a restart is triggered.	1800
		`--enable-c4d-hang-detection`	Specifies whether to enable C4D (Calibrating Collective Communication over Converged ethernet - Diagnosis) detection. This feature helps you quickly diagnose and locate slow nodes and faulty nodes that cause a job to hang. Note This parameter takes effect only when `--enable-job-hang-detection` is enabled.	False
	Hang detection on job exit	--enable-job-exit-hang-detection	Specifies whether to enable hang detection when a job is about to exit. This feature supports only synchronous jobs. Valid values: False: Disables the feature. True: Enables the feature. If a job does not exit within the specified period after one of its instances completes, a job restart is triggered.	False
	Hang detection on job exit	--job-exit-hang-interval	Specifies the duration in seconds that a job can be inactive during its exit process. The value must be a positive integer. If the inactive duration on exit exceeds this value, the job is marked as abnormal and a restart is triggered.	600
Fault tolerance configuration Note Applies to scenarios where an instance fails.	Fault tolerance policy	--fault-tolerant-policy	Specifies the fault tolerance policy. Valid values: OnFailure: When a job exception occurs: For an asynchronous job, the failed instance is unconditionally restarted. For a synchronous job, the job is unconditionally restarted. ExitCodeAndErrorMsg: When a job exception occurs, the system evaluates the exit code and error message of the failed instance. For more information, see Step 3: Configure advanced features for fault tolerance monitoring. If the retry conditions are met: For an asynchronous job, the failed instance is restarted. For a synchronous job, the job is restarted. Never: No action is taken. The job is marked as failed.	ExitCodeAndErrorMsg
	Maximum occurrences of the same error	--max-num-of-same-error	Specifies the maximum number of times the same error can occur on a single instance. If the error count exceeds this value, the job is marked as failed.	10
	Maximum tolerated failure rate	--max-tolerated-failure-rate	Sets the maximum tolerated failure rate. If the percentage of failed instances exceeds this value, the job is marked as failed. The default value -1 indicates that this feature is disabled. Example: If you set this parameter to 0.3, the job is marked as failed if more than 30% of the worker instances fail.	-1

Sample configurations

The following examples show common parameter configurations for different training jobs.

Synchronous training job (common in PyTorch jobs)
The job restarts if an instance encounters an exception and the fault tolerance conditions are met.
```
--job-execution-mode=Sync --enable-job-restart=True --max-num-of-job-restart=3 --fault-tolerant-policy=ExitCodeAndErrorMsg
```
Asynchronous training job (common in TensorFlow jobs)
For a retriable error, only the failed worker instance is restarted. By default, the job does not restart if a PS or Chief instance fails. To enable job restart, set --enable-job-restart to True.
```
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
```
Offline inference job (common in ElasticBatch jobs)
Instances are independent, which is similar to an asynchronous job. If an instance fails, only that instance is restarted.
```
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
```

Step 2: Enable fault tolerance monitoring

You can enable the fault tolerance monitoring feature in the DLC console or by using an SDK when you submit a training job.

Enable in the console

When you submit a DLC training job in the console, you can turn on the Automatic Fault Tolerance switch in the Fault Tolerance and Diagnosis section and configure other parameters. For more information, see Create a training job. Enabling this feature starts an AIMaster role to monitor the job throughout its lifecycle and perform fault tolerance actions when errors occur.

The following items describe the settings:

You can specify other parameters in the Other Cofiguration text box. For more information about the parameters, see Step 1: Configure fault tolerance parameters.
After you enable Hang Detection, you can enable C4D Detection. C4D (Calibrating Collective Communication over Converged ethernet - Diagnosis) is a proprietary diagnostic tool developed by Alibaba Cloud to identify issues such as slow performance or hangs in large model training jobs. For more information, see Use C4D.
Note
- C4D depends on ACCL, a high-performance collective communication library developed by Alibaba Cloud. Make sure that ACCL is installed. For more information, see ACCL: Alibaba Cloud high-performance collective communication library.
- Currently, you can use the C4D Detection feature for DLC jobs that run on Lingjun AI Computing Service.
After you enable Hang Detection, you can use the call stack snapshot analysis tool to locate the specific line of code where a job hangs. This requires dedicated configuration for the hang detection threshold. For more information, see Use the call stack snapshot analysis tool.

Enable using the DLC SDK

Use the Go SDK

Enable the fault tolerance monitoring feature when you submit a job by using the Go SDK.

createJobRequest := &client.CreateJobRequest{}
settings := &client.JobSettings{
    EnableErrorMonitoringInAIMaster: tea.Bool(true),
    ErrorMonitoringArgs: tea.String("--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=3600"),
}
createJobRequest.SetSettings(settings)

Parameters:

EnableErrorMonitoringInAIMaster: Specifies whether to enable the fault tolerance monitoring feature.
ErrorMonitoringArgs: Specifies other parameters for fault tolerance monitoring.

Use the Python SDK

Enable the fault tolerance monitoring feature when you submit a job by using the Python SDK.

from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSettings

settings = JobSettings(
    enable_error_monitoring_in_aimaster = True,
    error_monitoring_args = "--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=30"
)
create_job_req = CreateJobRequest(
    ...
    settings = settings,
)

Parameters:

enable_error_monitoring_in_aimaster: Specifies whether to enable the fault tolerance monitoring feature.
error_monitoring_args: Specifies other parameters for fault tolerance monitoring.

Step 3: Configure advanced features

You can use the following advanced features to customize fault tolerance monitoring based on the requirements of your job.

Configure fault tolerance notifications

After you enable fault tolerance monitoring for a job, you can configure notifications for fault tolerance events. On the Workspace Details page, choose Configure Workspace > Configure Event Notification, click Create Event Rule, and set the event type to DLC task > Automatic Fault Tolerance. For more information, see Workspace Event Center.

If a training job encounters an exception, such as a NaN loss value, you can use the AIMaster SDK in your code to send a custom notification message.

Note

To use this feature, you must install the AIMaster wheel package. For more information, see FAQ.

from aimaster import job_monitor as jm

job_monitor_client = jm.Monitor(config=jm.PyTorchConfig())

...

if loss == Nan and rank == 0:
  st = job_monitor_client.send_custom_message(content="The training loss for the job is NaN.")
  if not st.ok():
      print('failed to send message, error %s' % st.to_string())

Configure custom fault tolerance keywords

The fault tolerance monitoring feature includes a built-in monitoring module for common retriable errors. If you want fault tolerance actions to be triggered when specific keywords appear in the logs of a failed instance, you can configure the keywords in your code. Once configured, the fault tolerance monitoring module scans the tail-end logs of the failed instance for matching keywords.

Note

The fault tolerance policy must be set to ExitCodeAndErrorMsg.

Example of configuring custom fault tolerance keywords for a PyTorch job

from aimaster import job_monitor as jm

jm_config_params = {}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

The monitor.set_retryable_errors function sets the custom fault tolerance keywords.

Example of configuring custom fault tolerance keywords for a TensorFlow job

from aimaster import job_monitor as jm

jm_config_params = {}
jm_config = jm.TFConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

Configure staged job hang detection

By default, hang detection settings apply to the entire job lifecycle. However, jobs run in stages. For example, during the initialization stage, nodes may take a long time to establish communication, whereas during the training stage, logs are updated more frequently. To quickly detect hung nodes during the training process, DLC provides a staged job hang detection feature. This feature allows you to configure different hang detection intervals for different training stages. The following code provides an example.

monitor.reset_config(jm_config_params)

# Example:
#     monitor.reset_config(job_hang_interval=10)
#     or
#     config_params = {"job_hang_interval": 10, }
#     monitor.reset_config(**config_params)

The following code provides an example of how to configure staged job hang detection for a PyTorch job.

import torch
import torch.distributed as dist
from aimaster import job_monitor as jm

jm_config_params = {
    "job_hang_interval": 1800 # Global 30-minute detection.
}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)

dist.init_process_group('nccl')

...

# Implement these two functions in the AIMaster SDK.
# You only need to add annotations to your functions.
def reset_hang_detect(hang_seconds):
    jm_config_params = {
        "job_hang_interval": hang_seconds
    }
    monitor.reset_config(**jm_config_params)

def hang_detect(interval):
    reset_hang_detect(interval)
    ...

@hang_detect(180) # Reset hang detection to 3 minutes, for this function scope only.
def train():
    ...

@hang_detect(-1) # Temporarily disable hang detection, for this function scope only.
def test():
    ...

for epoch in range(0, 100):
    train(epoch)
    test(epoch)
    self.scheduler.step()

Use C4D

C4D (Calibrating Collective Communication over Converged ethernet - Diagnosis) is a proprietary tool developed by Alibaba Cloud to diagnose issues such as slow performance or hangs in large model training jobs. C4D depends on ACCL, a high-performance collective communication library developed by Alibaba Cloud. Make sure that ACCL is installed and the environment variables are correctly configured. For more information, see ACCL: Alibaba Cloud high-performance collective communication library. Currently, you can use the C4D detection feature for DLC jobs that run on Lingjun AI Computing Service.

How it works

C4D aggregates status information from all nodes in a job to determine whether any node has issues in the communication layer or outside the communication layer. The following figure shows the system architecture.

Parameters

After you enable C4D Detection, you can configure the following parameters in the Other Configurations text box.

Parameter

Description

Example

--c4d-log-level

Sets the C4D output log level. Valid values:

Info
Warning (default)
Error

The default value is Warning, which outputs logs at the Warning and Error levels. We recommend that you use the default value for normal operations. To troubleshoot performance issues, you can set the value to Info.

--c4d-log-level=Info

--c4d-common-envs

Sets environment variables for C4D execution. Use the format k1=v1,k2=v2 and separate multiple variables with a comma (,). By default, this parameter is empty. The following environment variables are available:

C4D_HANG_TIMEOUT: The duration in microseconds that a job can hang before a Warning is triggered. Default value: 10000000 (10 seconds).
C4D_HANG_TIMES: The number of hang occurrences that trigger an Error log and the automatic node isolation logic. This variable is used with C4D_HANG_TIMEOUT. Default value: 18 (By default, a hang of 3 minutes triggers automatic node isolation).
C4D_CONN_BW_CHECK_PERIOD: The interval for bandwidth checks. Default value: 10 seconds.
C4D_RUNTIME_LOG_LEVEL: The C4D runtime log level. Valid values:
- TRACE
- DEBUG
- INFO (default)
- WARNING
- ERROR
- FATAL
C4D_ENABLE_STATS_OUTPUT: Specifies whether to output C4D-related statistics. Valid values:
- TRUE
- FALSE (default)

--c4d-common-envs=C4D_HANG_TIMEOUT=1,C4D_HANG_TIMES=2

For Error-level logs, AIMaster automatically isolates the corresponding node and restarts the job. The following table describes the handling logic for each log level.

Error level	Error description	Actions
Error	By default, a communication-layer hang that exceeds three minutes causes the job to fail. You can modify this default by configuring the C4D_HANG_TIMEOUT and C4D_HANG_TIMES parameters.	AIMaster automatically isolates the node that is reported in the log.
Warn	By default, a communication-layer hang that exceeds 10 seconds affects performance but does not cause the job to fail. You can modify this default by configuring the C4D_HANG_TIMEOUT parameter.	The node that is reported in the log is not automatically isolated and requires manual confirmation.
Warn	A non-communication-layer hang that exceeds 10 seconds may cause the job to fail.	The node that is reported in the log is not automatically isolated and requires manual confirmation.
Info	Communication-layer slowness and non-communication-layer slowness.	These diagnostic logs are primarily for performance issues and require manual confirmation.

If you find that a DLC job runs slowly or hangs, go to the DLC job list and click the job name to go to the job details page. In the Instance section, view the AIMaster node log to see the C4D diagnostic results. For more information about the diagnostic results, see Sample diagnostic results. 5bc5051b1abae830588522ab7a50b23f

Diagnostic results

RankCommHang: Indicates that a node has a hang in the communication layer.
RankNonCommHang: Indicates that a node has a hang outside the communication layer, for example, in the computing process.
RankCommSlow: Indicates that a node has slow performance in the communication layer.
RankNonCommSlow: Indicates that a node has slow performance outside the communication layer.

Use the call stack analysis tool

Job hangs are a common type of failure in large model training. A typical example is an NCCL hang, which generates a "Watchdog caught collective operation timeout" log when the job fails. To help you quickly identify the root cause of job hangs, we developed a call stack snapshot analysis tool. Follow these steps to use the tool:

Step 1: Install pystack or py-spy

Check whether pystack or py-spy is installed in your container image. If not, you must install one of them. The following command provides an example on how to install pystack.

pip install pystack -i https://mirrors.cloud.aliyuncs.com/pypi/simple/ --trusted-host mirrors.cloud.aliyuncs.com

Step 2: Enable the hang detection switch

For information about how to enable the switch, see Enable in the console. After you enable the Hang Detection switch, you must configure an appropriate value for the hang detection threshold to use the call stack snapshot analysis tool. First, check the timeout period of your job. You can usually find this information in the error log that is generated after the job hangs. The following code provides a log sample.

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2143, OpType=ALLREDUCE, NumelIn=659, NumelOut=659, Timeout(ms)=600000) ran for 600535 milliseconds before timing out

The Timeout field in this error log indicates that the timeout period for the job is 600 seconds (10 minutes). In this case, we recommend that you set the hang detection threshold to 450 seconds. If the Timeout value in the error log is 1800 seconds, we recommend that you set the hang detection threshold to 1500 seconds. As a rule of thumb, the hang detection threshold should be about 150 to 200 seconds less than the Timeout value.

After you configure the hang detection feature as described in the preceding steps, AIMaster automatically collects and analyzes the call stack of the job process when a hang occurs. You can view the analysis result in the AIMaster node log. The following figure shows a sample analysis result of a call stack that is generated after a job hangs.

In the analysis result, the stack field shows the specific call stack, the threads field shows the associated threads, and the count field shows the number of threads that have the same call stack. Stacks with a count of 1 are highly likely to be the cause of the hang and require close inspection.

Step 4: View restart reasons

View restart attempts: Job restart information is organized by attempt. On the Job Overview page, you can click to expand the details of an attempt to view information such as the time consumed in each stage. This helps you better understand the job's status.
View restart history: You can click the number of restarts or the Restart records tab to view restart information, including the restart reason, restart result, and time consumed by the restart.
Perform the following steps:
- In the Restart records list, click Description to view detailed information about a restart, including the Restarts, Restart Time, Node Name, Instance Name, Error Code, Error Message, and Error Source.
- Click View Aggregation Fault Details to expand the details of all restart records.

FAQ

Q: How do I install the AIMaster SDK?

Run the following command to install the corresponding wheel package based on your Python version.

# Python 3.6
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl

# Python 3.8
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl

# Python 3.10
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl