Deep Learning Container (DLC) provides the fault tolerance monitoring feature that is empowered by AIMaster. This topic describes how to enable and configure the feature.
Background information
As the scale of model and data increases, deep learning jobs are frequently run in a distributed manner. When a deep learning job is run by an increased number of instances, occasional exceptions in the underlying software stack or hardware environment can terminate the job.
To ensure that large-scale deep learning jobs are reliably run in a distributed manner, DLC provides the fault tolerance monitoring feature that is empowered by AIMaster. AIMaster is a job-level component. When you use AIMaster for a DLC job, an AIMaster instance is launched to run concurrently with other job instances. The AIMaster instance monitors the job progress and manages fault tolerance and resource allocation.
Step 1: Configure additional parameters of fault tolerance monitoring
Refer to the following sections to determine the additional parameters that you want to configure. When you enable the fault tolerance monitoring feature in the subsequent step, you can specify these additional parameters in the Other Configuration field based on your business requirements.
Parameter description
Category | Configuration | Parameter | Description | Default value |
General configuration | Job type | --job-execution-mode | The type of the DLC job. Valid values:
The fault tolerance policy varies based on the job type. For example:
| Sync |
Job restart | --enable-job-restart | Specifies whether to restart the job when a fault tolerance condition is triggered or a runtime exception is detected. Valid values:
| False | |
--max-num-of-job-restart | The maximum number of attempts to restart the job. If this limit is exceeded, the job is reported as failed. | 3 | ||
Runtime configuration Note This configuration takes effect only if all instances of the job run as expected. | Hang detection for running jobs Hang detection for exiting jobs | --enable-job-hang-detection | Specifies whether to enable hang detection when the job is running. This parameter is valid only for synchronous jobs. Valid values:
| False |
--job-hang-interval | The maximum duration during which the job can be non-responsive. Valid values: positive integers. Unit: seconds. If the job remains non-responsive for a duration longer than this value, the job is reported as hung and is restarted. | 1800 | ||
--enable-c4d-hang-detection | Specifies whether to enable the C4D detection feature. This feature is used to quickly diagnose and identify slow nodes during job running and faulty nodes that cause job hang issues. Note This parameter is valid only when the --enable-job-hang-detection parameter is used together. | |||
--enable-job-exit-hang-detection | Specifies whether to enable hang detection when the job is exiting. This parameter is valid only for synchronous jobs. Valid values:
| False | ||
--job-exit-hang-interval | The maximum duration during which the job can be non-responsive when the job is exiting. Valid values: positive integers. Unit: seconds. If the job fails to exit within this duration, the job is reported as hung and is restarted. | 600 | ||
Fault tolerance configuration Note This configuration takes effect only if an instance of the job fails to run. | Fault tolerance policy | --fault-tolerant-policy | Valid values:
| ExitCodeAndErrorMsg |
Maximum Error Occurrences | --max-num-of-same-error | The maximum number of times an error can occur on a single instance. If an error occurs for a number of times that exceeds this value, the job is reported as failed. | 10 | |
Maximum Fault Tolerance rate | --max-tolerated-failure-rate | The maximum percentage of failed instances. If the percentage of failed instances exceeds this value, the job is reported as failed. Default value: -1, which indicates that the parameter is disabled. For example, a value of 0.3 indicates that if more than 30% of the job instances fail, the job is reported as failed. | -1 |
Sample configurations
This section provides examples of common configurations for different types of DLC jobs.
Synchronous training jobs (such as PyTorch jobs)
If a job instance is abnormal and meets the fault tolerance conditions, the job is restarted.
--job-execution-mode=Sync --enable-job-restart=True --max-num-of-job-restart=3 --fault-tolerant-policy=ExitCodeAndErrorMsg
Asynchronous training jobs (such as TensorFlow jobs)
If a retriable error occurs on a worker instance, the worker instance is restarted. If an error occurs on a parameter server or chief instance, the job is not allowed to restart. To restart the job in the preceding scenario, set the --enable-job-restart parameter to True.
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
Offline inference jobs (such as ElasticBatch jobs)
The instances of an offline inference job are independent from each other, which is similar to an asynchronous job. If a job instance is unexpectedly terminated, only the instance is restarted.
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
Step 2: Enable fault tolerance monitoring
When you submit a DLC job, you can enable the fault tolerance monitoring feature by using the PAI console or DLC SDK.
Use the PAI console
When you create a DLC job in the PAI console, you can turn on Automatic Fault Tolerance in the Fault Tolerance and Diagnosis section and configure the parameters. For more information, see Submit training jobs. After you configure the parameters, an AIMaster instance is launched when you run the DLC job. The AIMaster instance monitors the job progress and manages fault tolerance when errors occur.
You can configure additional parameters in the Other Configuration field. For more information, see Step 1: Configure additional parameters of fault tolerance monitoring.
You can turn on Hanging Detection to enable the C4D detection feature. C4D, self-developed by Alibaba Cloud, is a diagnostic tool designed to diagnose and identify slow nodes during large model training or faulty nodes that cause job hang issues.
NoteC4D depends on ACCL that is an in-house High Performance Network (HPN) communication library provided by Alibaba Cloud. Make sure that you have installed ACCL.
You can enable the C4D detection feature when you select Lingjun Resources during submission of a DLC job.
Use DLC SDK
Use DLC SDK for Go
Sample code:
createJobRequest := &client.CreateJobRequest{} settings := &client.JobSettings{ EnableErrorMonitoringInAIMaster: tea.Bool(true), ErrorMonitoringArgs: tea.String("--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=3600"), } createJobRequest.SetSettings(settings)
Parameters:
EnableErrorMonitoringInAIMaster: specifies whether to enable the fault tolerance monitoring feature.
ErrorMonitoringArgs: additional parameters that you can configure for the fault tolerance monitoring feature.
Use DLC SDK for Python
Sample code:
from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSettings settings = JobSettings( enable_error_monitoring_in_aimaster = True, error_monitoring_args = "--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=30" ) create_job_req = CreateJobRequest( ... settings = settings, )
Parameters:
enable_error_monitoring_in_aimaster: specifies whether to enable the fault tolerance monitoring feature.
error_monitoring_args: additional parameters that you can configure for the fault tolerance monitoring feature.
Step 3: Configure enhanced fault tolerance monitoring
You can configure the following features of enhanced fault tolerance monitoring based on your business requirements.
Fault tolerance notification
If you want to receive notifications about fault tolerance events, such as a job restart, you can configure a rule for such notifications by performing the following steps: On the Workspace Details page, choose Workspace notification.
. On the Configure Workspace page, click Create Event Rule. In the panel that appears, select DLC Jobs > Automatic Fault Tolerance from the Event Type drop-down list. For more information, seeYou can also use AIMaster SDK to configure custom notifications when an error occurs, such as when the loss function returns NaN. Sample code:
To configure custom notifications, you must install the wheel package of AIMaster. For more information, see the "FAQ" section of this topic.
from aimaster import job_monitor as jm
job_monitor_client = jm.Monitor(config=jm.PyTorchConfig())
...
if loss == Nan and rank == 0:
st=job_monitor_client.send_custom_message (content="The loss function returns NaN")
if not st.ok():
print('failed to send message, error %s' % st.to_string())
Custom keywords
The fault tolerance monitoring feature provides a built-in module that can automatically detect common retriable errors. If you want to trigger fault tolerance when error logs carry information that matches specific keywords, you can use the following methods to configure custom keywords. After you configure custom keywords, the module scans the tail logs of failed instances for key information that matches your custom keywords.
You must set the --fault-tolerant-policy parameter to ExitCodeAndErrorMsg.
Sample configuration for PyTorch jobs
from aimaster import job_monitor as jm jm_config_params = {} jm_config = jm.PyTorchConfig(**jm_config_params) monitor = jm.Monitor(config=jm_config) monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])
In the preceding code, the custom keywords are configured by using the monitor.set_retryable_errors function.
Sample configuration for TensorFlow jobs
from aimaster import job_monitor as jm jm_config_params = {} jm_config = jm.TFConfig(**jm_config_params) monitor = jm.Monitor(config=jm_config) monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])
Fine-grained hang detection
By default, hang detection is configured for the entire job runtime. However, you may need different configurations at different runtime stages. For example, during job initialization, it takes a long time to establish the connections between instances, whereas the execution of the job is faster. DLC allows you to configure the hang detection policy based on the current status of the job by using the following methods:
monitor.reset_config(jm_config_params)
# Example:
# monitor.reset_config(job_hang_interval=10)
# or
# config_params = {"job_hang_interval": 10, }
# monitor.reset_config(**config_params)
Sample configuration for a PyTorch job
import torch
import torch.distributed as dist
from aimaster import job_monitor as jm
jm_config_params = {
"job_hang_interval": 1800 # Configure a global hang detection policy. This value specifies that the maximum duration during which the job can be non-responsive is 30 minutes.
}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
dist.init_process_group('nccl')
...
# impl these two funcs in aimaster sdk
# user just need to add annotations to their func
def reset_hang_detect(hang_seconds):
jm_config_params = {
"job_hang_interval": hang_seconds
}
monitor.reset_config(**jm_config_params)
def hang_detect(interval):
reset_hang_detect(interval)
...
@hang_detect(180) # reset hang detect to 3 min, only for func scope
def train():
...
@hang_detect(-1) # disable hang detect temporarily, only for func scope
def test():
...
for epoch in range(0, 100):
train(epoch)
test(epoch)
self.scheduler.step()
C4D
C4D, self-developed by Alibaba Cloud, is a diagnostic tool designed to diagnose and identify slow nodes during large model training or faulty nodes that cause job hang issues. C4D depends on ACCL that is an in-house HPN communication library provided by Alibaba Cloud. Make sure that you have installed ACCL. You can enable the C4D detection feature when you select Lingjun Resources during submission of a DLC job.
Feature introduction
C4D aggregates the statuses of all nodes in a job, and then analyzes and determines whether a communication or non-communication issue occurs to some nodes. The following figure shows the system architecture.
Parameter description
After you enable the C4D detection feature, you can configure the parameters in the Other Configuration field. The following table describes the parameters.
Parameter | Description | Example |
--c4d-log-level | The level of the log generated by C4D. Valid values:
The default value Warning indicates that the logs of the Warning and Error levels are generated. We recommend that you use the default value under normal operating conditions. To troubleshoot performance issues, set the value to Info. |
|
--c4d-common-envs | The environment variables for C4D running, in the
|
|
For logs of the Error level, AIMaster automatically isolates the related nodes and restarts the job. The following table describes the processing logic for logs at each level.
Log level | Description | Action |
Error | By default, if a job is suspended for more than 3s due to communication issues, the job may fail. To resolve this issue, you can modify the default values of the C4D_HANG_TIMEOUT and C4D_HANG_TIMES parameters. | AIMaster automatically isolates the nodes in the log. |
Warn | By default, if a job is suspended for more than 10s due to communication issues, the performance is affected but the job will not fail. To resolve this issue, you can modify the default value of the C4D_HANG_TIMEOUT parameter. | AIMaster requires manual confirmation for isolating the nodes in the log. |
If a job is suspended for more than 10s due to non-communication issues, the job may fail. | AIMaster requires manual confirmation for isolating the nodes in the log. | |
Info | Slow nodes exist due to communication or non-communication issues. | The logs of this level focus on performance issues. Manual confirmation is required. |
During the running of a DLC job, if a slow node exists or the job is suspended, obtain the C4D diagnosis result. To obtain the diagnosis result, click the name of the job in the job list. In the Instance section of the Overview tab, view the log of the AIMaster node to obtain the C4D diagnosis result.
Sample diagnosis results
RankCommHang indicates that a job is suspended due to communication issues.
RankNonCommHang indicates that a job is suspended due to non-communication issues. For example, compute nodes are faulty, causing hang issues.
RankCommSlow indicates that a slow node exists due to communication issues.
RankNonCommSlow indicates that a slow node exists due to non-communication issues.
FAQ
How do I install AIMaster SDK?
Install AIMaster SDK by using the wheel package that matches your Python version.
# Python 3.6
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl
# Python 3.8
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl
# Python 3.10
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl