Deep Learning Container (DLC) provides the fault tolerance monitoring feature that is empowered by AIMaster. This topic describes how to enable and configure the feature.
Background information
As the scale of model and data increases, deep learning tasks are frequently run in a distributed manner. When a deep learning task is run by an increased number of instances, occasional exceptions in the underlying software stack or hardware environment can terminate the task.
To ensure that large-scale deep learning tasks are reliably run in a distributed manner, DLC provides the fault tolerance monitoring feature that is empowered by AIMaster. AIMaster is a job-level component. When you use AIMaster for a DLC job, an AIMaster instance is launched to run concurrently with other job instances. The AIMaster instance monitors the job progress and manages fault tolerance and resource allocation.
Procedure
Step 1: Configure additional parameters of fault tolerance monitoring
Refer to the parameter description and sample configurations to configure additional parameters of fault tolerance monitoring based on your business requirements.
Step 2: Enable and configure fault tolerance monitoring
When you submit a DLC job, enable and configure the fault tolerance monitoring feature by using the Platform for AI (PAI) console or AIMaster SDK.
Step 3: Configure enhanced fault tolerance monitoring
If the configuration in the previous step cannot meet your business requirements, you can configure enhanced fault tolerance monitoring to add custom keywords by using AIMaster SDK. After you add custom keywords, AIMaster automatically scans the logs of faulty nodes and triggers fault tolerance if the logs carry information that matches your custom keywords. Enhanced fault tolerance monitoring alsosends notifications when fault tolerance is triggered.
Step 1: Configure additional parameters of fault tolerance monitoring
Refer to the following sections to determine the additional parameters that you want to configure. When you enable the fault tolerance monitoring feature in the subsequent step, you can specify these additional parameters in the Other Configuration field based on your business requirements.
Parameter description
The following table describes the additional parameters that you can configure for fault tolerance monitoring.
Category | Configuration | Parameter | Description | Default value |
General configuration | Job type | --job-execution-mode | The type of the DLC job. Valid values:
The fault tolerance policy varies based on the job type. For example:
| Sync |
Job restart | --enable-job-restart | Specifies whether to restart the job when a fault tolerance condition is triggered or a runtime exception is detected. Valid values:
| False | |
--max-num-of-job-restart | The maximum number of attempts to restart the job. If this limit is exceeded, the job is reported as failed. | 3 | ||
Runtime configuration Note This configuration takes effect only if all instances of the job run as expected. | Hang detection for running jobs | --enable-job-hang-detection | Specifies whether to enable hang detection when the job is running. This parameter is valid only for synchronous jobs. Valid values:
| False |
--job-hang-interval | The maximum duration during which the job can be non-responsive. Valid values: positive integers. Unit: seconds. If the job remains non-responsive for a duration longer than this value, the job is reported as hung and is restarted. | 1800 | ||
Hang detection for exiting jobs | --enable-job-exit-hang-detection | Specifies whether to enable hang detection when the job is exiting. This parameter is valid only for synchronous jobs. Valid values:
| False | |
--job-exit-hang-interval | The maximum duration during which the job can be non-responsive when the job is exiting. Valid values: positive integers. Unit: seconds. If the job fails to exit within this duration, the job is reported as hung and is restarted. | 600 | ||
Fault tolerance configuration Note This configuration takes effect only if an instance of the job fails to run. | Fault tolerance policy | --fault-tolerant-policy | Valid values:
| Never |
Maximum Error Occurrences | --max-num-of-same-error | The maximum number of times an error can occur on a single instance. If an error occurs for a number of times that exceeds this value, the job is reported as failed. | 10 | |
Maximum Fault Tolerance rate | --max-tolerated-failure-rate | The maximum percentage of failed instances. If the percentage of failed instances exceeds this value, the job is reported as failed. Default value: -1, which indicates that the parameter is disabled. For example, a value of 0.3 indicates that if more than 30% of the job instances fail, the job is reported as failed. | -1 |
Sample configurations
This section provides examples of common configurations for different types of DLC jobs.
Synchronous training jobs (such as PyTorch jobs)
If a job instance is unexpectedly terminated and the exit code or error log meets the fault tolerance conditions, such as preemption, the job is restarted.
--job-execution-mode=Sync --enable-job-restart=True --max-num-of-job-restart=3 --fault-tolerant-policy=ExitCodeAndErrorMsg
Asynchronous training jobs (such as TensorFlow jobs)
If a retriable error occurs on a worker instance of the job, the worker instance is restarted. If an error occurs on a parameter server or chief instance, the job is not allowed to restart. To restart the job in the preceding scenario, set the --enable-job-restart parameter to True.
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
Offline inference jobs (such as ElasticBatch jobs)
The instances of an offline inference job are independent from each other, which is similar to an asynchronous job. If a job instance is unexpectedly terminated, only the instance is restarted.
--job-execution-mode=Async --fault-tolerant-policy=OnFailure
Step 2: Enable fault tolerance monitoring
When you submit a DLC job, you can enable the fault tolerance monitoring feature by using the PAI console or DLC SDK.
Use the PAI console
When you create a DLC job in the PAI console, you can turn on Automatic Fault Tolerance in the Resource Configuration section and configure the parameters. For more information, see Submit training jobs. After you configure the parameters, an AIMaster instance is launched when you run the DLC job. The AIMaster instance monitors the job progress and manages fault tolerance when errors occur.
For information about how to configure the Other Configuration field, see the "Step 1: Configure additional parameters of fault tolerance monitoring" section of this topic.
Use DLC SDK
Use DLC SDK for Go
Sample code:
createJobRequest := &client.CreateJobRequest{} settings := &client.JobSettings{ EnableErrorMonitoringInAIMaster: tea.Bool(true), ErrorMonitoringArgs: tea.String("--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=3600"), } createJobRequest.SetSettings(settings)
Parameters:
EnableErrorMonitoringInAIMaster: specifies whether to enable the fault tolerance monitoring feature.
ErrorMonitoringArgs: additional parameters that you can configure for the fault tolerance monitoring feature.
Use DLC SDK for Python
Sample code:
from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSettings settings = JobSettings( enable_error_monitoring_in_aimaster = True, error_monitoring_args = "--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=30" ) create_job_req = CreateJobRequest( ... settings = settings, )
Parameters:
enable_error_monitoring_in_aimaster: specifies whether to enable the fault tolerance monitoring feature.
error_monitoring_args: additional parameters that you can configure for the fault tolerance monitoring feature.
Step 3: Configure enhanced fault tolerance monitoring
You can configure the following features of enhanced fault tolerance monitoring based on your business requirements.
Fault tolerance notification
If you want to receive notifications about fault tolerance events, such as a job restart, you can configure a rule for such notifications by performing the following steps: Go to the Workspace Details page, click the Events tab, and then click Create Event Rule. In the panel that appears, select DLC Jobs > Automatic Fault Tolerance from the Event Type drop-down list. For more information, see Workspace notification.
You can also use AIMaster SDK to configure custom notifications when an error occurs, such as when the loss function returns NaN. Sample code:
To configure custom notifications, you must install the wheel package of AIMaster. For more information, see the "FAQ" section of this topic.
from aimaster import job_monitor as jm
job_monitor_client = jm.Monitor(config=jm.PyTorchConfig())
...
if loss == Nan and rank == 0:
st=job_monitor_client.send_custom_message (content="The loss function returns NaN")
if not st.ok():
print('failed to send message, error %s' % st.to_string())
Custom keywords
The fault tolerance monitoring feature provides a built-in module that can automatically detect common retriable errors. If you want to trigger fault tolerance when error logs carry information that matches specific keywords, you can use the following methods to configure custom keywords. After you configure custom keywords, the module scans the tail logs of failed instances for key information that matches your custom keywords.
You must set the --fault-tolerant-policy parameter to ExitCodeAndErrorMsg.
Sample configuration for PyTorch jobs
from aimaster import job_monitor as jm jm_config_params = {} jm_config = jm.PyTorchConfig(**jm_config_params) monitor = jm.Monitor(config=jm_config) monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])
In the preceding code, the custom keywords are configured by using the monitor.set_retryable_errors function.
Sample configuration for TensorFlow jobs
from aimaster import job_monitor as jm jm_config_params = {} jm_config = jm.TFConfig(**jm_config_params) monitor = jm.Monitor(config=jm_config) monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])
Fine-grained hang detection
By default, hang detection is configured for the entire job runtime. However, you may need different configurations at different runtime stages. For example, when a large-scale job is initializing, it takes a long time to establish the connections between instances, whereas the execution of the job is faster. DLC allows you to configure the hang detection policy based on the current status of the job by using the following methods:
monitor.reset_config(jm_config_params)
# Example:
# monitor.reset_config(job_hang_interval=10)
# or
# config_params = {"job_hang_interval": 10, }
# monitor.reset_config(**config_params)
Sample configuration for a PyTorch job
import torch
import torch.distributed as dist
from aimaster import job_monitor as jm
jm_config_params = {
"job_hang_interval": 1800 # Configure a global hang detection policy. This value specifies that the maximum duration during which the job can be non-responsive is 30 minutes.
}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)
dist.init_process_group('nccl')
...
# impl these two funcs in aimaster sdk
# user just need to add annotations to their func
def reset_hang_detect(hang_seconds):
jm_config_params = {
"job_hang_interval": hang_seconds
}
monitor.reset_config(**jm_config_params)
def hang_detect(interval):
reset_hang_detect(interval)
...
@hang_detect(180) # reset hang detect to 3 min, only for func scope
def train():
...
@hang_detect(-1) # disable hang detect temporally, only for func scope
def test():
...
for epoch in range(0, 100):
train(epoch)
test(epoch)
self.scheduler.step()
FAQ
How do I install AIMaster SDK?
Install AIMaster SDK by using the wheel package that matches your Python version.
# py36
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl
# py38
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl
# py310
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl