All Products
Search
Document Center

Platform For AI:AIMaster: Elastic fault tolerance engine

Last Updated:Apr 14, 2025

Deep Learning Container (DLC) provides the fault tolerance monitoring feature that is empowered by AIMaster. This topic describes how to enable and configure the feature.

Background information

As the scale of model and data increases, deep learning jobs are frequently run in a distributed manner. When a deep learning job is run by an increased number of instances, occasional exceptions in the underlying software stack or hardware environment can terminate the job.

To ensure that large-scale deep learning jobs are reliably run in a distributed manner, DLC provides the fault tolerance monitoring feature that is empowered by AIMaster. AIMaster is a job-level component. When you use AIMaster for a DLC job, an AIMaster instance is launched to run concurrently with other job instances. The AIMaster instance monitors the job progress and manages fault tolerance and resource allocation.

Step 1: Configure additional parameters of fault tolerance monitoring

Refer to the following sections to determine the additional parameters that you want to configure. When you enable the fault tolerance monitoring feature in the subsequent step, you can specify these additional parameters in the Other Configuration field based on your business requirements.

Parameter description

Category

Configuration

Parameter

Description

Default value

General configuration

Job type

--job-execution-mode

The type of the DLC job. Valid values:

  • Sync: a synchronous job.

  • Async: an asynchronous job.

The fault tolerance policy varies based on the job type. For example:

  • When a retriable error occurs, a synchronous job requires a full restart.

  • When a retriable error occurs, an asynchronous job requires restarting only the instance that failed.

Sync

Job restart

--enable-job-restart

Specifies whether to restart the job when a fault tolerance condition is triggered or a runtime exception is detected. Valid values:

  • False

  • True

False

--max-num-of-job-restart

The maximum number of attempts to restart the job. If this limit is exceeded, the job is reported as failed.

3

Runtime configuration

Note

This configuration takes effect only if all instances of the job run as expected.

Hang detection for running jobs

Hang detection for exiting jobs

--enable-job-hang-detection

Specifies whether to enable hang detection when the job is running. This parameter is valid only for synchronous jobs. Valid values:

  • False: disables hang detection.

  • True: enables hang detection. If the standard output (STDOUT) and standard error (STDERR) logs of all instances are not updated within the duration that you specify for the --job-hang-interval parameter, the job is reported as hung and is restarted.

False

--job-hang-interval

The maximum duration during which the job can be non-responsive. Valid values: positive integers. Unit: seconds.

If the job remains non-responsive for a duration longer than this value, the job is reported as hung and is restarted.

1800

--enable-c4d-hang-detection

Specifies whether to enable the C4D detection feature. This feature is used to quickly diagnose and identify slow nodes during job running and faulty nodes that cause job hang issues.

Note

This parameter is valid only when the --enable-job-hang-detection parameter is used together.

--enable-job-exit-hang-detection

Specifies whether to enable hang detection when the job is exiting. This parameter is valid only for synchronous jobs. Valid values:

  • False: disables hang detection.

  • True: enables hang detection. If an instance runs as expected but the corresponding job fails to exit within the duration that you specify for the --job-exit-hang-interval parameter, the job is reported as hung and is restarted.

False

--job-exit-hang-interval

The maximum duration during which the job can be non-responsive when the job is exiting. Valid values: positive integers. Unit: seconds.

If the job fails to exit within this duration, the job is reported as hung and is restarted.

600

Fault tolerance configuration

Note

This configuration takes effect only if an instance of the job fails to run.

Fault tolerance policy

--fault-tolerant-policy

Valid values:

  • OnFailure:

    • If an asynchronous job encounters an error, the failed instance is unconditionally restarted.

    • If a synchronous job encounters an error, the job is unconditionally restarted.

  • ExitCodeAndErrorMsg: checks whether the retry condition is met based on the exit code and error log of the failed instance. For more information, see the "Step 3: Configure enhanced fault tolerance monitoring" section of this topic.

    • If the retry condition is met and the job is an asynchronous job, the failed instance is restarted.

    • If the retry condition is met and the job is a synchronous job, the job is restarted.

  • Never: reports the job as failed without an attempt to restart the job.

ExitCodeAndErrorMsg

Maximum Error Occurrences

--max-num-of-same-error

The maximum number of times an error can occur on a single instance.

If an error occurs for a number of times that exceeds this value, the job is reported as failed.

10

Maximum Fault Tolerance rate

--max-tolerated-failure-rate

The maximum percentage of failed instances. If the percentage of failed instances exceeds this value, the job is reported as failed.

Default value: -1, which indicates that the parameter is disabled. For example, a value of 0.3 indicates that if more than 30% of the job instances fail, the job is reported as failed.

-1

Sample configurations

This section provides examples of common configurations for different types of DLC jobs.

  • Synchronous training jobs (such as PyTorch jobs)

    If a job instance is abnormal and meets the fault tolerance conditions, the job is restarted.

    --job-execution-mode=Sync --enable-job-restart=True --max-num-of-job-restart=3 --fault-tolerant-policy=ExitCodeAndErrorMsg
  • Asynchronous training jobs (such as TensorFlow jobs)

    If a retriable error occurs on a worker instance, the worker instance is restarted. If an error occurs on a parameter server or chief instance, the job is not allowed to restart. To restart the job in the preceding scenario, set the --enable-job-restart parameter to True.

    --job-execution-mode=Async --fault-tolerant-policy=OnFailure
  • Offline inference jobs (such as ElasticBatch jobs)

    The instances of an offline inference job are independent from each other, which is similar to an asynchronous job. If a job instance is unexpectedly terminated, only the instance is restarted.

    --job-execution-mode=Async --fault-tolerant-policy=OnFailure

Step 2: Enable fault tolerance monitoring

When you submit a DLC job, you can enable the fault tolerance monitoring feature by using the PAI console or DLC SDK.

Use the PAI console

When you create a DLC job in the PAI console, you can turn on Automatic Fault Tolerance in the Fault Tolerance and Diagnosis section and configure the parameters. For more information, see Submit training jobs. After you configure the parameters, an AIMaster instance is launched when you run the DLC job. The AIMaster instance monitors the job progress and manages fault tolerance when errors occur.

image.png

  • You can configure additional parameters in the Other Configuration field. For more information, see Step 1: Configure additional parameters of fault tolerance monitoring.

  • You can turn on Hanging Detection to enable the C4D detection feature. C4D, self-developed by Alibaba Cloud, is a diagnostic tool designed to diagnose and identify slow nodes during large model training or faulty nodes that cause job hang issues.

    Note
    • C4D depends on ACCL that is an in-house High Performance Network (HPN) communication library provided by Alibaba Cloud. Make sure that you have installed ACCL.

    • You can enable the C4D detection feature when you select Lingjun Resources during submission of a DLC job.

Use DLC SDK

  • Use DLC SDK for Go

    Sample code:

    createJobRequest := &client.CreateJobRequest{}
    settings := &client.JobSettings{
        EnableErrorMonitoringInAIMaster: tea.Bool(true),
        ErrorMonitoringArgs: tea.String("--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=3600"),
    }
    createJobRequest.SetSettings(settings)

    Parameters:

    • EnableErrorMonitoringInAIMaster: specifies whether to enable the fault tolerance monitoring feature.

    • ErrorMonitoringArgs: additional parameters that you can configure for the fault tolerance monitoring feature.

  • Use DLC SDK for Python

    Sample code:

    from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSettings
    
    settings = JobSettings(
        enable_error_monitoring_in_aimaster = True,
        error_monitoring_args = "--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=30"
    )
    create_job_req = CreateJobRequest(
        ...
        settings = settings,
    )

    Parameters:

    • enable_error_monitoring_in_aimaster: specifies whether to enable the fault tolerance monitoring feature.

    • error_monitoring_args: additional parameters that you can configure for the fault tolerance monitoring feature.

Step 3: Configure enhanced fault tolerance monitoring

You can configure the following features of enhanced fault tolerance monitoring based on your business requirements.

Fault tolerance notification

If you want to receive notifications about fault tolerance events, such as a job restart, you can configure a rule for such notifications by performing the following steps: On the Workspace Details page, choose Configure Workspace > Configure Event Notification. On the Configure Workspace page, click Create Event Rule. In the panel that appears, select DLC Jobs > Automatic Fault Tolerance from the Event Type drop-down list. For more information, see Workspace notification.

You can also use AIMaster SDK to configure custom notifications when an error occurs, such as when the loss function returns NaN. Sample code:

Note

To configure custom notifications, you must install the wheel package of AIMaster. For more information, see the "FAQ" section of this topic.

from aimaster import job_monitor as jm

job_monitor_client = jm.Monitor(config=jm.PyTorchConfig())

...

if loss == Nan and rank == 0:
  st=job_monitor_client.send_custom_message (content="The loss function returns NaN")
  if not st.ok():
      print('failed to send message, error %s' % st.to_string())

Custom keywords

The fault tolerance monitoring feature provides a built-in module that can automatically detect common retriable errors. If you want to trigger fault tolerance when error logs carry information that matches specific keywords, you can use the following methods to configure custom keywords. After you configure custom keywords, the module scans the tail logs of failed instances for key information that matches your custom keywords.

Note

You must set the --fault-tolerant-policy parameter to ExitCodeAndErrorMsg.

  • Sample configuration for PyTorch jobs

    from aimaster import job_monitor as jm
    
    jm_config_params = {}
    jm_config = jm.PyTorchConfig(**jm_config_params)
    monitor = jm.Monitor(config=jm_config)
    monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

    In the preceding code, the custom keywords are configured by using the monitor.set_retryable_errors function.

  • Sample configuration for TensorFlow jobs

    from aimaster import job_monitor as jm
    
    jm_config_params = {}
    jm_config = jm.TFConfig(**jm_config_params)
    monitor = jm.Monitor(config=jm_config)
    monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

Fine-grained hang detection

By default, hang detection is configured for the entire job runtime. However, you may need different configurations at different runtime stages. For example, during job initialization, it takes a long time to establish the connections between instances, whereas the execution of the job is faster. DLC allows you to configure the hang detection policy based on the current status of the job by using the following methods:

monitor.reset_config(jm_config_params)

# Example:
#     monitor.reset_config(job_hang_interval=10)
#     or
#     config_params = {"job_hang_interval": 10, }
#     monitor.reset_config(**config_params)

Sample configuration for a PyTorch job

import torch
import torch.distributed as dist
from aimaster import job_monitor as jm

jm_config_params = {
    "job_hang_interval": 1800 # Configure a global hang detection policy. This value specifies that the maximum duration during which the job can be non-responsive is 30 minutes. 
}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)

dist.init_process_group('nccl')

...

# impl these two funcs in aimaster sdk
# user just need to add annotations to their func
def reset_hang_detect(hang_seconds):
    jm_config_params = {
        "job_hang_interval": hang_seconds
    }
    monitor.reset_config(**jm_config_params)

def hang_detect(interval):
    reset_hang_detect(interval)
    ...

@hang_detect(180) # reset hang detect to 3 min, only for func scope
def train():
    ...

@hang_detect(-1) # disable hang detect temporarily, only for func scope
def test():
    ...

for epoch in range(0, 100):
    train(epoch)
    test(epoch)
    self.scheduler.step()
                            

C4D

C4D, self-developed by Alibaba Cloud, is a diagnostic tool designed to diagnose and identify slow nodes during large model training or faulty nodes that cause job hang issues. C4D depends on ACCL that is an in-house HPN communication library provided by Alibaba Cloud. Make sure that you have installed ACCL. You can enable the C4D detection feature when you select Lingjun Resources during submission of a DLC job.

Feature introduction

C4D aggregates the statuses of all nodes in a job, and then analyzes and determines whether a communication or non-communication issue occurs to some nodes. The following figure shows the system architecture.

image

Parameter description

After you enable the C4D detection feature, you can configure the parameters in the Other Configuration field. The following table describes the parameters.

Parameter

Description

Example

--c4d-log-level

The level of the log generated by C4D. Valid values:

  • Info

  • Warning (default)

  • Error

The default value Warning indicates that the logs of the Warning and Error levels are generated. We recommend that you use the default value under normal operating conditions. To troubleshoot performance issues, set the value to Info.

--c4d-log-level=Info

--c4d-common-envs

The environment variables for C4D running, in the k1=v1,k2=v2 format. Separate multiple environment variables with commas (,). By default, the parameter is left empty. The following environment variables are supported:

  • C4D_HANG_TIMEOUT: the time duration for which a job is suspended before Warning is triggered. Default value: 10000000. Unit: μs.

  • C4D_HANG_TIMES: the number of times for which a job is suspended. If a job is suspended for a specific number of times, the log of the Error level is recorded and then automatic isolation of nodes is triggered. This parameter must be used together with the C4D_HANG_TIMEOUT parameter. Default value: 18. The default value 18 indicates that automatic isolation of nodes is triggered after a job is suspended for 3 minutes.

  • C4D_CONN_BW_CHECK_PERIOD: the time interval at which the bandwidth is checked. Default value: 10s.

  • C4D_RUNTIME_LOG_LEVEL: the log level during C4D running. Valid values:

    • TRACE

    • DEBUG

    • INFO (default)

    • WARNING

    • ERROR

    • FATAL

  • C4D_ENABLE_STATS_OUTPUT: specifies whether to generate C4D-related statistics. Valid values:

    • TRUE

    • FALSE (default)

--c4d-common-envs=C4D_HANG_TIMEOUT=1,C4D_HANG_TIMES=2

For logs of the Error level, AIMaster automatically isolates the related nodes and restarts the job. The following table describes the processing logic for logs at each level.

Log level

Description

Action

Error

By default, if a job is suspended for more than 3s due to communication issues, the job may fail. To resolve this issue, you can modify the default values of the C4D_HANG_TIMEOUT and C4D_HANG_TIMES parameters.

AIMaster automatically isolates the nodes in the log.

Warn

By default, if a job is suspended for more than 10s due to communication issues, the performance is affected but the job will not fail. To resolve this issue, you can modify the default value of the C4D_HANG_TIMEOUT parameter.

AIMaster requires manual confirmation for isolating the nodes in the log.

If a job is suspended for more than 10s due to non-communication issues, the job may fail.

AIMaster requires manual confirmation for isolating the nodes in the log.

Info

Slow nodes exist due to communication or non-communication issues.

The logs of this level focus on performance issues. Manual confirmation is required.

During the running of a DLC job, if a slow node exists or the job is suspended, obtain the C4D diagnosis result. To obtain the diagnosis result, click the name of the job in the job list. In the Instance section of the Overview tab, view the log of the AIMaster node to obtain the C4D diagnosis result.5bc5051b1abae830588522ab7a50b23f

Sample diagnosis results

  • RankCommHang indicates that a job is suspended due to communication issues.image

  • RankNonCommHang indicates that a job is suspended due to non-communication issues. For example, compute nodes are faulty, causing hang issues.image

  • RankCommSlow indicates that a slow node exists due to communication issues.image

  • RankNonCommSlow indicates that a slow node exists due to non-communication issues.image

FAQ

How do I install AIMaster SDK?

Install AIMaster SDK by using the wheel package that matches your Python version.

# Python 3.6
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl

# Python 3.8
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl

# Python 3.10
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl