All Products
Search
Document Center

Platform For AI:AIMaster: An elastic and automatic fault tolerance engine

Last Updated:May 10, 2026

AIMaster improves the stability and continuity of large-scale distributed deep learning jobs. It addresses issues such as software and hardware exceptions, job hangs, and instance failures by providing job monitoring, fault-tolerance decisions, and resource control.

Background

Deep learning is widely used. As models and data scale up, distributed training becomes a common practice. As the number of job instances grows, software and hardware exceptions can cause job failures.

To ensure stable operation for large-scale distributed deep learning jobs, DLC provides an AIMaster-based fault tolerance monitoring feature. AIMaster is a job-level component. When you enable this feature, an AIMaster instance runs alongside your job's other instances to provide job monitoring, fault-tolerance decisions, and resource control.

Limitations

AIMaster currently supports the following frameworks: PyTorch, MPI, TensorFlow, and ElasticBatch.

Step 1: Enable fault tolerance monitoring

You can enable the fault tolerance monitoring feature in the console or by using an SDK when you submit a DLC training job.

In the console

When you submit a DLC training job in the console, go to the Fault Tolerance and Diagnosis section, turn on the Automatic Fault Tolerance switch, and configure additional parameters. For more information, see Create a training job. DLC then starts an extra AIMaster role to monitor the job throughout its lifecycle and perform fault tolerance when errors occur.

image

Details:

  • You can configure additional parameters in the Other Cofiguration text box. For parameter details, see Appendix: Fault tolerance parameters.

  • After you enable Hanging Detection, you can enable the C4D Detection feature. C4D (Calibrating Collective Communication over Converged ethernet - Diagnosis) is a diagnostic tool developed by Alibaba Cloud for diagnosing slow or hung jobs in large model training. For more information, see Use C4D.

    Note
  • After you enable Hanging Detection, you can use the function call stack snapshot analysis tool to locate the exact line of code where a job hang occurs. You must configure the hang detection threshold for this tool to work correctly. For more information, see "Use the function call stack snapshot analysis tool".

Via the DLC SDK

  • Use the Go SDK

    Enable fault tolerance monitoring when you submit a job by using the Go SDK.

    createJobRequest := &client.CreateJobRequest{}
    settings := &client.JobSettings{
        EnableErrorMonitoringInAIMaster: tea.Bool(true),
        ErrorMonitoringArgs: tea.String("--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=3600"),
    }
    createJobRequest.SetSettings(settings)

    Parameters:

    • EnableErrorMonitoringInAIMaster: Specifies whether to enable the fault tolerance monitoring feature.

    • ErrorMonitoringArgs: Specifies additional parameters for fault tolerance monitoring.

  • Use the Python SDK

    Enable fault tolerance monitoring when you submit a job by using the Python SDK.

    from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSettings
    
    settings = JobSettings(
        enable_error_monitoring_in_aimaster = True,
        error_monitoring_args = "--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=30"
    )
    create_job_req = CreateJobRequest(
        ...
        settings = settings,
    )

    Parameters:

    • enable_error_monitoring_in_aimaster: Specifies whether to enable the fault tolerance monitoring feature.

    • error_monitoring_args: Specifies additional parameters for fault tolerance monitoring.

Step 2: Configure advanced features

Select from the following advanced features based on your monitoring needs.

Configure fault tolerance notifications

After you enable fault tolerance monitoring for a job, you can configure notifications for fault tolerance events. In the Workspace Details page, choose Configure Workspace > Configure Event Notification. Then, click Create Event Rule and set the event type to DLC task > Automatic Fault Tolerance. For more information, see Workspace Event Center.

When a training job encounters an exception, such as a NaN loss, you can use the AIMaster SDK in your code to send a custom notification message:

Note

To use this feature, you must install the AIMaster wheel package. For more information, see FAQ.

from aimaster import job_monitor as jm

job_monitor_client = jm.Monitor(config=jm.PyTorchConfig())

...

if loss == Nan and rank == 0:
  st = job_monitor_client.send_custom_message(content="The training loss of the job is NaN.")
  if not st.ok():
      print('failed to send message, error %s' % st.to_string())

Configure custom retryable error keywords

Fault tolerance monitoring includes built-in detection for common retryable errors. If you want AIMaster to perform fault tolerance when specific keywords appear in the logs of a failed instance, you can configure them in your code. After configuration, the monitoring module scans the end of the failed instance's log for these keywords.

Note

The fault tolerance policy must be set to ExitCodeAndErrorMsg.

  • Example of configuring custom retryable error keywords for a PyTorch job

    from aimaster import job_monitor as jm
    
    jm_config_params = {}
    jm_config = jm.PyTorchConfig(**jm_config_params)
    monitor = jm.Monitor(config=jm_config)
    monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

    The parameters configured in monitor.set_retryable_errors are the custom retryable error keywords.

  • Example of configuring custom retryable error keywords for a TensorFlow job

    from aimaster import job_monitor as jm
    
    jm_config_params = {}
    jm_config = jm.TFConfig(**jm_config_params)
    monitor = jm.Monitor(config=jm_config)
    monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

Configure staged job hang detection

By default, the hang detection configuration applies to the entire job. However, jobs often run in distinct stages. For example, node communication during initialization may take longer than during the training stage, where logs update more frequently. To quickly detect a job hang during the training process, DLC provides a staged hang detection feature. This allows you to configure different hang detection intervals for different stages of the job. Configure it as follows:

monitor.reset_config(jm_config_params)

# Example:
#     monitor.reset_config(job_hang_interval=10)
#     or
#     config_params = {"job_hang_interval": 10, }
#     monitor.reset_config(**config_params)

The following is an example of staged hang detection for a PyTorch job.

import torch
import torch.distributed as dist
from aimaster import job_monitor as jm

jm_config_params = {
    "job_hang_interval": 1800 # Global 30-minute detection.
}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)

dist.init_process_group('nccl')

...

# impl these two funcs in aimaster sdk
# user just need to add annotations to their func
def reset_hang_detect(hang_seconds):
    jm_config_params = {
        "job_hang_interval": hang_seconds
    }
    monitor.reset_config(**jm_config_params)

def hang_detect(interval):
    reset_hang_detect(interval)
    ...

@hang_detect(180) # Reset hang detection to 3 minutes, for this function's scope only.
def train():
    ...

@hang_detect(-1) # Disable hang detection temporarily, for this function's scope only.
def test():
    ...

for epoch in range(0, 100):
    train(epoch)
    test(epoch)
    self.scheduler.step()
                            

Use C4D

C4D (Calibrating Collective Communication over Converged ethernet - Diagnosis) is a tool developed by Alibaba Cloud for diagnosing slow or hung jobs in large model training. C4D depends on the Alibaba Cloud high-performance collective communication library (ACCL). Ensure that ACCL is installed and that the environment variables are correctly configured. For more information, see ACCL: Alibaba Cloud high-performance collective communication library. Currently, you can use the C4D detection feature when you select Lingjun AI Computing Service for a DLC job.

Feature overview

C4D aggregates status information from all nodes in a job to determine whether a node has issues at the communication layer or elsewhere.

image

All parameters

After you enable the C4D detection feature, you can configure the following parameters in the Other Configurations text box:

Parameter

Description

Example value

--c4d-log-level

Sets the C4D output log level. Valid values:

  • Info

  • Warning (default)

  • Error

The default value is Warning, which outputs logs at the Warning and Error levels. We recommend using the default value for normal operation. To troubleshoot performance issues, you can set the level to Info.

--c4d-log-level=Info

--c4d-common-envs

Set the environment variables for C4D execution. The format is k1=v1,k2=v2, where multiple variables are separated by a comma (,). This parameter is empty by default. The optional environment variables are as follows:

  • C4D_HANG_TIMEOUT: The duration that a job can hang before a warning is logged. Default: 10,000,000 microseconds (10 seconds).

  • C4D_HANG_TIMES: The number of job hangs that must occur before an error log is recorded, which then triggers the automatic node isolation logic. Used with C4D_HANG_TIMEOUT. Default: 18. By default, a 3-minute hang triggers automatic node isolation.

  • C4D_CONN_BW_CHECK_PERIOD: The time interval for bandwidth checks. Default: 10 seconds.

  • C4D_RUNTIME_LOG_LEVEL: Specifies the C4D runtime log level. Valid values:

    • TRACE

    • DEBUG

    • INFO (default)

    • WARNING

    • ERROR

    • FATAL

  • C4D_ENABLE_STATS_OUTPUT: Specifies whether to output C4D-related statistics. Valid values:

    • TRUE

    • FALSE (default)

--c4d-common-envs=C4D_HANG_TIMEOUT=1,C4D_HANG_TIMES=2

For error-level logs, AIMaster automatically isolates the corresponding node and restarts the job. The handling logic for each log level is as follows:

Error level

Description

Actions

Error

By default, a communication-layer job hang that exceeds three minutes causes the job to fail. You can modify this default by configuring the C4D_HANG_TIMEOUT and C4D_HANG_TIMES parameters.

AIMaster automatically isolates the node reported in the log.

Warning

By default, a communication-layer job hang that exceeds 10 seconds affects performance but does not cause the job to fail. You can modify this default by configuring the C4D_HANG_TIMEOUT parameter.

No automatic node isolation. Manual confirmation is required.

A non-communication-layer job hang that exceeds 10 seconds might cause the job to fail.

No automatic node isolation. Manual confirmation is required.

Info

Communication-layer and non-communication-layer slowness.

These diagnostic logs are primarily for performance issues and require manual confirmation.

If you find that a DLC job is running slowly or is hung, go to the DLC job list and click the job name to open the job overview page. In the Instances section, view the AIMaster node log to see the C4D diagnostic results. For more information about the diagnosis results, see Diagnostic result examples.5bc5051b1abae830588522ab7a50b23f

Diagnostic result examples

  • RankCommHang: indicates that a node has a hang in the communication layer.image

  • RankNonCommHang: indicates that a node has a hang outside the communication layer, such as a hang in the compute process.image

  • RankCommSlow: indicates that a node is slow in the communication layer.image

  • RankNonCommSlow: indicates that a node is slow outside the communication layer.image

Analyze call stack snapshots

A common failure in large model training is a job hang. One frequent type is an NCCL hang, which typically generates a log entry like "Watchdog caught collective operation timeout". To help you quickly identify the root cause of job hangs, we developed the function call stack snapshot analysis tool. Follow these steps to use it:

Step 1: Install pystack or py-spy

First, confirm whether pystack or py-spy is installed in your container image. If not, you must install one. For example, use the following command to install pystack:

pip install pystack -i https://mirrors.cloud.aliyuncs.com/pypi/simple/ --trusted-host mirrors.cloud.aliyuncs.com

Step 2: Enable hang detection

For instructions on how to enable hang detection, see In the console. For the function call stack snapshot analysis tool to work correctly, you must also set an appropriate value for the hang detection threshold. First, determine the timeout value for your model job. You can usually find this in the error log after a job hangs. For example:

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2143, OpType=ALLREDUCE, NumelIn=659, NumelOut=659, Timeout(ms)=600000) ran for 600535 milliseconds before timing out

From the Timeout field in this error log, you can see the job's timeout is 600,000 milliseconds (600 seconds or 10 minutes). In this case, we recommend setting the hang detection threshold to 450 seconds. If the Timeout value in your log is 1,800 seconds, we recommend setting the threshold to 1,500 seconds. As a general rule, the hang detection threshold should be about 150 to 200 seconds less than the job's timeout value.

After you correctly configure hang detection, AIMaster automatically collects and analyzes the function call stack of the job process when a hang occurs. You can view the analysis result in the AIMaster node log. The following is a sample analysis result from the tool after a job hang:

image.png

In the analysis result, the stack field shows the function call stack, the threads field lists the threads where this stack occurred, and the count field indicates the number of threads with this stack. Stacks with a count of 1 are highly likely to be the cause of the job hang and should be investigated first.

Step 3: View restart reasons

  • View restart rounds: Job restart information is organized by round. On the job details page, you can expand a round's details to view information such as the time spent in each stage. This helps you understand the job's execution status more accurately.

    image

  • View restart history: You can click the number of restarts or the Restart records tab to view relevant restart information, including the restart reason, result, and duration.

    image

    Procedure:

    • In the Restart records list, click Description to view detailed information for a specific restart, including the Restarts, Restart Time, Node Name, Instance Name, Error Code, Error Message, and Error Source.

    • Click View Aggregation Fault Details to expand the full list of details for all restart records.

      image

Appendix: Fault tolerance parameters

This section describes all parameters for the fault tolerance monitoring feature. You can refer to the common parameter configuration examples to plan your settings. When you enable fault tolerance monitoring, you can configure these parameters in the Other Cofiguration section as needed.

All parameters

General configuration

Feature

Parameter

Description

Default

Job execution mode

--job-execution-mode

The execution mode of the job. Valid values:

  • Sync: A synchronous job.

  • Async: An asynchronous job.

Fault tolerance behavior differs by job type for retriable errors:

  • For a synchronous job, the entire job restarts.

  • For an asynchronous job, worker instances are independent. Only the failed instance restarts, and other instances are not affected.

Sync

Job restart settings

--enable-job-restart

Specifies whether to allow the job to restart when fault tolerance conditions are met or a runtime exception is detected. Valid values:

  • False: The job does not restart.

  • True: The job restarts.

False

--max-num-of-job-restart

The maximum number of times the job can restart. If this number is exceeded, the job fails.

3

Runtime configuration

Note

Applies to scenarios where no instances have failed.

Feature

Parameter

Description

Default

Job hang detection

--enable-job-hang-detection

Specifies whether to enable runtime hang detection for the job. This feature supports only synchronous jobs. Valid values:

  • False: Disables the feature.

  • True: Enables the feature. If the stdout and stderr logs of all instances are not updated within a specified time, the job restarts.

False

--job-hang-interval

The duration in seconds that a job can be suspended. This must be a positive integer.

If the suspension duration exceeds this value, the job is considered abnormal and restarts.

1800

--enable-c4d-hang-detection

Specifies whether to enable C4D detection to quickly diagnose and locate slow nodes and faulty nodes that cause a job hang during execution.

Note

This parameter takes effect only when the --enable-job-hang-detection parameter is also enabled.

False

Job exit hang detection

--enable-job-exit-hang-detection

Specifies whether to enable hang detection during job exit. This feature supports only synchronous jobs. Valid values:

  • False: Disables the feature.

  • True: Enables the feature. After an instance of the job succeeds, if the job does not finish within the specified time, it restarts.

False

--job-exit-hang-interval

The duration in seconds that a job can be suspended during exit. This must be a positive integer.

If the exit duration exceeds this value, the job is considered abnormal and restarts.

600

Fault tolerance configuration

Note

Applies to scenarios where at least one instance has failed.

Feature

Parameter

Description

Default

Fault tolerance policy

--fault-tolerant-policy

The fault tolerance policy. Valid values:

  • OnFailure: When a job exception occurs:

    • For an asynchronous job, the failed instance is unconditionally restarted.

    • For a synchronous job, the entire job is unconditionally restarted.

  • ExitCodeAndErrorMsg: When a job exception occurs, the system evaluates the exit code and error message of the failed instance. For more information, see Step 2: Configure advanced features. If the retry condition is met:

    • For an asynchronous job, the failed instance is restarted.

    • For a synchronous job, the entire job is restarted.

  • Never: No action is taken. The job fails.

ExitCodeAndErrorMsg

Maximum occurrences of the same error

--max-num-of-same-error

The maximum number of times the same error can occur on a single instance.

If the error count exceeds this value, the job fails.

10

Maximum tolerated failure rate

--max-tolerated-failure-rate

The maximum tolerated failure rate. If the proportion of failed instances exceeds this value, the job fails.

The default value of -1 disables this feature. For example, a value of 0.3 means the job fails if more than 30% of workers encounter an error.

-1

Parameter configuration examples

The following examples show common parameter configurations for different training jobs.

  • Synchronous training job (common for PyTorch jobs)

    When an instance fails and meets the fault tolerance conditions, the job restarts.

    --job-execution-mode=Sync --enable-job-restart=True --max-num-of-job-restart=3 --fault-tolerant-policy=ExitCodeAndErrorMsg
  • Asynchronous training job (common for TensorFlow jobs)

    For retryable errors, failed Worker instances are restarted. If a PS or Chief instance fails, the job is not restarted by default. To enable job restarts, set --enable-job-restart=True.

    --job-execution-mode=Async --fault-tolerant-policy=OnFailure
  • Offline inference job (common for ElasticBatch jobs)

    Instances are independent, similar to an asynchronous job. When an instance fails, only that instance restarts.

    --job-execution-mode=Async --fault-tolerant-policy=OnFailure

FAQ

Q: How do I install the AIMaster SDK?

Use the command that corresponds to your Python version to install the wheel package.

# Python 3.6
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl

# Python 3.8
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl

# Python 3.10
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl