All Products
Search
Document Center

Platform For AI:AIMaster: Elastic fault tolerance engine

Last Updated:Mar 04, 2024

Deep Learning Container (DLC) provides the fault tolerance monitoring feature that is empowered by AIMaster. This topic describes how to enable and configure the feature.

Background information

As the scale of model and data increases, deep learning tasks are frequently run in a distributed manner. When a deep learning task is run by an increased number of instances, occasional exceptions in the underlying software stack or hardware environment can terminate the task.

To ensure that large-scale deep learning tasks are reliably run in a distributed manner, DLC provides the fault tolerance monitoring feature that is empowered by AIMaster. AIMaster is a job-level component. When you use AIMaster for a DLC job, an AIMaster instance is launched to run concurrently with other job instances. The AIMaster instance monitors the job progress and manages fault tolerance and resource allocation.

Procedure

  1. Step 1: Configure additional parameters of fault tolerance monitoring

    Refer to the parameter description and sample configurations to configure additional parameters of fault tolerance monitoring based on your business requirements.

  2. Step 2: Enable and configure fault tolerance monitoring

    When you submit a DLC job, enable and configure the fault tolerance monitoring feature by using the Platform for AI (PAI) console or AIMaster SDK.

  3. Step 3: Configure enhanced fault tolerance monitoring

    If the configuration in the previous step cannot meet your business requirements, you can configure enhanced fault tolerance monitoring to add custom keywords by using AIMaster SDK. After you add custom keywords, AIMaster automatically scans the logs of faulty nodes and triggers fault tolerance if the logs carry information that matches your custom keywords. Enhanced fault tolerance monitoring alsosends notifications when fault tolerance is triggered.

Step 1: Configure additional parameters of fault tolerance monitoring

Refer to the following sections to determine the additional parameters that you want to configure. When you enable the fault tolerance monitoring feature in the subsequent step, you can specify these additional parameters in the Other Configuration field based on your business requirements.

Parameter description

The following table describes the additional parameters that you can configure for fault tolerance monitoring.

Category

Configuration

Parameter

Description

Default value

General configuration

Job type

--job-execution-mode

The type of the DLC job. Valid values:

  • Sync: a synchronous job.

  • Async: an asynchronous job.

The fault tolerance policy varies based on the job type. For example:

  • When a retriable error occurs, a synchronous job requires a full restart.

  • When a retriable error occurs, an asynchronous job requires restarting only the instance that failed.

Sync

Job restart

--enable-job-restart

Specifies whether to restart the job when a fault tolerance condition is triggered or a runtime exception is detected. Valid values:

  • False

  • True

False

--max-num-of-job-restart

The maximum number of attempts to restart the job. If this limit is exceeded, the job is reported as failed.

3

Runtime configuration

Note

This configuration takes effect only if all instances of the job run as expected.

Hang detection for running jobs

--enable-job-hang-detection

Specifies whether to enable hang detection when the job is running. This parameter is valid only for synchronous jobs. Valid values:

  • False: disables hang detection.

  • True: enables hang detection. If the standard output (STDOUT) and standard error (STDERR) logs of all instances are not updated within the duration that you specify for the --job-hang-interval parameter, the job is reported as hung and is restarted.

False

--job-hang-interval

The maximum duration during which the job can be non-responsive. Valid values: positive integers. Unit: seconds.

If the job remains non-responsive for a duration longer than this value, the job is reported as hung and is restarted.

1800

Hang detection for exiting jobs

--enable-job-exit-hang-detection

Specifies whether to enable hang detection when the job is exiting. This parameter is valid only for synchronous jobs. Valid values:

  • False: disables hang detection.

  • True: enables hang detection. If an instance runs as expected but the corresponding job fails to exit within the duration that you specify for the --job-exit-hang-interval parameter, the job is reported as hung and is restarted.

False

--job-exit-hang-interval

The maximum duration during which the job can be non-responsive when the job is exiting. Valid values: positive integers. Unit: seconds.

If the job fails to exit within this duration, the job is reported as hung and is restarted.

600

Fault tolerance configuration

Note

This configuration takes effect only if an instance of the job fails to run.

Fault tolerance policy

--fault-tolerant-policy

Valid values:

  • OnFailure:

    • If an asynchronous job encounters an error, the failed instance is unconditionally restarted.

    • If a synchronous job encounters an error, the job is unconditionally restarted.

  • ExitCodeAndErrorMsg: checks whether the retry condition is met based on the exit code and error log of the failed instance. For more information, see the "Step 3: Configure enhanced fault tolerance monitoring" section of this topic.

    • If the retry condition is met and the job is an asynchronous job, the failed instance is restarted.

    • If the retry condition is met and the job is a synchronous job, the job is restarted.

  • Never: reports the job as failed without an attempt to restart the job.

Never

Maximum Error Occurrences

--max-num-of-same-error

The maximum number of times an error can occur on a single instance.

If an error occurs for a number of times that exceeds this value, the job is reported as failed.

10

Maximum Fault Tolerance rate

--max-tolerated-failure-rate

The maximum percentage of failed instances. If the percentage of failed instances exceeds this value, the job is reported as failed.

Default value: -1, which indicates that the parameter is disabled. For example, a value of 0.3 indicates that if more than 30% of the job instances fail, the job is reported as failed.

-1

Sample configurations

This section provides examples of common configurations for different types of DLC jobs.

  • Synchronous training jobs (such as PyTorch jobs)

    If a job instance is unexpectedly terminated and the exit code or error log meets the fault tolerance conditions, such as preemption, the job is restarted.

    --job-execution-mode=Sync --enable-job-restart=True --max-num-of-job-restart=3 --fault-tolerant-policy=ExitCodeAndErrorMsg
  • Asynchronous training jobs (such as TensorFlow jobs)

    If a retriable error occurs on a worker instance of the job, the worker instance is restarted. If an error occurs on a parameter server or chief instance, the job is not allowed to restart. To restart the job in the preceding scenario, set the --enable-job-restart parameter to True.

    --job-execution-mode=Async --fault-tolerant-policy=OnFailure
  • Offline inference jobs (such as ElasticBatch jobs)

    The instances of an offline inference job are independent from each other, which is similar to an asynchronous job. If a job instance is unexpectedly terminated, only the instance is restarted.

    --job-execution-mode=Async --fault-tolerant-policy=OnFailure

Step 2: Enable fault tolerance monitoring

When you submit a DLC job, you can enable the fault tolerance monitoring feature by using the PAI console or DLC SDK.

Use the PAI console

When you create a DLC job in the PAI console, you can turn on Automatic Fault Tolerance in the Resource Configuration section and configure the parameters. For more information, see Submit training jobs. After you configure the parameters, an AIMaster instance is launched when you run the DLC job. The AIMaster instance monitors the job progress and manages fault tolerance when errors occur.

image.pngFor information about how to configure the Other Configuration field, see the "Step 1: Configure additional parameters of fault tolerance monitoring" section of this topic.

Use DLC SDK

  • Use DLC SDK for Go

    Sample code:

    createJobRequest := &client.CreateJobRequest{}
    settings := &client.JobSettings{
        EnableErrorMonitoringInAIMaster: tea.Bool(true),
        ErrorMonitoringArgs: tea.String("--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=3600"),
    }
    createJobRequest.SetSettings(settings)

    Parameters:

    • EnableErrorMonitoringInAIMaster: specifies whether to enable the fault tolerance monitoring feature.

    • ErrorMonitoringArgs: additional parameters that you can configure for the fault tolerance monitoring feature.

  • Use DLC SDK for Python

    Sample code:

    from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSettings
    
    settings = JobSettings(
        enable_error_monitoring_in_aimaster = True,
        error_monitoring_args = "--job-execution-mode=Sync --enable-job-restart=True --enable-job-hang-detection=True --job-hang-interval=30"
    )
    create_job_req = CreateJobRequest(
        ...
        settings = settings,
    )

    Parameters:

    • enable_error_monitoring_in_aimaster: specifies whether to enable the fault tolerance monitoring feature.

    • error_monitoring_args: additional parameters that you can configure for the fault tolerance monitoring feature.

Step 3: Configure enhanced fault tolerance monitoring

You can configure the following features of enhanced fault tolerance monitoring based on your business requirements.

Fault tolerance notification

If you want to receive notifications about fault tolerance events, such as a job restart, you can configure a rule for such notifications by performing the following steps: Go to the Workspace Details page, click the Events tab, and then click Create Event Rule. In the panel that appears, select DLC Jobs > Automatic Fault Tolerance from the Event Type drop-down list. For more information, see Workspace notification.

You can also use AIMaster SDK to configure custom notifications when an error occurs, such as when the loss function returns NaN. Sample code:

Note

To configure custom notifications, you must install the wheel package of AIMaster. For more information, see the "FAQ" section of this topic.

from aimaster import job_monitor as jm

job_monitor_client = jm.Monitor(config=jm.PyTorchConfig())

...

if loss == Nan and rank == 0:
  st=job_monitor_client.send_custom_message (content="The loss function returns NaN")
  if not st.ok():
      print('failed to send message, error %s' % st.to_string())

Custom keywords

The fault tolerance monitoring feature provides a built-in module that can automatically detect common retriable errors. If you want to trigger fault tolerance when error logs carry information that matches specific keywords, you can use the following methods to configure custom keywords. After you configure custom keywords, the module scans the tail logs of failed instances for key information that matches your custom keywords.

Note

You must set the --fault-tolerant-policy parameter to ExitCodeAndErrorMsg.

  • Sample configuration for PyTorch jobs

    from aimaster import job_monitor as jm
    
    jm_config_params = {}
    jm_config = jm.PyTorchConfig(**jm_config_params)
    monitor = jm.Monitor(config=jm_config)
    monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

    In the preceding code, the custom keywords are configured by using the monitor.set_retryable_errors function.

  • Sample configuration for TensorFlow jobs

    from aimaster import job_monitor as jm
    
    jm_config_params = {}
    jm_config = jm.TFConfig(**jm_config_params)
    monitor = jm.Monitor(config=jm_config)
    monitor.set_retryable_errors(["connect timeout", "error_yyy", "error_zzz"])

Fine-grained hang detection

By default, hang detection is configured for the entire job runtime. However, you may need different configurations at different runtime stages. For example, when a large-scale job is initializing, it takes a long time to establish the connections between instances, whereas the execution of the job is faster. DLC allows you to configure the hang detection policy based on the current status of the job by using the following methods:

monitor.reset_config(jm_config_params)

# Example:
#     monitor.reset_config(job_hang_interval=10)
#     or
#     config_params = {"job_hang_interval": 10, }
#     monitor.reset_config(**config_params)

Sample configuration for a PyTorch job

import torch
import torch.distributed as dist
from aimaster import job_monitor as jm

jm_config_params = {
    "job_hang_interval": 1800 # Configure a global hang detection policy. This value specifies that the maximum duration during which the job can be non-responsive is 30 minutes. 
}
jm_config = jm.PyTorchConfig(**jm_config_params)
monitor = jm.Monitor(config=jm_config)

dist.init_process_group('nccl')

...

# impl these two funcs in aimaster sdk
# user just need to add annotations to their func
def reset_hang_detect(hang_seconds):
    jm_config_params = {
        "job_hang_interval": hang_seconds
    }
    monitor.reset_config(**jm_config_params)

def hang_detect(interval):
    reset_hang_detect(interval)
    ...

@hang_detect(180) # reset hang detect to 3 min, only for func scope
def train():
    ...

@hang_detect(-1) # disable hang detect temporally, only for func scope
def test():
    ...

for epoch in range(0, 100):
    train(epoch)
    test(epoch)
    self.scheduler.step()
                            

FAQ

How do I install AIMaster SDK?

Install AIMaster SDK by using the wheel package that matches your Python version.

# py36
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl

# py38
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl

# py310
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl