All Products
Search
Document Center

Platform For AI:Compute monitoring and fault tolerance

Last Updated:Apr 01, 2026

EAS automatically checks GPU compute power and node communication health for large-scale distributed deployments.

Applicable scope

This feature applies to multi-node distributed inference services deployed on Lingjun resources.

Key concepts

  • Detection timing:

    • Before instance startup: Runs before the application in the service instance (Pod) starts. Detects hardware or network issues to prevent startup failures.

    • During instance runtime: Runs as a background process alongside the service.

  • Check item:

    • Before instance startup: Compute performance check, node communication check, and cross-check for computing and communication.

    • During instance runtime: Only C4D (GPU health check).

    • For details about check items, see Appendix: Check items.

  • Abnormal state handling:

    • Instance startup failure: If an issue is detected, the system terminates the current instance startup.

    • No action: The system only records an event without taking other action.

Procedure

Enable and configure compute monitoring

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. In the Features section, under Stability Guarantee, enable Compute monitoring & fault tolerance. In the panel that appears, configure the check parameters. To use a JSON file instead, see Appendix: JSON file parameters.

    Note

    Both "Before running" and "Instance running" checks can be added.

    • Configure pre-run check (optional):

      • Detection timing: Select Before running.

      • Check item: Select check items as needed, such as Run Compute Performance Check and Run Node Communication Check. By default, GPU GEMM, All-Reduce (single-node), and All-Reduce (between two nodes) are enabled.

      • Set an appropriate timeout based on estimated durations in Check items. Checks run sequentially. The default timeout is 5 minutes. A check that exceeds this time is treated as a failure.

      • Abnormal state handling: The default is Instance startup failed. Select Rebuild Instance based on your disaster recovery policy.

    • Configure in-run check (optional):

      • Detection timing: Select Instance running.

      • Check item: Only C4D is available.

      • Abnormal state handling: Only Ignore is available.

View check results

After configuring this feature, view the check reports in either of the following ways:

  • Method 1: From the instance list

    1. On the service details page, go to the Overview tab.

    2. In Service Instance, find the target instance and click View results in the Actions column.image

  • Method 2: From deployment events

    1. On the service details page, go to the Deployment Events tab.

    2. Find an event of type SanityCheckSucceeded or SanityCheckFailed and click View results in the Actions column.image

The Health check results drawer appears. View detailed reports for each check item.

FAQ

What are the common causes for an All-Reduce check failure?

An All-Reduce check failure usually indicates network communication issues such as high latency, severe packet loss, or incorrect Remote Direct Memory Access (RDMA) configuration between nodes. Use the detailed data in the report to troubleshoot nodes with slow communication.

Appendix: Check items

Check item

Description

Estimated duration

Before instance startup

Compute performance check

GPU GEMM

Checks GPU GEMM performance to identify:

  • Faulty GPUs: computation errors or hangs

  • Slow nodes: low TFLOPS

1 minute

GPU Kernel Launch

Checks GPU kernel launch latency to identify:

  • Faulty nodes: kernel launch errors or hangs

  • Slow nodes: long kernel launch time

1 minute

Node communication check

All-Reduce

Checks node communication performance across different patterns to identify:

  • Faulty nodes: communication errors or hangs

  • Slow nodes: low communication bandwidth

Per collective communication check:

5 minutes

All-to-All

All-Gather

Multi-All-Reduce

PyTorch-Gloo

Checks node communication through PyTorch Gloo to identify faulty nodes.

1 minute

Network Connectivity

Checks network connectivity of head or tail nodes to identify connectivity issues.

2 minutes

Cross-check for computing and communication

MatMul/All-Reduce Overlap

Checks single-node performance when communication and computation kernels overlap to identify:

  • Faulty nodes: overlap computation errors or hangs

  • Slow nodes: long overlap computation time

1 minute

During instance runtime

C4D

Checks GPU health during instance runtime.

Appendix: JSON file parameters

Configuration example

{
    "aimaster": {
        "runtime_check": {
            "fail_action": "retain",
            "micro_benchmarks": "c4d"
        },
        "sanity_check": {
            "fail_action": "retain",
            "micro_benchmarks": "gemm_flops,all_reduce_1,all_reduce_2,kernel_launch,all_reduce,all_to_all_2,all_gather_2,all_gather,multi_all_reduce_2,multi_all_reduce,pytorch_gloo_2,network_connectivity,comp_comm_overlap",
            "timeout": 100
        }
    }
}

Parameters

Parameter

Description

aimaster

runtime_check

Checks performed during instance runtime.

fail_action

Action to take when an abnormal state is detected.

micro_benchmarks

Check item. Valid value: C4D.

sanity_check

Checks performed before instance startup.

fail_action

Action to take when an abnormal state is detected.

micro_benchmarks

Check items. Separate multiple items with commas.

timeout

Maximum check duration, in minutes.