Configure GPU Health Checks and Auto Fault Recovery - Platform For AI - Alibaba Cloud - Platform For AI

Applicable scope

This feature applies to multi-node distributed inference services deployed on Lingjun resources.

Key concepts

Detection timing:
- Before instance startup: Runs before the application in the service instance (Pod) starts. Detects hardware or network issues to prevent startup failures.
- During instance runtime: Runs as a background process alongside the service.
Check item:
- Before instance startup: Compute performance check, node communication check, and cross-check for computing and communication.
- During instance runtime: Only C4D (GPU health check).
- For details about check items, see Appendix: Check items.
Abnormal state handling:
- Instance startup failure: If an issue is detected, the system terminates the current instance startup.
- No action: The system only records an event without taking other action.

Procedure

Enable and configure compute monitoring

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
In the Features section, under Stability Guarantee, enable Compute monitoring & fault tolerance. In the panel that appears, configure the check parameters. To use a JSON file instead, see Appendix: JSON file parameters.

Note
Both "Before running" and "Instance running" checks can be added.
- Configure pre-run check (optional):
  - Detection timing: Select Before running.
  - Check item: Select check items as needed, such as Run Compute Performance Check and Run Node Communication Check. By default, GPU GEMM, All-Reduce (single-node), and All-Reduce (between two nodes) are enabled.
  - Set an appropriate timeout based on estimated durations in Check items. Checks run sequentially. The default timeout is 5 minutes. A check that exceeds this time is treated as a failure.
  - Abnormal state handling: The default is Instance startup failed. Select Rebuild Instance based on your disaster recovery policy.
- Configure in-run check (optional):
  - Detection timing: Select Instance running.
  - Check item: Only C4D is available.
  - Abnormal state handling: Only Ignore is available.

View check results

After configuring this feature, view the check reports in either of the following ways:

Method 1: From the instance list
1. On the service details page, go to the Overview tab.
2. In Service Instance, find the target instance and click View results in the Actions column.
Method 2: From deployment events
1. On the service details page, go to the Deployment Events tab.
2. Find an event of type SanityCheckSucceeded or SanityCheckFailed and click View results in the Actions column.

The Health check results drawer appears. View detailed reports for each check item.

FAQ

What are the common causes for an All-Reduce check failure?

An All-Reduce check failure usually indicates network communication issues such as high latency, severe packet loss, or incorrect Remote Direct Memory Access (RDMA) configuration between nodes. Use the detailed data in the report to troubleshoot nodes with slow communication.

Appendix: Check items

Check item		Description	Estimated duration
Before instance startup
Compute performance check	GPU GEMM	Checks GPU GEMM performance to identify: Faulty GPUs: computation errors or hangs Slow nodes: low TFLOPS	1 minute
Compute performance check	GPU Kernel Launch	Checks GPU kernel launch latency to identify: Faulty nodes: kernel launch errors or hangs Slow nodes: long kernel launch time	1 minute
Node communication check	All-Reduce	Checks node communication performance across different patterns to identify: Faulty nodes: communication errors or hangs Slow nodes: low communication bandwidth	Per collective communication check: 5 minutes
	All-to-All
	All-Gather
	Multi-All-Reduce
	PyTorch-Gloo	Checks node communication through PyTorch Gloo to identify faulty nodes.	1 minute
	Network Connectivity	Checks network connectivity of head or tail nodes to identify connectivity issues.	2 minutes
Cross-check for computing and communication	MatMul/All-Reduce Overlap	Checks single-node performance when communication and computation kernels overlap to identify: Faulty nodes: overlap computation errors or hangs Slow nodes: long overlap computation time	1 minute
During instance runtime
C4D		Checks GPU health during instance runtime.

Appendix: JSON file parameters

Configuration example

{
    "aimaster": {
        "runtime_check": {
            "fail_action": "retain",
            "micro_benchmarks": "c4d"
        },
        "sanity_check": {
            "fail_action": "retain",
            "micro_benchmarks": "gemm_flops,all_reduce_1,all_reduce_2,kernel_launch,all_reduce,all_to_all_2,all_gather_2,all_gather,multi_all_reduce_2,multi_all_reduce,pytorch_gloo_2,network_connectivity,comp_comm_overlap",
            "timeout": 100
        }
    }
}

Parameters

Parameter			Description
aimaster	runtime_check Checks performed during instance runtime.	fail_action	Action to take when an abnormal state is detected.
	runtime_check Checks performed during instance runtime.	micro_benchmarks	Check item. Valid value: C4D.
	sanity_check Checks performed before instance startup.	fail_action	Action to take when an abnormal state is detected.
		micro_benchmarks	Check items. Separate multiple items with commas.
		timeout	Maximum check duration, in minutes.