Handle faults on ACS GPU-HPN nodes - Container Compute Service

During AI training, a GPU node fault can cause tasks to hang or terminate abnormally, which affects task efficiency. Alibaba Cloud Container Service for Kubernetes (ACS) provides fault monitoring for GPU-HPN nodes. If a node becomes abnormal, ACS marks the node and reports the issue through Kubernetes events and conditions. This topic describes the fault metrics, data retrieval methods, and recovery process for GPU-HPN nodes.

Node fault handling process

ACS continuously runs health checks on GPU-HPN nodes. If a node fails, follow this process to repair the faulty node using the ACS self-healing mechanism.

Fault notification
ACS reports the cause of the fault through events and node conditions. It also adds a taint to the node to prevent new pods from being scheduled to it.
Drain faulty node
After you receive a fault notification, you must promptly evict the pods from the faulty node. You can use acs-instance-helper to automatically evict pods. For more information, see Configure automatic rotation for instances with hardware exceptions.
Repair faulty node
After the faulty node is drained, ACS automatically starts the repair process.
Node self-recovery
After ACS repairs the node, the related taints and conditions on the node are automatically restored to a normal state. New pods can then be scheduled to the node.

Fault notifications

Important

To ensure prompt handling of faults, configure alert conditions using kube-eventer. For more information, see kube-eventer.

When a GPU-HPN node is faulty, ACS provides fault information using conditions on the Node object and events on the corresponding pods.

Node

An anomaly label is added to the node for filtering and viewing.
```
metadata:
  labels:
    alibabacloud.com/node-anomaly: "true"
```
A taint named alibabacloud.com/node-anomaly is added to the node. By default, newly submitted pods are not scheduled to this node.
```
spec:
  taints:
  - effect: NoSchedule
    key: alibabacloud.com/node-anomaly
    timeAdded: "2024-10-16T06:09:27Z"
```

Detailed fault information is recorded in the node's conditions field in a condition with the type `NodeAnomaly`.

status:
  conditions:
  - lastHeartbeatTime: "2024-10-16T06:09:31Z"
    lastTransitionTime: "2024-10-16T06:09:31Z"
    message: The node has encountered an anomaly.
    reason: NodeBroken
    status: "True"
    type: NodeAnomaly

The fields in conditions are as follows:

Field	Description	When it is updated
`type`	`NodeAnomaly`. Indicates a node anomaly.	Static field. Does not change during the node lifecycle.
`status`	Indicates whether a fault exists. `True`: A fault exists. `False`: No fault exists.	Updated when the node's fault status changes.
`reason`	Fault type. It can be one of the following: `NodeBroken`: The entire GPU-HPN node has failed. `GPUCardBroken`: A GPU card-level failure has occurred. `NodeMaintenance`: The GPU-HPN node is undergoing a system upgrade or O&M.	Updated when the node's fault status changes.
`message`	Records detailed information about the fault.	Updated when the node's fault status changes.
`lastTransitionTime`	The time when the fault status last changed.	Updated when the node's fault status changes.
`lastHeartbeatTime`	The regularly updated heartbeat time.	Updated when the node's fault status changes or if more than five minutes have passed since the last update.

Detailed self-healing information is recorded in the node's conditions field in a condition with the type `FaultHealing`. This condition is updated as the self-healing process progresses.

status:
  conditions:
  - lastHeartbeatTime: "2025-03-24T11:14:48Z"
    lastTransitionTime: "2025-03-24T11:14:48Z"
    message: node fault healing success
    reason: Success
    status: "False"
    type: FaultHealing

This section describes the fields in conditions:

Field	Description
`type`	`FaultHealing`. Indicates that the node is in the self-healing process.
`status`	Indicates whether the node is currently self-healing. `True`: Self-healing is in progress. `False`: Self-healing is complete.
`reason`	The node's self-healing status. `Success`, `Finished`: Self-healing is complete. `InProgress`: Self-healing is in progress. `Failed`: Self-healing failed.
`message`	Records detailed information about the self-healing progress.
`lastTransitionTime`	Updated when the self-healing progress changes.
`lastHeartbeatTime`	Updated when the self-healing progress changes.

Pod Event

A Warning event is generated for pods that are running on the faulty node.

Important

After you receive the fault information, evict the pods from the faulty node as soon as possible. ACS automatically starts the node repair and self-healing process after all pods are evicted. You can use acs-instance-helper to automatically evict pods. For more information, see Configure automatic rotation for instances with hardware exceptions.

reason: NodeBroken
type: Warning
message: 'The pod is proposed to be evicted at 2024-10-16 07:21:54 +0000 UTC, reason: xxx'

The following table describes the fields in the event.

Field	Description
`type`	Static field. The value is `Warning`.
`reason`	Fault type. It can be one of the following: `NodeBroken`: The entire GPU-HPN node has failed. `GPUCardBroken`: A GPU card-level failure has occurred. `NodeMaintenance`: The GPU-HPN node is undergoing a system upgrade or O&M.
`message`	Records detailed information about the self-healing progress.