During AI training, a GPU node fault can cause tasks to hang or terminate abnormally, which affects task efficiency. Alibaba Cloud Container Service for Kubernetes (ACS) provides fault monitoring for GPU-HPN nodes. If a node becomes abnormal, ACS marks the node and reports the issue through Kubernetes events and conditions. This topic describes the fault metrics, data retrieval methods, and recovery process for GPU-HPN nodes.
Node fault handling process
ACS continuously runs health checks on GPU-HPN nodes. If a node fails, follow this process to repair the faulty node using the ACS self-healing mechanism.
ACS reports the cause of the fault through events and node conditions. It also adds a taint to the node to prevent new pods from being scheduled to it.
Drain faulty node
After you receive a fault notification, you must promptly evict the pods from the faulty node. You can use acs-instance-helper to automatically evict pods. For more information, see Configure automatic rotation for instances with hardware exceptions.
Repair faulty node
After the faulty node is drained, ACS automatically starts the repair process.
Node self-recovery
After ACS repairs the node, the related taints and conditions on the node are automatically restored to a normal state. New pods can then be scheduled to the node.
Fault notifications
To ensure prompt handling of faults, configure alert conditions using kube-eventer. For more information, see kube-eventer.
When a GPU-HPN node is faulty, ACS provides fault information using conditions on the Node object and events on the corresponding pods.
Node
An anomaly label is added to the node for filtering and viewing.
metadata: labels: alibabacloud.com/node-anomaly: "true"A taint named
alibabacloud.com/node-anomalyis added to the node. By default, newly submitted pods are not scheduled to this node.spec: taints: - effect: NoSchedule key: alibabacloud.com/node-anomaly timeAdded: "2024-10-16T06:09:27Z"Detailed fault information is recorded in the node's
conditionsfield in a condition with the type `NodeAnomaly`.status: conditions: - lastHeartbeatTime: "2024-10-16T06:09:31Z" lastTransitionTime: "2024-10-16T06:09:31Z" message: The node has encountered an anomaly. reason: NodeBroken status: "True" type: NodeAnomalyThe fields in
conditionsare as follows:Field
Description
When it is updated
type`NodeAnomaly`. Indicates a node anomaly.
Static field. Does not change during the node lifecycle.
statusIndicates whether a fault exists.
True: A fault exists.False: No fault exists.
Updated when the node's fault status changes.
reasonFault type. It can be one of the following:
NodeBroken: The entire GPU-HPN node has failed.GPUCardBroken: A GPU card-level failure has occurred.NodeMaintenance: The GPU-HPN node is undergoing a system upgrade or O&M.
Updated when the node's fault status changes.
messageRecords detailed information about the fault.
Updated when the node's fault status changes.
lastTransitionTimeThe time when the fault status last changed.
Updated when the node's fault status changes.
lastHeartbeatTimeThe regularly updated heartbeat time.
Updated when the node's fault status changes or if more than five minutes have passed since the last update.
Detailed self-healing information is recorded in the node's
conditionsfield in a condition with the type `FaultHealing`. This condition is updated as the self-healing process progresses.status: conditions: - lastHeartbeatTime: "2025-03-24T11:14:48Z" lastTransitionTime: "2025-03-24T11:14:48Z" message: node fault healing success reason: Success status: "False" type: FaultHealingThis section describes the fields in
conditions:Field
Description
type`FaultHealing`. Indicates that the node is in the self-healing process.
statusIndicates whether the node is currently self-healing.
True: Self-healing is in progress.False: Self-healing is complete.
reasonThe node's self-healing status.
`Success`, `Finished`: Self-healing is complete.
`InProgress`: Self-healing is in progress.
`Failed`: Self-healing failed.
messageRecords detailed information about the self-healing progress.
lastTransitionTimeUpdated when the self-healing progress changes.
lastHeartbeatTimeUpdated when the self-healing progress changes.
Pod Event
A Warning event is generated for pods that are running on the faulty node.
After you receive the fault information, evict the pods from the faulty node as soon as possible. ACS automatically starts the node repair and self-healing process after all pods are evicted. You can use acs-instance-helper to automatically evict pods. For more information, see Configure automatic rotation for instances with hardware exceptions.
reason: NodeBroken
type: Warning
message: 'The pod is proposed to be evicted at 2024-10-16 07:21:54 +0000 UTC, reason: xxx'The following table describes the fields in the event.
Field | Description |
| Static field. The value is `Warning`. |
| Fault type. It can be one of the following:
|
| Records detailed information about the self-healing progress. |