All Products
Search
Document Center

Container Compute Service:Monitor and recover from GPU-HPN node faults

Last Updated:Feb 11, 2026

During AI training, a GPU node fault can cause tasks to hang or terminate abnormally, which affects task efficiency. Alibaba Cloud Container Service for Kubernetes (ACS) provides fault monitoring for GPU-HPN nodes. If a node becomes abnormal, ACS marks the node and reports the issue through Kubernetes events and conditions. This topic describes the fault metrics, data retrieval methods, and recovery process for GPU-HPN nodes.

Node fault handling process

ACS continuously runs health checks on GPU-HPN nodes. If a node fails, follow this process to repair the faulty node using the ACS self-healing mechanism.

image
  1. Fault notification

    ACS reports the cause of the fault through events and node conditions. It also adds a taint to the node to prevent new pods from being scheduled to it.

  2. Drain faulty node

    After you receive a fault notification, you must promptly evict the pods from the faulty node. You can use acs-instance-helper to automatically evict pods. For more information, see Configure automatic rotation for instances with hardware exceptions.

  3. Repair faulty node

    After the faulty node is drained, ACS automatically starts the repair process.

  4. Node self-recovery

    After ACS repairs the node, the related taints and conditions on the node are automatically restored to a normal state. New pods can then be scheduled to the node.

Fault notifications

Important

To ensure prompt handling of faults, configure alert conditions using kube-eventer. For more information, see kube-eventer.

When a GPU-HPN node is faulty, ACS provides fault information using conditions on the Node object and events on the corresponding pods.

Node

  • An anomaly label is added to the node for filtering and viewing.

    metadata:
      labels:
        alibabacloud.com/node-anomaly: "true"
  • A taint named alibabacloud.com/node-anomaly is added to the node. By default, newly submitted pods are not scheduled to this node.

    spec:
      taints:
      - effect: NoSchedule
        key: alibabacloud.com/node-anomaly
        timeAdded: "2024-10-16T06:09:27Z"
  • Detailed fault information is recorded in the node's conditions field in a condition with the type `NodeAnomaly`.

    status:
      conditions:
      - lastHeartbeatTime: "2024-10-16T06:09:31Z"
        lastTransitionTime: "2024-10-16T06:09:31Z"
        message: The node has encountered an anomaly.
        reason: NodeBroken
        status: "True"
        type: NodeAnomaly

    The fields in conditions are as follows:

    Field

    Description

    When it is updated

    type

    `NodeAnomaly`. Indicates a node anomaly.

    Static field. Does not change during the node lifecycle.

    status

    Indicates whether a fault exists.

    • True: A fault exists.

    • False: No fault exists.

    Updated when the node's fault status changes.

    reason

    Fault type. It can be one of the following:

    • NodeBroken: The entire GPU-HPN node has failed.

    • GPUCardBroken: A GPU card-level failure has occurred.

    • NodeMaintenance: The GPU-HPN node is undergoing a system upgrade or O&M.

    Updated when the node's fault status changes.

    message

    Records detailed information about the fault.

    Updated when the node's fault status changes.

    lastTransitionTime

    The time when the fault status last changed.

    Updated when the node's fault status changes.

    lastHeartbeatTime

    The regularly updated heartbeat time.

    Updated when the node's fault status changes or if more than five minutes have passed since the last update.

  • Detailed self-healing information is recorded in the node's conditions field in a condition with the type `FaultHealing`. This condition is updated as the self-healing process progresses.

    status:
      conditions:
      - lastHeartbeatTime: "2025-03-24T11:14:48Z"
        lastTransitionTime: "2025-03-24T11:14:48Z"
        message: node fault healing success
        reason: Success
        status: "False"
        type: FaultHealing

    This section describes the fields in conditions:

    Field

    Description

    type

    `FaultHealing`. Indicates that the node is in the self-healing process.

    status

    Indicates whether the node is currently self-healing.

    • True: Self-healing is in progress.

    • False: Self-healing is complete.

    reason

    The node's self-healing status.

    • `Success`, `Finished`: Self-healing is complete.

    • `InProgress`: Self-healing is in progress.

    • `Failed`: Self-healing failed.

    message

    Records detailed information about the self-healing progress.

    lastTransitionTime

    Updated when the self-healing progress changes.

    lastHeartbeatTime

    Updated when the self-healing progress changes.

Pod Event

A Warning event is generated for pods that are running on the faulty node.

Important

After you receive the fault information, evict the pods from the faulty node as soon as possible. ACS automatically starts the node repair and self-healing process after all pods are evicted. You can use acs-instance-helper to automatically evict pods. For more information, see Configure automatic rotation for instances with hardware exceptions.

reason: NodeBroken
type: Warning
message: 'The pod is proposed to be evicted at 2024-10-16 07:21:54 +0000 UTC, reason: xxx'

The following table describes the fields in the event.

Field

Description

type

Static field. The value is `Warning`.

reason

Fault type. It can be one of the following:

  • NodeBroken: The entire GPU-HPN node has failed.

  • GPUCardBroken: A GPU card-level failure has occurred.

  • NodeMaintenance: The GPU-HPN node is undergoing a system upgrade or O&M.

message

Records detailed information about the self-healing progress.