All Products
Search
Document Center

Container Service for Kubernetes:Enable node auto-healing

Last Updated:Mar 26, 2026

Node auto repair monitors node health in managed node pools and triggers self-healing tasks when issues are detected. When ACK detects a node failure, it automatically repairs faulty system components. For hardware-level failures, it can also restart the node or initiate hardware repair. This reduces manual intervention for common failure scenarios.

Auto repair cannot resolve all failures. Some severe or complex issues still require manual intervention.

How it works

The fault detection, notification, and repair workflow:

image
  1. Fault detection — ACK uses the ack-node-problem-detector (NPD) add-on to check for node exceptions. When a node becomes unhealthy and remains in that state beyond a specified threshold, ACK treats it as a failure.

  2. Notification — When a fault is detected, ACK generates a node condition and a Kubernetes event. Configure alerts in the Event Center to receive notifications.

  3. (Exclusive GPUs) Fault isolation — After a GPU exception is detected, ACK isolates the faulty GPU card. For details, see GPU exception detection and automatic isolation.

  4. Repair — ACK takes different actions depending on whether the failure is a system/Kubernetes component issue or a node instance issue:

    Failure type Repair process
    System and Kubernetes add-on exceptions 1. ACK repairs the faulty system and Kubernetes add-ons — for example, restarting the kubelet or container runtime. 2. If Reboot Node on System/Kubernetes Component Failure is enabled and the initial repair fails, ACK marks the node as unschedulable, drains it, restarts it, and makes it schedulable again. See System and Kubernetes add-on exceptions.
    Node instance exceptions 1. ACK adds a taint to the faulty node. 2. If Repair Nodes Only After Acquiring Permissions is enabled, ACK waits for your authorization. 3. ACK drains the node and performs a repair action — restarting the node or initiating hardware repair. 4. When the node returns to normal, ACK removes the taint. See Node instance exceptions.

Repair serialization:

  • If a cluster has multiple node pools, ACK repairs them serially, one node pool at a time.

  • If a node pool has multiple unhealthy nodes, ACK repairs them serially, one by one. If a node fails to heal, ACK stops the auto repair process for all other faulty nodes in that node pool.

Prerequisites

Before you begin, ensure that you have:

  • An ACK managed cluster (auto repair is not available for other cluster types)

  • A managed node pool or Lingjun node pool (auto repair is not supported for other node pool types)

  • The NPD add-on installed and the Event Center configured — see Event monitoring

  • (For node instance exception repair) NPD version 1.2.26 or later, and allowlist access — submit a ticket to request access. NPD version 1.2.26 is in phased release.

  • (For Lingjun node pools) Allowlist access — submit a ticket to request access.

After enabling node auto repair, enable alert management and activate the Cluster Node auto repair Alert Rule Set and Cluster GPU Monitoring Alert Rule Set in the Event Center. These alert rule sets are in phased release and may not be visible yet. For details, see Container Service Alert Management.

Limitations

Auto repair does not trigger in these scenarios:

  • Node pool type: Auto repair is available only for managed node pools and Lingjun node pools in ACK managed clusters.

  • Unsupported node conditions: Auto repair covers only the specific system component and node instance conditions listed in the trigger tables below.

  • Repair chain stops on failure: If a node in a node pool fails to heal, ACK stops auto repair for all other faulty nodes in that node pool until the original fault is resolved.

  • Recovery failed nodes: A node in Recovery failed status will not trigger another auto repair until the underlying fault is resolved manually.

  • Already-unschedulable nodes: If a node was already unschedulable before repair started, ACK will not automatically make it schedulable after the repair completes.

Enable node auto repair

Enable and configure node auto repair through the managed node pool settings. The steps are the same for Lingjun node pools.

New node pools

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster to manage. In the left navigation pane, choose Nodes > Node Pools.

  3. On the Node Pools page, click Create Node Pool. In the Configure Managed Node Pool section, select Custom Node Management. Enable Auto Repair and configure the Reboot Node on System/Kubernetes Component Failure option. Follow the on-screen instructions to complete the node pool creation. For a full description of all configuration options, see Create and manage a node pool.

    image

Existing node pools

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster to manage. In the left navigation pane, choose Nodes > Node Pools.

  3. In the node pool list, find the target node pool. In the Actions column, click image > Enable Managed Node Pool (for a regular node pool) or Configure Managed Node Pool (for a managed node pool). In the Configure Managed Node Pool section, select Custom Node Management. Enable Auto Repair and configure the Reboot Node on System/Kubernetes Component Failure option. Follow the on-screen instructions to submit the configuration. For a full description of all configuration options, see Create and manage a node pool.

    image

System and Kubernetes add-on exceptions

Repair process

ACK initiates a repair task based on the node's condition. Run kubectl describe node to view the node's conditions.

When an exception persists beyond the threshold for that condition, ACK starts the following repair process:

  1. ACK attempts to repair the faulty system and Kubernetes add-ons — for example, by restarting the kubelet or container runtime.

  2. If Reboot Node on System/Kubernetes Component Failure is enabled and the initial repair actions fail:

    1. ACK marks the faulty node as unschedulable.

    2. ACK drains the node. The drain operation times out after 30 minutes. ACK evicts Pods while respecting configured Pod Disruption Budgets (PDBs). To maintain high availability, deploy workloads with multiple replicas across different nodes and configure PDBs for critical services. If the drain fails, ACK still proceeds with the subsequent steps.

    3. ACK restarts the node.

    4. When the node returns to normal, ACK makes it schedulable again. Exception: If the node was already unschedulable before repair started, ACK will not automatically make it schedulable after the repair.

Node conditions that trigger auto repair

Node condition Description Risk level Threshold Repair action
KubeletNotReady(KubeletHung) The kubelet stopped unexpectedly, causing the node to report NotReady. High 180s 1. Restart the kubelet. 2. If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.
KubeletNotReady(PLEG) The Pod Lifecycle Event Generator (PLEG) health check failed, causing the node to report NotReady. Medium 180s 1. Restart containerd or Docker. 2. Restart the kubelet. 3. If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.
KubeletNotReady(SandboxError) A PodSandbox was not found, preventing the kubelet from starting correctly. High 180s 1. Delete the corresponding sandbox container. 2. Restart the kubelet.
RuntimeOffline containerd or Docker stopped, making the node unavailable. High 90s 1. Restart containerd or Docker. 2. If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.
NTPProblem The time synchronization service (ntpd or chronyd) is abnormal. High 10s Restart ntpd or chronyd.
SystemdOffline The systemd state is abnormal, preventing containers from starting or stopping. High 90s If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.
ReadonlyFilesystem The node's filesystem became read-only. High 90s If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.

Node instance exceptions

Important

Complete all steps in Prerequisites before proceeding.

Repair process

Important

For Lingjun nodes that require hardware repair, the repair process redeploys the node and erases all data on its local disks. Enable Repair Nodes Only After Acquiring Permissions for the node pool so you can back up data before authorizing the repair.

ACK automatically triggers the following repair process 5 minutes after a node instance exception is detected:

  1. ACK adds the following taint to the faulty node:

    • Key: alibabacloud.com/node-needrepair

    • Value: Unschedulable

    • Effect: NoSchedule

  2. If Repair Nodes Only After Acquiring Permissions is enabled, ACK pauses and waits for your authorization: > Enable this option if you need to handle workloads on the unhealthy node or back up data before repair begins.

    1. ACK adds the label alibabacloud.com/node-needrepair=Inquiring to the faulty node.

    2. Handle the Pods on the node or back up your data. Then authorize the repair by either deleting the alibabacloud.com/node-needrepair label or setting its value to Approved (alibabacloud.com/node-needrepair=Approved).

    3. ACK proceeds with the next steps after receiving your authorization.

  3. ACK drains the node. The drain operation times out after 30 minutes. ACK evicts Pods while respecting configured PDBs. Deploy workloads with multiple replicas across different nodes and configure PDBs for critical services to maintain high availability. If the drain fails, ACK still proceeds with the subsequent steps.

  4. ACK performs a repair action — restarting the node or initiating hardware repair.

  5. ACK checks whether the node has returned to normal:

    • If the fault is resolved, ACK removes the taint and the node returns to a normal state.

    • If the fault persists or the repair fails, the taint is not removed. ACK periodically sends event notifications. View the events for troubleshooting or submit a ticket.

  6. (For exclusive GPUs) When the GPU card returns to normal, ACK removes its isolation.

Important

After a successful hardware repair on a Lingjun node, manually remove the node from the node pool, then re-add the repaired device as an existing node. For details, see Remove a node and Add an existing Lingjun node.

Node conditions that trigger auto repair

Node condition Description Repair action
NvidiaXID74Error Fatal NVLink hardware error. Repair hardware
NvidiaXID79Error GPU has fallen off the bus and is no longer detectable by the system. Repair hardware
NvidiaRemappingRowsFailed GPU failed to perform row remapping. Repair hardware
NvidiaDeviceLost GPU has fallen off the bus or become inaccessible. Repair hardware
NvidiaInfoRomCorrupted The infoROM is corrupted. Repair hardware
NvidiaPowerCableErr External power cables are not properly attached. Repair hardware
NvidiaXID95Error Uncontained ECC error — all applications on the GPU are affected. The GPU must be reset before applications can restart. Restart node
NvidiaXID48Error Double Bit ECC error (DBE) — uncorrectable, requires a GPU reset or node restart. Restart node
NvidiaXID119Error Timeout waiting for the GSP core to respond to an RPC message. Restart node
NvidiaXID140Error Unrecovered ECC error — uncorrectable errors detected in GPU memory, requiring a GPU reset. Restart node
NvidiaXID120Error Error in code running on the GPU's GSP core. Restart node
NvidiaPendingRetiredPages GPU has pending retired pages that require a GPU reset to take effect. Restart node
NvidiaRemappingRowsRequireReset Uncorrectable, uncontained error requiring a GPU reset for recovery. Restart node
NvidiaXID44Error Graphics Engine fault during a context switch — uncorrectable, requires a GPU reset or node restart. Restart node
NvidiaXID61Error Internal micro-controller breakpoint or warning — uncorrectable, requires a GPU reset or node restart. Restart node
NvidiaXID62Error Internal micro-controller halt — uncorrectable, requires a GPU reset or node restart. Restart node
NvidiaXID69Error Graphics Engine class error — uncorrectable, requires a GPU reset or node restart. Restart node

For details on these conditions — including the node conditions they generate and whether they produce an event — see GPU exception detection and automatic isolation.

Node status during repair

Status Meaning
Repairing A repair task is in progress.
Normal The repair completed and the fault was resolved.
Recovery failed The repair completed but the fault persists. The node will not trigger another auto repair until the underlying fault is resolved.

Monitor repair events

When ACK triggers a node auto repair operation, it logs events in the Event Center. On the cluster's details page, choose Operations > Event Center to view recovery records and the specific actions taken. Subscribe to these events as described in Event monitoring.

Event Level Description
NodeRepairStart Normal Node auto repair has started.
NodeRepairAction Normal A repair action was performed, such as restarting the kubelet.
NodeRepairSucceed Normal Node auto repair succeeded.
NodeRepairFailed Warning Node auto repair failed. See FAQ.
NodeRepairIgnore Normal Node auto repair was skipped because the ECS instance was not in a running state.

FAQ

What should I do if node auto repair fails?

Auto repair cannot resolve all failures due to the complexity of some fault scenarios. When a repair task fails or the fault persists after the task completes, ACK sets the node status to Recovery failed and stops auto repair for other faulty nodes in the same node pool until the original fault is resolved. Submit a ticket to contact technical support.

What's next