Automate Node Recovery with Auto Repair in ACK Managed Pools - Container Service for Kubernetes

How it works

The fault detection, notification, and repair workflow:

Fault detection — ACK uses the ack-node-problem-detector (NPD) add-on to check for node exceptions. When a node becomes unhealthy and remains in that state beyond a specified threshold, ACK treats it as a failure.
Notification — When a fault is detected, ACK generates a node condition and a Kubernetes event. Configure alerts in the Event Center to receive notifications.
(Exclusive GPUs) Fault isolation — After a GPU exception is detected, ACK isolates the faulty GPU card. For details, see GPU exception detection and automatic isolation.

Repair — ACK takes different actions depending on whether the failure is a system/Kubernetes component issue or a node instance issue:

Failure type	Repair process
System and Kubernetes add-on exceptions	1. ACK repairs the faulty system and Kubernetes add-ons — for example, restarting the kubelet or container runtime. 2. If Reboot Node on System/Kubernetes Component Failure is enabled and the initial repair fails, ACK marks the node as unschedulable, drains it, restarts it, and makes it schedulable again. See System and Kubernetes add-on exceptions.
Node instance exceptions	1. ACK adds a taint to the faulty node. 2. If Repair Nodes Only After Acquiring Permissions is enabled, ACK waits for your authorization. 3. ACK drains the node and performs a repair action — restarting the node or initiating hardware repair. 4. When the node returns to normal, ACK removes the taint. See Node instance exceptions.

Repair serialization:

If a cluster has multiple node pools, ACK repairs them serially, one node pool at a time.
If a node pool has multiple unhealthy nodes, ACK repairs them serially, one by one. If a node fails to heal, ACK stops the auto repair process for all other faulty nodes in that node pool.

Prerequisites

Before you begin, ensure that you have:

An ACK managed cluster (auto repair is not available for other cluster types)
A managed node pool or Lingjun node pool (auto repair is not supported for other node pool types)
The NPD add-on installed and the Event Center configured — see Event monitoring
(For node instance exception repair) NPD version 1.2.26 or later, and allowlist access — submit a ticket to request access. NPD version 1.2.26 is in phased release.
(For Lingjun node pools) Allowlist access — submit a ticket to request access.

After enabling node auto repair, enable alert management and activate the Cluster Node auto repair Alert Rule Set and Cluster GPU Monitoring Alert Rule Set in the Event Center. These alert rule sets are in phased release and may not be visible yet. For details, see Container Service Alert Management.

Limitations

Auto repair does not trigger in these scenarios:

Node pool type: Auto repair is available only for managed node pools and Lingjun node pools in ACK managed clusters.
Unsupported node conditions: Auto repair covers only the specific system component and node instance conditions listed in the trigger tables below.
Repair chain stops on failure: If a node in a node pool fails to heal, ACK stops auto repair for all other faulty nodes in that node pool until the original fault is resolved.
Recovery failed nodes: A node in Recovery failed status will not trigger another auto repair until the underlying fault is resolved manually.
Already-unschedulable nodes: If a node was already unschedulable before repair started, ACK will not automatically make it schedulable after the repair completes.

Enable node auto repair

Enable and configure node auto repair through the managed node pool settings. The steps are the same for Lingjun node pools.

New node pools

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the cluster to manage. In the left navigation pane, choose Nodes > Node Pools.
On the Node Pools page, click Create Node Pool. In the Configure Managed Node Pool section, select Custom Node Management. Enable Auto Repair and configure the Reboot Node on System/Kubernetes Component Failure option. Follow the on-screen instructions to complete the node pool creation. For a full description of all configuration options, see Create and manage a node pool.

Existing node pools

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the cluster to manage. In the left navigation pane, choose Nodes > Node Pools.
In the node pool list, find the target node pool. In the Actions column, click > Enable Managed Node Pool (for a regular node pool) or Configure Managed Node Pool (for a managed node pool). In the Configure Managed Node Pool section, select Custom Node Management. Enable Auto Repair and configure the Reboot Node on System/Kubernetes Component Failure option. Follow the on-screen instructions to submit the configuration. For a full description of all configuration options, see Create and manage a node pool.

System and Kubernetes add-on exceptions

Repair process

ACK initiates a repair task based on the node's condition. Run kubectl describe node to view the node's conditions.

When an exception persists beyond the threshold for that condition, ACK starts the following repair process:

ACK attempts to repair the faulty system and Kubernetes add-ons — for example, by restarting the kubelet or container runtime.
If Reboot Node on System/Kubernetes Component Failure is enabled and the initial repair actions fail:
1. ACK marks the faulty node as unschedulable.
2. ACK drains the node. The drain operation times out after 30 minutes. ACK evicts Pods while respecting configured Pod Disruption Budgets (PDBs). To maintain high availability, deploy workloads with multiple replicas across different nodes and configure PDBs for critical services. If the drain fails, ACK still proceeds with the subsequent steps.
3. ACK restarts the node.
4. When the node returns to normal, ACK makes it schedulable again. Exception: If the node was already unschedulable before repair started, ACK will not automatically make it schedulable after the repair.

Node conditions that trigger auto repair

Node condition	Description	Risk level	Threshold	Repair action
`KubeletNotReady(KubeletHung)`	The kubelet stopped unexpectedly, causing the node to report `NotReady`.	High	180s	1. Restart the kubelet. 2. If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.
`KubeletNotReady(PLEG)`	The Pod Lifecycle Event Generator (PLEG) health check failed, causing the node to report `NotReady`.	Medium	180s	1. Restart containerd or Docker. 2. Restart the kubelet. 3. If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.
`KubeletNotReady(SandboxError)`	A PodSandbox was not found, preventing the kubelet from starting correctly.	High	180s	1. Delete the corresponding sandbox container. 2. Restart the kubelet.
`RuntimeOffline`	containerd or Docker stopped, making the node unavailable.	High	90s	1. Restart containerd or Docker. 2. If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.
`NTPProblem`	The time synchronization service (ntpd or chronyd) is abnormal.	High	10s	Restart ntpd or chronyd.
`SystemdOffline`	The systemd state is abnormal, preventing containers from starting or stopping.	High	90s	If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.
`ReadonlyFilesystem`	The node's filesystem became read-only.	High	90s	If Reboot Node on System/Kubernetes Component Failure is enabled, restart the ECS instance.

Node instance exceptions

Important

Complete all steps in Prerequisites before proceeding.

Repair process

Important

For Lingjun nodes that require hardware repair, the repair process redeploys the node and erases all data on its local disks. Enable Repair Nodes Only After Acquiring Permissions for the node pool so you can back up data before authorizing the repair.

ACK automatically triggers the following repair process 5 minutes after a node instance exception is detected:

ACK adds the following taint to the faulty node:
- Key: alibabacloud.com/node-needrepair
- Value: Unschedulable
- Effect: NoSchedule
If Repair Nodes Only After Acquiring Permissions is enabled, ACK pauses and waits for your authorization: > Enable this option if you need to handle workloads on the unhealthy node or back up data before repair begins.
1. ACK adds the label alibabacloud.com/node-needrepair=Inquiring to the faulty node.
2. Handle the Pods on the node or back up your data. Then authorize the repair by either deleting the alibabacloud.com/node-needrepair label or setting its value to Approved (alibabacloud.com/node-needrepair=Approved).
3. ACK proceeds with the next steps after receiving your authorization.
ACK drains the node. The drain operation times out after 30 minutes. ACK evicts Pods while respecting configured PDBs. Deploy workloads with multiple replicas across different nodes and configure PDBs for critical services to maintain high availability. If the drain fails, ACK still proceeds with the subsequent steps.
ACK performs a repair action — restarting the node or initiating hardware repair.
ACK checks whether the node has returned to normal:
- If the fault is resolved, ACK removes the taint and the node returns to a normal state.
- If the fault persists or the repair fails, the taint is not removed. ACK periodically sends event notifications. View the events for troubleshooting or submit a ticket.
(For exclusive GPUs) When the GPU card returns to normal, ACK removes its isolation.

Important

After a successful hardware repair on a Lingjun node, manually remove the node from the node pool, then re-add the repaired device as an existing node. For details, see Remove a node and Add an existing Lingjun node.

Node conditions that trigger auto repair

Node condition	Description	Repair action
`NvidiaXID74Error`	Fatal NVLink hardware error.	Repair hardware
`NvidiaXID79Error`	GPU has fallen off the bus and is no longer detectable by the system.	Repair hardware
`NvidiaRemappingRowsFailed`	GPU failed to perform row remapping.	Repair hardware
`NvidiaDeviceLost`	GPU has fallen off the bus or become inaccessible.	Repair hardware
`NvidiaInfoRomCorrupted`	The infoROM is corrupted.	Repair hardware
`NvidiaPowerCableErr`	External power cables are not properly attached.	Repair hardware
`NvidiaXID95Error`	Uncontained ECC error — all applications on the GPU are affected. The GPU must be reset before applications can restart.	Restart node
`NvidiaXID48Error`	Double Bit ECC error (DBE) — uncorrectable, requires a GPU reset or node restart.	Restart node
`NvidiaXID119Error`	Timeout waiting for the GSP core to respond to an RPC message.	Restart node
`NvidiaXID140Error`	Unrecovered ECC error — uncorrectable errors detected in GPU memory, requiring a GPU reset.	Restart node
`NvidiaXID120Error`	Error in code running on the GPU's GSP core.	Restart node
`NvidiaPendingRetiredPages`	GPU has pending retired pages that require a GPU reset to take effect.	Restart node
`NvidiaRemappingRowsRequireReset`	Uncorrectable, uncontained error requiring a GPU reset for recovery.	Restart node
`NvidiaXID44Error`	Graphics Engine fault during a context switch — uncorrectable, requires a GPU reset or node restart.	Restart node
`NvidiaXID61Error`	Internal micro-controller breakpoint or warning — uncorrectable, requires a GPU reset or node restart.	Restart node
`NvidiaXID62Error`	Internal micro-controller halt — uncorrectable, requires a GPU reset or node restart.	Restart node
`NvidiaXID69Error`	Graphics Engine class error — uncorrectable, requires a GPU reset or node restart.	Restart node

For details on these conditions — including the node conditions they generate and whether they produce an event — see GPU exception detection and automatic isolation.

Node status during repair

Status	Meaning
Repairing	A repair task is in progress.
Normal	The repair completed and the fault was resolved.
Recovery failed	The repair completed but the fault persists. The node will not trigger another auto repair until the underlying fault is resolved.

Monitor repair events

When ACK triggers a node auto repair operation, it logs events in the Event Center. On the cluster's details page, choose Operations > Event Center to view recovery records and the specific actions taken. Subscribe to these events as described in Event monitoring.

Event	Level	Description
`NodeRepairStart`	Normal	Node auto repair has started.
`NodeRepairAction`	Normal	A repair action was performed, such as restarting the kubelet.
`NodeRepairSucceed`	Normal	Node auto repair succeeded.
`NodeRepairFailed`	Warning	Node auto repair failed. See FAQ.
`NodeRepairIgnore`	Normal	Node auto repair was skipped because the ECS instance was not in a running state.

FAQ

What should I do if node auto repair fails?

Auto repair cannot resolve all failures due to the complexity of some fault scenarios. When a repair task fails or the fault persists after the task completes, ACK sets the node status to Recovery failed and stops auto repair for other faulty nodes in the same node pool until the original fault is resolved. Submit a ticket to contact technical support.

What's next

Enable the NPD add-on and monitor cluster events through the Event Center. For details, see Event monitoring.
For GPU fault detection, see GPU exception detection and automatic isolation and Diagnose GPU node issues.
To resolve a node issue by removing the faulty node and adding it back, follow the documented procedures in the ACK console to prevent unexpected behavior. For details, see Remove a node and Add existing nodes.