All Products
Search
Document Center

Container Service for Kubernetes:Enable auto repair for nodes

Last Updated:Feb 17, 2025

After you enable the managed node pool feature for a node pool, you can enable the auto repair feature. Container Service for Kubernetes (ACK) automatically monitors the status of nodes and automatically executes auto repair tasks when nodes do not work as expected. This simplifies node O&M. Due to the complexity of faults, the auto repair task cannot repair all faults, and you must manually repair specific complex faults.

Prerequisites

The Kubernetes event center is enabled to receive alert events from node pools and the ack-node-problem-detector component is installed to detect node exceptions. For more information, see Event monitoring.

Procedure

You can configure the Managed Node Pool parameter for a new ode pool or an existing node pool. If you turn on the Restart Faulty Node switch, operations such as draining and replacing disks may be involved in the auto repair process. We recommend that you store business data on disks.

Enable auto repair when you create a node pool

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.

  3. On the Node Pools page, click Create Node Pool, select the Managed Node Pool checkbox, select the Restart Faulty Node checkbox based on your business requirements, and then follow the instructions to create a node pool.

    image

    For more information about the parameters, see Create and manage a node pool.

Enable auto repair for an existing node pool

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.

  3. In the node pool list, find the node pool that you want to manage and choose More > Configure Managed Node Pool in the Actions column. On the page that appears, select Managed Node Pool and select Restart Faulty Node based on your business requirements.

    image

Conditions that trigger auto repair

ACK determines whether to run auto repair tasks based on the status (condition) of nodes. To check the status of a node, run the kubectl describe node command and check the value of the condition field in the command output. If a node remains in an abnormal state within a period of time that exceeds the threshold, ACK automatically runs auto repair tasks on the node.

Important

The node pool monitors the running status of nodes. If the status of a node is not reported for more than 10 minutes or a node is in the NotReady state, ACK restarts the node to repair the faults on the node. In this case, the pods on the node are restarted.

The following table describes the trigger conditions.

Item

Description

Severity

Threshold

Operation

KubeletNotReady(KubeletHung)

The node is in the NotReady state because the kubelet stopped running.

High

180s

  1. Restart the kubelet.

  2. Restart the Elastic Compute Service (ECS) instance if Restart Faulty Node is selected.

KubeletNotReady(PLEG)

The node is in the NotReady state because the Pod Lifecycle Event Generator (PLEG) module failed to pass health checks.

Medium

180s

  1. Restart containerd or Docker.

  2. Restart the kubelet.

  3. Restart the Elastic Compute Service (ECS) instance if Restart Faulty Node is selected.

KubeletNotReady(SandboxError)

The kubelet cannot be started because no sandboxed pod was found.

High

180s

  1. Delete the sandboxed pod.

  2. Restart the kubelet.

RuntimeOffline

Docker or containerd stopped running and the node is unavailable.

High

90s

  1. Restart containerd or Docker.

  2. Restart the Elastic Compute Service (ECS) instance if Restart Faulty Node is enabled.

NTPProblem

The time synchronization service (ntpd or chronyd) is in an abnormal state.

High

10s

Restart ntpd or chronyd.

SystemdOffline

Systemd is in an abnormal state and cannot launch or destroy containers.

High

90s

Restart the Elastic Compute Service (ECS) instance if Restart Faulty Node is selected.

ReadonlyFilesystem

The node file system is read-only.

High

90s

Restart the Elastic Compute Service (ECS) instance if Restart Faulty Node is selected.

Auto repair events

After auto repair is triggered, ACK writes the relevant events to the Kubernetes event center. To view the repair records and operations in the Kubernetes event center, go to the cluster details page and choose Operations > Event Center in the left-side navigation pane. For more information about the events, see Event monitoring.

Cause

Level

Description

NodeRepairStart

Normal

The system starts to repair the node.

NodeRepairAction

Normal

The repair operation on the node, such as restarting the kubelet.

NodeRepairSucceed

Normal

The node recovers from exceptions after the repair operation is complete.

NodeRepairFailed

Warning

The node fails to recover from exceptions after the repair operation is complete. For more information, see FAQ.

NodeRepairIgnore

Normal

The node skips the repair operation. If an ECS node is not in the Running state, the system does not perform auto repair operations on the node.

Execution rules and process

The following section describes the complete auto repair process.

  1. If a node enters an abnormal state and remains in the state for a period of time, ACK determines that the node is in the Error state.

  2. After a node is considered in the Error state, ACK runs specific auto repair tasks to fix the exceptions and generates events.

    • The node is in the Repairing state while the auto repair task is running.

    • If the node exceptions are fixed after the auto repair tasks are complete, the node enters the Normal state.

    • If the node exceptions persist after the repair tasks are complete, the node enters to the Failed to Recover state.

      If a node fails to be repaired, the auto repair task is no longer triggered. This means that auto repair is resumed for the node only after the node recovers from the exception.

  • If a cluster contains multiple node pools, you can run auto repair tasks on the nodes in each node pool in parallel.

  • If exceptions occur on multiple nodes in a node pool, ACK runs auto repair tasks on the nodes in sequence. If ACK fails to repair one of the nodes, ACK stops running auto repair tasks on the remaining nodes.

FAQ

What do I do if the node fails to be automatically repaired?

The auto repair feature may fail to fix complicated node exceptions in specific scenarios. If an auto repair task on a node fails or the node exception persists after an auto repair task is complete, the node enters the Failed to Recover state.

If ACK fails to repair a node in a node pool, ACK stops running auto repair tasks on all nodes in the node pool until the node recovers from exceptions. To contact technical support, submit a ticket .

References