When exceptions occur on a node in a managed node pool, Container Service for Kubernetes (ACK) automatically repairs the node. This ensures that the nodes in an ACK cluster can run as normal. After you create a managed node pool or convert a node pool into a managed node pool, auto repair is automatically enabled for the nodes in the managed node pool. ACK monitors the status of nodes in managed node pools and runs different repair tasks to fix different node exceptions. This topic describes the use scenarios and procedure of auto repair.

Prerequisites

  • A managed node pool is created or a node pool is converted into a managed node pool. For more information, see Work with managed node pools.
  • The Kubernetes event center is enabled. For more information, see Event monitoring.

Conditions to trigger auto repair

Notice When ACK repairs a node, it may perform operations such as node draining and system disk replacement. To avoid data loss, we recommend that you store data on data disks.
ACK determines whether to run auto repair tasks based on the status of nodes. To check the status of a node, run the kubectl describe node command and check the value of the condition field in the output. If a node remains in an abnormal state within a period of time that exceeds the threshold, ACK automatically runs repair tasks on the node. The following table describes the trigger conditions.
Status Example Threshold (duration of a node exception that triggers auto repair)
NotReady kubelet processes not responding 180s
DockerHung dockerd not responding 90s

Procedure

The auto repair feature includes the following phases: diagnose node exceptions, determine whether to trigger auto repair, and run auto repair tasks.

Note Node diagnostics are performed based on statistics provided by NPD and the Kubernetes event center. Before you use the auto repair feature, make sure that NPD is installed and the Kubernetes event center is enabled.
During a complete auto repair procedure, a node transits among the following states:
  • Normal: The node runs without exceptions.
  • Error: Exceptions occur on the node.
  • Failed to Recover: The node fails to recover from the exceptions after the auto repair tasks are completed.
Node auto repair.png
  1. If a node enters an abnormal state and remains in the abnormal state for a period of time that is longer than the threshold, ACK determines that the node is in the Error state.
  2. After a node is considered in the Error state, ACK runs specific auto repair tasks to fix the exceptions and generates events.
    • If the node exceptions are fixed after the repair tasks are completed, the node changes to the Normal state.
    • If the node exceptions persist after the repair tasks are completed, the node changes to the Failed to Recover state.
Note
  • If exceptions occur on multiple nodes in a node pool, ACK runs auto repair tasks on the nodes in sequence. If ACK fails to repair one of the nodes, ACK stops running auto repair tasks on the remaining nodes.
  • If node exceptions occur in multiple node pools in a cluster, ACK runs auto repair tasks on the node pools in parallel.
  • If ACK fails to repair a node, ACK disables auto repair for the node until the node recovers from exceptions.

Auto repair events

After auto repair is triggered, ACK writes the relevant events to the Kubernetes event center. You can view the repair records and operations in the Kubernetes event center.
Cause Event level Event description
NodeRepairStart Normal The system starts to repair the node.
NodeRepairAction Normal The repair operation on the node, for example, restart kubelet.
NodeRepairSucceed Normal The node recovers from exceptions after the repair operation is completed.
NodeRepairFailed Warning The node fails to recover from exceptions after the repair operation is completed.

For more information, see FAQ.

NodeRepairIgnore Normal The repair operation is ignored on the node. If an Elastic Compute Service (ECS) node is not in the Running state, the system does not perform auto repair operations on the node.

FAQ

What do I do if ACK fails to repair a node?

The auto repair feature may fail to fix complicated node exceptions in some scenarios. If an auto repair task fails on a node or the node exception persists after an auto repair task is completed, the node enters the Failed to Recover state.

If ACK fails to repair a node in a node pool, it disables auto repair for all nodes in the node pool until the node recovers from exceptions. In this case, Submit a ticket to apply to manually fix node exceptions.

How do I disable auto repair for a node?

To disable auto repair for a node in a node pool, add the following label to the node:
alibabacloud.com/repair.policy=disable