After you enable the managed node pool feature for a node pool, you can enable the auto repair feature. Container Service for Kubernetes (ACK) automatically monitors the status of nodes and automatically executes auto repair tasks when nodes do not work as expected. This simplifies node O&M. Due to the complexity of faults, the auto repair task cannot repair all faults, and you must manually repair specific complex faults.
Prerequisites
The Kubernetes event center is enabled to receive alert events from node pools and the ack-node-problem-detector component is installed to detect node exceptions. For more information, see Event monitoring.
Procedure
You can configure the Managed Node Pool parameter for a new ode pool or an existing node pool. If you turn on the Restart Faulty Node switch, operations such as draining and replacing disks may be involved in the auto repair process. We recommend that you store business data on disks.
Enable auto repair when you create a node pool
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Node Pools page, click Create Node Pool, select the Managed Node Pool checkbox, select the Restart Faulty Node checkbox based on your business requirements, and then follow the instructions to create a node pool.
For more information about the parameters, see Create and manage a node pool.
Enable auto repair for an existing node pool
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
In the node pool list, find the node pool that you want to manage and choose More > Configure Managed Node Pool in the Actions column. On the page that appears, select Managed Node Pool and select Restart Faulty Node based on your business requirements.
Conditions that trigger auto repair
ACK determines whether to run auto repair tasks based on the status (condition
) of nodes. To check the status of a node, run the kubectl describe node
command and check the value of the condition
field in the command output. If a node remains in an abnormal state within a period of time that exceeds the threshold, ACK automatically runs auto repair tasks on the node.
The node pool monitors the running status of nodes. If the status of a node is not reported for more than 10 minutes or a node is in the NotReady state, ACK restarts the node to repair the faults on the node. In this case, the pods on the node are restarted.
The following table describes the trigger conditions.
Item | Description | Severity | Threshold | Operation |
KubeletNotReady(KubeletHung) | The node is in the NotReady state because the kubelet stopped running. | High | 180s |
|
KubeletNotReady(PLEG) | The node is in the NotReady state because the Pod Lifecycle Event Generator (PLEG) module failed to pass health checks. | Medium | 180s |
|
KubeletNotReady(SandboxError) | The kubelet cannot be started because no sandboxed pod was found. | High | 180s |
|
RuntimeOffline | Docker or containerd stopped running and the node is unavailable. | High | 90s |
|
NTPProblem | The time synchronization service (ntpd or chronyd) is in an abnormal state. | High | 10s | Restart ntpd or chronyd. |
SystemdOffline | Systemd is in an abnormal state and cannot launch or destroy containers. | High | 90s | Restart the Elastic Compute Service (ECS) instance if Restart Faulty Node is selected. |
ReadonlyFilesystem | The node file system is read-only. | High | 90s | Restart the Elastic Compute Service (ECS) instance if Restart Faulty Node is selected. |
Auto repair events
After auto repair is triggered, ACK writes the relevant events to the Kubernetes event center. To view the repair records and operations in the Kubernetes event center, go to the cluster details page and choose Event monitoring.
in the left-side navigation pane. For more information about the events, seeCause | Level | Description |
NodeRepairStart | Normal | The system starts to repair the node. |
NodeRepairAction | Normal | The repair operation on the node, such as restarting the kubelet. |
NodeRepairSucceed | Normal | The node recovers from exceptions after the repair operation is complete. |
NodeRepairFailed | Warning | The node fails to recover from exceptions after the repair operation is complete. For more information, see FAQ. |
NodeRepairIgnore | Normal | The node skips the repair operation. If an ECS node is not in the Running state, the system does not perform auto repair operations on the node. |
Execution rules and process
The following section describes the complete auto repair process.
If a node enters an abnormal state and remains in the state for a period of time, ACK determines that the node is in the Error state.
After a node is considered in the Error state, ACK runs specific auto repair tasks to fix the exceptions and generates events.
The node is in the Repairing state while the auto repair task is running.
If the node exceptions are fixed after the auto repair tasks are complete, the node enters the Normal state.
If the node exceptions persist after the repair tasks are complete, the node enters to the Failed to Recover state.
If a node fails to be repaired, the auto repair task is no longer triggered. This means that auto repair is resumed for the node only after the node recovers from the exception.
If a cluster contains multiple node pools, you can run auto repair tasks on the nodes in each node pool in parallel.
If exceptions occur on multiple nodes in a node pool, ACK runs auto repair tasks on the nodes in sequence. If ACK fails to repair one of the nodes, ACK stops running auto repair tasks on the remaining nodes.
FAQ
What do I do if the node fails to be automatically repaired?
The auto repair feature may fail to fix complicated node exceptions in specific scenarios. If an auto repair task on a node fails or the node exception persists after an auto repair task is complete, the node enters the Failed to Recover state.
If ACK fails to repair a node in a node pool, ACK stops running auto repair tasks on all nodes in the node pool until the node recovers from exceptions. To contact technical support, submit a ticket .
References
If you want to remove a faulty node and add a new node to resolve the issue, you must use a standardized process in the ACK console to remove and add the node. This prevents unexpected errors. For more information, see Remove a node and Add existing ECS instances to an ACK cluster.
You can enable auto updates for clusters. For more information, see Automatically update a cluster.