All Products
Search
Document Center

Container Service for Kubernetes:ack-node-repairer

Last Updated:Apr 26, 2024

When a Node Problem Detector (NPD) detects node exceptions, node events or node conditions are generated and reported to the Container Service for Kubernetes (ACK) cluster. ACK Node Repairer automatically listens to the events or conditions on each node and fixes the issues based on the related configurations. This topic describes how to install and configure ACK Node Repairer.

Prerequisites

Background Information

ACK Node Repairer is programmed with a predefined list of commonly occurring node exceptions and the actions to fix these exceptions. When a node exception occurs, ACK Node Repairer automatically triggers the corresponding action on the node to fix the exception. After the node exception is fixed, NPD automatically changes the status of the node exception. This creates a closed-loop system for detecting and repairing node exceptions. O&M engineers can also define the node exceptions that need to be fixed and the actions to fix the exceptions.

NPD is a tool for diagnosing Kubernetes nodes. NPD detects node exceptions and generates node events when the following exceptions are detected: Docker engine hangs, Linux kernel hangs, outbound traffic anomalies, and file descriptor anomalies.

Install ack-node-repairer

Before you use ACK Node Repairer, you must first install ack-node-repairer.

  1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

  2. On the Marketplace page, search for and select ack-node-repairer.

  3. On the component details page, click Deploy in the upper-right corner.

  4. In the Deploy panel, select a cluster and a namespace, and then click Next.

  5. In the Parameters step, select the latest chart version, specify the AccessKey pair in the Parameters section, and then click OK.

    AK

    Specify the AccessKey pair based on the following description:

    • accessKey: nodeRepairer.accessKey

    • accessSecret: nodeRepairer.accessSecret

Configure ack-node-repairer

After ACK Node Repairer is installed, all auto repairing operations that are supported by the current version are automatically performed. You can configure and enable or disable auto repairing for a type of node exception. The following example shows how to configure ack-node-repairer to automatically fix the issue of Network Time Protocol (NTP) service failure.

  1. View the YAML file of default-node-repairer.

    After ACK Node Repairer is installed, a noderepairers.nodes.alibabacloud.com type resource object named default-node-repairer is automatically created in the kube-system namespace. This resource object defines the node exceptions that are monitored by ACK Node Repairer and the related actions to fix the node exceptions. Run the following command to view the content of the YAML file:

    kubectl -n kube-system get noderepairers.nodes.alibabacloud.com default-node-repairer -o yaml
  2. Modify the configurations of default-node-repairer.

    In the spec.rules field, add the detector parameter to detect NTPProblem issues and add the healers parameter to fix NTP issues. The following code block is an example:

    spec:
      rules:
      # Specify the detector parameter to detect NTP issues and the nodeOperation parameter to fix NTP issues. 
      - detector:
          conditionType: NTPProblem
          type: conditionType
          paused: false
        healers:
        - nodeOperation: restart-ntpd
          type: nodejob
    Note

    To configure auto repairing for each node exception, you must associate the node condition with the action that is performed to fix the node exception. The rules.detector.conditionType parameter specifies the node condition. If you set rules.detector.paused to true, auto repairing is disabled for this type of node condition.

    After you perform the preceding steps, when NTP issues occur on a node in the cluster, ACK Node Repairer automatically runs the systemctl restart chronyd.service command on the node through CloudOps Orchestration Service (OOS) to restart NTP on the node.

Records of auto repairing events and results

A noderepairers.nodes.alibabacloud.com type resource object is automatically created in the kube-system namespace to record each auto repairing event. To view the content of this resource object, run the following command. You can also view the auto repairing result by checking the Status field in the output.

kubectl -n kube-system get nodejobs.nodes.alibabacloud.com {nodejob_cr_name} -o yaml