All Products
Search
Document Center

Elastic Compute Service:Automatically monitor and respond to ECS system events to automate O&M operations such as troubleshooting and dynamic scheduling

Last Updated:Sep 12, 2024

Alibaba Cloud provides Elastic Compute Service (ECS) system events to record and communicate resource information, such as the startup, stopping, and expiration of ECS instances and the task executions on ECS instances. In scenarios in which large-scale clusters are involved and resources are scheduled in real time, you can monitor and respond to ECS system events by using the ecs-tool-event plug-in of Cloud Assistant to automate O&M operations, such as troubleshooting and dynamic scheduling.

Note
  • ECS system events are defined by Alibaba Cloud to record and communicate resource information, such as the execution status of O&M tasks, resource exceptions, and resource status changes. For information about the categories and details of ECS system events, see Overview of ECS system events.

  • Cloud Assistant provides plug-ins that allow you to perform complex configurations by running simple commands and improve O&M efficiency. For more information, see Overview of Cloud Assistant and Use Cloud Assistant plug-ins.

Mechanism

You can monitor and respond to ECS system events in the ECS console or by calling ECS API operations. However, both methods have limitations.

  • If you monitor or respond to ECS system events in the ECS console, you cannot automate the response to the events. You must manually perform the operations. As a result, when ECS system events are generated for multiple ECS instances, you may overlook events.

  • If you monitor or respond to ECS system events by calling ECS API operations, you must develop programs to integrate the API operations. This requires financial and technical support.

To resolve the preceding issues, Alibaba Cloud provides the ecs-tool-event plug-in of Cloud Assistant that requests ECS system events from MetaServer every minute and stores the events as logs in operating systems. This way, you can collect system event logs from operating systems and monitor and respond to ECS system events based on the logs without the need to develop additional programs. For example, if you have Kubernetes automated O&M capabilities, you can collect log streams from the host_event.log file to adapt your O&M system.

Procedure

Important
  • Make sure that Cloud Assistant Agent is installed on your ECS instances. For more information, see Install Cloud Assistant Agent.

  • You have the root permissions required to start, stop, and check the status of Cloud Assistant plug-ins.

  1. Log on to an ECS instance and start the ecs-tool-event plug-in.

    After you start the ecs-tool-event plug-in, the plug-in requests ECS system events from MetaServer every minute and stores the events as logs in the operating system of the ECS instance.

    sudo acs-plugin-manager --exec --plugin=ecs-tool-event --params --start
    Note

    After you start the ecs-tool-event plug-in, you can go to the ls /var/log directory to view the host_event.log file that is automatically generated.

    • Log path: /var/log/host_event.log

    • Log format:

      %Y-%m-%d %H:%M:%S - WARNING - Ecs event type is: ${Event type}, event status is: ${Event status}, action ISO 8601 time is ${Execution time in ISO 8601}

      Example:

      2024-01-08 17:02:01 - WARNING - Ecs event type is: InstanceFailure.Reboot,event status is: Executed,action ISO 8601 time is 2023-12-27T11:49:28Z

  2. Check the status of the ecs-tool-event plug-in.

    sudo acs-plugin-manager --status
  3. Collect log streams from the host_event.log file to adapt your O&M system based on your business scenarios.

    Example: Automatically respond to ECS system events when ECS instances are used as nodes in a Kubernetes cluster

  4. (Optional) If you do not want to respond to ECS system events, stop the ecs-tool-event plug-in.

    sudo acs-plugin-manager --remove --plugin ecs-tool-event

Example: Automatically respond to ECS system events when ECS instances are used as nodes in a Kubernetes cluster

Scenario

If exceptions occur on one of the ECS instances that are used as nodes in a Kubernetes cluster, such as when the ECS instance is restarted, runs out of memory, or encounters an operating system error, online service stability may be affected. You must monitor and respond to the exception events to identify and troubleshoot the exceptions at the earliest opportunity. You can convert ECS system events into operating system logs by using the ecs-tool-event plug-in and conveniently and efficiently monitor and respond to the events by using the open source Node Problem Detector (NPD), Draino, and Autoscaler provided by the Kubernetes community without the need to develop additional programs. This way, you can increase the stability and reliability of the Kubernetes cluster.

NPD, Draino, and Autoscaler

  • NPD: is an open source component provided by the Kubernetes community that monitors the health of nodes and detects node issues, such as hardware and network issues. For more information, see the official NPD documentation.

  • Draino: functions as a controller in Kubernetes to monitor all nodes in a Kubernetes cluster and migrate pods from unhealthy nodes to healthy nodes. For more information, see the official Draino documentation.

  • Autoscaler: is an open source component provided by the Kubernetes community that automatically adjusts the number of nodes in a Kubernetes cluster and monitors the nodes to ensure that sufficient resources are available to allow all nodes to run and that no idle nodes exist. For more information, see the official Autoscaler documentation.

Architecture

The following workflow explains how the ecs-tool-event plug-in is used to automatically respond to ECS system events when ECS instances are used as nodes in a Kubernetes cluster. The following figure shows the architecture.

  1. The ecs-tool-event plug-in of Cloud Assistant requests ECS system events from MetaServer every minute and stores the events as logs in the following path in the operating system: /var/log/host_event.log.

  2. NPD collects system event logs and reports issues to the API server.

  3. Draino receives Kubernetes events (ECS system events) from the API server, evicts pods from unhealthy nodes, and then migrates the pods to healthy nodes.

  4. After pods are evicted from an unhealthy node, take the node out of service based on your business scenario, or use Autoscaler to automatically release the node and add a new ECS instance as a node to the Kubernetes cluster.

image

Procedure

Step 1: Start the ecs-tool-event plug-in on a node

Log on to a node (ECS instance) in the Kubernetes cluster and start the ecs-tool-event plug-in.

Important

In actual scenarios, you must start the ecs-tool-event plug-in on all nodes in the Kubernetes cluster. You can use Cloud Assistant to batch run the following command on multiple ECS instances to start the ecs-tool-event plug-in. For more information, see Create and run a command.

sudo acs-plugin-manager --exec --plugin=ecs-tool-event --params --start

After you start the ecs-tool-event plug-in, the plug-in automatically converts ECS system events into logs and stores the logs in the operating system of the node.

Step 2: Configure NPD and Draino for the Kubernetes cluster

  1. Log on to any node in the Kubernetes cluster.

  2. Configure NPD, which takes effect across the entire Kubernetes cluster.

    1. Configure the following NPD files.

      Note

      For information about how to configure NPD files, see the official NPD documentation.

      • node-problem-detector-config.yaml: is used to define the metrics that NPD monitors, such as system logs.

      • node-problem-detector.yaml: is used to define how NPD operates in a Kubernetes cluster.

      • rbac.yaml: is used to define the permissions granted to NPD in a Kubernetes cluster.

        Add the NPD files to an ECS instance on which NPD is not configured

        Add the preceding NPD files to the ECS instance.

        node-problem-detector-config.yaml

        apiVersion: v1
        data:
          kernel-monitor.json: |
            {
                "plugin": "kmsg",
                "logPath": "/dev/kmsg",
                "lookback": "5m",
                "bufferSize": 10,
                "source": "kernel-monitor",
                "conditions": [
                    {
                        "type": "KernelDeadlock",
                        "reason": "KernelHasNoDeadlock",
                        "message": "kernel has no deadlock"
                    },
                    {
                        "type": "ReadonlyFilesystem",
                        "reason": "FilesystemIsNotReadOnly",
                        "message": "Filesystem is not read-only"
                    }
                ],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "OOMKilling",
                        "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
                    },
                    {
                        "type": "temporary",
                        "reason": "TaskHung",
                        "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                    },
                    {
                        "type": "temporary",
                        "reason": "UnregisterNetDevice",
                        "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                    },
                    {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                    },
                    {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                    },
                    {
                                "type": "temporary",
                                "reason": "MemoryReadError",
                                "pattern": "CE memory read error .*"
                    },
                    {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "DockerHung",
                        "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                    },
                    {
                        "type": "permanent",
                        "condition": "ReadonlyFilesystem",
                        "reason": "FilesystemIsReadOnly",
                        "pattern": "Remounting filesystem read-only"
                    }
                ]
            }
          host_event.json: |
            {
                "plugin": "filelog",                     
                "pluginConfig": {
                    "timestamp": "^.{19}",
                    "message": "Ecs event type is: .*",
                    "timestampFormat": "2006-01-02 15:04:05"
                },
                "logPath": "/var/log/host_event.log",   
                "lookback": "5m",
                "bufferSize": 10,
                "source": "host-event",                     
                "conditions": [
                    {
                        "type": "HostEventRebootAfter48",       
                        "reason": "HostEventWillRebootAfter48",
                        "message": "The Host Is Running In Good Condition"
                    }
                ],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "HostEventRebootAfter48temporary",
                        "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                    },
                    {
                        "type": "permanent",
                        "condition": "HostEventRebootAfter48", 
                        "reason": "HostEventRebootAfter48Permanent",
                        "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                    }
                ]
            }
        
          docker-monitor.json: |
            {
                "plugin": "journald",
                "pluginConfig": {
                    "source": "dockerd"
                },
                "logPath": "/var/log/journal",
                "lookback": "5m",
                "bufferSize": 10,
                "source": "docker-monitor",
                "conditions": [],
                "rules": [
                    {
                        "type": "temporary",
                        "reason": "CorruptDockerImage",
                        "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*"
                    }
                ]
            }
        kind: ConfigMap
        metadata:
          name: node-problem-detector-config
          namespace: kube-system

        node-problem-detector.yaml

        apiVersion: apps/v1
        kind: DaemonSet
        metadata:
          name: node-problem-detector
          namespace: kube-system
          labels:
            app: node-problem-detector
        spec:
          selector:
            matchLabels:
              app: node-problem-detector
          template:
            metadata:
              labels:
                app: node-problem-detector
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                      - matchExpressions:
                          - key: kubernetes.io/os
                            operator: In
                            values:
                              - linux
              containers:
              - name: node-problem-detector
                command:
                - /node-problem-detector
                - --logtostderr
                - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/host_event.json
                image: cncamp/node-problem-detector:v0.8.10
                resources:
                  limits:
                    cpu: 10m
                    memory: 80Mi
                  requests:
                    cpu: 10m
                    memory: 80Mi
                imagePullPolicy: Always
                securityContext:
                  privileged: true
                env:
                - name: NODE_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: spec.nodeName
                volumeMounts:
                - name: log
                  mountPath: /var/log
                  readOnly: true
                - name: kmsg
                  mountPath: /dev/kmsg
                  readOnly: true
                # Make sure node problem detector is in the same timezone
                # with the host.
                - name: localtime
                  mountPath: /etc/localtime
                  readOnly: true
                - name: config
                  mountPath: /config
                  readOnly: true
              serviceAccountName: node-problem-detector
              volumes:
              - name: log
                # Config `log` to your system log directory
                hostPath:
                  path: /var/log/
              - name: kmsg
                hostPath:
                  path: /dev/kmsg
              - name: localtime
                hostPath:
                  path: /etc/localtime
              - name: config
                configMap:
                  name: node-problem-detector-config
                  items:
                  - key: kernel-monitor.json
                    path: kernel-monitor.json
                  - key: docker-monitor.json
                    path: docker-monitor.json
                  - key: host_event.json
                    path: host_event.json
              tolerations:
                - effect: NoSchedule
                  operator: Exists
                - effect: NoExecute
                  operator: Exists

        rbac.yaml

        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: node-problem-detector
          namespace: kube-system
        
        ---
        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRoleBinding
        metadata:
          name: npd-binding
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: ClusterRole
          name: system:node-problem-detector
        subjects:
          - kind: ServiceAccount
            name: node-problem-detector
            namespace: kube-system

        Modify the NPD files on an ECS instance on which NPD is configured

        • Add the log monitoring settings of the host_event.json file to the node-problem-detector-config.yaml file. The following sample code provides an example of the log monitoring settings:

          ...
          
          host_event.json: |
              {
                  "plugin": "filelog",   # Specify the plug-in that is used to collect logs. Set this parameter to filelog.       
                  "pluginConfig": {
                      "timestamp": "^.{19}",
                      "message": "Ecs event type is: .*",
                      "timestampFormat": "2006-01-02 15:04:05"
                  },
                  "logPath": "/var/log/host_event.log",    # Specify the path in which you want to store system event logs. Set this parameter to /var/log/host_event.log.
                  "lookback": "5m",
                  "bufferSize": 10,
                  "source": "host-event",                     
                  "conditions": [
                      {
                          "type": "HostEventRebootAfter48",    # Specify an event name, which is used in the Draino configuration.
                          "reason": "HostEventWillRebootAfter48",
                          "message": "The Host Is Running In Good Condition"
                      }
                  ],
                  "rules": [
                      {
                          "type": "temporary",
                          "reason": "HostEventRebootAfter48temporary",
                          "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                      },
                      {
                          "type": "permanent",
                          "condition": "HostEventRebootAfter48", 
                          "reason": "HostEventRebootAfter48Permanent",
                          "pattern": "Ecs event type is: SystemMaintenance.Reboot,event status is: Scheduled.*|Ecs event type is: SystemMaintenance.Reboot,event status is: Inquiring.*"
                      }
                  ]
              }
          
          ...
        • Configure the node-problem-detector.yaml file.

          • Add the /config/host_event.json configuration to the- --config.system-log-monitor line to enable NPD to monitor system event logs. The following sample code provides an example of the configuration:

            containers:
                  - name: node-problem-detector
                    command:
                     ...
                    - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/host_event.json
            
          • Add lines below the items: line in the - name: config section, as shown in the following sample code:

            ...
            - name: config
                    configMap:
                      name: node-problem-detector-config
                      items:
                      - key: kernel-monitor.json
                        path: kernel-monitor.json
                      - key: docker-monitor.json
                        path: docker-monitor.json
                      -key: host_event.json     # Add this line.
                        path: host_event.json    # Add this line.
            ...
    2. Run the following commands for the files to take effect:

      sudo kubectl create -f rbac.yaml
      sudo kubectl create -f node-problem-detector-config.yaml
      sudo kubectl create -f node-problem-detector.yaml
    3. Run the following command to check whether the NPD configuration takes effect:

      sudo kubectl describe nodes -n kube-system

      If the following command output in which a HostEventRebootAfter48 entry is added to the Conditions section is returned, the NPD configuration is complete and works as expected. If the HostEventRebootAfter48 entry does not appear in the Conditions section, wait for 3 minutes to 5 minutes.

      image.png

  3. Configure Draino, which takes effect across the entire Kubernetes cluster.

    1. Install Draino or modify the Draino configuration.

    2. Install Draino on an ECS instance on which Draino is not installed

      Add the following YAML file to the ECS instance:

      draino.yaml

      ---
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        labels: {component: draino}
        name: draino
        namespace: kube-system
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        labels: {component: draino}
        name: draino
      rules:
      - apiGroups: ['']
        resources: [events]
        verbs: [create, patch, update]
      - apiGroups: ['']
        resources: [nodes]
        verbs: [get, watch, list, update]
      - apiGroups: ['']
        resources: [nodes/status]
        verbs: [patch]
      - apiGroups: ['']
        resources: [pods]
        verbs: [get, watch, list]
      - apiGroups: ['']
        resources: [pods/eviction]
        verbs: [create]
      - apiGroups: [extensions]
        resources: [daemonsets]
        verbs: [get, watch, list]
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        labels: {component: draino}
        name: draino
      roleRef: {apiGroup: rbac.authorization.k8s.io, kind: ClusterRole, name: draino}
      subjects:
      - {kind: ServiceAccount, name: draino, namespace: kube-system}
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        labels: {component: draino}
        name: draino
        namespace: kube-system
      spec:
        # Draino does not currently support locking/master election, so you should
        # only run one draino at a time. Draino won't start draining nodes immediately
        # so it's usually safe for multiple drainos to exist for a brief period of
        # time.
        replicas: 1
        selector:
          matchLabels: {component: draino}
        template:
          metadata:
            labels: {component: draino}
            name: draino
            namespace: kube-system
          spec:
            containers:
            - name: draino
              image: planetlabs/draino:dbadb44
              # You'll want to change these labels and conditions to suit your deployment.
              command:
              - /draino
              - --debug
              - --evict-daemonset-pods
              - --evict-emptydir-pods
              - --evict-unreplicated-pods
              - KernelDeadlock
              - OutOfDisk
              - HostEventRebootAfter48
              # - ReadonlyFilesystem
              # - MemoryPressure
              # - DiskPressure
              # - PIDPressure
              livenessProbe:
                httpGet: {path: /healthz, port: 10002}
                initialDelaySeconds: 30
            serviceAccountName: draino

      Modify the Draino configuration on an ECS instance on which Draino is installed

      Open the Draino configuration file, find the containers: section, and then add the event name that you specified in the node-problem-detector-config.yaml file in Step 2 "Configure NPD, which takes effect across the entire Kubernetes cluster" as shown in the following code. In this example, the event name is HostEventRebootAfter48.

      containers:
            - name: draino
              image: planetlabs/draino:dbadb44
              # You'll want to change these labels and conditions to suit your deployment.
              command:
              - /draino
              - --debug
              ......
              - KernelDeadlock
              - OutOfDisk
              - HostEventRebootAfter48  # Add this line.  
    3. Run the following command for the Draino configuration to take effect:

    4. sudo kubectl create -f draino.yaml

Step 3: Take unhealthy nodes out of service and add new nodes

After pods are evicted from an unhealthy node, take the node out of service based on your business scenario, or use Autoscaler to automatically release the node and add a new ECS instance as a node to the Kubernetes cluster. For information about how to use Autoscaler, see the official Autoscaler documentation.

Verify the result

  1. Log on to a node and run the following command to simulate a log of an ECS system event.

    Important

    Replace the time in the command with the current system time.

    sudo echo  '2024-02-23 12:29:29 - WARNING - Ecs event type is: InstanceFailure.Reboot,event status is: Executed,action ISO 8601 time is 2023-12-27T11:49:28Z'  > /var/log/host_event.log
  2. Run the following command to check the node. The following command output indicates that the Cloud Assistant plug-in detects the ECS system event, generates a Kubernetes event based on the event, and sets the status of the node to Unschedulable.

    sudo kubectl describe nodes -n kube-system

    image