All Products
Search
Document Center

Container Service for Kubernetes:Alert management

Last Updated:Apr 09, 2025

Container Service for Kubernetes (ACK) provides the alert management feature that allows you to centrally configure alerting for containers. You can configure alert rules to receive notifications when a service exception occurs or when metrics exceed thresholds, including key metrics of basic cluster resources, metrics of core cluster components, and application metrics. You can modify the default alert rules of a cluster by deploying CustomResourceDefinitions (CRDs) in the cluster. This enables you to detect abnormal changes in the cluster.

Index

Feature

Link

Enable alert management

Configure alert rules

FAQ

Prerequisites

Billing

Alerts are sent by Simple Log Service, Managed Service for Prometheus, and CloudMonitor. Additional fees may be charged for sending notifications such as text messages and emails from these monitoring services. The following table describes the billing details. Before you enable the alert management feature, you can check the source of each alert item in the default alert rule template and activate the required services.

Alert source

Configuration requirements

Billing details

Simple Log Service

Enable event monitoring. Event monitoring is automatically enabled when you enable the alert management feature.

Billable items of pay-by-feature

Managed Service for Prometheus

Configure Prometheus Service monitoring for the cluster.

Free of charge

CloudMonitor

Enable the monitoring feature of CloudMonitor for an ACK cluster.

Pay-as-you-go

Enable alert management

After you enable the alert management feature, you can configure metric-based alerts for specific resources in the cluster and automatically receive alert notifications when exceptions occur. This helps you efficiently manage and maintain your cluster and ensure service stability. For more information about resource alerts, see Default alert rule template.

ACK managed cluster

You can enable alert management for an existing cluster or when you create a cluster.

Enable alert management for an existing cluster

If you have an existing cluster, you can enable alert management by performing the following steps:

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Alerts.

  3. On the Alert Configuration page, click Start Installation. The console automatically checks the prerequisites, and installs and upgrades the components.

  4. After the installation and upgrade are complete, configure alerts on the Alert Configuration page.

    Tab

    Description

    Alert Rule Management

    Turn on Enabled to enable the corresponding alert rule set. Click Edit Notification Object to set the associated notification object.

    Alert History

    You can view the latest 100 historical records sent within the last day. Click a link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration. Click Troubleshoot to quickly locate the resource page where the exception occurred (event or metric exception).

    Contact Management

    You can create, edit, or delete alert contacts.

    Contact methods can be set through text messages, mailboxes, and robot types. You need to authenticate them first in the CloudMonitor console under Alert Service > Alert Contact to receive alert messages. Contact synchronization is also supported. If the authentication information expires, you can delete the corresponding contact in CloudMonitor and refresh the contacts page. For notification object robot type settings, see DingTalk Robot, WeCom Robot, and Lark Robot.

    Contact Group Management

    You can create, edit, or delete alert contact groups. If no alert contact group exists, the ACK console automatically creates a default alert contact group based on the information that you provided during registration.

Enable alert management when you create a cluster

On the Component Configurations page when you create a cluster, select Alert Configuration > Use Default Alert Template To Configure Alerts and select Alert Contact Group. For more information, see Create an ACK managed cluster.

image

After you enable alert management when you create a cluster, the system enables the default alert rules and sends alert notifications to the default contact group. You can also Modify an alert contact or alert contact group.

ACK dedicated cluster

For an ACK dedicated cluster, you must first authorize the worker Resource Access Management (RAM) role, and then enable the default alert rules.

Authorize the worker RAM role

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, click Cluster Information.

  3. On the Cluster Information page, in the Cluster Resources section, copy the name on the right side of Worker RAM Role and click the link to go to the RAM console to authorize the worker RAM role.

    1. Create the following custom policy. For more information, see Create a custom policy on the JSON tab.

      {
                  "Action": [
                      "log:*",
                      "arms:*",
                      "cms:*",
                      "cs:UpdateContactGroup"
                  ],
                  "Resource": [
                      "*"
                  ],
                  "Effect": "Allow"
      }
    2. On the Roles page, search for the worker RAM role and grant the custom policy that you created to the worker RAM role. For more information, see Method 1: Grant permissions to a RAM role by clicking Grant Permission on the Roles page.

  4. Check the component logs to verify that the permissions are granted.

    1. In the left-side navigation pane of the details page of the cluster, choose Workloads > Deployments.

    2. Select Namespace as kube-system, and click the Name link of alicloud-monitor-controller in the stateless application list.

    3. Click the Logs tab to view the pod logs that indicate successful authorization.

Enable the default alert rules

  1. In the left-side navigation pane of the details page of the cluster, choose Operations > Alert Configuration.

  2. On the Alert Configuration page, configure the following alert information.

    Tab

    Description

    Alert Rule Management

    Turn on Enabled to enable the corresponding alert rule set. Click Edit Notification Object to set the associated notification object.

    Alert History

    You can view the latest 100 historical records sent within the last day. Click a link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration. Click Troubleshoot to quickly locate the resource page where the exception occurred (event or metric exception).

    Contact Management

    You can create, edit, or delete alert contacts.

    Contact methods can be set through text messages, mailboxes, and robot types. You need to authenticate them first in the CloudMonitor console under Alert Service > Alert Contact to receive alert messages. Contact synchronization is also supported. If the authentication information expires, you can delete the corresponding contact in CloudMonitor and refresh the contacts page. For notification object robot type settings, see DingTalk Robot, WeCom Robot, and Lark Robot.

    Contact Group Management

    You can create, edit, or delete alert contact groups. If no alert contact group exists, the ACK console automatically creates a default alert contact group based on the information that you provided during registration.

Configure alert rules by using CRDs

When the alerting feature is enabled, the system automatically creates an AckAlertRule object in the kube-system namespace. The AckAlertRule object contains the default alert rule template. You can modify the AckAlertRule object to customize the default alert rules based on your business requirements.

Procedure

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Alerts.

  3. On the Alert Rule Management tab, click Edit Alert Configuration in the upper-right corner. Then, click Actions > YAML in the row of the target rule to view the AckAlertRule resource configuration of the current cluster.

  4. Refer to the description of the default alert rule template and modify the sample YAML file.

    Example:

    apiVersion: alert.alibabacloud.com/v1beta1
    kind: AckAlertRule
    metadata:
      name: default
    spec:
      groups:
        # The following is a sample configuration of a cluster event alert rule.
        - name: pod-exceptions                             # The name of the alert rule group, which corresponds to the Group_Name field in the alert template.
          rules:
            - name: pod-oom                                # The name of the alert rule.
              type: event                                  # The type of the alert rule (Rule_Type). Valid values: event and metric-cms.
              expression: sls.app.ack.pod.oom              # The expression of the alert rule. When the rule type is event, the value of the expression is the value of Rule_Expression_Id in the default alert rule template described in this topic.
              enable: enable                               # The status of the alert rule. Valid values: enable and disable.
            - name: pod-failed
              type: event
              expression: sls.app.ack.pod.failed
              enable: enable
        # The following is a sample configuration of a cluster basic resource alert rule.
        - name: res-exceptions                              # The name of the alert rule group, which corresponds to the Group_Name field in the alert template.
          rules:
            - name: node_cpu_util_high                      # The name of the alert rule.
              type: metric-cms                              # The type of the alert rule (Rule_Type). Valid values: event and metric-cms.
              expression: cms.host.cpu.utilization          # The expression of the alert rule. When the rule type is metric-cms, the value of the expression is the value of Rule_Expression_Id in the default alert rule template described in this topic.
              contactGroups:                                # The alert contact group configuration that is mapped to the alert rule. The configuration is generated by the ACK console. The same contact is used for the same account. The contact can be reused in multiple clusters.
              enable: enable                                # The status of the alert rule. Valid values: enable and disable.
              thresholds:                                   # The threshold of the alert rule. For more information, see the section about how to change the threshold of an alert rule.            
                - key: CMS_ESCALATIONS_CRITICAL_Threshold
                  unit: percent
                  value: '1'

Example: Modify the threshold of a cluster basic resource alert rule by using a CRD

Based on the default alert rule template set, the Rule_Type of the cluster resource exception alert rule set is metric-cms, and the alert rule is synchronized from the basic resource alert rule of CloudMonitor. In this example, the thresholds parameter is added to the CRD that corresponds to the Node - CPU usage alert rule set to configure the threshold, the number of retries, and the silence period of the basic monitoring alert rule.

apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
  name: default
spec:
  groups:
    # The following is a sample configuration of a cluster basic resource alert rule.
    - name: res-exceptions                                        # The name of the alert rule group, which corresponds to the Group_Name field in the alert template.
      rules:
        - name: node_cpu_util_high                                # The name of the alert rule.
          type: metric-cms                                        # The type of the alert rule (Rule_Type). Valid values: event and metric-cms.
          expression: cms.host.cpu.utilization                    # The expression of the alert rule. When the rule type is metric-cms, the value of the expression is the value of Rule_Expression_Id in the default alert rule template described in this topic.
          contactGroups:                                          # The alert contact group configuration that is mapped to the alert rule. The configuration is generated by the ACK console. The same contact is used for the same account. The contact can be reused in multiple clusters.
          enable: enable                                          # The status of the alert rule. Valid values: enable and disable.
          thresholds:                                             # The threshold of the alert rule. For more information, see how to configure alert rules by using CRDs.
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: '1'  
            - key: CMS_ESCALATIONS_CRITICAL_Times
              value: '3'  
            - key: CMS_RULE_SILENCE_SEC
              value: '900'  

Parameter

Required

Description

Default value

CMS_ESCALATIONS_CRITICAL_Threshold

Yes

The alert threshold. If this parameter is not configured, the rule fails to be synchronized and is disabled.

  • unit: the unit. Valid values: percent, count, and qps.

  • value: the threshold.

The default value is the same as the default value specified in the default alert rule template.

CMS_ESCALATIONS_CRITICAL_Times

No

The number of times that the alert threshold is exceeded before an alert is triggered. If this parameter is not configured, the default value is used.

3

CMS_RULE_SILENCE_SEC

No

The silence period after an alert is triggered. This parameter is used to prevent frequent alerting. Unit: seconds. If this parameter is not configured, the default value is used.

900

Default alert rule template

The following tables describe the default alert rule template.

Error event set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Error event

An alert is triggered when an error occurs in the cluster.

Simple Log Service

event

error-event

sls.app.ack.error

Warning event set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Warning event

An alert is triggered when a warning occurs in the cluster, except for warnings that can be ignored.

Simple Log Service

event

warn-event

sls.app.ack.warn

Cluster core component exception alert rule set (ACK managed cluster)

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Cluster API Server unavailability

An alert is triggered when the API Server becomes unavailable, which may limit the cluster management functions.

Managed Service for Prometheus

metric-prometheus

apiserver-unhealthy

prom.apiserver.notHealthy.down

Cluster etcd unavailability

An alert is triggered when etcd becomes unavailable, which affects the status of the entire cluster.

Managed Service for Prometheus

metric-prometheus

etcd-unhealthy

prom.etcd.notHealthy.down

Cluster kube-scheduler unavailability

An alert is triggered when kube-scheduler, which is responsible for pod scheduling, becomes unavailable. This may prevent new pods from starting properly.

Managed Service for Prometheus

metric-prometheus

scheduler-unhealthy

prom.scheduler.notHealthy.down

Cluster KCM unavailability

An alert is triggered when Kubernetes Controller Manager (KCM), which manages control loops, becomes unavailable. This affects the automatic repair and resource adjustment mechanisms of the cluster.

Managed Service for Prometheus

metric-prometheus

kcm-unhealthy

prom.kcm.notHealthy.down

Cluster cloud-controller-manager unavailability

An alert is triggered when cloud-controller-manager, which manages the lifecycle of external cloud service components, becomes unavailable. This may affect the dynamic adjustment functions of services.

Managed Service for Prometheus

metric-prometheus

ccm-unhealthy

prom.ccm.notHealthy.down

CoreDNS Unavailability - Number of requests drops to 0

An alert is triggered when CoreDNS, which is the DNS service of the cluster, becomes unavailable. This affects service discovery and domain name resolution.

Managed Service for Prometheus

metric-prometheus

coredns-unhealthy-requestdown

prom.coredns.notHealthy.requestdown

CoreDNS Unavailability - Panics

An alert is triggered when a panic error occurs in CoreDNS. You must immediately analyze the logs for diagnosis.

Managed Service for Prometheus

metric-prometheus

coredns-unhealthy-panic

prom.coredns.notHealthy.panic

High error request rate of cluster Ingress

An alert is triggered when the error rate of HTTP requests processed by the Ingress controller is high, which may affect the accessibility of services.

Managed Service for Prometheus

metric-prometheus

ingress-err-request

prom.ingress.request.errorRateHigh

Cluster Ingress Controller certificate expiration

An alert is triggered when an SSL certificate is about to expire, which will cause HTTPS requests to fail. You must update the certificate in advance.

Managed Service for Prometheus

metric-prometheus

ingress-ssl-expire

prom.ingress.ssl.expire

Pod Pending cumulative count > 1000

An alert is triggered when too many pods in the cluster remain in the Pending state, which may be due to insufficient resources or unreasonable scheduling policies.

Managed Service for Prometheus

metric-prometheus

pod-pending-accumulate

prom.pod.pending.accumulate

High response time of cluster API Server Mutating Admission Webhook

An alert is triggered when the response time of the Mutating Admission Webhook is too slow, which affects the efficiency of resource creation and modification.

Managed Service for Prometheus

metric-prometheus

apiserver-admit-rt-high

prom.apiserver.mutating.webhook.rt.high

High response time of cluster API Server Validating Admission Webhook

An alert is triggered when the response time of the Validating Admission Webhook is too slow, which may cause configuration changes to be delayed.

Managed Service for Prometheus

metric-prometheus

apiserver-validate-rt-high

prom.apiserver.validation.webhook.rt.high

OOM errors in cluster control plane components

An alert is triggered when a core component of the cluster experiences an out of memory error. You need to perform a detailed investigation of the anomaly to prevent service failure.

Simple Log Service

event

ack-controlplane-oom

sls.app.ack.controlplane.pod.oom

Cluster node pool operations event alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Node self-healing failures

An alert is triggered when a node self-healing process fails. You need to immediately understand the cause and fix the issue to ensure high availability.

Simple Log Service

event

node-repair_failed

sls.app.ack.rc.node_repair_failed

Node CVE fix failures

An alert is triggered when an important CVE fix fails. The security of the cluster may be affected, requiring urgent assessment and remediation.

Simple Log Service

event

nodepool-cve-fix-failed

sls.app.ack.rc.node_vulnerability_fix_failed

Node pool CVE fix successes

An alert is triggered when a CVE fix is successfully applied, which reduces the security risk of known vulnerabilities.

Simple Log Service

event

nodepool-cve-fix-succ

sls.app.ack.rc.node_vulnerability_fix_succeed

Node pool CVE automatic fix skipped

An alert is triggered when an automatic fix is skipped, which may be due to compatibility or specific configuration issues. You need to confirm whether the security policy is reasonable.

Simple Log Service

event

nodepool-cve-fix-skip

sls.app.ack.rc.node_vulnerability_fix_skipped

Node pool kubelet parameter configuration failures

An alert is triggered when the kubelet configuration fails to be updated, which may affect node performance and resource scheduling.

Simple Log Service

event

nodepool-kubelet-cfg-failed

sls.app.ack.rc.node_kubelet_config_failed

Node pool kubelet parameter configuration successes

An alert is triggered when a new kubelet configuration is successfully applied. You need to confirm that the configuration is effective and meets expectations.

Simple Log Service

event

nodepool-kubelet-config-succ

sls.app.ack.rc.node_kubelet_config_succeed

Node pool kubelet upgrade failures

An alert is triggered when a kubelet upgrade fails, which may affect the stability and functionality of the cluster. You need to confirm the upgrade process and configuration.

Simple Log Service

event

nodepool-k-c-upgrade-failed

sls.app.ack.rc.node_kubelet_config_upgrade_failed

Node pool kubelet upgrade successes

An alert is triggered when a kubelet upgrade is successful. After confirming the upgrade success, ensure that the kubelet version meets the cluster and application requirements.

Simple Log Service

event

nodepool-k-c-upgrade-succ

sls.app.ack.rc.kubelet_upgrade_succeed

Node pool runtime upgrade successes

An alert is triggered when the container runtime in a node pool is successfully upgraded.

Simple Log Service

event

nodepool-runtime-upgrade-succ

sls.app.ack.rc.runtime_upgrade_succeed

Node pool runtime upgrade failures

An alert is triggered when the container runtime in a node pool fails to be upgraded.

Simple Log Service

event

nodepool-runtime-upgrade-fail

sls.app.ack.rc.runtime_upgrade_failed

Node pool OS image upgrade successes

An alert is triggered when the operating system image in a node pool is successfully upgraded.

Simple Log Service

event

nodepool-os-upgrade-succ

sls.app.ack.rc.os_image_upgrade_succeed

Node pool OS image upgrade failures

An alert is triggered when the operating system image in a node pool fails to be upgraded.

Simple Log Service

event

nodepool-os-upgrade-failed

sls.app.ack.rc.os_image_upgrade_failed

Lingjun node pool configuration change successes

An alert is triggered when the configuration of a Lingjun node pool is successfully changed.

Simple Log Service

event

nodepool-lingjun-config-succ

sls.app.ack.rc.lingjun_configuration_apply_succeed

Lingjun node pool configuration change failures

An alert is triggered when the configuration of a Lingjun node pool fails to be changed.

Simple Log Service

event

nodepool-lingjun-cfg-failed

sls.app.ack.rc.lingjun_configuration_apply_failed

Cluster node exception alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Cluster node docker process exceptions

An alert is triggered when the Dockerd or Containerd runtime on a node in the cluster experiences an exception.

Simple Log Service

event

docker-hang

sls.app.ack.docker.hang

Evictions in the cluster

An alert is triggered when a pod is evicted.

Simple Log Service

event

eviction-event

sls.app.ack.eviction

GPU Xid errors

An alert is triggered when a GPU Xid error occurs.

Simple Log Service

event

gpu-xid-error

sls.app.ack.gpu.xid_error

Node changes to the unschedulable state

An alert is triggered when the status of a node changes to unschedulable.

Simple Log Service

event

node-down

sls.app.ack.node.down

Node restarts

An alert is triggered when a node restarts.

Simple Log Service

event

node-restart

sls.app.ack.node.restart

NTP service failures on nodes

An alert is triggered when the Network Time Protocol (NTP) service fails.

Simple Log Service

event

node-ntp-down

sls.app.ack.ntp.down

PLEG errors on nodes

An alert is triggered when a Lifecycle Event Generator (PLEG) error occurs on a node.

Simple Log Service

event

node-pleg-error

sls.app.ack.node.pleg_error

Process errors on nodes

An alert is triggered when a process error occurs on a node.

Simple Log Service

event

ps-hang

sls.app.ack.ps.hang

Excessive file handles on nodes

An alert is triggered when the number of file handles on a node is excessive.

Simple Log Service

event

node-fd-pressure

sls.app.ack.node.fd_pressure

Excessive processes on nodes

An alert is triggered when the number of processes on a node is excessive.

Simple Log Service

event

node-pid-pressure

sls.app.ack.node.pid_pressure

Node deletion failures

An alert is triggered when a node fails to be deleted. Submit a ticket to contact the ACK team.

Simple Log Service

event

node-del-err

sls.app.ack.ccm.del_node_failed

Node adding failures

An alert is triggered when a node fails to be added to the cluster. Submit a ticket to contact the ACK team.

Simple Log Service

event

node-add-err

sls.app.ack.ccm.add_node_failed

Command execution failures in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-run-cmd-err

sls.app.ack.nlc.run_command_fail

Node removal failures in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-empty-cmd

sls.app.ack.nlc.empty_task_cmd

Unimplemented URL mode in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-url-m-unimp

sls.app.ack.nlc.url_mode_unimpl

Unknown repairing operations in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-opt-no-found

sls.app.ack.nlc.op_not_found

Node draining and removal failures in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-des-node-err

sls.app.ack.nlc.destroy_node_fail

Node draining failures in managed node pools

An alert is triggered when a node in a managed node pool fails to be drained. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-drain-node-err

sls.app.ack.nlc.drain_node_fail

ECS restart timeouts in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-restart-ecs-wait

sls.app.ack.nlc.restart_ecs_wait_fail

ECS restart failures in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-restart-ecs-err

sls.app.ack.nlc.restart_ecs_fail

ECS reset failures in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-reset-ecs-err

sls.app.ack.nlc.reset_ecs_fail

Auto-repair task failures in managed node pools

An alert is triggered when a node pool error occurs. Submit a ticket to contact the ACK team.

Simple Log Service

event

nlc-sel-repair-err

sls.app.ack.nlc.repair_fail

Cluster resource exception alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Node - CPU usage ≥ 85%

An alert is triggered when the CPU usage of a node exceeds the threshold. The default threshold is 85%.

If the percentage of available CPU resources is less than 15%, the CPU resources reserved for components may become insufficient. For more information, see Resource reservation policy. Consequently, CPU throttling may be frequently triggered and processes may respond slowly. We recommend that you optimize the CPU usage or adjust the threshold at the earliest opportunity.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

CloudMonitor

metric-cms

node_cpu_util_high

cms.host.cpu.utilization

Node - Memory usage ≥ 85%

An alert is triggered when the memory usage of a node exceeds the threshold. The default threshold is 85%.

If the percentage of available memory resources is less than 15%, the memory resources reserved for components may become insufficient. For more information, see Resource reservation policy. In this scenario, kubelet forcibly evicts pods from the node. We recommend that you optimize the memory usage or adjust the threshold at the earliest opportunity.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

CloudMonitor

metric-cms

node_mem_util_high

cms.host.memory.utilization

Node - Disk usage ≥ 85%

An alert is triggered when the disk usage of a node exceeds the threshold. The default threshold is 85%.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

CloudMonitor

metric-cms

node_disk_util_high

cms.host.disk.utilization

Node - Usage of outbound public bandwidth ≥ 85%

An alert is triggered when the usage of the outbound public bandwidth of a node exceeds the threshold. The default threshold is 85%.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

CloudMonitor

metric-cms

node_public_net_util_high

cms.host.public.network.utilization

Node - Inode usage ≥ 85%

An alert is triggered when the inode usage of a node exceeds the threshold. The default threshold is 85%.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

CloudMonitor

metric-cms

node_fs_inode_util_high

cms.host.fs.inode.utilization

Resources - Usage of the maximum connections of an SLB instance ≥ 85%

An alert is triggered when the usage of the maximum number of connections of a Server Load Balancer (SLB) instance exceeds the threshold. The default threshold is 85%.

Note

The SLB instance refers to the SLB instance that is associated with API Server or Ingress.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

CloudMonitor

metric-cms

slb_qps_util_high

cms.slb.qps.utilization

Resources - Usage of SLB outbound bandwidth ≥ 85%

An alert is triggered when the usage of the outbound bandwidth of an SLB instance exceeds the threshold. The default threshold is 85%.

Note

The SLB instance refers to the SLB instance that is associated with API Server or Ingress.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

Cloud Monitor

metric-cms

slb_traff_tx_util_high

cms.slb.traffic.tx.utilization

Resources - Usage of the maximum connections of an SLB instance ≥ 85%

An alert is triggered when the usage of the maximum number of connections of an SLB instance exceeds the threshold. The default threshold is 85%.

Note

The SLB instance refers to the SLB instance that is associated with API Server or Ingress.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

CloudMonitor

metric-cms

slb_max_con_util_high

cms.slb.max.connection.utilization

Resources - Connection drops per second of the listeners of an SLB instance remains ≥ 1

An alert is triggered when the number of connections dropped per second by the listeners of an SLB instance remains at 1 or more. The default threshold is 1.

Note

The SLB instance refers to the SLB instance that is associated with API Server or Ingress.

For information about how to adjust the threshold, see Example: Modify the threshold of a cluster basic resource alert rule by using a CRD.

CloudMonitor

metric-cms

slb_drop_con_high

cms.slb.drop.connection

Insufficient node disk space

An alert is triggered when the disk space of a node is insufficient.

Simple Log Service

event

node-disk-pressure

sls.app.ack.node.disk_pressure

Insufficient node resources for scheduling

An alert is triggered when a node has insufficient resources for scheduling.

Simple Log Service

event

node-res-insufficient

sls.app.ack.resource.insufficient

Insufficient node IP addresses

An alert is triggered when node IP addresses are insufficient.

Simple Log Service

event

node-ip-pressure

sls.app.ack.ip.not_enough

Disk usage exceeds the threshold

An alert is triggered when the usage of a disk exceeds the specified threshold. You can check the usage of a disk that is mounted to your cluster.

Simple Log Service

event

disk_space_press

sls.app.ack.csi.no_enough_disk_space

ACK control plane operations notification alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

ACK cluster task notifications

An alert is triggered to record and inform the management layer about related plans and changes.

Simple Log Service

event

ack-system-event-info

sls.app.ack.system_events.task.info

ACK cluster task failure notifications

An alert is triggered when a cluster operation fails. You need to pay attention to this alert and investigate the cause in a timely manner.

Simple Log Service

event

ack-system-event-error

sls.app.ack.system_events.task.error

Cluster auto scaling alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Auto scaling - Scale-out nodes

An alert is triggered when nodes are automatically scaled out to handle increased load requests.

Simple Log Service

event

autoscaler-scaleup

sls.app.ack.autoscaler.scaleup_group

Auto scaling - Scale-in nodes

An alert is triggered when nodes are automatically scaled in when the load decreases to save resources.

Simple Log Service

event

autoscaler-scaledown

sls.app.ack.autoscaler.scaledown

Auto scaling - Scale-out timeouts

An alert is triggered when a scale-out operation times out, which may indicate insufficient resources or inappropriate policies.

Simple Log Service

event

autoscaler-scaleup-timeout

sls.app.ack.autoscaler.scaleup_timeout

Auto scaling - Scale-in empty nodes

An alert is triggered when inactive nodes are identified and cleaned up to optimize resource usage.

Simple Log Service

event

autoscaler-scaledown-empty

sls.app.ack.autoscaler.scaledown_empty

Auto scaling - Scale-out node failures

An alert is triggered when a scale-out operation fails. You need to immediately analyze the cause and adjust the resource policy.

Simple Log Service

event

autoscaler-up-group-failed

sls.app.ack.autoscaler.scaleup_group_failed

Auto scaling - Cluster unhealthy

An alert is triggered when the cluster becomes unhealthy during a scaling operation. This requires immediate handling.

Simple Log Service

event

autoscaler-cluster-unhealthy

sls.app.ack.autoscaler.cluster_unhealthy

Auto scaling - Delete nodes that have been started for a long time

An alert is triggered when invalid nodes are cleaned up to reclaim resources.

Simple Log Service

event

autoscaler-del-started

sls.app.ack.autoscaler.delete_started_timeout

Auto scaling - Delete unregistered nodes

An alert is triggered when redundant nodes are processed to optimize cluster resources.

Simple Log Service

event

autoscaler-del-unregistered

sls.app.ack.autoscaler.delete_unregistered

Auto scaling - Scale-in failures

An alert is triggered when a scale-in operation fails, which may lead to resource waste and uneven load distribution.

Simple Log Service

event

autoscaler-scale-down-failed

sls.app.ack.autoscaler.scaledown_failed

Auto scaling - Deleted nodes not drained

An alert is triggered when pods on a node that is being deleted by an auto scaling operation fail to be evicted or migrated.

Simple Log Service

event

autoscaler-instance-expired

sls.app.ack.autoscaler.instance_expired

Cluster application workload alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Job execution failures

An alert is triggered when a Job fails during execution. Submit a ticket to contact the ACK team.

Managed Service for Prometheus

metric-prometheus

job-failed

prom.job.failed

Deployment pod anomalies

An alert is triggered when the number of available replicas of a deployment is insufficient, which may cause the service to be unavailable or partially unavailable. Submit a ticket to contact the ACK team.

Managed Service for Prometheus

metric-prometheus

deployment-rep-err

prom.deployment.replicaError

DaemonSet pod anomalies

An alert is triggered when some pods of a DaemonSet are in an abnormal state (such as failing to start or crashing), which affects the expected behavior or services of the node. Submit a ticket to contact the ACK team.

Managed Service for Prometheus

metric-prometheus

daemonset-status-err

prom.daemonset.scheduledError

DaemonSet pod scheduling errors

An alert is triggered when a DaemonSet fails to schedule some or all nodes, which may be due to resource constraints or inappropriate scheduling policies. Submit a ticket to contact the ACK team.

Managed Service for Prometheus

metric-prometheus

daemonset-misscheduled

prom.daemonset.misscheduled

Cluster pod exception alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Pod OOM errors

An alert is triggered when an out of memory (OOM) error occurs in a pod. Submit a ticket to contact the ACK team.

Simple Log Service

event

pod-oom

sls.app.ack.pod.oom

Pod launch failures

An alert is triggered when a pod fails to start. Submit a ticket to contact the ACK team.

Simple Log Service

event

pod-failed

sls.app.ack.pod.failed

Abnormal pod status

An alert is triggered when a pod is in an unhealthy state (such as Pending, Failed, or Unknown). Submit a ticket to contact the ACK team.

Managed Service for Prometheus

metric-prometheus

pod-status-err

prom.pod.status.notHealthy

Pod restart failures

An alert is triggered when a pod frequently fails to start, enters the CrashLoopBackOff state, or experiences other startup failures. Submit a ticket to contact the ACK team.

Managed Service for Prometheus

metric-prometheus

pod-crashloop

prom.pod.status.crashLooping

Cluster storage exception event alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Cloud disk size less than 20 GiB limit

ACK does not allow you to mount a disk of less than 20 GiB. You can check the sizes of the disks that are attached to your cluster.

Simple Log Service

event

csi_invalid_size

sls.app.ack.csi.invalid_disk_size

Subscription cloud disks cannot be mounted

ACK does not allow you to mount a subscription disk. You can check the billing methods of the disks that are attached to your cluster.

Simple Log Service

event

csi_not_portable

sls.app.ack.csi.disk_not_portable

Mount target unmounting failures because the mount target is being used

An alert is triggered when resources have not been fully released or active processes are still accessing the mount target. Submit a ticket to contact the ACK team.

Simple Log Service

event

csi_device_busy

sls.app.ack.csi.deivce_busy

No available cloud disk

An alert is triggered when no disk is available when a disk is being mounted to the cluster. Submit a ticket to contact the ACK team.

Simple Log Service

event

csi_no_ava_disk

sls.app.ack.csi.no_ava_disk

I/O Hangs of Cloud Disks

An alert is triggered when I/O hangs occur on a disk. Submit a ticket to contact the ACK team.

Simple Log Service

event

csi_disk_iohang

sls.app.ack.csi.disk_iohang

Slow I/O rate of PVC used to mount cloud disks

An alert is triggered when the I/O of a disk that is mounted by using a persistent volume claim (PVC) is slow. Submit a ticket to contact the ACK team.

Simple Log Service

event

csi_latency_high

sls.app.ack.csi.latency_too_high

PV anomalies

An alert is triggered when a persistent volume (PV) experiences an anomaly. Submit a ticket to contact the ACK team.

Managed Service for Prometheus

metric-prometheus

pv-failed

prom.pv.failed

Cluster network exception event alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Multiple route tables in a VPC

An alert is triggered when multiple route tables exist in a VPC, which may lead to complex network configuration or route conflicts. You need to optimize the network structure in a timely manner. Submit a ticket to contact the ACK team.

Simple Log Service

event

ccm-vpc-multi-route-err

sls.app.ack.ccm.describe_route_tables_failed

No available SLB instance

An alert is triggered when an SLB instance fails to be created. Submit a ticket to contact the ACK team.

Simple Log Service

event

slb-no-ava

sls.app.ack.ccm.no_ava_slb

SLB instance update failures

An alert is triggered when an SLB instance fails to be updated. Submit a ticket to contact the ACK team.

Simple Log Service

event

slb-sync-err

sls.app.ack.ccm.sync_slb_failed

SLB instance deletion failures

An alert is triggered when an SLB instance fails to be deleted. Submit a ticket to contact the ACK team.

Simple Log Service

event

slb-del-err

sls.app.ack.ccm.del_slb_failed

Route creation failures

An alert is triggered when a cluster fails to create a route in the virtual private cloud (VPC). Submit a ticket to contact the ACK team.

Simple Log Service

event

route-create-err

sls.app.ack.ccm.create_route_failed

Route synchronization failures

An alert is triggered when a cluster fails to update a route in the VPC. Submit a ticket to contact the ACK team.

Simple Log Service

event

route-sync-err

sls.app.ack.ccm.sync_route_failed

Invalid Terway resources

An alert is triggered when a Terway resource is invalid. Submit a ticket to contact the ACK team.

Simple Log Service

event

terway-invalid-res

sls.app.ack.terway.invalid_resource

IP allocation failures of Terway

An alert is triggered when an IP address fails to be allocated in Terway mode. Submit a ticket to contact the ACK team.

Simple Log Service

event

terway-alloc-ip-err

sls.app.ack.terway.alloc_ip_fail

Ingress bandwidth configuration parsing failures

An alert is triggered when the bandwidth configuration of an Ingress fails to be parsed. Submit a ticket to contact the ACK team.

Simple Log Service

event

terway-parse-err

sls.app.ack.terway.parse_fail

Network resource allocation failures of Terway

An alert is triggered when a network resource fails to be allocated in Terway mode. Submit a ticket to contact the ACK team.

Simple Log Service

event

terway-alloc-res-err

sls.app.ack.terway.allocate_failure

Network resource reclaiming failures of Terway

An alert is triggered when a network resource fails to be reclaimed in Terway mode. Submit a ticket to contact the ACK team.

Simple Log Service

event

terway-dispose-err

sls.app.ack.terway.dispose_failure

Terway virtual mode changes

An alert is triggered when the Terway virtual mode is changed. Submit a ticket to contact the ACK team.

Simple Log Service

event

terway-virt-mod-err

sls.app.ack.terway.virtual_mode_change

Terway triggers pod IP configuration check

An alert is triggered when Terway triggers a pod IP configuration check.

Simple Log Service

event

terway-ip-check

sls.app.ack.terway.config_check

Ingress configuration reload failures

An alert is triggered when the configuration of an Ingress fails to be reloaded. In this case, check whether the Ingress configuration is valid. Submit a ticket to contact the ACK team.

Simple Log Service

event

ingress-reload-err

sls.app.ack.ingress.err_reload_nginx

Cluster important audit operation alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Container login or command execution operations in the cluster

An alert is triggered when a user logs on to a container or executes a command, which may be for maintenance or abnormal activities. Audit operations can be used for tracking and security detection.

Simple Log Service

event

audit-at-command

sls.app.k8s.audit.at.command

Node schedulability status changes in the cluster

An alert is triggered when the schedulability status of a node changes, which affects service efficiency and resource load. You need to immediately follow up on the intention of the adjustment and verify the effect.

Simple Log Service

event

audit-cordon-switch

sls.app.k8s.audit.at.cordon.uncordon

Resource deletion operations in the cluster

An alert is triggered when a resource is deleted, which may be a planned or abnormal behavior. We recommend that you audit to prevent risks.

Simple Log Service

event

audit-resource-delete

sls.app.k8s.audit.at.delete

Node draining or eviction behaviors in the cluster

An alert is triggered when a node is drained or pods are evicted, which reflects node load pressure or policy execution. You need to confirm the necessity and impact.

Simple Log Service

event

audit-drain-eviction

sls.app.k8s.audit.at.drain.eviction

Public network login behaviors in the cluster

An alert is triggered when a user logs on over the public network, which may pose security risks. You need to confirm the login and access permission configuration.

Simple Log Service

event

audit-internet-login

sls.app.k8s.audit.at.internet.login

Node label updates in the cluster

An alert is triggered when node labels are updated, which are used to distinguish and manage node resources. The correctness of labels affects operations efficiency.

Simple Log Service

event

audit-node-label-update

sls.app.k8s.audit.at.label

Node taint updates in the cluster

An alert is triggered when node taints are changed, which affects scheduling policies and toleration mechanisms. You need to correctly execute and audit the configuration.

Simple Log Service

event

audit-node-taint-update

sls.app.k8s.audit.at.taint

Resource modification operations in the cluster

An alert is triggered when resource configurations are modified in real time, which may indicate adjustments to application policies. You need to verify whether they meet business objectives.

Simple Log Service

event

audit-resource-update

sls.app.k8s.audit.at.update

Cluster security exception events

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

High-risk configurations detected in inspections

An alert is triggered when a high-risk configuration is detected during a cluster inspection. Submit a ticket to contact the ACK team.

Simple Log Service

event

si-c-a-risk

sls.app.ack.si.config_audit_high_risk

Cluster inspection exception event alert rule set

Alert item

Rule description

Alert source

Rule_Type

ACK_CR_Rule_Name

SLS_Event_ID

Anomalies detected in cluster inspection

An alert is triggered when the automatic inspection mechanism detects potential anomalies. You need to analyze the specific issue and daily maintenance strategy. Submit a ticket to contact the ACK team.

Simple Log Service

event

cis-sched-failed

sls.app.ack.cis.schedule_task_failed

FAQ

The alert rule fails to be synchronized and the error message "The Project does not exist : k8s-log-xxx" is returned

  • Issue:

    The alert rule synchronization status in the alert center shows the error message The Project does not exist : k8s-log-xxx.

  • Cause:

    You did not create an event center in Log Service for your cluster.

  • Solution:

    1. In the Simple Log Service console, check whether you have reached the quota limit. For more information about resources, see Basic resources.

      1. If you have reached the quota limit, delete unnecessary projects or submit a ticket to request an increase in the project resource quota limit. For information about how to delete a project, see Manage projects.

      2. If you have not reached the quota limit, perform the following steps.

    2. Reinstall ack-node-problem-detector.

      When you reinstall the component, a default project named k8s-log-xxxxxx is created.

      1. Uninstall ack-node-problem-detector.

        1. In the left-side navigation pane of the ACK conso details page of the target cluster, choose Operations > Components.

        2. Click the Logs & Monitoring tab. In the ack-node-problem-detector card, click Uninstall. In the dialog box that appears, click OK.

      2. After the uninstallation is complete, install ack-node-problem-detector.

        1. In the left-side navigation pane, choose Operations > Alert Configuration

        2. On the Alert Configuration page, click Start Installation. The console automatically creates a project and installs and upgrades the components.

    3. On the Alert Configuration page, turn off the switch in the Enabled column for the corresponding alert rule set. Wait until Alert Rule Status changes to Rule Disabled, and then turn on the switch to retry.

The alert rule fails to be synchronized and an error message similar to this rule have no xxx contact groups reference is returned

  • Issue:

    The alert rule fails to be synchronized and an error message similar to this rule have no xxx contact groups reference is returned.

  • Cause:

    No contact group subscribes to the alert rule.

  • Solution:

    1. Create a contact group and add contacts.

    2. Click Edit Notification Object on the right side of the corresponding alert rule set and configure a contact group that subscribes to the alert rule set.