Container Service for Kubernetes (ACK) provides the alert management feature that allows you to centrally manage alerts that are triggered in different scenarios. You can configure alert rules to get notified when a service exception occurs or one of the following metrics exceeds the threshold: key metrics of basic cluster resources, metrics of core cluster components, and application metrics. You can enable alerting when you create a cluster. ACK allows you to deploy CustomResourceDefinitions (CRDs) in a cluster to configure and manage alert rules. This topic introduces the use scenarios of alert management, and describes how to configure alert rules and how to enable alert management for an ACK dedicated cluster.

Background information

The alert management feature allows you to manage the following types of alerts:
  • Alerts that are triggered by events of cluster exceptions. The event data is synchronized from the event center of ACK. For more information see Event monitoring.
  • Alerts that are triggered when the key metrics of basic cluster resources exceed thresholds. The metrics are synchronized from CloudMonitor. For more information see Monitor basic resources.

Scenarios

ACK allows you to centrally configure alert rules and manage alerts in various scenarios. The alert management feature is commonly used in the following scenarios:

  • Cluster O&M
    You can configure alert rules to detect exceptions in cluster management, storage, networks, and elastic scaling at the earliest opportunity. Examples:
    • Use the alert rule set for resource exceptions to get notified when the key metrics of basic cluster resources exceed thresholds. Alerts are triggered when key metrics, such as CPU usage, memory usage, and network latency, exceed the specified thresholds. If you receive alert notifications, you can take measures to ensure cluster stability.
    • Use the alert rule set for cluster exceptions to get notified of node or container exceptions. Alerts are triggered upon events such as Docker process exceptions, node process exceptions, or pod restart failures.
    • Use the alert rule set for storage exceptions to get notified of storage changes and exceptions.
    • Use the alert rule set for network exceptions to get notified of network changes and exceptions.
    • Use the alert rule set for O&M exceptions to get notified of changes and exceptions that are related to cluster control.
  • Application development

    You can configure alert rules to get notified of exceptions and abnormal metrics of running applications in the cluster at the earliest opportunity. For example, you can configure alert rules to receive notifications about exceptions of replicated pods and when the CPU and memory usage of a Deployment exceeds the thresholds. You can use the default alert rule template to quickly set up alerts to receive notifications about exceptions of replicated pods in the cluster. For example, you can configure and enable the alert rule set for pod exceptions to get notified of exceptions in the pods of your application.

  • Application management

    To get notified of the issues that occur throughout the lifecycle of an application, we recommend that you take note of application health, capacity planning, cluster stability, exceptions, and errors. You can configure and enable the alert rule set for critical events to get notified of warnings and errors in the cluster. You can configure and enable the alert rule set for resource exceptions to get notified of abnormal resource usage in the cluster and optimize capacity planning.

  • Multi-cluster management

    When you manage multiple clusters, you may find it a complex task to configure and synchronize alert rules across the clusters. ACK allows you to deploy CRDs in the cluster to manage alert rules. You can configure the same CRDs to conveniently synchronize alert rules across multiple clusters.

Install and update the components

Before you can enable alerting, the ACK console automatically checks whether components need to be activated, installed, or updated.

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane, choose Operations > Alerts.
  5. On the Alerts page, the console automatically checks whether the following conditions are met.
    If not all conditions are met, follow the on-screen instructions to install or update the required components.
    • Log Service is activated. If Log Service is not activated, log on to the Log Service console and follow the on-screen instructions to activate the service.
      Note For more information about the billing rules of Log Service, see Billable items.
    • Event Center is installed. For more information, see Event monitoring.
    • The alicloud-monitor-controller component is updated to the latest version. For more information, see alicloud-monitor-controller.
    Alerts

Set up alerting

ACK managed clusters and ACK dedicated clusters support the alerting feature.

Step 1: Enable the default alert rules

  • When you create an ACK managed cluster, select Use Default Alert Rule Template and specify an alert contact group.
    After you select this option, the system automatically creates default alert rules and sends alert notifications to the specified contact group. Create a cluster

    For more information, see Create an ACK managed cluster.

  • To set up alerting for an existing cluster, you can enable alert rules for the cluster.
    1. In the left-side navigation pane, choose Operations > Alerts.
    2. On the Alert Rules tab, select an alert rule set and turn on Status to enable the alert rule set.
    Manage alert rulesFor more information, see Step 2: Configure alert rules.

Step 2: Configure alert rules

After you create an ACK managed cluster or an ACK dedicated cluster, you can manage alert rules, alert contacts, and alert contact groups.

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane, choose Operations > Alerts.
    FeatureDescription
    Alert Rules
    • By default, ACK provides an alert rule template that you can use to generate alerts based on exceptions and metrics.
    • Alert rules are classified into several alert rule sets. You can enable an alert rule set, disable an alert rule set, and configure multiple alert contact groups for an alert rule set.
    • An alert rule set contains multiple alert rules. Each alert rule corresponds to an alert item. You can create a YAML file to configure multiple alert rule sets in a cluster. You can also modify the YAML file to update alert rules.
    • For more information about how to configure alert rules by using a YAML file, see Configure alert rules by using CRDs.
    • For more information about the default alert rule template, see Default alert rule template.
    Alert HistoryYou can view up to 100 historical alerts. You can select an alert and click the link in the Alert Rule column to view rule details in the monitoring system. You can click Details to go to the resource page on which the alert is triggered. The alert may be triggered by an exception or an abnormal metric. Alert history
    Alert ContactsYou can create, edit, or delete alert contacts.
    The alert rule set for resource exceptions includes alert rules for basic node resources. Before an alert contact can receive alerts on basic cluster resources, the mobile phone number and email address of the contact must be verified in the CloudMonitor console. You can view and update information about an alert contact in the CloudMonitor console. If the verification has expired, delete the contact in the CloudMonitor console, and then refresh the Alert Contacts page in the ACK console. Manage contacts
    Alert Contact GroupsYou can create, edit, or delete alert contact groups. If no alert contact group exists, the ACK console automatically creates a default alert contact group based on the information that you provided during registration.
  5. On the Alert Rules tab, click Modify Contacts to specify the contact groups to which the alerts are sent. You can turn on or turn off Status to enable or disable the alert rule set.

Configure alert rules by using CRDs

If you enable the alerting feature, the system automatically creates a resource object of the AckAlertRule type in the kube-system namespace. The resource object contains the default alert rule template. You can use the resource object to configure alert rule sets.

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane, choose Operations > Alerts.
  5. In the upper-right corner of the Alert Rules tab, click Configure Alert Rule. You can view the configuration of the AckAlertRule object and modify the YAML file to update the configuration.
    Example:
    apiVersion: alert.alibabacloud.com/v1beta1
    kind: AckAlertRule
    metadata:
      name: default
    spec:
      groups:
        # The following code is a sample alert rule based on cluster events. 
        - name: pod-exceptions                             # The name of the alert rule set. This parameter corresponds to the Group_Name field in the alert rule template. 
          rules:
            - name: pod-oom                                # The name of the alert rule. 
              type: event                                  # The type of the alert rule (corresponds to the Rule_Type parameter). Valid values: event and metric-cms. 
              expression: sls.app.ack.pod.oom              # The alert rule expression. If you set the rule type to event, the expression is set to the value of Rule_Expression_Id in the default alert rule template. 
              enable: enable                               # The status of the alert rule. Valid values: enable and disable. 
            - name: pod-failed
              type: event
              expression: sls.app.ack.pod.failed
              enable: enable
        # The following code is a sample alert rule for basic cluster resources. 
        - name: res-exceptions                              # The name of the alert rule set. This parameter corresponds to the Group_Name field in the alert rule template. 
          rules:
            - name: node_cpu_util_high                      # The name of the alert rule. 
              type: metric-cms                              # The type of the alert rule (corresponds to the Rule_Type parameter). Valid values: event and metric-cms. 
              expression: cms.host.cpu.utilization          # The alert rule expression. If you set the rule type to event, the expression is set to the value of Rule_Expression_Id in the default alert rule template. 
              contactGroups:                                # The contact group that is associated with the alert rule. The contacts created by an Alibaba Cloud account are shared by all clusters within the account. 
              enable: enable                                # The status of the alert rule. Valid values: enable and disable. 
              thresholds:                                   # The alert threshold. For more information, see the "Modify the alert threshold for basic cluster resources" section of this topic.             
                - key: CMS_ESCALATIONS_CRITICAL_Threshold
                  unit: percent
                  value: '1'

Default alert rule template

ACK creates the default alert rules in the following conditions:
  • The default alert rules are enabled.
  • You go to the Alert Rules tab for the first time and the default alert rules are not enabled.
The following table describes the default alert rules.
Alert rule setAlert ruleDescriptionRule_TypeACK_CR_Rule_NameSLS_Event_ID
Alert rule set for critical events in the cluster.ErrorsAn alert is triggered when an error occurs in the cluster. eventerror-eventsls.app.ack.error
WarningsAn alert is triggered when a warning occurs in the cluster, except for warnings that can be ignored. eventwarn-eventsls.app.ack.warn
Alert rule set for cluster exceptionsDocker process exceptions on nodesAn alert is triggered when a dockerd exception or a containerd exception occurs on a node. eventdocker-hangsls.app.ack.docker.hang
Evictions in the clusterAn alert is triggered when a pod is evicted. eventeviction-eventsls.app.ack.eviction
GPU XID errorsAn alert is triggered when a GPU XID error occurs. eventgpu-xid-errorsls.app.ack.gpu.xid_error
Node changes to the unschedulable stateAn alert is triggered when the status of a node changes to unschedulable. eventnode-downsls.app.ack.node.down
Node restartsAn alert is triggered when a node restarts. eventnode-restartsls.app.ack.node.restart
NTP service failures on nodesAn alert is triggered when the Network Time Protocol (NTP) service fails. eventnode-ntp-downsls.app.ack.ntp.down
PLEG errors on nodesAn alert is triggered when a Lifecycle Event Generator (PLEG) error occurs on a node. eventnode-pleg-errorsls.app.ack.node.pleg_error
Process errors on nodesAn alert is triggered when a process error occurs on a node. eventps-hangsls.app.ack.ps.hang
Alert rule set for resource exceptionsNode - CPU usage ≥ 85%An alert is triggered when the CPU usage of a node exceeds the threshold. The default threshold is 85%.

If the percentage of available CPU resources is less than 15%, kubelet forcibly evicts pods from the node.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsnode_cpu_util_highcms.host.cpu.utilization
Node - Memory usage ≥ 85%An alert is triggered when the memory usage of a node exceeds the threshold. The default threshold is 85%.

If the percentage of available CPU resources is less than 15%, kubelet forcibly evicts pods from the node.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsnode_mem_util_highcms.host.memory.utilization
Node - Disk usage ≥ 85%An alert is triggered when the disk usage of a node exceeds the threshold. The default threshold is 85%.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsnode_disk_util_highcms.host.disk.utilization
Node - Usage of outbound public bandwidth ≥ 85%An alert is triggered when the usage of the outbound public bandwidth of a node exceeds the threshold. The default threshold is 85%.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsnode_public_net_util_highcms.host.public.network.utilization
Node - Inode usage ≥ 85%An alert is triggered when the inode usage of a node exceeds the threshold. The default threshold is 85%.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsnode_fs_inode_util_highcms.host.fs.inode.utilization
Resources - Usage of the maximum connections of an SLB instance ≥ 85%An alert is triggered when the usage of the maximum number of connections of a Server Load Balancer (SLB) instance exceeds the threshold. The default threshold is 85%.
Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsslb_qps_util_highcms.slb.qps.utilization
Resources - Usage of SLB outbound bandwidth ≥ 85%An alert is triggered when the usage of the outbound bandwidth of an SLB instance exceeds the threshold. The default threshold is 85%.
Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsslb_traff_tx_util_highcms.slb.traffic.tx.utilization
Resources - Usage of the maximum connections of an SLB instance ≥ 85%An alert is triggered when the usage of the maximum number of connections of an SLB instance exceeds the threshold. The default threshold is 85%.
Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsslb_max_con_util_highcms.slb.max.connection.utilization
Resources - Connection drops per second of the listeners of an SLB instance remains ≥ 1An alert is triggered when the number of connections dropped per second by the listeners of an SLB instance remains at 1 or more. The default threshold is 1.
Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses.

For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources.

metric-cmsslb_drop_con_highcms.slb.drop.connection
Excessive file handles on nodesAn alert is triggered when excessive file handles exist on a node. eventnode-fd-pressuresls.app.ack.node.fd_pressure
Insufficient node disk spaceAn alert is triggered when the disk space of a node is insufficient. eventnode-disk-pressuresls.app.ack.node.disk_pressure
Excessive processes on nodesAn alert is triggered when excessive processes run on a node. eventnode-pid-pressuresls.app.ack.node.pid_pressure
Insufficient node resources for schedulingAn alert is triggered when a node has insufficient resources for scheduling. eventnode-res-insufficientsls.app.ack.resource.insufficient
Insufficient node IP addressesAn alert is triggered when node IP addresses are insufficient. eventnode-ip-pressuresls.app.ack.ip.not_enough
Alert rule set for pod exceptionsPod OOM errorsAn alert is triggered when an out of memory (OOM) error occurs in a pod. eventpod-oomsls.app.ack.pod.oom
Pod restart failuresAn alert is triggered when a pod fails to restart. eventpod-failedsls.app.ack.pod.failed
Image pull failuresAn alert is triggered when an image fails to be pulled. eventimage-pull-back-offsls.app.ack.image.pull_back_off
Alert rule set for O&M exceptionsNo available SLB instanceAn alert is triggered when an SLB instance fails to be created. In this case, Submit a ticket to contact the ACK technical team. eventslb-no-avasls.app.ack.ccm.no_ava_slb
SLB instance update failuresAn alert is triggered when an SLB instance fails to be updated. In this case, Submit a ticket to contact the ACK technical team. eventslb-sync-errsls.app.ack.ccm.sync_slb_failed
SLB instance deletion failuresAn alert is triggered when an SLB instance fails to be deleted. In this case, Submit a ticket to contact the ACK technical team. eventslb-del-errsls.app.ack.ccm.del_slb_failed
Node deletion failuresAn alert is triggered when a node fails to be deleted. In this case, Submit a ticket to contact the ACK technical team. eventnode-del-errsls.app.ack.ccm.del_node_failed
Node adding failuresAn alert is triggered when a node fails to be added to the cluster. In this case, Submit a ticket to contact the ACK technical team. eventnode-add-errsls.app.ack.ccm.add_node_failed
Route creation failuresAn alert is triggered when a cluster fails to create a route in the virtual private cloud (VPC). In this case, Submit a ticket to contact the ACK technical team. eventroute-create-errsls.app.ack.ccm.create_route_failed
Route update failuresAn alert is triggered when a cluster fails to update the routes of the VPC. In this case, Submit a ticket to contact the ACK technical team. eventroute-sync-errsls.app.ack.ccm.sync_route_failed
Command execution failures in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-run-cmd-errsls.app.ack.nlc.run_command_fail
Node removal failures in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-empty-cmdsls.app.ack.nlc.empty_task_cmd
Unimplemented URL mode in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-url-m-unimpsls.app.ack.nlc.url_mode_unimpl
Unknown repairing operations in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-opt-no-foundsls.app.ack.nlc.op_not_found
Node draining and removal failures in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-des-node-errsls.app.ack.nlc.destroy_node_fail
Node draining failures in managed node poolsAn alert is triggered when a node in a managed node pool fails to be drained. In this case, Submit a ticket to contact the ACK technical team. eventnlc-drain-node-errsls.app.ack.nlc.drain_node_fail
ECS restart timeouts in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-restart-ecs-waitsls.app.ack.nlc.restart_ecs_wait_fail
ECS restart failures in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-restart-ecs-errsls.app.ack.nlc.restart_ecs_fail
ECS reset failures in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-reset-ecs-errsls.app.ack.nlc.reset_ecs_fail
Auto-repair task failures in managed node poolsAn alert is triggered when a node pool error occurs. In this case, Submit a ticket to contact the ACK technical team. eventnlc-sel-repair-errsls.app.ack.nlc.repair_fail
Alert rule set for network exceptionsInvalid Terway resourcesAn alert is triggered when a Terway resource is invalid. In this case, Submit a ticket to contact the ACK technical team. eventterway-invalid-ressls.app.ack.terway.invalid_resource
IP allocation failures of TerwayAn alert is triggered when an IP address fails to be allocated in Terway mode. In this case, Submit a ticket to contact the ACK technical team. eventterway-alloc-ip-errsls.app.ack.terway.alloc_ip_fail
Ingress bandwidth configuration parsing failuresAn alert is triggered when the bandwidth configuration of an Ingress fails to be parsed. In this case, Submit a ticket to contact the ACK technical team. eventterway-parse-errsls.app.ack.terway.parse_fail
Network resource allocation failures of TerwayAn alert is triggered when a network resource fails to be allocated in Terway mode. In this case, Submit a ticket to contact the ACK technical team. eventterway-alloc-res-errsls.app.ack.terway.allocate_failure
Network resource reclaiming failures of TerwayAn alert is triggered when a network resource fails to be reclaimed in Terway mode. In this case, Submit a ticket to contact the ACK technical team. eventterway-dispose-errsls.app.ack.terway.dispose_failure
Terway virtual mode changesAn alert is triggered when the Terway virtual mode is changed. eventterway-virt-mod-errsls.app.ack.terway.virtual_mode_change
Pod IP checks executed by TerwayAn alert is triggered when a pod IP is checked in Terway mode. eventterway-ip-checksls.app.ack.terway.config_check
Ingress configuration reload failuresAn alert is triggered when the configuration of an Ingress fails to be reloaded. In this case, check whether the Ingress configuration is valid. eventingress-reload-errsls.app.ack.ingress.err_reload_nginx
Alert rule set for storage exceptionsCloud disk size less than 20 GiBACK does not allow you to mount a disk of less than 20 GiB. You can check the sizes of the disks that are attached to your cluster. eventcsi_invalid_sizesls.app.ack.csi.invalid_disk_size
Subscription cloud disks cannot be mountedACK does not allow you to mount a subscription disk. You can check the billing methods of the disks that are attached to your cluster. eventcsi_not_portablesls.app.ack.csi.disk_not_portable
Mount target unmounting failures because the mount target is being usedAn alert is triggered when an unmount failure occurs because the mount target is in use. eventcsi_device_busysls.app.ack.csi.deivce_busy
No available cloud diskAn alert is triggered when no disk is available. In this case, Submit a ticket to contact the ACK technical team. eventcsi_no_ava_disksls.app.ack.csi.no_ava_disk
I/O hangs of cloud disksAn alert is triggered when I/O hangs occur on a disk. In this case, Submit a ticket to contact the ACK technical team. eventcsi_disk_iohangsls.app.ack.csi.disk_iohang
Slow I/O rate of PVC used to mount cloud disksAn alert is triggered when the I/O of a disk that is mounted by using a persistent volume claim (PVC) is slow. In this case, Submit a ticket to contact the ACK technical team. eventcsi_latency_highsls.app.ack.csi.latency_too_high
Disk usage exceeds the thresholdAn alert is triggered when the usage of a disk exceeds the specified threshold. You can check the usage of a disk that is mounted to your cluster. eventdisk_space_presssls.app.ack.csi.no_enough_disk_space
Alert rule set for cluster security eventsHigh-risk configurations detected in inspectionsAn alert is triggered when a high-risk configuration is detected during a cluster inspection. In this case, Submit a ticket to contact the ACK technical team. eventsi-c-a-risksls.app.ack.si.config_audit_high_risk

Enable alert management for an ACK dedicated cluster

Before you can enable the alert management feature for a dedicated cluster, you must grant the required permissions to the cluster.
Note The system automatically grants ACK managed clusters the permissions to access resources that are related to the alerting feature of Log Service.

Grant a dedicated cluster the permissions to access resources that are related to the alerting feature of Log Service and Application Real-Time Monitoring Service (ARMS) Prometheus. For more information, see Use custom policies to grant permissions to a RAM user and Overview.

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. On the Cluster Information page, click the Cluster Resources tab and then click the link to the right of Worker RAM Role to go to the Resource Access Management (RAM) console.
    Worker RAM role
  5. On the RAM Roles page, click the Permissions tab. Select the worker RAM role and click the link in the Policy column.
  6. On the Policy Document tab, click Modify Policy Document. In the Modify Policy Document panel, copy the following content to the Policy Document field.
    {
                "Action": [
                    "log:*",
                    "arms:*",
                    "cms:*",
                    "cs:UpdateContactGroup"
                ],
                "Resource": [
                    "*"
                ],
                "Effect": "Allow"
    }
  7. Click OK.
  8. Check the log to verify that the permissions are granted.
    1. In the left-side navigation pane of the cluster details page in the Container Service for Kubernetes (ACK) console, choose Workloads > Deployments.
    2. Set Namespace to kube-system, find alicloud-monitor-controller in the Deployments list, and then click the link in the Name column.
    3. Click the Logs tab and verify that the log content that indicates successful authorization is displayed. Pod log

Modify the alert threshold for basic cluster resources

If Rule_Type is set to metric-cms for an alert rule, the metrics are synchronized from CloudMonitor. You can modify the alert threshold of the alert rule by using a CRD object. For more information, see Configure alert rules by using CRDs.

In this example, the CPU usage alert rule is modified by using a CRD object. The thresholds parameter is used to specify the alert threshold, the number of times that the CPU usage exceeds the threshold, and the silence period.

apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
  name: default
spec:
  groups:
    # The following code is a sample alert rule for basic cluster resources. 
    - name: res-exceptions                                        # The name of the alert rule set. This parameter corresponds to the Group_Name field in the alert rule template. 
      rules:
        - name: node_cpu_util_high                                # The name of the alert rule. 
          type: metric-cms                                        # The type of the alert rule. Valid values: event and metric-cms. 
          expression: cms.host.cpu.utilization                    # The alert rule expression. If you set the rule type to event, the expression is set to the value of Rule_Expression_Id in the default alert rule template. 
          contactGroups:                                          # The contact group associated with the alert rule. You can add contact groups in the ACK console. The contacts created by an Alibaba Cloud account are shared by all clusters within the account. 
          enable: enable                                          # The status of the alert rule. Valid values: enable and disable. 
          thresholds:                                             # The alert threshold. For more information, see Configure alert rules by using CRDs. 
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: '1'  
            - key: CMS_ESCALATIONS_CRITICAL_Times
              value: '3'  
            - key: CMS_RULE_SILENCE_SEC
              value: '900'  
ParameterDescriptionDefault value
CMS_ESCALATIONS_CRITICAL_ThresholdThe alert threshold.
  • unit: The unit of the threshold. Valid values: percent, count, and qps.
  • value: The value of the threshold.
This parameter is required. If you leave this parameter empty, the modification does not take effect and the alert rule is disabled.
The default value is the same as the default value specified in the default alert rule template.
CMS_ESCALATIONS_CRITICAL_TimesThe number of times that the alert threshold is exceeded before an alert is triggered.

This parameter is optional. If you leave this parameter empty, the default value is used.

3
CMS_RULE_SILENCE_SECThe silence period after an alert is triggered. This parameter is used to prevent frequent alerting. Unit: seconds.

This parameter is optional. If you leave this parameter empty, the default value is used.

900

View alert rules in the corresponding monitoring system

After you enable the alert rules of alert management, the status of the alert rules that you enabled is updated on the Alert Rules tab. You can click Advanced Settings to go to the rule details page of the corresponding monitoring system such as ARMS, Log Service, and CloudMonitor. You can also go to the console of the corresponding monitoring system and view the details of the alert rules.

  • View the alert rules in Log Service:
    1. Log on to the Log Service console.
    2. Find and click the Log Service project that is used by the cluster. The default name of the Log Service project used by the cluster is K8s-log-{{clusterId}}.
    3. In the left-side navigation pane, click the 1 icon. On the Alert Center page, click the Alert Rules/Incidents tab to view the rules that are enabled. Select ACK for Type to filter the default alert rules that are updated by the alert management feature of ACK.
      Note You must make sure that the alert rules of alert management are enabled. Otherwise, the ACK option is not displayed in the Type field.
  • View the alert rules in CloudMonitor:
    1. Log on to the CloudMonitor console.
    2. In the left-side navigation pane, click Application Groups. On the Application Groups page, click the Resource Tag Rules tab.
    3. In the Rule Description column, search for Resource Tag Key:ack.aliyun.com, Tag Value:Equal To is <Cluster ID>.
  • View the alert rules in ARMS Prometheus: Log on to the ARMS console and view the alert rules that are updated by the alert management feature of ACK. The name of each alert rule displayed in the ARMS Prometheus console is in the <Alert rule name>_<Cluster name> format.

FAQ

What do I do if I fail to update an alert rule and the following error message is returned: The Project does not exist : k8s-log-xxx?

Symptom:

The system returns an error message, as shown in the following figure.Error Message 1

Cause:

You did not create an event center in Log Service for your cluster.

Solution:

  1. Go to the Log Service console. Check whether the number of projects has reached the quota limit. If the quota limit is reached, delete excessive projects or Submit a ticket to apply for a quota increase. For more information about how to delete a Log Service project, see Manage a project.
  2. Reinstall ack-node-problem-detector.
    1. In the left-side navigation pane of the cluster details page in the Container Service for Kubernetes (ACK) console, choose Applications > Helm.
    2. If you want to reinstall ack-node-problem-detector by using a YAML file, perform the following steps to obtain a copy of the YAML template of ack-node-problem-detector:

      On the Helm page, find ack-node-problem-detector and click Update in the Actions column. After ack-node-problem-detector is updated, click View Details in the Actions column. On the details page of ack-node-problem-detector, select a resource and click View in YAML to copy the YAML content to your on-premises machine. Perform the same operation for each resource to obtain a copy of the YAML template.

    3. On the Helm page, select ack-node-problem-detector and click Delete in the Actions column.
    4. In the left-side navigation pane of the details page, choose Operations > Add-ons.
    5. Click the Log and Monitoring tab, find ack-node-problem-detector, and then click Install.

      In the Note message, confirm the versions of the plug-ins and click OK. After ack-node-problem-detector is installed, the word "Installed" and the version information are displayed in the ack-node-problem-detector section.

What do I do if I fail to update an alert rule because no contact group subscribes to the alert rule?

Symptom:

The system returns an error message, as shown in the following figure.

The following error message is returned: this rule have no xxx contact groups reference. Error Message 2

Cause:

No contact group subscribes to the alert rule.

Solution:

  1. Create a contact group and add contacts.
  2. Find the alert rule and click Modify Contacts. In the Modify Contacts panel, add the contact group that you created as the subscriber.

For more information, see Set up alerting.