Container Service for Kubernetes (ACK) allows you to configure alerts to centrally manage exceptions in the cluster and provides various metrics for different scenarios. You can deploy Custom Resource Definitions (CRDs) in a cluster to configure and manage alert rules. This topic describes how to set up alerting and configure alert rules for a registered external Kubernetes cluster.
Prerequisites
- A cluster registration proxy is created and a self-managed Kubernetes cluster is connected to the cluster registration proxy. For more information, see Create a cluster registration proxy and register an on-premises cluster.
- A kubectl client is connected to the self-managed cluster. For more information, see Connect to Kubernetes clusters by using kubectl.
Scenarios
ACK allows you to configure and manage alerts in a centralized manner to monitor various scenarios. The alerting feature is commonly used in the following scenarios:
- Cluster O&M
You can configure alerts to detect exceptions in cluster management, storage, networks, and elastic scaling at the earliest opportunity. For example, you can configure and enable Alert Rule Set for Node Exceptions to monitor exceptions in all nodes or specific nodes in the cluster. You can configure and enable Alert Rule Set for Storage Exceptions to monitor changes and exceptions in cluster storage. You can configure and enable Alert Rule Set for Network Exceptions to monitor changes and exceptions in cluster networks. You can configure and enable Alert Rule Set for O&M Exceptions to monitor changes and exceptions in cluster management operations.
- Application development
You can configure alerts to detect exceptions and abnormal metrics of running applications in the cluster at the earliest opportunity. For example, you can configure alerts to detect exceptions of pod replicas and check whether the CPU and memory usage of a Deployment exceed the thresholds. You can use the default alert template to quickly set up alerts to receive notifications about exceptions of pod replicas in the cluster. For example, you can configure and enable Alert Rule Set for Pod Exceptions to monitor exceptions in the pods of your application.
- Application management
To monitor the issues that occur throughout the lifecycle of an application, we recommend that you pay attention to application health, capacity planning, cluster stability, exceptions, and errors. You can configure and enable Alert Rule Set for Critical Events to monitor warnings and errors in the cluster. You can configure and enable Alert Rule Set for Resource Exceptions to monitor resource usage in the cluster and optimize capacity planning.
- Multi-cluster management
When you manage multiple clusters, you may find it a complex task to configure and synchronize alert rules across the clusters. ACK allows you to deploy CRDs in the cluster to manage alert rules. You can configure the same CRDs to conveniently synchronize alert rules across multiple clusters.
Configure the cloud monitoring component in the registered external cluster
Step 1: Configure RAM permissions for the component
Before you can install the component in an external cluster, you must set the AccessKey pair to grant the external cluster the permissions to access Alibaba Cloud resources. Before you set the AccessKey pair, create a Resource Access Management (RAM) user and grant the RAM user the permissions to access Alibaba Cloud resources.
Step 2: Install and upgrade the component
The console automatically checks whether the alerting configuration meets the requirements and guides you to activate, install, or upgrade the component.
Set up alerting
Step 1: Enable the default alert rules
Step 2: Configure alert rules
Configure alert rules by using CRDs
When the alerting feature is enabled, the system automatically creates a resource object of the AckAlertRule type in the kube-system namespace. This resource object contains the default alert template. You can use this resource object to configure alert rule sets.
The default alert template
- The default alert rules are enabled.
- You go to the Alert Rules tab for the first time and default alert rules are not enabled.
Alert Rule Set | Alert Rule | ACK_CR_Rule_Name | SLS_Event_ID |
---|---|---|---|
Alert Rule Set for Critical Events | Errors | error-event | sls.app.ack.error |
Warnings | warn-event | sls.app.ack.warn | |
Alert Rule Set for Node Exceptions | Docker process exceptions on nodes | docker-hang | sls.app.ack.docker.hang |
Evictions | eviction-event | sls.app.ack.eviction | |
GPU XID errors | gpu-xid-error | sls.app.ack.gpu.xid_error | |
Node restarts | node-restart | sls.app.ack.node.restart | |
Network Time Protocol (NTP) service failures on nodes | node-ntp-down | sls.app.ack.ntp.down | |
Pod Lifecycle Event Generator (PLEG) errors on nodes | node-pleg-error | sls.app.ack.node.pleg_error | |
Process errors on nodes | ps-hang | sls.app.ack.ps.hang | |
Alert Rule Set for Resource Exceptions | Excess file handles on nodes | node-fd-pressure | sls.app.ack.node.fd_pressure |
Insufficient node disk space | node-disk-pressure | sls.app.ack.node.disk_pressure | |
Excessive processes on nodes | node-pid-pressure | sls.app.ack.node.pid_pressure | |
Insufficient node resources for scheduling | node-res-insufficient | sls.app.ack.resource.insufficient | |
Insufficient node IP addresses | node-ip-pressure | sls.app.ack.ip.not_enough | |
Alert Rule Set for Pod Exceptions | Pod out-of-memory (OOM) errors | pod-oom | sls.app.ack.pod.oom |
Pod restart failures | pod-failed | sls.app.ack.pod.failed | |
Image pull failures | image-pull-back-off | sls.app.ack.image.pull_back_off | |
Alert Rule Set for O&M Exceptions | No available Server Load Balancer (SLB) instance | slb-no-ava | sls.app.ack.ccm.no_ava_slb |
SLB instance update failures | slb-sync-err | sls.app.ack.ccm.sync_slb_failed | |
SLB instance deletion failures | slb-del-err | sls.app.ack.ccm.del_slb_failed | |
Node deletion failures | node-del-err | sls.app.ack.ccm.del_node_failed | |
Node addition failures | node-add-err | sls.app.ack.ccm.add_node_failed | |
Route creation failures | route-create-err | sls.app.ack.ccm.create_route_failed | |
Route update failures | route-sync-err | sls.app.ack.ccm.sync_route_failed | |
High-risk configurations detected in inspections | si-c-a-risk | sls.app.ack.si.config_audit_high_risk | |
Command execution failures in managed node pools | nlc-run-cmd-err | sls.app.ack.nlc.run_command_fail | |
No command provided in managed node pools | nlc-empty-cmd | sls.app.ack.nlc.empty_task_cmd | |
URL mode not implemented in managed node pools | nlc-url-m-unimp | sls.app.ack.nlc.url_mode_unimpl | |
Unknown repair operations in managed node pools | nlc-opt-no-found | sls.app.ack.nlc.op_not_found | |
Node draining and removal failures in managed node pools | nlc-des-node-err | sls.app.ack.nlc.destroy_node_fail | |
Node draining failures in managed node pools | nlc-drain-node-err | sls.app.ack.nlc.drain_node_fail | |
Elastic Compute Service (ECS) restart timeouts in managed node pools | nlc-restart-ecs-wait | sls.app.ack.nlc.restart_ecs_wait_fail | |
ECS restart failures in managed node pools | nlc-restart-ecs-err | sls.app.ack.nlc.restart_ecs_fail | |
ECS reset failures in managed node pools | nlc-reset-ecs-err | sls.app.ack.nlc.reset_ecs_fail | |
Auto-repair task failures in managed node pools | nlc-sel-repair-err | sls.app.ack.nlc.repair_fail | |
Alert Rule Set for Network Exceptions | Invalid Terway resources | terway-invalid-res | sls.app.ack.terway.invalid_resource |
IP allocation failures of Terway | terway-alloc-ip-err | sls.app.ack.terway.alloc_ip_fail | |
Ingress bandwidth configuration parsing failures | terway-parse-err | sls.app.ack.terway.parse_fail | |
Network resource allocation failures of Terway | terway-alloc-res-err | sls.app.ack.terway.allocate_failure | |
Network resource reclaim failures of Terway | terway-dispose-err | sls.app.ack.terway.dispose_failure | |
Terway virtual mode changes | terway-virt-mod-err | sls.app.ack.terway.virtual_mode_change | |
Pod IP checks executed by Terway | terway-ip-check | sls.app.ack.terway.config_check | |
Ingress configuration reload failures | ingress-reload-err | sls.app.ack.ingress.err_reload_nginx | |
Alert Rule Set for Storage Exceptions | Disk size is less than 20 GiB | csi_invalid_size | sls.app.ack.csi.invalid_disk_size |
Subscription disks cannot be mounted | csi_not_portable | sls.app.ack.csi.disk_not_portable | |
Unmount failures occur because the mount target is in use | csi_device_busy | sls.app.ack.csi.deivce_busy | |
No disks are available | csi_no_ava_disk | sls.app.ack.csi.no_ava_disk | |
I/O hangs of cloud disks | csi_disk_iohang | sls.app.ack.csi.disk_iohang | |
Slow I/O of the underlying disks of persistent volume claims (PVCs) | csi_latency_high | sls.app.ack.csi.latency_too_high | |
Disk usage exceeds the threshold | disk_space_press | sls.app.ack.csi.no_enough_disk_space |