All Products
Search
Document Center

Container Service for Kubernetes:Configuring alert rules using Prometheus

Last Updated:Mar 26, 2026

ACK clusters support both Alibaba Cloud Prometheus and open source Prometheus. If the pre-configured metrics don't cover your monitoring needs, write custom PromQL expressions to create alert rules for cluster nodes, hosts, container replicas, and workloads. An alert rule fires when a metric crosses a threshold or a condition is met.

Prerequisites

Before you begin, ensure that you have:

Configure alert rules with custom PromQL

Both Alibaba Cloud Prometheus and open source Prometheus support custom PromQL-based alert rules. When an alert rule's conditions are met, the system generates an alert event and sends a notification.

Alibaba Cloud Prometheus

To create a Prometheus alert rule using custom PromQL, see Create a Prometheus alert rule.

Open source Prometheus

  1. Configure an alert notification policy. Open source Prometheus supports webhooks, DingTalk robots, and email. Set the notification method by configuring the receiver parameter in the ack-prometheus-operator application. For details, see Alerting configuration.

  2. Create an alert rule. Deploy a PrometheusRule Custom Resource Definition (CRD) in the cluster to define alert rules. For reference, see Deploying Prometheus rules. The following example triggers an alert when node CPU usage exceeds 90% over a 2-minute window. The expr field specifies the PromQL expression and trigger condition for every alert rule.

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        # Labels must match ruleSelector.matchLabels in the Prometheus CRD.
        prometheus: example
        role: alert-rules
      name: prometheus-example-rules
    spec:
      groups:
      - name: example.rules
        rules:
        - alert: ExampleAlert
          # expr: PromQL query and trigger condition.
          # Refer to the PromQL configuration column in the alert rule tables below.
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 90
  3. Verify that the alert rule is active.

    1. Forward the Prometheus service to port 9090 on your local machine: ``bash kubectl port-forward svc/ack-prometheus-operator-prometheus 9090:9090 -n monitoring ``

    2. Open localhost:9090 in your browser.

    3. At the top of the page, choose Status > Rules. If the target alert rule appears on the Rules page, the rule is active.

Alert rule reference

Based on operations and maintenance (O&M) experience across clusters and applications, ACK provides the following recommended Prometheus alert rule configurations. These rules cover cluster stability, node anomalies, node resource usage, container replica anomalies, workload anomalies, storage exceptions, and network exceptions.

Alert rules use the following severity levels:

  • Critical: The issue affects the cluster, application, or your business. Requires immediate attention.

  • Warning: The issue affects the cluster, application, or your business. Investigate as soon as possible.

  • Normal: The alert relates to an important feature change.

The Rule description column refers to the Alert Rules tab on the Alerts page as the operation entry point. To update alert rules, log on to the Container Service for Kubernetes (ACK) console, click the target cluster name in the Clusters list, choose Operations > Alerts in the left navigation pane, and then click the Alert Rules tab.

Abnormal container replicas

Description Severity PromQL Threshold Window Rule description Common troubleshooting
Abnormal pod status Critical min_over_time(sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m:1m]) > 0 > 0 5 min Fires if a pod has an abnormal status (Pending, Unknown, or Failed) within the last 5 minutes. In the operation entry point, click Alert Rule Set for Pod Exceptions and configure the Pod anomaly alert rule. See Manage alerts in ACK. See Troubleshoot pod exceptions.
Pod startup failed Critical sum_over_time(increase(kube_pod_container_status_restarts_total{}[1m])[5m:1m]) > 3 > 3 restarts 5 min Fires if a pod fails to start more than 3 times within the last 5 minutes. In the operation entry point, click Alert Rule Set for Pod Exceptions and configure the Pod startup failures alert rule. See Manage alerts in ACK. See Troubleshoot pod exceptions.
More than 1,000 pods failed to be scheduled Critical sum(sum(max_over_time(kube_pod_status_phase{phase=~"Pending"}[5m])) by (pod)) > 1000 > 1,000 pods 5 min Fires when more than 1,000 pods are in the Pending state due to scheduling failures within the last 5 minutes. This may indicate excessive scheduling pressure in a large-scale cluster. ACK managed cluster Pro Edition provides enhanced scheduling features and an SLA (Service-level agreement). See Overview of ACK managed cluster Pro Edition.
Frequent container CPU throttling Warning rate(container_cpu_cfs_throttled_seconds_total[3m]) * 100 > 25 > 25% throttled time 3 min Fires when throttled CPU time exceeds 25% of total CPU time within the last 3 minutes. CPU throttling reduces time slices allocated to processes, which can increase process runtime and slow application logic. Check whether the CPU resource limit for the pod is set too low. Use the CPU Burst policy to reduce throttling — see Enable the CPU Burst policy. On multi-core nodes, use CPU topology-aware scheduling to maximize fragmented CPU resources — see Enable CPU topology-aware scheduling.
Pod CPU usage > 85% of limit Warning (sum(irate(container_cpu_usage_seconds_total{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container!="",container!="POD"}[1m])) by (namespace,pod) / sum(container_spec_cpu_quota{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container!="",container!="POD"}/100000) by (namespace,pod) * 100 <= 100 or on() vector(0)) >= 85 >= 85% of pod limit 1 min Fires when pod CPU usage exceeds 85% of its limit in a specified namespace or for a specified pod. Has no effect if the pod has no limit configured. The 85% threshold is the recommended default — adjust as needed. To filter by pod or namespace, replace pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*" with actual values; delete the filter to query all pods. High CPU usage can cause throttling and reduce time slice allocation. Check whether the CPU resource limit is too low. See Enable the CPU Burst policy and Enable CPU topology-aware scheduling.
Pod memory usage > 85% of limit Warning ((sum(container_memory_working_set_bytes{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container!="",container!="POD"}) by (pod,namespace) / sum(container_spec_memory_limit_bytes{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container!="",container!="POD"}) by (pod, namespace) * 100) <= 100 or on() vector(0)) >= 85 >= 85% of pod limit Fires when pod memory usage exceeds 85% of its limit. Has no effect if the pod has no limit configured. The 85% threshold is the recommended default — adjust as needed. High memory usage can cause the pod to be terminated by the OOM (out-of-memory) killer, leading to a restart. Check whether the memory resource limit is too low. Use the resource profiling feature to set an appropriate memory limit — see Resource profiling.

Abnormal workloads

Description Severity PromQL Threshold Window Rule description Common troubleshooting
Deployment replica mismatch Critical kube_deployment_spec_replicas{} != kube_deployment_status_replicas_available{} Any mismatch Fires when the number of available replicas for a Deployment does not match the desired count. In the operation entry point, click Alert Rule Set for Workload Exceptions and set the Deployment pod anomaly alert rule. See Manage alerts in ACK. See Troubleshoot pod exceptions.
DaemonSet replica mismatch Critical ((100 - kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{} * 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{})) > 0 > 0 Fires when the number of available replicas for a DaemonSet does not match the desired count. In the operation entry point, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod anomaly alert rule. See Manage alerts in ACK. See Troubleshoot pod exceptions.
DaemonSet scheduling error Critical kube_daemonset_status_number_misscheduled{job} > 0 > 0 Fires when a DaemonSet replica is scheduled to a node it should not run on. In the operation entry point, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod scheduling errors alert rule. See Manage alerts in ACK. See Troubleshoot pod exceptions.
Job failed Critical kube_job_status_failed{} > 0 > 0 Fires when a Job fails to execute. In the operation entry point, click Alert Rule Set for Workload Exceptions and set the Job execution failures alert rule. See Manage alerts in ACK. Check the logs of the failed pod under the Job to get error details. See Troubleshoot pod exceptions.

Storage exceptions

Description Severity PromQL Threshold Window Rule description Common troubleshooting
PersistentVolume (PV) status abnormal Critical kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0 > 0 Fires when a PV enters a Failed or Pending state. In the operation entry point, click Alert Rule Set for Storage Exceptions and set the PV anomaly alert rule. See Manage alerts in ACK. See the disk mounting section in FAQ about disk PVs.
Host disk usage > 85% Critical (100 - node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) >= 85 >= 85% Fires when the free space on a disk block device of a node is less than 10%. In the operation entry point, click Alert Rule Set for Resource Exceptions and set the Node - Disk usage >= 85% alert rule. See Manage alerts in ACK. Scale out the node or expand its disk. See FAQ about disk PVs.

Abnormal node status

Description Severity PromQL Threshold Window Rule description Common troubleshooting
Node NotReady for 3 minutes Critical (sum(max_over_time(kube_node_status_condition{condition="Ready",status="true"}[3m]) <= 0) by (node)) or (absent(kube_node_status_condition{condition="Ready",status="true"})) > 0 > 0 3 min Fires when a cluster node remains in the NotReady status for 3 minutes. In the operation entry point, click Alert Rule Set for Node Exceptions and set the Node changes to the unschedulable state alert rule. See Manage alerts in ACK. Determine whether the NotReady status is expected (for example, node replacement or planned maintenance). If not expected, check whether application pods are affected and evict them if necessary. Check node conditions for common causes such as memory pressure or a full disk.

Abnormal host resource usage

Host resource metrics measure the resources of the physical or virtual machine on which the node runs. Usage = (resource usage of all processes on the host) / (maximum host capacity).
Description Severity PromQL Threshold Window Rule description Common troubleshooting
Host memory usage > 85% Warning (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) >= 85 >= 85% Fires when host memory usage exceeds 85%. In the operation entry point, click Alert Rule Set for Resource Exceptions and configure the Node - Memory usage >= 85% alert rule. See Manage alerts in ACK. The 85% threshold is the recommended default — adjust as needed.
Note

The rules in the ACK alert configuration are provided by Cloud Monitor, and their metrics are consistent with the corresponding Prometheus rule metrics.

Release resources: use the cost analysis feature to check whether pods are occupying schedulable resources unreasonably (see Enable the cost analysis feature), and use resource profiling to right-size memory requests (see Resource profiling). Plan capacity and scale out nodes — see Scale nodes in an ACK cluster.
Host memory usage > 90% Critical (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) >= 90 >= 90% Fires when host memory usage exceeds 90%. Release resources: use the cost analysis feature (see Enable the cost analysis feature) and resource profiling (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.
Host CPU usage > 85% Warning 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) >= 85 >= 85% 2 min Fires when host CPU usage exceeds 85%. In the operation entry point, click Alert Rule Set for Resource Exceptions and configure the Node - CPU usage >= 85% alert rule.
Note

The rule in the ACK alerting configuration uses CloudMonitor ECS monitoring metrics, which are equivalent to this Prometheus rule's metrics. The 85% threshold is the recommended default — adjust as needed. See Manage alerts in ACK.

Release resources: use the cost analysis feature (see Enable the cost analysis feature) and resource profiling (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.
Host CPU usage > 90% Critical 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) >= 90 >= 90% 2 min Fires when host CPU usage exceeds 90%. Release resources: use the cost analysis feature (see Enable the cost analysis feature) and resource profiling (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.

Abnormal node resources

Node resource metrics measure container resource consumption relative to the node's allocatable capacity, not the physical machine capacity.
Consumed resources (numerator): Total resources used by all containers on the node, including working set memory, page cache allocated to containers, and more.
Allocatable resources (denominator): Resources available for containers after subtracting resources reserved for the node (container engine layer). See Node resource reservation policy.
Pod scheduling is based on resource requests, not actual usage.
Description Severity PromQL Threshold Window Rule description Common troubleshooting
Node CPU usage > 85% Warning sum(irate(container_cpu_usage_seconds_total{pod!=""}[1m])) by (node) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 85 >= 85% 1 min Fires when node CPU usage exceeds 85% of allocatable resources. Formula: Node resource usage / Total allocatable resources on the node. Release resources: use the cost analysis feature (see Enable the cost analysis feature) and resource profiling to distribute pods across nodes (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.
Node CPU resource allocation rate > 85% Normal (sum(sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready{condition="true"}) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 85 >= 85% Fires when the CPU resource allocation rate exceeds 85% of allocatable resources. Formula: Total resource requests of scheduled pods / Total allocatable resources on the node. The node has insufficient resources to schedule additional pods. Check for resource waste where actual usage is far below requested resources: use the cost analysis feature (see Enable the cost analysis feature) and resource profiling (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.
Node CPU oversold rate > 300% Warning (sum(sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready{condition="true"}) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 300 >= 300% Fires when the CPU oversold rate exceeds 300% of allocatable resources. Formula: Total resource limits of scheduled pods / Total allocatable resources on the node. The 300% threshold is the recommended default — adjust as needed. Total CPU limits on the node far exceed allocatable resources. During traffic peaks, resource contention and throttling can slow process responses. Use the cost analysis feature (see Enable the cost analysis feature) and resource profiling to right-size CPU requests and limits (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.
Node memory usage > 85% Warning sum(container_memory_working_set_bytes{pod!=""}) by (node) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 85 >= 85% Fires when node memory usage exceeds 85% of allocatable resources. Formula: Node resource usage / Total allocatable resources on the node. Release resources: use the cost analysis feature (see Enable the cost analysis feature) and resource profiling to distribute pods across nodes (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.
Node memory resource allocation rate > 85% Normal (sum(sum(kube_pod_container_resource_requests{resource="memory"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready{condition="true"}) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 85 >= 85% Fires when the memory resource allocation rate exceeds 85% of allocatable resources. Formula: Total resource requests of scheduled pods / Total allocatable resources on the node. The node has insufficient resources to schedule additional pods. Check for resource waste where actual usage is far below requested resources: use the cost analysis feature (see Enable the cost analysis feature) and resource profiling (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.
Node memory oversold rate > 300% Warning (sum(sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready{condition="true"}) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 300 >= 300% Fires when the memory oversold rate exceeds 300% of allocatable resources. Formula: Total resource limits of scheduled pods / Total allocatable resources on the node. The 300% threshold is the recommended default — adjust as needed. Total memory limits on the node far exceed allocatable resources. During traffic peaks, memory can reach the node limit, triggering a node OOM event that kills processes and disrupts workloads. Use the cost analysis feature (see Enable the cost analysis feature) and resource profiling to right-size memory requests and limits (see Resource profiling). Scale out nodes — see Scale nodes in an ACK cluster.

Network exceptions

Description Severity PromQL Threshold Window Rule description Common troubleshooting
CoreDNS request count drops to zero Critical (sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or (sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0) <= 0 1 min Detectable only in ACK managed clusters (Pro and Basic editions). Check whether the CoreDNS pods in the cluster are running normally.
CoreDNS panic exception Critical sum(rate(coredns_panic_count_total{}[3m])) > 0 > 0 3 min Detectable only in ACK managed clusters (Pro and Basic editions). Check whether the CoreDNS pods in the cluster are running normally.
Ingress controller certificate expiring within 14 days Warning ((nginx_ingress_controller_ssl_expire_time_seconds - time()) / 24 / 3600) < 14 < 14 days Requires the ACK Ingress controller component to be installed with the Ingress feature enabled. Reissue the Ingress controller certificate.

Auto Scaling exceptions

Description Severity PromQL Threshold Window Rule description Common troubleshooting
HPA replica count at maximum Warning max(kube_horizontalpodautoscaler_spec_max_replicas) by (namespace, horizontalpodautoscaler) - max(kube_horizontalpodautoscaler_status_current_replicas) by (namespace, horizontalpodautoscaler) <= 0 <= 0 difference Fires when the Horizontal Pod Autoscaler (HPA) current replica count reaches the configured maximum.
Note

Enable horizontalpodautoscaler-related metrics in Alibaba Cloud Prometheus first — these metrics are disabled by default and are free of charge.

Check whether the HPA policy meets your expectations. If the workload remains high, increase the maxReplicas value or optimize application performance.

What's next