All Products
Search
Document Center

Container Service for Kubernetes:Best practices for configuring alert rules in Prometheus

Last Updated:Jul 26, 2023

Container Service for Kubernetes (ACK) supports open source Prometheus and Alibaba Cloud Prometheus Service. This topic describes how to configure alert rules when you use open source Prometheus or Prometheus Service to monitor ACK clusters. It also describes these alert rules.

Table of contents

Configure alert rules in Prometheus

Use custom PromQL statements to configure alert rules in Prometheus Service

For more information about how to use custom PromQL statements to configure alert rules in Prometheus Service, see Create an alert rule for a Prometheus instance.

Use custom PromQL statements to configure alert rules in open source Prometheus

  1. Configure alert notification policies.

    Open source Prometheus supports various alert notification methods, including webhook URLs, DingTalk chatbots, and emails. You can set the alert notification method by configuring the receiver parameter in the configurations of the ack-prometheus-operator application. For more information, see Use open source Prometheus to monitor an ACK cluster .

  2. Create an alert rule.

    Create a PrometheusRule CustomResourceDefinition (CRD) to define an alert rule. For more information, see Deploying Prometheus Rules.

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        # The labels must be the same as the labels specified in the match ruleSelector -> matchLabels parameter of the Prometheus CRD. 
        prometheus: example
        role: alert-rules
      name: prometheus-example-rules
    spec:
      groups:
      - name: example.rules
        rules:
        - alert: ExampleAlert
          # The expr parameter specifies the data query and trigger condition of PromQL. For more information, see the PromQL statement column of the following alert rule table in this topic. 
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 90
  3. Check whether the alert rule takes effect.

    1. Run the following command to map Prometheus in the cluster to local port 9090:

      kubectl port-forward svc/ack-prometheus-operator-prometheus 9090:9090 -n monitoring
    2. Enter localhost:9090 into the address bar of your web browser to go to the Prometheus Server console.

    3. In the upper part of the Prometheus Server console, choose Status > Rules.

      On the Rules page, you can view alert rules. If the alert rule that you created is displayed on the Rules page, the alert rule has taken effect.

Alert rules

Based on the O&M experience of clusters and applications, ACK provides a set of Prometheus alert rules. You can use these rules to identify cluster stability issues, node exceptions, node resource utilization issues, pod errors, workload errors, storage errors, and network errors.

The alert rules are classified into the following severities based on the level of impact caused by pod errors and workload errors:

  • Critical: Clusters, applications, and even business are affected. You need to troubleshoot the issues immediately.

  • Warning: Clusters, applications, and even business are affected. You need to troubleshoot the issues at the earliest opportunity.

  • Normal: Important feature changes are involved.

The PromQL statements in the PromQL statement column are suitable for Prometheus Service. If you want to use these PromQL statements for open source Prometheus, delete the labels such as job="_kube-state-metrics".

Category

Item

Severity

PromQL statement

Description

SOP for handling alerts

Pod errors

Abnormal pod status

Critical

min_over_time(sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m:1m]) > 0

This rule triggers alerts if abnormal pod status is detected within the last 5 minutes.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Pod Exceptions and set the Pod anomaly alert rule. For more information, see Alert management.

For more information about how to troubleshoot abnormal pod status, see Pod troubleshooting.

Pod launch failures

Critical

sum_over_time(increase(kube_pod_container_status_restarts_total{}[1m])[5m:1m]) > 3

This rule triggers alerts if the number of pod launch failures exceeds 3 within the last 5 minutes.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Pod Exceptions and set the Pod startup failures alert rule. For more information, see Alert management.

For more information about how to troubleshoot pod launch failures, see Pod troubleshooting.

Over 1,000 pending pods

Critical

sum(sum(max_over_time(kube_pod_status_phase{ phase=~"Pending"}[5m])) by (pod)) > 1000

This rule triggers alerts if the number of pending pods exceeds 1,000 within the last 5 minutes.

This issue occurs because the specifications of the cluster cannot meet the requirements for scheduling more than 1,000 pods. ACK Pro clusters provide enhanced capabilities for scheduling pods and are covered by SLAs. We recommend that you upgrade the cluster to an ACK Pro cluster. For more information, see Overview of ACK Pro clusters.

Frequent CPU throttling

Warning

rate(container_cpu_cfs_throttled_seconds_total[3m]) * 100 > 25

CPU throttling is frequently enforced on pods. This rule triggers alerts if the percentage of throttled CPU time slices within the last 3 minutes exceeds 25%.

CPU throttling limits the CPU time slices that the processes in pods can use. This reduces the uptime of processes in the pods and may slow down the processes in the pods.

If this issue occurs, check whether the CPU limit of the pod is set to a small value. To resolve this issue, we recommend that you enable CPU Burst. For more information, see CPU Burst. If you cluster contains multi-core ECS instances, we recommend that you enable topology-aware CPU scheduling to maximize the utilization of CPU fragments. For more information, see Topology-aware CPU scheduling.

Workload exceptions

Deployment pod anomalies

Critical

kube_deployment_spec_replicas{} != kube_deployment_status_replicas_available{}

This rule triggers alerts if the number of replicated pods created by a Deployment is less than the specified value.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Workload Exceptions and set the Deployment pod anomaly alert rule. For more information, see Alert management.

Check whether pods that are provisioned by Deployments fail to be launched.

  • If pods that are provisioned by Deployments fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting.

  • If pods that are provisioned by Deployments are successfully launched or are in a normal state, Submit a ticket for assistance. Specify the details about the issue and the cluster ID in the ticket.

DaemonSet pod anomalies

Critical

((100 - kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{} * 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{})) > 0

This rule triggers alerts if the number of replicated pods created by a DaemonSet is less than the specified value.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod anomaly alert rule. For more information, see Alert management.

Check whether pods that are provisioned by DaemonSets fail to be launched.

  • If pods that are provisioned by DaemonSets fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting.

  • If pods that are provisioned by DaemonSets are successfully launched or are in a normal state, Submit a ticket for assistance. Specify the details about the issue and the cluster ID in the ticket.

DaemonSet pod scheduling errors

Critical

kube_daemonset_status_number_misscheduled{job} > 0

This rule triggers alerts if scheduling errors occur on the pods that are provisioned by a DaemonSet.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod scheduling errors alert rule. For more information, see Alert management.

Check whether pods that are provisioned by DaemonSets fail to be launched.

  • If pods that are provisioned by DaemonSets fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting.

  • If pods that are provisioned by DaemonSet are successfully launched or are in a normal state, Submit a ticket for assistance. Specify the details about the issue and the cluster ID in the ticket.

Job execution failures

Critical

kube_job_status_failed{} > 0

This rule triggers alerts if a Job fails to be executed.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Workload Exceptions and set the Job execution failures alert rule. For more information, see Alert management.

Check the logs of the pods created by the Job.

  • If pods that are provisioned by the Job fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting.

  • If pods that are provisioned by the Job are successfully launched or are in a normal state, Submit a ticket for assistance. Specify the details about the issue and the cluster ID in the ticket.

Storage errors

PV anomalies

Critical

kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0

This rule triggers alerts if a persistent volume (PV) is in an abnormal state.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Storage Exceptions and set the PV anomaly alert rule. For more information, see Alert management.

For more information about how to troubleshoot PV anomalies, see the disk mounting section in FAQ about disk volumes.

Disk space less than 10%

Critical

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) < 10

This rule triggers alerts if the free space of a disk is less than 10%.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Workload Exceptions and set the Node - Disk usage ≥ 85% alert rule. For more information, see Alert management.

Add nodes and disks. For more information, see the disk mounting section in FAQ about disk volumes.

Node anomalies

Node remaining in the NotReady state for 3 minutes

Critical

(sum(max_over_time(kube_node_status_condition{condition="Ready",status="true"}[3m]) <= 0) by (node)) or (absent(kube_node_status_condition{condition="Ready",status="true"})) > 0

This rule triggers alerts if a node remains in the NotReady state for 3 minutes.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Node Exceptions and set the Node changes to the unschedulable state alert rule. For more information, see Alert management.

  • Check whether the node is being replaced, removed, or manually set to unavailable.

    Important

    If the node is not in any of the preceding status, you must evict the pods on the node to avoid business interruptions.

  • Check the node conditions to identify the cause. For example, check whether the memory resources or disk space on the node is insufficient.

Excessively high resource utilization of nodes

Memory utilization higher than 80%

Warning

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20

This rule triggers alerts if the memory utilization of a node exceeds 80%.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Resource Exceptions and set the Node - Memory usage ≥ 85% alert rule. For more information, see Alert management.

  • Release resources.

    We recommend that you use the cost insights feature of ACK to check whether pods occupy schedulable CPU resources and whether the CPU requests of pods are set to a relatively high value. For more information, see Enable cost insights. We recommend that you use the resource profiling feature of ACK to configure memory requests for pods. For more information, see Resource profiling.

  • Plan the memory capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Memory utilization higher than 90%

Critical

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

This rule triggers alerts if the memory utilization of a node exceeds 90%.

  • Release resources.

    We recommend that you use the cost insights feature of ACK to check whether pods occupy schedulable CPU resources and whether the CPU requests of pods are set to a relatively high value. For more information, see Enable cost insights. We recommend that you use the resource profiling feature of ACK to configure memory requests for pods. For more information, see Resource profiling.

  • Plan the memory capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Excessively high resource utilization of nodes

CPU utilization higher than 80%

Warning

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80

This rule triggers alerts if the CPU utilization of a node exceeds 80%.

On the Alert Rules tab of the Alerts page in the ACK console, click Alert Rule Set for Resource Exceptions and set the Node - CPU usage ≥ 85% alert rule. For more information, see Alert management.

  • Release resources.

    We recommend that you use the cost insights feature of ACK to check whether pods occupy schedulable CPU resources and whether the CPU requests of pods are set to a relatively high value. For more information, see Enable cost insights. We recommend that you use the resource profiling feature of ACK to configure CPU requests for pods. For more information, see Resource profiling.

  • Plan the CPU capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

CPU utilization higher than 90%

Critical

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 90

This rule triggers alerts if the CPU utilization of a node exceeds 90%.

  • Release resources.

    We recommend that you use the cost insights feature of ACK to check whether pods occupy schedulable CPU resources and whether the CPU requests of pods are set to a relatively high value. For more information, see Enable cost insights. We recommend that you use the resource profiling feature of ACK to configure CPU requests for pods. For more information, see Resource profiling.

  • Plan the CPU capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Network errors

CoreDNS Unavailability - Number of requests drops to 0

Critical

(sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or (sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0)

This rule applies only to ACK Pro clusters and ACK basic clusters.

Check whether CoreDNS pods in the cluster run as expected.

CoreDNS Unavailability - Panics

Critical

sum(rate(coredns_panic_count_total{}[3m])) > 0

This rule applies only to ACK Pro clusters and ACK basic clusters.

Check whether CoreDNS pods in the cluster run as expected. If CoreDNS pods in the cluster do not run as expected, Submit a ticket.

Ingress controller certificates about to expire

Warning

((nginx_ingress_controller_ssl_expire_time_seconds - time()) / 24 / 3600) < 14

You must create Ingresses and install the ACK Ingress controller.

Issue the Ingress controller certificates again.

Scaling issues

Maximum number of pods in the HPA configuration reached

Warning

max(kube_horizontalpodautoscaler_spec_max_replicas) by (namespace, horizontalpodautoscaler) - max(kube_horizontalpodautoscaler_status_current_replicas) by (namespace, horizontalpodautoscaler) <= 0

You need to enable the horizontalpodautoscaler metric of Application Real-Time Monitoring Service (ARMS) Prometheus. By default, this metric is disabled. This metric is free of charge. Auto scaling

Check whether the HPA scaling policy meets the requirements.

References