Prometheus Monitoring alert rules include Application Real-Time Monitoring Service (ARMS) alert rules, Kubernetes alert rules, MongoDB alert rules, MySQL alert rules, NGINX alert rules, and Redis alert rules.

ARMS alert rules

Name Expression Data collection time (Unit: minutes) Trigger condition
PodCpu75 100 * (sum(rate(container_cpu_usage_seconds_total[1m])) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_cpu_cores, "pod_name", "$1", "pod", "(.*)")) by (pod_name))>75 7 The CPU utilization of a pod is greater than 75%.
PodMemory75 100 * (sum(container_memory_working_set_bytes) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_memory_bytes, "pod_name", "$1", "pod", "(.*)")) by (pod_name))>75 5 The memory usage of a pod is greater than 75%.
pod_status_no_running sum (kube_pod_status_phase{phase!="Running"}) by (pod,phase) 5 A pod is not running.
PodMem4GbRestart (sum (container_memory_working_set_bytes{id!="/"})by (pod_name,container_name) /1024/1024/1024)>4 5 The memory of a pod is larger than 4 GB.
PodRestart sum (increase (kube_pod_container_status_restarts_total{}[2m])) by (namespace,pod) >0 5 A pod is restarted.

Kubernetes alert rules

Name Expression Data collection time (Unit: minutes) Trigger condition
KubeStateMetricsListErrors (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01 15 An error occurs to a metric list.
KubeStateMetricsWatchErrors (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01 15 An error occurs to Metric Watch.
NodeFilesystemAlmostOutOfSpace ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) 60 A node file system is running out of space.
NodeFilesystemSpaceFillingUp ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) 60 A node file system is about to be fully occupied.
NodeFilesystemFilesFillingUp ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) 60 Files in a node file system are about to be fully occupied.
NodeFilesystemAlmostOutOfFiles ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) 60 Almost no files exist in a node file system.
NodeNetworkReceiveErrs increase(node_network_receive_errs_total[2m]) > 10 60 A network reception error occurs to a node.
NodeNetworkTransmitErrs increase(node_network_transmit_errs_total[2m]) > 10 60 A network transmission error occurs to a node.
NodeHighNumberConntrackEntriesUsed (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75 None A large number of conntrack entries are used.
NodeClockSkewDetected ( node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0 ) or ( node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0 ) 10 Time deviation occurs.
NodeClockNotSynchronising min_over_time(node_timex_sync_status[5m]) == 0 10 Time inconsistency occurs.
KubePodCrashLooping rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0 15 A loop crash occurs.
KubePodNotReady sum by (namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0 15 A pod is not ready.
KubeDeploymentGenerationMismatch kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} 15 Deployment versions do not match.
KubeDeploymentReplicasMismatch ( kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"} ) and ( changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0 ) 15 Deployment replicas do not match.
KubeStatefulSetReplicasMismatch ( kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"} ) and ( changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0 ) 15 State set replicas do not match.
KubeStatefulSetGenerationMismatch kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} 15 State set versions do not match.
KubeStatefulSetUpdateNotRolledOut max without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"} ) 15 A state set update is not rolled out.
KubeDaemonSetRolloutStuck kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00 15 A DaemonSet rollout is stuck.
KubeContainerWaiting sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0 60 A container is waiting.
KubeDaemonSetNotScheduled kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0 10 A DaemonSet is not scheduled.
KubeDaemonSetMisScheduled kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 15 A DaemonSet is misscheduled.
KubeCronJobRunning time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600 60 A cron job takes more than 1 hour to complete.
KubeJobCompletion kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0 60 A job is complete.
KubeJobFailed kube_job_failed{job="kube-state-metrics"} > 0 15 A job failed.
KubeHpaReplicasMismatch (kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 15 Host protected area (HPA) replicas do not match.
KubeHpaMaxedOut kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"} 15 The maximum number of HPA replicas is reached.
KubeCPUOvercommit sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{}) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores) 5 The CPU is overcommitted.
KubeMemoryOvercommit sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{}) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes) 5 The storage is overcommitted.
KubeCPUQuotaOvercommit sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 5 The CPU quota is overcommitted.
KubeMemoryQuotaOvercommit sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"}) / sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5 5 The storage quota is overcommitted.
KubeQuotaExceeded kube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 0.90 15 The quota is exceeded.
CPUThrottlingHigh sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 ) 15 The CPU is overheated.
KubePersistentVolumeFillingUp kubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet", metrics_path="/metrics"} < 0.03 1 The volume capacity is insufficient.
KubePersistentVolumeErrors kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0 5 An error occurs to the volume capacity.
KubeVersionMismatch count(count by (gitVersion) (label_replace(kubernetes_build_info{job! ~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[ 0-9]*.[ 0-9]*). *"))) > 1 15 Versions do not match.
KubeClientErrors (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m])) by (instance, job)) > 0.01 15 An error occurs to the client.
KubeAPIErrorBudgetBurn sum(apiserver_request:burnrate1h) > (14.40 * 0.01000) and sum(apiserver_request:burnrate5m) > (14.40 * 0.01000) 2 Excessive API errors occur.
KubeAPILatencyHigh ( cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} > on (verb) group_left() ( avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) + 2*stddev by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) ) ) > on (verb) group_left() 1.2 * avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) and on (verb,resource) cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99"} > 1 5 The API latency is high.
KubeAPIErrorsHigh sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.05 10 Excessive API errors occur.
KubeClientCertificateExpiration apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 None The client certificate expires.
AggregatedAPIErrors sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2 None An error occurs to the aggregated API.
AggregatedAPIDown sum by(name, namespace)(sum_over_time(aggregator_unavailable_apiservice[5m])) > 0 5 The aggregated API is offline.
KubeAPIDown absent(up{job="apiserver"} == 1) 15 An API operation is offline.
KubeNodeNotReady kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0 15 A node is not ready.
KubeNodeUnreachable kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1 2 A node is unreachable.
KubeletTooManyPods max(max(kubelet_running_pod_count{job="kubelet", metrics_path="/metrics"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"}) by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) by(node) > 0.95 15 Excessive pods exist.
KubeNodeReadinessFlapping sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 2 15 The readiness status changes frequently.
KubeletPlegDurationHigh node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10 5 The pod lifecycle event generator (PLEG) lasts for an extended period of time.
KubeletPodStartUpLatencyHigh histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet", metrics_path="/metrics"}[5m])) by (instance, le)) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 60 15 The startup latency of a pod is high.
KubeletDown absent(up{job="kubelet", metrics_path="/metrics"} == 1) 15 The kubelet is offline.
KubeSchedulerDown absent(up{job="kube-scheduler"} == 1) 15 The Kubernetes scheduler is offline.
KubeControllerManagerDown absent(up{job="kube-controller-manager"} == 1) 15 The controller manager is offline.
TargetDown 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 10 10 The target is offline.
NodeNetworkInterfaceFlapping changes(node_network_up{job="node-exporter",device! ~"veth.+"}[2m]) > 2 2 The network interface status changes frequently.

MongoDB alert rules

Name Expression Data collection time (Unit: minutes) Trigger condition
MongodbReplicationLag avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10 5 The replication lantency is long.
MongodbReplicationHeadroom (avg(mongodb_replset_oplog_tail_timestamp - mongodb_replset_oplog_head_timestamp) - (avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}))) <= 0 5 The replication margin is insufficient.
MongodbReplicationStatus3 mongodb_replset_member_state == 3 5 The replication status is 3.
MongodbReplicationStatus6 mongodb_replset_member_state == 6 5 The replication status is 6.
MongodbReplicationStatus8 mongodb_replset_member_state == 8 5 The replication status is 8.
MongodbReplicationStatus10 mongodb_replset_member_state == 10 5 The replication status is 10.
MongodbNumberCursorsOpen mongodb_metrics_cursor_open{state="total_open"} > 10000 5 Excessive cursors exist.
MongodbCursorsTimeouts sum (increase increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100 5 The cursor times out.
MongodbTooManyConnections mongodb_connections{state="current"} > 500 5 Excessive connections exist.
MongodbVirtualMemoryUsage (sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3 5 The virtual memory usage is high.

MySQL alert rules

Name Expression Data collection time (Unit: minutes) Trigger condition
MySQL is down mysql_up == 0 1 MySQL is offline.
open files high mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75 1 Excessive files are opened.
Read buffer size is bigger than max. allowed packet size mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet 1 The size of the read buffer exceeds the maximum allowed packet size.
Sort buffer possibly missconfigured mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 1 A configuration error may exist in the sort buffer.
Thread stack size is too small mysql_global_variables_thread_stack <196608 1 The thread stack size is small.
Used more than 80% of max connections limited mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8 1 The maximum connection rate of 80% is reached.
InnoDB Force Recovery is enabled mysql_global_variables_innodb_force_recovery != 0 1 Force recovery is enabled.
InnoDB Log File size is too small mysql_global_variables_innodb_log_file_size < 16777216 1 The log file size is small.
InnoDB Flush Log at Transaction Commit mysql_global_variables_innodb_flush_log_at_trx_commit != 1 1 Logs are refreshed when transactions are committed.
Table definition cache too small mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache 1 The number of cached table definitions is small.
Table open cache too small mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100 1 The number of cached open tables is small.
Thread stack size is possibly too small mysql_global_variables_thread_stack < 262144 1 The thread stack size may be small.
InnoDB Buffer Pool Instances is too small mysql_global_variables_innodb_buffer_pool_instances == 1 1 The number of instances in the buffer pool is small.
InnoDB Plugin is enabled mysql_global_variables_ignore_builtin_innodb == 1 1 The plug-in is enabled.
Binary Log is disabled mysql_global_variables_log_bin != 1 1 Binary logs are disabled.
Binlog Cache size too small mysql_global_variables_binlog_cache_size < 1048576 1 The cache size is small.
Binlog Statement Cache size too small mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0 1 The statement cache size is small.
Binlog Transaction Cache size too small mysql_global_variables_binlog_cache_size <1048576 1 The transaction cache size is small.
Sync Binlog is enabled mysql_global_variables_sync_binlog == 1 1 Binary logs are enabled.
IO thread stopped mysql_slave_status_slave_io_running != 1 1 I/O threads are stopped.
SQL thread stopped mysql_slave_status_slave_sql_running == 0 1 SQL threads are stopped.
Mysql_Too_Many_Connections rate(mysql_global_status_threads_connected[5m])>200 5 Excessive connections exist.
Mysql_Too_Many_slow_queries rate(mysql_global_status_slow_queries[5m])>3 5 Excessive slow queries exist.
Slave lagging behind Master rate(mysql_slave_status_seconds_behind_master[1m]) >30 1 The primary node outperforms the secondary nodes.
Slave is NOT read only(Please ignore this warning indicator.) mysql_global_variables_read_only != 0 1 Permissions on the secondary nodes are not read-only permissions.

NGINX alert rules

Name Expression Data collection time (Unit: minutes) Trigger condition
NginxHighHttp4xxErrorRate sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 5 The rate of HTTP 4xx errors is high.
NginxHighHttp5xxErrorRate sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 5 The rate of HTTP 5xx errors is high.
NginxLatencyHigh histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 10 5 The latency is high.

Redis alert rules

Name Expression Data collection time (Unit: minutes) Trigger condition
RedisDown redis_up == 0 5 Redis is offline.
RedisMissingMaster count(redis_instance_info{role="master"}) == 0 5 The primary node is missing.
RedisTooManyMasters count(redis_instance_info{role="master"}) > 1 5 Excessive primary nodes exist.
RedisDisconnectedSlaves count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1 5 Secondary nodes are disconnected.
RedisReplicationBroken delta(redis_connected_slaves[1m]) < 0 5 The replication is interrupted.
RedisClusterFlapping changes(redis_connected_slaves[5m]) > 2 5 Changes are detected in the connection to replica nodes.
RedisMissingBackup time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24 5 The backup is interrupted.
RedisOutOfMemory redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 5 The memory is insufficient.
RedisTooManyConnections redis_connected_clients > 100 5 Excessive connections exist.
RedisNotEnoughConnections redis_connected_clients < 5 5 Connections are insufficient.
RedisRejectedConnections increase(redis_rejected_connections_total[1m]) > 0 5 The connection is rejected.