KubeStateMetricsListErrors |
(sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m]))
/ sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01
|
15 |
An error occurs to a metric list. |
KubeStateMetricsWatchErrors |
(sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m]))
/ sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01
|
15 |
An error occurs to Metric Watch. |
NodeFilesystemAlmostOutOfSpace |
( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""}
* 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
|
60 |
A node file system is running out of space. |
NodeFilesystemSpaceFillingUp |
( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""}
* 100 < 40 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],
24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
|
60 |
A node file system is about to be fully occupied. |
NodeFilesystemFilesFillingUp |
( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""}
* 100 < 40 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],
24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
|
60 |
Files in a node file system are about to be fully occupied. |
NodeFilesystemAlmostOutOfFiles |
( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""}
* 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
|
60 |
Almost no files exist in a node file system. |
NodeNetworkReceiveErrs |
increase(node_network_receive_errs_total[2m]) > 10 |
60 |
A network reception error occurs to a node. |
NodeNetworkTransmitErrs |
increase(node_network_transmit_errs_total[2m]) > 10 |
60 |
A network transmission error occurs to a node. |
NodeHighNumberConntrackEntriesUsed |
(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75 |
None |
A large number of conntrack entries are used. |
NodeClockSkewDetected |
( node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0 )
or ( node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <=
0 )
|
10 |
Time deviation occurs. |
NodeClockNotSynchronising |
min_over_time(node_timex_sync_status[5m]) == 0 |
10 |
Time inconsistency occurs. |
KubePodCrashLooping |
rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60
* 5 > 0
|
15 |
A loop crash occurs. |
KubePodNotReady |
sum by (namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",
phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace,
pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0
|
15 |
A pod is not ready. |
KubeDeploymentGenerationMismatch |
kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} |
15 |
Deployment versions do not match. |
KubeDeploymentReplicasMismatch |
( kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"}
) and ( changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m])
== 0 )
|
15 |
Deployment replicas do not match. |
KubeStatefulSetReplicasMismatch |
( kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"}
) and ( changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m])
== 0 )
|
15 |
State set replicas do not match. |
KubeStatefulSetGenerationMismatch |
kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} |
15 |
State set versions do not match. |
KubeStatefulSetUpdateNotRolledOut |
max without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"}
unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"}
!= kube_statefulset_status_replicas_updated{job="kube-state-metrics"} )
|
15 |
A state set update is not rolled out. |
KubeDaemonSetRolloutStuck |
kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
< 1.00
|
15 |
A DaemonSet rollout is stuck. |
KubeContainerWaiting |
sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"})
> 0
|
60 |
A container is waiting. |
KubeDaemonSetNotScheduled |
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"}
> 0
|
10 |
A DaemonSet is not scheduled. |
KubeDaemonSetMisScheduled |
kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 |
15 |
A DaemonSet is misscheduled. |
KubeCronJobRunning |
time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600 |
60 |
A cron job takes more than 1 hour to complete. |
KubeJobCompletion |
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"}
> 0
|
60 |
A job is complete. |
KubeJobFailed |
kube_job_failed{job="kube-state-metrics"} > 0 |
15 |
A job failed. |
KubeHpaReplicasMismatch |
(kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"})
and changes(kube_hpa_status_current_replicas[15m]) == 0
|
15 |
Host protected area (HPA) replicas do not match. |
KubeHpaMaxedOut |
kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"} |
15 |
The maximum number of HPA replicas is reached. |
KubeCPUOvercommit |
sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{}) / sum(kube_node_status_allocatable_cpu_cores)
> (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)
|
5 |
The CPU is overcommitted. |
KubeCPUQuotaOvercommit |
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores)
> 1.5
|
5 |
The CPU quota is overcommitted. |
KubeMemoryQuotaOvercommit |
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"})
/ sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5
|
5 |
The storage quota is overcommitted. |
KubeQuotaExceeded |
kube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job,
type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 0.90
|
15 |
The quota is exceeded. |
CPUThrottlingHigh |
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container,
pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container,
pod, namespace) > ( 25 / 100 )
|
15 |
The CPU is overheated. |
KubePersistentVolumeFillingUp |
kubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet",
metrics_path="/metrics"} < 0.03
|
1 |
The volume capacity is insufficient. |
KubePersistentVolumeErrors |
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"}
> 0
|
5 |
An error occurs to the volume capacity. |
KubeVersionMismatch |
count(count by (gitVersion) (label_replace(kubernetes_build_info{job! ~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[
0-9]*.[ 0-9]*). *"))) > 1
|
15 |
Versions do not match. |
KubeClientErrors |
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m]))
by (instance, job)) > 0.01
|
15 |
An error occurs to the client. |
KubeAPIErrorBudgetBurn |
sum(apiserver_request:burnrate1h) > (14.40 * 0.01000) and sum(apiserver_request:burnrate5m)
> (14.40 * 0.01000)
|
2 |
Excessive API errors occur. |
KubeAPILatencyHigh |
( cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} > on (verb) group_left()
( avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"}
>= 0) + 2*stddev by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"}
>= 0) ) ) > on (verb) group_left() 1.2 * avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"}
>= 0) and on (verb,resource) cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99"}
> 1
|
5 |
The API latency is high. |
KubeAPIErrorsHigh |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb)
/ sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb)
> 0.05
|
10 |
Excessive API errors occur. |
KubeClientCertificateExpiration |
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job)
histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m])))
< 604800
|
None |
The client certificate expires. |
AggregatedAPIErrors |
sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2 |
None |
An error occurs to the aggregated API. |
AggregatedAPIDown |
sum by(name, namespace)(sum_over_time(aggregator_unavailable_apiservice[5m])) > 0 |
5 |
The aggregated API is offline. |
KubeAPIDown |
absent(up{job="apiserver"} == 1) |
15 |
An API operation is offline. |
KubeNodeNotReady |
kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"}
== 0
|
15 |
A node is not ready. |
KubeNodeUnreachable |
kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"}
== 1
|
2 |
A node is unreachable. |
KubeletTooManyPods |
max(max(kubelet_running_pod_count{job="kubelet", metrics_path="/metrics"}) by(instance)
* on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) by(node)
> 0.95
|
15 |
Excessive pods exist. |
KubeNodeReadinessFlapping |
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by
(node) > 2
|
15 |
The readiness status changes frequently. |
KubeletPlegDurationHigh |
node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"}
>= 10
|
5 |
The pod lifecycle event generator (PLEG) lasts for an extended period of time. |
KubeletPodStartUpLatencyHigh |
histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",
metrics_path="/metrics"}[5m])) by (instance, le)) * on(instance) group_left(node)
kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 60
|
15 |
The startup latency of a pod is high. |
KubeletDown |
absent(up{job="kubelet", metrics_path="/metrics"} == 1) |
15 |
The kubelet is offline. |
KubeSchedulerDown |
absent(up{job="kube-scheduler"} == 1) |
15 |
The Kubernetes scheduler is offline. |
KubeControllerManagerDown |
absent(up{job="kube-controller-manager"} == 1) |
15 |
The controller manager is offline. |
TargetDown |
100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace,
service)) > 10
|
10 |
The target is offline. |
NodeNetworkInterfaceFlapping |
changes(node_network_up{job="node-exporter",device! ~"veth.+"}[2m]) > 2 |
2 |
The network interface status changes frequently. |