All Products
Search
Document Center

Simple Log Service:Bulit-in alert monitoring rules for ACK

Last Updated:Aug 29, 2023

This topic describes the built-in alert monitoring rules that are used to monitor Container Service for Kubernetes (ACK) clusters.

After you enable the alerting feature in the ACK console, ACK saves the event logs of ACK clusters to a Logstore named k8s-event of a specified project in Simple Log Service. ACK also synchronizes the built-in alert monitoring rules to Simple Log Service Alert Center to monitor the Logstore.

The following table describes the built-in alert monitoring rules for ACK. You can also create custom alert monitoring rules based on your business requirements. For more information, see Create an alert monitoring rule for logs.

Alert monitoring rule ID

Alert monitoring rule name

Description

Query statement

Trigger Condition parameter setting

Group Evaluation parameter setting

sls_app_ack_ccm_at_add_node_fail

Failed to Add Node

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes add node failed" event exists.

eventId.reason : AddNodeFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_ccm_at_create_route_fail

Failed to Create Route

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes create route failed" event exists.

eventId.reason : CreateRouteFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_ccm_at_del_node_fail

Failed to Delete Node

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes delete node failed" event exists.

eventId.reason : DeleteNodeFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_ccm_at_del_slb_fail

Failed to Delete LoadBalancer

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes slb delete failed" event exists.

eventId.reason : DeleteLoadBalancerFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_ccm_at_no_ava_slb

No Available LoadBalancer

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes slb not available" event exists.

eventId.reason : UnAvailableLoadBalancer | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_ccm_at_sync_route_fail

Failed to Sync Route

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes sync route failed" event exists.

eventId.reason : SyncRouteFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_ccm_at_sync_slb_fail

Failed to Sync LoadBalancer

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes slb sync failed" event exists.

eventId.reason : SyncLoadBalancerFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_csi_at_device_busy

Failed to Unmount the Mount Point

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes csi disk device busy" event exists.

eventId.reason : DeviceBusy | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_csi_at_disk_iohang

Cloud Disk IOHang

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes csi ioHang" event exists.

eventId.reason : DeviceBusy | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_csi_at_disk_no_portable

Container Data Volume not Support Monthly and Annual Cloud Disk

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes csi not protable" event exists.

eventId.reason : ProvisioningFailed and eventId.message : DiskNotPortable| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_csi_at_invalid_disk_size

Invalid Cloud Disk Size, less than 20Gi

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes csi invalid disk size" event exists.

eventId.reason : ProvisioningFailed and eventId.message : InvalidDiskSize| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_csi_at_latency_too_high

SlowIO Occurs in Disk-Bound PVC

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes csi pvc latency load too high" event exists.

eventId.reason : LatencyTooHigh | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_csi_at_no_ava_disk

No Available Cloud Disk

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes csi no available disk" event exists.

eventId.reason : ResourceInvalid and eventId.message : "get disk"| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_csi_at_no_enough_disk_space

Disk Capacity Exceeds Threshold

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes csi not enough disk space" event exists.

eventId.reason : NotEnoughDiskSpace| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_common_at_docker_hung

Exception in Docker Process of Cluster Node

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node docker hang" event exists.

eventId.reason:DockerHung or eventId.reason: "docker daemon is offline" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_common_err

K8s Common Error Alert

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes cluster error event" event exists.

level : Error | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name

sls_app_ack_common_at_eviction

Cluster Eviction Event

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes eviction event" event exists.

* | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log where "eventId.reason" like '%Evict%' GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_gpu_xid_error

Cluster GPU XID Error Event

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes gpu xid error event" event exists.

eventId.reason : NodeHasNvidiaXidError | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_k8s_image_pull_fail

Failed to Pull Cluster Image

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes image pull back off event" event exists.

eventId.reason : Failed and eventId.message : ImagePullBackOff | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name

data matches the expression, cnt > 0

Custom Label: namespace, pod_name, and node_name

sls_app_ack_ingress_at_err_reload_nginx

Failed to Reload Ingress Configuration

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes ingress reload config error" event exists.

eventId.reason : RELOAD and eventId.message : "Error reloading NGINX" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name

data matches the expression, cnt > 0

Custom Label: namespace, pod_name, and node_name

sls_app_ack_common_at_k8s_no_ip

Cluster Node IP Resources Insufficient

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes ip not enough event" event exists.

InvalidVSwitchId.IpNotEnough or IpNotEnough | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_destr_node_fail

Node Pool NLC Failed to Destroy Nodes

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc destroy node failed" event exists.

eventId.reason : "NLC.Task.DestroyNode.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_drain_node_fail

Node Pool NLC Failed to Drain Nodes

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc drain node failed" event exists.

eventId.reason : "NLC.Task.DrainNode.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_emp_task_cmd

Node Pool NLC Empty Task Command

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc delete node failed: EmptyTaskCommand" event exists.

eventId.reason : "NLC.Task.EmptyTaskCommand" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_op_not_found

Node Pool NLC Task Opertaion Not Found

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc delete node failed: Task.Operation.NotFound" event exists.

eventId.reason : "NLC.Task.Operation.NotFound" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_repair_fail

Node Pool NLC Self-Repair Failed

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc self repair failed" event exists.

eventId.reason : "NLC.AutoRepairTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_reset_ecs_fail

Node Pool NLC Reset ECS Failed

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc reset ecs failed" event exists.

eventId.reason : "NLC.Task.ResetECS.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_restart_ecs_fail

Node Pool NLC Restart ECS Failed

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc restart ecs failed" event exists.

eventId.reason : "NLC.Task.RestartECS.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_restart_ecs_wait_fail

Node Pool NLC Restart ECS Wait Timeout

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc restart ecs wait timeout" event exists.

eventId.reason : "NLC.Task.RestartECS.WaitNodeReady.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_runcommand_fail

Node Pool NLC Run Command Failed

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pool nlc run command failed" event exists.

eventId.reason : "NLC.Task.RunCommand.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_nlc_at_url_mode_unimpl

Node Pool NLC Unimplemented Task Mode

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes nodde pool nlc delete node failed: Task.URL.Mode.Unimplemented" event exists.

eventId.reason : "NLC.Task.URL.Mode.Unimplemented" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_k8s_no_disk

Cluster Node Disk Space Insufficient

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node disk pressure event" event exists.

eventId.reason : NodeHasDiskPressure | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_node_down

Cluster Node Down

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node down event" event exists.

eventId.reason: NodeNotReady and eventId.message: "status is now: NodeNotReady" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_node_fd_pressure

Cluster Node Too Many File Handlers

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node fd pressure event" event exists.

eventId.reason : NodeHasFDPressure | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_node_pid_pressure

Cluster Node Too Many Processes

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pid pressure event" event exists.

eventId.reason : PIDPressure or eventId.reason : NodeHasPIDPressure | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_k8s_pleg_warn

Cluster Node PLEG Error

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node pleg error event" event exists.

eventId.message : "PLEG is not healthy" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_node_restart

Cluster Node Restart

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node restart event" event exists.

eventId.reason : NodeRebooted or eventId.reason : Rebooted | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_k8s_time_sync_err

Cluster Node NTP Down

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node ntp down" event exists.

eventId.reason : NTPIsDown | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_k8s_pod_start_fail

Cluster Pod Start Failed

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes pod start failed event" event exists.

eventId.reason : Failed and eventId.involvedObject.kind : Pod not eventId.message : ImagePullBackOff | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name

data matches the expression, cnt > 0

Custom Label: namespace, pod_name, and node_name

sls_app_ack_common_at_k8s_pod_oom

Cluster Pod OOM

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes pod oom event" event exists.

eventId.reason : PodOOMKilling | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name

data matches the expression, cnt > 0

Custom Label: namespace, pod_name, and node_name

sls_app_ack_common_at_k8s_ps_hung

Cluster Node Process Hang

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes ps process hang event" event exists.

eventId.reason : PSProcessIsHung | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_no_resource

Cluster Node Resource Insufficient

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes node resource insufficient" event exists.

eventId.reason : FailedScheduling and Insufficient | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name

data matches the expression, cnt > 0

Custom Label: namespace, pod_name, and node_name

sls_app_ack_si_at_conf_high_risk

High Risk Found By Config Audit

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes high risks have be found after running config audit" event exists.

eventId.reason : SecurityInspectorConfigAuditHighRiskFound | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name

data matches the expression, cnt > 0

Custom Label: namespace, pod_name, and node_name

sls_app_ack_terway_at_alloc_ip_fail

Terway Allocate IP Error

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes terway allocate ip error" event exists.

eventId.reason : AllocIPFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_terway_at_allocate_fail

Terway Allocate Reource Error

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes allocate resource error" event exists.

eventId.reason : AllocResourceFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_terway_at_config_check

Terway Trigger Pod IP Config Check

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes terway execute pod ip config check" event exists.

eventId.reason : ConfigCheck | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name

data matches the expression, cnt > 0

Custom Label: namespace, pod_name, and node_name

sls_app_ack_terway_at_dispose_fail

Terway Dispose Resource Error

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes dispose resource error" event exists.

eventId.reason : DisposeResourceFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_terway_at_invalid_resource

Terway Invalid Resource

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes terway have invalid resource" event exists.

eventId.reason : ResourceInvalid | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_terway_at_parse_fail

Failed to Parse Ingress Bandwidth Config

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes terway parse k8s.aliyun.com/ingress-bandwidth annotation error" event exists.

eventId.reason : ParseFailed and eventId.message : "Parse ingress bandwidth failed"| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_terway_at_vir_mode_change

Terway Virtual Mode Change

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes virtual mode changed" event exists.

eventId.reason : VirtualModeChanged | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name

data matches the expression, cnt > 0

Custom Label: namespace and node_name

sls_app_ack_common_at_common_warn

K8s Common Warn Alert

This rule checks data every 5 minutes. The trigger condition is that the "kubernetes cluster warn event" event exists.

level : Warning and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name

data matches the expression, cnt > 0

Custom Label: namespace, kind, and object_name