All Products
Search
Document Center

Container Service for Kubernetes:Container Service alert management

Last Updated:Feb 28, 2026

When cluster issues go undetected, they can escalate into outages, resource exhaustion, or security incidents. ACK alert management monitors your cluster for anomalous events, resource utilization thresholds, and core component health, and notifies your team through configurable channels. You customize thresholds and notification targets through an AckAlertRule CustomResourceDefinition (CRD).

How it works

ACK alert management collects data from three sources, evaluates rules against that data, and generates alerts when conditions are met:

Data sourceWhat it monitorsBilling
Simple Log Service (SLS)Cluster events, including pod failures, node issues, scaling operations, and audit trails. See Default alert rule templates for the complete list.Pay-by-feature
Managed Service for PrometheusCore component health, including API server, etcd, kube-scheduler, CoreDNS, and Ingress. See Default alert rule templates for the complete list.Free of charge
CloudMonitorResource metrics, including CPU, memory, disk, bandwidth, and SLB utilization. See Default alert rule templates for the complete list.Pay-as-you-go

Phone call and text message notifications incur additional fees.

When you enable alert management, ACK creates an AckAlertRule CRD in the kube-system namespace with default alert rule templates. An alert fires when a rule condition is met, and notifications are sent to the configured contact groups.

Prerequisites

Before you enable alert management, activate the required services for each data source:

Enable alert management for ACK managed clusters

Enable for an existing cluster

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster and click its name. In the left-side pane, choose Operations > Alerts.

  3. On the Alerts page, follow the on-screen instructions to install or upgrade the components.

  4. After the installation or upgrade is complete, go to the Alerts page to configure alert rules and contacts. For details on each tab, see Manage alerts in the console.

Enable during cluster creation

  1. On the Component Configurations page of the cluster creation wizard, select Use Default Alert Rule Template for Alerts, then select a contact group from Select Alert Contact Group.

    image

  2. Complete the remaining cluster creation steps. For more information, see Create an ACK managed cluster.

After cluster creation, the system enables the default alert rules and sends notifications to the default alert contact group. To update contacts later, see Modify alert contacts or alert contact groups.

Enable alert management for ACK dedicated clusters

For ACK dedicated clusters, grant permissions to the worker RAM role before you enable alert rules.

Step 1: Grant permissions to the worker RAM role

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster and click its name. In the left-side pane, click Cluster Information.

  3. In the Cluster Resources section, copy the name of the Worker RAM Role and click the link to open the Resource Access Management (RAM) console.

  4. Create a custom policy with the following permissions. For more information, see Create a custom policy on the JSON tab.

    This policy grants broad permissions for simplicity. In a production environment, follow the principle of least privilege and grant only the required permissions.
       {
           "Action": [
               "log:*",
               "arms:*",
               "cms:*",
               "cs:UpdateContactGroup"
           ],
           "Resource": [
               "*"
           ],
           "Effect": "Allow"
       }
  5. Attach the custom policy to the worker RAM role. For more information, see Grant permissions to a RAM role.

Step 2: Verify permissions

  1. In the left-side pane of the cluster management page, choose Workloads > Deployments.

  2. Set Namespace to kube-system and click the name of the alicloud-monitor-controller application.

  3. Click the Logs tab. The pod logs indicate that the authorization was successful.

Step 3: Enable default alert rules

  1. In the left-side pane, choose Operations > Alerts.

  2. On the Alerts page, configure alert rules and contacts. For details on each tab, see Manage alerts in the console.

Manage alerts in the console

The Alerts page has four tabs:

Alert Rules

  • Status: Toggle alert rule sets on or off.

  • Modify Contacts: Assign a contact group for alert notifications.

Only contact groups can be selected as notification targets. To notify a single person, create a group containing only that contact. Create contacts and contact groups before you configure alert rules.

Alert History

View the most recent 100 alert records from the last 24 hours.

  • Click the link in the Alert Rule column to view the detailed rule configuration in the corresponding monitoring system.

  • Click Details to locate the resource where the anomaly occurred.

  • Click Intelligent Analytics for AI-assisted root cause analysis and troubleshooting guidance.

Alert Contacts

Create, edit, or delete alert contacts. Supported notification methods:

MethodDetails
Phone / text messageSet a mobile number. Only verified mobile numbers can receive phone call notifications. For verification steps, see Verify a mobile number.
EmailSet an email address for the contact.
Chat robotsDingTalk Robot, WeCom Robot, and Lark Robot.
For DingTalk robots, add the security keywords: Alerting, Dispatch.
Before you configure email and robot notifications, verify them in the CloudMonitor console under Alert Service > Alert Contacts.

Alert Contact Groups

Create, edit, or delete contact groups. Contact groups are the only selectable notification targets when you Modify Contacts for an alert rule set.

If no contact group exists, the console creates a default group based on your Alibaba Cloud account information.

Customize alert rules

After you enable alert management, an AckAlertRule CRD resource named default is created in the kube-system namespace. Modify this CRD to customize alert thresholds, enable or disable individual rules, and set contact groups.

Console

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster and click its name. In the left-side pane, choose Operations > Alerts.

  3. On the Alert Rules tab, click Configure Alert Rule in the upper-right corner. Click YAML in the Actions column of the target rule to view the AckAlertRule configuration.

  4. Modify the YAML as needed. See AckAlertRule YAML reference for the full specification.

kubectl

Run the following command to edit the AckAlertRule resource directly:

kubectl edit ackalertrules default -n kube-system

Modify the YAML and save. See AckAlertRule YAML reference for the full specification.

AckAlertRule YAML reference

The following example shows the CRD structure with two rule groups -- one event-based and one metric-based:

apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
  name: default
spec:
  groups:
    # Event-based alert rule group
    - name: pod-exceptions
      rules:
        - name: pod-oom
          type: event                              # Rule type: event, metric-cms, or metric-prometheus
          expression: sls.app.ack.pod.oom          # Maps to SLS_Event_ID in the default templates
          enable: enable                           # Valid values: enable, disable

        - name: pod-failed
          type: event
          expression: sls.app.ack.pod.failed
          enable: enable

    # Metric-based alert rule group (CloudMonitor)
    - name: res-exceptions
      rules:
        - name: node_cpu_util_high
          type: metric-cms                         # Rule type: event, metric-cms, or metric-prometheus
          expression: cms.host.cpu.utilization      # Maps to the CloudMonitor metric
          contactGroups:                            # Contact group (managed by ACK console, shared across clusters)
          enable: enable
          thresholds:
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: '85'                          # CPU utilization threshold (default: 85%)
            - key: CMS_ESCALATIONS_CRITICAL_Times
              value: '3'                           # Alert triggers after 3 consecutive breaches
            - key: CMS_RULE_SILENCE_SEC
              value: '900'                         # 900-second silence period between repeat alerts

In this example, an alert fires when node CPU utilization exceeds 85% for three consecutive checks and the previous alert was triggered more than 900 seconds ago.

Threshold parameters

Threshold parameters apply to metric-cms type rules:

ParameterRequiredDescriptionDefault
CMS_ESCALATIONS_CRITICAL_ThresholdYesThe alert threshold. If omitted, the rule fails to sync and is disabled. Set unit to percent, count, or qps, and set value to the threshold number.Depends on the template
CMS_ESCALATIONS_CRITICAL_TimesNoNumber of consecutive threshold breaches required to trigger the alert.3
CMS_RULE_SILENCE_SECNoSilence period in seconds after an alert fires. Subsequent alerts for the same rule are suppressed during this period to prevent alert fatigue.900

Default alert rule templates

Alert rules are synced from SLS, Managed Service for Prometheus, and CloudMonitor. On the Alerts page, click Advanced Settings in the Actions column to view each rule's configuration.

The following sections list all default rules grouped by category.

Error events

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Error eventTriggered by all Error-level anomalous events in the cluster.SLSeventerror-eventsls.app.ack.error

Warning events

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Warn eventTriggered by key Warn-level anomalous events, excluding some ignorable events.SLSeventwarn-eventsls.app.ack.warn

Core component anomalies (ACK managed clusters)

AlertDescriptionSourceRule typeCRD rule nameEvent ID
API server unavailableThe API server becomes unavailable, which may limit cluster management features.Prometheusmetric-prometheusapiserver-unhealthyprom.apiserver.notHealthy.down
etcd unavailableetcd unavailability affects the status of the entire cluster.Prometheusmetric-prometheusetcd-unhealthyprom.etcd.notHealthy.down
kube-scheduler unavailableThe scheduler is unavailable. New pods may fail to start.Prometheusmetric-prometheusscheduler-unhealthyprom.scheduler.notHealthy.down
kube-controller-manager (KCM) unavailableControl loop anomalies affect automatic repair and resource adjustment.Prometheusmetric-prometheuskcm-unhealthyprom.kcm.notHealthy.down
cloud-controller-manager unavailableLifecycle management of cloud service components may be affected.Prometheusmetric-prometheusccm-unhealthyprom.ccm.notHealthy.down
CoreDNS request drops to zeroCoreDNS anomalies affect service discovery and domain name resolution.Prometheusmetric-prometheuscoredns-unhealthy-requestdownprom.coredns.notHealthy.requestdown
CoreDNS panic errorA panic error occurs in CoreDNS. Analyze the logs immediately.Prometheusmetric-prometheuscoredns-unhealthy-panicprom.coredns.notHealthy.panic
High Ingress error rateHigh HTTPS error rate from the Ingress controller may affect service accessibility.Prometheusmetric-prometheusingress-err-requestprom.ingress.request.errorRateHigh
Ingress SSL certificate expiringAn expired SSL certificate causes HTTPS requests to fail. Renew in advance.Prometheusmetric-prometheusingress-ssl-expireprom.ingress.ssl.expire
Pending pods > 1,000Too many pods in Pending state. May indicate insufficient resources or scheduling issues.Prometheusmetric-prometheuspod-pending-accumulateprom.pod.pending.accumulate
High mutating admission webhook RTSlow mutating admission webhook affects resource creation and modification.Prometheusmetric-prometheusapiserver-admit-rt-highprom.apiserver.mutating.webhook.rt.high
High validating admission webhook RTSlow validating admission webhook may delay configuration changes.Prometheusmetric-prometheusapiserver-validate-rt-highprom.apiserver.validation.webhook.rt.high
Control plane OOMAn out-of-memory (OOM) error occurs in a core cluster component. Investigate immediately.SLSeventack-controlplane-oomsls.app.ack.controlplane.pod.oom

Node pool operations

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Node auto-healing failsIdentify the cause and fix the issue to maintain high availability.SLSeventnode-repair_failedsls.app.ack.rc.node_repair_failed
Node CVE fix failsCluster security may be affected. Assess and fix urgently.SLSeventnodepool-cve-fix-failedsls.app.ack.rc.node_vulnerability_fix_failed
Node pool CVE fix succeedsA CVE fix was applied successfully, reducing security risks.SLSeventnodepool-cve-fix-succsls.app.ack.rc.node_vulnerability_fix_succeed
Node pool CVE auto-fix skippedAuto-fix was skipped, possibly due to compatibility issues. Check the security policy.SLSeventnodepool-cve-fix-skipsls.app.ack.rc.node_vulnerability_fix_skipped
kubelet config update failsThe kubelet configuration fails to update, which may affect node performance.SLSeventnodepool-kubelet-cfg-failedsls.app.ack.rc.node_kubelet_config_failed
kubelet config update succeedsConfirm the new kubelet configuration takes effect.SLSeventnodepool-kubelet-config-succsls.app.ack.rc.node_kubelet_config_succeed
kubelet upgrade failsMay affect cluster stability. Confirm the upgrade process and configuration.SLSeventnodepool-k-c-upgrade-failedsls.app.ack.rc.node_kubelet_config_upgrade_failed
kubelet upgrade succeedsVerify the kubelet version meets cluster and application requirements.SLSeventnodepool-k-c-upgrade-succsls.app.ack.rc.kubelet_upgrade_succeed
Runtime upgrade succeedsThe container runtime in the node pool was upgraded successfully.SLSeventnodepool-runtime-upgrade-succsls.app.ack.rc.runtime_upgrade_succeed
Runtime upgrade failsThe container runtime in the node pool failed to upgrade.SLSeventnodepool-runtime-upgrade-failsls.app.ack.rc.runtime_upgrade_failed
OS image upgrade succeedsThe operating system image in the node pool was upgraded successfully.SLSeventnodepool-os-upgrade-succsls.app.ack.rc.os_image_upgrade_succeed
OS image upgrade failsThe operating system image in the node pool failed to upgrade.SLSeventnodepool-os-upgrade-failedsls.app.ack.rc.os_image_upgrade_failed
Lingjun pool config change succeedsThe Node Lingjun pool configuration was changed successfully.SLSeventnodepool-lingjun-config-succsls.app.ack.rc.lingjun_configuration_apply_succeed
Lingjun pool config change failsThe Node Lingjun pool configuration failed to change.SLSeventnodepool-lingjun-cfg-failedsls.app.ack.rc.lingjun_configuration_apply_failed

Node anomalies

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Docker process anomalyThe Dockerd or containerd runtime on a cluster node is abnormal.SLSeventdocker-hangsls.app.ack.docker.hang
Eviction eventAn eviction event occurs in the cluster.SLSeventeviction-eventsls.app.ack.eviction
GPU XID errorAn anomalous GPU XID event occurs in the cluster.SLSeventgpu-xid-errorsls.app.ack.gpu.xid_error
Node goes offlineA node in the cluster goes offline.SLSeventnode-downsls.app.ack.node.down
Node restartsA node in the cluster restarts.SLSeventnode-restartsls.app.ack.node.restart
NTP service anomalyThe time synchronization service on a cluster node is abnormal.SLSeventnode-ntp-downsls.app.ack.ntp.down
PLEG anomalyThe Pod Lifecycle Event Generator (PLEG) on a cluster node is abnormal.SLSeventnode-pleg-errorsls.app.ack.node.pleg_error
Process anomalyThe number of processes on a cluster node is abnormal.SLSeventps-hangsls.app.ack.ps.hang
Too many file handlesThe number of file handles on the node is too large.SLSeventnode-fd-pressuresls.app.ack.node.fd_pressure
Too many processesThe number of processes on the cluster node is too large.SLSeventnode-pid-pressuresls.app.ack.node.pid_pressure
Failed to delete a nodeThe cluster failed to delete a node.SLSeventnode-del-errsls.app.ack.ccm.del_node_failed
Failed to add a nodeThe cluster failed to add a node.SLSeventnode-add-errsls.app.ack.ccm.add_node_failed
Managed node pool command execution failsCommand execution failed in a managed node pool.SLSeventnlc-run-cmd-errsls.app.ack.nlc.run_command_fail
Empty task command in managed node poolNo specific command is provided for the task in the managed node pool.SLSeventnlc-empty-cmdsls.app.ack.nlc.empty_task_cmd
Unimplemented task mode in managed node poolAn unimplemented task mode occurs in the managed node pool.SLSeventnlc-url-m-unimpsls.app.ack.nlc.url_mode_unimpl
Unknown repair operation in managed node poolAn unknown repair operation occurs in the managed node pool.SLSeventnlc-opt-no-foundsls.app.ack.nlc.op_not_found
Error destroying managed node pool nodeAn error occurred while destroying a node in the managed node pool.SLSeventnlc-des-node-errsls.app.ack.nlc.destroy_node_fail
Failed to drain managed node pool nodeAnomalous draining event in a managed node pool.SLSeventnlc-drain-node-errsls.app.ack.nlc.drain_node_fail
Restarted ECS instance not reaching desired stateA restarted ECS instance in a managed node pool does not reach the desired state.SLSeventnlc-restart-ecs-waitsls.app.ack.nlc.restart_ecs_wait_fail
Failed to restart ECS instance in managed node poolAn ECS instance in a managed node pool failed to restart.SLSeventnlc-restart-ecs-errsls.app.ack.nlc.restart_ecs_fail
Failed to reset ECS instance in managed node poolAn ECS instance in a managed node pool failed to reset.SLSeventnlc-reset-ecs-errsls.app.ack.nlc.reset_ecs_fail
Self-healing task fails in managed node poolA self-healing task failed in a managed node pool.SLSeventnlc-sel-repair-errsls.app.ack.nlc.repair_fail

Resource utilization

AlertDefault thresholdDescriptionSourceRule typeCRD rule nameEvent ID
Node CPU utilization high>= 85%Remaining resources below 15% may exceed the container engine's CPU reservation and cause CPU throttling. See Node resource reservation policy.CloudMonitormetric-cmsnode_cpu_util_highcms.host.cpu.utilization
Node memory utilization high>= 85%Remaining resources below 15% will exceed the container engine's memory reservation, triggering kubelet forced eviction. See Node resource reservation policy.CloudMonitormetric-cmsnode_mem_util_highcms.host.memory.utilization
Node disk usage high>= 85%Disk usage on a node instance exceeds the threshold.CloudMonitormetric-cmsnode_disk_util_highcms.host.disk.utilization
Node outbound bandwidth high>= 85%Outbound Internet bandwidth of a node instance exceeds the threshold.CloudMonitormetric-cmsnode_public_net_util_highcms.host.public.network.utilization
Node inode usage high>= 85%Inode usage on a node instance exceeds the threshold.CloudMonitormetric-cmsnode_fs_inode_util_highcms.host.fs.inode.utilization
SLB Layer 7 QPS high>= 85%QPS of an SLB instance (API server or Ingress-associated) exceeds the threshold.CloudMonitormetric-cmsslb_qps_util_highcms.slb.qps.utilization
SLB outbound bandwidth high>= 85%Outbound bandwidth of an SLB instance (API server or Ingress-associated) exceeds the threshold.CloudMonitormetric-cmsslb_traff_tx_util_highcms.slb.traffic.tx.utilization
SLB max connections high>= 85%Maximum connection usage of an SLB instance (API server or Ingress-associated) exceeds the threshold.CloudMonitormetric-cmsslb_max_con_util_highcms.slb.max.connection.utilization
SLB dropped connections>= 1/secDropped connections per second for an SLB listener (API server or Ingress-associated) continuously exceeds the threshold.CloudMonitormetric-cmsslb_drop_con_highcms.slb.drop.connection
Insufficient disk space--Insufficient disk space on a node.SLSeventnode-disk-pressuresls.app.ack.node.disk_pressure
Insufficient scheduling resources--No available scheduling resources in the cluster.SLSeventnode-res-insufficientsls.app.ack.resource.insufficient
Insufficient IP resources--Insufficient IP resources in the cluster.SLSeventnode-ip-pressuresls.app.ack.ip.not_enough
Disk usage exceeds threshold--Cluster disk usage exceeds the threshold. Check disk usage.SLSeventdisk_space_presssls.app.ack.csi.no_enough_disk_space

Control plane operations

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Cluster task notificationRecords control plane plans and changes.SLSeventack-system-event-infosls.app.ack.system_events.task.info
Cluster task failureA cluster operation failed. Investigate promptly.SLSeventack-system-event-errorsls.app.ack.system_events.task.error

Auto scaling events

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Scale-outNodes are scaled out to handle increased load.SLSeventautoscaler-scaleupsls.app.ack.autoscaler.scaleup_group
Scale-inNodes are scaled in as load decreases.SLSeventautoscaler-scaledownsls.app.ack.autoscaler.scaledown
Scale-out timeoutMay indicate insufficient resources or an improper policy.SLSeventautoscaler-scaleup-timeoutsls.app.ack.autoscaler.scaleup_timeout
Scale-in of empty nodesInactive nodes are cleaned up to optimize resource usage.SLSeventautoscaler-scaledown-emptysls.app.ack.autoscaler.scaledown_empty
Scale-out failsAnalyze the cause and adjust the resource policy.SLSeventautoscaler-up-group-failedsls.app.ack.autoscaler.scaleup_group_failed
Unhealthy cluster (auto scaling)Unhealthy cluster status due to auto scaling. Handle promptly.SLSeventautoscaler-cluster-unhealthysls.app.ack.autoscaler.cluster_unhealthy
Deletion of nodes that fail to startInvalid nodes are cleaned up to reclaim resources.SLSeventautoscaler-del-startedsls.app.ack.autoscaler.delete_started_timeout
Deletion of unregistered nodesRedundant nodes are processed to optimize cluster resources.SLSeventautoscaler-del-unregisteredsls.app.ack.autoscaler.delete_unregistered
Scale-in failsMay lead to resource waste and uneven load distribution.SLSeventautoscaler-scale-down-failedsls.app.ack.autoscaler.scaledown_failed
Node deleted before drainWhen auto scaling deletes a node, pods fail to be evicted or migrated.SLSeventautoscaler-instance-expiredsls.app.ack.autoscaler.instance_expired

Workload anomalies

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Job fails to runA Job fails during execution.Prometheusmetric-prometheusjob-failedprom.job.failed
Deployment replica anomalyInsufficient available replicas in a Deployment may cause full or partial service outage.Prometheusmetric-prometheusdeployment-rep-errprom.deployment.replicaError
DaemonSet status anomalySome DaemonSet replicas are in an abnormal state (failing to start or crashing).Prometheusmetric-prometheusdaemonset-status-errprom.daemonset.scheduledError
DaemonSet scheduling anomalyA DaemonSet fails to correctly schedule some or all nodes, possibly due to resource constraints.Prometheusmetric-prometheusdaemonset-misscheduledprom.daemonset.misscheduled

Pod anomalies

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Pod OOMAn out-of-memory (OOM) error occurs in a pod or a process within it.SLSeventpod-oomsls.app.ack.pod.oom
Pod fails to startA pod in the cluster fails to start.SLSeventpod-failedsls.app.ack.pod.failed
Unhealthy pod statusA pod is in an unhealthy state (Pending, Failed, or Unknown).Prometheusmetric-prometheuspod-status-errprom.pod.status.notHealthy
Pod CrashLoopBackOffA pod frequently fails to start and enters the CrashLoopBackOff state.Prometheusmetric-prometheuspod-crashloopprom.pod.status.crashLooping

Storage anomalies

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Disk capacity < 20 GiBDisks smaller than 20 GiB cannot be attached. Check the disk capacity.SLSeventcsi_invalid_sizesls.app.ack.csi.invalid_disk_size
Subscription disk not supportedSubscription disks cannot be used as container volumes. Check the billing method.SLSeventcsi_not_portablesls.app.ack.csi.disk_not_portable
Failed to unmount (device busy)The resource has not been fully released, or an active process is accessing the mount target.SLSeventcsi_device_busysls.app.ack.csi.deivce_busy
No available disksNo available disks can be attached to the cluster storage.SLSeventcsi_no_ava_disksls.app.ack.csi.no_ava_disk
Disk IOHangAn IOHang anomaly occurs in the cluster.SLSeventcsi_disk_iohangsls.app.ack.csi.disk_iohang
Slow I/O on PVC-bound diskA slow I/O anomaly occurs on the PVC bound to the cluster disk.SLSeventcsi_latency_highsls.app.ack.csi.latency_too_high
PersistentVolume (PV) failedAn anomaly occurs on a PV in the cluster.Prometheusmetric-prometheuspv-failedprom.pv.failed

Network anomalies

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Multiple VPC route tablesMay complicate network configuration or cause route conflicts.SLSeventccm-vpc-multi-route-errsls.app.ack.ccm.describe_route_tables_failed
No available SLB instancesThe cluster cannot create an SLB instance.SLSeventslb-no-avasls.app.ack.ccm.no_ava_slb
SLB sync failureThe cluster failed to sync the created SLB instance.SLSeventslb-sync-errsls.app.ack.ccm.sync_slb_failed
SLB deletion failureThe cluster failed to delete the SLB instance.SLSeventslb-del-errsls.app.ack.ccm.del_slb_failed
Route creation failureThe cluster failed to create a VPC network route.SLSeventroute-create-errsls.app.ack.ccm.create_route_failed
Route sync failureThe cluster failed to sync a VPC network route.SLSeventroute-sync-errsls.app.ack.ccm.sync_route_failed
Invalid Terway resourceAn invalid Terway network resource in the cluster.SLSeventterway-invalid-ressls.app.ack.terway.invalid_resource
Terway IP allocation failureTerway failed to assign an IP address.SLSeventterway-alloc-ip-errsls.app.ack.terway.alloc_ip_fail
Ingress bandwidth config parse failureA configuration parsing error for the cluster Ingress network.SLSeventterway-parse-errsls.app.ack.terway.parse_fail
Terway resource allocation failureTerway network resources failed to be allocated.SLSeventterway-alloc-res-errsls.app.ack.terway.allocate_failure
Terway resource reclaim failureTerway network resources failed to be reclaimed.SLSeventterway-dispose-errsls.app.ack.terway.dispose_failure
Terway virtual mode changeA change in the virtual mode of the cluster Terway network.SLSeventterway-virt-mod-errsls.app.ack.terway.virtual_mode_change
Terway pod IP config checkTerway triggered a pod IP configuration check.SLSeventterway-ip-checksls.app.ack.terway.config_check
Ingress reload failureThe cluster Ingress configuration failed to reload. Check if the Ingress configuration is correct.SLSeventingress-reload-errsls.app.ack.ingress.err_reload_nginx

Audit operations

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Container exec / command executionA user logs on to a container or runs a command in the cluster. Track for security auditing.SLSeventaudit-at-commandsls.app.k8s.audit.at.command
Node scheduling status changeNode scheduling status changed, affecting service efficiency and resource load.SLSeventaudit-cordon-switchsls.app.k8s.audit.at.cordon.uncordon
Resource deletionA resource was deleted. Audit the operation to prevent risks.SLSeventaudit-resource-deletesls.app.k8s.audit.at.delete
Node drain / evictionReflects node load pressure or policy execution. Confirm necessity and impact.SLSeventaudit-drain-evictionsls.app.k8s.audit.at.drain.eviction
Internet logonLogging on from the Internet may pose security risks. Verify logon and access permissions.SLSeventaudit-internet-loginsls.app.k8s.audit.at.internet.login
Node label updateLabel updates affect node resource management. Verify correctness.SLSeventaudit-node-label-updatesls.app.k8s.audit.at.label
Node taint updateTaint changes affect scheduling policies and toleration mechanisms.SLSeventaudit-node-taint-updatesls.app.k8s.audit.at.taint
Resource modificationReal-time resource modifications may indicate policy adjustments. Verify alignment with business objectives.SLSeventaudit-resource-updatesls.app.k8s.audit.at.update

Security anomalies

AlertDescriptionSourceRule typeCRD rule nameEvent ID
High-risk configuration foundA cluster security inspection has found a high-risk configuration.SLSeventsi-c-a-risksls.app.ack.si.config_audit_high_risk

Cluster inspection anomalies

AlertDescriptionSourceRule typeCRD rule nameEvent ID
Inspection finds anomalyThe automatic inspection captured a potential anomaly. Analyze the specific issue.SLSeventcis-sched-failedsls.app.ack.cis.schedule_task_failed

Troubleshooting

Pod eviction triggered by disk pressure

Alert message:

(combined from similar events): Failed to garbage collect required amount of images.
Attempted to free XXXX bytes, but only found 0 bytes eligible to free

Symptoms: The pod status is Evicted. The node reports disk pressure (The node had condition: [DiskPressure].).

Cause: When node disk usage reaches the eviction threshold (default: 85%), the kubelet performs pressure-based eviction and garbage collection to reclaim unused image files. Run df -h on the target node to check disk usage.

Solution:

  1. Log on to the target node (containerd runtime) and remove unused container images:

       crictl rmi --prune
  2. Clean up logs or resize the node disk:

  3. Adjust thresholds:

Best practices:

  • For nodes that frequently encounter this issue, assess the actual storage needs and plan resource requests and node disk capacity accordingly.

  • Regularly monitor storage usage with the Node Storage Dashboard to identify and address potential issues early.

Pod OOMKilling

Alert message:

pod was OOM killed. node:xxx pod:xxx namespace:xxx uuid:xxx

Symptoms: The pod status is abnormal, and the event details contain PodOOMKilling.

Cause: OOM events can occur at two levels:

  • Container cgroup-level OOM: The actual memory usage of a pod exceeds its memory limits. The Kubernetes cgroup forcibly terminates the pod.

  • Node-level OOM: Too many pods without resource limits (requests/limits) are running on a node, or non-Kubernetes processes consume excessive memory.

Diagnosis: Log on to the target node and run:

dmesg -T | grep -i "memory"

If the output contains out_of_memory, an OOM event occurred. If the log also contains Memory cgroup, it is a container cgroup-level OOM. Otherwise, it is a node-level OOM.

Solution:

For more information, see Causes and solutions for OOM Killer.

Pod CrashLoopBackOff

When a process in a pod exits unexpectedly, ACK tries to restart the pod. If the pod fails to reach the desired state after multiple restarts, its status changes to CrashLoopBackOff.

Diagnosis:

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster and click its name. In the left-side pane, choose Workloads > Pods.

  3. Find the abnormal pod and click View Details in the Actions column.

  4. Check the Events section for abnormal event descriptions.

  5. Click the Logs tab to view process-level errors.

    If the pod has been restarted, select Show the log of the last container exit to view the logs of the previous container.
    The console displays a maximum of 500 recent log entries. To view more historical logs, set up a log persistence solution for unified collection and storage.