When cluster issues go undetected, they can escalate into outages, resource exhaustion, or security incidents. ACK alert management monitors your cluster for anomalous events, resource utilization thresholds, and core component health, and notifies your team through configurable channels. You customize thresholds and notification targets through an AckAlertRule CustomResourceDefinition (CRD).
How it works
ACK alert management collects data from three sources, evaluates rules against that data, and generates alerts when conditions are met:
| Data source | What it monitors | Billing |
|---|---|---|
| Simple Log Service (SLS) | Cluster events, including pod failures, node issues, scaling operations, and audit trails. See Default alert rule templates for the complete list. | Pay-by-feature |
| Managed Service for Prometheus | Core component health, including API server, etcd, kube-scheduler, CoreDNS, and Ingress. See Default alert rule templates for the complete list. | Free of charge |
| CloudMonitor | Resource metrics, including CPU, memory, disk, bandwidth, and SLB utilization. See Default alert rule templates for the complete list. | Pay-as-you-go |
Phone call and text message notifications incur additional fees.
When you enable alert management, ACK creates an AckAlertRule CRD in the kube-system namespace with default alert rule templates. An alert fires when a rule condition is met, and notifications are sent to the configured contact groups.
Prerequisites
Before you enable alert management, activate the required services for each data source:
SLS event monitoring -- Enable event monitoring. Event monitoring is enabled by default when you enable alert management.
Prometheus monitoring -- Configure Prometheus monitoring for your cluster.
CloudMonitor -- Enable the Cloud Monitor feature for your cluster.
Enable alert management for ACK managed clusters
Enable for an existing cluster
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster and click its name. In the left-side pane, choose Operations > Alerts.
On the Alerts page, follow the on-screen instructions to install or upgrade the components.
After the installation or upgrade is complete, go to the Alerts page to configure alert rules and contacts. For details on each tab, see Manage alerts in the console.
Enable during cluster creation
On the Component Configurations page of the cluster creation wizard, select Use Default Alert Rule Template for Alerts, then select a contact group from Select Alert Contact Group.

Complete the remaining cluster creation steps. For more information, see Create an ACK managed cluster.
After cluster creation, the system enables the default alert rules and sends notifications to the default alert contact group. To update contacts later, see Modify alert contacts or alert contact groups.
Enable alert management for ACK dedicated clusters
For ACK dedicated clusters, grant permissions to the worker RAM role before you enable alert rules.
Step 1: Grant permissions to the worker RAM role
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster and click its name. In the left-side pane, click Cluster Information.
In the Cluster Resources section, copy the name of the Worker RAM Role and click the link to open the Resource Access Management (RAM) console.
Create a custom policy with the following permissions. For more information, see Create a custom policy on the JSON tab.
This policy grants broad permissions for simplicity. In a production environment, follow the principle of least privilege and grant only the required permissions.
{ "Action": [ "log:*", "arms:*", "cms:*", "cs:UpdateContactGroup" ], "Resource": [ "*" ], "Effect": "Allow" }Attach the custom policy to the worker RAM role. For more information, see Grant permissions to a RAM role.
Step 2: Verify permissions
In the left-side pane of the cluster management page, choose Workloads > Deployments.
Set Namespace to
kube-systemand click the name of thealicloud-monitor-controllerapplication.Click the Logs tab. The pod logs indicate that the authorization was successful.
Step 3: Enable default alert rules
In the left-side pane, choose Operations > Alerts.
On the Alerts page, configure alert rules and contacts. For details on each tab, see Manage alerts in the console.
Manage alerts in the console
The Alerts page has four tabs:
Alert Rules
Status: Toggle alert rule sets on or off.
Modify Contacts: Assign a contact group for alert notifications.
Only contact groups can be selected as notification targets. To notify a single person, create a group containing only that contact. Create contacts and contact groups before you configure alert rules.
Alert History
View the most recent 100 alert records from the last 24 hours.
Click the link in the Alert Rule column to view the detailed rule configuration in the corresponding monitoring system.
Click Details to locate the resource where the anomaly occurred.
Click Intelligent Analytics for AI-assisted root cause analysis and troubleshooting guidance.
Alert Contacts
Create, edit, or delete alert contacts. Supported notification methods:
| Method | Details |
|---|---|
| Phone / text message | Set a mobile number. Only verified mobile numbers can receive phone call notifications. For verification steps, see Verify a mobile number. |
| Set an email address for the contact. | |
| Chat robots | DingTalk Robot, WeCom Robot, and Lark Robot. |
For DingTalk robots, add the security keywords:Alerting,Dispatch.
Before you configure email and robot notifications, verify them in the CloudMonitor console under Alert Service > Alert Contacts.
Alert Contact Groups
Create, edit, or delete contact groups. Contact groups are the only selectable notification targets when you Modify Contacts for an alert rule set.
If no contact group exists, the console creates a default group based on your Alibaba Cloud account information.
Customize alert rules
After you enable alert management, an AckAlertRule CRD resource named default is created in the kube-system namespace. Modify this CRD to customize alert thresholds, enable or disable individual rules, and set contact groups.
Console
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster and click its name. In the left-side pane, choose Operations > Alerts.
On the Alert Rules tab, click Configure Alert Rule in the upper-right corner. Click YAML in the Actions column of the target rule to view the
AckAlertRuleconfiguration.Modify the YAML as needed. See AckAlertRule YAML reference for the full specification.
kubectl
Run the following command to edit the AckAlertRule resource directly:
kubectl edit ackalertrules default -n kube-systemModify the YAML and save. See AckAlertRule YAML reference for the full specification.
Threshold parameters
Threshold parameters apply to metric-cms type rules:
| Parameter | Required | Description | Default |
|---|---|---|---|
CMS_ESCALATIONS_CRITICAL_Threshold | Yes | The alert threshold. If omitted, the rule fails to sync and is disabled. Set unit to percent, count, or qps, and set value to the threshold number. | Depends on the template |
CMS_ESCALATIONS_CRITICAL_Times | No | Number of consecutive threshold breaches required to trigger the alert. | 3 |
CMS_RULE_SILENCE_SEC | No | Silence period in seconds after an alert fires. Subsequent alerts for the same rule are suppressed during this period to prevent alert fatigue. | 900 |
Default alert rule templates
Alert rules are synced from SLS, Managed Service for Prometheus, and CloudMonitor. On the Alerts page, click Advanced Settings in the Actions column to view each rule's configuration.
The following sections list all default rules grouped by category.
Troubleshooting
Pod eviction triggered by disk pressure
Alert message:
(combined from similar events): Failed to garbage collect required amount of images.
Attempted to free XXXX bytes, but only found 0 bytes eligible to freeSymptoms: The pod status is Evicted. The node reports disk pressure (The node had condition: [DiskPressure].).
Cause: When node disk usage reaches the eviction threshold (default: 85%), the kubelet performs pressure-based eviction and garbage collection to reclaim unused image files. Run df -h on the target node to check disk usage.
Solution:
Log on to the target node (containerd runtime) and remove unused container images:
crictl rmi --pruneClean up logs or resize the node disk:
Create a snapshot backup of the data disk or system disk, then delete files or folders that are no longer needed. For more information, see Resolve full disk space issues on Linux instances.
Scale out the system disk or data disk of the target node to increase storage capacity.
Adjust thresholds:
Adjust the kubelet image garbage collection threshold to reduce pod evictions. See Customize kubelet configurations for a node pool.
Modify the alert threshold in the
node_disk_util_highalert rule in the YAML configuration. See Customize alert rules.
Best practices:
For nodes that frequently encounter this issue, assess the actual storage needs and plan resource requests and node disk capacity accordingly.
Regularly monitor storage usage with the Node Storage Dashboard to identify and address potential issues early.
Pod OOMKilling
Alert message:
pod was OOM killed. node:xxx pod:xxx namespace:xxx uuid:xxxSymptoms: The pod status is abnormal, and the event details contain PodOOMKilling.
Cause: OOM events can occur at two levels:
Container cgroup-level OOM: The actual memory usage of a pod exceeds its memory limits. The Kubernetes cgroup forcibly terminates the pod.
Node-level OOM: Too many pods without resource limits (requests/limits) are running on a node, or non-Kubernetes processes consume excessive memory.
Diagnosis: Log on to the target node and run:
dmesg -T | grep -i "memory"If the output contains out_of_memory, an OOM event occurred. If the log also contains Memory cgroup, it is a container cgroup-level OOM. Otherwise, it is a node-level OOM.
Solution:
Container cgroup-level OOM:
Increase the pod's memory limits. Keep actual usage below 80% of the specified limits. See Manage pods and Upgrade or downgrade node resources.
Enable resource profiling to get recommended configurations for container requests and limits.
Node-level OOM:
Scale out the memory resources of the node or distribute workloads across more nodes. See Upgrade or downgrade node resources and Schedule applications to specific nodes.
Identify pods with high memory usage on the node and set reasonable memory limits.
For more information, see Causes and solutions for OOM Killer.
Pod CrashLoopBackOff
When a process in a pod exits unexpectedly, ACK tries to restart the pod. If the pod fails to reach the desired state after multiple restarts, its status changes to CrashLoopBackOff.
Diagnosis:
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster and click its name. In the left-side pane, choose Workloads > Pods.
Find the abnormal pod and click View Details in the Actions column.
Check the Events section for abnormal event descriptions.
Click the Logs tab to view process-level errors.
If the pod has been restarted, select Show the log of the last container exit to view the logs of the previous container.
The console displays a maximum of 500 recent log entries. To view more historical logs, set up a log persistence solution for unified collection and storage.