Container Service for Kubernetes (ACK) provides the alert management feature that allows you to centrally configure alerting for containers. You can configure alert rules to receive notifications when a service exception occurs or when metrics exceed thresholds, including key metrics of basic cluster resources, metrics of core cluster components, and application metrics. You can modify the default alert rules of a cluster by deploying CustomResourceDefinitions (CRDs) in the cluster. This enables you to detect abnormal changes in the cluster.
Index
Feature | Link |
Enable alert management | |
Configure alert rules | |
FAQ |
Prerequisites
Only ACK managed clusters and ACK dedicated clusters are supported.
ACK Serverless clusters require you to enable alerting in the corresponding monitoring instance. For more information, see Create an alert rule for a Prometheus instance.
Simple Log Service is activated. You must log on to the Simple Log Service console and follow the instructions to activate Simple Log Service.
Managed Service for Prometheus is activated. For more information, see Managed Service for Prometheus instance billing.
Billing
Alerts are sent by Simple Log Service, Managed Service for Prometheus, and CloudMonitor. Additional fees may be charged for sending notifications such as text messages and emails from these monitoring services. The following table describes the billing details. Before you enable the alert management feature, you can check the source of each alert item in the default alert rule template and activate the required services.
Alert source | Configuration requirements | Billing details |
Simple Log Service | Enable event monitoring. Event monitoring is automatically enabled when you enable the alert management feature. | |
Managed Service for Prometheus | Configure Prometheus Service monitoring for the cluster. | Free of charge |
CloudMonitor | Enable the monitoring feature of CloudMonitor for an ACK cluster. |
Enable alert management
After you enable the alert management feature, you can configure metric-based alerts for specific resources in the cluster and automatically receive alert notifications when exceptions occur. This helps you efficiently manage and maintain your cluster and ensure service stability. For more information about resource alerts, see Default alert rule template.
ACK managed cluster
You can enable alert management for an existing cluster or when you create a cluster.
Enable alert management for an existing cluster
If you have an existing cluster, you can enable alert management by performing the following steps:
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
On the Alert Configuration page, click Start Installation. The console automatically checks the prerequisites, and installs and upgrades the components.
After the installation and upgrade are complete, configure alerts on the Alert Configuration page.
Tab
Description
Alert Rule Management
Turn on Enabled to enable the corresponding alert rule set. Click Edit Notification Object to set the associated notification object.
Alert History
You can view the latest 100 historical records sent within the last day. Click a link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration. Click Troubleshoot to quickly locate the resource page where the exception occurred (event or metric exception).
Contact Management
You can create, edit, or delete alert contacts.
Contact methods can be set through text messages, mailboxes, and robot types. You need to authenticate them first in the CloudMonitor console under to receive alert messages. Contact synchronization is also supported. If the authentication information expires, you can delete the corresponding contact in CloudMonitor and refresh the contacts page. For notification object robot type settings, see DingTalk Robot, WeCom Robot, and Lark Robot.
Contact Group Management
You can create, edit, or delete alert contact groups. If no alert contact group exists, the ACK console automatically creates a default alert contact group based on the information that you provided during registration.
Enable alert management when you create a cluster
On the Component Configurations page when you create a cluster, select Alert Configuration > Use Default Alert Template To Configure Alerts and select Alert Contact Group. For more information, see Create an ACK managed cluster.
After you enable alert management when you create a cluster, the system enables the default alert rules and sends alert notifications to the default contact group. You can also Modify an alert contact or alert contact group.
ACK dedicated cluster
For an ACK dedicated cluster, you must first authorize the worker Resource Access Management (RAM) role, and then enable the default alert rules.
Authorize the worker RAM role
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, click Cluster Information.
On the Cluster Information page, in the Cluster Resources section, copy the name on the right side of Worker RAM Role and click the link to go to the RAM console to authorize the worker RAM role.
Create the following custom policy. For more information, see Create a custom policy on the JSON tab.
{ "Action": [ "log:*", "arms:*", "cms:*", "cs:UpdateContactGroup" ], "Resource": [ "*" ], "Effect": "Allow" }
On the Roles page, search for the worker RAM role and grant the custom policy that you created to the worker RAM role. For more information, see Method 1: Grant permissions to a RAM role by clicking Grant Permission on the Roles page.
Check the component logs to verify that the permissions are granted.
In the left-side navigation pane of the details page of the cluster, choose .
Select Namespace as kube-system, and click the Name link of alicloud-monitor-controller in the stateless application list.
Click the Logs tab to view the pod logs that indicate successful authorization.
Enable the default alert rules
In the left-side navigation pane of the details page of the cluster, choose Operations > Alert Configuration.
On the Alert Configuration page, configure the following alert information.
Tab
Description
Alert Rule Management
Turn on Enabled to enable the corresponding alert rule set. Click Edit Notification Object to set the associated notification object.
Alert History
You can view the latest 100 historical records sent within the last day. Click a link in the Alert Rule column to go to the corresponding monitoring system and view the detailed rule configuration. Click Troubleshoot to quickly locate the resource page where the exception occurred (event or metric exception).
Contact Management
You can create, edit, or delete alert contacts.
Contact methods can be set through text messages, mailboxes, and robot types. You need to authenticate them first in the CloudMonitor console under to receive alert messages. Contact synchronization is also supported. If the authentication information expires, you can delete the corresponding contact in CloudMonitor and refresh the contacts page. For notification object robot type settings, see DingTalk Robot, WeCom Robot, and Lark Robot.
Contact Group Management
You can create, edit, or delete alert contact groups. If no alert contact group exists, the ACK console automatically creates a default alert contact group based on the information that you provided during registration.
Configure alert rules by using CRDs
When the alerting feature is enabled, the system automatically creates an AckAlertRule object in the kube-system namespace. The AckAlertRule object contains the default alert rule template. You can modify the AckAlertRule object to customize the default alert rules based on your business requirements.
Procedure
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
On the Alert Rule Management tab, click Edit Alert Configuration in the upper-right corner. Then, click Actions > YAML in the row of the target rule to view the AckAlertRule resource configuration of the current cluster.
Refer to the description of the default alert rule template and modify the sample YAML file.
Example:
apiVersion: alert.alibabacloud.com/v1beta1 kind: AckAlertRule metadata: name: default spec: groups: # The following is a sample configuration of a cluster event alert rule. - name: pod-exceptions # The name of the alert rule group, which corresponds to the Group_Name field in the alert template. rules: - name: pod-oom # The name of the alert rule. type: event # The type of the alert rule (Rule_Type). Valid values: event and metric-cms. expression: sls.app.ack.pod.oom # The expression of the alert rule. When the rule type is event, the value of the expression is the value of Rule_Expression_Id in the default alert rule template described in this topic. enable: enable # The status of the alert rule. Valid values: enable and disable. - name: pod-failed type: event expression: sls.app.ack.pod.failed enable: enable # The following is a sample configuration of a cluster basic resource alert rule. - name: res-exceptions # The name of the alert rule group, which corresponds to the Group_Name field in the alert template. rules: - name: node_cpu_util_high # The name of the alert rule. type: metric-cms # The type of the alert rule (Rule_Type). Valid values: event and metric-cms. expression: cms.host.cpu.utilization # The expression of the alert rule. When the rule type is metric-cms, the value of the expression is the value of Rule_Expression_Id in the default alert rule template described in this topic. contactGroups: # The alert contact group configuration that is mapped to the alert rule. The configuration is generated by the ACK console. The same contact is used for the same account. The contact can be reused in multiple clusters. enable: enable # The status of the alert rule. Valid values: enable and disable. thresholds: # The threshold of the alert rule. For more information, see the section about how to change the threshold of an alert rule. - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: '1'
Example: Modify the threshold of a cluster basic resource alert rule by using a CRD
Based on the default alert rule template set, the Rule_Type of the cluster resource exception alert rule set is metric-cms, and the alert rule is synchronized from the basic resource alert rule of CloudMonitor. In this example, the thresholds
parameter is added to the CRD that corresponds to the Node - CPU usage alert rule set to configure the threshold, the number of retries, and the silence period of the basic monitoring alert rule.
apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
name: default
spec:
groups:
# The following is a sample configuration of a cluster basic resource alert rule.
- name: res-exceptions # The name of the alert rule group, which corresponds to the Group_Name field in the alert template.
rules:
- name: node_cpu_util_high # The name of the alert rule.
type: metric-cms # The type of the alert rule (Rule_Type). Valid values: event and metric-cms.
expression: cms.host.cpu.utilization # The expression of the alert rule. When the rule type is metric-cms, the value of the expression is the value of Rule_Expression_Id in the default alert rule template described in this topic.
contactGroups: # The alert contact group configuration that is mapped to the alert rule. The configuration is generated by the ACK console. The same contact is used for the same account. The contact can be reused in multiple clusters.
enable: enable # The status of the alert rule. Valid values: enable and disable.
thresholds: # The threshold of the alert rule. For more information, see how to configure alert rules by using CRDs.
- key: CMS_ESCALATIONS_CRITICAL_Threshold
unit: percent
value: '1'
- key: CMS_ESCALATIONS_CRITICAL_Times
value: '3'
- key: CMS_RULE_SILENCE_SEC
value: '900'
Parameter | Required | Description | Default value |
| Yes | The alert threshold. If this parameter is not configured, the rule fails to be synchronized and is disabled.
| The default value is the same as the default value specified in the default alert rule template. |
| No | The number of times that the alert threshold is exceeded before an alert is triggered. If this parameter is not configured, the default value is used. | 3 |
| No | The silence period after an alert is triggered. This parameter is used to prevent frequent alerting. Unit: seconds. If this parameter is not configured, the default value is used. | 900 |
Default alert rule template
The following tables describe the default alert rule template.
FAQ
The alert rule fails to be synchronized and the error message "The Project does not exist : k8s-log-xxx" is returned
Issue:
The alert rule synchronization status in the alert center shows the error message
The Project does not exist : k8s-log-xxx
.Cause:
You did not create an event center in Log Service for your cluster.
Solution:
In the Simple Log Service console, check whether you have reached the quota limit. For more information about resources, see Basic resources.
If you have reached the quota limit, delete unnecessary projects or submit a ticket to request an increase in the project resource quota limit. For information about how to delete a project, see Manage projects.
If you have not reached the quota limit, perform the following steps.
Reinstall ack-node-problem-detector.
When you reinstall the component, a default project named k8s-log-xxxxxx is created.
Uninstall ack-node-problem-detector.
In the left-side navigation pane of the ACK conso details page of the target cluster, choose .
Click the Logs & Monitoring tab. In the ack-node-problem-detector card, click Uninstall. In the dialog box that appears, click OK.
After the uninstallation is complete, install ack-node-problem-detector.
In the left-side navigation pane, choose
On the Alert Configuration page, click Start Installation. The console automatically creates a project and installs and upgrades the components.
On the Alert Configuration page, turn off the switch in the Enabled column for the corresponding alert rule set. Wait until Alert Rule Status changes to Rule Disabled, and then turn on the switch to retry.
The alert rule fails to be synchronized and an error message similar to this rule have no xxx contact groups reference
is returned
Issue:
The alert rule fails to be synchronized and an error message similar to
this rule have no xxx contact groups reference
is returned.Cause:
No contact group subscribes to the alert rule.
Solution:
Create a contact group and add contacts.
Click Edit Notification Object on the right side of the corresponding alert rule set and configure a contact group that subscribes to the alert rule set.