Realtime Compute for Apache Flink allows you to use CloudMonitor or Managed Service for Prometheus of Application Real-Time Monitoring Service (ARMS) to implement deployment monitoring and alerting. CloudMonitor is free of charge. You can configure alert rules for metrics or subscribe to event-triggered alerts. This helps you detect and handle exceptions at the earliest opportunity. This topic describes how to configure alert rules by using CloudMonitor or Managed Service for Prometheus of ARMS.
Limits
You cannot configure alert rules for Realtime Compute for Apache Flink deployments that are deployed in session clusters.
You cannot configure alert rules for batch deployments of Realtime Compute for Apache Flink.
Configuration guide
CloudMonitor: You must go to the CloudMonitor console to configure alert rules. For more information, see Configure alert rules in the CloudMonitor console.
Managed Service for Prometheus of ARMS:
You can configure alert rules and create alert templates in the development console of Realtime Compute for Apache Flink. You can configure alert rules for only seven metrics, including deployment failure events. For more information, see Configure alert rules in the development console of Realtime Compute for Apache Flink.
You can use the PromQL syntax in the ARMS console to configure alert rules for other metrics. For more information, see Configure alert rules in the ARMS console.
You can subscribe to event-triggered alerts in the CloudMonitor console. You can subscribe to alerts only for the Elastic Compute Service (ECS) failure handling events and the ECS proactive O&M events. You can also subscribe to alerts for deployment failure events in the development console of Realtime Compute for Apache Flink. For more information about how to subscribe to event-triggered alerts, see Subscribe to event-triggered alerts.
Configure alert rules in the CloudMonitor console
Only the Alibaba Cloud account that is used to purchase the workspace and the RAM users and RAM roles that have permissions on the namespaces within the Alibaba Cloud account can be used to configure alert rules for metrics or subscribe to event-triggered alerts in the CloudMonitor console.
Subscribe to alerts for metrics
On the Deployments page, click the name of the desired deployment.
On the page that appears, click the Metrics tab. In the upper-right corner of the Metrics tab, click Subscribe to indicator alerts to go to the CloudMonitor console.
In the Configure Rule Description panel of the CloudMonitor console, configure the parameters and click OK.
You can configure an alert rule for multiple metrics at a time. You can configure the namespace and deploymentId parameters for a metric to specify the monitoring scope. Set the namespace parameter to the name of the Realtime Compute for Apache Flink namespace and the deploymentId parameter to the value of the Deployment ID parameter in the Basic section of the Configuration tab on the Deployments page. For more information about other alert parameters, see Create an alert rule.
NoteThe namespace drop-down list displays only the namespaces in which the monitoring data is generated and the deploymentId drop-down list displays only the IDs of the deployments in which the monitoring data is generated. You can manually specify values for the namespace or deploymentId parameter if no values can be selected.
In the Create Alert Rule panel, configure other alert parameters.
If you set the Resource Range parameter to Instances, the value of the Associated Resources parameter is the ID of the workspace. You cannot change the value of the Associated Resources parameter after the alert rule is created. For more information about how to obtain the workspace ID, see How do I view information about a workspace, such as the workspace ID? For more information about other alert parameters, see Create an alert rule.
Click Confirm.
Subscribe to event-triggered alerts
You can configure alert conditions to subscribe to events. Events to which you can subscribe include system events and threshold-triggered events generated by alert rules.
On the Deployments page, click the name of the desired deployment.
On the page that appears, click the Metrics tab. In the upper-right corner of the Metrics tab, click Subscribe to event alerts to go to the CloudMonitor console.
On the Create Subscription Policy page, configure the parameters.
For more information about the parameters, see Manage event subscription policies (recommended).
Products: Select Flink.
Event name: Select JOB_FAILED if you set the Subscription Type parameter to System events. If you select the JOB_FAILED event, you can select only Critical for the Event Level parameter.
Application grouping: Select a Realtime Compute for Apache Flink namespace.
Event Resources: If you set the Subscription Type parameter to System events, enter the ID of the deployment in the Event Resources field to configure event-triggered alerts for the deployment. The ID of the deployment is the value of the Deployment ID parameter in the Basic section of the Configuration tab on the Deployments page.
Configure alert rules in the development console of Realtime Compute for Apache Flink
You can view only alert events within the last 48 hours in the development console of Realtime Compute for Apache Flink. If you want to view the alert events generated at earlier time, you can go to the Alert Management page in the ARMS console.
Create an alert rule
Go to the Alarm tab.
Log on to the Realtime Compute for Apache Flink console.
On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
On the Deployments page, click the name of the desired deployment.
Click the Alarm tab.
On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose
.You can also create an alert rule by using an alert template on the Configurations page. To create an alert rule by using an alert template, choose Add Rule > Create Rule by Template, click the name of the desired template, and then perform the subsequent steps. For more information about how to create an alert template, see Create an alert template.
In the Create Rule panel, configure the parameters. The following table describes the parameters.
Section
Parameter
Description
Rule
Name
The name must be 3 to 64 characters in length, and can contain lowercase letters, digits, and underscores (_). The name must start with a letter.
Description
The description of the alert rule.
Content
The conditions that trigger an alert. After you create the conditions, Realtime Compute for Apache Flink compares the values of specified metrics with the thresholds that are specified in the conditions at the interval you specify. If one of the conditions resolves to true, an alert is triggered.
A condition consists of the following items:
Metric:
Restart Count in 1 Minute: the number of times that the JobManager restarts deployments in one minute.
Checkpoint Count in 5 Minutes: the number of times that checkpointing succeeds in five minutes.
Emit Delay: the processing delay. This parameter specifies the difference between the time when data is generated and the time when data leaves the source operator. Unit: seconds.
ImportantThe time when the data is generated depends on the timestamp that is recorded in the external system. If no timestamp is recorded in the external system or the timestamp that is recorded when data is written to the external system is incorrect, the value of the Emit Delay parameter is invalid and cannot be used to determine the true processing delay.
IN RPS: the number of input data records per second.
OUT RPS: the number of output data records per second.
Source IdIe Time: the duration for which data is not processed in the source. Unit: milliseconds.
Job Failed: The deployment fails.
Time Interval: the interval within which data of a metric is collected every minute. Realtime Compute for Apache Flink obtains data of the metric within the last interval and compares the obtained data with the specified threshold. If the historical data meets the specified conditions of the alert rule, an alert is triggered.
Comparator: The greater-than-or-equal-to sign (>=) and the less-than-or-equal-to sign (<=) are supported.
Thresholds: the value that is used to compare with the value of a metric.
If you set the Comparator parameter to the greater-than-or-equal-to sign (>=), the maximum value of the metric on the vertical axis within the last interval is used. If the maximum value of the metric within the last interval is greater than or equal to the threshold, an alert is triggered.
If you set the Comparator parameter to the less-than-or-equal-to sign (<=), the minimum value of the metric on the vertical axis within the last interval is used. If the minimum value of the metric within the last interval is less than or equal to the threshold, an alert is triggered.
For example, you can set the Time Interval parameter to 5 minutes, the Comparator parameter to the less-than-or-equal-to sign (<=), and the Thresholds parameter to 2. In this case, Realtime Compute for Apache Flink obtains the values of a metric with the last 5 minutes on the vertical axis and compares the minimum value of the metric with the specified threshold. If the minimum value of the metric within the specified interval is less than or equal to the threshold, an alert is triggered.
Effective Period
The time period during which the alert rule is effective. If you do not specify a time period, all alert rules are effective throughout the day. For example, you can specify a time period from 09:00 to 18:00.
Alarm Rate
The interval at which an alert is reported. Unit: minutes. You can set this parameter to a value in a range from 1 minute to 1440 minutes (24 hours).
Notification
Notification
Valid values:
DingTalk
Email
SMS
Webhook
Phone
ImportantMake sure that the contact you added can receive alert notifications. Otherwise, alert notifications cannot be sent.
Notification object
The contacts to which alert notifications are sent. You can select multiple contacts. You can directly select or search for a contact. You must manage contacts before you select contacts.
To manage contacts, perform the following operations: Click Notification object management on the right side of Notification object. In the Edit Contact Group dialog box, click Edit in the Actions column on the Contact Group, Contact, Webhook, and DingTalk tabs separately, edit information, and then click Save.
For more information about how to add a webhook and a DingTalk chatbot, see FAQ.
Alarm Noise Reduction
After you click Advanced Settings, you can turn on Alarm Noise Reduction.
After you turn on Alert Noise Reduction, the system does not send alert notifications if a deployment can quickly resume due to a short-period failover. For example, in cluster scheduling or automatic tuning scenario, a deployment may perform a failover for a short period of time. The system sends alert notifications only when the specified threshold condition is continuously met.
No Data Alarms
After you click Advanced Settings, you can turn on No Data Alarms and specify the time period during which no data is generated.
After you turn on this switch, data that is monitored based on codeless tracking is reported. If no data is reported during the specified time period, the system sends an alert notification. In most cases, if an issue, such as an exception of the JobManager, abnormal deployment cancellation, or an exception of the report trace, occurs, data that is monitored based on codeless tracking is reported.
Click OK.
After you create an alert rule, the rule is immediately effective. You can stop, edit, or delete the alert rule in the alert rule list.
Create an alert template
Go to the Create Rule Template panel.
You can use one of the following methods to go to the Create Rule Template panel:
Go to the Configurations page.
Log on to the Realtime Compute for Apache Flink console.
On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, click Configurations.
On the Alarm Templates tab, click Add Alarm Template.
Go to the Deployments page.
Log on to the Realtime Compute for Apache Flink console.
On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, click Deployments. On the Deployments page, click the name of the desired deployment.
Click the Alarm tab.
On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose
.
In the Create Rule Template panel, configure the parameters for the alert template.
Section
Parameter
Description
Rule
Name
The name must be 3 to 64 characters in length, and can contain lowercase letters, digits, and underscores (_). The name must start with a letter.
Description
The description of the alert rule.
Content
The conditions that trigger an alert. After you create the conditions, Realtime Compute for Apache Flink compares the values of specified metrics with the thresholds that are specified in the conditions at the interval you specify. If one of the conditions resolves to true, an alert is triggered.
A condition consists of the following items:
Metric:
Restart Count in 1 Minute: the number of times that the JobManager restarts deployments in one minute.
Checkpoint Count in 5 Minutes: the number of times that checkpointing succeeds in five minutes.
Emit Delay: the processing delay. This parameter specifies the difference between the time when data is generated and the time when data leaves the source operator. Unit: seconds.
ImportantThe time when the data is generated depends on the timestamp that is recorded in the external system. If no timestamp is recorded in the external system or the timestamp that is recorded when data is written to the external system is incorrect, the value of the Emit Delay parameter is invalid and cannot be used to determine the true processing delay.
IN RPS: the number of input data records per second.
OUT RPS: the number of output data records per second.
Source IdIe Time: the duration for which data is not processed in the source. Unit: milliseconds.
Job Failed: The deployment fails.
Time Interval: the interval within which data of a metric is collected every minute. Realtime Compute for Apache Flink obtains data of the metric within the last interval and compares the obtained data with the specified threshold. If the historical data meets the specified conditions of the alert rule, an alert is triggered.
Comparator: The greater-than-or-equal-to sign (>=) and the less-than-or-equal-to sign (<=) are supported.
Thresholds: the value that is used to compare with the value of a metric.
If you set the Comparator parameter to the greater-than-or-equal-to sign (>=), the maximum value of the metric on the vertical axis within the last interval is used. If the maximum value of the metric within the last interval is greater than or equal to the threshold, an alert is triggered.
If you set the Comparator parameter to the less-than-or-equal-to sign (<=), the minimum value of the metric on the vertical axis within the last interval is used. If the minimum value of the metric within the last interval is less than or equal to the threshold, an alert is triggered.
For example, you can set the Time Interval parameter to 5 minutes, the Comparator parameter to the less-than-or-equal-to sign (<=), and the Thresholds parameter to 2. In this case, Realtime Compute for Apache Flink obtains the values of a metric with the last 5 minutes on the vertical axis and compares the minimum value of the metric with the specified threshold. If the minimum value of the metric within the specified interval is less than or equal to the threshold, an alert is triggered.
Effective Period
The time period during which the alert rule is effective. If you do not specify a time period, all alert rules are effective throughout the day. For example, you can specify a time period from 09:00 to 18:00.
Alarm Rate
The interval at which an alert is reported. Unit: minutes. You can set this parameter to a value in a range from 1 minute to 1440 minutes (24 hours).
Notification
Notification
Valid values:
DingTalk
Email
SMS
Webhook
Phone
ImportantMake sure that the contact you added can receive alert notifications. Otherwise, alert notifications cannot be sent.
Notification object
The contacts to which alert notifications are sent. You can select multiple contacts. You can directly select or search for a contact. You must manage contacts before you select contacts.
To manage contacts, perform the following operations: Click Notification object management on the right side of Notification object. In the Edit Contact Group dialog box, click Edit in the Actions column on the Contact Group, Contact, Webhook, and DingTalk tabs separately, edit information, and then click Save.
For more information about how to add a webhook and a DingTalk chatbot, see FAQ.
Alarm Noise Reduction
After you click Advanced Settings, you can turn on Alarm Noise Reduction.
After you turn on Alert Noise Reduction, the system does not send alert notifications if a deployment can quickly resume due to a short-period failover. For example, in cluster scheduling or automatic tuning scenario, a deployment may perform a failover for a short period of time. The system sends alert notifications only when the specified threshold condition is continuously met.
No Data Alarms
After you click Advanced Settings, you can turn on No Data Alarms and specify the time period during which no data is generated.
After you turn on this switch, data that is monitored based on codeless tracking is reported. If no data is reported during the specified time period, the system sends an alert notification. In most cases, if an issue, such as an exception of the JobManager, abnormal deployment cancellation, or an exception of the report trace, occurs, data that is monitored based on codeless tracking is reported.
Click OK.
After you create an alert template, you can edit the template or delete the template from the alert template list.
Configure alert rules in the ARMS console
If you use a RAM user or RAM role to access Realtime Compute for Apache Flink, the RAM user or RAM role must have the permissions to access ARMS. For more information, see Overview.
Log on to the Realtime Compute for Apache Flink console.
Find the workspace that you want to manage and choose
in the Actions column to go to the ARMS console.On the page that appears, view the workspace name, workspace ID, and the name of the Prometheus instance that corresponds to the workspace.
Click the icon in the upper-left corner.
Create an alert rule or an alert template.
Create an alert rule: In the left-side navigation pane, choose Create an alert rule for a Prometheus instance.
. On the Prometheus Alert Rules page, click Create Prometheus Alert Rule. For more information, seeCreate an alert template: In the left-side navigation pane, choose Create and manage an alert rule template.
. On the Prometheus Alert Rule Templates page, click Create Prometheus Alert Rule Template. For more information, see
FAQ
How do I add a DingTalk chatbot in the development console of Realtime Compute for Apache Flink?
Add a custom DingTalk chatbot and obtain the webhook URL of the chatbot. For more information, see Add a custom DingTalk chatbot and obtain the webhook URL.
ImportantTo ensure that you receive alerts from a DingTalk chatbot, select at least Custom Keywords in the Security Settings section of the Add Robot dialog box, and configure Alarm as a keyword.
Add a notification object.
On the Deployments page, click the name of the desired deployment. On the page that appears, click the Alarm tab.
On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose
or choose Add Rule > Create Rule by Template > Add Rule Template.In the Create Rule or Create Rule Template panel, click Notification object management.
In the Edit Contact Group dialog box, click the DingTalk tab. On the DingTalk tab, click Add DingTalk.
In the Add DingTalk dialog box, configure the Name and URL parameters and click Submit.
Go back to the Create Rule or Create Rule Template panel in Step 2. Select DingTalk for Notification and select the related DingTalk chatbot from the Notification object drop-down list.
For more information about how to configure other parameters for an alert rule, see Create an alert rule.
Click OK.
How do I add a webhook in the development console of Realtime Compute for Apache Flink?
In the Create Rule Template panel or the Create Rule panel, click Notification object management.
In the Edit Contact Group dialog box, click the Webhook tab. On the Webhook tab, click Add Webhook.
In the Add Webhook dialog box, configure the parameters. The following table describes the parameters.
Parameter
Description
Name
Required. The name of the webhook that you want to add.
URL
Required. The webhook URL.
Headers
Optional. The request headers that store cookies and tokens. The format is key: value.
NoteMake sure that a space exists after the colon (:) between the key and the value.
Params
Optional. The request parameters that are in the key: value format.
NoteMake sure that a space exists after the colon (:) between the key and the value.
Body
Required. The request body that is used to store the POST request parameters and parameter data.
You can use the $content placeholder in the request body. $content represents the actual alert message.
Click OK.
References
Realtime Compute for Apache Flink allows you to use CloudMonitor or Managed Service for Prometheus of ARMS to implement deployment monitoring and alerting. CloudMonitor is free of charge. For more information about the differences of features and costs between CloudMonitor and Managed Service for Prometheus of ARMS, see Comparison between CloudMonitor and Managed Service for Prometheus of ARMS.
For more information about the metrics supported by Realtime Compute for Apache Flink, see Metrics.
If you no longer require Managed Service for Prometheus of ARMS for a workspace or a metric for a deployment in Realtime Compute for Apache Flink, you can disable Managed Service for Prometheus of ARMS for the workspace or discard the metric for the deployment. This helps reduce costs. You can restore the metric that you discard based on your business requirements. For more information, see Discard or restore metrics.