All Products
Search
Document Center

Realtime Compute for Apache Flink:Configure alert rules

Last Updated:Aug 05, 2024

Realtime Compute for Apache Flink allows you to use CloudMonitor or Managed Service for Prometheus of Application Real-Time Monitoring Service (ARMS) to implement deployment monitoring and alerting. CloudMonitor is free of charge. You can configure alert rules for metrics or subscribe to event-triggered alerts. This helps you detect and handle exceptions at the earliest opportunity. This topic describes how to configure alert rules by using CloudMonitor or Managed Service for Prometheus of ARMS.

Limits

  • You cannot configure alert rules for Realtime Compute for Apache Flink deployments that are deployed in session clusters.

  • You cannot configure alert rules for batch deployments of Realtime Compute for Apache Flink.

Configuration guide

Configure alert rules in the CloudMonitor console

Important

Only the Alibaba Cloud account that is used to purchase the workspace and the RAM users and RAM roles that have permissions on the namespaces within the Alibaba Cloud account can be used to configure alert rules for metrics or subscribe to event-triggered alerts in the CloudMonitor console.

Subscribe to alerts for metrics

  1. On the Deployments page, click the name of the desired deployment.

  2. On the page that appears, click the Metrics tab. In the upper-right corner of the Metrics tab, click Subscribe to indicator alerts to go to the CloudMonitor console.

    image

  3. In the Configure Rule Description panel of the CloudMonitor console, configure the parameters and click OK.

    You can configure an alert rule for multiple metrics at a time. You can configure the namespace and deploymentId parameters for a metric to specify the monitoring scope. Set the namespace parameter to the name of the Realtime Compute for Apache Flink namespace and the deploymentId parameter to the value of the Deployment ID parameter in the Basic section of the Configuration tab on the Deployments page. For more information about other alert parameters, see Create an alert rule.

    Note

    The namespace drop-down list displays only the namespaces in which the monitoring data is generated and the deploymentId drop-down list displays only the IDs of the deployments in which the monitoring data is generated. You can manually specify values for the namespace or deploymentId parameter if no values can be selected.

    image

  4. In the Create Alert Rule panel, configure other alert parameters.

    If you set the Resource Range parameter to Instances, the value of the Associated Resources parameter is the ID of the workspace. You cannot change the value of the Associated Resources parameter after the alert rule is created. For more information about how to obtain the workspace ID, see How do I view information about a workspace, such as the workspace ID? For more information about other alert parameters, see Create an alert rule.

  5. Click Confirm.

Subscribe to event-triggered alerts

You can configure alert conditions to subscribe to events. Events to which you can subscribe include system events and threshold-triggered events generated by alert rules.

  1. On the Deployments page, click the name of the desired deployment.

  2. On the page that appears, click the Metrics tab. In the upper-right corner of the Metrics tab, click Subscribe to event alerts to go to the CloudMonitor console.

    image

  3. On the Create Subscription Policy page, configure the parameters.

    For more information about the parameters, see Manage event subscription policies (recommended).

    • Products: Select Flink.

    • Event name: Select JOB_FAILED if you set the Subscription Type parameter to System events. If you select the JOB_FAILED event, you can select only Critical for the Event Level parameter.  

    • Application grouping: Select a Realtime Compute for Apache Flink namespace.

    • Event Resources: If you set the Subscription Type parameter to System events, enter the ID of the deployment in the Event Resources field to configure event-triggered alerts for the deployment. The ID of the deployment is the value of the Deployment ID parameter in the Basic section of the Configuration tab on the Deployments page.

    image

Configure alert rules in the development console of Realtime Compute for Apache Flink

Note

You can view only alert events within the last 48 hours in the development console of Realtime Compute for Apache Flink. If you want to view the alert events generated at earlier time, you can go to the Alert Management page in the ARMS console.

Create an alert rule

  1. Go to the Alarm tab.

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. On the Deployments page, click the name of the desired deployment.

    4. Click the Alarm tab.

  2. On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose Add Rule > Custom Rule.

    You can also create an alert rule by using an alert template on the Configurations page. To create an alert rule by using an alert template, choose Add Rule > Create Rule by Template, click the name of the desired template, and then perform the subsequent steps. For more information about how to create an alert template, see Create an alert template.

  3. In the Create Rule panel, configure the parameters. The following table describes the parameters.

    Section

    Parameter

    Description

    Rule

    Name

    The name must be 3 to 64 characters in length, and can contain lowercase letters, digits, and underscores (_). The name must start with a letter.

    Description

    The description of the alert rule.

    Content

    The conditions that trigger an alert. After you create the conditions, Realtime Compute for Apache Flink compares the values of specified metrics with the thresholds that are specified in the conditions at the interval you specify. If one of the conditions resolves to true, an alert is triggered.

    A condition consists of the following items:

    • Metric:

      • Restart Count in 1 Minute: the number of times that the JobManager restarts deployments in one minute.

      • Checkpoint Count in 5 Minutes: the number of times that checkpointing succeeds in five minutes.

      • Emit Delay: the processing delay. This parameter specifies the difference between the time when data is generated and the time when data leaves the source operator. Unit: seconds.

        Important

        The time when the data is generated depends on the timestamp that is recorded in the external system. If no timestamp is recorded in the external system or the timestamp that is recorded when data is written to the external system is incorrect, the value of the Emit Delay parameter is invalid and cannot be used to determine the true processing delay.

      • IN RPS: the number of input data records per second.

      • OUT RPS: the number of output data records per second.

      • Source IdIe Time: the duration for which data is not processed in the source. Unit: milliseconds.

      • Job Failed: The deployment fails.

    • Time Interval: the interval within which data of a metric is collected every minute. Realtime Compute for Apache Flink obtains data of the metric within the last interval and compares the obtained data with the specified threshold. If the historical data meets the specified conditions of the alert rule, an alert is triggered.

    • Comparator: The greater-than-or-equal-to sign (>=) and the less-than-or-equal-to sign (<=) are supported.

    • Thresholds: the value that is used to compare with the value of a metric.

      • If you set the Comparator parameter to the greater-than-or-equal-to sign (>=), the maximum value of the metric on the vertical axis within the last interval is used. If the maximum value of the metric within the last interval is greater than or equal to the threshold, an alert is triggered.

      • If you set the Comparator parameter to the less-than-or-equal-to sign (<=), the minimum value of the metric on the vertical axis within the last interval is used. If the minimum value of the metric within the last interval is less than or equal to the threshold, an alert is triggered.

    For example, you can set the Time Interval parameter to 5 minutes, the Comparator parameter to the less-than-or-equal-to sign (<=), and the Thresholds parameter to 2. In this case, Realtime Compute for Apache Flink obtains the values of a metric with the last 5 minutes on the vertical axis and compares the minimum value of the metric with the specified threshold. If the minimum value of the metric within the specified interval is less than or equal to the threshold, an alert is triggered.

    Effective Period

    The time period during which the alert rule is effective. If you do not specify a time period, all alert rules are effective throughout the day. For example, you can specify a time period from 09:00 to 18:00.

    Alarm Rate

    The interval at which an alert is reported. Unit: minutes. You can set this parameter to a value in a range from 1 minute to 1440 minutes (24 hours).

    Notification

    Notification

    Valid values:

    • DingTalk

    • Email

    • SMS

    • Webhook

    • Phone

    Important

    Make sure that the contact you added can receive alert notifications. Otherwise, alert notifications cannot be sent.

    Notification object

    The contacts to which alert notifications are sent. You can select multiple contacts. You can directly select or search for a contact. You must manage contacts before you select contacts.

    To manage contacts, perform the following operations: Click Notification object management on the right side of Notification object. In the Edit Contact Group dialog box, click Edit in the Actions column on the Contact Group, Contact, Webhook, and DingTalk tabs separately, edit information, and then click Save.

    For more information about how to add a webhook and a DingTalk chatbot, see FAQ.

    Alarm Noise Reduction

    After you click Advanced Settings, you can turn on Alarm Noise Reduction.

    After you turn on Alert Noise Reduction, the system does not send alert notifications if a deployment can quickly resume due to a short-period failover. For example, in cluster scheduling or automatic tuning scenario, a deployment may perform a failover for a short period of time. The system sends alert notifications only when the specified threshold condition is continuously met.

    No Data Alarms

    After you click Advanced Settings, you can turn on No Data Alarms and specify the time period during which no data is generated.

    After you turn on this switch, data that is monitored based on codeless tracking is reported. If no data is reported during the specified time period, the system sends an alert notification. In most cases, if an issue, such as an exception of the JobManager, abnormal deployment cancellation, or an exception of the report trace, occurs, data that is monitored based on codeless tracking is reported.

  4. Click OK.

    After you create an alert rule, the rule is immediately effective. You can stop, edit, or delete the alert rule in the alert rule list.

Create an alert template

  1. Go to the Create Rule Template panel.

    You can use one of the following methods to go to the Create Rule Template panel:

    • Go to the Configurations page.

      1. Log on to the Realtime Compute for Apache Flink console.

      2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

      3. In the left-side navigation pane, click Configurations.

      4. On the Alarm Templates tab, click Add Alarm Template.

    • Go to the Deployments page.

      1. Log on to the Realtime Compute for Apache Flink console.

      2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

      3. In the left-side navigation pane, click Deployments. On the Deployments page, click the name of the desired deployment.

      4. Click the Alarm tab.

      5. On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose Add Rule > Create Rule by Template > Add Rule Template.

  2. In the Create Rule Template panel, configure the parameters for the alert template.

    Section

    Parameter

    Description

    Rule

    Name

    The name must be 3 to 64 characters in length, and can contain lowercase letters, digits, and underscores (_). The name must start with a letter.

    Description

    The description of the alert rule.

    Content

    The conditions that trigger an alert. After you create the conditions, Realtime Compute for Apache Flink compares the values of specified metrics with the thresholds that are specified in the conditions at the interval you specify. If one of the conditions resolves to true, an alert is triggered.

    A condition consists of the following items:

    • Metric:

      • Restart Count in 1 Minute: the number of times that the JobManager restarts deployments in one minute.

      • Checkpoint Count in 5 Minutes: the number of times that checkpointing succeeds in five minutes.

      • Emit Delay: the processing delay. This parameter specifies the difference between the time when data is generated and the time when data leaves the source operator. Unit: seconds.

        Important

        The time when the data is generated depends on the timestamp that is recorded in the external system. If no timestamp is recorded in the external system or the timestamp that is recorded when data is written to the external system is incorrect, the value of the Emit Delay parameter is invalid and cannot be used to determine the true processing delay.

      • IN RPS: the number of input data records per second.

      • OUT RPS: the number of output data records per second.

      • Source IdIe Time: the duration for which data is not processed in the source. Unit: milliseconds.

      • Job Failed: The deployment fails.

    • Time Interval: the interval within which data of a metric is collected every minute. Realtime Compute for Apache Flink obtains data of the metric within the last interval and compares the obtained data with the specified threshold. If the historical data meets the specified conditions of the alert rule, an alert is triggered.

    • Comparator: The greater-than-or-equal-to sign (>=) and the less-than-or-equal-to sign (<=) are supported.

    • Thresholds: the value that is used to compare with the value of a metric.

      • If you set the Comparator parameter to the greater-than-or-equal-to sign (>=), the maximum value of the metric on the vertical axis within the last interval is used. If the maximum value of the metric within the last interval is greater than or equal to the threshold, an alert is triggered.

      • If you set the Comparator parameter to the less-than-or-equal-to sign (<=), the minimum value of the metric on the vertical axis within the last interval is used. If the minimum value of the metric within the last interval is less than or equal to the threshold, an alert is triggered.

    For example, you can set the Time Interval parameter to 5 minutes, the Comparator parameter to the less-than-or-equal-to sign (<=), and the Thresholds parameter to 2. In this case, Realtime Compute for Apache Flink obtains the values of a metric with the last 5 minutes on the vertical axis and compares the minimum value of the metric with the specified threshold. If the minimum value of the metric within the specified interval is less than or equal to the threshold, an alert is triggered.

    Effective Period

    The time period during which the alert rule is effective. If you do not specify a time period, all alert rules are effective throughout the day. For example, you can specify a time period from 09:00 to 18:00.

    Alarm Rate

    The interval at which an alert is reported. Unit: minutes. You can set this parameter to a value in a range from 1 minute to 1440 minutes (24 hours).

    Notification

    Notification

    Valid values:

    • DingTalk

    • Email

    • SMS

    • Webhook

    • Phone

    Important

    Make sure that the contact you added can receive alert notifications. Otherwise, alert notifications cannot be sent.

    Notification object

    The contacts to which alert notifications are sent. You can select multiple contacts. You can directly select or search for a contact. You must manage contacts before you select contacts.

    To manage contacts, perform the following operations: Click Notification object management on the right side of Notification object. In the Edit Contact Group dialog box, click Edit in the Actions column on the Contact Group, Contact, Webhook, and DingTalk tabs separately, edit information, and then click Save.

    For more information about how to add a webhook and a DingTalk chatbot, see FAQ.

    Alarm Noise Reduction

    After you click Advanced Settings, you can turn on Alarm Noise Reduction.

    After you turn on Alert Noise Reduction, the system does not send alert notifications if a deployment can quickly resume due to a short-period failover. For example, in cluster scheduling or automatic tuning scenario, a deployment may perform a failover for a short period of time. The system sends alert notifications only when the specified threshold condition is continuously met.

    No Data Alarms

    After you click Advanced Settings, you can turn on No Data Alarms and specify the time period during which no data is generated.

    After you turn on this switch, data that is monitored based on codeless tracking is reported. If no data is reported during the specified time period, the system sends an alert notification. In most cases, if an issue, such as an exception of the JobManager, abnormal deployment cancellation, or an exception of the report trace, occurs, data that is monitored based on codeless tracking is reported.

  3. Click OK.

    After you create an alert template, you can edit the template or delete the template from the alert template list.

Configure alert rules in the ARMS console

Note

If you use a RAM user or RAM role to access Realtime Compute for Apache Flink, the RAM user or RAM role must have the permissions to access ARMS. For more information, see Overview.

  1. Log on to the Realtime Compute for Apache Flink console.

  2. Find the workspace that you want to manage and choose More > Monitoring Indicator Configuration in the Actions column to go to the ARMS console.

    On the page that appears, view the workspace name, workspace ID, and the name of the Prometheus instance that corresponds to the workspace.

    image.png

  3. Click the image.png icon in the upper-left corner.

  4. Create an alert rule or an alert template.

    • Create an alert rule: In the left-side navigation pane, choose Managed Service for Prometheus > Prometheus Alert Rules. On the Prometheus Alert Rules page, click Create Prometheus Alert Rule. For more information, see Create an alert rule for a Prometheus instance.

    • Create an alert template: In the left-side navigation pane, choose Managed Service for Prometheus > Prometheus Alert Rule Template. On the Prometheus Alert Rule Templates page, click Create Prometheus Alert Rule Template. For more information, see Create and manage an alert rule template.

FAQ

How do I add a DingTalk chatbot in the development console of Realtime Compute for Apache Flink?

  1. Add a custom DingTalk chatbot and obtain the webhook URL of the chatbot. For more information, see Add a custom DingTalk chatbot and obtain the webhook URL.

    Important

    To ensure that you receive alerts from a DingTalk chatbot, select at least Custom Keywords in the Security Settings section of the Add Robot dialog box, and configure Alarm as a keyword.

  2. Add a notification object.

    1. On the Deployments page, click the name of the desired deployment. On the page that appears, click the Alarm tab.

    2. On the Alarm tab, click the Alarm Rules tab. In the upper-right corner of the Alarm Rules tab, choose Add Rule > Custom Rule or choose Add Rule > Create Rule by Template > Add Rule Template.

    3. In the Create Rule or Create Rule Template panel, click Notification object management.

      image.png

  3. In the Edit Contact Group dialog box, click the DingTalk tab. On the DingTalk tab, click Add DingTalk.

    In the Add DingTalk dialog box, configure the Name and URL parameters and click Submit.

  4. Go back to the Create Rule or Create Rule Template panel in Step 2. Select DingTalk for Notification and select the related DingTalk chatbot from the Notification object drop-down list.

    For more information about how to configure other parameters for an alert rule, see Create an alert rule.

  5. Click OK.

How do I add a webhook in the development console of Realtime Compute for Apache Flink?

  1. In the Create Rule Template panel or the Create Rule panel, click Notification object management.

  2. In the Edit Contact Group dialog box, click the Webhook tab. On the Webhook tab, click Add Webhook.

  3. In the Add Webhook dialog box, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Name

    Required. The name of the webhook that you want to add.

    URL

    Required. The webhook URL.

    Headers

    Optional. The request headers that store cookies and tokens. The format is key: value.

    Note

    Make sure that a space exists after the colon (:) between the key and the value.

    Params

    Optional. The request parameters that are in the key: value format.

    Note

    Make sure that a space exists after the colon (:) between the key and the value.

    Body

    Required. The request body that is used to store the POST request parameters and parameter data.

    You can use the $content placeholder in the request body. $content represents the actual alert message.

  4. Click OK.

References

  • Realtime Compute for Apache Flink allows you to use CloudMonitor or Managed Service for Prometheus of ARMS to implement deployment monitoring and alerting. CloudMonitor is free of charge. For more information about the differences of features and costs between CloudMonitor and Managed Service for Prometheus of ARMS, see Comparison between CloudMonitor and Managed Service for Prometheus of ARMS.

  • For more information about the metrics supported by Realtime Compute for Apache Flink, see Metrics.

  • If you no longer require Managed Service for Prometheus of ARMS for a workspace or a metric for a deployment in Realtime Compute for Apache Flink, you can disable Managed Service for Prometheus of ARMS for the workspace or discard the metric for the deployment. This helps reduce costs. You can restore the metric that you discard based on your business requirements. For more information, see Discard or restore metrics.