All Products
Search
Document Center

Realtime Compute for Apache Flink:Configure monitoring and alerting

Last Updated:Dec 17, 2025

Realtime Compute for Apache Flink supports monitoring and alerting through CloudMonitor (a free monitoring service) or Application Real-Time Monitoring Service (ARMS). You can configure metric-based alerts for jobs, event-based alerts for jobs, and workflow alerts to detect and handle exceptions promptly. This topic describes how to configure monitoring and alerting for different monitoring services.

Limits

  • You cannot configure monitoring and alerting for Flink jobs that are deployed to session clusters.

  • You cannot configure monitoring and alerting for batch jobs.

  • ARMS does not support workflow alerts. However, you can configure them using CloudMonitor (a free monitoring service).

Configuration guide

Select a configuration method based on the monitoring service type of your workspace (Check the monitoring service type):

Switch the monitoring service type

You can switch the monitoring service type to meet different business requirements.

In the Realtime Compute for Apache Flink console, click More in the Actions column of the target workspace to switch to another monitoring service.

Note

Read the notes about switching the service type carefully. You can proceed with the configuration only after selecting the confirmation checkbox.

Configure monitoring and alerting with CloudMonitor

Metric-based alerting

Important

Only the Alibaba Cloud account that purchased the workspace and the RAM users and RAM roles that have permissions on the project can configure alerts on CloudMonitor.

  1. Log on to the Cloud Monitor console.

  2. In the left navigation pane, choose Alerts > Alert Rules.

  3. Click Create Alert Rule to configure the parameters.

  4. Parameter

    Description

    Product

    Realtime Compute for Apache Flink

    Resource Range

    Select Instances. The alert rule applies to the specified Realtime Compute for Apache Flink workspace.

    Associated Resources

    Click Add Instance, select the workspace in the destination region (How do I view information such as the workspace ID?), and then click OK.

    Rule Description

    Click Add Rule > Simple Metric or Combined Metrics to go to the Set Rule Description panel.

    image

    In the Dimension section, you can configure namespace (Flink project name) and deploymentID (the Deployment ID on the Deployment Details tab of the Flink job) to specify the job to monitor.

    Note
    • If no data is available in the namespace and deploymentID drop-down lists, you can enter the information manually.

    • If you leave these parameters empty, all jobs in all projects are monitored.

    Note

Event-based alerting (including workflows)

Important

Only the Alibaba Cloud account that purchased the workspace and the RAM users and RAM roles that have permissions on the project can configure alerts or subscribe to events on CloudMonitor.

Job event alerting

You can subscribe to system event alerts for jobs by configuring conditions. You can also configure event alerts in batches.

  1. Log on to the Cloud Monitor console.

  2. In the navigation pane on the left, choose Event Center > Event Subscription.

  3. On the Subscription Policy tab, click Create Subscription Policy.

  4. On the Create Subscription Policy page, you can configure the parameters. For more information about the parameters, see Manage event subscriptions (recommended).

    image

    • Subscription Type: System Event.

    • For Product, select Realtime Compute for Apache Flink.

    • The supported events are Job running failed (not supported when using the ARMS monitoring service), Post-processing for ECS breakdown, and Impact of proactive O&M on ECS.

    • Event Content: You can specify a job or configure batch alerts by entering the following Flink information in the event content.

      • Workspace ID: Configures event alerts for all jobs in all projects within the target workspace. For more information about how to view the workspace ID, see How do I view information such as the workspace ID?.

      • Project name: Configures event alerts for all jobs in the target project.

      • Deployment name: Enter the name of the deployment for event alerting. Separate multiple names with a comma (,). If your account has deployments with the same name, use the DeploymentID.

      • Deployment ID: The ID of the deployment to which the event alert applies. Separate multiple IDs with a comma (,). You can find this ID under Deployment Job ID on the Deployment Details tab of a Flink job.

    Note

    If you do not set Application Group, Event Content, or Event Resource, the subscription applies to all workspaces in your account.

Workflow event alerting

You can configure conditions to subscribe to system event alerts for Flink workflows. You can also configure event alerts in batches. For more information about workflows, see Manage workflows.

  1. Obtain the resource ID of the workflow node.

    1. Log on to the Cloud Monitor console.

    2. In the left navigation pane, choose Event Center > System Event.

    3. On the Event Monitoring tab, set Product to Realtime Compute for Apache Flink and Event Name to Workflow task status change. Then, click Search.

    4. Filter for the resource ID of the workflow node.

      工作流告警_cn.png

      The resource format is acs:flink:cn-hangzhou:<AlibabaCloudAccountID>:resourceId/workspaceId/<workspaceId-namespaceId>#workflowDefinitionName/<workflowDefinitionName>#taskDefinitionName/<taskDefinitionName>. You can also use this format to directly construct the resource ID for your workflow node.

      Parameter

      Description

      <AlibabaCloudAccountID>

      The ID of the Alibaba Cloud account that owns the Flink workspace.

      <workspaceId-namespaceId>

      Consists of the workspaceId and namespaceId, joined by a hyphen (-).

      workspaceId: The workspace ID. For more information, see Workspace management and operations.

      namespaceId: Your namespace name.

      <workflowDefinitionName>

      The workflow name.

      <taskDefinitionName>

      The workflow node name.

      Note

      Workflow status change events may take a few minutes to appear in CloudMonitor.

  2. Subscribe to event notifications.

    1. In the navigation pane on the left, choose Event Center > Event Subscription.

    2. On the Subscription Policy tab, click Create Subscription Policy.

    3. On the Create Subscription Policy page, set the parameters for the policy. For more information about the parameters, see Manage event subscriptions (recommended).

      • Subscription Type: System Event.

      • For Product, select Realtime Compute for Apache Flink.

      • For Event Name, select Workflow task status change.

      • Event Content: You can enter values such as toState: FAILED (the workflow failed), toState: SUCCESS (the workflow succeeded), or fromState: SCHEDULED, toState: RUNNING (the workflow transitions from Scheduled to Running).

      • Event resources: Enter the resource ID of the workflow from Step 1. If you enter multiple IDs, separate them with a comma (,).

      • Event Type, Event Level, or Application Group: Leave these parameters unset.

Configure monitoring and alerting with ARMS

Metric-based alerting

Note

Multi-metric monitoring in ARMS is supported only through custom PromQL. For a simpler configuration, you can use metric-based alerting with CloudMonitor.

Single-job configuration (Realtime Compute console)

You can create new alert rules for a target job. Alternatively, you can create an alert rule template and use it to create alerts for a target job. Using a template speeds up the configuration of monitoring and alerting.

Note

The Realtime Compute development console displays only alert events from the last 48 hours. To view older alert events, go to Alert Management in the ARMS console.

  1. Go to the alert configuration page.

    1. Log in to the Realtime Compute for Apache Flink console and click Console in the Actions column of the target workspace.

    2. Go to Operation Center > Job O&M and click the name of the target job.

    3. Click the Alert Configuration tab.

  2. On the Alert Rules tab, select Add Alert Rule > Custom Rule.

    You can also choose Add Alert Rule > Rule Template. This lets you use a template to create an alert rule directly or with minor modifications, which accelerates the configuration of monitoring and alerting.

  3. Enter the alert rule information.

    Category

    Parameter

    Description

    Rule Details

    Rule Name

    Must start with a letter and contain only lowercase letters, digits, and underscores (_). The length must be 3 to 64 characters.

    Description

    The remarks for the rule.

    Content

    Configure the conditions that trigger an alert. After configuration, the specified metric value is compared with the threshold at a specified interval. If the condition is met, an alert is automatically triggered.

    • Metric:

      • Restart Count in 1 Minute: The number of JobManager restarts within 1 minute. Unit: times.

      • Checkpoint Count in 5 Minutes: The number of successful checkpoints within 5 minutes. Unit: count.

      • Emit Delay: The business latency, which is the time difference between when data is generated and when it leaves the Source operator. Unit: seconds.

        Important

        The data generation time depends on the timestamp recorded in the external system. If the external system has no timestamp, or if the timestamp is written incorrectly when data is written to the external system, the Emit Delay value is inaccurate and cannot reflect the true latency. We recommend configuring multi-metric alerts to determine the real event. For more information, see Recommended configurations and templates for monitoring and alerting.

      • IN RPS: The number of input records per second. Unit: records per second.

      • OUT RPS: The number of output records per second. Unit: records per second.

      • Source Idle Time: The duration that the source has not processed data. Unit: milliseconds.

      • Job Failed: The job failed.

    • Time Difference: The length of the historical data time window that the system queries backward from the current time during each check. Unit: minutes.

    • Operator: Supports >= and <=.

    • Threshold: The value to compare with the metric.

      • If you select >= as the operator, the MAX value on the vertical axis is used. If the maximum value within the time difference is greater than or equal to the threshold, the alert rule is triggered.

      • If you select <= as the operator, the MIN value on the vertical axis is used. If the minimum value within the time difference is less than or equal to the threshold, the alert rule is triggered.

    Assume you are monitoring the "Checkpoint Count in 5 Minutes" metric, with a time difference of 10 minutes, a threshold of 2, and the "<=" operator.

    The system will check the historical data of the last 10 minutes every minute to see if the number of successful checkpoints in any "5-minute period" within those 10 minutes is less than or equal to 2. If so, an alert is triggered.

    Effective Period

    The effective period for alert monitoring. You can specify it to be effective only during the day (9:00 to 18:00). By default, it is effective all day.

    Alert Frequency

    Send an alert only once within a consecutive number of minutes. The range is 1 minute to 1440 minutes (24 hours).

    Notification Method

    Notification Method

    You can select multiple notification methods. The supported methods are:

    • DingTalk: DingTalk.

    • Email: Email.

    • SMS: Text message.

    • Webhook: Web service address.

    • Phone: Phone call.

      Ensure the recipient's phone number is verified. Otherwise, the notification cannot be effective. You can click Notification Recipient Management below. If the Unverified tag appears in the Phone column for the target contact on the Contacts tab, click it to complete verification.

      image

    Important

    Ensure that you have created and added available notification recipients. Otherwise, the alert notification will fail. For example, if you select DingTalk as the notification method, select DingTalk and add a DingTalk notification recipient of the DingTalk Robot type.

    Notification Recipient

    You can select or search for multiple notification recipients. Before you select a recipient, you must create one by clicking Notification Recipient Management on the right. For more information, see Workspace and namespace management.

    Alert Denoising

    Click Advanced Configuration and turn on the Alert Denoising switch.

    After you turn on the Alert Denoising switch, alerts will not be sent for scenarios where the job can recover quickly (such as short-term failovers triggered by cluster scheduling or auto-tuning). Alerts are sent only when your set threshold condition is continuously met.

    No-data Alert

    Click Advanced Configuration, turn on the No-data Alert switch, and enter the duration for continuous no-data.

    After you enable this feature, it monitors for scenarios where no monitoring instrumentation data is reported. If no data is reported within the selected time period, an alert is triggered. Usually, JobManager exceptions, abnormal job stops, or reporting link exceptions cause no monitoring instrumentation data to be reported.

  4. Click OK.

    The saved alert rule is enabled by default and appears in the alert rule list. From the list, you can stop, edit, or delete the rule.

Single-job or multi-job configuration (ARMS console)

  1. Log on to the Realtime Compute for Apache Flink console.

  2. In the Actions column for the target workspace, choose More > Monitoring Metrics Configuration to navigate to the ARMS console.

    The workspace name, workspace ID, and the corresponding Prometheus instance name are displayed at the top.

    image.png

  3. To create an alert rule, click Alert Rules in the navigation pane on the left.

    image

    • Check Type: Supports metric-based alerting through static thresholds and custom PromQL (excluding alert metrics already supported by Flink).

    • The filter conditions support batch alert configuration. For Namespace, you can enter a project name or select All to apply the configuration to all projects in the workspace. For Deployment, you can enter the Deployment ID of a target job or select All to apply the configuration to all jobs in the project. The deployment ID is available on the Deployment Details tab of the Flink job.

    For more information about other configuration parameters, see Create a Prometheus alert rule. You can also create a Prometheus alert rule template. For more information, see Create a Prometheus alert rule template.

Event-based alerting

Only job failure events are supported. You can configure this by selecting the Job Failed rule in Metric-based alerting. Other event alerts are not supported. To configure other event alerts, use Event-based alerting with CloudMonitor.

FAQ

How do I check the monitoring service type for a workspace?

You select the monitoring service type when you create a workspace. After a workspace is created, go to the Operation Center > Job O&M page and click the name of the target job. If the Alert Configuration tab is displayed, the workspace uses the pay-as-you-go Prometheus monitoring service (ARMS). Otherwise, it uses the free monitoring service (CloudMonitor).

image

How do I add a DingTalk robot for alerts in the Realtime Compute development console?

  1. Add a custom DingTalk robot and obtain its webhook address. For more information, see Add a custom DingTalk robot and obtain its webhook address.

    Important

    In the Security Settings section, select at least Custom Keywords and set at least one keyword to Alert to receive alert information.

  2. Add a notification recipient.

    1. On the Operation Center > Job O&M page, click the name of the target job, and then click the Alert Configuration tab.

    2. Choose Add Alert Rule > Custom Rule or Rule Template.

    3. On the Create Rule or Create Alert Rule Template page, click Notification Recipient Management.

      image.png

  3. On the DingTalk Robot tab, click Add DingTalk Robot.

    Enter the DingTalk robot's Name and Address, and then click Submit.

  4. Return to the Create Rule or Create Alert Rule Template page from Step 2. Set Notification Method to DingTalk and Notification Recipient to the corresponding DingTalk robot.

    For descriptions of other parameters in the alert rule, see Single-job configuration (Realtime Compute console).

  5. Click OK.

How do I create a webhook in the Realtime Compute development console?

  1. On the Alert Template or Rule Information page, click Notification Recipient Management.

  2. On the Webhook tab, click Create Webhook.

  3. On the Create Webhook page, you can enter the webhook information.

    Parameter

    Description

    Name

    Required. The webhook name.

    URL

    Required. The endpoint of the web service.

    Headers

    Optional. The request header, used to store cookie and token information. The format is key: value.

    Note

    Ensure there is a space after the colon between the key and the value.

    Params

    Optional. The request parameters. The format is key: value.

    Note

    Ensure there is a space after the colon between the key and the value.

    Body

    Required. The request body, used to store POST parameters and data.

    You can use the $content placeholder in the Body string to output the alert content.

  4. Click OK.

References

  • Realtime Compute for Apache Flink supports CloudMonitor (a free monitoring service) or Prometheus Service of ARMS for job monitoring and alerting. For a comparison of features, costs, and other details, see Comparison of alerting features between CloudMonitor and ARMS.

  • ARMS supports features such as alert escalation, and scheduling. For more information, see Escalation Policy, and related tutorials.

  • CloudMonitor supports receiving alert notifications through DingTalk groups, Lark groups, and other methods. For more information about the configuration methods, see Alert notification methods.

  • For more information about the supported monitoring metrics, see Monitoring metrics.

  • You can disable monitoring and alerting or discard specific metrics (when using ARMS monitoring and alerting) to save costs. You can resume metric collection later if needed. For more information, see Discard or restore monitoring metrics.