Realtime Compute for Apache Flink supports monitoring and alerting through CloudMonitor (a free monitoring service) or Application Real-Time Monitoring Service (ARMS). You can configure metric-based alerts for jobs, event-based alerts for jobs, and workflow alerts to detect and handle exceptions promptly. This topic describes how to configure monitoring and alerting for different monitoring services.
Limits
You cannot configure monitoring and alerting for Flink jobs that are deployed to session clusters.
You cannot configure monitoring and alerting for batch jobs.
ARMS does not support workflow alerts. However, you can configure them using CloudMonitor (a free monitoring service).
Configuration guide
Select a configuration method based on the monitoring service type of your workspace (Check the monitoring service type):
CloudMonitor (a free monitoring service)
Metric-based alerting: Configure alerts based on changes in metric values, such as CPU, latency, and data volume. You can apply this to single jobs or multiple jobs in batches.
Event-based alerting (including workflows): Configure alerts based on event occurrences, such as a job failure. You can apply this to job and workflow events.
ARMS monitoring service
Metric-based alerting: Configure alerts based on changes in metric values. Six core metrics are supported. You can apply this to single jobs or multiple jobs in batches.
Event-based alerting: Only job failure events are supported. To configure alerts for other events, see Event-based alerting with CloudMonitor.
Switch the monitoring service type
You can switch the monitoring service type to meet different business requirements.
In the Realtime Compute for Apache Flink console, click More in the Actions column of the target workspace to switch to another monitoring service.
Read the notes about switching the service type carefully. You can proceed with the configuration only after selecting the confirmation checkbox.
Configure monitoring and alerting with CloudMonitor
Metric-based alerting
Only the Alibaba Cloud account that purchased the workspace and the RAM users and RAM roles that have permissions on the project can configure alerts on CloudMonitor.
Log on to the Cloud Monitor console.
In the left navigation pane, choose .
Click Create Alert Rule to configure the parameters.
Parameter
Description
Product
Realtime Compute for Apache Flink
Resource Range
Select Instances. The alert rule applies to the specified Realtime Compute for Apache Flink workspace.
Associated Resources
Click Add Instance, select the workspace in the destination region (How do I view information such as the workspace ID?), and then click OK.
Rule Description
Click Add Rule > Simple Metric or Combined Metrics to go to the Set Rule Description panel.

In the Dimension section, you can configure namespace (Flink project name) and deploymentID (the Deployment ID on the Deployment Details tab of the Flink job) to specify the job to monitor.
NoteIf no data is available in the namespace and deploymentID drop-down lists, you can enter the information manually.
If you leave these parameters empty, all jobs in all projects are monitored.
NoteIn a production environment, single-metric alerts can easily lead to false positives or false negatives. Combined-metric alerts better reflect real business exceptions. For more information, see Recommended configurations and templates for monitoring and alerting.
For more information about other parameters, see Create an alert rule.
Event-based alerting (including workflows)
Only the Alibaba Cloud account that purchased the workspace and the RAM users and RAM roles that have permissions on the project can configure alerts or subscribe to events on CloudMonitor.
Job event alerting
You can subscribe to system event alerts for jobs by configuring conditions. You can also configure event alerts in batches.
Log on to the Cloud Monitor console.
In the navigation pane on the left, choose .
On the Subscription Policy tab, click Create Subscription Policy.
On the Create Subscription Policy page, you can configure the parameters. For more information about the parameters, see Manage event subscriptions (recommended).

Subscription Type: System Event.
For Product, select Realtime Compute for Apache Flink.
The supported events are Job running failed (not supported when using the ARMS monitoring service), Post-processing for ECS breakdown, and Impact of proactive O&M on ECS.
Event Content: You can specify a job or configure batch alerts by entering the following Flink information in the event content.
Workspace ID: Configures event alerts for all jobs in all projects within the target workspace. For more information about how to view the workspace ID, see How do I view information such as the workspace ID?.
Project name: Configures event alerts for all jobs in the target project.
Deployment name: Enter the name of the deployment for event alerting. Separate multiple names with a comma (
,). If your account has deployments with the same name, use the DeploymentID.Deployment ID: The ID of the deployment to which the event alert applies. Separate multiple IDs with a comma (
,). You can find this ID under Deployment Job ID on the Deployment Details tab of a Flink job.
NoteIf you do not set Application Group, Event Content, or Event Resource, the subscription applies to all workspaces in your account.
Workflow event alerting
You can configure conditions to subscribe to system event alerts for Flink workflows. You can also configure event alerts in batches. For more information about workflows, see Manage workflows.
Obtain the resource ID of the workflow node.
Log on to the Cloud Monitor console.
In the left navigation pane, choose .
On the Event Monitoring tab, set Product to Realtime Compute for Apache Flink and Event Name to Workflow task status change. Then, click Search.
Filter for the resource ID of the workflow node.

The resource format is
acs:flink:cn-hangzhou:<AlibabaCloudAccountID>:resourceId/workspaceId/<workspaceId-namespaceId>#workflowDefinitionName/<workflowDefinitionName>#taskDefinitionName/<taskDefinitionName>. You can also use this format to directly construct the resource ID for your workflow node.Parameter
Description
<AlibabaCloudAccountID>The ID of the Alibaba Cloud account that owns the Flink workspace.
<workspaceId-namespaceId>Consists of the
workspaceIdandnamespaceId, joined by a hyphen (-).workspaceId: The workspace ID. For more information, see Workspace management and operations.namespaceId: Your namespace name.<workflowDefinitionName>The workflow name.
<taskDefinitionName>The workflow node name.
NoteWorkflow status change events may take a few minutes to appear in CloudMonitor.
Subscribe to event notifications.
In the navigation pane on the left, choose .
On the Subscription Policy tab, click Create Subscription Policy.
On the Create Subscription Policy page, set the parameters for the policy. For more information about the parameters, see Manage event subscriptions (recommended).
Subscription Type: System Event.
For Product, select Realtime Compute for Apache Flink.
For Event Name, select Workflow task status change.
Event Content: You can enter values such as
toState: FAILED(the workflow failed),toState: SUCCESS(the workflow succeeded), orfromState: SCHEDULED, toState: RUNNING(the workflow transitions from Scheduled to Running).Event resources: Enter the resource ID of the workflow from Step 1. If you enter multiple IDs, separate them with a comma (
,).Event Type, Event Level, or Application Group: Leave these parameters unset.
Configure monitoring and alerting with ARMS
Metric-based alerting
Multi-metric monitoring in ARMS is supported only through custom PromQL. For a simpler configuration, you can use metric-based alerting with CloudMonitor.
Single-job configuration (Realtime Compute console)
You can create new alert rules for a target job. Alternatively, you can create an alert rule template and use it to create alerts for a target job. Using a template speeds up the configuration of monitoring and alerting.
The Realtime Compute development console displays only alert events from the last 48 hours. To view older alert events, go to Alert Management in the ARMS console.
Go to the alert configuration page.
Log in to the Realtime Compute for Apache Flink console and click Console in the Actions column of the target workspace.
Go to and click the name of the target job.
Click the Alert Configuration tab.
On the Alert Rules tab, select .
You can also choose . This lets you use a template to create an alert rule directly or with minor modifications, which accelerates the configuration of monitoring and alerting.
Enter the alert rule information.
Category
Parameter
Description
Rule Details
Rule Name
Must start with a letter and contain only lowercase letters, digits, and underscores (_). The length must be 3 to 64 characters.
Description
The remarks for the rule.
Content
Configure the conditions that trigger an alert. After configuration, the specified metric value is compared with the threshold at a specified interval. If the condition is met, an alert is automatically triggered.
Metric:
Restart Count in 1 Minute: The number of JobManager restarts within 1 minute. Unit: times.
Checkpoint Count in 5 Minutes: The number of successful checkpoints within 5 minutes. Unit: count.
Emit Delay: The business latency, which is the time difference between when data is generated and when it leaves the Source operator. Unit: seconds.
ImportantThe data generation time depends on the timestamp recorded in the external system. If the external system has no timestamp, or if the timestamp is written incorrectly when data is written to the external system, the Emit Delay value is inaccurate and cannot reflect the true latency. We recommend configuring multi-metric alerts to determine the real event. For more information, see Recommended configurations and templates for monitoring and alerting.
IN RPS: The number of input records per second. Unit: records per second.
OUT RPS: The number of output records per second. Unit: records per second.
Source Idle Time: The duration that the source has not processed data. Unit: milliseconds.
Job Failed: The job failed.
Time Difference: The length of the historical data time window that the system queries backward from the current time during each check. Unit: minutes.
Operator: Supports >= and <=.
Threshold: The value to compare with the metric.
If you select >= as the operator, the MAX value on the vertical axis is used. If the maximum value within the time difference is greater than or equal to the threshold, the alert rule is triggered.
If you select <= as the operator, the MIN value on the vertical axis is used. If the minimum value within the time difference is less than or equal to the threshold, the alert rule is triggered.
Assume you are monitoring the "Checkpoint Count in 5 Minutes" metric, with a time difference of 10 minutes, a threshold of 2, and the "<=" operator.
The system will check the historical data of the last 10 minutes every minute to see if the number of successful checkpoints in any "5-minute period" within those 10 minutes is less than or equal to 2. If so, an alert is triggered.
Effective Period
The effective period for alert monitoring. You can specify it to be effective only during the day (9:00 to 18:00). By default, it is effective all day.
Alert Frequency
Send an alert only once within a consecutive number of minutes. The range is 1 minute to 1440 minutes (24 hours).
Notification Method
Notification Method
You can select multiple notification methods. The supported methods are:
DingTalk: DingTalk.
Email: Email.
SMS: Text message.
Webhook: Web service address.
Phone: Phone call.
Ensure the recipient's phone number is verified. Otherwise, the notification cannot be effective. You can click Notification Recipient Management below. If the Unverified tag appears in the Phone column for the target contact on the Contacts tab, click it to complete verification.

ImportantEnsure that you have created and added available notification recipients. Otherwise, the alert notification will fail. For example, if you select DingTalk as the notification method, select DingTalk and add a DingTalk notification recipient of the DingTalk Robot type.
Notification Recipient
You can select or search for multiple notification recipients. Before you select a recipient, you must create one by clicking Notification Recipient Management on the right. For more information, see Workspace and namespace management.
Alert Denoising
Click Advanced Configuration and turn on the Alert Denoising switch.
After you turn on the Alert Denoising switch, alerts will not be sent for scenarios where the job can recover quickly (such as short-term failovers triggered by cluster scheduling or auto-tuning). Alerts are sent only when your set threshold condition is continuously met.
No-data Alert
Click Advanced Configuration, turn on the No-data Alert switch, and enter the duration for continuous no-data.
After you enable this feature, it monitors for scenarios where no monitoring instrumentation data is reported. If no data is reported within the selected time period, an alert is triggered. Usually, JobManager exceptions, abnormal job stops, or reporting link exceptions cause no monitoring instrumentation data to be reported.
Click OK.
The saved alert rule is enabled by default and appears in the alert rule list. From the list, you can stop, edit, or delete the rule.
Single-job or multi-job configuration (ARMS console)
Log on to the Realtime Compute for Apache Flink console.
In the Actions column for the target workspace, choose to navigate to the ARMS console.
The workspace name, workspace ID, and the corresponding Prometheus instance name are displayed at the top.

To create an alert rule, click Alert Rules in the navigation pane on the left.

Check Type: Supports metric-based alerting through static thresholds and custom PromQL (excluding alert metrics already supported by Flink).
The filter conditions support batch alert configuration. For Namespace, you can enter a project name or select All to apply the configuration to all projects in the workspace. For Deployment, you can enter the Deployment ID of a target job or select All to apply the configuration to all jobs in the project. The deployment ID is available on the Deployment Details tab of the Flink job.
For more information about other configuration parameters, see Create a Prometheus alert rule. You can also create a Prometheus alert rule template. For more information, see Create a Prometheus alert rule template.
Event-based alerting
Only job failure events are supported. You can configure this by selecting the Job Failed rule in Metric-based alerting. Other event alerts are not supported. To configure other event alerts, use Event-based alerting with CloudMonitor.
FAQ
How do I add a DingTalk robot for alerts in the Realtime Compute development console?
How do I create a webhook in the Realtime Compute development console?
References
Realtime Compute for Apache Flink supports CloudMonitor (a free monitoring service) or Prometheus Service of ARMS for job monitoring and alerting. For a comparison of features, costs, and other details, see Comparison of alerting features between CloudMonitor and ARMS.
ARMS supports features such as alert escalation, and scheduling. For more information, see Escalation Policy, and related tutorials.
CloudMonitor supports receiving alert notifications through DingTalk groups, Lark groups, and other methods. For more information about the configuration methods, see Alert notification methods.
For more information about the supported monitoring metrics, see Monitoring metrics.
You can disable monitoring and alerting or discard specific metrics (when using ARMS monitoring and alerting) to save costs. You can resume metric collection later if needed. For more information, see Discard or restore monitoring metrics.

