This document provides key alert metrics, recommended configurations, and operations and maintenance (O&M) examples for Real-time Computing for Apache Flink. You can use this guide to better monitor system performance and diagnose faults.
Prerequisites
For more information, see Configure monitoring and alerting. You can choose the configuration method that corresponds to the monitoring service used by your workspace.
Multi-metric monitoring in ARMS requires custom PromQL. If you need a simpler configuration, you can still use Cloud Monitor to configure alerts.
Recommended alert rule configurations
|
Scenario |
Combined metric/Event name |
Rule configuration |
Level |
Action |
|
Job status event |
= FAILED (event alerting) |
P0 |
1. Check if the restart policy is misconfigured. We recommend using the default configurations. 2. Determine if the cause is the restart policy or an abnormal JobManager or TaskManager. 3. Restore the job from the latest snapshot or successful checkpoint. |
|
|
Overview/Number of error recoveries per minute for the job |
≥ 1 for 1 consecutive period |
P0 |
1. Identify the problem.
2. Restore the job from the latest snapshot or successful checkpoint. |
|
|
Number of successful checkpoints (5 min cumulative) |
≤ 0 for 1 consecutive period |
P0 |
1. For more information, see System checkpoints to troubleshoot the root cause of the failure. 2. Identify the problem.
3. Dynamically update the configuration or restore the job from the latest successful checkpoint. |
|
|
Overview/Business latency && Records in from source per second |
Maximum latency ≥ 180000 Input records ≥ 0 for 3 consecutive periods |
P1 |
1. For more information, see Metric description to investigate the cause of the latency.
2. Take action based on the cause.
|
|
|
Overview/Records in from source per second && Source Raw Data Timestamp |
Input records ≤ 0 (business-dependent) Maximum idle time ≥ 60000 for 5 consecutive periods |
P1 |
1. Check taskmanager.log, flame graphs, and upstream service metrics to confirm if the problem is no upstream data, throttling, an error, or a stalled thread stack. 2. Take action based on the cause.
|
|
|
Overview/Records out to sink per second |
≤ 0 for 5 consecutive periods |
P1 |
1. Confirm if data reaches the sink operator.
2. Confirm if the sink can write to the external system.
3. As a temporary measure, enable dual-write to a backup storage system. |
|
|
CPU/ CPU utilization of a single TM |
≥ 85 % for 10 consecutive periods |
P2 |
1. Use flame graphs or the Flink UI to locate the hot spot operator.
2. Increase the degree of parallelism for the bottleneck operator, or allocate more CPU cores to the TaskManager. |
|
|
TM heap memory used |
≥ 90 % for 10 consecutive periods |
P2 |
1. Check GC logs to identify the problem.
2. Take action based on the cause: Increase the heap size or increase the degree of parallelism to reduce the data volume per slot. |
Job availability
Job failure alert
Development console (ARMS)
-
Log in to the Realtime Compute for Apache Flink console. In the Actions column of your workspace, click Console.
-
On the page, click the target job.
-
Click the Alert Configuration tab.

Cloud Monitor
-
Log on to the Cloud Monitor console.
-
In the navigation pane on the left, choose .
-
On the Subscription Policy tab, click Create Subscription Policy.
-
On the Create Subscription Policy page, configure the parameters. For more information, see Manage event subscriptions (Recommended).

Job stability
Prevent frequent JobManager restarts
-
Metric:
Number of error recoveries per minute for the job -
Rule: Send an alert if a job restarts within 1 minute.
-
Recommended configuration:
-
Number of error recoveries per minute for the jobMetric value >= 1
-
Period: 1 minute
-
Notification: Phone call, text message, email, and WebHook (Critical)
-
Ensure checkpoint success rate
-
Metric:
Number of completed checkpoints per minute -
Rule: Send an alert if no checkpoint is completed within 5 minutes.
-
Recommended configuration:
-
Number of completed checkpoints per minute -
Metric value <= 0
-
Period: 5 minutes
-
Notification: Phone call, text message, email, and WebHook (Critical)
-
Data timeliness
Ensure SLA for latency
-
Metrics:
-
Business latency -
Records in from source per second
-
-
Rule: Generate an alert if data is being received and the business latency exceeds 5 minutes. You can adjust the threshold and alert level as needed.
-
Recommended configuration:
-
Business latencyMaximum >= 300000
-
Records in from source per secondMetric value > 0
-
Period: 5 minutes
-
Upstream data stream interruption detection
-
Metrics:
-
Records in from source per second -
Age of unprocessed source data
-
-
Rule: An alert is triggered if there is inbound data and the service latency exceeds 5 minutes (the threshold and alert level are configurable).
-
Recommended configuration:
-
Records in from source per secondMetric value <= 0
-
Age of unprocessed data at the sourceMaximum > 60000
-
Period: 5 minutes
-
No downstream data output detection
-
Metric:
Records out to sink per second -
Rule: Generate an alert if there is no data output for more than 5 minutes. You can adjust the threshold and alert level as needed.
-
Recommended configuration:
-
Records out to sink per secondMetric value <= 0
-
Period: 5 minutes
-
Resource performance bottlenecks
CPU performance bottlenecks
-
Metric:
Single TM CPU utilization -
Rule: Alert if CPU utilization is greater than 85% for more than 10 minutes.
-
Recommended configuration:
-
CPU utilization of a single TMMaximum >= 85
-
Period: 10 minutes
-
Memory performance bottlenecks
-
Metric: TM heap memory usage
-
Rule: Alert if heap memory usage is greater than 90% for more than 10 minutes.
-
Recommended configuration:
-
TM heap memory usageMaximum >= Threshold (90%)
Determine this threshold based on the heap memory usage found on the page. For example, if the usage is 194 MB / 413 MB, set the threshold to 372 MB (90% of 413 MB).

-
Period: 10 minutes
-