Monitor Flink Jobs Early to Catch Failures via Alert Rules - Realtime Compute for Apache Flink

Prerequisites

Before you begin, complete the setup described in Configure monitoring and alerting. Choose the monitoring tool that matches your workspace configuration.

Multi-metric alerting in ARMS (Application Real-Time Monitoring Service) requires custom PromQL. For simpler setup, use CloudMonitor instead.

Recommended alert rules

The following table summarizes the alerts covered in this guide. Configure them in priority order — P0 alerts indicate immediate job impact, while P2 alerts indicate emerging resource pressure.

<table> <thead> <tr> <td> Scenario </td> <td> Metric or event </td> <td> Trigger condition </td> <td> Level </td> <td> Action </td> </tr> </thead> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <tbody> <tr> <td> <a href="#e7130392f32oe">Job failure</a> </td> <td> Job status event </td> <td> = FAILED (event alerting) </td> <td> P0 </td> <td> 1. Check whether the restart strategy is misconfigured. Use the default settings unless you have a specific reason to override them. 2. Determine whether the failure is caused by the restart strategy or by an abnormal JobManager or TaskManager. 3. Restore the job from the latest snapshot or successful checkpoint. </td> </tr> <tr> <td> <a href="#0056509ea2937">Failover surge</a> </td> <td> Overview/Number of error recoveries per minute for the job </td> <td> ≥ 1 for 1 consecutive period </td> <td> P0 </td> <td> 1. Identify the root cause. <ul> <li> Analyze failover, JobManager, and TaskManager logs. </li> <li> Ignore: Infrequent, auto-recoverable machine failures. </li> <li> Fix: Code bugs, resource bottlenecks, or configuration errors. </li> </ul> 2. Restore the job from the latest snapshot or successful checkpoint. </td> </tr> <tr> <td> <a href="#5435fcd4abo9e">Consecutive checkpoint failures</a> </td> <td> Number of successful checkpoints (5 min cumulative) </td> <td> ≤ 0 for 1 consecutive period </td> <td> P0 </td> <td> 1. See <a href="https://www.alibabacloud.com/help/en/document_detail/414257.html#caf8e65a8awv1">System checkpoints</a> to identify the root cause. 2. Act on the cause. <ul> <li> Configuration issue (such as timeout): Adjust the checkpoint settings. </li> <li> Resource pressure (such as backpressure): Use <a href="https://www.alibabacloud.com/help/en/document_detail/2536572.html">dynamic scaling</a> to add resources to the backpressured operator. </li> </ul> 3. Dynamically update the configuration or restore the job from the latest successful checkpoint. </td> </tr> <tr> <td> <a href="#c339292e4e6p7">High business latency (with incoming data)</a> </td> <td> Overview/Business latency && Records in from source per second </td> <td> Maximum latency ≥ 180000 Input records ≥ 0 for 3 consecutive periods </td> <td> P1 </td> <td> 1. See <a href="https://www.alibabacloud.com/help/en/document_detail/2543043.html">Metric description</a> to investigate the cause. <ul> <li> Data plane: Are event timestamps out of order? </li> <li> Traffic: Is there an upstream surge or downstream backpressure? </li> </ul> 2. Act on the cause. <ul> <li> Internal: Adjust connector WITH parameters and scale out the bottleneck operator. </li> <li> External: Optimize external service settings, such as adjusting throttling policies or increasing connection limits. </li> </ul> </td> </tr> <tr> <td> <a href="#ff3815bfc11gq">Upstream data interruption</a> </td> <td> Overview/Records in from source per second && Source Raw Data Timestamp </td> <td> Input records ≤ 0 (business-dependent) Maximum idle time ≥ 60000 for 5 consecutive periods </td> <td> P1 </td> <td> 1. Check taskmanager.log, flame graphs, and upstream service metrics to confirm the cause: no upstream data, throttling, an error, or a stalled thread stack. 2. Act on the cause. <ul> <li> Connector issue: Adjust connector parameters such as timeout or concurrency, or add TaskManager resources. </li> <li> Upstream or downstream service issue: Notify the upstream team to investigate. </li> <li> Flink internal bottleneck (such as backpressure or a freeze): Resolve the root cause first, then restart the job from the latest checkpoint. </li> </ul> </td> </tr> <tr> <td> <a href="#10fd3e302b76j">No downstream output</a> </td> <td> Overview/Records out to sink per second </td> <td> ≤ 0 for 5 consecutive periods </td> <td> P1 </td> <td> 1. Confirm whether data reaches the sink operator. <ul> <li> Business logic filtering: Check logs or metrics to determine whether all input was filtered out. </li> <li> Late data discard: Check watermark and window settings to determine whether data was discarded as late arrivals. </li> </ul> 2. Confirm whether the sink can write to the external system. <ul> <li> Connection layer: Is the connection pool full? Is network connectivity normal? </li> <li> Target system: Does the downstream database or service have a locked table, insufficient disk space, write throttling, or other errors? </li> </ul> 3. As a temporary measure, enable dual-write to a backup storage system. </td> </tr> <tr> <td> <a href="#1e5b7fcf1cdsj">CPU bottleneck</a> </td> <td> CPU/CPU utilization of a single TM </td> <td> ≥ 85% for 10 consecutive periods </td> <td> P2 </td> <td> 1. Use flame graphs or the Flink UI to locate the hot-spot operator. <ul> <li> Business logic: Check for complex calculations, JSON parsing, or inefficient user-defined functions (UDFs). </li> <li> Data skew: Check whether a hot-spot key is overloading a single task with excessive data volume. </li> <li> Insufficient resources: Determine whether the current degree of parallelism and TaskManager resources can handle the traffic, and whether there is severe backpressure. </li> <li> Frequent GC: Check logs or JVM metrics to determine whether memory pressure is triggering frequent Full GCs. </li> </ul> 2. Increase the degree of parallelism for the bottleneck operator, or allocate more CPU cores to the TaskManager. </td> </tr> <tr> <td> <a href="#5d68057bafc8w">Memory bottleneck</a> </td> <td> TM heap memory used </td> <td> ≥ 90% for 10 consecutive periods </td> <td> P2 </td> <td> 1. Check GC logs to identify the problem type. <ul> <li> Memory leak: Heap memory does not return to baseline after GC and the baseline keeps rising. </li> <li> Insufficient capacity: Heap usage stays consistently high, triggering frequent Full GCs and degrading performance. </li> <li> Sudden OutOfMemoryError (OOM): Memory fills instantly when processing a specific record or batch. </li> </ul> 2. Increase heap size or increase the degree of parallelism to reduce the data volume per slot. </td> </tr> </tbody> </table>

Job availability

Job failure alert

Configure a P0 alert that fires immediately when a job transitions to the FAILED state, so you can restore service before the failure cascades.

Metric: Job status event = FAILED

Notification: Phone call, text message, email, and webhook (Critical)

When this alert fires:

Check whether the restart strategy is misconfigured. Use the default settings unless you have a specific reason to override them.
Determine whether the failure is caused by the restart strategy or by an abnormal JobManager or TaskManager.
Restore the job from the latest snapshot or successful checkpoint.

Configure in ARMS

Log in to the Realtime Compute for Apache Flink console. In the Actions column of your workspace, click Console.
On the Operation Center > Job O&M page, click the target job.
Click the Alert Configuration tab.

Configure in CloudMonitor

Log in to the CloudMonitor console.
In the left navigation pane, choose Event Center > Event Subscription.
On the Subscription Policy tab, click Create Subscription Policy.
Configure the parameters. For details, see Manage event subscriptions (Recommended).

Failover surge

Frequent restarts indicate an underlying problem — code bugs, resource bottlenecks, or configuration errors — that auto-recovery cannot fix. Configure this alert to catch surge patterns before they exhaust retry budgets.

Metric: Number of error recoveries per minute for the job

Recommended configuration:

Metric value ≥ 1
Period: 1 minute
Notification: Phone call, text message, email, and webhook (Critical)

When this alert fires:

Identify the root cause by analyzing failover, JobManager, and TaskManager logs.
- Ignore: Infrequent, auto-recoverable machine failures.
- Fix: Code bugs, resource bottlenecks, or configuration errors.
Restore the job from the latest snapshot or successful checkpoint.

Consecutive checkpoint failures

When no checkpoint succeeds within 5 minutes, the job has no recent recovery point.

Metric: Number of completed checkpoints per minute

Recommended configuration:

Metric value ≤ 0
Period: 5 minutes
Notification: Phone call, text message, email, and webhook (Critical)

When this alert fires:

See System checkpoints to identify the root cause.
Act on the cause:
- Configuration issue (such as timeout): Adjust the checkpoint settings.
- Backpressure: Use dynamic scaling to add resources to the backpressured operator.
Update the configuration dynamically or restore the job from the latest successful checkpoint.

Data timeliness

Ensure SLA for latency

Alert when the job is receiving data but processing lags behind by more than 5 minutes. Adjust the threshold and alert level to match your SLA.

Metrics: Business latency, Records in from source per second

Recommended configuration:

Business latency Maximum ≥ 300000
Records in from source per second Metric value > 0
Period: 5 minutes

When this alert fires:

See Metric description to investigate the cause.
- Data plane: Are event timestamps out of order?
- Traffic: Is there an upstream surge or downstream backpressure?
Act on the cause:
- Internal: Adjust connector WITH parameters and scale out the bottleneck operator.
- External: Optimize external service settings, such as adjusting throttling policies or increasing connection limits.

Upstream data interruption

Alert when there is inbound data and the service latency exceeds 5 minutes. You can adjust the threshold and alert level as needed.

Metrics: Records in from source per second, Age of unprocessed data at the source

Recommended configuration:

Records in from source per second Metric value ≤ 0
Age of unprocessed data at the source Maximum > 60000
Period: 5 minutes

When this alert fires:

Check taskmanager.log, flame graphs, and upstream service metrics to confirm the cause: no upstream data, throttling, an error, or a stalled thread stack.
Act on the cause:
- Connector issue: Adjust connector parameters such as timeout or concurrency, or add TaskManager resources.
- Upstream or downstream service issue: Notify the upstream team to investigate.
- Flink internal bottleneck (such as backpressure or a freeze): Resolve the root cause first, then restart the job from the latest checkpoint.

No downstream output

Alert when the sink stops emitting data for more than 5 minutes. This can indicate a business logic issue, a late-data configuration problem, or a failure in the downstream system.

Metric: Records out to sink per second

Recommended configuration:

Metric value ≤ 0
Period: 5 minutes

When this alert fires:

Confirm whether data reaches the sink operator:
- Business logic filtering: Check logs or metrics to determine whether all input was filtered out for not meeting conditions.
- Late data discard: Check watermark and window settings to determine whether data was discarded as late arrivals.
Confirm whether the sink can write to the external system:
- Connection layer: Is the connection pool full? Is network connectivity normal?
- Target system: Does the downstream database or service have a locked table, insufficient disk space, write throttling, or other errors?
As a temporary measure, enable dual-write to a backup storage system.

Resource performance bottlenecks

CPU bottleneck

Alert when a single TaskManager's CPU stays above 85% for 10 consecutive minutes.

Metric: CPU utilization of a single TM

Recommended configuration:

Maximum ≥ 85
Period: 10 minutes

When this alert fires:

Use flame graphs or the Flink UI to locate the hot-spot operator.
- Business logic: Check for complex calculations, JSON parsing, or inefficient UDFs.
- Data skew: Check whether a hot-spot key is overloading a single task with excessive data volume.
- Insufficient resources: Determine whether the current degree of parallelism and TaskManager resources can handle the traffic, and whether there is severe backpressure.
- Frequent GC: Check logs or JVM metrics to determine whether memory pressure is triggering frequent Full GCs that consume CPU.
Increase the degree of parallelism for the bottleneck operator, or allocate more CPU cores to the TaskManager.

Memory bottleneck

Alert when a TaskManager's heap memory exceeds 90% of capacity for 10 consecutive minutes. Derive the absolute threshold from the actual heap size shown on the Job O&M > Job Log page. For example, if the usage reads 194 MB / 413 MB, set the threshold to 372 MB (90% of 413 MB).

Metric: TM heap memory usage

Recommended configuration:

Maximum ≥ Threshold (90% of total heap)
Period: 10 minutes

When this alert fires:

Check GC logs to identify the problem type:
- Memory leak: Heap does not return to baseline after GC and the baseline keeps rising.
- Insufficient capacity: Heap usage stays consistently high, triggering frequent Full GCs and degrading performance.
- Sudden OutOfMemoryError (OOM): Memory fills instantly when processing a specific record or batch.
Increase heap size or increase the degree of parallelism to reduce data volume per slot.