All Products
Search
Document Center

Realtime Compute for Apache Flink:Recommended monitoring configurations

Last Updated:Mar 09, 2026

This document provides key alert metrics, recommended configurations, and operations and maintenance (O&M) examples for Real-time Computing for Apache Flink. You can use this guide to better monitor system performance and diagnose faults.

Prerequisites

For more information, see Configure monitoring and alerting. You can choose the configuration method that corresponds to the monitoring service used by your workspace.

Note

Multi-metric monitoring in ARMS requires custom PromQL. If you need a simpler configuration, you can still use Cloud Monitor to configure alerts.

Recommended alert rule configurations

Scenario

Combined metric/Event name

Rule configuration

Level

Action

Job failure alert

Job status event

= FAILED (event alerting)

P0

1. Check if the restart policy is misconfigured. We recommend using the default configurations.

2. Determine if the cause is the restart policy or an abnormal JobManager or TaskManager.

3. Restore the job from the latest snapshot or successful checkpoint.

Failover surge

Overview/Number of error recoveries per minute for the job

≥ 1 for 1 consecutive period

P0

1. Identify the problem.

  • Analyze failover, JobManager, and TaskManager logs to find the root cause.

  • Ignore: Infrequent, auto-recoverable machine failures.

  • Fix: Code bugs, resource bottlenecks, or configuration errors.

2. Restore the job from the latest snapshot or successful checkpoint.

Consecutive checkpoint failures

Number of successful checkpoints (5 min cumulative)

≤ 0 for 1 consecutive period

P0

1. For more information, see System checkpoints to troubleshoot the root cause of the failure.

2. Identify the problem.

  • Parameter issue (such as timeout): Adjust the checkpoint configuration.

  • Resource scaling (such as backpressure): Use dynamic scaling to add resources to the operator under backpressure.

3. Dynamically update the configuration or restore the job from the latest successful checkpoint.

High business latency (with data)

Overview/Business latency && Records in from source per second

Maximum latency ≥ 180000

Input records ≥ 0

for 3 consecutive periods

P1

1. For more information, see Metric description to investigate the cause of the latency.

  • Data plane: Are event times out of order?

  • Traffic level: Indicates whether there is an upstream traffic surge or downstream backpressure.

2. Take action based on the cause.

  • Internal: Adjust connector WITH parameters and scale out the bottleneck operator.

  • External: Optimize external service configurations, such as adjusting throttling policies or increasing the number of connections.

Upstream data stream interruption detection

Overview/Records in from source per second &&

Source Raw Data Timestamp

Input records ≤ 0 (business-dependent)

Maximum idle time ≥ 60000

for 5 consecutive periods

P1

1. Check taskmanager.log, flame graphs, and upstream service metrics to confirm if the problem is no upstream data, throttling, an error, or a stalled thread stack.

2. Take action based on the cause.

  • Connector issue: Optimize connector parameters such as timeout or concurrency, or add more TaskManager resources.

  • Upstream/downstream service issue: Notify the upstream business owner to handle the issue.

  • Flink internal bottleneck (such as backpressure or system freeze): First, resolve the root cause of the bottleneck, such as a downstream issue. Then, restart the job from the latest checkpoint.

No downstream data output detection

Overview/Records out to sink per second

≤ 0 for 5 consecutive periods

P1

1. Confirm if data reaches the sink operator.

  • Business logic filtering: Use logs or metrics to confirm if all input data was filtered out for not meeting conditions.

  • Late data discard: Check watermark and window configurations to confirm if data was discarded for being late.

2. Confirm if the sink can write to the external system.

  • Connection layer: Is the sink connection pool full? Is the network connectivity normal?

  • Target system layer: Does the downstream database or service have a locked table, insufficient disk space, write throttling, or other errors?

3. As a temporary measure, enable dual-write to a backup storage system.

CPU performance bottlenecks

CPU/ CPU utilization of a single TM

≥ 85 % for 10 consecutive periods

P2

1. Use flame graphs or the Flink UI to locate the hot spot operator.

  • Business logic: Check for complex calculations, JSON parsing, or inefficient user-defined functions (UDFs).

  • Data skew: Check if a hot spot key is causing a single task to be overloaded due to excessive data volume.

  • Insufficient resources: Can the current degree of parallelism and TM resources handle the data traffic? Is there severe backpressure?

  • Frequent GC: Use logs or JVM metrics to check if memory pressure is causing frequent Full GCs, which consume a large amount of CPU.

2. Increase the degree of parallelism for the bottleneck operator, or allocate more CPU cores to the TaskManager.

Memory performance bottlenecks

TM heap memory used

≥ 90 % for 10 consecutive periods

P2

1. Check GC logs to identify the problem.

  • Memory leak: Use the Flink UI or monitoring to observe if the heap memory fails to return to a normal baseline after GC and the baseline continues to rise.

  • Insufficient capacity: The heap memory usage is consistently high, frequently triggering Full GCs and degrading performance.

  • Sudden OOM: Memory is instantly filled when processing a specific record or batch of data, directly causing an OutOfMemoryError.

2. Take action based on the cause: Increase the heap size or increase the degree of parallelism to reduce the data volume per slot.

Job availability

Job failure alert

Development console (ARMS)

  1. Log in to the Realtime Compute for Apache Flink console. In the Actions column of your workspace, click Console.

  2. On the Operation Center > Job O&M page, click the target job.

  3. Click the Alert Configuration tab.

image

Cloud Monitor

  1. Log on to the Cloud Monitor console.

  2. In the navigation pane on the left, choose Event Center > Event Subscription.

  3. On the Subscription Policy tab, click Create Subscription Policy.

  4. On the Create Subscription Policy page, configure the parameters. For more information, see Manage event subscriptions (Recommended).

image

Job stability

Prevent frequent JobManager restarts

  • Metric: Number of error recoveries per minute for the job

  • Rule: Send an alert if a job restarts within 1 minute.

  • Recommended configuration:

    • Number of error recoveries per minute for the job

      Metric value >= 1

    • Period: 1 minute

    • Notification: Phone call, text message, email, and WebHook (Critical)

Ensure checkpoint success rate

  • Metric: Number of completed checkpoints per minute

  • Rule: Send an alert if no checkpoint is completed within 5 minutes.

  • Recommended configuration:

    • Number of completed checkpoints per minute

    • Metric value <= 0

    • Period: 5 minutes

    • Notification: Phone call, text message, email, and WebHook (Critical)

Data timeliness

Ensure SLA for latency

  • Metrics:

    • Business latency

    • Records in from source per second

  • Rule: Generate an alert if data is being received and the business latency exceeds 5 minutes. You can adjust the threshold and alert level as needed.

  • Recommended configuration:

    • Business latency

      Maximum >= 300000

    • Records in from source per second

      Metric value > 0

    • Period: 5 minutes

Upstream data stream interruption detection

  • Metrics:

    • Records in from source per second

    • Age of unprocessed source data

  • Rule: An alert is triggered if there is inbound data and the service latency exceeds 5 minutes (the threshold and alert level are configurable).

  • Recommended configuration:

    • Records in from source per second

      Metric value <= 0

    • Age of unprocessed data at the source

      Maximum > 60000

    • Period: 5 minutes

No downstream data output detection

  • Metric: Records out to sink per second

  • Rule: Generate an alert if there is no data output for more than 5 minutes. You can adjust the threshold and alert level as needed.

  • Recommended configuration:

    • Records out to sink per second

      Metric value <= 0

    • Period: 5 minutes

Resource performance bottlenecks

CPU performance bottlenecks

  • Metric: Single TM CPU utilization

  • Rule: Alert if CPU utilization is greater than 85% for more than 10 minutes.

  • Recommended configuration:

    • CPU utilization of a single TM

      Maximum >= 85

    • Period: 10 minutes

Memory performance bottlenecks

  • Metric: TM heap memory usage

  • Rule: Alert if heap memory usage is greater than 90% for more than 10 minutes.

  • Recommended configuration:

    • TM heap memory usage

      Maximum >= Threshold (90%)

      Determine this threshold based on the heap memory usage found on the Job O&M > Job Log page. For example, if the usage is 194 MB / 413 MB, set the threshold to 372 MB (90% of 413 MB).

      image

    • Period: 10 minutes