All Products
Search
Document Center

Realtime Compute for Apache Flink:Best practices for monitoring and alerting

Last Updated:Nov 20, 2025

This document lists key alert metrics, recommended configurations, and best practices to help you effectively monitor job performance and diagnose issues.

Prerequisites

See Configure alert rules and choose the appropriate configuration method based on your workspace's monitoring service.

Note

Recommended alert rules

Use case

Metric(s)/event(s)

Rule configuration

Severity level

Recommended actions

Job failures

System event: JOB_FAILED

Upon event occurrence

P0

1. Check the configured restart policy.
2. Determine the cause of the failure (e.g., JobManager/TaskManager issue or a deliberate stop).
3. Restore the job from the latest savepoint or checkpoint.

Frequent restarts

Overview > NumOfRestart

  • ≥ 1 restart

  • 1 consecutive period

P0

1. Identify the cause.

  • Analyze failover, JobManager, and TaskManager logs to find the cause of the failure.

  • Ignore infrequent, auto-recoverable machine failures.

  • Fix code bugs, resource bottlenecks, or configuration errors.

2. Recover the job from the latest savepoint or checkpoint.

Consecutive checkpoint failures

Checkpoint > NumOfCheckpoints

  • ≤ 0 checkpoints

  • 1 consecutive period

P0

1. See System checkpoints to troubleshoot the cause of checkpoint failures.

2. Identify the issue.

  • Parameter issues (e.g., timeout): Adjust the checkpoint configuration.

  • Resource issues (e.g., backpressure): Use dynamic scaling to add resources to the affected operator.

3. Dynamically update the configuration or restore the job from the latest checkpoint.

Latency

Overview > CurrentEmitEventTimeLag & NumOfRecordsInFromSourcePerSecond

  • Maximum latency ≥ 180,000

  • Input records ≥ 0

  • 3 consecutive periods

P1

1. See Metric description to investigate the cause of the latency.

  • Data-related: Check if event times are out of order.

  • Traffic-related: Check if there is an upstream data surge or backpressure from a downstream system.

2. Take action based on the cause.

  • Internal cause: Adjust connector options or scale up the bottleneck operator.

  • External cause: Optimize external service configurations (e.g., adjust the traffic throttling strategy, increase connection counts).

Upstream data flow interruptions

Overview > NumOfRecordsInFromSourcePerSecond &
SourceIdleTime

  • Input records ≤ 0 (depending on your business logic)

  • Maximum source idle time ≥ 60,000 ms

  • 5 consecutive periods

P1

1. Check taskmanager.log, flame graphs, and upstream metrics to identify the cause, such as a stalled thread, an upstream issue, rate throttling, or an error.

2. Take action based on the cause.

  • Connector issues: Adjust connector options (e.g., timeout, concurrency) or increase TaskManager resources.

  • Upstream/downstream service issues: Notify the upstream service owner to resolve the issue.

  • Internal Flink bottlenecks (e.g., backpressure or system freeze): First, resolve the root cause of the bottleneck (e.g., fix the downstream issue), then restart the job from the latest checkpoint.

Output issues

Overview > NumOfRecordsOutToSinkPerSecond

  • Output records ≤ 0

  • 5 consecutive periods

P1

1. Verify if data is reaching the sink operator.

  • Data filtering: Check logs/metrics for any data filtering that might be dropping records.

  • Late data handling: Review watermark and window configurations to ensure late data isn't being dropped.

2. Verify the sink operator sends data to the destination.

  • Connection: Check for full connection pools. Assess network connection stability.

  • Destination system: Investigate downstream system for table locks, disk space, write throttling, or other errors.

3. Temporarily enable dual-write to a backup system.

CPU performance bottlenecks

CPU > TMCPUUsage

  • TM CPU usage ≥ 85%

  • 10 consecutive periods

P2

1. Use the flame graph or Flink UI to identify the hotspot operator.

  • Business logic: Check for complex computations, JSON parsing, or inefficient UDFs.

  • Data skew: Determine if a hot key is overloading a single task due to an excessive volume of data.

  • Insufficient resources: Check for the parallelism and TaskManager resources. Investigate backpressure severity.

  • Frequent GC: Check logs or JVM metrics to see if memory pressure is causing frequent full GCs, which consume significant CPU.

2. Increase the parallelism for the bottleneck operator or allocate more vCPUs to TaskManagers.

Memory performance bottlenecks

Memory > TMHeapMemoryUsed

  • TM heap memory usage ≥ 90%

  • 10 consecutive periods

P2

1. Analyze GC logs to identify the problem.

  • Memory leak: In the Flink UI or monitoring dashboards, check if heap memory usage fails to return to its baseline after GC and continues to rise.

  • Insufficient capacity: Determine heap memory usage. High usage can trigger frequent full GCs, degrading performance.

  • Sudden OOM: Verify if memory is exhausted instantly when processing a specific record or batch of data, directly causing an OOM.

2. Take action based on the cause: increase the heap size or increase the parallelism.

Job availability

Create alert rules for job failures

Prometheus

  1. Log on to Realtime Compute for Apache Flink's Management Console. Click Console in the Actions column of your workspace.

  2. In the Development Console, navigate to O&M > Deployments, and click the name of your job deployment.

  3. Click the Alarm tab. Switch to the Alarm Rules subtab, and click Add Rule > Custom Rule. In the Create Rule panel, for Content, select Job Failed from the Metric dropdown list.

image

Cloud Monitor

  1. Log on to the Cloud Monitor console.

  2. In the left navigation pane, choose Event Center > Event Subscription.

  3. On the Subscription Policy tab, click Create Subscription Policy.

  4. On the Create Subscription Policy page, configure the policy. For more information, see Manage event subscription policies (recommended).

image

Job stability

Frequent restarts

  • Metric: NumOfRestart

  • Rule description: Alert if the job restarts one or more times within a minute.

  • Recommended configuration:

    • NumOfRestart

      Metric Value >= 1

    • Period: 1 Consecutive Cycle (1 Cycle = 1 Minute)

    • Notification: Phone Call + SMS Message + Email + Webhook (Critical)

Consecutive checkpoint failures

  • Metric: NumOfCheckpoints

  • Rule description: Alert if no checkpoint succeeds for 5 minutes.

  • Recommended configuration:

    • NumOfCheckpoints

    • Metric Value < 0

    • Period: 5 Consecutive Cycles (1 Cycle = 1 Minute)

    • Notification: Email + SMS + Phone + Webhook (Critical)

Data timeliness

Latency

  • Metrics:

    • CurrentEmitEventTimeLag

    • NumOfRecordsInFromSourcePerSecond

  • Rule description: Alert if data is coming in and business latency exceeds 5 minutes. Choose a threshold and alert level as needed.

  • Recommended configuration:

    • CurrentEmitEventTimeLag

      Maximum Value >= 300000

    • NumOfRecordsInFromSourcePerSecond

      Metric Value > 0

    • Period: 5 Consecutive Cycles (1 Cycle = 1 Minute)

Upstream data flow interruptions

  • Metrics:

    • NumOfRecordsInFromSourcePerSecond

    • SourceIdleTime

  • Rule description: Alert if data is coming in but not handled by the source for over 5 minutes. Choose a threshold and alert level as needed.

  • Recommended configuration:

    • NumOfRecordsInFromSourcePerSecond

      Metric Value <= 0

    • SourceIdleTime

      Maximum Value > 60000

    • Period: 5 Consecutive Cycles (1 Cycle = 1 Minute)

Output issues

  • Metric: NumOfRecordsOutToSinkPerSecond

  • Rule: Alert if there is no data output for more than 5 minutes. Choose a threshold and alert level as needed.

  • Recommended configuration:

    • NumOfRecordsOutToSinkPerSecond

      Metric Value <= 0

    • Period: 5 Consecutive Cycles (1 Cycle = 1 Minute)

Resource performance bottlenecks

CPU performance bottlenecks

  • Metric: TMCPUUsage

  • Rule: Alert if CPU utilization exceeds 85% for more than 10 minutes.

  • Recommended configuration:

    • TMCPUUsage

      Maximum Value >= 85

    • Period: 10 Consecutive Cycles (1 Cycle = 1 Minute)

Memory performance bottlenecks

  • Metric: TMHeapMemoryUsed

  • Rule: Alert if heap memory usage exceeds 90% for more than 10 minutes.

  • Recommended configuration:

    • TMHeapMemoryUsed

      Maximum Value >= Threshold (90%)

      Determine this threshold based on JVM Heap on the O&M > Deployments page. For example, if usage is 194 MB / 413 MB, set the threshold to 372 MB (90% of 413 MB).

      image

    • Period: 10 Consecutive Cycles (1 Cycle = 1 Minute)