Best practices for monitoring and alerting - Realtime Compute for Apache Flink

This document lists key alert metrics, recommended configurations, and best practices to help you effectively monitor job performance and diagnose issues.

Prerequisites

See Configure alert rules and choose the appropriate configuration method based on your workspace's monitoring service.

Note

Unless explicitly specified, configure alert rules in the Cloud Monitor console.
To monitor multiple metrics using Prometheus, you must use a custom PromQL statement to create an alert rule. For a simpler setup, configure alert rules with Cloud Monitor.

Recommended alert rules

Use case	Metric(s)/event(s)	Rule configuration	Severity level	Recommended actions
Job failures	System event: `JOB_FAILED`	Upon event occurrence	P0	1. Check the configured restart policy. 2. Determine the cause of the failure (e.g., JobManager/TaskManager issue or a deliberate stop). 3. Restore the job from the latest savepoint or checkpoint.
Frequent restarts	Overview > `NumOfRestart`	≥ 1 restart 1 consecutive period	P0	1. Identify the cause. Analyze failover, JobManager, and TaskManager logs to find the cause of the failure. Ignore infrequent, auto-recoverable machine failures. Fix code bugs, resource bottlenecks, or configuration errors. 2. Recover the job from the latest savepoint or checkpoint.
Consecutive checkpoint failures	Checkpoint > `NumOfCheckpoints`	≤ 0 checkpoints 1 consecutive period	P0	1. See System checkpoints to troubleshoot the cause of checkpoint failures. 2. Identify the issue. Parameter issues (e.g., timeout): Adjust the checkpoint configuration. Resource issues (e.g., backpressure): Use dynamic scaling to add resources to the affected operator. 3. Dynamically update the configuration or restore the job from the latest checkpoint.
Latency	Overview > `CurrentEmitEventTimeLag` & `NumOfRecordsInFromSourcePerSecond`	Maximum latency ≥ 180,000 Input records ≥ 0 3 consecutive periods	P1	1. See Metric description to investigate the cause of the latency. Data-related: Check if event times are out of order. Traffic-related: Check if there is an upstream data surge or backpressure from a downstream system. 2. Take action based on the cause. Internal cause: Adjust connector options or scale up the bottleneck operator. External cause: Optimize external service configurations (e.g., adjust the traffic throttling strategy, increase connection counts).
Upstream data flow interruptions	Overview > `NumOfRecordsInFromSourcePerSecond` & `SourceIdleTime`	Input records ≤ 0 (depending on your business logic) Maximum source idle time ≥ 60,000 ms 5 consecutive periods	P1	1. Check `taskmanager.log`, flame graphs, and upstream metrics to identify the cause, such as a stalled thread, an upstream issue, rate throttling, or an error. 2. Take action based on the cause. Connector issues: Adjust connector options (e.g., timeout, concurrency) or increase TaskManager resources. Upstream/downstream service issues: Notify the upstream service owner to resolve the issue. Internal Flink bottlenecks (e.g., backpressure or system freeze): First, resolve the root cause of the bottleneck (e.g., fix the downstream issue), then restart the job from the latest checkpoint.
Output issues	Overview > `NumOfRecordsOutToSinkPerSecond`	Output records ≤ 0 5 consecutive periods	P1	1. Verify if data is reaching the sink operator. Data filtering: Check logs/metrics for any data filtering that might be dropping records. Late data handling: Review watermark and window configurations to ensure late data isn't being dropped. 2. Verify the sink operator sends data to the destination. Connection: Check for full connection pools. Assess network connection stability. Destination system: Investigate downstream system for table locks, disk space, write throttling, or other errors. 3. Temporarily enable dual-write to a backup system.
CPU performance bottlenecks	CPU > `TMCPUUsage`	TM CPU usage ≥ 85% 10 consecutive periods	P2	1. Use the flame graph or Flink UI to identify the hotspot operator. Business logic: Check for complex computations, JSON parsing, or inefficient UDFs. Data skew: Determine if a hot key is overloading a single task due to an excessive volume of data. Insufficient resources: Check for the parallelism and TaskManager resources. Investigate backpressure severity. Frequent GC: Check logs or JVM metrics to see if memory pressure is causing frequent full GCs, which consume significant CPU. 2. Increase the parallelism for the bottleneck operator or allocate more vCPUs to TaskManagers.
Memory performance bottlenecks	Memory > `TMHeapMemoryUsed`	TM heap memory usage ≥ 90% 10 consecutive periods	P2	1. Analyze GC logs to identify the problem. Memory leak: In the Flink UI or monitoring dashboards, check if heap memory usage fails to return to its baseline after GC and continues to rise. Insufficient capacity: Determine heap memory usage. High usage can trigger frequent full GCs, degrading performance. Sudden OOM: Verify if memory is exhausted instantly when processing a specific record or batch of data, directly causing an OOM. 2. Take action based on the cause: increase the heap size or increase the parallelism.

Job availability

Create alert rules for job failures

Prometheus

Log on to Realtime Compute for Apache Flink's Management Console. Click Console in the Actions column of your workspace.
In the Development Console, navigate to O&M > Deployments, and click the name of your job deployment.
Click the Alarm tab. Switch to the Alarm Rules subtab, and click Add Rule > Custom Rule. In the Create Rule panel, for Content, select Job Failed from the Metric dropdown list.

Cloud Monitor

Log on to the Cloud Monitor console.
In the left navigation pane, choose Event Center > Event Subscription.
On the Subscription Policy tab, click Create Subscription Policy.
On the Create Subscription Policy page, configure the policy. For more information, see Manage event subscription policies (recommended).

Job stability

Frequent restarts

Metric: NumOfRestart
Rule description: Alert if the job restarts one or more times within a minute.
Recommended configuration:
- NumOfRestart
  Metric Value >= 1
- Period: 1 Consecutive Cycle (1 Cycle = 1 Minute)
- Notification: Phone Call + SMS Message + Email + Webhook (Critical)

Consecutive checkpoint failures

Metric: NumOfCheckpoints
Rule description: Alert if no checkpoint succeeds for 5 minutes.
Recommended configuration:
- NumOfCheckpoints
- Metric Value < 0
- Period: 5 Consecutive Cycles (1 Cycle = 1 Minute)
- Notification: Email + SMS + Phone + Webhook (Critical)

Data timeliness

Latency

Metrics:
- CurrentEmitEventTimeLag
- NumOfRecordsInFromSourcePerSecond
Rule description: Alert if data is coming in and business latency exceeds 5 minutes. Choose a threshold and alert level as needed.
Recommended configuration:
- CurrentEmitEventTimeLag
  Maximum Value >= 300000
- NumOfRecordsInFromSourcePerSecond
  Metric Value > 0
- Period: 5 Consecutive Cycles (1 Cycle = 1 Minute)

Upstream data flow interruptions

Metrics:
- NumOfRecordsInFromSourcePerSecond
- SourceIdleTime
Rule description: Alert if data is coming in but not handled by the source for over 5 minutes. Choose a threshold and alert level as needed.
Recommended configuration:
- NumOfRecordsInFromSourcePerSecond
  Metric Value <= 0
- SourceIdleTime
  Maximum Value > 60000
- Period: 5 Consecutive Cycles (1 Cycle = 1 Minute)

Output issues

Metric: NumOfRecordsOutToSinkPerSecond
Rule: Alert if there is no data output for more than 5 minutes. Choose a threshold and alert level as needed.
Recommended configuration:
- NumOfRecordsOutToSinkPerSecond
  Metric Value <= 0
- Period: 5 Consecutive Cycles (1 Cycle = 1 Minute)

Resource performance bottlenecks

CPU performance bottlenecks

Metric: TMCPUUsage
Rule: Alert if CPU utilization exceeds 85% for more than 10 minutes.
Recommended configuration:
- TMCPUUsage
  Maximum Value >= 85
- Period: 10 Consecutive Cycles (1 Cycle = 1 Minute)

Memory performance bottlenecks

Metric: TMHeapMemoryUsed
Rule: Alert if heap memory usage exceeds 90% for more than 10 minutes.
Recommended configuration:
- TMHeapMemoryUsed
  Maximum Value >= Threshold (90%)
  Determine this threshold based on JVM Heap on the O&M > Deployments page. For example, if usage is 194 MB / 413 MB, set the threshold to 372 MB (90% of 413 MB).
- Period: 10 Consecutive Cycles (1 Cycle = 1 Minute)