Realtime Compute for Apache Flink exposes metrics across job, operator, and system dimensions. Use these metrics to monitor job health, diagnose performance bottlenecks, and configure alerts. This reference covers each metric's definition, unit, and supported connectors, along with guidance for diagnosing common scenarios.
Usage notes
Data discrepancies between Cloud Monitor and the Flink console
Display differences
The Flink console queries metrics via Prometheus Query Language (PromQL) and displays only the maximum latency. In real-time computing scenarios, average latency can mask critical issues such as data skew or single-partition blocking. Only the maximum value is operationally meaningful for troubleshooting.
Value drift
Cloud Monitor uses a pre-aggregation mechanism to calculate metrics. Due to differences in aggregation windows, sampling intervals, or calculation logic, the maximum value in Cloud Monitor may differ slightly from the real-time value in the Flink console. When troubleshooting, treat the Flink console value as the source of truth.
Emit Delay and watermark configuration
Diagnose common scenarios
Metrics reflect the current state of individual components. For root cause analysis, always combine metric data with the BackPressure panel and Thread Dump in the Flink UI.
Operator backpressure
Downstream operators cannot keep up, causing the source to slow its ingestion rate.
How to detect: Check the backpressure monitoring panel in the Flink UI.
Metric signals:
-
sourceIdleTimeincreases periodically -
currentFetchEventTimeLagandcurrentEmitEventTimeLagincrease continuously -
In severe cases,
sourceIdleTimeincreases without interruption
Source performance bottleneck
The source is reading at maximum throughput but still cannot meet downstream demand.
How to detect: No backpressure is detected in the job.
Metric signals:
-
sourceIdleTimestays near 0, indicating the source is running at full capacity -
currentFetchEventTimeLagandcurrentEmitEventTimeLagare both high and similar in value
Data skew or empty partitions
Data is unevenly distributed across upstream Kafka partitions, or some partitions are empty.
How to detect: Compare sourceIdleTime across source subtasks.
Metric signals:
-
One source has a significantly higher
sourceIdleTimethan others, indicating that its parallelism slots are idle
High end-to-end latency
The overall job latency is high and you need to locate whether the bottleneck is inside the source or in the external system.
How to detect: Analyze sourceIdleTime, the lag difference, and pendingRecords together.
| Signal | Interpretation |
|---|---|
sourceIdleTime is high |
The external system is producing data slowly; Flink is not the bottleneck |
currentEmitEventTimeLag - currentFetchEventTimeLag is small |
Bottleneck is in network I/O bandwidth or source parallelism (insufficient pull capacity) |
currentEmitEventTimeLag - currentFetchEventTimeLag is large |
Bottleneck is in data parsing or downstream backpressure (insufficient processing capacity) |
pendingRecords is high |
A large volume of data is accumulating in the external system |
Overview metrics
System checkpoints
State
I/O
Watermark
| Metric | Definition | Details | Unit | Supported connectors |
|---|---|---|---|---|
| Task InputWatermark | Timestamp of the latest watermark received by each task | Indicates data receiving latency at the TaskManager (TM) level. | None | Not applicable |
| watermarkLag | Watermark latency | Use this to determine job latency at the subtask level. | ms | Kafka, RocketMQ, SLS, DataHub, Hologres (Binlog Source) |