All Products
Search
Document Center

Realtime Compute for Apache Flink:Monitoring metrics

Last Updated:Mar 26, 2026

Realtime Compute for Apache Flink exposes metrics across job, operator, and system dimensions. Use these metrics to monitor job health, diagnose performance bottlenecks, and configure alerts. This reference covers each metric's definition, unit, and supported connectors, along with guidance for diagnosing common scenarios.

Usage notes

Data discrepancies between Cloud Monitor and the Flink console

Display differences

The Flink console queries metrics via Prometheus Query Language (PromQL) and displays only the maximum latency. In real-time computing scenarios, average latency can mask critical issues such as data skew or single-partition blocking. Only the maximum value is operationally meaningful for troubleshooting.

Value drift

Cloud Monitor uses a pre-aggregation mechanism to calculate metrics. Due to differences in aggregation windows, sampling intervals, or calculation logic, the maximum value in Cloud Monitor may differ slightly from the real-time value in the Flink console. When troubleshooting, treat the Flink console value as the source of truth.

Emit Delay and watermark configuration

How Emit Delay is calculated

Emit Delay measures data freshness, not system processing speed. The formula is:

Delay = Current system time - Logical time field in the data payload (for example, PriceData.time)

The metric is high if the data itself is old, or if the system is waiting for watermark alignment before emitting output.

When Emit Delay is high but the system is healthy

If your business logic depends on watermarks for correctness but the data source has inherent latency — such as slow instrumentation reporting or historical backfill — a high Emit Delay is expected and does not indicate a fault. Typical symptoms:

  • Monitoring alerts show high latency

  • Kafka Consumer Lag is near 0

  • CPU utilization is low

In this case, monitor Kafka Consumer Lag instead. As long as the lag does not grow continuously, the system is processing data normally.

When to reduce Emit Delay

If your use case prioritizes real-time output over strict event-time ordering — for example, dashboards or risk control where minor out-of-order events are acceptable — a large watermark tolerance window directly adds output latency. To reduce it:

  • Switch to processing time semantics, or

  • Set the watermark waiting threshold to 0

After either change, Emit Delay drops to near the physical processing latency, and records are emitted as they arrive.

Diagnose common scenarios

Metrics reflect the current state of individual components. For root cause analysis, always combine metric data with the BackPressure panel and Thread Dump in the Flink UI.

Operator backpressure

Downstream operators cannot keep up, causing the source to slow its ingestion rate.

How to detect: Check the backpressure monitoring panel in the Flink UI.

Metric signals:

  • sourceIdleTime increases periodically

  • currentFetchEventTimeLag and currentEmitEventTimeLag increase continuously

  • In severe cases, sourceIdleTime increases without interruption

Source performance bottleneck

The source is reading at maximum throughput but still cannot meet downstream demand.

How to detect: No backpressure is detected in the job.

Metric signals:

  • sourceIdleTime stays near 0, indicating the source is running at full capacity

  • currentFetchEventTimeLag and currentEmitEventTimeLag are both high and similar in value

Data skew or empty partitions

Data is unevenly distributed across upstream Kafka partitions, or some partitions are empty.

How to detect: Compare sourceIdleTime across source subtasks.

Metric signals:

  • One source has a significantly higher sourceIdleTime than others, indicating that its parallelism slots are idle

High end-to-end latency

The overall job latency is high and you need to locate whether the bottleneck is inside the source or in the external system.

How to detect: Analyze sourceIdleTime, the lag difference, and pendingRecords together.

Signal Interpretation
sourceIdleTime is high The external system is producing data slowly; Flink is not the bottleneck
currentEmitEventTimeLag - currentFetchEventTimeLag is small Bottleneck is in network I/O bandwidth or source parallelism (insufficient pull capacity)
currentEmitEventTimeLag - currentFetchEventTimeLag is large Bottleneck is in data parsing or downstream backpressure (insufficient processing capacity)
pendingRecords is high A large volume of data is accumulating in the external system

Overview metrics

These metrics apply at the job or operator level.

Metric Definition Details Unit Supported connectors
Num of Restarts Number of times the job restarted due to errors Counts error-triggered restarts, excluding JobManager (JM) failovers. Use this to track job availability. Count All
currentEmitEventTimeLag Business latency Measures time from when an event was generated (event time) to when it is emitted downstream. A high value indicates latency in data pulling or processing. ms Kafka, RocketMQ, Simple Log Service (SLS), DataHub, PostgreSQL CDC, Hologres (Binlog Source)
currentFetchEventTimeLag Transmission latency Measures time from when an event was generated to when it is fetched from the external system. A high value indicates network I/O or upstream system latency. Compare with currentEmitEventTimeLag to isolate the bottleneck: a small difference means insufficient pull capacity; a large difference means the job cannot process data fast enough. ms Kafka, RocketMQ, SLS, DataHub, PostgreSQL CDC, Hologres (Binlog Source)
numRecordsIn Total input records across all operators If this value stops increasing for a prolonged period, the upstream may have been fully consumed. Check the upstream data source. Count All built-in connectors
numRecordsOut Total output records across all operators If this value stops increasing, a logic error in the job may be dropping all records. Check the job code. Count All built-in connectors
numRecordsInOfSource Input records at the source operator Use this to observe upstream data input volume at the source level. Count Kafka, MaxCompute, Incremental MaxCompute, RocketMQ, SLS, DataHub, Elasticsearch, Hologres
numRecordsOutOfSink Output records from the sink operator Use this to observe downstream data output volume. Count Kafka, SLS, DataHub, Hologres, HBase, Tablestore, Redis
numRecordsInPerSecond Input records per second (entire job) Use this to monitor overall ingestion throughput and observe how performance changes under varying input loads. records/s All built-in connectors
numRecordsOutPerSecond Output records per second (entire job) Use this to monitor overall output throughput and observe how performance changes under varying output loads. records/s All connectors
numRecordsInOfSourcePerSecond (IN RPS) Input records per second at the source Measures the ingestion rate per source. If this value drops to 0, the upstream may be fully consumed or output is blocked. Check the upstream data source. records/s Kafka, MaxCompute, Incremental MaxCompute, RocketMQ, SLS, DataHub, Elasticsearch, Hologres
numRecordsOutOfSinkPerSecond (OUT RPS) Output records per second at the sink Measures the output rate per sink. If this value drops to 0, a job logic error may be filtering out all records. Check the job code. records/s Kafka, MaxCompute, Incremental MaxCompute, SLS, DataHub, Hologres, HBase, Tablestore, Redis
pendingRecords Unread records in the external system The number of records in the external system that the source has not yet fetched. A high and growing value indicates accumulation upstream. Count Kafka, Elasticsearch
sourceIdleTime Source idle duration Indicates how long the source has had no data to process. A persistently high value means the external system is producing data slowly. ms Kafka, RocketMQ, PostgreSQL CDC, Hologres (Binlog Source)

System checkpoints

Metric Definition Details Unit
Num of Checkpoints Number of checkpoints Use this to set up checkpoint alerts and track checkpoint health over time. Count
lastCheckpointDuration Duration of the most recent checkpoint A long or timed-out checkpoint may indicate large state, network issues, unaligned barriers, or data backpressure. Monitor both the absolute value and its rate of change over time. ms
lastCheckpointSize Uploaded size of the most recent checkpoint Use this to analyze checkpoint performance when bottlenecks occur. Bytes

State

State latency metrics are disabled by default. To enable them, set state.backend.latency-track.keyed-state-enabled: true in the advanced Flink configurations. Enabling these metrics may affect job runtime performance.
Metric Definition Unit Minimum version
State Clear Latency Maximum latency of a single state clear operation ns VVR 4.0.0
Value State Latency Maximum latency of a single Value State access ns VVR 4.0.0
Aggregating State Latency Maximum latency of a single Aggregating State access ns VVR 4.0.0
Reducing State Latency Maximum latency of a single Reducing State access ns VVR 4.0.0
Map State Latency Maximum latency of a single Map State access ns VVR 4.0.0
List State Latency Maximum latency of a single List State access ns VVR 4.0.0
Sorted Map State Latency Maximum latency of a single Sorted Map State access ns VVR 4.0.0
State Size Size of state data in memory Bytes VVR 4.0.12
State File Size Size of the state data file on local disk Bytes VVR 4.0.13

I/O

Metric Definition Details Unit Supported connectors
numBytesIn Total input bytes Monitor input throughput from upstream. Bytes Kafka, MaxCompute, Incremental MaxCompute, RocketMQ
numBytesInPerSecond Input bytes per second Monitor the input stream rate from upstream. Bytes/s Kafka, MaxCompute, Incremental MaxCompute, RocketMQ
numBytesOut Total output bytes Monitor output throughput. Bytes Kafka, RocketMQ, DataHub, HBase
numBytesOutPerSecond Output bytes per second Monitor the output stream rate. Bytes/s Kafka, RocketMQ, DataHub, HBase
Task numRecords I/O Total records received and sent by each subtask Use this to determine whether the job has an I/O bottleneck. Count Kafka, MaxCompute, Incremental MaxCompute, SLS, DataHub, Elasticsearch, Hologres, HBase, Tablestore, Redis
Task numRecords I/O PerSecond Records received and sent by each subtask per second Use this to assess the severity of any I/O bottleneck. records/s Kafka, MaxCompute, Incremental MaxCompute, SLS, DataHub, Elasticsearch, Hologres, HBase, Tablestore, Redis
currentSendTime Time to send the most recent record to the downstream system A low value for this metric indicates slow output from the subtask. ms Kafka, MaxCompute, Incremental MaxCompute, RocketMQ, SLS, DataHub, Hologres (JDBC and RPC modes only; not supported in BHClient mode), HBase, Tablestore, Redis

Watermark

Metric Definition Details Unit Supported connectors
Task InputWatermark Timestamp of the latest watermark received by each task Indicates data receiving latency at the TaskManager (TM) level. None Not applicable
watermarkLag Watermark latency Use this to determine job latency at the subtask level. ms Kafka, RocketMQ, SLS, DataHub, Hologres (Binlog Source)

CPU

Metric Definition Details Unit
JM CPU Usage CPU usage of a single JobManager Reflects Flink's CPU time-slice consumption. 100% means one core is fully utilized; 400% means four cores are fully utilized. If persistently above 100%, the CPU is saturated. A high system load with low CPU usage may indicate many processes in uninterruptible sleep due to frequent I/O operations. Requires VVR 6.0.6 or later. None
TM CPU Usage CPU usage of a single TaskManager Same interpretation as JM CPU Usage. 100% means one core is fully utilized; 400% means four cores are fully utilized. None

Memory

Metric Definition Unit
JM Heap Memory Heap memory usage of the JobManager Bytes
JM NonHeap Memory Non-heap memory usage of the JobManager Bytes
TM Heap Memory Heap memory usage of the TaskManager Bytes
TM nonHeap Memory Non-heap memory usage of the TaskManager Bytes
TM Mem (RSS) Resident Set Size (RSS) of the TaskManager process, as reported by Linux Bytes

JVM

Metric Definition Details Unit
JM Threads Number of JM threads Excessive threads consume memory and reduce job stability. Count
TM Threads Number of TM threads Excessive threads consume memory and reduce job stability. Count
JM GC Count Number of JM garbage collection (GC) events Frequent GC events consume memory and degrade job performance. Use this to diagnose job-level failures. Count
JM GC Time Duration of each JM GC event Long GC pauses degrade job performance. Use this to diagnose job-level failures. ms
TM GC Count Number of TM GC events Frequent GC events consume memory and degrade job performance. Use this to diagnose task-level failures. Count
TM GC Time Duration of each TM GC event Long GC pauses degrade job performance. Use this to diagnose job-level failures. ms
JM ClassLoader Total classes loaded or unloaded by the Java Virtual Machine (JVM) of the JM since startup An excessive number of class load/unload operations consumes memory and degrades job performance. None
TM ClassLoader Total classes loaded or unloaded by the JVM of the TM since startup An excessive number of class load/unload operations consumes memory and degrades job performance. None

Connector - MySQL

Requires VVR 8.0.9 or later.

Metric Definition Unit Use case
isSnapshotting Whether the job is in the full snapshot phase (1 = yes) None Determine the current processing phase
isBinlogReading Whether the job is in the incremental binary log (binlog) reading phase (1 = yes) None Determine the current processing phase
Num of remaining tables Tables waiting to be processed in the snapshot phase Count Track remaining work in the snapshot phase
Num of snapshotted tables Tables already processed in the snapshot phase Count Track completed work in the snapshot phase
Num of remaining SnapshotSplits Snapshot splits waiting to be processed Count Track remaining splits in the snapshot phase
Num of processed SnapshotSplits Snapshot splits already processed Count Track completed splits in the snapshot phase
currentFetchEventTimeLag Latency between data generation and when it is read from the database ms Monitor binary log read latency
currentReadTimestampMs Timestamp of the most recent record read ms Track the progress of data ingestion
numRecordsIn Total records read Count Monitor overall data volume processed
numSnapshotRecords Records processed in the snapshot phase Count Monitor data volume processed during the snapshot phase
numRecordsInPerTable Records read per table Count Monitor per-table data volume
numSnapshotRecordsPerTable Records processed per table in the snapshot phase Count Monitor per-table snapshot data volume

Connector - Kafka

Metric Definition Unit Use case Version limitations
commitsSucceeded Successful offset commits Count Verify that offset commits are succeeding VVR 8.0.9 or later
commitsFailed Failed offset commits Count Detect offset commit failures VVR 8.0.9 or later
Fetch Rate Data fetch frequency times/s Analyze pull latency and throughput
Fetch Latency Avg Average latency of fetch operations ms Analyze pull latency and throughput
Fetch Size Avg Average bytes fetched per request Bytes Analyze pull latency and throughput
Avg Records In Per-Request Average records per fetch request Count Analyze pull latency and throughput
currentSendTime Time to send the most recent record None Track consumption progress
batchSizeAvg Average bytes per write batch Bytes Analyze write latency and throughput
requestLatencyAvg Average write request latency ms Analyze write latency and throughput
requestsInFlight Number of in-flight write requests Count Analyze write latency and throughput
recordsPerRequestAvg Average records per write request Count Analyze write latency and throughput
recordSizeAvg Average message size Bytes Analyze write latency and throughput

Connector - Paimon

Metric Definition Unit Use case Version limitations
Number of Writers Number of active writer instances Count A high count can reduce write efficiency and increase memory consumption. Review whether the bucket count or partition key configuration is appropriate. VVR 8.0.9 or later
Max Compaction Thread Busy Maximum compaction thread busy ratio in the last minute, across all active buckets Ratio Indicates peak pressure on small file compaction. A value near 1.0 means compaction threads are saturated.
Average Compaction Thread Busy Average compaction thread busy ratio in the last minute, across all active buckets Ratio Indicates average pressure on small file compaction.
Max Number of Level 0 Files Maximum Level 0 file count across all active buckets Count Applies to primary key tables only. A high and growing value means compaction cannot keep up with write throughput.
Average Number of Level 0 Files Average Level 0 file count across all active buckets Count Applies to primary key tables only. Use this alongside the maximum to understand compaction health.
Last Commit Duration Duration of the most recent commit ms A long commit duration may indicate too many buckets being written simultaneously.
Number of Partitions Last Committed Partitions written in the most recent commit Count A large number can reduce write efficiency and increase memory consumption.
Number of Buckets Last Committed Buckets written in the most recent commit Count A large number can reduce write efficiency and increase memory consumption.
Used Write Buffer Write buffer memory currently in use across all TaskManagers Bytes This buffer occupies Java heap memory. If set too large, it may cause an out-of-memory (OOM) error.
Total Write Buffer Total write buffer memory allocated across all TaskManagers Bytes This buffer occupies Java heap memory. If set too large, it may cause an OOM error.

Data ingestion

These metrics apply to MySQL Change Data Capture (CDC) jobs. They are identical to the MySQL connector metrics above. Requires VVR 8.0.9 or later.

Metric Definition Unit Use case
isSnapshotting Whether the job is in the full snapshot phase (1 = yes) None Determine the current processing phase
isBinlogReading Whether the job is in the incremental binlog reading phase (1 = yes) None Determine the current processing phase
Num of remaining tables Tables waiting to be processed in the snapshot phase Count Track remaining work in the snapshot phase
Num of snapshotted tables Tables already processed in the snapshot phase Count Track completed work in the snapshot phase
Num of remaining SnapshotSplits Snapshot splits waiting to be processed Count Track remaining splits in the snapshot phase
Num of processed SnapshotSplits Snapshot splits already processed Count Track completed splits in the snapshot phase
currentFetchEventTimeLag Latency between data generation and when it is read from the database ms Monitor binary log read latency
currentReadTimestampMs Timestamp of the most recent record read ms Track the progress of data ingestion
numRecordsIn Total records read Count Monitor overall data volume processed
numRecordsInPerTable Records read per table Count Monitor per-table data volume
numSnapshotRecords Records processed in the snapshot phase Count Monitor data volume processed during the snapshot phase
numSnapshotRecordsPerTable Records processed per table in the snapshot phase Count Monitor per-table snapshot data volume