This topic describes the metrics for fully managed Flink.
Notes
Data discrepancies between CloudMonitor and the Flink console
-
Differences in displayed dimensions
The Flink console uses PromQL queries to display only the maximum latency. In real-time computing scenarios, average latency can mask serious issues like data skew or single-partition blockages. Therefore, only the maximum latency provides valuable operational insights. -
Value discrepancies
CloudMonitor uses a pre-aggregation mechanism to compute metrics. The "maximum" value in CloudMonitor may differ slightly from the real-time value in the Flink console due to differences in aggregation windows, sampling timestamps, or calculation logic. For troubleshooting, use the Flink console data as the source of truth.
Data latency and watermark configuration
-
Latency calculation logic
The current monitoring metric Emit Delay is calculated based on event time, using the following formula:Delay = Current System Time - Logical time field in the data record (e.g., PriceData.time)
This means the metric reflects data freshness, not the system's processing speed. This metric is high when the source data is old or when the system pauses output to align watermarks.
-
Recommendations
Scenario 1: Your business logic relies on watermarks for correctness, but the source data is old
-
Typical situations:
-
Upstream data delivery is inherently delayed (e.g., slow event reporting).
-
You are running a backfill to process data from a previous day.
-
The business logic requires watermarks to handle out-of-order events, so they cannot be disabled.
-
-
Phenomenon: Monitoring alerts show high latency, but the Kafka consumer group has no lag (lag ≈ 0) and the CPU load is low.
-
Recommendations:
-
Ignore this latency metric: In this case, a high delay is expected because it reflects the age of the data. It does not indicate a system failure.
-
Switch to a different metric: Monitor the Kafka consumer lag instead. If the consumer lag does not continuously increase, the system has sufficient processing capacity and requires no intervention.
-
Scenario 2: You require low latency and can tolerate minor out-of-order events or data loss
-
Typical situations:
-
For applications like large-screen dashboards or real-time risk control, watermark-induced waiting slows down output.
-
The business logic is more concerned with when the data is received (processing time) than with the timestamp inside the data record (event time).
-
-
Phenomenon: The data stream is real-time, but because the watermark is configured with a large tolerance window (e.g., a 10-second lateness allowance), output is delayed by 10 seconds.
-
Recommendations:
-
Remove or disable watermarks: Switch to using processing time for calculations, or set the watermark wait threshold to 0.
-
Expected result: The latency metric will drop significantly, approaching the actual processing time. Data is processed upon arrival, without waiting for alignment.
-
-
Metric characteristics
Metrics only reflect the current state of a component and are insufficient to determine an issue's root cause. For a comprehensive diagnosis, always use the Flink UI's backpressure monitor and other tools.
1. Operator backpressure
Symptom: Downstream operators cannot process data fast enough, so the source reduces its emission rate.
-
How to identify: Use the Flink UI's backpressure monitor to identify this issue.
-
Metric characteristics:
-
sourceIdleTimeperiodically increases. -
currentFetchEventTimeLagandcurrentEmitEventTimeLagcontinuously increase. -
Extreme case: If an operator is completely stuck,
sourceIdleTimewill increase continuously.
-
2. Source performance bottleneck
Symptom: The source reads at maximum speed but cannot meet data processing demands.
-
How to identify: No backpressure is detected in the job.
-
Metric characteristics:
-
sourceIdleTimeremains at a very low value (indicating the source is operating at full capacity). -
currentFetchEventTimeLagandcurrentEmitEventTimeLagare similar and remain high.
-
3. Data skew or empty partitions
Symptom: Data distribution is uneven across upstream Kafka partitions, or some partitions are empty.
-
How to identify: Compare metrics across different source subtasks.
-
Metric characteristics:
-
The
sourceIdleTimefor a specific source subtask is significantly higher than for others, indicating that this parallel instance is idle.
-
4. Data latency
Symptom: The overall job latency is high. You must determine if the bottleneck is the source or an external system.
-
How to identify: Analyze the idle time, the difference between lag metrics, and the backlog size in combination.
-
Metric characteristics:
-
High
sourceIdleTime:
This indicates the source is idle, which usually means the data production rate from the external system is low, not that Flink is processing slowly. -
Lag difference analysis:
Compare the difference betweencurrentEmitEventTimeLagandcurrentFetchEventTimeLag. This difference represents the time data spends within the source operator:-
Small difference (close metric values): This indicates insufficient fetch capacity. The bottleneck is typically network I/O bandwidth or insufficient source parallelism.
-
Large difference: This indicates insufficient processing capacity. The bottleneck is typically inefficient data parsing or backpressure from downstream operators.
-
-
pendingRecords(if supported by the connector):
This metric directly reflects the external backlog. A higher value indicates a more severe data backlog in the external system.
-
Overview
|
Metric |
Description |
Details |
Unit |
Supported connectors |
|
Number of restarts |
The number of times the deployment restarted after an error. |
The number of times the deployment restarted due to an error. This metric excludes restarts caused by a JobManager (JM) failover. Use this metric to monitor the deployment's availability and status. |
Count |
N/A |
|
Current emit event time lag |
The data processing latency. |
A high value indicates latency in data fetching or processing. |
milliseconds (ms) |
|
|
Current fetch event time lag |
The data fetching latency from the upstream system. |
A high value indicates data fetching latency. Check your network I/O and upstream systems. Comparing this metric with
|
milliseconds (ms) |
|
|
numRecordsIn |
The total number of records received by all operators. |
If the |
Count |
All built-in connectors. |
|
numRecordsOut |
The total number of records emitted. |
If the |
Count |
All built-in connectors. |
|
numRecordsInOfSource |
The number of records ingested by the source operator. |
Use this metric to monitor data input from the upstream source. |
Count |
|
|
numRecordsOutOfSink |
The total number of records emitted by the sink operator. |
Use this metric to monitor data output to the downstream sink. |
Count |
|
|
numRecordsInPerSecond |
The number of records ingested per second across the entire data stream. |
Use this metric to monitor the processing speed of the entire data stream. For example, you can use |
records/second |
All built-in connectors. |
|
numRecordsOutPerSecond |
The number of records emitted per second across the entire data stream. |
Use this metric to measure the output speed of the entire data stream. For example, you can use |
records/second |
All connectors. |
|
numRecordsInOfSourcePerSecond (IN RPS) |
The number of records ingested per second by each source. |
Use this metric to measure the record generation rate of each source. For example, in a data stream with multiple sources, you can use this metric to understand the ingestion rate of each source and tune the data stream for better performance. This metric is also useful for monitoring and alerting. A value of 0 indicates that the upstream system has stopped producing data or that consumption is blocked, which prevents output. Verify that the upstream source is still producing data. |
records/second |
|
|
numRecordsOutOfSinkPerSecond (OUT RPS) |
The number of records emitted per second by each sink. |
Use this metric to measure the output rate of each sink. For example, in a data stream with multiple sinks, you can use this metric to understand the output speed of each sink and tune the data stream for better performance. This metric is useful for monitoring and alerting. A value of 0 suggests a possible error in the deployment's code logic that is filtering out all data. Review the code logic. |
records/second |
|
|
pendingRecords |
The number of records in the external system that the source operator has not yet pulled. |
This metric shows the number of records in the external system that the source operator has not yet pulled. |
Count |
|
|
sourceIdleTime |
The duration for which the source operator is idle. |
This metric indicates whether the source is idle. A high value suggests that the data production rate in the external system is low. |
milliseconds (ms) |
|
|
busyTimePerSecond |
The amount of time a Task is busy each second. |
The number of milliseconds per second that a Task thread is busy processing data. The value ranges from 0 to 1,000. A higher value indicates the Task is under a heavier load. Use this metric to identify performance bottlenecks, assess resource utilization, and guide auto-tuning. |
milliseconds (ms) |
N/A |
Checkpoints
|
Metric |
Description |
Details |
Unit |
|
Number of checkpoints |
The total number of checkpoints. |
Provides an overview of checkpoint status to help you configure alerts. |
Count |
|
lastCheckpointDuration |
The duration of the most recent checkpoint. |
A long duration or a timeout can be caused by a large state size, temporary network issues, unaligned barriers, or backpressure. |
milliseconds (ms) |
|
lastCheckpointSize |
The size of the most recent checkpoint. |
Indicates the size of the last uploaded checkpoint. Use this metric to analyze performance when a bottleneck occurs. |
Bytes |
State
Latency state metrics are disabled by default. To use these metrics, set state.backend.latency-track.keyed-state-enabled: true in the additional Flink configurations. Enabling these metrics can impact the runtime performance of your deployment.
|
Metric |
Description |
Description |
Unit |
Supported version |
|
State Clear Latency |
The maximum latency of a single state clear operation. |
Use this metric to monitor the performance of state clear operations. |
nanoseconds (ns) |
Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.0 or later. |
|
ValueState Latency |
The maximum latency of a single ValueState access operation. |
Use this metric to monitor the performance of ValueState access. |
nanoseconds (ns) |
|
|
AggregatingState Latency |
The maximum latency of a single AggregatingState access operation. |
Use this metric to monitor the performance of AggregatingState access. |
nanoseconds (ns) |
|
|
ReducingState Latency |
The maximum latency of a single ReducingState access operation. |
Use this metric to monitor the performance of ReducingState access. |
nanoseconds (ns) |
|
|
MapState Latency |
The maximum latency of a single MapState access operation. |
Use this metric to monitor the performance of MapState access. |
nanoseconds (ns) |
|
|
ListState Latency |
The maximum latency of a single ListState access operation. |
Use this metric to monitor the performance of ListState access. |
nanoseconds (ns) |
|
|
SortedMapState Latency |
The maximum latency of a single SortedMapState access operation. |
Use this metric to monitor the performance of SortedMapState access. |
nanoseconds (ns) |
|
|
State Size |
The size of the state data. |
Use this metric to:
|
Bytes |
Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.12 or later. |
|
State File Size |
The size of the state data file. |
Use this metric to:
|
Bytes |
Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.13 or later. |
I/O
|
Metric |
Description |
Details |
Unit |
Supported connectors |
|
numBytesIn |
The total number of input bytes. |
Use this metric to monitor input throughput from the source and track deployment traffic. |
Bytes |
|
|
numBytesInPerSecond |
The total number of input bytes per second. |
Use this metric to monitor the input rate from the source and track deployment traffic. |
Bytes/s |
|
|
numBytesOut |
The total number of output bytes. |
Use this metric to monitor output throughput to the sink and track deployment traffic. |
Bytes |
|
|
numBytesOutPerSecond |
The total number of output bytes per second. |
Use this metric to monitor the output rate to the sink and track deployment traffic. |
Bytes/s |
|
|
Task numRecords I/O |
The total number of records received and emitted by each subtask. |
Use this metric to identify potential I/O bottlenecks. |
Records |
|
|
Task numRecords I/O PerSecond |
The total number of records received and emitted by each subtask per second. |
Use this metric to identify I/O bottlenecks and assess their severity based on the processing rate. |
Records/s |
|
|
currentSendTime |
The time each subtask takes to send the latest record to the sink. |
A high value for this metric indicates that the subtask output is too slow. |
Milliseconds (ms) |
|
Watermark
|
Metric |
Description |
Usage |
Unit |
Supported connector |
|
Task InputWatermark |
The time each task receives the latest watermark. |
Use this metric to monitor the progress of data arriving at a TaskManager. |
N/A |
Not connector-specific. |
|
watermarkLag |
The difference between wall-clock time and the watermark's event time. |
Use this metric to determine the processing latency at the subtask level. |
ms |
|
CPU
|
Metric |
Description |
Description |
Unit |
|
JobManager CPU usage |
CPU utilization of a JobManager. |
This metric shows the percentage of CPU time slices used by Flink. A value of 100% means one CPU core is fully utilized, and 400% means four cores are fully utilized. If this value consistently exceeds 100%, the JobManager is CPU-bound. High load with low CPU utilization may indicate an excessive number of processes in an uninterruptible sleep state due to frequent read and write operations. Note
This metric is available only for Realtime Compute for Apache Flink VVR 6.0.6 and later. |
N/A |
|
TaskManager CPU usage |
CPU utilization of a TaskManager. |
This metric shows the percentage of CPU time slices used by Flink. A value of 100% means one CPU core is fully utilized, and 400% means four cores are fully utilized. If this value consistently exceeds 100%, the TaskManager is CPU-bound. High load with low CPU utilization may indicate an excessive number of processes in an uninterruptible sleep state due to frequent read and write operations. |
N/A |
Memory
|
Metric |
Description |
Description |
Unit |
|
JM heap memory |
The heap memory of the JobManager. |
Tracks changes in JobManager heap memory. |
Bytes |
|
JM non-heap memory |
The non-heap memory of the JobManager. |
Tracks changes in JobManager non-heap memory. |
Bytes |
|
TM heap memory |
The heap memory of the TaskManager. |
Tracks changes in TaskManager heap memory. |
Bytes |
|
TM non-heap memory |
The non-heap memory of the TaskManager. |
Tracks changes in TaskManager non-heap memory. |
Bytes |
|
TM Mem (RSS) |
The Resident Set Size (RSS) of the TaskManager process, as reported by the operating system. |
Monitors the total physical memory usage of the TaskManager process. |
Bytes |
JVM
|
Metric |
Description |
Details |
Unit |
|
JM Threads |
The number of JobManager threads. |
Too many JobManager threads can consume excessive memory, reducing job stability. |
Count |
|
TM Threads |
The number of TaskManager threads. |
Too many TaskManager threads can consume excessive memory, reducing job stability. |
Count |
|
JM GC Count |
The number of garbage collection (GC) events for the JobManager. |
Frequent garbage collection events can consume excessive memory and degrade job performance. Use this metric to diagnose job-level failures. |
Count |
|
JM GC Time |
The duration of each garbage collection event for the JobManager. |
Long garbage collection pauses can consume excessive memory and degrade job performance. Use this metric to diagnose job-level failures. |
Milliseconds (ms) |
|
TM GC Count |
The number of garbage collection events for the TaskManager. |
Frequent garbage collection events can consume excessive memory and degrade job performance. Use this metric to diagnose task-level failures. |
Count |
|
TM GC Time |
The duration of each garbage collection event for the TaskManager. |
Long garbage collection pauses can consume excessive memory and degrade job performance. Use this metric to diagnose task-level failures. |
Milliseconds (ms) |
|
JM ClassLoader |
The total number of classes loaded or unloaded by the JobManager's JVM since startup. |
A high volume of class loading and unloading in the JobManager's JVM can consume excessive memory and degrade job performance. |
N/A |
|
TM ClassLoader |
The total number of classes loaded or unloaded by the TaskManager's JVM since startup. |
A high volume of class loading and unloading in the TaskManager's JVM can consume excessive memory and degrade job performance. |
N/A |
MySQL connector
|
Metric |
Description |
Unit |
Scenario |
Supported version |
|
isSnapshotting |
Indicates whether the job is in the snapshot phase (value = 1). |
N/A |
Check if the job is in the snapshot phase. |
Realtime Compute for Apache Flink versions 8.0.9 and later. |
|
isBinlogReading |
Indicates whether the job is in the incremental phase (value = 1). |
N/A |
Check if the job is in the incremental phase. |
|
|
Number of remaining tables |
The number of tables waiting to be processed in the snapshot phase. |
Count |
Check the number of unprocessed tables. |
|
|
Number of snapshotted tables |
The number of processed tables in the snapshot phase. |
Count |
Check the number of processed tables. |
|
|
Number of remaining SnapshotSplits |
The number of splits waiting to be processed in the snapshot phase. |
Count |
Check the number of unprocessed splits. |
|
|
Number of processed SnapshotSplits |
The number of processed splits in the snapshot phase. |
Count |
Check the number of processed splits. |
|
|
currentFetchEventTimeLag |
The latency between when an event is created in the database and when it is read by the connector. |
ms |
Check the latency of reading binlogs from the database. |
|
|
currentReadTimestampMs |
The timestamp of the most recent data record that is read. |
ms |
Check the timestamp of the most recently read data record. |
|
|
numRecordsIn |
The total number of data records read. |
Count |
Check the total number of data records read. |
|
|
numSnapshotRecords |
The number of data records processed in the snapshot phase. |
Count |
Check the number of data records processed in the snapshot phase. |
|
|
numRecordsInPerTable |
The number of data records read from each table. |
Count |
Check the number of data records read from each table. |
|
|
numSnapshotRecordsPerTable |
The number of data records processed for each table during the snapshot phase. |
Count |
Check the number of data records processed for each table in the snapshot phase. |
Connector - Kafka
|
Metric |
Description |
Unit |
Scenario |
Supported version |
|
commitsSucceeded |
The total number of successful offset commits. |
Count |
Verifies that offset commits succeed. |
Realtime Compute for Apache Flink VVR 8.0.9 or later. |
|
commitsFailed |
The total number of failed offset commits. |
Count |
Identifies issues with offset commits. |
|
|
Fetch Rate |
The average number of fetch requests per second. |
Count/s |
Use to monitor the data fetch rate and identify potential latency issues. |
|
|
Fetch Latency Avg |
The average latency of fetch operations. |
Milliseconds |
A high value may indicate a network bottleneck or a slow Kafka broker. |
|
|
Fetch Size Avg |
The average number of bytes per fetch request. |
Bytes |
Use to analyze data fetch throughput and efficiency. |
|
|
Avg Records In Per-Request |
The average number of records per fetch request. |
Count |
Use to analyze the efficiency of record batching within fetch requests. |
|
|
currentSendTime |
The event-time timestamp of the last record processed by the connector. |
N/A |
Use to monitor consumption progress. |
|
|
batchSizeAvg |
The average number of bytes per batch. |
Bytes |
Use to analyze data write latency and throughput. |
|
|
requestLatencyAvg |
The average latency of data write requests. |
Milliseconds |
Use to assess data write performance. |
|
|
requestsInFlight |
The number of data write requests currently in progress. |
N/A |
A high value can indicate a bottleneck in the sink system. |
|
|
recordsPerRequestAvg |
The average number of records in each data write request. |
Count |
Use to evaluate batching efficiency and data write throughput. |
|
|
recordSizeAvg |
The average record size in bytes. |
Bytes |
Use to analyze data write throughput and efficiency. |
Paimon connector
|
Metric |
Description |
Unit |
Scenario |
Supported version |
|
Number of Writers |
The number of active |
Count |
A high number of writers can degrade write performance and increase memory consumption. If this value is high, check whether your |
Realtime Compute for Apache Flink VVR 8.0.9 or later. |
|
Max Compaction Thread Busy |
The maximum busy ratio of |
Ratio |
This metric reflects the |
|
|
Average Compaction Thread Busy |
The average busy ratio of |
Ratio |
This metric reflects the average |
|
|
Max Number of Level 0 Files |
The maximum number of level-0 files. |
Count |
For a |
|
|
Average Number of Level 0 Files |
The average number of level-0 files. |
Count |
For a |
|
|
Last Commit Duration |
The duration of the last commit. |
Milliseconds |
If the duration is excessively long, check whether data is being written to too many |
|
|
Number of Partitions Last Committed |
The number of |
Count |
A high number of partitions in a single commit can degrade write performance and increase memory consumption. Check whether your |
|
|
Number of Buckets Last Committed |
The number of |
Count |
A high number of buckets in a single commit can degrade write performance and increase memory consumption. Check whether your |
|
|
Used Write Buffer |
The amount of write buffer memory in use. |
Bytes |
This buffer consumes |
|
|
Total Write Buffer |
The total allocated write buffer memory. |
Bytes |
This buffer consumes |
Data ingestion
|
Metric |
Description |
Unit |
Scenario |
Supported version |
|
isSnapshotting |
Indicates if the job is in the snapshot phase. A value of 1 means the job is in this phase. |
N/A |
Determines the current processing phase of the job. |
Realtime Compute for Apache Flink VVR 8.0.9 or later. |
|
isBinlogReading |
Indicates if the job is in the incremental phase. A value of 1 means the job is in this phase. |
N/A |
Determines the current processing phase of the job. |
|
|
Number of remaining tables |
The number of tables waiting to be processed in the snapshot phase. |
Tables |
Monitors the queue of tables for snapshot processing. |
|
|
Number of snapshotted tables |
The number of tables processed in the snapshot phase. |
Tables |
Monitors the count of completed table snapshots. |
|
|
Number of remaining SnapshotSplits |
The number of splits waiting to be processed in the snapshot phase. |
Splits |
Monitors the queue of data splits for snapshot processing. |
|
|
Number of processed SnapshotSplits |
The number of splits processed in the snapshot phase. |
Splits |
Monitors the count of completed data splits from the snapshot phase. |
|
|
currentFetchEventTimeLag |
The latency between when an event is created in the database and when it is read by the connector. |
ms |
Measures the data ingestion latency from the database's binary log. |
|
|
currentReadTimestampMs |
The timestamp of the most recently read data record. |
ms |
Identifies the point-in-time of the latest ingested record. |
|
|
numRecordsIn |
The total number of data records read. |
Records |
Tracks the total number of data records read by the source. |
|
|
numRecordsInPerTable |
The number of data records read from each table. |
Records |
Tracks the total number of data records read from each table. |
|
|
numSnapshotRecords |
The number of data records processed during the snapshot phase. |
Records |
Monitors the total records processed during the snapshot phase. |
|
|
numSnapshotRecordsPerTable |
The number of data records processed for each table during the snapshot phase. |
Records |
Monitors the per-table record count processed during the snapshot phase. |