Metrics - Realtime Compute for Apache Flink - Alibaba Cloud Documentation Center

This topic describes the metrics for fully managed Flink.

Notes

Data discrepancies between CloudMonitor and the Flink console

Differences in displayed dimensions
The Flink console uses PromQL queries to display only the maximum latency. In real-time computing scenarios, average latency can mask serious issues like data skew or single-partition blockages. Therefore, only the maximum latency provides valuable operational insights.
Value discrepancies
CloudMonitor uses a pre-aggregation mechanism to compute metrics. The "maximum" value in CloudMonitor may differ slightly from the real-time value in the Flink console due to differences in aggregation windows, sampling timestamps, or calculation logic. For troubleshooting, use the Flink console data as the source of truth.

Data latency and watermark configuration

Latency calculation logic
The current monitoring metric Emit Delay is calculated based on event time, using the following formula:

Delay = Current System Time - Logical time field in the data record (e.g., PriceData.time)

This means the metric reflects data freshness, not the system's processing speed. This metric is high when the source data is old or when the system pauses output to align watermarks.
Recommendations

Scenario 1: Your business logic relies on watermarks for correctness, but the source data is old
- Typical situations:
  - Upstream data delivery is inherently delayed (e.g., slow event reporting).
  - You are running a backfill to process data from a previous day.
  - The business logic requires watermarks to handle out-of-order events, so they cannot be disabled.
- Phenomenon: Monitoring alerts show high latency, but the Kafka consumer group has no lag (lag ≈ 0) and the CPU load is low.
- Recommendations:
  1. Ignore this latency metric: In this case, a high delay is expected because it reflects the age of the data. It does not indicate a system failure.
  2. Switch to a different metric: Monitor the Kafka consumer lag instead. If the consumer lag does not continuously increase, the system has sufficient processing capacity and requires no intervention.
Scenario 2: You require low latency and can tolerate minor out-of-order events or data loss
- Typical situations:
  - For applications like large-screen dashboards or real-time risk control, watermark-induced waiting slows down output.
  - The business logic is more concerned with when the data is received (processing time) than with the timestamp inside the data record (event time).
- Phenomenon: The data stream is real-time, but because the watermark is configured with a large tolerance window (e.g., a 10-second lateness allowance), output is delayed by 10 seconds.
- Recommendations:
  1. Remove or disable watermarks: Switch to using processing time for calculations, or set the watermark wait threshold to 0.
  2. Expected result: The latency metric will drop significantly, approaching the actual processing time. Data is processed upon arrival, without waiting for alignment.

Metric characteristics

Metrics only reflect the current state of a component and are insufficient to determine an issue's root cause. For a comprehensive diagnosis, always use the Flink UI's backpressure monitor and other tools.

1. Operator backpressure

Symptom: Downstream operators cannot process data fast enough, so the source reduces its emission rate.

How to identify: Use the Flink UI's backpressure monitor to identify this issue.
Metric characteristics:
- sourceIdleTime periodically increases.
- currentFetchEventTimeLag and currentEmitEventTimeLag continuously increase.
- Extreme case: If an operator is completely stuck, sourceIdleTime will increase continuously.

2. Source performance bottleneck

Symptom: The source reads at maximum speed but cannot meet data processing demands.

How to identify: No backpressure is detected in the job.
Metric characteristics:
- sourceIdleTime remains at a very low value (indicating the source is operating at full capacity).
- currentFetchEventTimeLag and currentEmitEventTimeLag are similar and remain high.

3. Data skew or empty partitions

Symptom: Data distribution is uneven across upstream Kafka partitions, or some partitions are empty.

How to identify: Compare metrics across different source subtasks.
Metric characteristics:
- The sourceIdleTime for a specific source subtask is significantly higher than for others, indicating that this parallel instance is idle.

4. Data latency

Symptom: The overall job latency is high. You must determine if the bottleneck is the source or an external system.

How to identify: Analyze the idle time, the difference between lag metrics, and the backlog size in combination.
Metric characteristics:
- High sourceIdleTime:
  This indicates the source is idle, which usually means the data production rate from the external system is low, not that Flink is processing slowly.
- Lag difference analysis:
  Compare the difference between currentEmitEventTimeLag and currentFetchEventTimeLag. This difference represents the time data spends within the source operator:
  - Small difference (close metric values): This indicates insufficient fetch capacity. The bottleneck is typically network I/O bandwidth or insufficient source parallelism.
  - Large difference: This indicates insufficient processing capacity. The bottleneck is typically inefficient data parsing or backpressure from downstream operators.
- pendingRecords (if supported by the connector):
  This metric directly reflects the external backlog. A higher value indicates a more severe data backlog in the external system.

Overview

Metric	Description	Details	Unit	Supported connectors
Number of restarts	The number of times the deployment restarted after an error.	The number of times the deployment restarted due to an error. This metric excludes restarts caused by a JobManager (JM) failover. Use this metric to monitor the deployment's availability and status.	Count	N/A
Current emit event time lag	The data processing latency.	A high value indicates latency in data fetching or processing.	milliseconds (ms)	Kafka ApsaraMQ for RocketMQ Simple Log Service DataHub Postgres Change Data Capture (CDC) Hologres (Binlog Source)
Current fetch event time lag	The data fetching latency from the upstream system.	A high value indicates data fetching latency. Check your network I/O and upstream systems. Comparing this metric with `currentEmitEventTimeLag` helps you analyze the source's processing capacity. The difference between the two represents the time data spends in the source operator. If the two lags are very close, it suggests that the source has insufficient capacity to pull data from the external system, likely due to network I/O or parallelism limitations. A large gap between the two lags suggests that the deployment's processing capacity is insufficient, causing data to build up in the source operator. On the deployment details page, go to the Status Overview tab and use the BackPressure page to locate the problematic vertex. Then, go to the Thread Dump page to analyze the stack and identify the bottleneck.	milliseconds (ms)	Kafka ApsaraMQ for RocketMQ Simple Log Service DataHub Postgres Change Data Capture (CDC) Hologres (Binlog Source)
numRecordsIn	The total number of records received by all operators.	If the `numRecordsIn` value for a specific operator does not increase for a long period, it may indicate a problem with the upstream data flow. Check the upstream source and operators.	Count	All built-in connectors.
numRecordsOut	The total number of records emitted.	If the `numRecordsOut` value for a specific operator does not increase for a long period, it may indicate an error in the deployment's code logic that is causing records to be dropped. Review the code logic.	Count	All built-in connectors.
numRecordsInOfSource	The number of records ingested by the source operator.	Use this metric to monitor data input from the upstream source.	Count	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ Simple Log Service DataHub Elasticsearch Hologres
numRecordsOutOfSink	The total number of records emitted by the sink operator.	Use this metric to monitor data output to the downstream sink.	Count	Kafka Simple Log Service DataHub Hologres ApsaraDB for HBase Tablestore ApsaraDB for Redis
numRecordsInPerSecond	The number of records ingested per second across the entire data stream.	Use this metric to monitor the processing speed of the entire data stream. For example, you can use `numRecordsInPerSecond` to observe whether the overall processing speed meets expected levels and how performance varies with different input loads.	records/second	All built-in connectors.
numRecordsOutPerSecond	The number of records emitted per second across the entire data stream.	Use this metric to measure the output speed of the entire data stream. For example, you can use `numRecordsOutPerSecond` to observe whether the overall output speed meets your expectations and how performance changes under different output loads.	records/second	All connectors.
numRecordsInOfSourcePerSecond (IN RPS)	The number of records ingested per second by each source.	Use this metric to measure the record generation rate of each source. For example, in a data stream with multiple sources, you can use this metric to understand the ingestion rate of each source and tune the data stream for better performance. This metric is also useful for monitoring and alerting. A value of 0 indicates that the upstream system has stopped producing data or that consumption is blocked, which prevents output. Verify that the upstream source is still producing data.	records/second	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ Simple Log Service DataHub Elasticsearch Hologres
numRecordsOutOfSinkPerSecond (OUT RPS)	The number of records emitted per second by each sink.	Use this metric to measure the output rate of each sink. For example, in a data stream with multiple sinks, you can use this metric to understand the output speed of each sink and tune the data stream for better performance. This metric is useful for monitoring and alerting. A value of 0 suggests a possible error in the deployment's code logic that is filtering out all data. Review the code logic.	records/second	Kafka MaxCompute Incremental MaxCompute Simple Log Service DataHub Hologres ApsaraDB for HBase Tablestore ApsaraDB for Redis
pendingRecords	The number of records in the external system that the source operator has not yet pulled.	This metric shows the number of records in the external system that the source operator has not yet pulled.	Count	Kafka Elasticsearch
sourceIdleTime	The duration for which the source operator is idle.	This metric indicates whether the source is idle. A high value suggests that the data production rate in the external system is low.	milliseconds (ms)	Kafka ApsaraMQ for RocketMQ Postgres Change Data Capture (CDC) Hologres (Binlog Source)
busyTimePerSecond	The amount of time a Task is busy each second.	The number of milliseconds per second that a Task thread is busy processing data. The value ranges from 0 to 1,000. A higher value indicates the Task is under a heavier load. Use this metric to identify performance bottlenecks, assess resource utilization, and guide auto-tuning.	milliseconds (ms)	N/A

Checkpoints

Metric	Description	Details	Unit
Number of checkpoints	The total number of checkpoints.	Provides an overview of checkpoint status to help you configure alerts.	Count
lastCheckpointDuration	The duration of the most recent checkpoint.	A long duration or a timeout can be caused by a large state size, temporary network issues, unaligned barriers, or backpressure.	milliseconds (ms)
lastCheckpointSize	The size of the most recent checkpoint.	Indicates the size of the last uploaded checkpoint. Use this metric to analyze performance when a bottleneck occurs.	Bytes

State

Note

Latency state metrics are disabled by default. To use these metrics, set state.backend.latency-track.keyed-state-enabled: true in the additional Flink configurations. Enabling these metrics can impact the runtime performance of your deployment.

Metric	Description	Description	Unit	Supported version
State Clear Latency	The maximum latency of a single state clear operation.	Use this metric to monitor the performance of state clear operations.	nanoseconds (ns)	Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.0 or later.
ValueState Latency	The maximum latency of a single ValueState access operation.	Use this metric to monitor the performance of ValueState access.	nanoseconds (ns)
AggregatingState Latency	The maximum latency of a single AggregatingState access operation.	Use this metric to monitor the performance of AggregatingState access.	nanoseconds (ns)
ReducingState Latency	The maximum latency of a single ReducingState access operation.	Use this metric to monitor the performance of ReducingState access.	nanoseconds (ns)
MapState Latency	The maximum latency of a single MapState access operation.	Use this metric to monitor the performance of MapState access.	nanoseconds (ns)
ListState Latency	The maximum latency of a single ListState access operation.	Use this metric to monitor the performance of ListState access.	nanoseconds (ns)
SortedMapState Latency	The maximum latency of a single SortedMapState access operation.	Use this metric to monitor the performance of SortedMapState access.	nanoseconds (ns)
State Size	The size of the state data.	Use this metric to: Identify current or potential state bottlenecks on nodes. Verify that the time to live (TTL) configuration is working as expected.	Bytes	Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.12 or later.
State File Size	The size of the state data file.	Use this metric to: Monitor the local disk space that state data consumes and take action if usage is high. Determine if excessively large state data is causing insufficient local disk space.	Bytes	Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.13 or later.

I/O

Metric	Description	Details	Unit	Supported connectors
numBytesIn	The total number of input bytes.	Use this metric to monitor input throughput from the source and track deployment traffic.	Bytes	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ
numBytesInPerSecond	The total number of input bytes per second.	Use this metric to monitor the input rate from the source and track deployment traffic.	Bytes/s	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ
numBytesOut	The total number of output bytes.	Use this metric to monitor output throughput to the sink and track deployment traffic.	Bytes	Kafka ApsaraMQ for RocketMQ DataHub ApsaraDB for HBase
numBytesOutPerSecond	The total number of output bytes per second.	Use this metric to monitor the output rate to the sink and track deployment traffic.	Bytes/s	Kafka ApsaraMQ for RocketMQ DataHub ApsaraDB for HBase
Task numRecords I/O	The total number of records received and emitted by each subtask.	Use this metric to identify potential I/O bottlenecks.	Records	Kafka MaxCompute Incremental MaxCompute Simple Log Service DataHub Elasticsearch Hologres ApsaraDB for HBase Tablestore ApsaraDB for Redis
Task numRecords I/O PerSecond	The total number of records received and emitted by each subtask per second.	Use this metric to identify I/O bottlenecks and assess their severity based on the processing rate.	Records/s	Kafka MaxCompute Incremental MaxCompute Simple Log Service DataHub Elasticsearch Hologres ApsaraDB for HBase Tablestore ApsaraDB for Redis
currentSendTime	The time each subtask takes to send the latest record to the sink.	A high value for this metric indicates that the subtask output is too slow.	Milliseconds (ms)	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ Simple Log Service DataHub Hologres Note Supported in JDBC mode and RPC mode. Not supported in BHClient mode. ApsaraDB for HBase Tablestore ApsaraDB for Redis

Watermark

Metric	Description	Usage	Unit	Supported connector
Task InputWatermark	The time each task receives the latest watermark.	Use this metric to monitor the progress of data arriving at a TaskManager.	N/A	Not connector-specific.
watermarkLag	The difference between wall-clock time and the watermark's event time.	Use this metric to determine the processing latency at the subtask level.	ms	Kafka ApsaraMQ for RocketMQ Simple Log Service DataHub Hologres (binlog source)

CPU

Metric	Description	Description	Unit
JobManager CPU usage	CPU utilization of a JobManager.	This metric shows the percentage of CPU time slices used by Flink. A value of 100% means one CPU core is fully utilized, and 400% means four cores are fully utilized. If this value consistently exceeds 100%, the JobManager is CPU-bound. High load with low CPU utilization may indicate an excessive number of processes in an uninterruptible sleep state due to frequent read and write operations. Note This metric is available only for Realtime Compute for Apache Flink VVR 6.0.6 and later.	N/A
TaskManager CPU usage	CPU utilization of a TaskManager.	This metric shows the percentage of CPU time slices used by Flink. A value of 100% means one CPU core is fully utilized, and 400% means four cores are fully utilized. If this value consistently exceeds 100%, the TaskManager is CPU-bound. High load with low CPU utilization may indicate an excessive number of processes in an uninterruptible sleep state due to frequent read and write operations.	N/A

Memory

Metric	Description	Description	Unit
JM heap memory	The heap memory of the JobManager.	Tracks changes in JobManager heap memory.	Bytes
JM non-heap memory	The non-heap memory of the JobManager.	Tracks changes in JobManager non-heap memory.	Bytes
TM heap memory	The heap memory of the TaskManager.	Tracks changes in TaskManager heap memory.	Bytes
TM non-heap memory	The non-heap memory of the TaskManager.	Tracks changes in TaskManager non-heap memory.	Bytes
TM Mem (RSS)	The Resident Set Size (RSS) of the TaskManager process, as reported by the operating system.	Monitors the total physical memory usage of the TaskManager process.	Bytes

JVM

Metric	Description	Details	Unit
JM Threads	The number of JobManager threads.	Too many JobManager threads can consume excessive memory, reducing job stability.	Count
TM Threads	The number of TaskManager threads.	Too many TaskManager threads can consume excessive memory, reducing job stability.	Count
JM GC Count	The number of garbage collection (GC) events for the JobManager.	Frequent garbage collection events can consume excessive memory and degrade job performance. Use this metric to diagnose job-level failures.	Count
JM GC Time	The duration of each garbage collection event for the JobManager.	Long garbage collection pauses can consume excessive memory and degrade job performance. Use this metric to diagnose job-level failures.	Milliseconds (ms)
TM GC Count	The number of garbage collection events for the TaskManager.	Frequent garbage collection events can consume excessive memory and degrade job performance. Use this metric to diagnose task-level failures.	Count
TM GC Time	The duration of each garbage collection event for the TaskManager.	Long garbage collection pauses can consume excessive memory and degrade job performance. Use this metric to diagnose task-level failures.	Milliseconds (ms)
JM ClassLoader	The total number of classes loaded or unloaded by the JobManager's JVM since startup.	A high volume of class loading and unloading in the JobManager's JVM can consume excessive memory and degrade job performance.	N/A
TM ClassLoader	The total number of classes loaded or unloaded by the TaskManager's JVM since startup.	A high volume of class loading and unloading in the TaskManager's JVM can consume excessive memory and degrade job performance.	N/A

MySQL connector

Metric	Description	Unit	Scenario	Supported version
isSnapshotting	Indicates whether the job is in the snapshot phase (value = 1).	N/A	Check if the job is in the snapshot phase.	Realtime Compute for Apache Flink versions 8.0.9 and later.
isBinlogReading	Indicates whether the job is in the incremental phase (value = 1).	N/A	Check if the job is in the incremental phase.
Number of remaining tables	The number of tables waiting to be processed in the snapshot phase.	Count	Check the number of unprocessed tables.
Number of snapshotted tables	The number of processed tables in the snapshot phase.	Count	Check the number of processed tables.
Number of remaining SnapshotSplits	The number of splits waiting to be processed in the snapshot phase.	Count	Check the number of unprocessed splits.
Number of processed SnapshotSplits	The number of processed splits in the snapshot phase.	Count	Check the number of processed splits.
currentFetchEventTimeLag	The latency between when an event is created in the database and when it is read by the connector.	ms	Check the latency of reading binlogs from the database.
currentReadTimestampMs	The timestamp of the most recent data record that is read.	ms	Check the timestamp of the most recently read data record.
numRecordsIn	The total number of data records read.	Count	Check the total number of data records read.
numSnapshotRecords	The number of data records processed in the snapshot phase.	Count	Check the number of data records processed in the snapshot phase.
numRecordsInPerTable	The number of data records read from each table.	Count	Check the number of data records read from each table.
numSnapshotRecordsPerTable	The number of data records processed for each table during the snapshot phase.	Count	Check the number of data records processed for each table in the snapshot phase.

Connector - Kafka

Metric	Description	Unit	Scenario	Supported version
commitsSucceeded	The total number of successful offset commits.	Count	Verifies that offset commits succeed.	Realtime Compute for Apache Flink VVR 8.0.9 or later.
commitsFailed	The total number of failed offset commits.	Count	Identifies issues with offset commits.
Fetch Rate	The average number of fetch requests per second.	Count/s	Use to monitor the data fetch rate and identify potential latency issues.
Fetch Latency Avg	The average latency of fetch operations.	Milliseconds	A high value may indicate a network bottleneck or a slow Kafka broker.
Fetch Size Avg	The average number of bytes per fetch request.	Bytes	Use to analyze data fetch throughput and efficiency.
Avg Records In Per-Request	The average number of records per fetch request.	Count	Use to analyze the efficiency of record batching within fetch requests.
currentSendTime	The event-time timestamp of the last record processed by the connector.	N/A	Use to monitor consumption progress.
batchSizeAvg	The average number of bytes per batch.	Bytes	Use to analyze data write latency and throughput.
requestLatencyAvg	The average latency of data write requests.	Milliseconds	Use to assess data write performance.
requestsInFlight	The number of data write requests currently in progress.	N/A	A high value can indicate a bottleneck in the sink system.
recordsPerRequestAvg	The average number of records in each data write request.	Count	Use to evaluate batching efficiency and data write throughput.
recordSizeAvg	The average record size in bytes.	Bytes	Use to analyze data write throughput and efficiency.

Paimon connector

Metric	Description	Unit	Scenario	Supported version
Number of Writers	The number of active `writer instance`s.	Count	A high number of writers can degrade write performance and increase memory consumption. If this value is high, check whether your `bucket` count and `partition key` settings are appropriate.	Realtime Compute for Apache Flink VVR 8.0.9 or later.
Max Compaction Thread Busy	The maximum busy ratio of `compaction` threads.	Ratio	This metric reflects the `compaction` pressure. A value close to 100% indicates that compaction is a bottleneck, which may slow down data writes.
Average Compaction Thread Busy	The average busy ratio of `compaction` threads.	Ratio	This metric reflects the average `compaction` pressure across all buckets. A high value suggests that overall compaction performance is slow.
Max Number of Level 0 Files	The maximum number of level-0 files.	Count	For a `primary key table`, a high number of level-0 files (small files) indicates that `compaction` is not keeping pace with the write speed.
Average Number of Level 0 Files	The average number of level-0 files.	Count	For a `primary key table`, a high average number of level-0 files (small files) indicates that overall `compaction` is not keeping pace with the write speed.
Last Commit Duration	The duration of the last commit.	Milliseconds	If the duration is excessively long, check whether data is being written to too many `bucket`s simultaneously.
Number of Partitions Last Committed	The number of `partition`s written in the last commit.	Count	A high number of partitions in a single commit can degrade write performance and increase memory consumption. Check whether your `bucket` count or `partition key` settings are appropriate.
Number of Buckets Last Committed	The number of `bucket`s written in the last commit.	Count	A high number of buckets in a single commit can degrade write performance and increase memory consumption. Check whether your `bucket` count or `partition key` settings are appropriate.
Used Write Buffer	The amount of write buffer memory in use.	Bytes	This buffer consumes `Java heap memory` on all `TaskManager`s. A persistently high value can lead to an Out of Memory (OOM) error.
Total Write Buffer	The total allocated write buffer memory.	Bytes	This buffer consumes `Java heap memory` on all `TaskManager`s. Setting this value too high can exhaust available memory and lead to an Out of Memory (OOM) error.

Data ingestion

Metric	Description	Unit	Scenario	Supported version
isSnapshotting	Indicates if the job is in the snapshot phase. A value of 1 means the job is in this phase.	N/A	Determines the current processing phase of the job.	Realtime Compute for Apache Flink VVR 8.0.9 or later.
isBinlogReading	Indicates if the job is in the incremental phase. A value of 1 means the job is in this phase.	N/A	Determines the current processing phase of the job.
Number of remaining tables	The number of tables waiting to be processed in the snapshot phase.	Tables	Monitors the queue of tables for snapshot processing.
Number of snapshotted tables	The number of tables processed in the snapshot phase.	Tables	Monitors the count of completed table snapshots.
Number of remaining SnapshotSplits	The number of splits waiting to be processed in the snapshot phase.	Splits	Monitors the queue of data splits for snapshot processing.
Number of processed SnapshotSplits	The number of splits processed in the snapshot phase.	Splits	Monitors the count of completed data splits from the snapshot phase.
currentFetchEventTimeLag	The latency between when an event is created in the database and when it is read by the connector.	ms	Measures the data ingestion latency from the database's binary log.
currentReadTimestampMs	The timestamp of the most recently read data record.	ms	Identifies the point-in-time of the latest ingested record.
numRecordsIn	The total number of data records read.	Records	Tracks the total number of data records read by the source.
numRecordsInPerTable	The number of data records read from each table.	Records	Tracks the total number of data records read from each table.
numSnapshotRecords	The number of data records processed during the snapshot phase.	Records	Monitors the total records processed during the snapshot phase.
numSnapshotRecordsPerTable	The number of data records processed for each table during the snapshot phase.	Records	Monitors the per-table record count processed during the snapshot phase.