All Products
Search
Document Center

Realtime Compute for Apache Flink:Metrics

Last Updated:Apr 29, 2026

This topic describes the metrics for fully managed Flink.

Notes

Data discrepancies between CloudMonitor and the Flink console

  1. Differences in displayed dimensions
    The Flink console uses PromQL queries to display only the maximum latency. In real-time computing scenarios, average latency can mask serious issues like data skew or single-partition blockages. Therefore, only the maximum latency provides valuable operational insights.

  2. Value discrepancies
    CloudMonitor uses a pre-aggregation mechanism to compute metrics. The "maximum" value in CloudMonitor may differ slightly from the real-time value in the Flink console due to differences in aggregation windows, sampling timestamps, or calculation logic. For troubleshooting, use the Flink console data as the source of truth.

Data latency and watermark configuration

  1. Latency calculation logic
    The current monitoring metric Emit Delay is calculated based on event time, using the following formula:

    Delay = Current System Time - Logical time field in the data record (e.g., PriceData.time)

    This means the metric reflects data freshness, not the system's processing speed. This metric is high when the source data is old or when the system pauses output to align watermarks.

  2. Recommendations

    Scenario 1: Your business logic relies on watermarks for correctness, but the source data is old

    • Typical situations:

      • Upstream data delivery is inherently delayed (e.g., slow event reporting).

      • You are running a backfill to process data from a previous day.

      • The business logic requires watermarks to handle out-of-order events, so they cannot be disabled.

    • Phenomenon: Monitoring alerts show high latency, but the Kafka consumer group has no lag (lag ≈ 0) and the CPU load is low.

    • Recommendations:

      1. Ignore this latency metric: In this case, a high delay is expected because it reflects the age of the data. It does not indicate a system failure.

      2. Switch to a different metric: Monitor the Kafka consumer lag instead. If the consumer lag does not continuously increase, the system has sufficient processing capacity and requires no intervention.

    Scenario 2: You require low latency and can tolerate minor out-of-order events or data loss

    • Typical situations:

      • For applications like large-screen dashboards or real-time risk control, watermark-induced waiting slows down output.

      • The business logic is more concerned with when the data is received (processing time) than with the timestamp inside the data record (event time).

    • Phenomenon: The data stream is real-time, but because the watermark is configured with a large tolerance window (e.g., a 10-second lateness allowance), output is delayed by 10 seconds.

    • Recommendations:

      1. Remove or disable watermarks: Switch to using processing time for calculations, or set the watermark wait threshold to 0.

      2. Expected result: The latency metric will drop significantly, approaching the actual processing time. Data is processed upon arrival, without waiting for alignment.

Metric characteristics

Metrics only reflect the current state of a component and are insufficient to determine an issue's root cause. For a comprehensive diagnosis, always use the Flink UI's backpressure monitor and other tools.

1. Operator backpressure

Symptom: Downstream operators cannot process data fast enough, so the source reduces its emission rate.

  • How to identify: Use the Flink UI's backpressure monitor to identify this issue.

  • Metric characteristics:

    • sourceIdleTime periodically increases.

    • currentFetchEventTimeLag and currentEmitEventTimeLag continuously increase.

    • Extreme case: If an operator is completely stuck, sourceIdleTime will increase continuously.

2. Source performance bottleneck

Symptom: The source reads at maximum speed but cannot meet data processing demands.

  • How to identify: No backpressure is detected in the job.

  • Metric characteristics:

    • sourceIdleTime remains at a very low value (indicating the source is operating at full capacity).

    • currentFetchEventTimeLag and currentEmitEventTimeLag are similar and remain high.

3. Data skew or empty partitions

Symptom: Data distribution is uneven across upstream Kafka partitions, or some partitions are empty.

  • How to identify: Compare metrics across different source subtasks.

  • Metric characteristics:

    • The sourceIdleTime for a specific source subtask is significantly higher than for others, indicating that this parallel instance is idle.

4. Data latency

Symptom: The overall job latency is high. You must determine if the bottleneck is the source or an external system.

  • How to identify: Analyze the idle time, the difference between lag metrics, and the backlog size in combination.

  • Metric characteristics:

    • High sourceIdleTime:
      This indicates the source is idle, which usually means the data production rate from the external system is low, not that Flink is processing slowly.

    • Lag difference analysis:
      Compare the difference between currentEmitEventTimeLag and currentFetchEventTimeLag. This difference represents the time data spends within the source operator:

      • Small difference (close metric values): This indicates insufficient fetch capacity. The bottleneck is typically network I/O bandwidth or insufficient source parallelism.

      • Large difference: This indicates insufficient processing capacity. The bottleneck is typically inefficient data parsing or backpressure from downstream operators.

    • pendingRecords (if supported by the connector):
      This metric directly reflects the external backlog. A higher value indicates a more severe data backlog in the external system.

Overview

Metric

Description

Details

Unit

Supported connectors

Number of restarts

The number of times the deployment restarted after an error.

The number of times the deployment restarted due to an error. This metric excludes restarts caused by a JobManager (JM) failover. Use this metric to monitor the deployment's availability and status.

Count

N/A

Current emit event time lag

The data processing latency.

A high value indicates latency in data fetching or processing.

milliseconds (ms)

  • Kafka

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Postgres Change Data Capture (CDC)

  • Hologres (Binlog Source)

Current fetch event time lag

The data fetching latency from the upstream system.

A high value indicates data fetching latency. Check your network I/O and upstream systems. Comparing this metric with currentEmitEventTimeLag helps you analyze the source's processing capacity. The difference between the two represents the time data spends in the source operator.

  • If the two lags are very close, it suggests that the source has insufficient capacity to pull data from the external system, likely due to network I/O or parallelism limitations.

  • A large gap between the two lags suggests that the deployment's processing capacity is insufficient, causing data to build up in the source operator. On the deployment details page, go to the Status Overview tab and use the BackPressure page to locate the problematic vertex. Then, go to the Thread Dump page to analyze the stack and identify the bottleneck.

milliseconds (ms)

  • Kafka

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Postgres Change Data Capture (CDC)

  • Hologres (Binlog Source)

numRecordsIn

The total number of records received by all operators.

If the numRecordsIn value for a specific operator does not increase for a long period, it may indicate a problem with the upstream data flow. Check the upstream source and operators.

Count

All built-in connectors.

numRecordsOut

The total number of records emitted.

If the numRecordsOut value for a specific operator does not increase for a long period, it may indicate an error in the deployment's code logic that is causing records to be dropped. Review the code logic.

Count

All built-in connectors.

numRecordsInOfSource

The number of records ingested by the source operator.

Use this metric to monitor data input from the upstream source.

Count

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Elasticsearch

  • Hologres

numRecordsOutOfSink

The total number of records emitted by the sink operator.

Use this metric to monitor data output to the downstream sink.

Count

  • Kafka

  • Simple Log Service

  • DataHub

  • Hologres

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

numRecordsInPerSecond

The number of records ingested per second across the entire data stream.

Use this metric to monitor the processing speed of the entire data stream. For example, you can use numRecordsInPerSecond to observe whether the overall processing speed meets expected levels and how performance varies with different input loads.

records/second

All built-in connectors.

numRecordsOutPerSecond

The number of records emitted per second across the entire data stream.

Use this metric to measure the output speed of the entire data stream.

For example, you can use numRecordsOutPerSecond to observe whether the overall output speed meets your expectations and how performance changes under different output loads.

records/second

All connectors.

numRecordsInOfSourcePerSecond (IN RPS)

The number of records ingested per second by each source.

Use this metric to measure the record generation rate of each source. For example, in a data stream with multiple sources, you can use this metric to understand the ingestion rate of each source and tune the data stream for better performance. This metric is also useful for monitoring and alerting.

A value of 0 indicates that the upstream system has stopped producing data or that consumption is blocked, which prevents output. Verify that the upstream source is still producing data.

records/second

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Elasticsearch

  • Hologres

numRecordsOutOfSinkPerSecond (OUT RPS)

The number of records emitted per second by each sink.

Use this metric to measure the output rate of each sink. For example, in a data stream with multiple sinks, you can use this metric to understand the output speed of each sink and tune the data stream for better performance.

This metric is useful for monitoring and alerting. A value of 0 suggests a possible error in the deployment's code logic that is filtering out all data. Review the code logic.

records/second

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • Simple Log Service

  • DataHub

  • Hologres

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

pendingRecords

The number of records in the external system that the source operator has not yet pulled.

This metric shows the number of records in the external system that the source operator has not yet pulled.

Count

  • Kafka

  • Elasticsearch

sourceIdleTime

The duration for which the source operator is idle.

This metric indicates whether the source is idle. A high value suggests that the data production rate in the external system is low.

milliseconds (ms)

  • Kafka

  • ApsaraMQ for RocketMQ

  • Postgres Change Data Capture (CDC)

  • Hologres (Binlog Source)

busyTimePerSecond

The amount of time a Task is busy each second.

The number of milliseconds per second that a Task thread is busy processing data. The value ranges from 0 to 1,000. A higher value indicates the Task is under a heavier load. Use this metric to identify performance bottlenecks, assess resource utilization, and guide auto-tuning.

milliseconds (ms)

N/A

Checkpoints

Metric

Description

Details

Unit

Number of checkpoints

The total number of checkpoints.

Provides an overview of checkpoint status to help you configure alerts.

Count

lastCheckpointDuration

The duration of the most recent checkpoint.

A long duration or a timeout can be caused by a large state size, temporary network issues, unaligned barriers, or backpressure.

milliseconds (ms)

lastCheckpointSize

The size of the most recent checkpoint.

Indicates the size of the last uploaded checkpoint. Use this metric to analyze performance when a bottleneck occurs.

Bytes

State

Note

Latency state metrics are disabled by default. To use these metrics, set state.backend.latency-track.keyed-state-enabled: true in the additional Flink configurations. Enabling these metrics can impact the runtime performance of your deployment.

Metric

Description

Description

Unit

Supported version

State Clear Latency

The maximum latency of a single state clear operation.

Use this metric to monitor the performance of state clear operations.

nanoseconds (ns)

Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.0 or later.

ValueState Latency

The maximum latency of a single ValueState access operation.

Use this metric to monitor the performance of ValueState access.

nanoseconds (ns)

AggregatingState Latency

The maximum latency of a single AggregatingState access operation.

Use this metric to monitor the performance of AggregatingState access.

nanoseconds (ns)

ReducingState Latency

The maximum latency of a single ReducingState access operation.

Use this metric to monitor the performance of ReducingState access.

nanoseconds (ns)

MapState Latency

The maximum latency of a single MapState access operation.

Use this metric to monitor the performance of MapState access.

nanoseconds (ns)

ListState Latency

The maximum latency of a single ListState access operation.

Use this metric to monitor the performance of ListState access.

nanoseconds (ns)

SortedMapState Latency

The maximum latency of a single SortedMapState access operation.

Use this metric to monitor the performance of SortedMapState access.

nanoseconds (ns)

State Size

The size of the state data.

Use this metric to:

  • Identify current or potential state bottlenecks on nodes.

  • Verify that the time to live (TTL) configuration is working as expected.

Bytes

Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.12 or later.

State File Size

The size of the state data file.

Use this metric to:

  • Monitor the local disk space that state data consumes and take action if usage is high.

  • Determine if excessively large state data is causing insufficient local disk space.

Bytes

Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 4.0.13 or later.

I/O

Metric

Description

Details

Unit

Supported connectors

numBytesIn

The total number of input bytes.

Use this metric to monitor input throughput from the source and track deployment traffic.

Bytes

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

numBytesInPerSecond

The total number of input bytes per second.

Use this metric to monitor the input rate from the source and track deployment traffic.

Bytes/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

numBytesOut

The total number of output bytes.

Use this metric to monitor output throughput to the sink and track deployment traffic.

Bytes

  • Kafka

  • ApsaraMQ for RocketMQ

  • DataHub

  • ApsaraDB for HBase

numBytesOutPerSecond

The total number of output bytes per second.

Use this metric to monitor the output rate to the sink and track deployment traffic.

Bytes/s

  • Kafka

  • ApsaraMQ for RocketMQ

  • DataHub

  • ApsaraDB for HBase

Task numRecords I/O

The total number of records received and emitted by each subtask.

Use this metric to identify potential I/O bottlenecks.

Records

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • Simple Log Service

  • DataHub

  • Elasticsearch

  • Hologres

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

Task numRecords I/O PerSecond

The total number of records received and emitted by each subtask per second.

Use this metric to identify I/O bottlenecks and assess their severity based on the processing rate.

Records/s

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • Simple Log Service

  • DataHub

  • Elasticsearch

  • Hologres

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

currentSendTime

The time each subtask takes to send the latest record to the sink.

A high value for this metric indicates that the subtask output is too slow.

Milliseconds (ms)

  • Kafka

  • MaxCompute

  • Incremental MaxCompute

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Hologres

    Note

    Supported in JDBC mode and RPC mode. Not supported in BHClient mode.

  • ApsaraDB for HBase

  • Tablestore

  • ApsaraDB for Redis

Watermark

Metric

Description

Usage

Unit

Supported connector

Task InputWatermark

The time each task receives the latest watermark.

Use this metric to monitor the progress of data arriving at a TaskManager.

N/A

Not connector-specific.

watermarkLag

The difference between wall-clock time and the watermark's event time.

Use this metric to determine the processing latency at the subtask level.

ms

  • Kafka

  • ApsaraMQ for RocketMQ

  • Simple Log Service

  • DataHub

  • Hologres (binlog source)

CPU

Metric

Description

Description

Unit

JobManager CPU usage

CPU utilization of a JobManager.

This metric shows the percentage of CPU time slices used by Flink. A value of 100% means one CPU core is fully utilized, and 400% means four cores are fully utilized. If this value consistently exceeds 100%, the JobManager is CPU-bound. High load with low CPU utilization may indicate an excessive number of processes in an uninterruptible sleep state due to frequent read and write operations.

Note

This metric is available only for Realtime Compute for Apache Flink VVR 6.0.6 and later.

N/A

TaskManager CPU usage

CPU utilization of a TaskManager.

This metric shows the percentage of CPU time slices used by Flink. A value of 100% means one CPU core is fully utilized, and 400% means four cores are fully utilized. If this value consistently exceeds 100%, the TaskManager is CPU-bound. High load with low CPU utilization may indicate an excessive number of processes in an uninterruptible sleep state due to frequent read and write operations.

N/A

Memory

Metric

Description

Description

Unit

JM heap memory

The heap memory of the JobManager.

Tracks changes in JobManager heap memory.

Bytes

JM non-heap memory

The non-heap memory of the JobManager.

Tracks changes in JobManager non-heap memory.

Bytes

TM heap memory

The heap memory of the TaskManager.

Tracks changes in TaskManager heap memory.

Bytes

TM non-heap memory

The non-heap memory of the TaskManager.

Tracks changes in TaskManager non-heap memory.

Bytes

TM Mem (RSS)

The Resident Set Size (RSS) of the TaskManager process, as reported by the operating system.

Monitors the total physical memory usage of the TaskManager process.

Bytes

JVM

Metric

Description

Details

Unit

JM Threads

The number of JobManager threads.

Too many JobManager threads can consume excessive memory, reducing job stability.

Count

TM Threads

The number of TaskManager threads.

Too many TaskManager threads can consume excessive memory, reducing job stability.

Count

JM GC Count

The number of garbage collection (GC) events for the JobManager.

Frequent garbage collection events can consume excessive memory and degrade job performance. Use this metric to diagnose job-level failures.

Count

JM GC Time

The duration of each garbage collection event for the JobManager.

Long garbage collection pauses can consume excessive memory and degrade job performance. Use this metric to diagnose job-level failures.

Milliseconds (ms)

TM GC Count

The number of garbage collection events for the TaskManager.

Frequent garbage collection events can consume excessive memory and degrade job performance. Use this metric to diagnose task-level failures.

Count

TM GC Time

The duration of each garbage collection event for the TaskManager.

Long garbage collection pauses can consume excessive memory and degrade job performance. Use this metric to diagnose task-level failures.

Milliseconds (ms)

JM ClassLoader

The total number of classes loaded or unloaded by the JobManager's JVM since startup.

A high volume of class loading and unloading in the JobManager's JVM can consume excessive memory and degrade job performance.

N/A

TM ClassLoader

The total number of classes loaded or unloaded by the TaskManager's JVM since startup.

A high volume of class loading and unloading in the TaskManager's JVM can consume excessive memory and degrade job performance.

N/A

MySQL connector

Metric

Description

Unit

Scenario

Supported version

isSnapshotting

Indicates whether the job is in the snapshot phase (value = 1).

N/A

Check if the job is in the snapshot phase.

Realtime Compute for Apache Flink versions 8.0.9 and later.

isBinlogReading

Indicates whether the job is in the incremental phase (value = 1).

N/A

Check if the job is in the incremental phase.

Number of remaining tables

The number of tables waiting to be processed in the snapshot phase.

Count

Check the number of unprocessed tables.

Number of snapshotted tables

The number of processed tables in the snapshot phase.

Count

Check the number of processed tables.

Number of remaining SnapshotSplits

The number of splits waiting to be processed in the snapshot phase.

Count

Check the number of unprocessed splits.

Number of processed SnapshotSplits

The number of processed splits in the snapshot phase.

Count

Check the number of processed splits.

currentFetchEventTimeLag

The latency between when an event is created in the database and when it is read by the connector.

ms

Check the latency of reading binlogs from the database.

currentReadTimestampMs

The timestamp of the most recent data record that is read.

ms

Check the timestamp of the most recently read data record.

numRecordsIn

The total number of data records read.

Count

Check the total number of data records read.

numSnapshotRecords

The number of data records processed in the snapshot phase.

Count

Check the number of data records processed in the snapshot phase.

numRecordsInPerTable

The number of data records read from each table.

Count

Check the number of data records read from each table.

numSnapshotRecordsPerTable

The number of data records processed for each table during the snapshot phase.

Count

Check the number of data records processed for each table in the snapshot phase.

Connector - Kafka

Metric

Description

Unit

Scenario

Supported version

commitsSucceeded

The total number of successful offset commits.

Count

Verifies that offset commits succeed.

Realtime Compute for Apache Flink VVR 8.0.9 or later.

commitsFailed

The total number of failed offset commits.

Count

Identifies issues with offset commits.

Fetch Rate

The average number of fetch requests per second.

Count/s

Use to monitor the data fetch rate and identify potential latency issues.

Fetch Latency Avg

The average latency of fetch operations.

Milliseconds

A high value may indicate a network bottleneck or a slow Kafka broker.

Fetch Size Avg

The average number of bytes per fetch request.

Bytes

Use to analyze data fetch throughput and efficiency.

Avg Records In Per-Request

The average number of records per fetch request.

Count

Use to analyze the efficiency of record batching within fetch requests.

currentSendTime

The event-time timestamp of the last record processed by the connector.

N/A

Use to monitor consumption progress.

batchSizeAvg

The average number of bytes per batch.

Bytes

Use to analyze data write latency and throughput.

requestLatencyAvg

The average latency of data write requests.

Milliseconds

Use to assess data write performance.

requestsInFlight

The number of data write requests currently in progress.

N/A

A high value can indicate a bottleneck in the sink system.

recordsPerRequestAvg

The average number of records in each data write request.

Count

Use to evaluate batching efficiency and data write throughput.

recordSizeAvg

The average record size in bytes.

Bytes

Use to analyze data write throughput and efficiency.

Paimon connector

Metric

Description

Unit

Scenario

Supported version

Number of Writers

The number of active writer instances.

Count

A high number of writers can degrade write performance and increase memory consumption. If this value is high, check whether your bucket count and partition key settings are appropriate.

Realtime Compute for Apache Flink VVR 8.0.9 or later.

Max Compaction Thread Busy

The maximum busy ratio of compaction threads.

Ratio

This metric reflects the compaction pressure. A value close to 100% indicates that compaction is a bottleneck, which may slow down data writes.

Average Compaction Thread Busy

The average busy ratio of compaction threads.

Ratio

This metric reflects the average compaction pressure across all buckets. A high value suggests that overall compaction performance is slow.

Max Number of Level 0 Files

The maximum number of level-0 files.

Count

For a primary key table, a high number of level-0 files (small files) indicates that compaction is not keeping pace with the write speed.

Average Number of Level 0 Files

The average number of level-0 files.

Count

For a primary key table, a high average number of level-0 files (small files) indicates that overall compaction is not keeping pace with the write speed.

Last Commit Duration

The duration of the last commit.

Milliseconds

If the duration is excessively long, check whether data is being written to too many buckets simultaneously.

Number of Partitions Last Committed

The number of partitions written in the last commit.

Count

A high number of partitions in a single commit can degrade write performance and increase memory consumption. Check whether your bucket count or partition key settings are appropriate.

Number of Buckets Last Committed

The number of buckets written in the last commit.

Count

A high number of buckets in a single commit can degrade write performance and increase memory consumption. Check whether your bucket count or partition key settings are appropriate.

Used Write Buffer

The amount of write buffer memory in use.

Bytes

This buffer consumes Java heap memory on all TaskManagers. A persistently high value can lead to an Out of Memory (OOM) error.

Total Write Buffer

The total allocated write buffer memory.

Bytes

This buffer consumes Java heap memory on all TaskManagers. Setting this value too high can exhaust available memory and lead to an Out of Memory (OOM) error.

Data ingestion

Metric

Description

Unit

Scenario

Supported version

isSnapshotting

Indicates if the job is in the snapshot phase. A value of 1 means the job is in this phase.

N/A

Determines the current processing phase of the job.

Realtime Compute for Apache Flink VVR 8.0.9 or later.

isBinlogReading

Indicates if the job is in the incremental phase. A value of 1 means the job is in this phase.

N/A

Determines the current processing phase of the job.

Number of remaining tables

The number of tables waiting to be processed in the snapshot phase.

Tables

Monitors the queue of tables for snapshot processing.

Number of snapshotted tables

The number of tables processed in the snapshot phase.

Tables

Monitors the count of completed table snapshots.

Number of remaining SnapshotSplits

The number of splits waiting to be processed in the snapshot phase.

Splits

Monitors the queue of data splits for snapshot processing.

Number of processed SnapshotSplits

The number of splits processed in the snapshot phase.

Splits

Monitors the count of completed data splits from the snapshot phase.

currentFetchEventTimeLag

The latency between when an event is created in the database and when it is read by the connector.

ms

Measures the data ingestion latency from the database's binary log.

currentReadTimestampMs

The timestamp of the most recently read data record.

ms

Identifies the point-in-time of the latest ingested record.

numRecordsIn

The total number of data records read.

Records

Tracks the total number of data records read by the source.

numRecordsInPerTable

The number of data records read from each table.

Records

Tracks the total number of data records read from each table.

numSnapshotRecords

The number of data records processed during the snapshot phase.

Records

Monitors the total records processed during the snapshot phase.

numSnapshotRecordsPerTable

The number of data records processed for each table during the snapshot phase.

Records

Monitors the per-table record count processed during the snapshot phase.