Metrics - Realtime Compute for Apache Flink - Alibaba Cloud Documentation Center

This topic describes the metrics that are supported by fully managed Flink.

Precautions

Metrics that are reported by a source reflect only the current situation of the source and cannot be used to identify the root cause of an issue. You need to use auxiliary metrics or tools to identify the root cause of an issue. The following table describes the metrics in specific scenarios.

Scenario	Description
An operator in a deployment has backpressure.	The backpressure detection feature provided by Flink UI, rather than metrics, is the most direct method to detect backpressure. If backpressure exists, the rate at which the source sends data to downstream operators decreases. In this case, the value of the sourceIdleTime metric may periodically increase and the values of the currentFetchEventTimeLag and currentEmitEventTimeLag metrics may continuously increase. In extreme cases, such as when an operator is stuck, the value of the sourceIdleTime metric continuously increases.
The source has a performance bottleneck.	If only the throughput of the source is insufficient, no backpressure can be detected in your deployment. The sourceIdleTime metric remains at a small value because the source keeps running. The values of the currentFetchEventTimeLag and currentEmitEventTimeLag metrics are large and close to each other.
Data skew occurs at the upstream, or a partition is empty.	If data skew occurs or a partition is empty, one or more sources are idle. In this case, the value of the sourceIdleTime metric for the sources is large.

If the latency of a deployment is high, you can use the following metrics to analyze the data processing capabilities of fully managed Flink and the retention of data in the external system.

Metric	Description
sourceIdleTime	Indicates whether the source is idle. If the value of this metric is large, the rate at which your data is generated in the external system is low.
currentFetchEventTimeLag and currentEmitEventTimeLag	Indicate the latency when fully managed Flink processes data. You can analyze the data processing capabilities of a source based on the difference between the values of the two metrics. The difference indicates the duration for which the data is retained in the source. If the difference between the values of the two metrics is small, the source does not efficiently pull data from the external system because of issues related to network I/O or parallelism. If the difference between the values of the two metrics is large, the source does not efficiently process data because of issues related to data parsing, parallelism, or backpressure.
pendingRecords	Indicates the amount of data that is retained in the external system.

Overview

Metric	Definition	Description	Unit	Supported connector
Num of Restarts	The number of times that a deployment is restarted when a deployment failover occurs.	This metric indicates the number of times that a deployment is restarted when a deployment failover occurs. The number of times that the deployment is restarted when the JobManager failover occurs is not included. This metric is used to check the availability and status of the deployment.	N/A	N/A
currentEmitEventTimeLag	The processing latency.	If the value of this metric is large, a data latency may occur in the deployment when the system pulls data or processes data.	Milliseconds	Kafka ApsaraMQ for RocketMQ Simple Log Service DataHub Postgres Change Data Capture (CDC) Hologres (Hologres binlog source table)
currentFetchEventTimeLag	The transmission latency.	If the value of this metric is large, a data latency may occur in the deployment when the system pulls data. In this case, you must check the information about the network I/O or the source. You can analyze the data processing capabilities of a source based on the difference between the values of this metric and the currentEmitEventTimeLag metric. The difference between the values of the two metrics indicates the duration for which the data is retained in the source. The processing mechanism varies based on whether miniBatch is enabled: If the difference between the values of the two metrics is small, the source does not efficiently pull data from the external system because of issues related to network I/O or parallelism. If the difference between the values of the two metrics is large, the processing capability of the deployment is insufficient. This leads to data retention in the source. To resolve this issue, perform the following steps: On the Deployments page, click the name of the desired deployment. On the deployment details page, click the Status tab. On the Status tab, click the value in the Name column. On the page that appears, click the BackPressure tab to locate the Vertex topology that causes the issue. Then, on the BackPressure tab, click Dump in the Thread Dump column to go to the Thread Dump tab to analyze the stack that has a performance bottleneck.	Milliseconds	Kafka ApsaraMQ for RocketMQ Simple Log Service DataHub Postgres CDC Hologres (Hologres binlog source table)
numRecordsIn	The total number of input data records of all operators.	If the value of this metric does not increase for a long period of time for an operator, data may be missing from the source. In this case, you must check the data of the source.	N/A	All built-in connectors
numRecordsOut	The total number of output data records.	If the value of this metric does not increase for a long period of time for an operator, an error may occur in the code logic of the deployment and data is missing. In this case, you must check the code logic of the deployment.	N/A	All built-in connectors
numRecordsInofSource	The total number of data records that flow into the source operator in each operator.	This metric is used to check the number of data records that flow into the source.	N/A	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ Simple Log Service DataHub ElasticSearch Hologres
numRecordsOutOfSink	The total number of output data records in the sink.	This metric is used to check the number of data records that are exported by the source.	N/A	Kafka Simple Log Service DataHub Hologres ApsaraDB for HBase Tablestore ApsaraDB for Redis
numRecordsInPerSecond	The number of input data records per second for all data streams.	This metric is used to monitor the overall data stream processing speed. For example, the value of the numRecordsInPerSecond metric helps you determine whether the overall data stream processing speed meets the expected requirements and how the deployment performance changes under different input data loads.	Count/s	All built-in connectors
numRecordsOutPerSecond	The number of output data records per second for all data streams.	This metric is used to monitor the number of output data records per second for all data streams. This metric is also used to monitor the overall data stream output speed. For example, the value of the numRecordsOutPerSecond metric helps you determine whether the overall data stream output speed meets the expected requirements and how the deployment performance changes under different output data loads.	Count/s	All connectors
numRecordsInOfSourcePerSecond (IN RPS)	The number of input data records per second in a source.	This metric is used to monitor the number of input data records per second in a source and monitor the speed at which data records are generated in the source. For example, the number of data records that can be generated varies based on the type of the data source. The value of the numRecordsInOfSourcePerSecond metric helps you determine the speed at which data records are generated in a source and adjust data streams to improve performance. This metric is also used for monitoring and alerting. If the value of this metric is 0, data may be missing from the source. In this case, you must check whether data output is blocked because the data of the source is not consumed.	Count/s	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ Simple Log Service DataHub ElasticSearch Hologres
numRecordsOutOfSinkPerSecond (OUT RPS)	The number of output data records per second in a sink.	This metric is used to monitor the number of output data records per second in a sink and monitor the speed at which data records are exported from the sink. For example, the number of data records that can be exported varies based on the type of the sink. The value of the numRecordsOutOfSinkPerSecond metric helps you determine the speed at which data records are exported from a sink and adjust data streams to improve performance. This metric is also used for monitoring and alerting. If the value of this metric is 0, all data is filtered due to a defect in the code logic of the deployment. In this case, you must check the code logic of the deployment.	Count/s	Kafka MaxCompute Incremental MaxCompute Simple Log Service DataHub Hologres ApsaraDB for HBase Tablestore ApsaraDB for Redis
pendingRecords	The number of data records that are not read by the source.	This metric is used to check the number of data records that are not pulled by the source from the external system.	N/A	Kafka ElasticSearch
sourceIdleTime	The duration for which data is not processed in the source.	This metric specifies whether the source is idle. If the value of this metric is large, your data is generated at a low speed in the external system.	Milliseconds	Kafka ApsaraMQ for RocketMQ Postgres CDC Hologres (Hologres binlog source table)

Checkpoint

Metric	Definition	Description	Unit
Num of Checkpoints	The number of checkpoints.	This metric is used to obtain the overview of checkpoints and configure alerts for checkpoints.	N/A
lastCheckpointDuration	The duration for which the last checkpoint is used.	If the checkpoint is used for a long period of time or times out, the possible cause may be that the storage space occupied by state data is excessively large, a temporary network error occurs, barriers are not aligned, or data backpressure exists.	Milliseconds
lastCheckpointSize	The size of the last checkpoint.	This metric is used to view the size of the last checkpoint that is uploaded. You can analyze the performance of checkpoints when a bottleneck issue occurs for the checkpoints based on the value of this metric.	Byte
lastCheckpointFullSize Note Only Realtime Compute for Apache Flink that uses VVR 6.0 or later supports this metric.	The full size of the last checkpoint.	This metric is used to obtain the actual size of the current checkpoint for remote storage.	Byte

State

Note

To use the metric that is related to state access latency, you must specify the metric. You must set state.backend.latency-track.keyed-state-enabled to true to enable the metric in the Additional Configuration section of the Advanced tab on the Draft Editor page in the console of fully managed Flink. After you enable the metric that is related to state access latency, the deployment performance may be affected when the deployment is running.

Metric	Definition	Description	Unit	Supported version
State Clear Latency	The maximum latency in a state data cleanup.	You can view the performance of state data cleanup.	Nanosecond	Only Realtime Compute for Apache Flink that uses VVR 4.0.0 or later supports these metrics.
Value State Latency	The maximum latency in single ValueState access.	You can view the ValueState access performance.	Nanosecond
Aggregating State Latency	The maximum latency in single AggregatingState access.	You can view the AggregatingState access performance.	Nanosecond
Reducing State Latency	The maximum latency in single ReducingState access.	You can view the ReducingState access performance.	Nanosecond
Map State Latency	The maximum latency in single MapState access.	You can view the MapState access performance.	Nanosecond
List State Latency	The maximum latency in single ListState access.	You can view the ListState access performance.	Nanosecond
Sorted Map State Latency	The maximum latency in single SortedMapState access.	You can view the SortedMapState access performance.	Nanosecond
State Size	The size of the state data.	This metric helps you perform the following operations: Directly identify nodes or identify nodes in advance in which state data bottlenecks may occur. Check whether the time to live (TTL) of state data takes effect.	Byte	Only Realtime Compute for Apache Flink that uses VVR 4.0.12 or later supports this metric.
State File Size	The size of the state data file.	This metric helps you perform the following operations: Check the size of the state data file in the local disk. You can take actions in advance if the size is large. Determine whether the state data is excessively large if the local disk space is insufficient.	Byte	Only Realtime Compute for Apache Flink that uses VVR 4.0.13 or later supports this metric.

CEP

Metric	Definition	Description	Unit	Supported connector	Supported version
patternMatchedTimes	The number of times the pattern is matched.	This metric is used to evaluate whether the matching effect meets your expectations.	N/A	All connectors	Only Realtime Compute for Apache Flink that uses VVR 6.0.1 or later supports these metrics.
patternMatchingAvgTime	The average duration for which the pattern is matched.	This metric is used to evaluate whether the matching performance meets your expectations.	Microsecond
numLateRecordsDropped	The number of discarded data records due to data latency.	This metric is used to evaluate the degree of the out-of-order data for an event and whether the watermark configuration is valid.	N/A

IO

Metric	Definition	Description	Unit	Supported connector
numBytesIn	The total number of input bytes.	This metric is used to check the size of the input data records of the source. This can help observe the deployment throughput.	Byte	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ
numBytesInPerSecond	The total number of input bytes per second.	This metric is used to check the rate at which data flows into the source. This can help observe the deployment throughput.	Byte/s	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ
numBytesOut	The total number of output bytes.	This metric is used to check the size of the output data records of the source. This can help observe the deployment throughput.	Bytes	Kafka ApsaraMQ for RocketMQ DataHub ApsaraDB for HBase
numBytesOutPerSecond	The total number of output bytes per second.	This metric is used to check the rate at which data is exported by the source. This can help observe the deployment throughput.	Byte/s	Kafka ApsaraMQ for RocketMQ DataHub ApsaraDB for HBase
Task numRecords I/O	The total number of data records that flow into each subtask and data records that are exported by each subtask.	This metric is used to check whether I/O bottlenecks exist in the deployment.	N/A	Kafka MaxCompute Incremental MaxCompute Simple Log Service DataHub ElasticSearch Hologres ApsaraDB for HBase Tablestore ApsaraDB for Redis
Task numRecords I/O PerSecond	The total number of data records that flow into each subtask and data records that are exported by each subtask per second.	This metric is used to check whether I/O bottlenecks exist in the deployment and determine the severity of the I/O bottlenecks based on the input and output rate of each subtask.	Count/s	Kafka MaxCompute Incremental MaxCompute Simple Log Service DataHub ElasticSearch Hologres ApsaraDB for HBase Tablestore ApsaraDB for Redis
currentSendTime	The duration that is required by each subtask to export the last data record to the sink.	If the value of this metric is small, the rate at which data records are exported by each subtask to the sink is excessively slow.	Milliseconds	Kafka MaxCompute Incremental MaxCompute ApsaraMQ for RocketMQ Simple Log Service DataHub Hologres Note When you write data to Hologres in remote procedure call (RPC) mode or by using a Java Database Connectivity (JDBC) driver, the Hologres connector supports this metric. When you write data to Hologres in BHClient mode, the Hologres connector does not support this metric. ApsaraDB for HBase Tablestore ApsaraDB for Redis

Watermark

Metric	Definition	Description	Unit	Supported connector
Task InputWatermark	The time when each task receives the latest watermark.	This metric is used to check the latency of data receiving by TaskManagers.	N/A	N/A
watermarkLag	The latency of watermarks.	This metric is used to determine the latency of subtasks.	Milliseconds	Kafka ApsaraMQ for RocketMQ Simple Log Service DataHub Hologres (Hologres binlog source table)

JobManager resources

Metric	Definition	Description	Unit
JM CPU Load	The CPU load of the JobManager.	If the value of this metric is greater than 100% for a long period of time, the CPU of the JobManager is busy and the CPU load is high. This may affect the system performance. As a result, issues such as system stuttering and slow response occur. Note Realtime Compute for Apache Flink that uses VVR 6.0.6 or later does not support this metric. You can view the JM CPU Usage metric to monitor the CPU utilization of the JobManager.	N/A
JM CPU Usage	The CPU utilization of the JobManager.	This metric indicates the utilization of CPU time slices that are occupied by fully managed Flink. If the value of this parameter is 100%, one CPU core is used. If the value of this parameter is 400%, four CPU cores are used. If the value of this metric is greater than 100% for a long period of time, the CPU of the JobManager is busy. If the CPU load is high but the CPU utilization is low, a large number of processes that are in the uninterruptible sleep state may be running due to frequent read and write operations. Note Only Realtime Compute for Apache Flink that uses VVR 6.0.6 or later supports this metric.	N/A
JM Heap Memory	The heap memory of the JobManager.	This metric is used to check the change in the heap memory of the JobManager.	Byte
JM nonHeap Memory	The non-heap memory of the JobManager.	This metric is used to check the change in the non-heap memory of the JobManager.	Byte
JM Threads	The number of threads of the JobManager.	A large number of threads of the JobManager occupies excessive memory space. This reduces the deployment stability.	N/A
JM GC Count	The number of times that garbage collection (GC) of the JobManager occurs.	If GC of the JobManager occurs a large number of times, excessive memory space is occupied. This affects the deployment performance. This metric is used to diagnose deployments and handle deployment faults.	Times
JM GC Time	The duration for which each GC of the JobManager lasts.	If each GC lasts for a long period of time, excessive memory space is occupied. This affects the deployment performance. This metric is used to diagnose deployments and handle deployment faults.	Milliseconds
JM ClassLoader/ClassUnLoader	The total number of classes that are loaded or unloaded after the Java Virtual Machine (JVM) where the JobManager resides is created.	After the JVM where the JobManager resides is created, if the total number of the classes that are loaded or unloaded is excessively large, excessive memory space is occupied. This affects the deployment performance.	N/A

TaskManager resources

Metric	Definition	Description	Unit
TM CPU Load	The CPU load of a TaskManager.	This metric indicates the total number of processes in which the CPU is calculating data and processes in which data waits to be calculated by the CPU. In most cases, this metric indicates how busy the CPU is. The value of this parameter is related to the number of CPU cores that are used. The CPU load in fully managed Flink is calculated by using the following formula: CPU load = CPU utilization/Number of CPU cores. If the value of this parameter is greater than the CPU load in fully managed Flink, the CPU processing may be blocked. Note Realtime Compute for Apache Flink that uses VVR 6.0.6 or later does not support this metric. You can view the TM CPU Usage metric to monitor the CPU utilization of a TaskManager.	N/A
TM CPU Usage	The CPU utilization of a TaskManager.	This metric indicates the utilization of CPU time slices that are occupied by fully managed Flink. If the value of this parameter is 100%, one CPU core is used. If the value of this parameter is 400%, four CPU cores are used. If the value of this metric is greater than 100% for a long period of time, the CPU of the JobManager is busy. If the CPU load is high but the CPU utilization is low, a large number of processes that are in the uninterruptible sleep state may be running due to frequent read and write operations.	N/A
TM Heap Memory	The heap memory of a TaskManager.	This metric is used to check the change in the heap memory of the TaskManager.	Byte
TM nonHeap Memory	The non-heap memory of a TaskManager.	This metric is used to check the change in the non-heap memory of the TaskManager.	Byte
TM Mem (RSS)	The memory of the entire process by using Linux.	This metric is used to check the change of the memory of the process.	Byte
TM Threads	The number of threads of a TaskManager.	A large number of threads of the TaskManager occupies excessive memory space. This reduces the deployment stability.	N/A
TM GC Count	The number of times that GC of a TaskManager occurs.	If GC of the JobManager occurs a large number of times, excessive memory space is occupied. This affects the deployment performance. This metric is used to diagnose deployments and handle deployment faults.	N/A
TM GC Time	The duration for which each GC of a TaskManager lasts.	If each GC lasts for a long period of time, excessive memory space is occupied. This affects the deployment performance. This metric is used to diagnose deployments and handle deployment faults.	Milliseconds
TM ClassLoader/ClassUnLoader	The total number of classes that are loaded or unloaded after the JVM where a TaskManager resides is created.	After the JVM where the TaskManager resides is created, if the total number of the classes that are loaded or unloaded is excessively large, excessive memory space is occupied. This affects the deployment performance.	N/A