Fully managed Flink allows you to view the metrics of a deployment. This way, you can check whether the deployment data is normal. This topic describes how to view the metrics of a deployment. This topic also describes the metrics that are supported by Flink sources and sinks and the connectors that are used to report metrics.

Background information

You can view the metrics of a deployment in the console of fully managed Flink. You can also use one of the following methods to view the metrics:
  • Use the self-managed Prometheus service to view the metrics.
    If the network is connected, you must add the following code to the Additional Configuration section on the Advanced tab of the Draft Editor page in the console of fully managed Flink:
    metrics.reporters: promgatewayappmgr
    metrics.reporter.promgatewayappmgr.groupingKey: 'deploymentName={{deploymentName}};deploymentId={{deploymentId}};jobId={{jobId}}'
    metrics.reporter.promgatewayappmgr.jobName: '{{deploymentName}}'
    metrics.reporter.promgatewayappmgr.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
    metrics.reporter.promgatewayappmgr.host: pushgateway host
    metrics.reporter.promgatewayappmgr.port: pushgateway port
    Note
    • In the code, you must replace pushgateway host with the hostname of your Pushgateway and replace pushgateway port with the port number of your Pushgateway. The system automatically replaces the values of the deploymentName, deploymentId, and jobId parameters.
    • The network connection that is established between fully managed Flink and the self-managed Prometheus service must meet the following requirements:
      • If the self-managed Prometheus service resides in the same virtual private cloud (VPC) as fully managed Flink, make sure that the CIDR block of fully managed Flink can access the security group of the Prometheus service.
      • If the self-managed Prometheus service resides in a different VPC from fully managed Flink but the self-managed Prometheus service uses a public endpoint, you must configure Internet access for fully managed Flink. For more information, see Reference.
      • If the self-managed Prometheus service resides in a different VPC from fully managed Flink but the self-managed Prometheus service uses only the endpoint of the VPC, you must establish a connection between the VPCs. For more information, see Reference.
  • Call Application Real-Time Monitoring Service (ARMS) API operations to obtain the metrics and integrate the metrics into your platform.

    For more information about ARMS API operations, see List of API operations by feature. For more information about operator-related metrics, see Operator Metrics.

Precautions

  • Metrics that are reported by a source reflect only the current situation of the source and cannot be used to identify the root cause of an issue. You need to use auxiliary metrics or tools to identify the root cause of an issue. The following table describes the analysis of metrics in specific scenarios.
    Scenario Description
    An operator in a deployment has backpressure. The backpressure detection feature provided by Flink UI, rather than metrics, is the most direct method to detect backpressure. If backpressure exists, the rate at which the source sends data to downstream operators decreases. In this case, the value of the sourceIdleTime metric may periodically increase and the values of the currentFetchEventTime and currentEmitEventTimeLag metrics may continuously increase. In extreme cases, such as when an operator is stuck, the value of the sourceIdleTime metric continuously increases.
    The source has a performance bottleneck. If only the throughput of the source is insufficient, no backpressure can be detected in your deployment. The sourceIdleTime metric remains at a small value because the source keeps running. The values of the currentFetchEventTimeLag and currentEmitEventTimeLag metrics are large and close to each other.
    Data skew occurs at the upstream, or a partition is empty. If data skew occurs or a partition is empty, one or more sources are idle. In this case, the value of the sourceIdleTime metric for the sources is large.
  • If the latency of a deployment is high, you can use the following metrics to analyze the data processing capabilities of fully managed Flink and the retention of data in the external system.
    Metric Description
    sourceIdleTime Indicates whether the source is idle. If the value of this metric is large, the rate at which your data is generated in the external system is low.
    currentFetchEventTimeLag and currentEmitEventTimeLag Indicate the latency when fully managed Flink processes data. You can analyze the data processing capabilities of a source based on the difference between the values of the two metrics. The difference indicates the duration for which the data is retained in the source.
    • If the difference between the values of the two metrics is small, the source does not efficiently pull data from the external system because of issues related to network I/O or parallelism.
    • If the difference between the values of the two metrics is large, the source does not efficiently process data because of issues related to data parsing, parallelism, or backpressure.
    pendingRecords Indicates the amount of data that is retained in the external system.

Procedure

  1. Log on to the Realtime Compute for Apache Flink console.
  2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
  3. In the left-side navigation pane, choose Applications > Deployments.
  4. Click the name of the desired deployment.
  5. Click Metrics.
    Metrics
  6. View the metrics of the deployment.
    For more information about the metrics that are supported by sources and sinks and the connectors that are used to report metrics, see Metrics.

Metrics

  • Overview
    Metric Description Details Unit Supported connector Supported version
    Num of Restarts The number of times that a deployment is restarted. This metric is used to check the availability and status of the deployment. N/A N/A Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 2.0.0 or later supports this metric.
    currentEmitEventTimeLag The processing latency. If the value of this metric is large, a data latency may occur in the deployment when the system pulls data or processes data. Millisecond
    • Kafka
    • Message Queue for Apache RocketMQ
    • Log Service
    • DataHub
    • Data type of PostgreSQL CDC
    • Hologres (Hologres binlog source table)
    Only Realtime Compute for Apache Flink that uses VVR 2.1.2 or later supports this metric.
    currentFetchEventTimeLag The transmission latency. If the value of this metric is large, a data latency may occur in the deployment when the system pulls data. In this case, you must check the information about the network I/O or the source. You can analyze the data processing capabilities of a source based on the difference between the values of this metric and the currentEmitEventTimeLag metric. The difference indicates the duration for which the data is retained in the source.
    • If the difference between the values of the two metrics is small, the source does not efficiently pull data from the external system because of issues related to network I/O or parallelism.
    • If the difference between the values of the two metrics is large, the source does not efficiently process data because of issues related to data parsing, parallelism, or backpressure.
    Millisecond
    • Kafka
    • Message Queue for Apache RocketMQ
    • Log Service
    • DataHub
    • PostgreSQL CDC
    • Hologres (Hologres binlog source table)
    numRecordsIn The total number of input data records of all operators. If the value of this metric does not increase for a long period of time for an operator, data may be missing from the source. In this case, you must check the data of the source. N/A All connectors
    numRecordsOut The total number of output data records. If the value of this metric does not increase for a long period of time for an operator, an error may occur in the code logic of the deployment and data is missing. In this case, you must check the code logic of the deployment. N/A All connectors
    numRecordsIn of Source The total number of data records that flow into the source operator in each operator. This metric is used to check the number of data records that flow into the source. N/A
    • Kafka
    • MaxCompute
    • Incremental MaxCompute
    • Message Queue for Apache RocketMQ
    • Log Service
    • DataHub
    • Elasticsearch
    • Hologres
    numRecordsOut of Source The total number of data records that are exported by the source. This metric is used to check the number of data records that are exported by the source. N/A
    • Kafka
    • Log Service
    • DataHub
    • Hologres
    • ApsaraDB for HBase
    • Tablestore
    • ApsaraDB for Redis
    numRecordsInPerSecond The number of input data records per second. If the value of this metric is 0 for an operator, all data may be missing from the source. In this case, you must check whether the data of the source is not consumed. If the data of the source is not consumed, the data cannot be exported to the sink. Count/s All connectors
    numRecordsOutPerSecond The number of output data records per second. If the value of this metric is 0 for an operator, the code logic of the deployment may be invalid and all data is filtered out. In this case, you must check the code logic of the deployment. Count/s All connectors
    numRecordsInofSourcePerSecond The number of input data records per second in the source. This metric is used for monitoring and alerting. If the value of this metric is 0, data may be missing from the source. In this case, you must check the data of the source. Count/s
    • Kafka
    • MaxCompute
    • Incremental MaxCompute
    • Message Queue for Apache RocketMQ
    • Log Service
    • DataHub
    • Elasticsearch
    • Hologres
    numRecordsOutofSourcePerSecond The number of output data records per second in the source. This metric is used for monitoring and alerting. If the value of this metric is 0, the code logic of the deployment may be invalid and all data is filtered out. In this case, you must check the code logic of the deployment. Count/s
    • Kafka
    • MaxCompute
    • Incremental MaxCompute
    • Log Service
    • DataHub
    • Hologres
    • ApsaraDB for HBase
    • Tablestore
    • ApsaraDB for Redis
    pendingRecords The number of data records that are not read by the source. This metric is used to check the number of data records that are not pulled by the source from the external system. N/A
    • Kafka
    • Elasticsearch
    sourceIdleTime The duration for which data is not processed in the source. This metric specifies whether the source is idle. If the value of this metric is large, your data is generated at a low speed in the external system. Millisecond
    • Kafka
    • Message Queue for Apache RocketMQ
    • Postgres CDC
    • Hologres (Hologres binlog source table)
  • Checkpoint
    Note Only Realtime Compute for Apache Flink that uses VVR 2.0.0 or later supports these metrics.
    Metric Description Details Unit
    Num of Checkpoints The number of checkpoints. This metric is used to obtain the overview of checkpoints and configure alerts for checkpoints. N/A
    lastCheckpointDuration The duration for which the last checkpoint is used. If the checkpoint is used for a long period of time or times out, the possible cause may be that the storage space occupied by state data is excessively large, a temporary network error occurs, barriers are not aligned, or data backpressure exists. Millisecond
    lastCheckpointSize The size of the last checkpoint. This metric is used to view the size of the last checkpoint that is uploaded. You can analyze the performance of checkpoints when a bottleneck issue occurs for the checkpoints based on the value of this metric. Byte
    lastCheckpointFullSize
    Note Only Realtime Compute for Apache Flink that uses VVR 6.0 or later supports this metric.
    The full size of the last checkpoint. This metric is used to obtain the actual size of the current checkpoint for remote storage. Byte
  • State
    Note The metrics related to state access latency must be specified before they can be used. Therefore, you must set state.backend.latency-track.keyed-state-enabled to true in the Additional Configuration section of the Advanced tab on the Draft Editor page. After you enable metrics related to state access latency, the performance of deployments may be affected.
    Metric Description Details Unit Supported version
    State Clear Latency The maximum latency in a state data cleanup. You can view the performance of state data cleanup. Nanosecond Only Realtime Compute for Apache Flink that uses VVR 4.0.0 or later supports these metrics.
    Value State Latency The maximum latency in single ValueState access. You can view the ValueState access performance. Nanosecond
    Aggregating State Latency The maximum latency in single AggregatingState access. You can view the AggregatingState access performance. Nanosecond
    Reducing State Latency The maximum latency in single ReducingState access. You can view the ReducingState access performance. Nanosecond
    Map State Latency The maximum latency in single MapState access. You can view the MapState access performance. Nanosecond
    List State Latency The maximum latency in single ListState access. You can view the ListState access performance. Nanosecond
    Sorted Map State Latency The maximum latency in single SortedMapState access. You can view the SortedMapState access performance. Nanosecond
    State Size The size of the state data. This metric helps you perform the following operations:
    • Directly identify nodes or identify nodes in advance in which state data bottlenecks may occur.
    • Check whether the time to live (TTL) of state data takes effect.
    Byte Only Realtime Compute for Apache Flink that uses VVR 4.0.12 or later supports this metric.
    State File Size The size of the state data file. This metric helps you perform the following operations:
    • Check the size of the state data file in the local disk. You can take actions in advance if the size is large.
    • Determine whether the state data is excessively large if the local disk space is insufficient.
    Byte Only Realtime Compute for Apache Flink that uses VVR 4.0.13 or later supports this metric.
  • I/O
    Note Only Realtime Compute for Apache Flink that uses VVR 2.1.2 or later supports these metrics.
    Metric Description Details Unit Supported connector
    numBytesIn The total number of input bytes. This metric is used to check the size of the input data records of the source. This can help observe the deployment throughput. Byte
    • Kafka
    • MaxCompute
    • Incremental MaxCompute
    • Message Queue for Apache RocketMQ
    numBytesInPerSecond The total number of input bytes per second. This metric is used to check the rate at which data flows into the source. This can help observe the deployment throughput. Byte/s
    • Kafka
    • MaxCompute
    • Incremental MaxCompute
    • Message Queue for Apache RocketMQ
    numBytesOut The total number of output bytes. This metric is used to check the size of the output data records of the source. This can help observe the deployment throughput. Byte
    • Kafka
    • Message Queue for Apache RocketMQ
    • DataHub
    • ApsaraDB for HBase
    numBytesOutPerSecond The total number of output bytes per second. This metric is used to check the rate at which data is exported by the source. This can help observe the deployment throughput. Byte/s
    • Kafka
    • Message Queue for Apache RocketMQ
    • DataHub
    • ApsaraDB for HBase
    Task numRecords I/O The total number of data records that flow into each subtask and data records that are exported by each subtask. This metric is used to check whether I/O bottlenecks exist in the deployment. N/A
    • Kafka
    • MaxCompute
    • Incremental MaxCompute
    • Log Service
    • DataHub
    • Elasticsearch
    • Hologres
    • ApsaraDB for HBase
    • Tablestore
    • ApsaraDB for Redis
    Task numRecords I/O PerSecon The total number of data records that flow into each subtask and data records that are exported by each subtask per second. This metric is used to check whether I/O bottlenecks exist in the deployment and determine the severity of the I/O bottlenecks based on the input and output rate of each subtask. Count/s
    • Kafka
    • MaxCompute
    • Incremental MaxCompute
    • Log Service
    • DataHub
    • Elasticsearch
    • Hologres
    • ApsaraDB for HBase
    • Tablestore
    • ApsaraDB for Redis
    currentSendTime The duration that is required by each subtask to export the last data record to the sink. If the value of this metric is small, the rate at which data records are exported by each subtask to the sink is excessively slow. Millisecond
    • Kafka
    • MaxCompute
    • Incremental MaxCompute
    • Message Queue for Apache RocketMQ
    • Log Service
    • DataHub
    • Hologres
      Note When you write data to Hologres in remote procedure call (RPC) mode or by using a Java Database Connectivity (JDBC) driver, the Hologres connector supports this metric. When you write data to Hologres in BHClient mode, the Hologres connector does not support this metric.
    • ApsaraDB for HBase
    • Tablestore
    • ApsaraDB for Redis
  • Watermark
    Metric Description Details Unit Supported connector Supported version
    Task InputWatermark The time when each task receives the latest watermark. This metric is used to check the latency of data receiving by TaskManagers. N/A N/A Only Realtime Compute for Apache Flink that uses VVR 2.0.0 or later supports this metric.
    watermarkLag The latency of watermarks. This metric is used to determine the latency of subtasks. Millisecond
    • Kafka
    • Message Queue for Apache RocketMQ
    • Log Service
    • DataHub
    • Hologres (Hologres binlog source table)
    Only Realtime Compute for Apache Flink that uses VVR 2.1.2 or later supports this metric.
  • JobManager
    Note Only Realtime Compute for Apache Flink that uses VVR 2.0.0 or later supports these metrics.
    Metric Description Details Unit
    JM CPU Load The CPU load of the JobManager. If the value of this metric is greater than 100% for a long period of time, the CPU of the JobManager is busy and the CPU load is high. This may affect the system performance. As a result, issues such as system stuttering and slow response occur. N/A
    JM Heap Memory The heap memory of the JobManager. This metric is used to check the change in the heap memory of the JobManager. Byte
    JM nonHeap Memory The non-heap memory of the JobManager. This metric is used to check the change in the non-heap memory of the JobManager. Byte
    JM Threads The number of threads of the JobManager. A large number of threads of the JobManager occupies excessive memory space. This reduces the deployment stability. N/A
    JM GC Count The number of times that garbage collection (GC) of the JobManager occurs. If GC of the JobManager occurs a large number of times, excessive memory space is occupied. This affects the deployment performance. This metric helps you diagnose deployments and handle deployment faults. Times
    JM GC Time The duration for which each GC of the JobManager lasts. If each GC lasts for a long period of time, excessive memory space is occupied. This affects the deployment performance. This metric helps you diagnose deployments and handle deployment faults. Millisecond
    JM ClassLoader/ClassUnLoader The total number of classes that are loaded or unloaded after the Java Virtual Machine (JVM) where the JobManager resides is created. After the JVM where the JobManager resides is created, if the total number of the classes that are loaded or unloaded is excessively large, excessive memory space is occupied. This affects the deployment performance. N/A
  • TaskManager
    Note Only Realtime Compute for Apache Flink that uses VVR 2.0.0 or later supports these metrics.
    Metric Description Details Unit
    TM CPU Load The CPU load of a TaskManager. If the value of this metric is greater than 100% for a long period of time, the CPU of the TaskManager is busy and the CPU load is high. This may affect the system performance. As a result, issues such as system stuttering and slow response occur. N/A
    TM CPU Usage The CPU utilization of a TaskManager. If the value of this metric is greater than 100% for a long period of time, the CPU of the TaskManager is busy. If the CPU load is high but the CPU utilization is low, a large number of processes that are in the uninterruptible sleep state may be running due to frequent read and write operations. N/A
    TM Heap Memory The heap memory of a TaskManager. This metric is used to check the change in the heap memory of the TaskManager. Byte
    TM nonHeap Memory The non-heap memory of a TaskManager. This metric is used to check the change in the non-heap memory of the TaskManager. Byte
    TM Mem (RSS) The memory of the entire process by using Linux. This metric is used to check the change of the memory of the process. Byte
    TM Threads The number of threads of a TaskManager. A large number of threads of the TaskManager occupies excessive memory space. This reduces the deployment stability. N/A
    TM GC Count The number of times that GC of a TaskManager occurs. If GC of a TaskManager occurs a large number of times, excessive memory space is occupied. This affects the deployment performance. This metric helps you diagnose deployments and handle deployment faults. N/A
    TM GC Time The duration for which each GC of a TaskManager lasts. If each GC lasts for a long period of time, excessive memory space is occupied. This affects the deployment performance. This metric helps you diagnose deployments and handle deployment faults. Millisecond
    TM ClassLoader/ClassUnLoader The total number of classes that are loaded or unloaded after the JVM where a TaskManager resides is created. After the JVM where the TaskManager resides is created, if the total number of the classes that are loaded or unloaded is excessively large, excessive memory space is occupied. This affects the deployment performance. N/A