Fully managed Flink allows you to view the metrics of a job. This way, you can check whether the job data is normal. This topic describes how to view the metrics of a job. This topic also describes the metrics that are supported by Flink sources and sinks and the connectors that are used to report metrics.

Background information

You can view the metrics of a job in the console of fully managed Flink. You can also use one of the following methods to view the metrics:
  • Use the self-managed Prometheus service to view the metrics.
    If the network is connected, you must add the following code to the Additional Configuration section on the Advanced tab of the Draft Editor page of the job in the console of fully managed Flink:
    metrics.reporters: promgatewayappmgr
    metrics.reporter.promgatewayappmgr.groupingKey: deploymentName={{deploymentName}};deploymentId={{deploymentId}};jobId={{jobId}}
    metrics.reporter.promgatewayappmgr.jobName: '{{deploymentName}}'
    metrics.reporter.promgatewayappmgr.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
    metrics.reporter.promgatewayappmgr.host: ${your pushgateway host}
    metrics.reporter.promgatewayappmgr.port: ${your pushgateway port}
    Note In the code, you must replace ${your pushgateway host} with the hostname of your Pushgateway and replace ${your pushgateway port} with the port number of your Pushgateway.
  • Call Application Real-Time Monitoring Service (ARMS) API operations to obtain the metrics and integrate the metrics into your platform.

    For more information about ARMS API operations, see List of API operations by feature. For more information about operator-related metrics, see Operator Metrics.

Precautions

  • Metrics that are reported by a source reflect only the current situation of the source and cannot be used to identify the root cause of an issue. You need to use auxiliary metrics or tools to identify the root cause of an issue. The following table describes the analysis of metrics in specific scenarios.
    Scenario Description
    An operator in a job has backpressure. The backpressure detection feature provided by Flink UI, rather than metrics, is the most direct method to detect backpressure. If backpressure exists, the rate at which the source sends data to downstream operators decreases. In this case, the value of the sourceIdleTime metric may periodically increase and the values of the currentFetchEventTime and currentEmitEventTimeLag metrics may continuously increase. In extreme cases, such as when an operator is stuck, the value of the sourceIdleTime metric continuously increases.
    The source has a performance bottleneck. If only the throughput of the source is insufficient, no backpressure can be detected in your job. The sourceIdleTime metric remains at a small value because the source keeps running. The values of the currentFetchEventTimeLag and currentEmitEventTimeLag metrics are large and close to each other.
    Data skew occurs at the upstream, or a partition is empty. If data skew occurs or a partition is empty, one or more sources are idle. In this case, the value of the sourceIdleTime metric for the sources is large.
  • If the latency of a job is high, you can use the following metrics to analyze the data processing capabilities of fully managed Flink and the retention of data in the external system.
    Metric Description
    sourceIdleTime Indicates whether the source is idle. If the value of this metric is large, the rate at which your data is generated in the external system is low.
    currentFetchEventTimeLag and currentEmitEventTimeLag Indicate the latency when fully managed Flink processes data. You can analyze the data processing capabilities of a source based on the difference between the values of the two metrics. The difference indicates the duration for which the data is retained in the source.
    • If the difference between the values of the two metrics is small, the source does not efficiently pull data from the external system because of issues related to network I/O or parallelism.
    • If the difference between the values of the two metrics is large, the source does not efficiently process data because of issues related to data parsing, parallelism, or backpressure.
    pendingBytes and pendingRecords Indicate the amount of data that is retained in the external system.

Procedure

  1. Log on to the Realtime Compute for Apache Flink console.
  2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
  3. In the left-side navigation pane, choose Applications > Deployments.
  4. Click the name of the desired job.
  5. Click Metrics.
    Metrics
  6. View the metrics of the job.
    For more information about the metrics that are supported by sources and sinks and the connectors that are used to report metrics, see Metrics.

Metrics

The following table describes the metrics that are supported by fully managed Flink powered by Ververica Runtime (VVR) 2.1.3 or later.
  • Source
    Metric Type Unit Description
    numBytesIn Counter Byte The total amount of data that flows into the source.
    numBytesInPerSecond Meter Byte/s The rate at which data flows into the source.
    numRecordsIn Counter Number The total number of data records that flow into the source.
    numRecordsInPerSecond Meter Number/s The rate at which data records flow into the source.
    numRecordsInErrors Counter Number The total number of data records that fail to be processed by the source.
    currentFetchEventTimeLag Gauge Millisecond

    Formula: Time when data is read by the source (FetchTime) - Event time of the data (EventTime).

    This metric indicates the duration for which data is retained in the external system.

    For example, when fully managed Flink reads data from Kafka, this metric indicates the difference between the time when data is pulled by a Kafka consumer (FetchTime) and the time when data is written to Kafka (EventTime).

    currentEmitEventTimeLag Gauge Millisecond

    Formula: Time when data leaves the source (EmitTime) - Event time of the data (EventTime).

    This metric indicates the duration for which data is retained in the external system and the source.

    For example, when fully managed Flink reads data from Kafka, this metric indicates the difference between the time when data leaves the Kafka source (FetchTime) and the time when data is written to Kafka (EventTime).

    watermarkLag Gauge Millisecond

    Formula: Current time - Watermark time of data.

    This metric indicates the latency of a watermark.

    sourceIdleTime Gauge Millisecond

    Formula: Current time - Time when data was last processed.

    This metric indicates the duration for which the source is idle.

    pendingBytes Gauge Byte The amount of data that is not pulled by the source from the external system.
    Notice Some sources do not report this metric. This metric is reported only when it is supported by both the external system and the source.
    pendingRecords Gauge Number The number of data records that are not pulled by the source from the external system.
    Notice Some sources do not report this metric. This metric is reported only when it is supported by both the external system and the source.
    Notice Some sources do not report the preceding metrics. These metrics are reported only when they are supported by both the external system and the source. For more information about metrics, see Standardize Connector Metrics.
    Table 1. Connectors that are used to report metrics

    The check sign (√) indicates that the metric is reported. The cross sign (x) indicates that the metric is not reported.

    Connector numBytesIn numBytesInPerSecond numRecordsIn numRecordsInPerSecond numRecordsInErrors currentFetchEventTimeLag currentEmitEventTimeLag watermarkLag sourceIdleTime pendingBytes pendingRecords
    Kafka x x
    MaxCompute x x x x x x x
    Incremental MaxCompute x x x x x x x
    Message Queue for Apache RocketMQ x x
    Log Service x x x x x x
    DataHub x x x x x x
    Elasticsearch x x x x x x x x
    Hologres x x x x x x x x x
    ApsaraDB for HBase x x x x x x x x x x x
    Tablestore x x x x x x x x x x x
    Phoenix x x x x x x x x x x x
    ApsaraDB for Redis x x x x x x x x x x x
  • Sink
    Table 2.
    Metric Type Unit Description
    numBytesOut Counter Byte The total amount of data exported by the sink.
    numBytesOutPerSecond Meter Byte/s The rate at which data is exported by the sink.
    numRecordsOut Counter Number The total number of data records exported by the sink.
    numRecordsOutPerSecond Meter Number/s The rate at which data records are exported by the sink.
    numRecordsOutErrors Counter Number The total number of data records that fail to be processed by the sink.
    currentSendTime Gauge Millisecond The duration for which the most recent data record is exported to the external system.
    Table 3. Connectors that are used to report metrics

    The check sign (√) indicates that the metric is reported. The cross sign (x) indicates that the metric is not reported.

    Connector numBytesOut numBytesOutPerSecond numRecordsOut numRecordsOutPerSecond numRecordsOutErrors currentSendTime
    Kafka
    MaxCompute x x x x
    Incremental MaxCompute x x x x
    Message Queue for Apache RocketMQ x x x
    Log Service x x x
    DataHub x
    Elasticsearch x x x x x x
    Hologres x x x
    ApsaraDB for HBase x
    Tablestore x x x
    Phoenix x x x
    ApsaraDB for Redis x x x