Realtime Compute for Apache Flink shows core metrics of your job on the Curve Charts tab to help you diagnose the status of the job.

- The metrics are displayed only when a Realtime Compute for Apache Flink job is in the running state. If the job is in the suspended or terminated state, the metrics of this job are not displayed.
- The metrics are collected and displayed on the Curve Charts tab after the job is running for more than one minute. This causes the latency in the data that is displayed in the curve charts.
Go to the Curve Charts tab
- Go to the Job Administration page in the Realtime Compute for Apache Flink console.
- Log on to the Realtime Compute development platform.
- In the top navigation bar, click Administration.
- On the Jobs page that appears, click the target job name under the Job Name field.
- In the upper part of the Job Administration page, click the Curve Charts tab.
Overview
- Failover
The failover curve chart displays the frequency of failovers that are caused by errors or exceptions for the current job. To calculate the failover rate, divide the total number of failovers that occurred within the minute that precedes the current failover time by 60. For example, if a failover occurred once within the last minute, the failover rate is 0.01667. The failover rate is calculated by using the following formula: 1/60 = 0.01667.
- DelayTo help you obtain the full-link timeliness and job performance, Realtime Compute for Apache Flink provides the following latency metrics:
- Processing Delay: Processing delay = Current system time - Event time at which the system processes the last data record. If no more data enters upstream storage systems, the processing delay gradually increases as the system time continues to move forward.
- Data Pending Time: Data pending time = Time when data enters Realtime Compute for Apache Flink - Event time. Even if no more data enters upstream storage systems, the queued time does not increase. The queued time is used to assess whether the Realtime Compute for Apache Flink job has backpressure.
- Data Arrival Interval: Data arrival interval = Processing delay - Data pending time. If the Realtime Compute for Apache Flink job has no backpressure, the queued time is short and stable. In this case, this metric reflects the degree of data sparsity between the data sources. If the Realtime Compute for Apache Flink job has backpressure, the queued time is long or unstable. In this case, this metric cannot be used for reference.
Note- Realtime Compute for Apache Flink uses a distributed computing framework. The preceding three latency metrics obtain values of each shard or partition of the data source. Then, the metrics report the maximum values among all the shards or all the partitions to the development platform of Realtime Compute for Apache Flink. Therefore, the aggregated data arrival interval that is displayed on the development platform is different from the interval that is obtained by using the following formula: Data arrival interval = Processing delay - Data pending time.
- If no more data enters a shard or a partition of the data source, the processing delay gradually increases.
- Input TPS of Each Source
This chart displays statistics on all streaming data input of a Realtime Compute for Apache Flink job. The chart records the number of blocks that are read from the source table per second. This helps you obtain the transactions per second (TPS) of a data storage system. Different from TPS, the records per second (RPS) metric indicates the number of records read from the source table per second. These records are resolved from the blocks. For example, if Log Service reads five log groups per second, the value of TPS is 5. If eight log records are resolved from each log group, a total of 40 log records are resolved. In this case, the value of RPS is 40.
- Data Output of Each Sink
This chart displays statistics on all output data of a Realtime Compute for Apache Flink job. This helps you obtain the RPS of a data storage system. In most cases, if no data output is detected during system operations and maintenance (O&M), you must check the input of the upstream storage system and the output of the downstream storage system.
- Input RPS of Each Source
This chart displays statistics on all input streaming data of a Realtime Compute for Apache Flink job. This helps you obtain the RPS of a data storage system. If no data output is detected during system O&M, you must check the RPS to determine whether the input data from the upstream storage system is normal.
- Input BPS of Each Source
This chart displays statistics on all input streaming data of a Realtime Compute for Apache Flink job. This chart records the traffic that is used to read the input source table per second. This helps you obtain the bytes per second (BPS) of the traffic.
- Dirty Data from Each Source
This chart displays the number of dirty data records in the data source of a Realtime Compute for Apache Flink job in different time periods.
- Auto Scaling Successes and FailuresThis chart displays the number of auto scaling successes and the number of auto scaling failures.Notice This curve chart is suitable only for Realtime Compute for Apache Flink whose version is later than V3.0.0.
- CPUs Consumed By Auto ScalingThis chart displays the number of CPUs consumed when auto scaling is performed.Notice This curve chart is suitable only for Realtime Compute for Apache Flink whose version is later than V3.0.0.
- Memory Consumed by Auto ScalingThis chart displays the memory space consumed when auto scaling is performed.Notice This curve chart is suitable only for Realtime Compute for Apache Flink whose version is later than V3.0.0.
Advanced View
Alibaba Cloud Realtime Compute for Apache Flink provides a fault tolerance mechanism that allows you to restore data streams and ensures that the data streams are consistent with the application. The fault tolerance mechanism is used to create consistent snapshots of distributed data streams and the related states. These snapshots work as consistency checkpoints to which the system can fall back if a failure occurs.


Curve chart | Description |
---|---|
Checkpoint Duration | Displays the time that is consumed to create a checkpoint. Unit: milliseconds. |
Checkpoint Size | Displays the memory size that is required to create a checkpoint. |
Checkpoint Alignment Time | Displays the duration consumed by all the data streams to flow from the upstream nodes to the node on which you create a checkpoint. When the sink operator receives Barrier n from all the input streams, the operator acknowledges to the checkpoint coordinator that Snapshot n is created. The sink operator represents the destination of the DAG stream. After all the sink operators acknowledge that snapshot n is created, this snapshot is considered completed. This duration is known as the checkpoint alignment time. |
Checkpoint Count | Displays the number of checkpoints within a specific period of time. |
Get | Displays the longest duration for which a subtask performs a GET operation on the RocksDB within a specific period of time. |
Put | Displays the longest duration for which a subtask performs a PUT operation on the RocksDB within a specific period of time. |
Seek | Displays the longest duration for which a subtask performs a SEEK operation on the RocksDB within a specific period of time. |
State Size | Displays the state size of the job within a specific period of time. If the size increases at a high rate, we recommend that you check for potential issues in the job. |
GMS GC Time | Displays the duration for which the underlying container of the job performs garbage collection. |
GMS GC Rate | Displays the frequency at which the underlying container of the job performs garbage collection. |
Watermark
Curve chart | Description |
---|---|
Watermark Delay | Displays the difference between the watermark time and the system time. |
Dropped Records per Second | Displays the number of data records that are dropped per second. If a data record arrives at the window after the watermark time, the data record is dropped. |
Dropped Records | Displays the total number of dropped data records. If a data record arrives at the window after the watermark time, the data record is dropped. |
Delay
Top 15 Source Subtasks with the Longest Processing Delay
Throughput
Curve chart | Description |
---|---|
Task Input TPS | Displays the data input status of all the tasks in a job. |
Task Output TPS | Displays the data output status of all the tasks in a job. |
Queue
Curve chart | Description |
---|---|
Input Queue Usage | Displays the data input queue of all the tasks in a job. |
Output Queue Usage | Displays the data output queue of all the tasks in a job. |
Tracing
Curve chart | Description |
---|---|
Time Used In Processing Per Second | Displays the duration for which a task processes data per second. |
Time Used In Waiting Output Per Second | Displays the duration for which a task waits for output data per second. |
Task Latency Histogram Mean | Displays the latency of each task. |
Wait Output Histogram Mean | Displays the duration for which each task waits for output. |
Wait Input Histogram Mean | Displays the duration for which each task waits for input. |
Partition Latency Mean | Displays the latency of concurrent tasks in each partition. |
Process
Curve chart | Description |
---|---|
Process Memory RSS | Displays the memory usage of each process. |
CPU Usage | Displays the CPU utilization of each process. |
JVM
Curve chart | Description |
---|---|
Memory Heap Used | Displays the Java Virtual Machine (JVM) heap memory usage of a job. |
Memory Non-Heap Used | Displays the JVM non-heap memory usage of a job. |
Thread Count | Displays the number of threads in a job. |
GC(CMS) | Displays the number of times that a job completes garbage collection (GC). |