View runtime analysis

In Dataphin, you can analyze runtime metrics for real-time instances of the open-source Flink real-time computing engine. This analysis includes operations such as refreshing real-time instance information and displays data on failures, backpressure conditions, Sink outputs, and Checkpoint failures.

Permission description

You must have project space permissions to view the runtime analysis of a project.
Access to the Apache Flink Dashboard requires a username and password. The dashboard provides prompt information for user roles including super administrators, system administrators, task owners, and project operation and maintenance owners.

Runtime analysis entry

From the Dataphin home page, click Development on the top menu bar to navigate to the data development page by default.
To view Runtime Analysis, follow these steps:
Click Operation and Maintenance -> select Project (Dev-Prod mode requires selecting the environment) -> click the Real-time Instance -> click the icon.

The runtime analysis page displays the running status of various metrics as illustrated below:

Feature	Description
① Time Area Selection	Quick Options: Supports the last 10 minutes, last 1 hour, last 6 hours, last 1 day, and last 1 week. The default is the last 1 hour. Custom Time Interval: You can customize the time interval and use the custom time interval as the selected time interval. Only supports selecting the past 7 days [N-7, N].
② Go To Apache Flink Dashboard and Refresh	Go To Apache Flink Dashboard: The Apache Flink Dashboard only supports Flink engine versions 1.14 And Above. Click Go To Apache Flink Dashboard to access the Apache Flink Dashboard for more details. Note Ensure VPC network intercommunication to view the Apache Flink Dashboard. Refresh: Manually refresh the data.
③ Real-time monitoring metrics	For Flink SQL or Flink Datastream tasks, you can view metrics such as overview, checkpoint, IO, watermark, CPU, memory, and JVM. For detailed descriptions of each metric, see Real-time monitoring metrics description.
④ Metric data statistics	View the data status of each metric within the selected time period.

Metric statistics description

If the selected time interval is less than or equal to 6 hours, you can view all data points collected every minute.
For time intervals greater than 6 hours and up to 24 hours, data points are collected every 5 minutes starting from the hour, with each point representing the count for the preceding 5 minutes.
For time intervals exceeding 24 hours, data points are collected every 10 minutes starting from the hour, with each point representing the count for the preceding 10 minutes.

Real-time monitoring metrics description

Overview

Monitoring metrics	Description	Unit
Number of Failures	The count of task failures occurring within the specified time frame.	Times
Backpressure Condition	Indicates if the task is experiencing backpressure during the current time frame, which occurs when it generates data at a faster rate than the downstream task can process.	Boolean
Data Output of Each Sink	The throughput of each sink, measured in Transactions Per Second (TPS).	TPS
Number of Checkpoint Failures	The count of checkpoint failures for tasks within the current time period.	Times

Checkpoint

Monitoring metrics	Subtypes	Description	Unit
Total number of Checkpoints (Num of Checkpoints)	Checkpoint Total Count (totalNumberOfCheckpoints)	Counts the total number of task Checkpoints for the specified time period.	Count
	Number of failed Checkpoints (numberOfFailedCheckpoints)	Indicates the number of task Checkpoints that failed during the current time period.	Count
	Completed Checkpoints Count (numberOfCompletedCheckpoints)	Tallies the number of task Checkpoints completed within the current time period.	Count
	Number of in-progress Checkpoints (numberOfInProgressCheckpoints)	Reflects the number of task Checkpoints that are currently in progress.	Count
Duration of the most recent checkpoint (lastCheckpointDuration)	Duration of the last Checkpoint (lastCheckpointDuration)	Measures the time taken to complete the most recent task Checkpoint. Extended Checkpoint duration or timeouts can result from factors like substantial state size, transient network problems, barrier misalignment, or data backpressure.	Milliseconds (ms)
Last Checkpoint size (lastCheckpointSize)	Size of the last Checkpoint (lastCheckpointSize)	The size of the most recently uploaded Checkpoint, useful for analyzing performance issues during bottlenecks.	Bytes (Byte)

IO

Monitoring metrics	Meaning	Monitoring metrics	Description	Unit
Input rate: total bytes received per second. (numBytesIn PerSecond)	Provides insight into the upstream flow rate, aiding in the analysis of job traffic performance.	Local bytes read per second (numBytesInLocal PerSecond)	Measures the rate of bytes read locally each second.	Bytes
		Remote Bytes Read per Second (numBytesInRemote Per Second)	Measures the rate of bytes read remotely each second.	Bytes
		Local network buffer read rate (bytes per second) (numBuffersIn Local PerSecond)	Measures the rate of bytes read from the local network buffer each second.	Bytes
		Remote bytes read per second from network buffer (numBuffersIn Remote PerSecond)	Measures the rate of bytes read from the remote network buffer each second.	Bytes
Output rate: total bytes sent per second. (numBytesOut PerSecond)	Provides insight into the output condition of upstream throughput, aiding in the analysis of job traffic performance.	Bytes Output per Second (numBytesOut per Second)	Measures the rate of bytes sent each second.	Bytes
		Network buffer bytes output per second (numBuffersOut PerSecond)	Measures the rate of bytes sent from the network buffer each second.	Bytes
Subtask I/O: total records processed per second. (Task numRecords I/O PerSecond)	Enables identification of potential I/O bottlenecks in the job and assessment of their severity.	Records received per second (numRecordsIn PerSecond)	Measures the number of records received each second.	Count
		Records sent per second (numRecordsOut PerSecond)	Measures the number of records sent each second.	Count
Subtask I/O: total records processed. (Task numRecords I/O)	Enables identification of potential I/O bottlenecks in the job.	Total records received (numRecordsIn)	Measures the total number of records received.	Count
		Total records transmitted (numRecordsOut)	Measures the total number of records sent.	Count

Watermark

Monitoring metrics

Description

Unit

Last Watermark Timestamp for Each Task

(Task InputWatermark)

Indicates the time the last Watermark was received by each task, reflecting the data reception delay by the Task Manager (TM).

Milliseconds (ms)

CPU

Monitoring metrics

Description

Unit

Single JM CPU utilization

(JM CPU Load)

Measures the CPU load for a single JobManager (JM). A value consistently over 100% suggests high CPU load, potentially leading to performance issues such as system lag and delayed response times.

Count

Single TM CPU utilization

(TM CPU Load)

Indicates the CPU time slice utilization by a single TaskManager (TM) within Flink. A reading of 100% signifies full utilization of one core, with 400% indicating full utilization of four cores. Persistent values above 100% denote high CPU load. Conversely, low CPU utilization coupled with high load may result from excessive read/write operations leading to numerous uninterruptible sleep states.

Count

Memory

Monitoring metrics	Subtypes	Description	Unit
JM heap memory (JM Heap Memory)	JM heap memory used (JM Heap Memory Used)	The amount of heap memory currently in use by the JobManager.	Bytes
	JM heap memory committed (JM Heap Memory Committed)	The amount of heap memory guaranteed to be available to the JobManager by the JVM.	Bytes
	JM heap memory max (JM Heap Memory Max)	The maximum amount of heap memory that can be used by the JobManager.	Bytes
JM non-heap memory (JM NonHeap Memory)	JM non-heap memory used (JM NonHeap Memory Used)	The amount of non-heap memory currently in use by the JobManager.	Bytes
	JM non-heap memory committed (JM NonHeap Memory Committed)	The amount of non-heap memory guaranteed to be available to the JobManager by the JVM.	Bytes
	JM non-heap memory max (JM NonHeap Memory Max)	The maximum amount of non-heap memory that can be used by the JobManager.	Bytes
TM heap memory (TM Heap Memory)	TM heap memory used (TM Heap Memory Used)	The amount of heap memory currently in use by the TaskManager.	Bytes
	TM heap memory committed (TM Heap Memory Committed)	The amount of heap memory guaranteed to be available to the TaskManager by the JVM.	Bytes
	TM heap memory max (TM Heap Memory Max)	Maximum TM heap memory	Bytes
TM non-heap memory (TM NonHeap Memory)	TM non-heap memory used (TM NonHeap Memory Used)	The amount of non-heap memory currently in use by the TaskManager.	Bytes
	TM non-heap memory committed (TM NonHeap Memory Committed)	The amount of non-heap memory guaranteed to be available to the TaskManager by the JVM.	Bytes
	TM non-heap memory max (TM NonHeap Memory Max)	The maximum amount of non-heap memory that can be used by the TaskManager.	Bytes

JVM

Monitoring metrics	Description	Unit
JM active threads (JM Threads)	The number of active threads in the JobManager (JM). Excessive JM threads can consume significant memory, compromising job stability.	Count
TM active threads (TM Threads)	The number of active threads in the TaskManager (TM), aggregated by TM, with multiple TMs displayed on separate lines.	Count
JM young generation garbage collector runtime (JM GC Time)	The runtime of the young generation garbage collector in the JobManager (JM). Prolonged garbage collection times can lead to excessive memory usage, impacting job performance. This metric is useful for diagnosing job-level issues.	Milliseconds (ms)
TM young generation garbage collector runtime (TM GC Time)	The runtime of the young generation garbage collector in the TaskManager (TM). Extended garbage collection times can lead to excessive memory usage, impacting job performance. This metric is useful for diagnosing job-level issues.	Milliseconds (ms)
JM young generation garbage collector count (JM GC Count)	The count of young generation garbage collections in the JobManager (JM). A high number of garbage collection events can consume significant memory, affecting job performance. This metric is useful for diagnosing job-level issues.	Count
TM young generation garbage collector count (TM GC Count)	The count of young generation garbage collections in the TaskManager (TM). An excessive number of garbage collection events can consume significant memory, affecting job performance. This metric is useful for diagnosing task-level issues.	Count
Total classes loaded by TM since JVM startup (TM ClassLoader)	The total number of classes loaded by the TaskManager (TM) since the Java Virtual Machine (JVM) startup. A large number of classes loaded or unloaded can consume significant memory, impacting job performance.	Count
Total classes loaded by JM since JVM startup (JM ClassLoader)	The total number of classes loaded by the JobManager (JM) since the Java Virtual Machine (JVM) startup. A large number of classes loaded or unloaded can consume significant memory, impacting job performance.	Count