Monitor Flink Real-Time Instance Runtime & Diagnose Issues - Dataphin - Alibaba Cloud - Dataphin

Permissions

To view the running analysis of a project, you must have project space permissions.
To access the Apache Flink Dashboard, you must provide a username and password. The following roles can view the username and password hints: Super Administrator, System Administrator, task owner, and project O&M engineer.

Access running analysis

In the top navigation bar of the Dataphin homepage, choose Development > Task O&M.
In the left-side navigation pane, choose instance O&M > real-time instance. On the real-time instance page, click the icon in the Actions column for the target instance.

View running analysis

The running analysis page displays the following metrics:

Feature	Description
① Time range selector	Quick options: You can select Last 10 minutes, Last 1 hour, Last 6 hours, Last 1 day, or Last 1 week. The default is Last 1 hour. Custom time range: Specify a custom time range within the last 7 days.
② Go to Apache Flink Dashboard and Refresh	Go to Apache Flink Dashboard: Available only for Flink engine versions 1.14 and later. Click Go to Apache Flink Dashboard to open the dashboard. Note To view the Apache Flink Dashboard, ensure that your VPC networks can communicate with each other. Refresh: Manually refresh the displayed data.
③ Real-time monitoring metrics	For Flink SQL or Flink DataStream tasks, you can view the following metrics: overview, checkpoint, IO, watermark, CPU, memory, and JVM. For details about each metric, see Real-time monitoring metrics.
④ Metric data statistics	Displays the data for each metric within the selected time range.

Metric aggregation rules

If the selected time range is 6 hours or less, all data points collected every minute are displayed.
If the selected time range is greater than 6 hours and less than or equal to 24 hours, data points are displayed every 5 minutes, starting from the top of the hour. Each data point aggregates data from the preceding 5 minutes.
If the selected time range is greater than 24 hours, data points are displayed every 10 minutes, starting from the top of the hour. Each data point aggregates data from the preceding 10 minutes.

Real-time monitoring metrics

Overview

Metric	Description	Unit
Number of failures	The number of times the task failed within the selected time range.	Count
Backpressure status	Whether backpressure exists for the task within the selected time range. Backpressure occurs when an upstream task produces data faster than a downstream task can consume it.	Yes or No
Data output per sink	The data output rate for each sink, in transactions per second (TPS).	Count/s
Number of failed checkpoints	The number of failed checkpoint operations for the task within the selected time range.	Count

Checkpoint

Metric	Sub-metric	Description	Unit
Number of checkpoints (Num of Checkpoints)	Total number of checkpoints (totalNumberOfCheckpoints)	The total number of checkpoints for the task within the selected time range.	Count
	Total number of failed checkpoints (numberOfFailedCheckpoints)	The number of failed checkpoints for the task within the selected time range.	Count
	Total number of completed checkpoints (numberOfCompletedCheckpoints)	The number of completed checkpoints for the task within the selected time range.	Count
	Total number of in-progress checkpoints (numberOfInProgressCheckpoints)	The number of in-progress checkpoints for the task within the selected time range.	Count
Last checkpoint duration (lastCheckpointDuration)	Last checkpoint duration (lastCheckpointDuration)	The duration of the last checkpoint for the task. A large state size, temporary network issues, unaligned barriers, or data backpressure can cause a long checkpoint duration or a timeout.	ms
Last checkpoint size (lastCheckpointSize)	Last checkpoint size (lastCheckpointSize)	The size of the last checkpoint for the task. Use this metric to analyze checkpoint performance when bottlenecks occur.	byte

IO

Metric group	Description	Metric	Description	Unit
Total input bytes per second (numBytesInPerSecond)	Use this metric group to monitor upstream data flow and job traffic performance.	Bytes read locally per second (numBytesInLocalPerSecond)	The number of bytes read locally per second.	byte
		Bytes read remotely per second (numBytesInRemotePerSecond)	The number of bytes read remotely per second.	byte
		Bytes from local network buffers per second (numBuffersInLocalPerSecond)	The number of bytes read from local network buffers per second.	byte
		Bytes from remote network buffers per second (numBuffersInRemotePerSecond)	The number of bytes read from remote network buffers per second.	byte
Total output bytes per second (numBytesOutPerSecond)	Use this metric group to monitor upstream throughput and job traffic performance.	Output bytes per second (numBytesOutPerSecond)	The number of bytes output per second.	byte
Total output bytes per second (numBytesOutPerSecond)		Output buffer bytes per second (numBuffersOutPerSecond)	The number of bytes output from network buffers per second.	byte
Records I/O per second per subtask (Task numRecordsI/OPerSecond)	Use this metric to identify IO bottlenecks and assess their severity based on the rate.	Records received per second (numRecordsInPerSecond)	The number of records received per second.	Count
		Records sent per second (numRecordsOutPerSecond)	The number of records sent per second.	Count
Total records I/O per subtask (Task numRecordsI/O)	Use this metric to identify IO bottlenecks in the job.	Total records received (numRecordsIn)	The total number of records received.	Count
Total records I/O per subtask (Task numRecordsI/O)	Use this metric to identify IO bottlenecks in the job.	Total records sent (numRecordsOut)	The total number of records sent.	Count

Watermark

Metric

Description

Unit

Task input watermark

(Task InputWatermark)

The timestamp of the latest watermark received by each task. This metric indicates data reception latency at the TM.

ms

CPU

Metric

Description

Unit

JM CPU load

(JM CPU Load)

The CPU utilization of a single JM. A value consistently above 100% indicates high CPU load, which may cause sluggishness or long response times.

Count

TM CPU load

(TM CPU Load)

The CPU utilization of a single TM, reflecting Flink's consumption of CPU time slices. For a single-core CPU, 100% indicates full utilization; for a four-core CPU, 400% indicates full utilization. A value consistently above 100% indicates high CPU load. If the load is high but CPU utilization is low, excessive processes in an uninterruptible sleep state from frequent read/write operations may be the cause.

Count

Memory

Metric	Sub-metric	Description	Unit
JM heap memory (JM Heap Memory)	Used (JM Heap Memory Used)	The amount of used JM heap memory.	byte
	Committed (JM Heap Memory Committed)	The amount of committed JM heap memory.	byte
	Max (JM Heap Memory Max)	The maximum available JM heap memory.	byte
JM non-heap memory (JM NonHeap Memory)	Used (JM NonHeap Memory Used)	The amount of used JM non-heap memory.	byte
	Committed (JM NonHeap Memory Committed)	The amount of committed JM non-heap memory.	byte
	Max (JM NonHeap Memory Max)	The maximum available JM non-heap memory.	byte
TM heap memory (TM Heap Memory)	Used (TM Heap Memory Used)	The amount of used TM heap memory.	byte
	Committed (TM Heap Memory Committed)	The amount of committed TM heap memory.	byte
	Max (TM Heap Memory Max)	The maximum available TM heap memory.	byte
TM non-heap memory (TM NonHeap Memory)	Used (TM NonHeap Memory Used)	The amount of used TM non-heap memory.	byte
	Committed (TM NonHeap Memory Committed)	The amount of committed TM non-heap memory.	byte
	Max (TM NonHeap Memory Max)	The maximum available TM non-heap memory.	byte

JVM

Metric	Description	Unit
Total active JM threads (JM Threads)	The total number of active JM threads. An excessive thread count can consume too much memory and reduce job stability.	Count
Total active TM threads (TM Threads)	The total number of active TM threads, aggregated by TM. Each TM is represented by a separate line on the graph.	Count
JM young generation GC time (JM GC Time)	The runtime of the young generation garbage collector for the JM. Long GC pauses can consume significant memory and affect job performance. Use this metric to diagnose job-level failures.	ms
TM young generation GC time (TM GC Time)	The runtime of the young generation garbage collector for the TM. Long GC pauses can consume significant memory and affect job performance. Use this metric to diagnose job-level failures.	ms
JM young generation GC count (JM GC Count)	The number of times the young generation garbage collector ran for the JM. Frequent GC events can consume significant memory and degrade job performance. Use this metric to diagnose job-level failures.	Count
TM young generation GC count (TM GC Count)	The number of times the young generation garbage collector ran for the TM. Frequent GC events can consume significant memory and degrade job performance. Use this metric to diagnose task-level failures.	Count
TM classes loaded (TM ClassLoader)	The total number of classes loaded by the TM's JVM since startup. A high number of loaded or unloaded classes can consume excessive memory and degrade job performance.	Count
JM classes loaded (JM ClassLoader)	The total number of classes loaded by the JM's JVM since startup. A high number of loaded or unloaded classes can consume excessive memory and degrade job performance.	Count