All Products
Search
Document Center

Dataphin:View running analysis

Last Updated:Jun 23, 2026

For real-time instances running on the open-source Flink engine, Dataphin provides running analysis to help you analyze and refresh instance information and monitor key metrics such as failure count, backpressure status, data output per sink, and checkpoint failures.

Permissions

  • To view the running analysis of a project, you must have project space permissions.

  • To access the Apache Flink Dashboard, you must provide a username and password. The following roles can view the username and password hints: Super Administrator, System Administrator, task owner, and project O&M engineer.

Access running analysis

  1. In the top navigation bar of the Dataphin homepage, choose Development > Task O&M.

  2. In the left-side navigation pane, choose instance O&M > real-time instance. On the real-time instance page, click the image icon in the Actions column for the target instance.

View running analysis

The running analysis page displays the following metrics:

image

Feature

Description

① Time range selector

  • Quick options: You can select Last 10 minutes, Last 1 hour, Last 6 hours, Last 1 day, or Last 1 week. The default is Last 1 hour.

  • Custom time range: Specify a custom time range within the last 7 days.

Go to Apache Flink Dashboard and Refresh

  • Go to Apache Flink Dashboard: Available only for Flink engine versions 1.14 and later. Click Go to Apache Flink Dashboard to open the dashboard.

    1

    Note

    To view the Apache Flink Dashboard, ensure that your VPC networks can communicate with each other.

  • Refresh: Manually refresh the displayed data.

Real-time monitoring metrics

For Flink SQL or Flink DataStream tasks, you can view the following metrics: overview, checkpoint, IO, watermark, CPU, memory, and JVM. For details about each metric, see Real-time monitoring metrics.

Metric data statistics

Displays the data for each metric within the selected time range.

Metric aggregation rules

  • If the selected time range is 6 hours or less, all data points collected every minute are displayed.

  • If the selected time range is greater than 6 hours and less than or equal to 24 hours, data points are displayed every 5 minutes, starting from the top of the hour. Each data point aggregates data from the preceding 5 minutes.

  • If the selected time range is greater than 24 hours, data points are displayed every 10 minutes, starting from the top of the hour. Each data point aggregates data from the preceding 10 minutes.

Real-time monitoring metrics

Overview

Metric

Description

Unit

Number of failures

The number of times the task failed within the selected time range.

Count

Backpressure status

Whether backpressure exists for the task within the selected time range. Backpressure occurs when an upstream task produces data faster than a downstream task can consume it.

Yes or No

Data output per sink

The data output rate for each sink, in transactions per second (TPS).

Count/s

Number of failed checkpoints

The number of failed checkpoint operations for the task within the selected time range.

Count

Checkpoint

Metric

Sub-metric

Description

Unit

Number of checkpoints

(Num of Checkpoints)

Total number of checkpoints

(totalNumberOfCheckpoints)

The total number of checkpoints for the task within the selected time range.

Count

Total number of failed checkpoints

(numberOfFailedCheckpoints)

The number of failed checkpoints for the task within the selected time range.

Count

Total number of completed checkpoints

(numberOfCompletedCheckpoints)

The number of completed checkpoints for the task within the selected time range.

Count

Total number of in-progress checkpoints

(numberOfInProgressCheckpoints)

The number of in-progress checkpoints for the task within the selected time range.

Count

Last checkpoint duration
(lastCheckpointDuration)

Last checkpoint duration

(lastCheckpointDuration)

The duration of the last checkpoint for the task.

A large state size, temporary network issues, unaligned barriers, or data backpressure can cause a long checkpoint duration or a timeout.

ms

Last checkpoint size

(lastCheckpointSize)

Last checkpoint size

(lastCheckpointSize)

The size of the last checkpoint for the task. Use this metric to analyze checkpoint performance when bottlenecks occur.

byte

IO

Metric group

Description

Metric

Description

Unit

Total input bytes per second

(numBytesInPerSecond)

Use this metric group to monitor upstream data flow and job traffic performance.

Bytes read locally per second

(numBytesInLocalPerSecond)

The number of bytes read locally per second.

byte

Bytes read remotely per second

(numBytesInRemotePerSecond)

The number of bytes read remotely per second.

byte

Bytes from local network buffers per second

(numBuffersInLocalPerSecond)

The number of bytes read from local network buffers per second.

byte

Bytes from remote network buffers per second

(numBuffersInRemotePerSecond)

The number of bytes read from remote network buffers per second.

byte

Total output bytes per second

(numBytesOutPerSecond)

Use this metric group to monitor upstream throughput and job traffic performance.

Output bytes per second

(numBytesOutPerSecond)

The number of bytes output per second.

byte

Output buffer bytes per second

(numBuffersOutPerSecond)

The number of bytes output from network buffers per second.

byte

Records I/O per second per subtask

(Task numRecordsI/OPerSecond)

Use this metric to identify IO bottlenecks and assess their severity based on the rate.

Records received per second

(numRecordsInPerSecond)

The number of records received per second.

Count

Records sent per second

(numRecordsOutPerSecond)

The number of records sent per second.

Count

Total records I/O per subtask

(Task numRecordsI/O)

Use this metric to identify IO bottlenecks in the job.

Total records received

(numRecordsIn)

The total number of records received.

Count

Total records sent

(numRecordsOut)

The total number of records sent.

Count

Watermark

Metric

Description

Unit

Task input watermark

(Task InputWatermark)

The timestamp of the latest watermark received by each task. This metric indicates data reception latency at the TM.

ms

CPU

Metric

Description

Unit

JM CPU load

(JM CPU Load)

The CPU utilization of a single JM. A value consistently above 100% indicates high CPU load, which may cause sluggishness or long response times.

Count

TM CPU load

(TM CPU Load)

The CPU utilization of a single TM, reflecting Flink's consumption of CPU time slices. For a single-core CPU, 100% indicates full utilization; for a four-core CPU, 400% indicates full utilization. A value consistently above 100% indicates high CPU load. If the load is high but CPU utilization is low, excessive processes in an uninterruptible sleep state from frequent read/write operations may be the cause.

Count

Memory

Metric

Sub-metric

Description

Unit

JM heap memory

(JM Heap Memory)

Used

(JM Heap Memory Used)

The amount of used JM heap memory.

byte

Committed

(JM Heap Memory Committed)

The amount of committed JM heap memory.

byte

Max

(JM Heap Memory Max)

The maximum available JM heap memory.

byte

JM non-heap memory

(JM NonHeap Memory)

Used

(JM NonHeap Memory Used)

The amount of used JM non-heap memory.

byte

Committed

(JM NonHeap Memory Committed)

The amount of committed JM non-heap memory.

byte

Max

(JM NonHeap Memory Max)

The maximum available JM non-heap memory.

byte

TM heap memory

(TM Heap Memory)

Used

(TM Heap Memory Used)

The amount of used TM heap memory.

byte

Committed

(TM Heap Memory Committed)

The amount of committed TM heap memory.

byte

Max

(TM Heap Memory Max)

The maximum available TM heap memory.

byte

TM non-heap memory

(TM NonHeap Memory)

Used

(TM NonHeap Memory Used)

The amount of used TM non-heap memory.

byte

Committed

(TM NonHeap Memory Committed)

The amount of committed TM non-heap memory.

byte

Max

(TM NonHeap Memory Max)

The maximum available TM non-heap memory.

byte

JVM

Metric

Description

Unit

Total active JM threads

(JM Threads)

The total number of active JM threads. An excessive thread count can consume too much memory and reduce job stability.

Count

Total active TM threads

(TM Threads)

The total number of active TM threads, aggregated by TM. Each TM is represented by a separate line on the graph.

Count

JM young generation GC time

(JM GC Time)

The runtime of the young generation garbage collector for the JM. Long GC pauses can consume significant memory and affect job performance. Use this metric to diagnose job-level failures.

ms

TM young generation GC time

(TM GC Time)

The runtime of the young generation garbage collector for the TM. Long GC pauses can consume significant memory and affect job performance. Use this metric to diagnose job-level failures.

ms

JM young generation GC count

(JM GC Count)

The number of times the young generation garbage collector ran for the JM. Frequent GC events can consume significant memory and degrade job performance. Use this metric to diagnose job-level failures.

Count

TM young generation GC count

(TM GC Count)

The number of times the young generation garbage collector ran for the TM. Frequent GC events can consume significant memory and degrade job performance. Use this metric to diagnose task-level failures.

Count

TM classes loaded

(TM ClassLoader)

The total number of classes loaded by the TM's JVM since startup. A high number of loaded or unloaded classes can consume excessive memory and degrade job performance.

Count

JM classes loaded

(JM ClassLoader)

The total number of classes loaded by the JM's JVM since startup. A high number of loaded or unloaded classes can consume excessive memory and degrade job performance.

Count