Alibaba Cloud Realtime Compute provides an overview page of the core metrics of the current job. You can diagnose the running status of a job quickly based on the curve charts. In the future, Realtime Compute provide more deep intelligent analysis algorithms based on the job status to help you perform intelligent and automated diagnosis.

The following figure shows the curve chart of a metric.Curve Charts
Note
  • The metrics are displayed only when the job is in the running state. When the job is in the suspended or stopped state, the metrics are not displayed.
  • The metrics are asynchronously collected at the backend. Therefore, data latency exists. The metrics are collected and displayed on the curve charts only after the job has been running for more than 1 minute.

Go to the Curve Charts page

  1. Go to the Job Administration page.
    1. Log on to the Realtime Compute Console.
    2. In the top navigation bar, click Administration.
    3. In the Jobs section, click the target job name under the Job Name field.
  2. At the top of the Job Administration page, click Curve Charts.

Overview

  • Failover

    The Failover curve chart displays the frequency of failover caused by errors or exceptions for the current job. To calculate the failover rate, divide the accumulated failover count within 1 minute before the current failover time point by 60. Assume that a failover occurs in the last minute. The failover rate is 0.01667 (1/60 = 0.01667).

  • Delay

    To help you better understand the full-link timeliness and job performance, Realtime Compute provides the following three latency metrics:

    • Processing Delay: Processing delay = Current system time - Event time when the system processes the last data record. If no more data enters input storage systems, the processing delay gradually increases as the system time continues to move forward.
    • Data Pending Time: Data pending time = Time when data enters Flink - Event time. Even if no more data enters input storage systems, the data pending time does not increase. Generally, the data pending time is used to assess whether the current Realtime Compute job has back pressure.
    • Data Arrival Interval: Data arrival interval = Processing delay - Data pending time. When the Realtime Compute job has no back pressure (that is, the data pending time is small and stable), the data arrival interval can reflect the degree of data sparsity between data sources. When the Realtime Compute job has back pressure (that is, the data pending time is large or unstable), this metric has no reference value.
    Note
    • Realtime Compute applies a distributed computing framework. The preceding latency metrics obtain values for each shard or partition of the source, and then report the maximum values among all partitions to the front-end page. Therefore, the aggregated data arrival interval displayed on the front-end page is not exactly the same as that obtained based on the calculation method of the data arrival interval.
    • If no more data enters a shard or partition of the source, the processing delay increases gradually.
    • Currently, when the underlying algorithm is implemented, the data arrival interval is reported as 0 if it is less than 10 seconds.
  • Input TPS of Each Source

    This chart reflects the statistics on all stream data input of a Realtime Compute job. The number of blocks read from the source table per second is recorded. This helps you understand the Transactions Per Second (TPS) of data storage systems. Different from the TPS, Record Per Second (RPS) records the number of records read from the source table per second. These records are resolved from the blocks. Take Log Service as an example. Assume that five LogGroups are read per second. The TPS is 5. If eight log records are resolved from each LogGroup, a total of 40 log records are resolved, and the RPS is 40.

  • Data Output of Each Sink

    This chart reflects the statistics on all data output of a Realtime Compute job. This helps you understand the RPS of data storage systems. Generally, if you cannot detect data output during system O&M, you must check both the input and output storage systems.

  • Input RPS of Each Source

    This chart reflects the statistics on all stream data input of a Realtime Compute job. This helps you understand the RPS of data storage systems. Generally, if you cannot detect data output during system O&M, you must check the RPS to determine whether the input from data source is normal.

  • Input BPS of Each Source: This chart reflects the statistics on all stream data input of a Realtime Compute job. The traffic used to read the input source table per second is recorded. This helps you understand the Byte Per Second (BPS) of the traffic.
  • Dirty Data from Each Source: This chart reflects the number of dirty data records in the source of a Realtime Compute job in various time periods.
  • Auto Scaling Successes and Failures
    Note This curve chart is only applicable to Realtime Compute V3.0.0 or later.
    This chart reflects the respective number of successful and failed executions of auto scaling.
  • CPUs Consumed By Auto Scaling
    Note This curve chart is only applicable to Realtime Compute V3.0.0 or later.
    This chart reflects the amount of CPU consumed when auto scaling is executed.
  • Memory Consumed By Auto Scaling
    Note This curve chart is only applicable to Realtime Compute V3.0.0 or later.
    This chart reflects the amount of memory consumed when auto scaling is executed.

Advanced view

Alibaba Cloud Realtime Compute provides a fault tolerance that allows you to restore data streams and make sure that the data streams are consistent with the application. The central part of the fault tolerance is to create consistent snapshots of distributed data streams and their states. These snapshots act as consistency checkpoints to which the system can fall back when a failure occurs.

One of the core concepts of distributed snapshots is the barrier. Barriers are inserted to data streams and flow together with the data streams to the downstream. Barriers never overtake records, and the data streams are strictly in line. One barrier divides a data stream to two parts, one entering the current snapshot and the other entering the next snapshot. Each barrier has a snapshot ID. Data that flows before a barrier is included in the snapshot corresponding to this barrier. Barriers are lightweight. They do not interfere with the processing of data streams. Multiple barriers from different snapshots can simultaneously exist in the same data stream. That is, multiple snapshots can be created concurrently.
Barriers are injected to data streams at the source end. The point where the barrier for snapshot n is injected is the position in the source stream, up to which the snapshot covers the data. This point is indicated by Sn. The barriers then flow downstream. When an intermediate operator has received the barrier for snapshot n from all of its input streams, it emits the barrier for snapshot n to all of its output streams. When the sink operator (the destination of the DAG stream) has received the barrier for snapshot n from all of its input streams, the operator acknowledges to the checkpoint coordinator that snapshot n has been created. After all sink operators report that snapshot n has been created, this snapshot is considered created.The following table describes the curve charts of checkpoint metrics.
Curve chart Description
Checkpoint Duration This chart displays the time consumed for creating a checkpoint, in milliseconds.
Checkpoint Size This chart displays the memory size required for creating a checkpoint.
Checkpoint Alignment Time This chart displays the duration required for all data streams to flow from the input nodes to the node at which you create a checkpoint. When the sink operator (the destination of the DAG stream) has received the barrier for snapshot n from all of its input streams, the operator acknowledges to the checkpoint coordinator that snapshot n has been created. After all sink operators report that snapshot n has been created, this snapshot is considered created. This duration is known as the checkpoint alignment time.
Checkpoint Count This chart displays the number of checkpoints in a specified time period.
Get This chart displays the longest time that a subtask spends on performing a GET operation on the RocksDB within a specified period.
Put This chart displays the longest time that a subtask spends on performing a PUT operation on the RocksDB within a specified period.
Seek This chart displays the longest time that a subtask spends on performing a SEEK operation on the RocksDB within a specified period.
State Size This chart displays the state size of the job. If the size increases excessively fast, we recommend that you check and resolve potential issues.
GMS GC Time This chart displays the time that the underlying container of the job spends on garbage collection.
GMS GC Rate This chart displays the frequency that the underlying container of the job performs garbage collection.

WaterMark

Curve chart Description
WaterMark Delay This chart displays the difference between the watermark time and the system time.
Dropped Records per Second A data record is dropped if it arrives at the window after the watermark time. This chart displays the number of dropped data records every second.
Dropped Records A data record is dropped if it arrives at the window after the watermark time. This chart displays the total number of dropped data records.

Delay

The Top 15 Source Subtasks with the Longest Processing Delay chart displays the processing latency of each source subtask.

Throughput

Curve chart Description
Task Input TPS This chart displays the data input status of all tasks in the job.
Task Output TPS This chart displays the data output status of all tasks in the job.

Queue

Curve chart Description
Input Queue Usage This chart displays the data input queue of all tasks in the job.
Output Queue Usage This chart displays the data output queue of all tasks in the job.

Tracing

Curve chart Description
Time Used In Processing Per Second This chart displays the time that a task spends on processing data per second.
Time Used In Waiting Output Per Second This chart displays the time that a task spends on waiting for the output per second.
Task Latency Histogram Mean This chart displays the latency of each task.
Wait Output Histogram Mean This chart displays the time that each task spends on waiting for the output.
Wait Input Histogram Mean This chart displays the time that each task spends on waiting for the input.
Partition Latency Mean This chart displays the latency of concurrent tasks in each partition.

Process

Curve chart Description
Process Memory RSS This chart displays the memory usage of each process.
CPU Usage This chart displays the CPU usage of each process.

JVM

Curve chart Description
Memory Heap Used This chart displays the Java Virtual Machine (JVM) heap memory usage of the job.
Memory Non-Heap Used This chart displays the JVM non-heap memory usage of the job.
Thread Count This chart displays the number of threads for the job.
GC (GMS) This chart displays the number of times that the job completes garbage collection.