Realtime Compute shows core metrics of your job on the Curve Charts tab to help you diagnose the running status of the job. In the future, Realtime Compute will provide more deep intelligent analysis algorithms for intelligent and automated diagnosis based on the states of jobs.

The following figure shows the curve chart of a metric. Curve Charts
Note
  • The metrics are displayed only when a job is in the running state. If the job is suspended or stopped, its metrics are not displayed.
  • The metrics are collected and displayed on the Curve Charts tab after the job is running for more than one minute. This causes a delay on the data presented in the curve charts.

Go to the Curve Charts page

  1. Go to the Job Administration page.
    1. Log on to the Realtime Compute development platform.
    2. In the top navigation bar, click Administration.
    3. On the Jobs page that appears, click the target job name under the Job Name field.
  2. On the top of the Job Administration page, click the Curve Charts tab.

Overview

  • Failover

    This chart displays the frequency of failovers caused by errors or exceptions in the current job. To calculate the failover rate, divide the number of failovers occurred within the minute that preceded the current failover time by 60. Assume that the failover occurred once in the last minute. The failover rate is 0.01667 (1/60 = 0.01667).

  • Delay
    To help you better understand the full-link timeliness and job performance, Realtime Compute provides the following three latency metrics:
    • Processing Delay: Processing delay = Current system time - Event time the system processes the last data record. If no more data enters the upstream storage systems, the processing delay gradually increases as the system time continues to move forward.
    • Data Pending Time: Data pending time = Time data enters Realtime Compute - Event time. If no more data enters the upstream storage systems, the data pending time does not increase. The data pending time is used to assess whether the current Realtime Compute job has backpressure.
    • Data Arrival Interval: Data arrival interval = Processing delay - Data pending time. If the Realtime Compute job has no backpressure, the data pending time is small and stable. In this case, this metric reflects the degree of data sparsity between the data sources. If the Realtime Compute job has backpressure, the data pending time is large or unstable. In this case, this metric cannot be used for reference.
    Note
    • Realtime Compute uses a distributed computing framework. The preceding latency metrics obtain values for each shard or partition of the data source and then report the maximum values among all partitions to the Realtime Compute development platform. Therefore, the aggregated data arrival interval displayed on the development platform is not exactly the same as that obtained based on the formula: Data arrival interval = Processing delay - Data pending time.
    • If no more data enters a shard or partition of the data source, the processing delay increases gradually.
  • Input TPS of Each Source

    This chart reflects statistics on all streaming data input of a Realtime Compute job. It records the number of blocks read from the source table per second. This helps you understand the transactions per second (TPS) of a data storage system. Different from the TPS, records per second (RPS) indicates the number of records read from the source table per second. These records are resolved from the blocks. Take Log Service as an example. Assume that five log groups are read per second. The TPS is 5. If eight log records are resolved from each log group, a total of 40 log records are resolved. The RPS is 40.

  • Data Output of Each Sink

    This chart reflects statistics on all data output (not just streaming data output) of a Realtime Compute job. It helps you understand the RPS of a data storage system. If no data output is detected during system O&M, you must check both the input of the upstream storage system and the output of the downstream storage system.

  • Input RPS of Each Source

    This chart reflects statistics on all streaming data input of a Realtime Compute job. It helps you understand the RPS of a data storage system. If no data output is detected during system O&M, you must check the RPS to determine whether the input from the data source is normal.

  • Input BPS of Each Source

    This chart reflects statistics on all streaming data input of a Realtime Compute job. It records the traffic used to read the input source table per second. This helps you understand the bytes per second (BPS) of the traffic.

  • Dirty Data from Each Source

    This chart reflects the number of dirty data records in the data source of a Realtime Compute job in different time periods.

  • Auto Scaling Successes and Failures
    This chart reflects the numbers of times auto scaling succeeded and failed.
    Note This curve chart is only supported in Realtime Compute V3.0.0 and later.
  • CPUs Consumed By Auto Scaling
    This chart reflects the amount of CPU consumed when auto scaling is executed.
    Note This curve chart is only supported in Realtime Compute V3.0.0 and later.
  • Memory Consumed By Auto Scaling
    This chart reflects the amount of memory consumed when auto scaling is executed.
    Note This curve chart is only supported in Realtime Compute V3.0.0 and later.

Advanced view

Alibaba Cloud Realtime Compute provides a fault tolerance mechanism that allows you to restore data streams. This mechanism helps consistently recover the state of data streaming applications. The function of the fault tolerance mechanism is to create consistent snapshots of distributed data streams and their states. These snapshots act as consistency checkpoints to which the system can fall back if a failure occurs.

One of the core concepts of distributed snapshots is barriers. Barriers are inserted into data streams and flow with the data streams to the downstream. Barriers do not overtake data records, and the records flow strictly in line. A barrier divides a data stream into two parts, one entering the current snapshot and the other entering the next snapshot. Each barrier has a snapshot ID. Data that flows before a barrier is included in the specified snapshot. Barriers are lightweight. They do not interfere with the processing of data streams. Multiple barriers from different snapshots can co-exist in the same data stream. This allows multiple snapshots to be created concurrently.
Barriers are inserted into data streams at the source end. The point where the barrier for snapshot n is inserted is the position in the source stream. This point is indicated by Sn. The barriers then flow to the downstream. When an intermediate operator receives a barrier for snapshot n from all of its input streams, the operator emits a barrier for snapshot n to all of its output streams. When the sink operator (the destination of the DAG stream) receives barrier n from all of its input streams, the operator acknowledges to the checkpoint coordinator that snapshot n is created. After all sink operators acknowledge that snapshot n is created, this snapshot is considered completed. The following table describes the curve charts of checkpoint metrics.
Curve chart Description
Checkpoint Duration Displays the time consumed to create a checkpoint. Unit: milliseconds.
Checkpoint Size Displays the memory size required to create a checkpoint.
Checkpoint Alignment Time Displays the duration required for all data streams to flow from the upstream nodes to the node on which you create a checkpoint. When the sink operator (the destination of the DAG stream) receives barrier n from all of its input streams, the operator acknowledges to the checkpoint coordinator that snapshot n is created. After all sink operators acknowledge that snapshot n is created, this snapshot is considered completed. This duration is known as the checkpoint alignment time.
Checkpoint Count Displays the number of checkpoints within a specific period of time.
Get Displays the longest time that a subtask spends in performing a GET operation on the RocksDB within a specific period of time.
Put Displays the longest time that a subtask spends in performing a PUT operation on the RocksDB within a specific period of time.
Seek Displays the longest time that a subtask spends in performing a SEEK operation on the RocksDB within a specific period of time.
State Size Displays the state size of the job within a specific period of time. If the size increases too fast, we recommend that you check for potential issues in the job.
GMS GC Time Displays the time that the underlying container of the job spends on garbage collection.
GMS GC Rate Displays the frequency as which the underlying container of the job performs garbage collection.

WaterMark

Curve chart Description
WaterMark Delay Displays the difference between the watermark time and the system time.
Dropped Records per Second Displays the number of data records dropped every second. If a data record arrives at the window after the watermark time, it is dropped.
Dropped Records Displays the total number of dropped data records. If a data record arrives at the window after the watermark time, it is dropped.

Delay

The Top 15 Source Subtasks with the Longest Processing Delay This chart displays the processing delay of each source subtask.

Throughput

Curve chart Description
Task Input TPS Displays the data input status of all tasks in a job.
Task Output TPS Displays the data output status of all tasks in a job.

Queue

Curve chart Description
Input Queue Usage Displays the data input queue of all tasks in a job.
Output Queue Usage Displays the data output queue of all tasks in a job.

Tracing

Curve chart Description
Time Used In Processing Per Second Displays the time that a task spends in processing data per second.
Time Used In Waiting Output Per Second Displays the time that a task spends in waiting for output per second.
Task Latency Histogram Mean Displays the latency of each task.
Wait Output Histogram Mean Displays the time that each task spends in waiting for output.
Wait Input Histogram Mean Displays the time that each task spends in waiting for input.
Partition Latency Mean Displays the latency of concurrent tasks in each partition.

Process

Curve chart Description
Process Memory RSS Displays the memory usage of each process.
CPU Usage Displays the CPU utilization of each process.

JVM

Curve chart Description
Memory Heap Used Displays the Java Virtual Machine (JVM) heap memory usage of a job.
Memory Non-Heap Used Displays the JVM non-heap memory usage of a job.
Thread Count Displays the number of threads in a job.
GC (CMS) Displays the number of times that a job completes garbage collection.