Backpressure is an important concept in streaming shuffle. If the processing capability of downstream storage systems is insufficient, Realtime Compute notifies upstream storage systems to stop sending data to avoid data loss. In this scenario, backpressure occurs. This topic describes typical backpressure scenarios and optimization ideas.

Backpressure detection mechanism

A job backpressure detection mechanism is provided in Realtime Compute versions later than V3.0.0. With this mechanism, Realtime Compute detects congestion in the output network buffer of a vertex to determine whether backpressure exists in the vertex. A vertex is a group of operators associated as a chain. To check the backpressure for a job, follow these steps:
  1. In the top navigation bar, click Administration.
    1. Log on to the Realtime Compute development platform.
    2. In the top navigation bar, click Administration.
    3. On the Jobs page that appears, click the target job name under the Job Name field.
  2. In the left-side navigation pane, click the running job for which you want to check the backpressure. In the Vertex Topology section of the Overview tab that appears, click the blue border of the vertex that you want to check for the job.
  3. In the right-side pane, click the BackPressure tab and view the backpressure status in the Status column.
    • If high with a red indicator is displayed, the vertex has backpressure.
    • If ok with a green indicator is displayed, the vertex does not have backpressure.

Backpressure scenarios and optimization ideas

Note In the following vertex topology diagrams, the vertex in green indicates that no backpressure is detected, whereas the vertex in red indicates that backpressure is detected.
  • Scenario 1: Only one vertex exists and no backpressure is detected.

    Due to Flink features, no network buffer is configured on the output of the last vertex. In this case, data is directly written into downstream storage systems. If a job has only one or the last vertex, the backpressure detection fails. Therefore, this vertex topology diagram does not indicate that no backpressure is detected in the job. To further determine if and where backpressure exists, you must split the operators in Vertex 0. For more information about how to split the operators, see Resource parameters.

  • Scenario 2: Multiple vertices exist and backpressure is detected on the second to last vertex.
    This vertex topology diagram shows that Vertex 1 has backpressure and Vertex 2 has a performance bottleneck. You can check the operator names in Vertex 2 to determine the actions that you can take.
    • If only write operations into downstream storage systems are involved, the backpressure may be caused by the slow writing speed. We recommend that you increase the parallelism for Vertex 2 or set the batchsize parameter for the result table. For more information, see Upstream and downstream data storage parameters.
    • If operations in addition to the write operation into downstream storage systems are involved, you must split the operators that correspond to those operations for further check. For more information about how to split the operators, see Resource parameters.
  • Scenario 3: Multiple vertices exist and backpressure is detected on a vertex other than the second to last vertex.
    This vertex topology diagram shows that Vertex 0 has backpressure and Vertex 1 has a performance bottleneck. You can check the operator names in Vertex 1 to determine the actions that you can take. The common operations and related optimization methods used in this scenario are as follows:
    • GROUP BY operation: You can increase the parallelism or set the miniBatch parameter to optimize the state operation. For more information, see Job parameters.
    • JOIN operation between dimension tables: You can increase the parallelism or set a cache policy for dimension tables. For more information, see relevant dimension table documents.
    • User-defined extension (UDX) operation: You can increase the parallelism or optimize the related UDX code.
  • Scenario 4: Multiple vertices exist and no backpressure is detected on all the vertices.
    This vertex topology diagram shows that Vertex 0 has a potential performance bottleneck. You can check the operator names in Vertex 0 to determine the actions that you can take.
    • If only read operations from the source table are involved, the slow reading speed causes high latency. However, Realtime Compute does not have performance bottlenecks. In this case, you can increase the parallelism of the source operator or set the batchsize parameter for reading the source data. For more information, see Upstream and downstream data storage parameters.
      Note The parallelism of the source operator cannot be greater than the number of shards of the upstream storage systems.
    • If operations in addition to the read operation from the source table are involved, we recommend that you split the operators involved in other operations first. For more information about how to split operators, see Resource parameters.
  • Scenario 5: Backpressure is detected on a vertex but no backpressure is detected on its subsequent parallel vertices.

    This vertex topology diagram shows that Vertex 0 has backpressure but whether Vertex 1 or Vertex 2 has a performance bottleneck cannot be determined. You can preliminarily determine the vertex where a performance bottleneck exists based on the IN_Q metric of Vertex 1 and Vertex 2. The vertex whose IN_Q remains 100% for a long period of time may have a performance bottleneck. To further determine where the performance bottleneck exists, you must split the operators of the vertex. For more information about how to split operators, see Resource parameters.