Backpressure is an important concept in streaming shuffle. If the processing capability of downstream storage systems is insufficient, Realtime Compute notifies upstream storage systems to stop sending data to avoid data loss. In this scenario, backpressure occurs. This topic describes typical backpressure scenarios and optimization ideas.
Backpressure detection mechanism
- In the top navigation bar, click Administration.
- Log on to the Realtime Compute development platform.
- In the top navigation bar, click Administration.
- On the Jobs page that appears, click the target job name under the Job Name field.
- In the left-side navigation pane, click the running job for which you want to check
the backpressure. In the Vertex Topology section of the Overview tab that appears, click the blue border of the vertex that you want to check for
the job.
- In the right-side pane, click the BackPressure tab and view the backpressure status in the Status column.
- If high with a red indicator is displayed, the vertex has backpressure.
- If ok with a green indicator is displayed, the vertex does not have backpressure.
Backpressure scenarios and optimization ideas
- Scenario 1: Only one vertex exists and no backpressure is detected.
Due to Flink features, no network buffer is configured on the output of the last vertex. In this case, data is directly written into downstream storage systems. If a job has only one or the last vertex, the backpressure detection fails. Therefore, this vertex topology diagram does not indicate that no backpressure is detected in the job. To further determine if and where backpressure exists, you must split the operators in Vertex 0. For more information about how to split the operators, see Resource parameters.
- Scenario 2: Multiple vertices exist and backpressure is detected on the second to
last vertex.This vertex topology diagram shows that Vertex 1 has backpressure and Vertex 2 has a performance bottleneck. You can check the operator names in Vertex 2 to determine the actions that you can take.
- If only write operations into downstream storage systems are involved, the backpressure
may be caused by the slow writing speed. We recommend that you increase the parallelism
for Vertex 2 or set the
batchsize
parameter for the result table. For more information, see Upstream and downstream data storage parameters. - If operations in addition to the write operation into downstream storage systems are involved, you must split the operators that correspond to those operations for further check. For more information about how to split the operators, see Resource parameters.
- If only write operations into downstream storage systems are involved, the backpressure
may be caused by the slow writing speed. We recommend that you increase the parallelism
for Vertex 2 or set the
- Scenario 3: Multiple vertices exist and backpressure is detected on a vertex other
than the second to last vertex.This vertex topology diagram shows that Vertex 0 has backpressure and Vertex 1 has a performance bottleneck. You can check the operator names in Vertex 1 to determine the actions that you can take. The common operations and related optimization methods used in this scenario are as follows:
- GROUP BY operation: You can increase the parallelism or set the
miniBatch
parameter to optimize the state operation. For more information, see Job parameters. - JOIN operation between dimension tables: You can increase the parallelism or set a cache policy for dimension tables. For more information, see relevant dimension table documents.
- User-defined extension (UDX) operation: You can increase the parallelism or optimize the related UDX code.
- GROUP BY operation: You can increase the parallelism or set the
- Scenario 4: Multiple vertices exist and no backpressure is detected on all the vertices.This vertex topology diagram shows that Vertex 0 has a potential performance bottleneck. You can check the operator names in Vertex 0 to determine the actions that you can take.
- If only read operations from the source table are involved, the slow reading speed
causes high latency. However, Realtime Compute does not have performance bottlenecks.
In this case, you can increase the parallelism of the source operator or set the
batchsize
parameter for reading the source data. For more information, see Upstream and downstream data storage parameters.Note The parallelism of the source operator cannot be greater than the number of shards of the upstream storage systems. - If operations in addition to the read operation from the source table are involved, we recommend that you split the operators involved in other operations first. For more information about how to split operators, see Resource parameters.
- If only read operations from the source table are involved, the slow reading speed
causes high latency. However, Realtime Compute does not have performance bottlenecks.
In this case, you can increase the parallelism of the source operator or set the
- Scenario 5: Backpressure is detected on a vertex but no backpressure is detected on
its subsequent parallel vertices.
This vertex topology diagram shows that Vertex 0 has backpressure but whether Vertex 1 or Vertex 2 has a performance bottleneck cannot be determined. You can preliminarily determine the vertex where a performance bottleneck exists based on the IN_Q metric of Vertex 1 and Vertex 2. The vertex whose IN_Q remains 100% for a long period of time may have a performance bottleneck. To further determine where the performance bottleneck exists, you must split the operators of the vertex. For more information about how to split operators, see Resource parameters.