When you restart a deployment based on a checkpoint or savepoint, the state data is downloaded from remote storage to rebuild the state engine. This process can create a bottleneck and result in significant delays due to the extensive I/O operations involved. This topic describes how to identify and remove performance bottlenecks during deployment startup and scaling.
Investigation steps
Perform the following steps to identify a bottleneck during deployment startup or scaling:
Use diagnostic tools to analyze operator status: Use tools such as thread dumps, thread activity analysis, and flame graphs to check whether an operator thread is throttled during the initialization phase, especially by operations on the state backend, such as GeminiStateBackend. For information about how to use the diagnostic tools, see Monitor deployment performance.
Identify the cause: If a stateful operator remains in the initialization phase for a long period of time, the bottleneck is likely to be caused by state data download or recovery.
Tuning methods
The following table describes the methods you can use to improve startup and scaling speed.
Method | Description | Configuration | Usage notes |
Configure dynamic scaling | This method allows you to dynamically update TaskManager parameters to reduce the service interruption time caused by deployment startup and cancellation. | For more information, see Dynamically update the parameter configuration for dynamic scaling. | Dynamic parameter update is an experimental feature and may cause service interruptions. Compared with traditional methods, this feature significantly reduces the interruption time to between 5 seconds and 1 minute. The actual interruption time varies based on factors such as deployment topology and state size. |
Enable local recovery | This method allows snapshots to be stored locally as a backup. As a result, smaller amount of data is downloaded from the remote storage, which accelerates the recovery process. If the local disk space is sufficient, this is the most suitable option. | Add the
configuration. For more information, see the "How do I configure parameters for deployment running?" section of the Reference topic. |
|
Use the lazy loading and delayed pruning features of GeminiStateBackend | GeminiStateBackend is an enterprise-class state backend developed by Alibaba Cloud. GeminiStateBackend allows a large-state deployment to quickly start by downloading only the necessary metadata to achieve real-time data processing. Then, the system uses asynchronous download and intelligent pruning to efficiently process remote checkpoint files. This significantly reduces the interruption time and improves the efficiency by over 90%. For more information, see GeminiStateBackend. | Add the
configuration. For more information, see the "How do I configure parameters for deployment running?" section of the Reference topic. Note This configuration is supported only in VVR 6.0.6 or later. | After the deployment is restarted, a temporary performance drop occurs due to the asynchronous download of state files. Performance gradually improves as the state files are fully restored. |
References
For information about the issues caused by large state size and the tuning workflow, see Performance tuning for large-state deployments.
Flink SQL uses an optimizer to select stateful operators based on parameter configurations and SQL statements. A basic understanding of the underlying mechanisms is necessary to optimize the performance of stateful computation over massive data. For more information, see Control state size to reduce backpressure in SQL deployments.
Apache Flink Datastream API allows you to manage the state size in a flexible manner. For more information, see Control state size to reduce backpressure using the Datastream API.
For information about how to diagnose and prevent checkpoint and savepoint timeout, see Diagnose and prevent checkpoint and savepoint timeout