State management is a complex and critical issue that affects performance, stability, and resource utilization in Apache Flink. This topic describes the principles and strategies of state management. This topic also describes how to use the features of Realtime Compute for Apache Flink to optimize the performance and stability of large-state deployments.
What is State
Apache Flink is an open source framework for real-time processing and analysis of data streams. A core concept of Apache Flink is State, which refers to the information that is maintained across multiple events. Similar to the role of memory in a computer, State allows operators to keep track of historical data when processing unbounded streams. State can be saved as key-value pairs or more complex data types, such as lists, arrays, and custom objects. Accessing and updating state data are essential to complex stream processing.
State management is an important feature of the development console of Realtime Compute for Apache Flink. This feature allows the system to automatically manage the checkpoints of a deployment and minimize the required storage space without affecting the availability of the deployment. This feature also allows you to efficiently manage savepoints and share them between deployments. This is of great practical value for A/B testing and active/standby failover of large-state deployments.
Issues caused by large state size
Maintaining a large-state deployment is challenging. As the state size increases over time, multiple performance issues occur.
Backpressure and performance decrease
Backpressure occurs when the I/O overhead reaches a specific level. This increases processing latency and decreases the throughput measured in transactions per second (TPS).
Low resource utilization
In most cases, stateful operators have a large amount of idle CPU resources. The greater the state size, the more significant the resource waste.
Checkpoint and savepoint timeout
When the state size is large, checkpoints and savepoints are more likely to time out. This increases the time required to resume data after a deployment is restarted and the latency of end-to-end exactly-once processing.
Slow startup and scaling
When you start or scale a deployment, each operator needs to recover its local data based on the state data of the entire deployment. The time consumption of this process is proportional to the state size. This increases the service interruption time of large-state deployments.
Workflow of tuning large-state deployments
To prevent the preceding issues, take the following steps to optimize the state management of large-state deployments:
Identify potential bottlenecks
Use diagnostic tools to understand the current running status of a deployment and determine whether the performance bottlenecks are caused by improper state management. For information about how to use the diagnostic tools, see Monitor deployment performance.
Use the latest engine version
Realtime Compute for Apache Flink continues to optimize the state module of the Ververica Runtime (VVR) engine. In most cases, the latest engine version provides optimal performance. VVR is an enterprise-class engine fully compatible with Apache Flink. VVR is also equipped with GeminiStateBackend to significantly facilitate state access, checkpoints, and state recovery. GeminiStateBackend automatically adjusts parameter configuration to ensure optimal performance, which eliminates the need for manual configuration. Before you make other optimizations, make sure that you use the latest VVR version. For more information, see GeminiStateBackend, Configurations of GeminiStateBackend, and Upgrade the engine version of a deployment.
Apply different tuning strategies for different issues
Runtime performance decrease (Backpressure)
To address this issue, take the following steps: Optimize the SQL queries, adjust time-to-live (TTL) to reduce state data, and then adjust the memory allocation and parallelism settings to reduce I/O operations on the disk. For more information, see Control state size to reduce backpressure in SQL deployments and Control state size to reduce backpressure using the Datastream API.
Checkpoint or savepoint timeout
To address this issue, take the following steps: Optimize the runtime performance to reduce backpressure, optimize the performance of the synchronous phase, adjust the parallelism to reduce the state size of each subtask, and then consider using native-format savepoints to improve efficiency. For more information, see Diagnose and prevent checkpoint and savepoint timeout.
Slow startup and scaling
If the local disk space is sufficient, you can enable the local recovery feature. You can also use the lazy loading and delayed pruning features of GeminiStateBackend to quickly start and scale a deployment. For more information, see Improve startup and scaling speed.