State management affects performance, stability, and resource utilization. Improper state management may lead to system crashes. The Datastream API allows you to manage the state size in a flexible manner. This topic describes how to control the state size of a deployment built with the Datastream API.
Background information
Apache Flink supports two types of state: operator state and keyed state. Using the keyed state is more likely to result in a large state size. To address this issue, you can use the DataStream API, such as the ValueState, ListState, and MapState interfaces, to manage the keyed state. You can also configure state time-to-live (TTL) to expire state data. For more information, see Using Keyed State .
Diagnostic tools
Backpressure is an indicator of performance bottlenecks in Apache Flink. In most cases, backpressure occurs because the state size continues to increase and exceeds the allocated memory size. In this case, the state backend moves infrequently used state data to the disk storage. However, accessing data in the disk storage is significantly slower than accessing data in the memory. If an operator frequently reads state data from the disk, the data latency significantly increases. This results in a performance bottleneck.
To identify whether backpressure is caused by a large state size, you need to thoroughly analyze the running status of the deployment and operators. You can use monitoring and diagnostic tools to identify and resolve performance issues caused by state access.
The following table describes the diagnostic tools on the development console of Realtime Compute for Apache Flink. You can use these tools together with the intelligent deployment diagnostics and automatic tuning features to facilitate performance tuning for large-state deployments.
Tool | Description | Usage |
Thread dumps | Check whether the operator thread is mainly accessing the state data at the current time. | |
Thread activity | Check whether the operator thread is mainly accessing the state data during a specific period. | |
Flame graphs | Check whether most of the CPU time is used to access the state data in a specific period. | |
Runtime metrics monitoring | View related metrics to check the state size and I/O overhead. | Click the deployment you want to view on the Deployments page, click the Status tab, and then click Metrics in the Actions section. You can view the following metrics:
|
Tuning methods
Reevaluate your application design
The design of state storage and management in your application is crucial. To prevent unbounded state growth, store only the required information in the state.
Configure TTL to reduce state size
Apache Flink provides various methods to help you manage the lifecycle of state data. For example, you can invoke the setTTL() method of the ValueStateDescriptor interface to automatically expire and clear state data. You can also invoke the clear() and remove() methods to delete the state data for records that you no longer need. This helps control the state size.
Use timers for state cleanup
You can use timers to periodically trigger state cleanup. This ensures that expired state data is removed in a timely manner and prevents unbounded state growth. In addition, you can manage the lifecycle of the state data in a fine-grained manner.
Monitor performance, generate logs, and analyze state files
Monitor the performance metrics related to the state size and the state backend to detect exceptions in a timely manner. Generate detailed logs to facilitate troubleshooting. Regularly analyze historical state files to identify patterns and potential risks and use this information to optimize state management.
Minimize disk reads
Reduce the number of disk reads and optimize memory allocation to improve system performance.
Optimize memory allocation
Allocate more resources to managed memory without affecting the total system resources. This effectively improves memory utilization, thereby reducing access to the disk. Before you use this method, ensure that the memory resources are sufficient for other parts of the system.
Add memory resources
Add memory resources and allocate more managed memory to the state storage engine. This improves memory utilization and reduces access to the disk. You can use this method in the expert mode of resource configuration to implement fine-grained resource allocation and achieve optimal performance.
Increase the parallelism
A higher parallelism results in a smaller state size for each subtask, thereby reducing the amount of data written to disk. This method effectively reduces disk I/O operations and improves data processing efficiency.
The following table describes how to use the preceding methods in different scenarios.
Scenarios | Method | Operation | Precautions |
Other memory resources, such as heap memory, are sufficient | Allocate more resources to managed memory. | Configure the | Make sure that other memory resources are sufficient. Otherwise, full garbage collection (Full GC) may occur frequently, which degrades performance. |
All scenarios | Add memory resources. | N/A | |
Increase the parallelism. |
References
For information about the issues caused by a large state size and the tuning workflow, see Performance tuning for large-state deployments.
Flink SQL utilizes an optimizer to select stateful operators based on parameter configurations and SQL statements. A basic understanding of the underlying mechanisms is necessary to optimize the performance of stateful computation over massive data. For more information, see Control state size to reduce backpressure in SQL deployments.
For information about how to diagnose and prevent checkpoint and savepoint timeout, see Diagnose and prevent checkpoint and savepoint timeout.
For information about how to identify and remove performance bottlenecks during deployment startup and scaling, see Improve startup and scaling speed.