Cut Flink Restart Time Under 1 Min with State Backend Tuning - Realtime Compute for Apache Flink

When a deployment restarts from a checkpoint or savepoint, every stateful operator must download its state data from remote storage before it can begin processing. For large-state deployments, this remote I/O is the primary cause of long service interruptions during restarts, failovers, and rescaling operations.

This topic describes how to identify whether slow state recovery is the bottleneck, and which tuning method to apply based on your scenario, Ververica Runtime (VVR) version, and local disk availability.

Identify the bottleneck

Use thread dumps, thread activity analysis, and flame graphs to check whether operator threads are blocked during the initialization phase — particularly by state backend operations such as GeminiStateBackend. If a stateful operator remains stuck in the initialization phase for an extended period, slow state download or recovery is the likely bottleneck.

For instructions on using these diagnostic tools, see Monitor deployment performance.

Tuning methods

The following table summarizes the available methods. Choose based on your scenario, VVR version, and disk availability.

Method	Best for	VVR requirement
Dynamic scaling	Reducing interruption time during parameter updates and restarts	None
Local recovery	Failovers and dynamic parameter updates when local disk space is sufficient	VVR 8.0.8 or later (experimental)
GeminiStateBackend lazy loading	Large-state deployments where startup time scales with state size	VVR 6.0.6 or later

Configure dynamic scaling

Dynamic scaling lets you dynamically update TaskManager parameters to reduce service interruption time caused by deployment startup and cancellation. This reduces service interruption time to between 5 seconds and 1 minute, compared to traditional methods.

Dynamic scaling is an experimental feature and may cause service interruptions. Actual interruption time varies based on deployment topology and state size.

For configuration details, see Dynamically update the parameter configuration for dynamic scaling.

Enable local recovery

Local recovery allows snapshots to be stored locally as a backup. As a result, a smaller amount of data is downloaded from remote storage, which accelerates the recovery process. If the local disk space is sufficient, this is the most suitable option.

Add the following configuration:

state.backend.local-recovery: true

For configuration instructions, see the "How do I configure parameters for deployment running?" section of the Reference topic.

Limitations:

Local recovery applies only to failovers and dynamic parameter updates. It does not take effect when you manually cancel and restart a deployment.
We recommend that you enable this experimental feature in VVR 8.0.8 or later.
Additional local disk space is required to store the secondary snapshot copies.

Use GeminiStateBackend lazy loading and delayed pruning

GeminiStateBackend is an enterprise-class state backend developed by Alibaba Cloud. GeminiStateBackend allows a large-state deployment to quickly start by downloading only the necessary metadata to achieve real-time data processing. Then, the system uses asynchronous download and intelligent pruning to efficiently process remote checkpoint files. This reduces interruption time by over 90% for large-state deployments.

Add the following configuration:

state.backend.gemini.file.cache.download.type: LazyDownloadOnRestore

This configuration is supported only in VVR 6.0.6 or later.

For configuration instructions, see the "How do I configure parameters for deployment running?" section of the Reference topic.

For more information about GeminiStateBackend, see GeminiStateBackend.

Trade-off: After restart, a temporary performance drop occurs while the background download is in progress. Performance improves gradually as state files are fully restored.

What's next

For an overview of large-state performance issues and the recommended tuning workflow, see Performance tuning for large-state deployments.
To reduce backpressure caused by large state in Flink SQL jobs, see Control state size to reduce backpressure in SQL deployments.
To manage state size with the DataStream API, see Control state size to reduce backpressure using the DataStream API.
To diagnose and prevent checkpoint and savepoint timeouts, see Diagnose and prevent checkpoint and savepoint timeout.