Fix Checkpoint and Savepoint Timeouts for Large State Jobs - Realtime Compute for Apache Flink

Checkpoint phases

Realtime Compute for Apache Flink uses the Chandy-Lamport algorithm for state management, ensuring data consistency and reliability. Every checkpoint or savepoint goes through two phases:

Synchronous phase -- The system waits for barriers to align across operators. Barriers are a special type of data record passed between operators. The time required for barrier alignment is proportional to the arrival delay of data records.
Asynchronous phase -- Each operator uploads its local state to a remote persistent storage system. Upload time is proportional to the state size.

Important

Backpressure slows down barrier propagation during the synchronous phase, which directly causes checkpoint and savepoint timeout. Resolve backpressure before investigating timeout issues. For guidance, see Control state size to reduce backpressure in SQL deployments and Control state size to reduce backpressure using the DataStream API.

Identify the bottleneck phase

After resolving backpressure, if checkpoints or savepoints still time out, use the following tools to determine whether the bottleneck is in the synchronous phase or the asynchronous phase.

Checkpoint history UI

Navigate to O&M > Deployments, click the target deployment name, then go to Logs > Checkpoints > Checkpoints History. This view provides deployment-level, operator-level, and subtask-level metrics.

Identify the operators where checkpoints are timing out or still in progress, then check the following metrics:

Metric	Description
Sync Duration	Total time spent in the synchronous phase, including snapshotting operator state.
Alignment Duration	Time between processing the first and the last checkpoint barrier. High values indicate uneven data distribution across input channels.
Async Duration	Time spent uploading state to remote storage.
Checkpointed Data Size	Volume of state data written during the checkpoint.

How to interpret the results:

High Sync Duration or Alignment Duration -- The bottleneck is in the synchronous phase. Barriers are traveling slowly through the job graph, typically due to residual backpressure or channel skew.
High Async Duration or Checkpointed Data Size -- The bottleneck is in the asynchronous phase. The state is too large to upload within the timeout window.

Checkpoint metrics

Navigate to O&M > Deployments, click the target deployment name, then go to the Logs tab and click Alarm. The lastCheckpointDuration and lastCheckpointSize metrics provide a coarse-grained view of historical checkpoint performance, useful for spotting trends over time.

Tune checkpoint and savepoint performance

Before applying any tuning method, make sure the deployment runtime performance meets expectations. Poor runtime performance amplifies checkpoint issues. After optimizing runtime performance, apply one or more of the following methods based on the bottleneck phase.

These methods are not mutually exclusive. Combine them if the deployment experiences bottlenecks in both phases.

Use unaligned checkpoints and buffer debloating

Property	Details
When to use	Checkpoint or savepoint timeout caused by the synchronous phase
How it helps	Eliminates the need for barrier alignment, resolving timeout issues related to slow or skewed barriers. Effective for deployments of all sizes.
Configuration	See Checkpointing under backpressure in the Apache Flink documentation.

Unaligned checkpoints have specific limitations. Review the Limitations section in the Apache Flink documentation before enabling this feature.

Increase parallelism

Property	Details
When to use	Checkpoint or savepoint timeout caused by the asynchronous phase
How it helps	Distributes state data across more parallel tasks, reducing the amount of data each task uploads during the asynchronous phase.
Configuration	Adjust parallelism using the basic or expert mode of resource configuration. See Configure resources for a deployment.

Use the native format for savepoints

Property	Details
When to use	Savepoint timeout caused by the asynchronous phase
How it helps	The native format generates savepoints faster and consumes less storage space than the standard format.
Configuration	Create a savepoint in the native format for a running deployment. See the "Manually create a savepoint" section of Status set management.

Important

Native-format savepoints do not guarantee compatibility across major Flink versions. If cross-version compatibility is required, use the standard format instead.

References

Performance tuning for large-state deployments -- Covers issues caused by large state size and the overall tuning workflow.
Control state size to reduce backpressure in SQL deployments -- Explains how the Flink SQL optimizer selects stateful operators, and how to tune stateful computation over large datasets.
Control state size to reduce backpressure using the DataStream API -- Covers flexible state size management with the DataStream API.
Improve startup and scaling speed -- When restarting a deployment from a checkpoint or savepoint, state data is downloaded from remote storage to restore the state engine. This process can become an efficiency bottleneck. See this topic for guidance on identifying and removing performance bottlenecks during deployment startup and scaling.