Checkpoint and savepoint timeout occurs during one of two phases in the checkpointing process: slow barrier alignment (synchronous phase) or slow state upload (asynchronous phase). This topic explains how to identify the bottleneck phase and apply the appropriate tuning method.
Checkpoint phases
Realtime Compute for Apache Flink uses the Chandy-Lamport algorithm for state management, ensuring data consistency and reliability. Every checkpoint or savepoint goes through two phases:
Synchronous phase -- The system waits for barriers to align across operators. Barriers are a special type of data record passed between operators. The time required for barrier alignment is proportional to the arrival delay of data records.
Asynchronous phase -- Each operator uploads its local state to a remote persistent storage system. Upload time is proportional to the state size.
Backpressure slows down barrier propagation during the synchronous phase, which directly causes checkpoint and savepoint timeout. Resolve backpressure before investigating timeout issues. For guidance, see Control state size to reduce backpressure in SQL deployments and Control state size to reduce backpressure using the DataStream API.
Identify the bottleneck phase
After resolving backpressure, if checkpoints or savepoints still time out, use the following tools to determine whether the bottleneck is in the synchronous phase or the asynchronous phase.
Checkpoint history UI
Navigate to O&M > Deployments, click the target deployment name, then go to Logs > Checkpoints > Checkpoints History. This view provides deployment-level, operator-level, and subtask-level metrics.

Identify the operators where checkpoints are timing out or still in progress, then check the following metrics:
| Metric | Description |
|---|---|
| Sync Duration | Total time spent in the synchronous phase, including snapshotting operator state. |
| Alignment Duration | Time between processing the first and the last checkpoint barrier. High values indicate uneven data distribution across input channels. |
| Async Duration | Time spent uploading state to remote storage. |
| Checkpointed Data Size | Volume of state data written during the checkpoint. |
How to interpret the results:
High Sync Duration or Alignment Duration -- The bottleneck is in the synchronous phase. Barriers are traveling slowly through the job graph, typically due to residual backpressure or channel skew.
High Async Duration or Checkpointed Data Size -- The bottleneck is in the asynchronous phase. The state is too large to upload within the timeout window.
Checkpoint metrics
Navigate to O&M > Deployments, click the target deployment name, then go to the Logs tab and click Alarm. The lastCheckpointDuration and lastCheckpointSize metrics provide a coarse-grained view of historical checkpoint performance, useful for spotting trends over time.
Tune checkpoint and savepoint performance
Before applying any tuning method, make sure the deployment runtime performance meets expectations. Poor runtime performance amplifies checkpoint issues. After optimizing runtime performance, apply one or more of the following methods based on the bottleneck phase.
These methods are not mutually exclusive. Combine them if the deployment experiences bottlenecks in both phases.
Use unaligned checkpoints and buffer debloating
| Property | Details |
|---|---|
| When to use | Checkpoint or savepoint timeout caused by the synchronous phase |
| How it helps | Eliminates the need for barrier alignment, resolving timeout issues related to slow or skewed barriers. Effective for deployments of all sizes. |
| Configuration | See Checkpointing under backpressure in the Apache Flink documentation. |
Unaligned checkpoints have specific limitations. Review the Limitations section in the Apache Flink documentation before enabling this feature.
Increase parallelism
| Property | Details |
|---|---|
| When to use | Checkpoint or savepoint timeout caused by the asynchronous phase |
| How it helps | Distributes state data across more parallel tasks, reducing the amount of data each task uploads during the asynchronous phase. |
| Configuration | Adjust parallelism using the basic or expert mode of resource configuration. See Configure resources for a deployment. |
Use the native format for savepoints
| Property | Details |
|---|---|
| When to use | Savepoint timeout caused by the asynchronous phase |
| How it helps | The native format generates savepoints faster and consumes less storage space than the standard format. |
| Configuration | Create a savepoint in the native format for a running deployment. See the "Manually create a savepoint" section of Status set management. |
Native-format savepoints do not guarantee compatibility across major Flink versions. If cross-version compatibility is required, use the standard format instead.
References
Performance tuning for large-state deployments -- Covers issues caused by large state size and the overall tuning workflow.
Control state size to reduce backpressure in SQL deployments -- Explains how the Flink SQL optimizer selects stateful operators, and how to tune stateful computation over large datasets.
Control state size to reduce backpressure using the DataStream API -- Covers flexible state size management with the DataStream API.
Improve startup and scaling speed -- When restarting a deployment from a checkpoint or savepoint, state data is downloaded from remote storage to restore the state engine. This process can become an efficiency bottleneck. See this topic for guidance on identifying and removing performance bottlenecks during deployment startup and scaling.