All Products
Search
Document Center

Realtime Compute for Apache Flink:Diagnose and prevent checkpoint and savepoint timeout

Last Updated:Feb 27, 2026

Checkpoint and savepoint timeout occurs during one of two phases in the checkpointing process: slow barrier alignment (synchronous phase) or slow state upload (asynchronous phase). This topic explains how to identify the bottleneck phase and apply the appropriate tuning method.

Checkpoint phases

Realtime Compute for Apache Flink uses the Chandy-Lamport algorithm for state management, ensuring data consistency and reliability. Every checkpoint or savepoint goes through two phases:

  1. Synchronous phase -- The system waits for barriers to align across operators. Barriers are a special type of data record passed between operators. The time required for barrier alignment is proportional to the arrival delay of data records.

  2. Asynchronous phase -- Each operator uploads its local state to a remote persistent storage system. Upload time is proportional to the state size.

Important

Backpressure slows down barrier propagation during the synchronous phase, which directly causes checkpoint and savepoint timeout. Resolve backpressure before investigating timeout issues. For guidance, see Control state size to reduce backpressure in SQL deployments and Control state size to reduce backpressure using the DataStream API.

Identify the bottleneck phase

After resolving backpressure, if checkpoints or savepoints still time out, use the following tools to determine whether the bottleneck is in the synchronous phase or the asynchronous phase.

Checkpoint history UI

Navigate to O&M > Deployments, click the target deployment name, then go to Logs > Checkpoints > Checkpoints History. This view provides deployment-level, operator-level, and subtask-level metrics.

Checkpoint history UI

Identify the operators where checkpoints are timing out or still in progress, then check the following metrics:

MetricDescription
Sync DurationTotal time spent in the synchronous phase, including snapshotting operator state.
Alignment DurationTime between processing the first and the last checkpoint barrier. High values indicate uneven data distribution across input channels.
Async DurationTime spent uploading state to remote storage.
Checkpointed Data SizeVolume of state data written during the checkpoint.

How to interpret the results:

  • High Sync Duration or Alignment Duration -- The bottleneck is in the synchronous phase. Barriers are traveling slowly through the job graph, typically due to residual backpressure or channel skew.

  • High Async Duration or Checkpointed Data Size -- The bottleneck is in the asynchronous phase. The state is too large to upload within the timeout window.

Checkpoint metrics

Navigate to O&M > Deployments, click the target deployment name, then go to the Logs tab and click Alarm. The lastCheckpointDuration and lastCheckpointSize metrics provide a coarse-grained view of historical checkpoint performance, useful for spotting trends over time.

Tune checkpoint and savepoint performance

Before applying any tuning method, make sure the deployment runtime performance meets expectations. Poor runtime performance amplifies checkpoint issues. After optimizing runtime performance, apply one or more of the following methods based on the bottleneck phase.

These methods are not mutually exclusive. Combine them if the deployment experiences bottlenecks in both phases.

Use unaligned checkpoints and buffer debloating

PropertyDetails
When to useCheckpoint or savepoint timeout caused by the synchronous phase
How it helpsEliminates the need for barrier alignment, resolving timeout issues related to slow or skewed barriers. Effective for deployments of all sizes.
ConfigurationSee Checkpointing under backpressure in the Apache Flink documentation.

Unaligned checkpoints have specific limitations. Review the Limitations section in the Apache Flink documentation before enabling this feature.

Increase parallelism

PropertyDetails
When to useCheckpoint or savepoint timeout caused by the asynchronous phase
How it helpsDistributes state data across more parallel tasks, reducing the amount of data each task uploads during the asynchronous phase.
ConfigurationAdjust parallelism using the basic or expert mode of resource configuration. See Configure resources for a deployment.

Use the native format for savepoints

PropertyDetails
When to useSavepoint timeout caused by the asynchronous phase
How it helpsThe native format generates savepoints faster and consumes less storage space than the standard format.
ConfigurationCreate a savepoint in the native format for a running deployment. See the "Manually create a savepoint" section of Status set management.
Important

Native-format savepoints do not guarantee compatibility across major Flink versions. If cross-version compatibility is required, use the standard format instead.

References