All Products
Search
Document Center

Realtime Compute for Apache Flink:Diagnose and prevent checkpoint and savepoint timeout

Last Updated:Sep 20, 2024

This topic describes how to diagnose and prevent checkpoint and savepoint timeout.

Background information

Apache Flink uses the Chandy-Lamport algorithm to facilitate state management and ensure data consistency and reliability. The execution of a checkpoint or savepoint includes two phases:

  1. Synchronous phase: In this phase, the system waits for the barriers to align and maintains the required resources. Barriers are a special type of data record that can be passed between operators. The time required for barrier alignment is proportional to the arrival delay of data records.

  2. Asynchronous phase: In this phase, operators upload their local state to a remote persistent storage system. The upload time is proportional to the state size.

Note

Backpressure slows down the execution during the synchronous phase, which results in checkpoint and savepoint timeout. Before you address the checkpoint and savepoint timeout issues, make sure that backpressure is removed. For more information, see Control state size to reduce backpressure in SQL deployments and Control state size to reduce backpressure using the DataStream API. This improves efficiency and stability.

Diagnostic tools

If checkpoints and savepoints still time out after you resolve the backpressure issue, analyze the alignment time in the synchronous phase and the upload time in the asynchronous phase.

Checkpoint UI

On the O&M > Deployments page, find the deployment that you want to manage and click its name. On the page that appears, click the Logs tab and choose Checkpoints > Checkpoints History. You can view deployment-level, operator-level, and subtask-level metrics to analyze the causes of timeout.

检查点和快照超时的诊断方法.jpg

Take note of the operators for which checkpoints time out or are being executed and monitor the following metrics:

  • Sync Duration and Alignment Duration: High values for these metrics indicate that the bottleneck is in the synchronous phase.

  • Async Duration and Checkpointed Data Size: High values for these metrics indicate that the bottleneck is in the asynchronous phase.

Checkpoint metrics

On the O&M > Deployments page, find the deployment that you want to manage and click its name. On Logs tab of the page that appears, click the Alarm tab. You can view the lastCheckpointDuration and lastCheckpointSize metrics to analyze the time consumption and size of historical checkpoints in a coarse-grained manner.

Tuning methods

Before performance tuning, make sure that the runtime performance meets your expectations. If the runtime performance is poor, you need to adjust the runtime performance based on the optimization guideline. After you optimize the runtime performance, you can improve the efficiency of generating checkpoints and savepoints. The following table describes the methods.

Method

Description

Scenario

Configuration

Usage notes

Use unaligned checkpoints and buffer debloating

This method effectively resolves timeout issues related to alignment and is suitable for deployments of all sizes.

Checkpoint and savepoint timeout in the synchronous phase

For more information about how to configure parameters, see Checkpointing under backpressure.

For more information about relevant usage notes, see Limitations.

Increase the parallelism

This method reduces the amount of state data of each parallel task to accelerate the asynchronous phase.

Checkpoint and savepoint timeout in the asynchronous phase

Use the basic or expert mode of resource configuration. For more information, see Configure resources for a deployment.

N/A

Use the native format for savepoints

Compared with the standard format, the native format is generated faster and consumes less storage space.

Savepoint timeout in the asynchronous phase

Create a savepoint in the native format for a running deployment. For more information, see the "Manually create a savepoint" section of the Status set management topic.

Compatibility across major versions is not guaranteed.

References

  • For information about the issues caused by large state size and the tuning workflow, see Performance tuning for large-state deployments.

  • Flink SQL uses an optimizer to select stateful operators based on parameter configurations and SQL statements. A basic understanding of the underlying mechanisms is necessary to optimize the performance of stateful computation over massive data. For more information, see Control state size to reduce backpressure in SQL deployments.

  • Apache Flink Datastream API allows you to manage the state size in a flexible manner. For more information, see Control state size to reduce backpressure using the Datastream API.

  • When you restart a deployment based on a checkpoint or savepoint, the state data is downloaded from the remote storage to restore the state engine. This process can be an efficiency bottleneck and may cause significant delays. For information about how to identify and remove performance bottlenecks during deployment startup and scaling, see Improve startup and scaling speed.