All Products
Search
Document Center

Realtime Compute for Apache Flink:Improve startup and scaling speed

Last Updated:Sep 23, 2024

When you restart a deployment based on a checkpoint or savepoint, the state data is downloaded from remote storage to rebuild the state engine. This process can create a bottleneck and result in significant delays due to the extensive I/O operations involved. This topic describes how to identify and remove performance bottlenecks during deployment startup and scaling.

Investigation steps

Perform the following steps to identify a bottleneck during deployment startup or scaling:

  1. Use diagnostic tools to analyze operator status: Use tools such as thread dumps, thread activity analysis, and flame graphs to check whether an operator thread is throttled during the initialization phase, especially by operations on the state backend, such as GeminiStateBackend. For information about how to use the diagnostic tools, see Monitor deployment performance.

  2. Identify the cause: If a stateful operator remains in the initialization phase for a long period of time, the bottleneck is likely to be caused by state data download or recovery.

Tuning methods

The following table describes the methods you can use to improve startup and scaling speed.

Method

Description

Configuration

Usage notes

Configure dynamic scaling

This method allows you to dynamically update TaskManager parameters to reduce the service interruption time caused by deployment startup and cancellation.

For more information, see Dynamically update the parameter configuration for dynamic scaling.

Dynamic parameter update is an experimental feature and may cause service interruptions. Compared with traditional methods, this feature significantly reduces the interruption time to between 5 seconds and 1 minute. The actual interruption time varies based on factors such as deployment topology and state size.

Enable local recovery

This method allows snapshots to be stored locally as a backup. As a result, smaller amount of data is downloaded from the remote storage, which accelerates the recovery process. If the local disk space is sufficient, this is the most suitable option.

Add the

state.backend.local-recovery: true

configuration. For more information, see the "How do I configure parameters for deployment running?" section of the Reference topic.

  • We recommend that you enable the experimental feature in Ververica Runtime (VVR) 8.0.8 or later.

  • This method applies only to failovers or dynamic parameter updates. If you manually cancel and restart a deployment, local recovery does not take effect.

  • Additional local disk space is required.

Use the lazy loading and delayed pruning features of GeminiStateBackend

GeminiStateBackend is an enterprise-class state backend developed by Alibaba Cloud. GeminiStateBackend allows a large-state deployment to quickly start by downloading only the necessary metadata to achieve real-time data processing. Then, the system uses asynchronous download and intelligent pruning to efficiently process remote checkpoint files. This significantly reduces the interruption time and improves the efficiency by over 90%. For more information, see GeminiStateBackend.

Add the

state.backend.gemini.file.cache.download.type: LazyDownloadOnRestore

configuration. For more information, see the "How do I configure parameters for deployment running?" section of the Reference topic.

Note

This configuration is supported only in VVR 6.0.6 or later.

After the deployment is restarted, a temporary performance drop occurs due to the asynchronous download of state files. Performance gradually improves as the state files are fully restored.

References