This topic describes how to configure quick task restart. This feature helps reduce the impact on a deployment if a failover occurs.
Background Information
This feature is in invitational preview. Exercise caution when you use this feature in the production environment. If you encounter any issue, submit a ticket for technical support.
In most cases, if a task in a Flink streaming deployment fails, all tasks in the same pipeline region perform a failover to ensure data consistency. After a deployment performs a failover, the source node needs to start consuming data from the previous checkpoint. However, after the failover for the tasks of specific deployments is complete, you need to download large source files or state data. If the parallelism of deployments is high, the tasks of the deployments may take a long period of time to perform a failover. As a result, the deployments may be delayed or blocked and cannot consume data within a specific period of time. The deployments may take a long period of time to recover.
To resolve this issue, you can configure quick task restart. After you configure quick task restart, if a task fails, only the failed task is restarted. This prevents the source from re-consuming the data of the previous checkpoint due to the exception of a task on a non-source node. This also reduces the period during which deployments cannot consume data due to task cancellation, restart, or data tracking. Quick task restart also alleviates the pressure on the cluster that is caused by the scheduling time and initialization of tasks when the parallelism of deployments is high. This reduces the impact of failovers on deployments.
The quick task restart feature supports two data consistency semantics: APPROXIMATE and AT_LEAST_ONCE. The APPROXIMATE semantics does not ensure data integrity and deduplication. The AT_LEAST_ONCE semantics ensures data integrity but does not ensure data deduplication. The APPROXIMATE semantics does not incur performance overheads. The AT_LEAST_ONCE semantics increases performance overheads and requires Object Storage Service (OSS) for data storage.
Limits
You cannot configure quick task restart for streaming deployments that use finite data sources.
You cannot use quick task configuration together with unaligned checkpoints.
You cannot configure quick task restart for batch deployments.
For deployments that are deployed in the same session cluster, you cannot run the deployments for which quick task restart is configured and the deployments for which quick task restart is not configured at the same time.
If an operator in a deployment uses the prepareSnapshotPreBarrier method or checkpoint-related information is sent when a deployment runs, you cannot use the AT_LEAST_ONCE semantics.
Precautions
Semantics | Precautions |
APPROXIMATE |
|
AT_LEAST_ONCE |
|
Procedure
Go to the entry point to configure quick task restart.
Log on to the Realtime Compute for Apache Flink console.
On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, click Deployments. On the Deployments page, click the name of the desired deployment.
In the upper-right corner of the Parameters section on the Configuration tab, click Edit.
In the Other Configuration field, add the following code.
Code that is required to be added for the APPROXIMATE semantics:
individual-task-failover.enabled: enabled_approximate shuffle-service-factory.class: org.apache.flink.runtime.io.network.IndividualRecoverableNettyShuffleServiceFactory
Code that is required to be added for the AT_LEAST_ONCE semantics:
individual-task-failover.enabled: enabled shuffle-service-factory.class: org.apache.flink.runtime.io.network.IndividualRecoverableNettyShuffleServiceFactory individual-task-failover.intermediate-checkpointing.interval: the period of time in which a task of a deployment refreshes its state. We recommend that you set this parameter to a value ranging from one fifth to one tenth of the checkpointing period. Unit: milliseconds. classloader.check-leaked-classloader: false
If your deployment is deployed in a session cluster, you must also add the preceding code to the configuration of the session cluster. For more information about how to configure a deployment in a session cluster, see Step 1: Create a session cluster.
In the upper-right corner of the Parameters section, click Save.
In the upper part of the Deployments page, click Cancel.
On the Deployments page, find the deployment that you want to start and click Start in the Actions column.