Configure automatic tuning - Realtime Compute for Apache Flink

Flink deployments support two automatic tuning modes: Autopilot and scheduled tuning. This topic describes how to configure automatic tuning. This topic also describes the precautions that you must take note of when you configure automatic tuning.

Background information

In most cases, a large amount of time is required for deployment tuning. For example, when you publish a draft, you must configure resources, parallelism, and the number and size of TaskManagers for the draft. When a deployment runs, you must adjust the resources of the deployment to maximize resource utilization. If backpressure occurs on the deployment or the latency increases, you must adjust the configurations of the deployment. Realtime Compute for Apache Flink supports automatic tuning. You can select an appropriate tuning mode based on the information that is described in the following table.

Tuning mode

Scenario

Benefit

References

Autopilot

A deployment uses 30 compute units (CUs). After the deployment runs for a period of time, the CPU utilization and memory usage of the deployment are sometimes excessively low when no latency or backpressure occurs in the source.

If you do not want to manually adjust resources, you can use the Autopilot mode to allow the system to automatically adjust resources. If the resource usage is low, the system automatically downgrades the resource configuration. If the resource usage reaches the specified threshold, the system automatically upgrades the resource configuration.

Helps you adjust the deployment parallelism and resource configuration based on your business requirements.
Globally optimizes your deployment. This helps handle performance issues, such as low deployment throughput, upstream and downstream backpressure, and a waste of resources.

For more information about the default tuning actions of Autopilot, see Default tuning actions of Autopilot.
For more information about how to enable Autopilot, see Enable and configure Autopilot.

Scheduled tuning

A scheduled tuning plan describes the relationships between resources and time points. A scheduled tuning plan can contain multiple groups of relationships between resources and time points.

When you use a scheduled tuning plan, you must know the resource usage during each period of time and configure resources based on the characteristics of business during the related period of time.

For example, the business peak hours in a day are from 09:00:00 to 19:00:00, and the off-peak hours are from 19:00:00 to 09:00:00 of the next day. In this case, you can enable scheduled tuning to use 30 CUs for your deployment during the peak hours and 10 CUs during the off-peak hours.

For more information about how to configure a scheduled tuning plan, see Enable and configure scheduled tuning.

Limits

A maximum of 20 resource plans can be created.
You cannot use Autopilot or scheduled tuning at the same time. If you want to change the tuning mode, you must first disable the tuning mode that is in use.
Scheduled tuning plans are mutually exclusive. You can apply only one scheduled tuning plan at the same time. If you want to change the scheduled tuning plan, you must first stop the scheduled tuning plan that is in use.
You cannot modify the deployment parallelism if you enable the Unaligned Checkpoints feature.
Autopilot is not supported for deployments that are deployed in session clusters.
Autopilot cannot resolve all performance bottlenecks of streaming deployments.
The performance bottlenecks of streaming deployments are determined based on all the upstream and downstream stores. If the performance bottleneck of a streaming deployment occurs on fully managed Flink, you can use Autopilot to optimize the resource configuration. However, Autopilot may fail to work when some conditions are not met. For example, Autopilot may require that the traffic smoothly changes, no data skew exists, and the throughput of each operator expands linearly when the parallelism of the deployment increases. If the business logic of the deployment deviates significantly from the preceding conditions, some issues may occur. Examples:
- The operation to modify the deployment parallelism cannot be triggered, or your deployment cannot reach a normal state and is repeatedly restarted.
- The performance of user-defined scalar functions (UDFs), user-defined aggregate functions (UDAFs), or user-defined table-valued functions (UDTFs) deteriorates.
Autopilot cannot identify issues that occur on external systems. If these issues occur, you need to troubleshoot them.
If an external system fails or access to the external system requires a long period of time, the deployment parallelism increases. This increases the load on the external system. As a result, the external system breaks down. The following common issues may occur on external systems:
- DataHub partitions are insufficient or the throughput of ApsaraMQ for RocketMQ is low.
- The performance of the sink operator is low.
- A deadlock occurs on an ApsaraDB RDS database.
Both Autopilot and scheduled tuning support resource configuration in Basic mode and Expert mode.

Precautions

After automatic tuning is triggered for a deployment, the deployment is restarted. During the restart process, the deployment temporarily stops processing data.
Note
For Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 8.0.1 or later, after automatic tuning is triggered for a deployment, fully managed Flink attempts to dynamically update parameter configurations of the deployment. If the dynamic update fails, fully managed Flink restarts the entire deployment. The service interruption time during dynamic parameter updates is 30% to 98% shorter than the service interruption time during the restart of the entire deployment. The service interruption time depends on the deployment status and logic. Only the configuration of the Parallelism parameter can be dynamically updated. For more information, see Dynamically update the parameter configuration of a deployment.
If you use a DataStream deployment or a custom SQL connector, make sure that the Parallelism parameter is not configured in the code of the deployment. Otherwise, Autopilot or scheduled tuning cannot be triggered to adjust the resources of the deployment, and the automatic tuning configuration does not take effect.

Enable and configure Autopilot

Go to the Autopilot Mode tab.
1. Log on to the Realtime Compute for Apache Flink console.
2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
3. On the Deployments page, click the name of the desired deployment.
4. On the Resources tab, click the Autopilot Mode tab.
Turn on Autopilot.
After you turn on Autopilot, Autopilot Mode Applying is displayed in the upper part of the Resources tab. If you want to disable Autopilot, you can turn off Autopilot or click Turn Off Autopilot in the upper-right corner of the Resources tab.

Click Edit in the upper-right corner of the Configurations section and modify Autopilot-related parameters. The following table describes the parameters.

Parameter	Description
Autopilot Strategy	Stable Strategy: After this strategy is applied, the system searches for fixed resources or a scheduled tuning plan that is suitable for the entire running cycle and adjusts resources of the deployment based on the running status of the deployment in the entire cycle. This helps reduce the impact of start and stop operations on the deployment. This strategy helps the deployment run stably and reduces unnecessary changes and fluctuations to make the deployment reach the convergence state. Note The system dynamically adjusts resources of a deployment only when the system finds a resource configuration that is more suitable for the entire running cycle of the deployment. Otherwise, the system does not modify the existing resource configuration. Adaptive Strategy: After this strategy is applied, the system dynamically modifies the resource configuration based on the real-time resources and metric information of the deployment. The system focuses on the current latency and resource usage of the deployment and quickly optimizes resource configurations based on the changes in related metrics. This strategy allows the system to quickly respond to deployment requirements and improves the efficiency and adaptability of resource configurations.
Cooldown Minutes	The time interval at which Autopilot is triggered after a deployment is restarted due to Autopilot.
Max CPU	The maximum number of CPU cores that can be allocated for the automatic resource configuration of a deployment. The default value of this parameter varies based on the tuning strategy.
Max Memory	The maximum amount of memory that can be allocated for the automatic resource configuration of a deployment. The default value of this parameter varies based on the tuning strategy.
More Configurations	You can configure the following parameters for Stable Strategy and Adaptive Strategy: mem.scale-down.interval: the minimum interval at which Autopilot is triggered when the memory size is decreased. Default value: 24. Unit: hours. The system checks the CPU utilization of the deployment at an interval of 24 hours. If the memory usage is less than the specified threshold, the system decreases the memory size or provides a recommendation for decreasing the memory size. parallelism.scale.max: the maximum parallelism when the value of the Parallelism parameter is increased. Default value: -1. This value indicates that the maximum parallelism is not limited. Note For message queue services, such as ApsaraMQ for Kafka, ApsaraMQ, and Simple Log Service, the parallelism for automatic tuning is affected by the number of partitions and cannot exceed the number of partitions. If the maximum parallelism exceeds the number of partitions, the system automatically changes the value of parallelism to the number of partitions. parallelism.scale.min: the minimum parallelism when the value of the Parallelism parameter is decreased. Default value: 1. This value indicates that the minimum parallelism is 1. delay-detector.scale-up.threshold: the maximum delay that is allowed. The throughput of the deployment is measured based on the delay of source data consumption. Default value: 1. Unit: minutes. If the data processing capability is insufficient and the data processing delay is longer than 1 minute, the system performs the scale-up operation to increase the throughput of the deployment or the system provides a recommendation for performing the scale-up operation. The system can increase the parallelism or split chains to perform the scale-up operation. slot-usage-detector.scale-up.threshold: the threshold for monitoring the idle time of data processing operators to trigger the increase of parallelism, excluding source operators. If the percentage of time that a vertex operator spends processing data is continuously greater than the value of this parameter, the parallelism is increased to improve resource usage. Default value: 0.8. slot-usage-detector.scale-down.threshold: the threshold for monitoring the idle time of data processing operators to trigger the decrease of parallelism, excluding source operators. If the percentage of time that a vertex operator spends processing data is continuously less than the value of this parameter, the parallelism is decreased to reduce resource usage. Default value: 0.2. slot-usage-detector.scale-up.sample-interval: the interval at which the slot idle metric is monitored. This parameter can be used to calculate the average value of the idle time. Default value: 3 minutes. This parameter takes effect together with the slot-usage-detector.scale-up.threshold and slot-usage-detector.scale-down.threshold parameters. If the average value of the idle time in a 3-minute period is greater than 0.8, the scale-up operation is performed. If the average value of the idle time in a 3-minute period is less than 0.2, the scale-down operation is performed. resources.memory-scale-up.max: the maximum memory size of a TaskManager and the JobManager. Default value: 16. Unit: GiB. When a TaskManager and the JobManager perform Autopilot or increase the parallelism, the upper limit of memory is 16 GiB.

In the upper-right corner of the Configurations section, click Save.

Enable and configure scheduled tuning

Procedure

Go to the Scheduled Mode tab.
1. Log on to the Realtime Compute for Apache Flink console.
2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
3. On the Deployments page, click the name of the desired deployment.
4. On the Resources tab, click the Scheduled Mode tab.
Click New Plan.
In the Resource Setting section of the New Plan panel, configure the resource configuration parameters.
- Trigger Period: You can select No Repeat, Every Day, Every Week, or Every Month from the drop-down list. If you set this parameter to Every Week or Every Month, you must specify the time range during which you want the plan to take effect.
- Trigger Time: Specify the time at which you want the plan to take effect.
- Mode: You can select Basic or Expert based on your business requirements. For more information, see Configure resources for a deployment.
- Other parameters: For more information, see Parameters section.
Optional. Click New Resource Setting Period and configure the Trigger Time parameter and resource configuration parameters.
You can configure resource tuning plans for multiple time periods in the same scheduled tuning plan.
Important
In the same scheduled tuning plan, the interval between the value of the Trigger Time parameter that is added after you click New Resource Setting Period and the existing value of the Trigger Time parameter must be greater than 30 minutes. Otherwise, the new resource configuration cannot be saved.
Find the desired scheduled tuning plan in the Resource Plans section of the Scheduled Mode tab, and click Apply in the Actions column.

Sample code

In this example, the peak hours of the deployment are from 09:00:00 to 19:00:00 every day. You can use 30 CUs for your deployment during peak hours. The off-peak hours of the deployment are from 19:00:00 to 09:00:00 of the next day. You can use 10 CUs for your deployment during off-peak hours. The following figure shows the configuration result of the tuning strategy in this example.

Default tuning actions of Autopilot

If you enable Autopilot, the system automatically optimizes resource configurations from the perspectives of parallelism and memory.

Autopilot enables the system to adjust the deployment parallelism to meet the throughput requirements, which change with the deployment traffic.
The system monitors the delay of consumption of the source data, the actual CPU utilization of TaskManagers, and the data processing capability of each operator to adjust the deployment parallelism. The system adjusts the deployment parallelism based on the following rules:
- If the deployment delay does not exceed the default value of the deployment delay, the system does not modify the parallelism of the deployment. The default value is 60s.
- If the deployment delay exceeds 60s, the system determines whether to increase the parallelism of the deployment based on the following conditions:
  - If the deployment delay is decreasing, the system does not adjust the parallelism of the deployment.
  - If the deployment delay continuously increases for 3 minutes (default value), the system adjusts the parallelism of the deployment to a value that is twice the processing capacity of the current actual transactions per second (TPS), but not greater than the maximum number of CUs. By default, the maximum number of CUs is 64.
- If the delay metric does not exist for the deployment, the system adjusts the parallelism of the deployment based on the following conditions:
  - If the percentage of the data processing time of a vertex node exceeds 80% in six consecutive minutes, the system increases the parallelism of the deployment to reduce the value of slot-utilization to 50%. The number of CUs cannot exceed the specified maximum number of CUs. By default, the maximum number of CUs is 64.
  - If the average CPU utilization of all TaskManagers exceeds 80% in six consecutive minutes, the system increases the parallelism of the deployment to reduce the average CPU utilization to 50%.
- If the maximum CPU utilization of all TaskManagers is less than 20% in 24 consecutive hours and the percentage of the data processing time of a vertex node is less than 20%, the system decreases the parallelism of the deployment to increase the CPU utilization and the percentage of the actual data processing time of the vertex node to 50%.
Autopilot also enables the system to monitor the memory usage and failovers of a deployment to adjust the memory configuration of the deployment. The system adjusts the memory size of the deployment based on the following rules:
- If the JobManager encounters frequent garbage collections (GCs) or an out of memory (OOM) error, the system increases the memory size of the JobManager. By default, the memory size of the JobManager can be adjusted up to 16 GiB.
- If frequent GCs, an OOM error, or a HeartBeatTimeout error occur on a TaskManager, the system increases the memory size of the TaskManager. By default, the maximum memory size of a TaskManager is 16 GiB.
- If the memory usage of a TaskManager exceeds 95%, the system increases the memory size of the TaskManager.
- If the actual memory usage of a TaskManager falls below 30% for 24 consecutive hours, the system decreases the memory size of the TaskManager. By default, the minimum memory size of a TaskManager is 1.6 GiB.

References

The intelligent deployment diagnostics feature can help you monitor the health status of your deployments and ensure the stability and reliability of your business. For more information, see Perform intelligent deployment diagnostics.
You can use deployment configurations and Flink SQL optimization to improve the performance of Flink SQL deployments. For more information, see Optimize Flink SQL.