By default, Autopilot in monitoring mode is enabled for fully managed Flink. This topic describes how to configure Autopilot and provides usage notes for the configuration process.
Background information
In most cases, a large amount of time is required for job tuning. For example, when you publish a job, you need to configure information about the job, such as the resources, the parallelism of jobs, and the number and size of TaskManagers. When a job is running, you need to adjust the resources of the job to maximize resource utilization. If backpressure occurs on the job or the latency increases, you must adjust the configurations of the job.
If the performance of each operator and the upstream and downstream computing performance of streaming jobs meet your business requirements and keep stable, the Autopilot feature of fully managed Flink helps you fine tune the parallelism of jobs and resource configurations to globally optimize your jobs. This feature resolves various performance issues, such as low job throughput, backpressure in the entire link, and poor resource utilization.
Limits
- Autopilot is not supported for jobs that are deployed in session clusters.
- Autopilot cannot resolve the performance bottleneck issues of streaming jobs.
The job processing mode of tuning policies is determined based on assumed scenarios. For example, the traffic changes smoothly and no data skew exists. The throughput of each operator expands linearly when the parallelism of jobs increases. If the business logic deviates significantly from the preceding assumed scenarios, the following issues may occur. Examples:
- The operation to modify the parallelism of jobs cannot be triggered, or your job cannot reach a normal state and is repeatedly restarted.
- The performance of user-defined scalar functions (UDFs), user-defined aggregate functions (UDAFs), or user-defined table-valued functions (UDTFs) deteriorates.
If a performance bottleneck issue occurs in your job, you need to change the Mode parameter of Autopilot to Monitoring and manually tune the job.
- Autopilot cannot identify issues that occur on external systems.
If faults occur in an external system or access to the external system requires a long period of time, the parallelism of jobs increases. This increases the load on the external system. As a result, the external service breaks down. Common issues that occur on external systems:
- The number of DataHub partitions is insufficient or the throughput of Message Queue for Apache RocketMQ is low.
- The performance of the sink node is low.
- A deadlock occurs on an ApsaraDB RDS database.
If these issues occur, you need to troubleshoot them.
Usage notes
- After Autopilot is triggered for a job, you must restart the job. During the restart process, the job temporarily stops processing data.
- By default, the interval at which Autopilot is triggered is 10 minutes. You can configure the cooldown.minutes parameter to change the interval.
- If you write your job code by using the DataStream API or Table API, make sure that the Parallelism parameter is not specified in the job code. If this parameter is specified, the Autopilot feature does not take effect, and the resources of the job cannot be adjusted.
Default tuning
- Adjusts the parallelism of jobs to meet the job throughput requirements. After Autopilot
is enabled, the system monitors the delay changes of the source data consumed, the
actual CPU utilization of TaskManagers, and the data processing capabilities of each
operator to adjust the parallelism of jobs. The system adjusts the parallelism of
jobs based on the following rules:
- If the job delay is no more than the default value, the system does not proactively increase the parallelism of jobs. The default value is 60s.
- If the job delay exceeds 60s, the system determines whether to increase the parallelism
of jobs based on the following conditions:
- If the job delay is decreasing, the system does not adjust the parallelism of jobs.
- If the job delay increases continuously for three minutes (default value), the system adjusts the parallelism of jobs to a value twice the processing capacity of the current actual transactions per second (TPS), but not greater than the maximum number of compute units (CUs). By default, the maximum number of CUs is 64.
- If the delay metric does not exist for the job, the system adjusts the parallelism
of jobs based on the following conditions:
- If a vertex node takes more than 80% of the time in six consecutive minutes to process data, the system increases the parallelism of jobs to reduce the value of slot-utilization to 50%. The maximum number of CUs cannot exceed the specified maximum number of CUs. By default, the maximum number of CUs is 64.
- If the average CPU utilization of all TaskManagers exceeds 80% in six consecutive minutes, the system increases the parallelism of jobs to reduce the average CPU utilization to 50%.
- If the maximum CPU utilization of all TaskManagers is less than 20% in 24 consecutive hours and the data processing time of a vertex node is less than 20%, the system decreases the parallelism of jobs to increase the CPU utilization and the actual data processing time of the vertex node to 50%.
- Monitors the memory usage and failover of the job to adjust the memory configurations
of the job. The system adjusts the parallelism of jobs based on the following rules:
- If the JobManager encounters frequent garbage collections (GCs) or an out of memory (OOM) error, the system increases the memory size of the JobManager. By default, the memory size of the JobManager can be adjusted up to 16 GiB.
- If a TaskManager encounter frequent GCs, an out of memory (OOM) error, or a HeartBeartTimout error, the system increases the memory size of the TaskManager. By default, the memory size of a TaskManager can be adjusted up to 16 GiB.
- If the memory usage of a TaskManager exceeds 95%, the system increases the memory size of the TaskManager.
- If the memory usage of a TaskManager is less than 30% in 24 consecutive hours, the system reduces the memory size of the TaskManager.
Configure Autopilot
- Go to the Deployments page.
- Log on to the Realtime Compute for Apache Flink console.
- On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
- In the left-side navigation pane, choose .
- Configure Autopilot.
- Click Save.
Mappings between parameters of Ververica Platform (VVP) 2.5.X and VVP 2.4.X
VVP 2.4.X | VVP 2.5.X |
---|---|
cpu-based.scale-up.threshold | tm-cpu-usage-detector.scale-up.threshold |
cpu-based.scale-down.threshold | tm-cpu-usage-detector.scale-down.threshold |
cpu-based.scale-up.window-size.min | tm-cpu-usage-detector.scale-up.sample-interval |
cpu-based.scale-down.window-size.min | tm-cpu-usage-detector.scale-down.sample-interval |
source-delay-based.threshold | delay-detector.scale-up.threshold |
source-delay-based.scale-up.window-size.min | delay-detector.scale-up.sample-interval |
slot-utilization-based.threshold | slot-usage-detector.scale-down.threshold |
slot-utilization-based.scale-down.window-size.min | slot-usage-detector.scale-down.sample-interval |
slot-usage-detector.scale-up.threshold | slot-usage-detector.scale-up.threshold |
slot-usage-detector.scale-up.sample-interval | slot-usage-detector.scale-up.sample-interval |
memory-utilization-based.memory-usage-max.threshold | tm-memory-usage-detector.scale-up.threshold |
memory-utilization-based.memory-usage-min.threshold | tm-memory-usage-detector.scale-down.threshold |
memory-utilization-based.memory-usage.target-utilization | memory-utilization-based.memory-usage.target-utilization |
memory-utilization-based.scale-up.window-size.min | Not supported |
memory-utilization-based.scale-down.window-size.min | tm-memory-usage-detector.scale-down.sample-interval |
memory-utilization-based.gc-ratio.threshold | Not supported |
memory-utilization-based.gc-time-longest-ms.threshold | Not supported |
memory-utilization-based.gc-time-ms-per-second.threshold | jm-gc-detector.gc-time-ms-per-second.threshold
tm-gc-detector.gc-time-ms-per-second.threshold |
memory-utilization-based.memory-scale-up.max | Not supported |
memory-utilization-based.memory-scale-up.ratio | memory-utilization-based.memory-scale-up.ratio |
job-exception-based.oom-exception.memory-scale-up.ratio | job-exception-based.oom-exception.memory-scale-up.ratio |
job-exception-based.oom-exception.memory-scale-up.max | Not supported |
job-exception-based.oom-exception.include-tm-timeout | Not supported |