By default, Autopilot in monitoring mode is enabled for fully managed Flink. This topic describes how to configure Autopilot and provides usage notes for the configuration process.

Background information

In most cases, a large amount of time is required for job tuning. For example, when you publish a job, you need to configure information about the job, such as the resources, the parallelism of jobs, and the number and size of TaskManagers. When a job is running, you need to adjust the resources of the job to maximize resource utilization. If backpressure occurs on the job or the latency increases, you must adjust the configurations of the job.

If the performance of each operator and the upstream and downstream computing performance of streaming jobs meet your business requirements and keep stable, the Autopilot feature of fully managed Flink helps you fine tune the parallelism of jobs and resource configurations to globally optimize your jobs. This feature resolves various performance issues, such as low job throughput, backpressure in the entire link, and poor resource utilization.

Limits

  • Autopilot is not supported for jobs that are deployed in session clusters.
  • Autopilot cannot resolve the performance bottleneck issues of streaming jobs.
    The job processing mode of tuning policies is determined based on assumed scenarios. For example, the traffic changes smoothly and no data skew exists. The throughput of each operator expands linearly when the parallelism of jobs increases. If the business logic deviates significantly from the preceding assumed scenarios, the following issues may occur. Examples:
    • The operation to modify the parallelism of jobs cannot be triggered, or your job cannot reach a normal state and is repeatedly restarted.
    • The performance of user-defined scalar functions (UDFs), user-defined aggregate functions (UDAFs), or user-defined table-valued functions (UDTFs) deteriorates.

    If a performance bottleneck issue occurs in your job, you need to change the Mode parameter of Autopilot to Monitoring and manually tune the job.

  • Autopilot cannot identify issues that occur on external systems.
    If faults occur in an external system or access to the external system requires a long period of time, the parallelism of jobs increases. This increases the load on the external system. As a result, the external service breaks down. Common issues that occur on external systems:
    • The number of DataHub partitions is insufficient or the throughput of Message Queue for Apache RocketMQ is low.
    • The performance of the sink node is low.
    • A deadlock occurs on an ApsaraDB RDS database.

    If these issues occur, you need to troubleshoot them.

Usage notes

  • After Autopilot is triggered for a job, you must restart the job. During the restart process, the job temporarily stops processing data.
  • By default, the interval at which Autopilot is triggered is 10 minutes. You can configure the cooldown.minutes parameter to modify the interval.
  • If you write your job code by using the DataStream API or Table API, make sure that the Parallelism parameter is not specified in the job code. If this parameter is specified, the Autopilot feature does not take effect, and the resources of the job cannot be adjusted.

Default tuning

If Autopilot is enabled, the system automatically performs the following operations to tune the resource configurations:
  • Adjusts the parallelism of jobs to meet the job throughput requirements. After Autopilot is enabled, the system monitors the delay changes of the source data consumed, the actual CPU utilization of TaskManagers, and the data processing capabilities of each operator to adjust the parallelism of jobs. The system adjusts the parallelism of jobs based on the following rules:
    • If the job delay is no more than the default value, the system does not proactively increase the parallelism of jobs. The default value is 60s.
    • If the job delay exceeds 60s, the system determines whether to increase the parallelism of jobs based on the following conditions:
      • If the job delay is decreasing, the system does not adjust the parallelism of jobs.
      • If the job delay increases continuously for three minutes (default value), the system adjusts the parallelism of jobs to a value twice the processing capacity of the current actual transactions per second (TPS), but not greater than the maximum number of compute units (CUs). By default, the maximum number of CUs is 64.
    • If the delay metric does not exist for the job, the system adjusts the parallelism of jobs based on the following conditions:
      • If a vertex node takes more than 80% of the time in six consecutive minutes to process data, the system increases the parallelism of jobs to reduce the value of slot-utilization to 50%. The maximum number of CUs cannot exceed the specified maximum number of CUs. By default, the maximum number of CUs is 64.
      • If the average CPU utilization of all TaskManagers exceeds 80% in six consecutive minutes, the system increases the parallelism of jobs to reduce the average CPU utilization to 50%.
    • If the maximum CPU utilization of all TaskManagers is less than 20% in 24 consecutive hours and the data processing time of a vertex node is less than 20%, the system decreases the parallelism of jobs to increase the CPU utilization and the actual data processing time of the vertex node to 50%.
  • Monitors the memory usage and failover of the job to adjust the memory configurations of the job. The system adjusts the parallelism of jobs based on the following rules:
    • If JobManager encounters frequent garbage collections (GCs) or an out of memory (OOM) error, the system increases the memory size of JobManager. By default, the memory size can be adjusted up to 16 GiB.
    • If TaskManagers encounter frequent GCs, an out of memory (OOM) error, or a HeartBeartTimout error, the system increases the memory size of TaskManagers. By default, the memory size can be adjusted up to 16 GiB.
    • If the memory usage of TaskManagers exceeds 95%, the system increases the memory size of TaskManagers.
    • If the memory usage of TaskManagers is less than 30% in 24 consecutive hours, the system reduces the memory size of TaskManagers.

Configure Autopilot

  1. Go to the Deployments page.
    1. Log on to the Realtime Compute for Apache Flink console.
    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
    3. In the left-side navigation pane, choose Applications > Deployments.
  2. Configure Autopilot.
    1. Click the name of the job that you want to optimize.
    2. Click the Autopilot tab.
    3. Click the Configuration tab.
    4. By default, Autopilot is enabled and Mode of Autopilot is Monitoring. You can set Mode to Disabled to disable Autopilot or set Mode to Active to make the resource configurations take effect.
      • Monitoring (default value)
        If you enable Autopilot, the system provides recommended tuning configurations for your job based on the tuning policy that you selected. You must manually make the configurations take effect. You can use one of the following methods to make the resource configurations take effect:
        • On the Configuration tab, set Mode to Active.
        • On the Status tab, perform the following operations as prompted:
          • If the Apply button appears on the Status tab, you can click the button to automatically restart the job to make the configurations take effect. Apply button
          • If the Apply button does not appear on the Status tab, you must manually modify the related configurations on the Advanced tab of the Draft Editor page as prompted and publish and restart the job. No Apply button
      • Active

        If you set Mode to Active, Autopilot is enabled. The system provides the recommended tuning configurations for your job based on the tuning policy that you selected. The system automatically starts the job to make the configurations take effect. No further action is required.

      • Disabled

        If you set Mode to Disabled, Autopilot is disabled.

    5. Optional:After you set Mode to Monitoring or Active, configure the following parameters.
      Parameter Description Default value
      cooldown.minutes The minimum interval at which the Autopilot feature automatically triggers a job to restart. 10 (Unit: minutes)
      resources.cu.max The maximum number of CUs that can be used. 64 (Unit: CU)
      parallelism.scale-up.interval The minimum interval at which Autopilot is triggered when the parallelism of jobs is increased. 6 (Unit: minutes)
      parallelism.scale-down.interval The minimum interval at which Autopilot is triggered when the parallelism of jobs is decreased. 24 (Unit: hours)
      mem.scale-down.interval The minimum interval at which Autopilot is triggered when the memory size is decreased. 24 (Unit: hours)
      parallelism.scale.max The maximum parallelism when the parallelism is increased. -1, which indicates that the maximum parallelism is not limited
      parallelism.scale.min The minimum parallelism when the parallelism is decreased. 1, which indicates that the minimum parallelism is 1
      delay-detector.scale-up.threshold The maximum delay that is allowed. 1 (Unit: minutes)
      slot-usage-detector.scale-up.threshold If the percentage of the data processing time of a vertex node is greater than the value of this parameter, the operation of increasing the parallelism of jobs is triggered. 0.8
      slot-usage-detector.scale-down.threshold If the percentage of the data processing time of a vertex node is greater than the value of this parameter, the operation of decreasing the parallelism of jobs is triggered. 0.2
      tm-cpu-usage-detector.scale-up.threshold The CPU utilization threshold that is used to trigger the operation of increasing the parallelism. 0.8
      tm-cpu-usage-detector.scale-down.threshold The CPU utilization threshold that is used to trigger the operation of decreasing the parallelism. 0.2
      resources.memory-scale-up.max The maximum memory size of TaskManagers and JobManager. 16 (Unit: GiB)
  3. Click Save.

Mappings between parameters of VVP 2.5.X and VVP 2.4.X

The parameters of VVP 2.5.X are compatible with most of the parameters of VVP 2.4.X. The following table lists their mappings.
VVP 2.4.X VVP 2.5.X
cpu-based.scale-up.threshold tm-cpu-usage-detector.scale-up.threshold
cpu-based.scale-down.threshold tm-cpu-usage-detector.scale-down.threshold
cpu-based.scale-up.window-size.min tm-cpu-usage-detector.scale-up.sample-interval
cpu-based.scale-down.window-size.min tm-cpu-usage-detector.scale-down.sample-interval
source-delay-based.threshold delay-detector.scale-up.threshold
source-delay-based.scale-up.window-size.min delay-detector.scale-up.sample-interval
slot-utilization-based.threshold slot-usage-detector.scale-down.threshold
slot-utilization-based.scale-down.window-size.min slot-usage-detector.scale-down.sample-interval
slot-usage-detector.scale-up.threshold slot-usage-detector.scale-up.threshold
slot-usage-detector.scale-up.sample-interval slot-usage-detector.scale-up.sample-interval
memory-utilization-based.memory-usage-max.threshold tm-memory-usage-detector.scale-up.threshold
memory-utilization-based.memory-usage-min.threshold tm-memory-usage-detector.scale-down.threshold
memory-utilization-based.memory-usage.target-utilization memory-utilization-based.memory-usage.target-utilization
memory-utilization-based.scale-up.window-size.min Not supported
memory-utilization-based.scale-down.window-size.min tm-memory-usage-detector.scale-down.sample-interval
memory-utilization-based.gc-ratio.threshold Not supported
memory-utilization-based.gc-time-longest-ms.threshold Not supported
memory-utilization-based.gc-time-ms-per-second.threshold jm-gc-detector.gc-time-ms-per-second.threshold

tm-gc-detector.gc-time-ms-per-second.threshold

memory-utilization-based.memory-scale-up.max Not supported
memory-utilization-based.memory-scale-up.ratio memory-utilization-based.memory-scale-up.ratio
job-exception-based.oom-exception.memory-scale-up.ratio job-exception-based.oom-exception.memory-scale-up.ratio
job-exception-based.oom-exception.memory-scale-up.max Not supported
job-exception-based.oom-exception.include-tm-timeout Not supported