How to use AutoPilot to automatically tune the job?

1. Introduction to AutoPilot

1.1 Problems solved by AutoPilot

The main goal of AutoPilot is to solve the two major problems of Flink job development and operation and maintenance.

First, job tuning is difficult, and development and operation and maintenance costs are high.

Flink jobs usually take a long time to run, and the traffic of data and jobs will change over time, so job resources will also change over time. Usually, we need to continuously tune the job over time to ensure that the job runs stably for a long time.

The introduction of Flink SQL greatly simplifies the difficulty of job development, but increases the difficulty of job tuning. Because SQL users usually do not understand the specific implementation of the underlying layer, this makes job tuning more difficult.

Second, the low utilization rate of operation resources and the high cost of execution resources.

The reason for this problem is that when there is no dynamic resource optimization, the job usually needs to be configured according to the resource demand during the peak period. In the long run, the resource utilization rate during the off-peak period will be relatively low, which will The cost of running the job is high.

Generally speaking, the goal of AutoPilot is to reduce the threshold of using Flink through automatic and adaptive resource tuning, and at the same time reduce the cost of using Flink.

1.2 System Architecture of AutoPilot

AutoPilot is part of Flink's management and control service, which mainly includes two parts: anomaly detection and anomaly resolution. AutoPilot anomaly detection mainly performs statistical analysis by subscribing to the event information of Flink jobs in real time to identify abnormal states caused by resource problems. When an exception occurs, automatic resource tuning will be triggered to resolve it. AutoPilot exception resolution is mainly achieved by dynamically updating the parameters of job resource configuration. After the job configuration parameters are updated, another service that manages and controls the service, that is, APP Manager, will automatically restart the Flink job to achieve the latest configuration upper limit, thereby realizing the update of the job configuration.

Currently AutoPilot mainly supports three functions:

According to the actual load of the job, the concurrency degree of the job is dynamically adjusted, and the number of TMs is adjusted accordingly, so that the resources of the job can be changed with the change of the traffic, and the dynamic adjustment can be realized;

Dynamically adjust TM resources according to TM memory utilization, so as to ensure that the resources of a single TM memory are in a reasonable state;

Automatically identify job exceptions caused by resource problems, and dynamically adjust TM resources to ensure that jobs are in a stable state.

2. AutoPilot Practical Demonstration

2.1 How to configure AutoPilot for a job

AutoPilot supports independent configuration for each job, and can dynamically update the configuration of AutoPilot without affecting the normal operation of the job.

■ AutoPilot provides three modes:

The default Disabled mode: that is, AutoPilot will not monitor the job status;
Active mode: AutoPilot starts job status monitoring, and automatically updates job parameter configuration when necessary;
Monitoring mode: Start job status monitoring, provide configuration update suggestions when job abnormalities are identified, but require user confirmation to manually trigger job configuration update.

■ AutoPilot provides five strategies:

Cpu-based strategy: The degree of concurrency is dynamically adjusted mainly based on the actual CPU utilization of the TM. This is a typical elastic computing scaling strategy. When the CPU utilization is high, it means that the job is relatively busy. At this time, AutoPilot will expand the concurrency of the job to reduce the load of a single TM. When the CPU utilization rate is low, it means that the TM is relatively idle. At this time, the concurrency of the job can be reduced in turn to release excess resources;

Source-delay-based strategy: mainly based on the delay metrics of the source to determine whether concurrency adjustment is required. This strategy currently only supports two sources, sls and datahub. The community is promoting the standardization of metrics, namely FLIP-33. After completion, this strategy will support more sources;

Slot-utilization-based strategy: mainly judge whether the concurrency needs to be reduced according to the slot utilization rate of the task. Unlike CPU utilization, if there is io wait or sleep logic in the task, it will also be counted, and the calculation of utilization will be more accurate. However, this strategy relies on the statistics of the utilization rate of the source node and will depend on FLIP-27, so it will not take effect until FLIP-27 is fully completed;

Memory-utilization-based strategy: mainly based on TM actual memory utilization and GC metrics information to determine whether to adjust the TM memory size. When the overall memory utilization of the TM is low and there is no severe GC, you can adjust the size of the memory; when the memory utilization of the TM is already high, or when the GC is serious, you can increase the memory of a single TM to ensure that the above running The task is in a relatively healthy state;

Job-exception-based strategy: mainly to automatically identify job exceptions caused by resource exceptions. When such anomalies are identified, AutoPilot will automatically increase the size of a single TM memory to resolve such resource anomalies and ensure that the job is in a stable state.

■ AutoPilot cooldown

When AutoPilot is in Active or Monitoring state, you need to configure the cooldown time of AutoPilot. The cooling time refers to the minimum time interval between two rescales. Because when AutoPilot is triggered, the job needs to be restarted, and the job status needs to be initialized and warmed up during the restart process. Therefore, this period of time needs to be excluded to avoid wrong judgments made by the AutoPilot strategy. Generally speaking, the larger the state, the longer the initialization and warm-up time of the job. Therefore, the cooling time should be set a little longer to ensure the normal operation of AutoPilot.

■ AutoPilot custom parameters

The policy behavior of AutoPilot can be individually controlled through user-defined parameters to adapt to the requirements of some special jobs. For example, for a job with a lot of IO operations, if cpu-based is enabled, the threshold triggered by cpu-based needs to be lowered, so as to adapt to the actual cpu usage scenario of the job.

■ Practical demonstration

For this pre-created job, when you need to configure AutoPilot for it, you need to select the AutoPilot tab page on the job details page. By default, AutoPilot will not be started. When it needs to be started, you need to switch the mode of AutoPilot from disable to Active or Monitoring on the tab page.

After selecting the mode, you need to select the desired strategy and cooling time, and fill in the custom parameters in the custom configuration position. Then save it, so that AutoPilot can be turned on to monitor the job status and automatically optimize resources.

When you need to turn off the AutoPilot of a job, you only need to switch the mode to disable on the configuration page and save it, so that the state of AutoPilot returns to disabled. No matter how you operate AutoPilot, it will not affect the normal operation of the job.

2.2 How to check the running status of AutoPilot

When AutoPilot starts, you can view the current running status of AutoPilot on the auto-tuning status page. Status information mainly consists of two parts:

The first is the latest recommended job configuration. When AutoPilot is in the Monitoring state, if there is a new configuration recommendation, it will be displayed on this page. At the same time, you can manually trigger the update of the configuration on this page.

The second is to see the status information of the jobs monitored by each AutoPilot launch policy. On the one hand, these status information can explain why AutoPilot currently needs to update the configuration, and at the same time, it can also be used to assist manual tuning or code optimization based on some status information.

2.3 How to view AutoPilot historical information

During the running of AutoPilot, once the job configuration is modified, these modifications will be saved as events, allowing users to review AutoPilot behavior and analyze job traffic at a later stage. To view event information, you can filter out and view events of the AutoPilot type in "Running Events".

3. How to choose the AutoPilot strategy

3.1 General scene operations

Under this default policy, when the CPU utilization of a TM is relatively high for a long time, the adjustment of the concurrency will be triggered; when the memory usage of the TM is high or low, the memory adjustment of a single TM will be triggered; when When a resource-related exception occurs in a job, it will also trigger the adjustment of TM resources in a timely manner. The entire configuration is basically the same as the common automatic tuning configuration of elastic computing, so it is relatively simple and easy to understand.

3.2 High-priority, delay-sensitive jobs

It is recommended not to enable the Active mode of AutoPilot, but to use the Monitoring mode. Because once this type of job triggers the tuning of AutoPilot, the job will be restarted, which may affect the business effect. Through the Monitoring mode, you can manually review at regular intervals to check whether there is any configuration that needs to be optimized, and you can manually update it at an appropriate time.

3.3 Jobs using sls or datahub

The strategy of source-delay-based + slot-utilization-based + memory-utilization-based + job-exception-based is recommended. In this way, the tuning effect on job concurrency will be better, and the convergence speed of the entire algorithm will be faster.

4. Precautions for using AutoPilot

■ First, AutoPilot modifies the concurrency degree through the default concurrency degree, so the concurrency degree cannot be displayed and set in the job code, otherwise dynamic adjustment cannot be realized.

■ Secondly, after AutoPilot triggers the update, the console will automatically restart the job, which will cause the job to temporarily stop processing data. It is recommended to use the Monitoring mode for jitter-sensitive jobs to avoid impact on the business.

■ Third, the AutoPilot strategy has certain assumptions about the data model of the job:

The traffic of the job needs to change smoothly without data skew, so that the resources required by the job can be estimated based on the running statistics of the job a short period of time before the current time;

The data of the job cannot have data skew. The throughput of each operator will expand linearly with the concurrency. In this way, the throughput of the job after the concurrency adjustment can be estimated according to the throughput of the current job, so as to determine how much the job needs to be adjusted. concurrency;

When the job pattern seriously deviates from these assumptions, there may be job exceptions, AutoPilot does not trigger automatic adjustment, or AutoPilot triggers automatic adjustment, but the algorithm may fail to converge, and the job will continue to be in an abnormal state , and constantly restart and so on. At this time, you need to turn off AutoPilot and perform manual job tuning to ensure that the job is in a healthy state.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us