This topic describes how to configure Autopilot and the considerations during the configuration.
Background information
In most cases, you need to spend a large amount of time on job tuning. For example, when you publish a job, you need to consider how to configure job information, such as the resources, the number of concurrent jobs, and the number and size of Task Managers. In addition, when a job is running, you also need to consider how to adjust the resources of the job to maximize the resource utilization for the job. If backpressure occurs for the job or the latency is increased, you need to consider how to adjust job configurations.
The Autopilot feature of fully managed Flink helps you more appropriately adjust the parallelism and resource configurations for your jobs and globally optimize your jobs based on the prerequisite. The prerequisite is that the performance of each operator and the upstream and downstream of streaming jobs meets the standards and is stable. The feature also solves various performance tuning problems, such as insufficient job throughput, backpressure in the entire link, and a waste of resources.
- Performance issues of upstream and downstream storage
- Upstream storage: for example, insufficient DataHub partitions or insufficient throughput of Message Queue (MQ)
- Downstream storage: for example, sink performance issues or ApsaraDB RDS deadlocks
- Performance issues of user-defined functions (UDFs)
Example: performance issues of scalar UDFs, user-defined aggregate functions (UDAFs), or user-defined table-valued functions (UDTFs)
Prerequisites
- The Parallelism parameter of the job is not specified in job code.
If you compile the job by using the DataStream API or the Table API, make sure that the Parallelism parameter of the job is not specified in job code. Otherwise, the Autopilot feature cannot adjust the resources of the job.
- The Number of Task Managers parameter is not specified on the interface or in code.
If the Number of Task Managers parameter is specified on the interface or in code, Autopilot cannot run as expected.
- The taskmanager.numberOfTaskSlots parameter is not specified in the YAML configuration file.
- The Upgrade Strategy parameter is not set to None.
If the Upgrade Strategy parameter is set to None, the system does not automatically restart the job after you modify job configurations. As a result, the Autopilot configurations cannot take effect.
Considerations
- After Autopilot is triggered, the job must be restarted. As a result, the job temporarily stops processing data. The way of restarting the job is determined by the Upgrade Strategy parameter. We recommend that you use the Stateless upgrade strategy.
- In Autopilot policies, the processing modes of jobs are based on specific assumptions. For example, traffic smoothly changes, no data skew exists, and the throughput of each operator can linearly increase as the parallelism increases. If business logic seriously deviates from the preceding assumptions, job exceptions may occur. For example, the operation of changing the parallelism fails to be triggered. The job fails to reach the normal status. The job continuously restarts. In this case, you must adjust the mode of Autopilot to Monitoring or Disabled and perform manual tuning.
- Autopilot cannot identify the issues of external systems. If the issues occur, you must resolve the issues of external systems by yourself. For example, if an external system becomes faulty or access slows down, the parallelism of the job increases. This increases the pressure on the external system and causes a breakdown of the external system.
- Autopilot is not supported for jobs that are deployed in Session clusters.