[Best Practice] Instance Plan Management Released by AnalyticDB for PostgreSQL

Background

In August 2022, the cloud-native data warehouse AnalyticDB for PostgreSQL released manual start-pause and per-second billing functions. Computing resources are free of charge during instance pause, which helps users save costs. In September 2022, the plan management function was publicly previewed to enable automatic work. It provides abundant resource flexibility, supports the start and pause of plans and the scale-out and scale-in of computing nodes, and facilitates users to plan resource usage of instances based on the time dimension, saving costs.

Implementation Plan

Overall Architecture

Based on the technical framework of manual start and pause, a planned task management module is added to the execution layer of planned tasks. At the same time, a timing task scheduler is introduced to execute timing tasks at regular intervals.

Planned Task Management

Planned Task Model

Specified Time Plan

Execute planned tasks (start, pause, scale-out, and scale-in) at the specified time, which applies to emergencies. For example, if a user wants to temporarily run a batch of tasks, temporarily scale out, and scale in immediately after the run, it can be used to manage resource usage reasonably.

Periodic Execution Plan

Execute tasks based on a fixed cycle (start, pause, scale-out, and scale-in), which applies to regular services, such as offline batch running. The computing resources are scaled out during batch running every early morning and reduced after running.

Planned Task Scheduling

Planned tasks are associated with instances, and one instance can create multiple planned tasks. The timing task scheduler calls the execution task operation of the business controller every minute. The business controller queries all the planned tasks to be executed and uses an asynchronous thread pool to execute the specific planned tasks. Considering the scheduling cycle of the timing task scheduler, theoretically, the task execution time may be delayed by up to one minute.

State Machine of Planned Tasks

Planned tasks have the following states, and the state transitions are shown in the following figure:

Pending: The initial state of a plan, waiting to be planned for execution
Running: The plan is being executed.
Finished: If the task is executed successfully at the specified time, it turns to this state, indicating that the task execution is completed.
Success: The recurring task is executed in the current cycle.
Failure: The task fails to be executed; wait and retry
Discard: If the failed retry reaches a certain threshold, the planned execution fails and no retries are made.
Cancel: The plan is paused.
Deleted: The plan is deleted.

Challenges Posed by Planned Tasks

Planned tasks enable users to plan resource usage in the time dimension, automatically scale in or pause instances during business troughs, and automatically start or scale out computing resources during business peaks, which saves user costs to the greatest extent. Also, planned tasks pose higher challenges to the automated O&M of products.

Plan tasks on time
The success rate of planned tasks should be high.
Product cost should not be increased.

How to Ensure an On-Time Plan Execution

We mainly consider the timing task scheduler and monitor alerts to ensure planned tasks can be executed on time.

The execution logic of the timing task scheduler is simple enough to ensure that scheduling delays will not be triggered by additional logic. The timing framework provides alerts for task execution failures or delays.
In addition to adding the alert of task execution delay in the planned management module, it is necessary to provide an additional inspection module to detect the tasks to be executed regularly to improve the planned task monitor alerts. Once the planned task delay is found to be greater than a certain threshold, the alert works.

How to Ensure the Success Rate

Planned tasks are divided into two types. The first type does not depend on the underlying resources, such as pause and scale-in. Such tasks are mainly related to the running environment during execution, and the probability of failure is small. Even if the task fails, as long as it is handled within a controllable time range, the impact on users is controllable. However, the success rate of start and scale-out tasks depends on the current run time environment and the watermark of the underlying resource pool. If such tasks are not executed timely or fail to be executed, they will impact the user's business and easily cause production failures. Therefore, how to ensure the success rate of such tasks is our primary consideration.

The underlying layer of the adbpg serverless instance is deployed in a resource pool mode, which can improve the production and elasticity speed of the instance and realize per-second scale in and out. However, the resource pool needs to maintain certain resource buffers. If the resource sales rate does not increase, the business cost will increase. For planned tasks, if an instance is paused or scaled in, not releasing resources ensures the resources of the instance being started or scaled out but will undoubtedly increase the cost of the product. If resources are released, how can we ensure that the instance has resources when it is started or scaled out? We adopt the resource pool mode with separate hot and cold pools. The hot pool stores schedulable resources without maintaining resource buffers. The cold pool stores ECS off-duty, which is pre-installed with business components. There is no charge for computing resources. Users only need to bear the cost of the system disk. When the hot pool is insufficient, the cold pool can pop up to the hot pool without business awareness. The following is the implementation principle of the hot and cold resource pool mode:

The auto-scaling controller predicts the sales volume of resources for one point every five minutes and proactively maintains the total amount of resources in the hot pool at the predicted level.
When the resource scheduler evaluates resources, if the current schedulable nodes cannot meet the requirements of resource evaluation, the resource scheduler needs to evaluate the resources of the cold pool again. Only whether the ECS specifications can be newly purchased (the same as the resource evaluation logic of a single tenant) is evaluated. If the resources of the cold pool can meet the requirements, the resource evaluation is successful.
When the resource scheduler receives a request to create an instance, it schedules the instance in the hot pool. If the resources are insufficient, it notifies the cluster-autoscaler of the total amount of missing resources.
After receiving the pop-up event, the auto-scaling controller needs to convert the resource amount into the number of ECS nodes and perform the process of scaling out the cold pool to the hot pool.
The business layer automatically retries when the instance fails to be created. As long as the cold pool resources in the retry interval can be popped up successfully, the instance will be created without business awareness.

Best Practices

After purchasing a Serverless instance, users can perform the following operations to create a planned task and view the execution records in the Event Center.

Note: Currently, Serverless only supports planned tasks only on a pay-as-you-go basis.

Click the link to purchase Serverless Instance: Pay-As-You-Go Trial

Create a Specified Time Plan

Log on to the AnalyticDB for PostgreSQL console. On the Instance Details page, click Plan Management and then click Create a Planned Task:

Note: The specified run time is UTC, which should be converted according to local time. After the planned task is created, the details can be viewed on the Plan List page, including the plan status and planned execution time.

Create a Periodic Execution Plan

Log on to the AnalyticDB for PostgreSQL console. On the Instance Details page, click Plan Management and then click Create a Planned Task:

Note: cron expressions are calculated based on UTC and should be converted based on local time. After the planned task is created, the details can be viewed on the Plan List page, including the plan status and planned execution time.

Edit Planned Tasks

The planned task that has been created can be edited to modify its name, description, and run time.

Disable Planned Tasks

If users do not want to execute a planned task, they can temporarily disable it. When disabled, the planned task status becomes Disabled and will not be executed.

Enable Planned Tasks

Users can restore disabled plans if they need.

Delete Planned Tasks

If users no longer want to execute the planned task, they can delete it.

After a planned task is deleted, it will not be seen in the planned list.

View Time Change Records

The operations of the planned task are displayed in the Notification class event. This allows you to trace the change history. The execution result of the planned task is also displayed in the Notification class event.

Summary

Reducing costs and increasing efficiency has always been the common goal between users and us. From manual start and pause to per-second billing and planned tasks, we continue to polish our product and strive to provide users with cost-effective and easy-to-use cloud-native data warehouse products.

Community