The Principle of the Elasticity Technology of the Cloud-native Data Warehouse AnalyticDB

This article introduces the AnalyticDB MySQL Multi-Cluster elastic model for automatic and intelligent scaling to better fit business loads, make full use of resources, and maximize benefits.

By Guyue

Overview

In the context of a global economic slowdown and the intensified focus on digital infrastructure among businesses, the imperative of maximizing benefits is inescapable. Alibaba Cloud's cloud-native data warehouse, AnalyticDB for MySQL Data Lakehouse Edition (hereinafter referred to as AnalyticDB for MySQL), featured a scheduled elasticity function from the outset. This functionality allowed businesses to routinely scale computing resources up or down to reduce costs. A year on, AnalyticDB for MySQL has introduced a multi-cluster elastic resource model in response to user needs. This model adapts to user demand, automatically configures resources, and incrementally improves performance, further aiding users in reducing expenses and enhancing computational efficiency.

Introduction to the Elastic Model

There are two types of elastic models: the Min-Max elastic model and the Multi-Cluster elastic model.

Min-Max Elastic Model

The resources available to a single SQL statement can be scaled between minimum and maximum values. This model is suitable for ETL scenarios, aiming to enhance the performance of individual SQL statements. For example, if the minimum value is set to 16 cores and the maximum value is set to 32 cores, the resources utilized by a single SQL statement will fall within the range of 16 to 32 cores.

Multi-Cluster Elastic Model

This model scales resources at the cluster level. In this model, the resources available to a single SQL statement are confined to a single cluster, and resources are isolated between clusters. It is suitable for online analysis and interactive analytic scenarios, improving SQL concurrency.

For instance, let's assume the size of a single cluster is 16 cores, and the minimum and maximum number of clusters are set to 1 and 2, respectively. In this case, each SQL statement can utilize 16 cores, which corresponds to a single cluster. As the SQL concurrency increases or decreases, the number of clusters can automatically adjust between 1 and 2. Importantly, SQL statements running on Cluster 1 will not affect those running on Cluster 2.

Advantages of the Multi-Cluster Elastic Model

Before the introduction of the multi-cluster elastic model, AnalyticDB for MySQL utilized the min-max elastic model to scale and release resources through scheduled elasticity. However, this model had certain limitations:

• Usability: Users could only scale resources within specified time periods based on business requirements. The elastic plan took effect at longer intervals, with a minimum interval of 10 minutes.

• Performance: Small queries were easily impacted by larger queries. The query concurrency remained unchanged regardless of the resource group size, leading to potential query queuing.

• Cost: Users needed to specify scaling for fixed periods, which made it difficult to adapt to real-time business loads and quickly scale during periods of low demand.

To address these challenges and cater to high-concurrency real-time analysis scenarios, AnalyticDB for MySQL introduced the multi-cluster elastic model. This model is employed within the online resource groups of AnalyticDB for MySQL, with each resource group consisting of one or more clusters. Compared to the min-max elastic model, the multi-cluster elastic model offers improved usability, performance, and cost-effectiveness.

Usability

The multi-cluster elastic model automatically adjusts the number of clusters based on the load of the current instance, allowing for more flexible response to different business traffic. Users no longer need to manually set the resource scaling time; they only need to define the upper and lower limits of the number of clusters and the size of each cluster.

Performance

Clusters are isolated from each other, ensuring that a single SQL statement only affects the cluster it is located in and does not impact the execution of SQL statements in other clusters. This prevents the interference of large SQL statements on smaller ones and avoids compromising the overall query performance within the resource group. Experimental results have shown that as the number of clusters increases, query concurrency improves linearly. Compared to the min-max elastic model, the multi-cluster elastic model can achieve a 28% increase in query concurrency with the same computing resources.

Cost

The multi-cluster elastic model dynamically selects the optimal number of clusters based on user load to handle business peaks and troughs. To better illustrate the cost benefits of the multi-cluster elastic model compared to the min-max elasticity model, let's consider a practical example:

• 12:00 a.m. to 7:00 a.m.: During this period, the customer service experiences off-peak hours. The user utilizes scheduled elasticity to decrease computing resources to 48 ACU.

• 7:00 a.m. to 12:00 p.m.: The customer service enters an intermittent peak period. The user uses scheduled elasticity to maintain computing resources at 192 ACU to handle potential peaks at any given time.

• The total amount of resources used throughout the day is 3600 ACU.

Note: ACU (AnalyticDB Compute Unit) is the smallest unit for resource allocation in AnalyticDB for MySQL. One ACU is approximately equivalent to 1 core and 4GB of memory.

After applying the multi-cluster elastic model:

• 12:00 a.m. to 7:00 a.m.: The user reduces the cluster size to 48 ACU, maintaining consistency with the scheduled elasticity scenario.

• 7:00 a.m. to 12:00 p.m.: The user changes the cluster size to 96 ACU and sets the minimum and maximum number of clusters to 1 and 2 respectively.

• The total amount of resources used throughout the day is 2208 ACU, resulting in approximately 38.7% cost savings.

It can be observed that compared to scheduled elasticity, the multi-cluster elastic model reduces the number of clusters and cuts costs for users during low-demand periods. When a business peak arrives, the number of clusters increases accordingly to handle the increased workload.

In general, the multi-cluster elastic model is more suitable for high-concurrency real-time analysis scenarios compared to the min-max elastic model. It offers improved ease of use, higher performance, and lower costs. Next, let's delve into the technical architecture of the multi-cluster model and how it achieves accurate, fast, and efficient scaling.

Multi-Cluster Technology Architecture

When designing the multi-cluster technology architecture, we focused on three core goals: accuracy, speed, and efficiency of scaling. The diagram below illustrates the overall architecture of the AnalyticDB MySQL multi-cluster model, which can be divided into three layers:

• Access layer: This layer delivers user queries to specific resource groups and then distributes them to specific clusters for execution based on the cluster's load.

• Execution layer: Within each resource group, multiple clusters of the same size are created, and each query is executed on only one of the clusters.

• Decision level: This layer continuously monitors the load of AnalyticDB for MySQL resources in real-time to make informed decisions about the scaling of multi-cluster resource groups.

Fast Scaling: Real-time Data Link

To ensure timely scale-out and quick response to user queries, AnalyticDB for MySQL has implemented a region-level metric collection system. The instances of AnalyticDB for MySQL are modified to update internal business metrics (such as the number of queued queries and CPU usage) in real-time. These metrics are collected by the metric collection process and stored in the central storage end. The current delay for the entire data link is approximately 10 seconds.

Accurate Scaling: Stable Scaling Policies

Instances of AnalyticDB for MySQL are complex systems comprising access layer nodes, compute nodes, and storage nodes. When making scaling decisions for computing resources, the following challenges are encountered:

• How to identify bottlenecks in external components in multi-component interactive systems?

• With numerous instance metrics available, which metrics should be used to make scaling decisions?

• How to prevent short-term fluctuations in metrics from causing inaccurate scaling policies?

• How to determine the reasonableness of scaling decisions?

To address these challenges, we divide the entire cluster number calculation into three stages: decision-making, execution, and feedback.

Decision-making

Bottleneck Identification

In our decision-making system, metrics are divided into two categories.

Positive metrics provide feedback on the load status of the target component, which helps in making scaling decisions.
Negative metrics provide feedback on the load status of components other than the target component, helping to identify external bottlenecks.

When the negative metrics reach the specified threshold, it indicates that scaling out the compute nodes for the AnalyticDB for MySQL instance may not accelerate queries or improve concurrency. In such cases, the decision-making system will raise an alarm until the bottleneck is resolved.

Estimation of the Cluster Number

After analyzing the load of an AnalyticDB for MySQL instance, we have found that scaling cannot be determined by a single metric. The metrics used for scaling decisions vary depending on the type of user load. The main metrics used include user CPU utilization, user memory usage, and the number of queued user queries.

During the scaling estimation process, we calculate the number of candidate clusters based on each metric and select the largest number of clusters among all metrics as the final candidate.

• For metrics related to the cluster (such as CPU usage), the following formula can be used:

• For metrics related to the resource group but not the cluster (such as the number of queued queries and concurrency), the following formula can be used:

• Finally, we can get the larger value of all the calculated target cluster numbers:

Stability Window

To avoid metric jitter caused by factors such as instance load jitter, a stability window algorithm can be used.

During the stability window, the estimated number of clusters each time is recorded, and the current number of clusters is compared with the number recommended in the stability window.

a) If the current cluster number is greater than the window min and greater than the window max, it remains unchanged.

b) If the current cluster number is less than the window min, it should be scaled out to the window min.

c) If the current cluster number is greater than the window max, it should be scaled in to the window max.

Execution

The clusters in the resource groups and the Kubernetes custom resources in AnalyticDB for MySQL are managed by the in-house operator. These custom resources implement the Kubernetes scale subresource. After the decision-making system makes a decision, it sends the target cluster number to the custom operator through the Kubernetes scale API for scaling.

Feedback: Effectiveness Evaluation

After the scale-out (or scale-in) is completed, the decision-making system records the value of the metrics before the scale-out (or scale-in) and keeps observing the metrics of user load.

• Scale-out evaluation: The decision-making system continuously observes whether the QPS of user queries increases or whether the RT of user queries decreases according to the cluster ratio after the scale-out. If no significant increase is detected, the scale-out is considered invalid. The decision-making system will restore the cluster number to the original number, stop the scale-out decision, and send an alert.

• Scale-in evaluation: The decision-making system continuously observes whether the QPS and RT of user queries change significantly after the scale-in. If the changes exceed a certain threshold, the system will restore the cluster number to the original number, stop the scale-in decision, and send an alert.

Good Scaling: Routing Policy Based on Load Balancing

After the number of clusters increases, users do not need to specify the cluster to which queries are routed. AnalyticDB for MySQL automatically routes queries to the cluster with the minimum load based on the load balancing algorithm.

The following figure shows the routing based on cluster load balancing.

When Q5 comes, the load of Cluster 0 is 2, the load of Cluster 1 is 1, and the load of Cluster 2 is 1. In this case, Q5 is preferentially allocated to Cluster 1 for execution.
When Q6 comes, the load of Cluster 0 is 2, the load of Cluster 1 is 2, and the load of Cluster 2 is still 1. In this case, Q6 is allocated to Cluster 2 for execution.

Summary

To better fit the business load, make full use of resources, and maximize benefits, we have introduced the AnalyticDB MySQL Multi-Cluster elastic model to complete automatic and intelligent scaling. The AnalyticDB MySQL Multi-Cluster model has the following characteristics:

Cost: Automatic scaling in and out to fit business loads means lower cost compared with fixed resources in a single cluster resource group.
Query performance: Queries increase linearly and the query isolation is superior to the min-max resource group model.
Automatic elasticity: No manual operation is needed to adjust the size of resource groups.

In the future, we will continue to enhance in the following areas:

• Proactive elasticity: Minimize query latency based on predictive proactive elasticity.

• Load decoupling: Use WorkLoadManager to automatically redirect large queries to offline resource groups to reduce resource contention in online resource groups.

• Elasticity efficiency: Accelerate elasticity efficiency with a hot pool of resource pods.

• Performance display: Visualize performance optimization and cost savings.

Community

The Principle of the Elasticity Technology of the Cloud-native Data Warehouse AnalyticDB

Overview

Introduction to the Elastic Model

Min-Max Elastic Model

Multi-Cluster Elastic Model

Advantages of the Multi-Cluster Elastic Model

Usability

Performance

Cost

Multi-Cluster Technology Architecture

Fast Scaling: Real-time Data Link

Accurate Scaling: Stable Scaling Policies

Decision-making

Bottleneck Identification

Estimation of the Cluster Number

Stability Window

Execution

Feedback: Effectiveness Evaluation

Good Scaling: Routing Policy Based on Load Balancing

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Big Data Consulting for Data Technology Solution

Hologres

Tair

AnalyticDB for MySQL