Self-Driven Databases: Intelligent Parameter Tuning for Alibaba Databases Explained

The contributors to this paper include: Chen Jie (nicknamed Ai'ao at Alibaba), Zhang Ping (Chiyuan), Qiao Honglin (Honglin), Tan Jian (Tanjian), Qixing, and Zhang Tieying (Tieying).

The VLDB 2019 conference (the 45th International Conference on Very Large Data Bases) was held in Los Angeles, California from August 26 to 30, 2019. At this conference, multiple papers from the Alibaba Cloud database product team were selected into the Research Track and the Industrial Track.

This article provides a detailed overview of "iBTune: Individualized Buffer Tuning for Largescale Cloud Databases", a shortlisted paper in the Research Track.

Background

About five or six years ago, the Alibaba database team tried to apply the experience of database administrators to products to provide more efficient and intelligent database services for business development. Cloud database administrators has been providing self-service intelligent diagnosis and tuning services since 2014 and evolved into the next-generation Self-Driving Database Platform (SDDP) in 2018 after four years of continuous investigations.

SDDP is an intelligent database platform that provides self-driving capabilities to the databases running on the platform, enabling these databases to perceive, make decisions, recover, and make optimizations independently, for the purpose of continuous and imperceptible service provisioning. In addition to the core capabilities of automatic diagnosis, recovery, and optimization, SDDP provides a range of other capabilities, such as physical resource management, instance lifetime management, security, and auto scaling.

At the end of 2017, SDDP enabled fully-automated end-to-end optimization for all database instances, covering automated slow SQL query, space, and parameter optimization for large databases. This article describes the automated parameter tuning for large databases.

In recent years, intelligent database tuning technology has been primarily developed to tune database parameters by using data-driven and machine learning algorithms. However, this presents developers with major technical challenges. The problem to be solved is how to automatically configure millions of database instances that run different services in large databases with optimal cost-performance and at the lowest resource cost while meeting service level agreement (SLA) requirements. This intelligent database tuning technology is of great value to cloud service providers.

The academic community has researched intelligent database tuning for the past year or two. For example, Carnegie Mellon University developed OtterTune, an automatic tuning tool, which relies on previous experience in manual tuning and cannot be scaled for large databases. Azure SQL Database and AWS also invested in intelligent database tuning but have not published any papers or products.

Alibaba has been exploring intelligent database parameter tuning since 2018. Multiple teams cooperated to make technical breakthroughs and finally implemented extensive tuning, ranging from problem definition and core algorithm design, assessment, and improvement to end-to-end automated tuning.

The paper "iBTune: Individualized Buffer Tuning for Largescale Cloud Databases" was selected for inclusion in the Research Track at the VLDB 2019 conference, marking an important milestone for Alibaba's database intelligence business. The paper was co-written by Tan Jian, Tieying, Feidao, Ai'ao, Qixing, Chiyuan, Honglin, Shiyue, Mingsong, and Zhang Rui.

This paper proposes innovative ideas for intelligent database parameter tuning, which have been extensively implemented on about 10,000 instances within the Alibaba Group and helped reduce memory resource usage by about 12%. This makes Alibaba the only company in the industry to extensively implement database parameter tuning.

Problem Definition

Parameter tuning is an important means of database tuning, and it becomes increasingly complex as the number of database parameters increases. For example, the latest version of MySQL provides more than 500 parameters, and even PostgreSQL provides about 290 parameters. Database tuning focuses on parameters that impact performance. Among these parameters, the buffer pool settings have the greatest impact.

Currently, Alibaba adopts a deployment in which multiple database instances share the same host. This results in a waste of server resources because host memory shortages often occur despite redundant CPU and storage resources. Therefore, memory resource shortages represent a bottleneck that prevents improvements in the deployment density of database instances.

Therefore, we focus on intelligent parameter tuning efforts on the buffer pool because it uses the most memory resources. We need to optimize the buffer pool configuration as it is a critical factor for lowering overall server costs and improving database instance performance.

For large databases, the challenge is how to set the buffer pool size for each database instance to an appropriate value in order to minimize the actual buffer pool size without compromising instance performance. Static configuration templates are typically used to set database instance parameters, which enables unified management and control of traditional large databases.

Alibaba provides database instances with 10 buffer pool specifications for business partners to choose from. When applying for instances, developers often use the default or advanced specifications without defining the buffer pool size required by services. This results in serious resource waste.

The diversity and constant changes of services make it impossible for DBAs to manually tune the parameters of large databases. This prompted us to investigate the use of data-driven and machine learning algorithms to dynamically tune the database buffer pool based on changes in the database load and performance.

Problem Analysis

The buffer pool size and hit ratio are directly related. We may reduce the buffer pool size without affecting the service if we can find a formula such as Buffer pool size = Function(Hit ratio) and determine a hit ratio that is acceptable for the business partner or database administrator.

We found a great deal of research on the relationship between the buffer pool size and hit ratio in the cache research field for operating systems. Some research shows that the relationship follows the power-law distribution in the long tail part, that is:

We performed stress testing on different buffer pool configurations in major online transaction processing (OLTP) scenarios within Alibaba, such as adding to the shopping cart and transaction payment, by using Frodo, a tool developed by Alibaba DBAs. The test results proved that the buffer pool of MySQL complies with the assumptions of the power-law distribution in the long tail part.

Determining the Appropriate Miss Ratio

Alibaba has more than 30,000 primary nodes of database instances. We considered the method of finding instances among these database instances that are similar to the target instance to determine the miss ratio of the target instance based on the miss ratio of the similar instances.

We selected features such as CPU usage, logical read, I/O read, miss ratio, and response time as the performance metrics to describe a service workload. We also selected statistical items such as mean, median, 70th percentile, and 90th percentile to measure the values of these features.

To reduce the impact of the difference between weekday and weekend performance, we selected performance data that spans four weeks for similarity calculation. The following figure shows examples of two pairs of similar instances.

Algorithm Challenges

Now, we can calculate the buffer pool size by using a formula based on the defined parameters and target miss ratio. The next challenge lies in applying this algorithm in the engineering process.

From the perspectives of business partners and database administrators, the hit ratio does not directly reflect the service impact of databases and cannot be used to measure the stability of database services. Due to stability requirements, we cannot apply the algorithm at the service level because we cannot quantify the service impact.

After repeated discussions with database administrators and business partners, we found that the database response time is very important to database administrators and business partners, particularly the maximum response time of database instances when interacting with applications.

We can quantify the service impact after buffer pool tuning if we can predict the worst value (upper bound) of the response time of database instances. This removes a difficulty in buffer pool tuning and makes it possible to apply the algorithm in the production environment. Therefore, we used the algorithm to predict the response time upper bound for database instances.

The RT Prediction Model

We proposed a pairwise deep neural network (DNN) model for RT prediction, as shown in the following figure.

The DNN is a fully connected network, with ReLU as the activation function. Hidden Layer 1, Hidden Layer 2, and Hidden Layer 3 have 100, 50, and 50 nodes, respectively.

The Experiment

In the RT prediction experiment, we compared regression algorithms, such as linear regression (LR), XGBoost, random sample consensus (RANSAC), decision tree (DTree), ENet, Adaptive Boosting (AdaBoost), gradient boosted decision tree (GBDT), K-Neighbors Regression (KNR), Bagging regressor (BR), extremely randomized trees regressor (ETR), random forests (RF), and sparse subspace clustering (SSC). We added deep learning algorithms including the instance-to-vector (I2V)-DNN model at the embedding layer and the pairwise DNN model.

The following figure shows the structure of the I2V-DNN model.

To prove the universality of the algorithm, we selected 1,000 instances from several major database business scenarios within Alibaba, including examples with different read/write ratios such as read-only instances, write-only instances, and read/write balancing instances.

We evaluated algorithm performance by using the following three metrics:

AMRAE measures the percentage of errors in the response time prediction results, MAE measures the average error in response time prediction, and UMAE measures the underestimation of response time predictions.

The following figure compares the response time prediction results based on the experiment data.

As shown in the preceding figure, the pairwise DNN algorithm surpasses other algorithms in AMRAE performance and demonstrates the best performance in other metrics. Therefore, we selected the pairwise DNN algorithm for response time prediction.

The Results

To intuitively show the changes before and after buffer pool tuning, we selected 10 instances and measured their database metrics before and after tuning, as shown in the following figure.

As shown in the preceding figure, the response time for different instance types (except Instance 1) does not differ much before and after buffer pool tuning. The queries per second and CPU usage metrics show that the queries per second does not differ much before and after buffer pool tuning and that the instances consume almost the same amount of resources. However, the instances show different degrees of memory usage reduction.

Instance 1 experiences a great increase in its response time after tuning. This instance normally has extremely low queries per second, where a single query accounts for most of the processing time. The value returned for this query after tuning is different from that before tuning. Tuning significantly increases the number of logical reads and physical reads and further rises the average response time. However, the absolute value of response time after tuning is not large and no slow SQL queries occur. This situation is acceptable for services. Therefore, rollback is not triggered.

Implementation

We implemented an end-to-end algorithm application process, from data collection to execution of buffer pool tuning commands. The system has four modules: metric collection, data processing, decision making, and execution. The following figure shows the module design.

Metric collection: The database control platform collects the metrics of all database instances within Alibaba, including all metrics used by the algorithm.
Data processing: Statistics are compiled for the collected metrics in different window dimensions after the streaming operation, and the statistical results are stored in MaxCompute for use by the algorithm.
Decision making: This module reads metric statistics from MaxCompute and calculates them by using the algorithm to determine the target buffer pool size.
Execution: The data control platform runs tuning commands and schedules the related time window to efficiently perform tuning in compliance with release constraints.

Stability Challenges

Downgrading the buffer pool configuration has an impact on stability. Improper operations may increase the workload of database administrators and even cause service failures. The buffer pool tuning project poses major challenges and burdens for the database administrators and other colleagues responsible for service stability in the related business unit.

We have taken a series of measures to ensure service stability, including:

Algorithm model: Adjusts the sensitivity coefficient αα of the mapping relationship between buffer pool size and hit ratio to make the tuning result more conservative.
Online tuning: Only tunes the parameters of instances that support online tuning to avoid MySQL crashes caused by the MySQL kernel.
Phased tuning policy: A strict phased tuning policy is implemented during extensive network-wide parameter tuning. At first, the service DBA tunes a small number of instances based on the algorithm-determined buffer pool size to ensure service stability. Then, the service DBA adds a large number of instances to the whitelist allowing automatic tuning of the buffer pool size. Finally, the service DBA performs extensive automated tuning on confirmed non-core instances, with the number of instances limited during each tuning operation.
Closed-loop process: The entire tuning process is a closed loop, from data collection, decision making concerning the buffer pool size, and automatic buffer pool tuning to post-tuning quantification and rollback. A post-tuning statistical analysis report is issued every day.

Results

After algorithm exploration and end-to-end automated buffer pool tuning, about 10,000 instances within Alibaba were tuned in FY2019, which reduced overall memory usage from 217 TB to 190 TB and saved 12.44% (27 TB) of memory resources.

Future Plans

In terms of business, in FY2020, we will implement buffer pool tuning for more instances to save more memory resources and optimize and apply the algorithm model on the public cloud through Hybrid Cloud Database Management (HDM) to provide cloud users with recommendations for selecting database instance types.
In technical terms, we will tune other parameters related to database performance in addition to the buffer pool, explore the relationships and impact of multiple performance-related parameters, and create a model based on the relationship between database load and performance impact to tune database parameters at the database instance level.

Community

Self-Driven Databases: Intelligent Parameter Tuning for Alibaba Databases Explained

Background

Problem Definition

Problem Analysis

Determining the Appropriate Miss Ratio

Algorithm Challenges

The RT Prediction Model

The Experiment

The Results

Implementation

Stability Challenges

Results

Future Plans

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Big Data Consulting for Data Technology Solution

MaxCompute

Big Data Consulting Services for Retail Solution

ApsaraDB for HBase