Database administrators and application developers have all experienced SQL optimization. SQL statements running in databases are varied and continuous. Dynamic changes make SQL optimization necessary. Such changes include fast business iteration, change of data distribution characteristics, hot-data changes, and database version upgrades. For these changes, similar processes trigger different challenges each time.
It is not enough to analyze slow query logs for abnormal SQL queries. How do we accurately identify bottlenecks based on database experts' experience or tools and obtain rectification and optimization recommendations? How do we comprehensively evaluate the optimization effect and impact (including the adverse impact), evaluate the impact on related SQL, and write operations to ensure pre-release security? How do we select phased release policies and change windows for complex deployments, such as large-scale database sharding or table sharding, to promote online changes securely and stably? How do we continuously track the effects to ensure that everything is problem-free?
We need to consider two important time points (as shown in the following figure.) In a simple slow SQL query trend, T1 represents the time point when we find out the performance of the database instance is abnormal. From this point on, we start to optimize slow SQL queries. T2 represents the time point when the optimization is completed and the instance restores. In traditional optimization, this process is manually driven and usually has the following shortcomings:
In addition to the preceding two issues, we are faced with two more serious challenges:
Traditional methods are manually driven and have inherent limitations in the preceding two challenges. As a result, developers are often driven by faults and busy addressing faults. With large-scale business development and instance growth, these issues will be enlarged and probably cannot be resolved if more labor is invested.
Automatic SQL optimization is one of the core services of Alibaba Cloud Database Autonomy Service (DAS). It uses autonomous self-optimization to ensure SQL optimization closure.
As shown in the preceding figure, its closure capabilities include:
1) Detects abnormal workloads, identifies database business changes, and quickly identifies and locates abnormal SQL queries. For example, SQL exceptions include additional slow SQL queries, SQL queries that cause performance deterioration, and inefficient SQL queries.
2) Calls the SQL diagnosis and optimization service for abnormal SQL queries to generate optimization recommendations, for example, optimal indexing, SQL statement rewriting, and engine recommendations.
3) Evaluates optimization recommendation risks, generates a phase-release plan, and orchestrates optimization tasks based on database instance workloads and profiles.
4) Selects O&M windows and makes online changes based on the phased-release plan. In the current phase, indexes can be automatically changed online.
5) Tracks the multidimensional optimization effect for online changes and continuously and comprehensively evaluates performance regression risks. If the optimization meets the expectation, optimization benefit is automatically calculated. If the optimization fails to meet the expectation, it is automatically rolled back.
This closed automatic optimization loop converts manual and passive optimization to intelligent, proactive, and continuous optimization, ensuring unattended SQL optimization. With this closed-loop, you have a group of database experts who ensure that your database system always runs optimally 24/7.
To ensure unattended SQL optimization, we must address multiple challenges:
Accuracy: An exception detection mechanism is required to accurately identify the optimization time and locate abnormal SQL queries.
Professionalism: Professional and effective optimization diagnosis is required.
Security: The security of online changes must be controllable.
Comprehensiveness: The optimization effect needs to be monitored comprehensively from multiple dimensions and evaluated promptly to ensure security.
Association: Complex online problems sometimes require integrated management. For example, automatic SQL throttling and SQL optimization of DAS needs to be associated with sudden, malignant, slow SQL queries to solve both symptoms and root causes.
Scale: A scalable service architecture needs to be built to support large-scale automatic optimization for hundreds, thousands, or millions of SQL queries. The following interprets the DAS solution from multiple dimensions.
DAS automatic SQL optimization is a data-driven closed-loop. The preceding figure briefly illustrates the whole process.
SQL optimization can discover SQL exceptions in various scenarios:
The automatic SQL optimization service periodically analyzes slow SQL queries of instances offline in the O&M window period and initiates SQL optimization.
The workload exception detection algorithm will detect SQL queries with deteriorating performance promptly and trigger automatic SQL optimization. For complex online issues, automatic SQL optimization and automatic SQL throttling of DAS will be associated to initiate automatic SQL optimization.
With online and offline business SQL queries, the database workload and data amount change. Existing indexes cannot match business performance requirements or initiate instance workload diagnosis and optimization.
The DAS SQL diagnosis and optimization service provide strong support for automatic SQL optimization. It uses the cost-based model to formulate optimization measures, which is the same as the database optimizer. It quantifies all possible recommendation options based on the execution cost and makes reliable recommendations. The service has run stably in Alibaba for over three years with a daily diagnosis volume of about 50,000. It also supports SQL optimization for the business applications of the group. In the past three years, the SQL diagnosis success rate remained over 98%, and the recommendation rate for slow SQL queries was more than 75%.
To ensure secure changes, security check before changes, phased change policy, and performance tracking after changes are required.
Security check: To reduce risks, changes only occur in the O&M window period. We will also check the latency for data synchronization between the primary and secondary databases, instance workload, and table space. Changes are made only if these indicators are within the secure scope.
Phased change policy: In a large-scale database or table sharding scenarios, a phased change plan is automatically generated to reduce risks. During the change process, the system will monitor the latency for data synchronization between the primary and secondary instances. If the latency exceeds the threshold, all index change tasks of the database will be stopped, and only one change task can be executed in each database.
Effect evaluation: The effect evaluation algorithm tracks the performance of optimized SQL queries and related SQL templates to prevent faults due to performance deterioration. Based on the decision tree model, the performance tracking algorithm compares the performance indicators of the SQL template before and after optimization to comprehensively determine whether the SQL template has experienced performance deterioration at that time. Business is often changed by day and the default tracking time is 24 hours. If rollback is not performed, optimization is successful and the optimization benefit is calculated.
The DAS automatic SQL optimization service ran stably without faults in Alibaba for about two years before being released in the cloud. As of April 2020, the service has optimized more than 42 million slow SQL queries, and the slow SQL query rate of the group network dropped by about 92%. More importantly, the automatic SQL optimization service has built an effective proactive analysis and feedback system. Online failures and rollback examples during automatic optimization are automatically accumulated to the example system. The service continuously drives automatic optimization closure and fast iteration of the diagnosis service.
Alibaba Cloud DAS has recently launched this new feature. It is now available for a free trial by clicking here.
ApsaraDB - March 19, 2020
Alibaba Cloud MaxCompute - May 5, 2019
yzq1989 - April 10, 2020
ApsaraDB - July 8, 2020
hyj1991 - June 20, 2019
Alibaba Cloud New Products - June 1, 2020
ApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.Learn More
A financial-grade distributed relational database that features high stability, high scalability, and high performance.Learn More
A database engine fully compatible with Apache Cassandra with enterprise-level SLA assurance.Learn More
TSDB is a stable, reliable, and cost-effective online high-performance time series database service.Learn More
More Posts by ApsaraDB