By Yunji, senior database R&D engineer from Alibaba Cloud ApsaraDB
In this article, we discuss how Alibaba Cloud Database Autonomy Service (DAS) provides auto-SQL throttling to handle service exceptions and database crashes effectively with quick turnaround time.
As the technology matures, especially with the popularization of cloud databases, database systems are becoming increasingly stable and easier to maintain. Many tasks such as version upgrades and instance migration are now completed automatically without affecting upper-layer services. Even if a hardware device or the network is abnormal, the preventive maintenance inspection (PMI) system immediately migrates data and restarts services to ensure service stability.
However, almost all the existing automated solutions focus on ensuring server-end stability. Issues caused by service-end exceptions, such as slow SQL statements caused by service changes and traffic bursts, still require human intervention. These solutions are unable to rapidly handle service exceptions and prevent systems from crashing.
There is a dramatic increase in the number of SQL statements executed in parallel, for example, in the case of cache penetration, abnormal calls, and major promotions.
For example, SQL statements querying for information about an account that involves a large number of transactions in the order data tend to slow down the database.
If a large number of SQL statements are executed on new database tables that are not indexed, the system will slow down and normal services will be affected.
Conventional solutions such as killing problematic SQL statements, restarting the database, or switching to a secondary database cannot solve these issues.
Database administrators (DBAs) are not usually able to do anything other than watch the database crash or call developers for help.
If you are familiar with the Alibaba Cloud products, you probably know that ApsaraDB for RDS for MySQL 5.6 and later versions effectively solves these issues through SQL throttling.
SQL throttling specifies the maximum degree of parallelism for executing any SQL statement in the database. By throttling the degree of parallelism for abnormal SQL statements, the database responds to service requests effectively while ensuring most of the services operate normally. In other words, the normal operation of the majority of services is guaranteed by compromising a minority of them.
This feature has been put into use for nearly three years. Many users have the following questions while using it:
1) How do I detect an exception immediately when it occurs?
2) How do I identify SQL statements that need to be throttled when an exception occurs?
3) How do I extract keywords for SQL statements that need to be throttled to minimize the impact on services when facilitating service recovery?
4) How do I quickly verify whether a throttling operation has been executed properly?
In addition to these questions, you may face some other problems. For example, when an emergency occurs, your DBA cannot be reached or does not have immediate access to a computer; there is too much information to analyze, or operation errors are introduced during troubleshooting.
Therefore, you need to automate the whole process including detecting exceptions, locating abnormal SQL statements, throttling SQL statements, and tracking and rolling back operations.
The automated SQL throttling solution was then developed and has been running on Alibaba Group's internal databases for more than two years. It was officially made available to the public as part of an Alibaba Cloud product in February 2020. To try this feature, go to the Alibaba Cloud Database Autonomy Service (DAS) product page.
The following diagram shows the overall implementation process:
This module is enabled by default for hosts and engines of ApsaraDB Relational Database Service (RDS) instances. These performance metrics include CPU usage, input/output operations per second (IOPS), queries per second (QPS), and active sessions. These real-time metrics serve as the basis for all subsequent analysis and actions.
This module uses historical performance data of a database instance to train a machine learning model offline. Then it uses this model to detect exceptions based on real-time performance metrics. This method detects exceptions much sooner, before receiving the threshold-based alerts. This module will be described in detail later in another article.
This module subscribes to exception events of a database instance and collects session information the moment when an exception occurs. Then it analyzes the collected information together with statistics of all the SQL statements in SQL performance_schema (generated through SQL audit) to find the root cause of the exception.
We categorize common root causes into four types:
When the root cause of an instance exception falls into the second or third type of SQL statements described above, DAS automatically extracts keywords from problematic SQL statements to throttle these SQL statements. This operation requires user authorization. The biggest challenge is to extract the right keywords from SQL statements to implement precise throttling without unnecessarily interrupting other services.
If an abnormal SQL statement that needs to be throttled is detected, the next step is to extract keywords from this SQL statement. Ideally, the keywords should be unique, and only the abnormal SQL statements are throttled without affecting other SQL statements. DAS supports two throttling modes: SQL template-based throttling and SQL text-based throttling. Do not confuse them with each other.
For SQL template-based throttling, if an SQL statement includes ID information contained in the template, the ID information has a higher priority. Such ID information includes the SQL ID automatically generated by the database middleware based on a template, and the hint information added to the SQL template by the developer.
The benefit of using an ID as a keyword is that it helps to ensure the template is unique and the SQL statements based on other templates are not affected. The drawback is that if a problematic SQL statement does not include ID information (for example, when it is manually executed from a command line), this SQL statement can still be executed when a throttling rule against the corresponding ID is enabled.
If an SQL statement does not include template ID information, text information needs to be extracted. Assume that we have obtained an SQL template during the analysis process. As shown in the following snippet, Template 1 and Template 2 can be respectively obtained through a calculation based on SQL text 1 and SQL text 2. To throttle SQL statements based on Template 1, the most comprehensive keywords that we can use are select, id, name, age, from, students, where, and name.
The benefit of using these keywords when throttling SQL statements is that all SQL statements that contain these keywords are throttled regardless of the connection mode that was used to send these SQL statements. The drawback is that some SQL statements may be mistakenly throttled. For example, Template 2 includes all keywords of Template 1.
/* SQL 文本1 */ select id,name,age from students where name='张三'; /* SQL 模板 1 */ select id,name,age from students where name = ? /* SQL 文本2 */ select id,name,age from students where name='张三' and sid='唯一ID'; /*SQL 模板 2*/ select id,name,age from students where name=? and sid=?
When an abnormal SQL statement that can be optimized is identified through root cause analysis, the SQL statement is also sent to the auto-optimization module when throttling is enabled, for example, to automatically create an index in the corresponding table. This module will be described in detail later in another article.
Continuous tracking is performed after auto-throttling. If the traffic sent to the database is not reduced or is not reduced as much as expected, DAS automatically rolls back the throttling operation and starts another round of root cause analysis.
This feature has been running steadily for more than two years on Alibaba Group's databases. It detects exceptions within a minute post their occurrence, identifies the root causes within 5 minutes, and recovers the service within 10 minutes, effectively reducing database faults.
This feature is now available in Alibaba Cloud DAS.
We will post a series of articles on a weekly basis to introduce Alibaba Cloud DAS features in detail, such as AutoScale, exception detection, auto-SQL optimization, workload-based SQL review, intelligent stress testing, and intelligent parameter tuning. Please stay tuned.
Visit Alibaba Cloud Database Autonomy Service (DAS) to learn more about its pricing and capabilities.
Alibaba Cloud New Products - June 1, 2020
ApsaraDB - June 29, 2020
ApsaraDB - July 8, 2020
ApsaraDB - June 27, 2022
Alibaba Cloud New Products - June 1, 2020
Alibaba Cloud New Products - June 1, 2020
Self-driving Database Platform: Self-repair, Self-optimization, and Self-securityLearn More
Fully managed and less trouble database servicesLearn More
Migrating to fully managed cloud databases brings a host of benefits including scalability, reliability, and cost efficiency.Learn More
Mitigate the scalability problem of single machine relational databases for large-scale online databases.Learn More
More Posts by Alibaba Cloud New Products