Database Autonomy Service (DAS) helps enterprises save up to 90% of database administration costs and reduce 80% of operations and maintenance (O&M) risks. This allows you to focus on business innovation and maintain a rapid growth of your business. In this topic, Double 11 cases are used to describe the six core autonomy features of DAS: 24/7 real-time anomaly detection, fault self-healing, automatic optimization, automatic parameter tuning, auto scaling, and intelligent stress testing.

24/7 real-time anomaly detection

DAS provides 24/7 real-time anomaly detection by using machine learning algorithms to detect database workload anomalies in real time. This detection mechanism can detect database anomalies in a timely manner compared with the traditional threshold-based alerting method. In the traditional alerting method, failures are detected after they occur. DAS allows you to collect various data, such as hundreds of database performance metrics from links, and query logs to which SQL statements have been loaded from links. DAS also allows you to process and store large amounts of data online and offline. You can use machine learning and prediction algorithms of databases to implement continuous model training, real-time model prediction, and real-time anomaly detection and analysis for database instances. The real-time anomaly detection feature brings the following benefits compared with the traditional rule-based and threshold-based method:
  • Wide detection scope. In addition to the monitored metrics, DAS monitors items such as SQL statements, logs, and locks.
  • Near-real-time detection. Anomalies are detected in a near-real-time manner. In the traditional method, anomalies are detected after they occur.
  • Anomaly-driven detection technique that is implemented by using artificial intelligence (AI). The 24/7 real-time anomaly detection feature is not driven by faults.
  • Periodic anomaly identification capability, adaptation to different service features, and prediction capability.

The anomaly detection feature can accurately and automatically identify general workload anomalies, such as glitches, periodic characteristics, trend characteristics, and mean offset. This feature identifies anomalies by using multiple time series characteristics. After an anomaly is identified, the feature triggers global diagnostics and analysis that is implemented based on root causes, subsequent recovery from failures, and optimization.

Automatic self-healing

The 24/7 real-time anomaly detection feature ensures that database instance anomalies are detected in real time. DAS automatically analyzes root causes, stops operations that may adversely affect the system, or repairs the system. This helps your database automatically recover. This way, the impact on enterprise business is reduced. The case study of automatic SQL throttling during Double 11.

At 12:31:00 on November 5, 2020, the number of active sessions and the CPU utilization of a DAS instance surged. At 12:33:00, the DAS anomaly detection center confirmed that the increase was caused by a database exception instead of jitters. This exception triggered automatic SQL throttling and root cause diagnostics. At 12:34:00, the diagnostics was complete. In the diagnosis result, two SQL statements that caused the exception were identified. Automatic SQL throttling was immediately initiated after the SQL statements were detected. Then, the number of active sessions began to decrease. After the existing problematic SQL statements were executed, the number of active sessions recovered in a short time and the CPU utilization recovered to a normal value. This entire process meets the following requirement of the self-healing capability: 1-5-10. The number 1 indicates that anomalies are detected within 1 minute. The number 5 indicates that anomalies are located within 5 minutes. The number 10 indicates that anomalies are handled within 10 minutes.

External automatic SQL optimization

DAS continuously performs SQL review and optimization for databases based on global workloads and actual business scenarios. About 80% of database issues can be solved by SQL optimization based on optimization experience. However, SQL optimization is a complicated process that requires expert knowledge and experience on databases. SQL optimization is also a time-consuming and heavy task because SQL workload keeps changing. This makes SQL optimization a demanding job that requires high expertise and costs. DAS continuously performs SQL review and optimization for databases based on global workload and actual business scenarios. DAS works like a professional DBA to take care of your databases around the clock. This makes SQL optimization more qualified. In addition, the SQL diagnostics capability of DAS has the following technical characteristics compared with the traditional method:
  • The SQL diagnostics capability uses the external cost-based model to provide index and statement rewriting recommendations and identify and recommend performance bottlenecks. This eliminates the defects of the rigid rule-based traditional method, such as inappropriate recommendations and failure to quantify performance improvement effect.
  • DAS applies in the following items: the formal signature database for test cases, automatically extracting feedback information from online use cases, and diversified application scenarios of Alibaba.
  • SQL optimization is implemented by using the global workload and the workload characteristics, such as the execution frequency and read/write ratio of SQL statements. This minimizes the defects of SQL optimization that is implemented by using some workloads.
The following example shows a case study of automatic SQL optimization that was implemented during Double 11. On November 7, DAS detected a load anomaly caused by slow SQL statements by using the load anomaly detection feature. This anomaly automatically triggered a closed loop for SQL optimization. After the optimized SQL statements were published, the optimization effect was tracked for 24 consecutive hours to evaluate the optimization benefits. The optimization effect was significant. The average response time (RT) and the number of scanned rows before and after the optimization. The statistics show that the average number of scanned rows for the optimized SQL statements was 148,889.198 and the average RT was 505.561 milliseconds before the optimization. After the optimization, the average number of scanned rows was 12.132 and it was about one ten-thousandth of the average RT before the optimization. The average RT was decreased to 0.471 milliseconds and was about one thousandth of the average RT before the optimization.
  • Average RT and the number of scanned rows before the automatic SQL optimization
  • Average RT and the number of scanned rows after the automatic SQL optimization

Auto scaling

Alibaba Cloud databases provide options and storage capacities that are based on computing specifications for you to choose. When the scale of your business workload changes, your database can be elastically scaled out or in. For cloud native applications, the databases automatically determine the most appropriate specifications based on the changes of your business workload. Therefore, the minimum amount of resources is used to ensure the database capacity that is required by your business. The time series forecasting feature of DAS is implemented by using AI. This feature automatically calculates and forecasts the business model and capacity usage of a database. This helps to implement on-demand (or forecast) automatic scaling in a timely manner.

The auto scaling feature of DAS implements a complete closed data loop. The loop consists of the following modules: performance data collection, the decision center, algorithm models, specification recommendation, management and execution, and task tracking and evaluation.
  • The module of performance data collection collects real-time performance data of instances, such as various performance metrics, specification configuration information, and information about instance running sessions.
  • The decision center module provides a global trend based on information, such as current performance data and the information about the instance session list, so that global autonomy is implemented based on root causes. For example, DAS implements SQL throttling to solve the issue of insufficient computing resources. If the trend shows that the business traffic surges, the auto scaling service process proceeds.
  • Algorithm models are the core of the DAS auto scaling service. It implements calculations for detecting business load anomalies and recommending capacity specifications for database instances. This solves the core issues, such as difficulty in selecting the scaling time, scaling mode, and computing specifications.
  • The specification recommendation and validation module generates specific recommended specifications and checks whether the recommended specifications are applicable to the deployment types and actual running environments of database instances. This module also repeatedly checks whether the recommended specifications can be purchased in the current region. This ensures that the recommended specifications can be used on the management side.
  • The management and execution module distributes and implements tasks by using the generated recommended specifications.
  • The status tracking module measures and tracks performance changes of database instances before and after the specifications are changed.

The case study of automatic SQL optimization that was implemented during Double 11. For a PolarDB instance that was connected to DAS, business traffic of the user kept increasing. The CPU utilization of the PolarDB instance kept increasing and became high. DAS accurately identified the instance anomaly by using the auto scaling algorithm and automatically added two read-only nodes to the instance. This way, the CPU utilization of the instance was decreased to a lower value. After the CPU utilization remained in this state for two hours, the instance traffic continued to increase and the auto scaling feature was triggered again. The auto scaling feature upgraded the instance specification from 4 cores and 8 GB to 8 cores and 16 GB. After that, the instance ran as expected for more than 10 hours. This ensured the normal running of the instance during peak hours.

Intelligent stress testing

The intelligent stress testing feature of DAS allows you to evaluate the required database specification and capacity before you deploy your service to the cloud or before business promotions. The auto scaling capability allows you to automatically trigger scale-in and scale-out based on the specified performance threshold of your database or the built-in intelligent policies of DAS. This way, the workload of specification evaluation and management is decreased.

Most traditional stress testing solutions are implemented by using existing stress testing tools, such as Sysbench and TPC-C. The biggest problem is that the SQL statements that correspond to these stress testing tools greatly differ from those in actual business scenarios. The stress testing results cannot reflect the actual performance and stability of your business. The intelligent stress testing feature of DAS is implemented by using the workload of the actual business. Therefore, the stress testing results reflect the performance and stability changes of the services for different workloads. For this purpose, the intelligent stress testing feature must overcome the following challenges:
  • Long-term stress testing, such as 24/7 stress testing for evaluating the business stability, can be implemented when a large amount of SQL statements cannot be collected. SQL statement collection requires time and storage costs. When some SQL statements are provided, DAS needs to generate the SQL statements that meet your business requirements.
  • The capability of using concurrent threads to play back instance traffic: DAS must ensure the concurrency that is consistent with that of actual business, and can provide the playback speed options, such as 2x and 10x, and peak stress testing. The speed 2x indicates that the playback speed is two times faster than the normal speed. The speed 10x indicates that the playback speed is ten times faster than the normal speed.

DAS automatically learns the business model to generate actual business workload that falls within the stress testing period. In addition, DAS provides more stress testing scenarios to help you overcome challenges in scenarios such as business promotions and database selection.

Automatic parameter tuning

Databases have hundreds of thousands of parameters and a wide range of business scenarios. You can find it difficult to manually adjust the settings of the parameters to the optimal settings. To solve this problem, you can combine the machine learning and intelligent stress testing techniques to automatically recommend the optimal parameter template for each database instance. Common databases deployed on the cloud, such as MySQL and PolarDB, have hundreds of parameters. Each parameter ranges from tens to tens of thousands or even hundreds of thousands or more. The process of configuring parameters is equivalent to searching the multidimensional space for the required values of parameters, such as transactions per second (TPS) or latency. For example, The TPS value increases and the latency value decreases. DBAs typically configure parameters based on experience or by using default parameter values. However, the parameters vary based on workloads and hardware. Even experienced DBAs cannot ensure that the configured parameters are valid. A large number of small and medium-sized enterprises that use the cloud services do not have O&M personnel, and are unable to configure and adjust the optimal parameters. Therefore, database parameter configuration faces the following challenges:
  • Experience-based parameter tuning cannot ensure that parameters are valid.
  • Workloads of cloud services are diverse, and parameters vary based on workloads.
  • For heterogeneous hardware (specifications), the parameters vary based on hardware specifications.

DAS handles parameter tuning as the black box optimization problem and implements iterative learning by using efficient machine learning techniques. DAS increases TPS by 15% to 55% for different workloads. The entire process is implemented in about 100 iteration steps within three to five hours. DAS learns various workloads and the characteristics and parameters of hardware specifications offline by using the rich workload data and hardware infrastructures of Alibaba Group. Then, DAS matches the partially ordered set and implements meta learning to generate online models that fit the specified workloads. After a few iterations, the parameter values that correspond to the specified workloads can be learned. The ideal parameter values are generated in about 10 to 30 steps within less than one hour.