Database Autonomy Service (DAS) helps enterprises save up to 90% of database administration costs and reduce 80% of operations and maintenance (O&M) risks. This allows you to focus on business innovation and maintain a rapid growth of your business. In this topic, Double 11 cases are used to describe the six core autonomy features of DAS: 24/7 real-time anomaly detection, fault self-healing, automatic optimization, automatic parameter tuning, auto scaling, and intelligent stress testing.
24/7 real-time anomaly detection
- Wide detection scope. In addition to the monitored metrics, DAS monitors items such as SQL statements, logs, and locks.
- Near-real-time detection. Anomalies are detected in a near-real-time manner. In the traditional method, anomalies are detected after they occur.
- Anomaly-driven detection technique that is implemented by using artificial intelligence (AI). The 24/7 real-time anomaly detection feature is not driven by faults.
- Periodic anomaly identification capability, adaptation to different service features, and prediction capability.
The anomaly detection feature can accurately and automatically identify general workload anomalies, such as glitches, periodic characteristics, trend characteristics, and mean offset. This feature identifies anomalies by using multiple time series characteristics. After an anomaly is identified, the feature triggers global diagnostics and analysis that is implemented based on root causes, subsequent recovery from failures, and optimization.
The 24/7 real-time anomaly detection feature ensures that database instance anomalies are detected in real time. DAS automatically analyzes root causes, stops operations that may adversely affect the system, or repairs the system. This helps your database automatically recover. This way, the impact on enterprise business is reduced. The case study of automatic SQL throttling during Double 11.
At 12:31:00 on November 5, 2020, the number of active sessions and the CPU utilization of a DAS instance surged. At 12:33:00, the DAS anomaly detection center confirmed that the increase was caused by a database exception instead of jitters. This exception triggered automatic SQL throttling and root cause diagnostics. At 12:34:00, the diagnostics was complete. In the diagnosis result, two SQL statements that caused the exception were identified. Automatic SQL throttling was immediately initiated after the SQL statements were detected. Then, the number of active sessions began to decrease. After the existing problematic SQL statements were executed, the number of active sessions recovered in a short time and the CPU utilization recovered to a normal value. This entire process meets the following requirement of the self-healing capability: 1-5-10. The number 1 indicates that anomalies are detected within 1 minute. The number 5 indicates that anomalies are located within 5 minutes. The number 10 indicates that anomalies are handled within 10 minutes.
External automatic SQL optimization
- The SQL diagnostics capability uses the external cost-based model to provide index and statement rewriting recommendations and identify and recommend performance bottlenecks. This eliminates the defects of the rigid rule-based traditional method, such as inappropriate recommendations and failure to quantify performance improvement effect.
- DAS applies in the following items: the formal signature database for test cases, automatically extracting feedback information from online use cases, and diversified application scenarios of Alibaba.
- SQL optimization is implemented by using the global workload and the workload characteristics, such as the execution frequency and read/write ratio of SQL statements. This minimizes the defects of SQL optimization that is implemented by using some workloads.
- Average RT and the number of scanned rows before the automatic SQL optimization
- Average RT and the number of scanned rows after the automatic SQL optimization
Alibaba Cloud databases provide options and storage capacities that are based on computing specifications for you to choose. When the scale of your business workload changes, your database can be elastically scaled out or in. For cloud native applications, the databases automatically determine the most appropriate specifications based on the changes of your business workload. Therefore, the minimum amount of resources is used to ensure the database capacity that is required by your business. The time series forecasting feature of DAS is implemented by using AI. This feature automatically calculates and forecasts the business model and capacity usage of a database. This helps to implement on-demand (or forecast) automatic scaling in a timely manner.
- The module of performance data collection collects real-time performance data of instances, such as various performance metrics, specification configuration information, and information about instance running sessions.
- The decision center module provides a global trend based on information, such as current performance data and the information about the instance session list, so that global autonomy is implemented based on root causes. For example, DAS implements SQL throttling to solve the issue of insufficient computing resources. If the trend shows that the business traffic surges, the auto scaling service process proceeds.
- Algorithm models are the core of the DAS auto scaling service. It implements calculations for detecting business load anomalies and recommending capacity specifications for database instances. This solves the core issues, such as difficulty in selecting the scaling time, scaling mode, and computing specifications.
- The specification recommendation and validation module generates specific recommended specifications and checks whether the recommended specifications are applicable to the deployment types and actual running environments of database instances. This module also repeatedly checks whether the recommended specifications can be purchased in the current region. This ensures that the recommended specifications can be used on the management side.
- The management and execution module distributes and implements tasks by using the generated recommended specifications.
- The status tracking module measures and tracks performance changes of database instances before and after the specifications are changed.
The case study of automatic SQL optimization that was implemented during Double 11. For a PolarDB instance that was connected to DAS, business traffic of the user kept increasing. The CPU utilization of the PolarDB instance kept increasing and became high. DAS accurately identified the instance anomaly by using the auto scaling algorithm and automatically added two read-only nodes to the instance. This way, the CPU utilization of the instance was decreased to a lower value. After the CPU utilization remained in this state for two hours, the instance traffic continued to increase and the auto scaling feature was triggered again. The auto scaling feature upgraded the instance specification from 4 cores and 8 GB to 8 cores and 16 GB. After that, the instance ran as expected for more than 10 hours. This ensured the normal running of the instance during peak hours.
Intelligent stress testing
The intelligent stress testing feature of DAS allows you to evaluate the required database specification and capacity before you deploy your service to the cloud or before business promotions. The auto scaling capability allows you to automatically trigger scale-in and scale-out based on the specified performance threshold of your database or the built-in intelligent policies of DAS. This way, the workload of specification evaluation and management is decreased.
- Long-term stress testing, such as 24/7 stress testing for evaluating the business stability, can be implemented when a large amount of SQL statements cannot be collected. SQL statement collection requires time and storage costs. When some SQL statements are provided, DAS needs to generate the SQL statements that meet your business requirements.
- The capability of using concurrent threads to play back instance traffic: DAS must ensure the concurrency that is consistent with that of actual business, and can provide the playback speed options, such as 2x and 10x, and peak stress testing. The speed 2x indicates that the playback speed is two times faster than the normal speed. The speed 10x indicates that the playback speed is ten times faster than the normal speed.
DAS automatically learns the business model to generate actual business workload that falls within the stress testing period. In addition, DAS provides more stress testing scenarios to help you overcome challenges in scenarios such as business promotions and database selection.
Automatic parameter tuning
- Experience-based parameter tuning cannot ensure that parameters are valid.
- Workloads of cloud services are diverse, and parameters vary based on workloads.
- For heterogeneous hardware (specifications), the parameters vary based on hardware specifications.
DAS handles parameter tuning as the black box optimization problem and implements iterative learning by using efficient machine learning techniques. DAS increases TPS by 15% to 55% for different workloads. The entire process is implemented in about 100 iteration steps within three to five hours. DAS learns various workloads and the characteristics and parameters of hardware specifications offline by using the rich workload data and hardware infrastructures of Alibaba Group. Then, DAS matches the partially ordered set and implements meta learning to generate online models that fit the specified workloads. After a few iterations, the parameter values that correspond to the specified workloads can be learned. The ideal parameter values are generated in about 10 to 30 steps within less than one hour.