×
Community Blog Intelligent Routing of Memory - Overloading Queries in Serverless Data Warehouses

Intelligent Routing of Memory - Overloading Queries in Serverless Data Warehouses

This article introduces intelligent routing that detects and reroutes memory‑overloading queries in serverless data warehouses.

In our previous blog post, we provided a brief introduction to Intelligent Query Routing. This time, we shall delve into the technical details of it.

Introduction

Data warehouses, as core data infrastructures, are foundational for many enterprise applications. While these systems are designed for high-performance and scalability, a recurring challenge remains: managing memory resources under heavy query workloads. Memory-Overloading (MO) Queries, can cause a common Out-of-Memory (OOM) exception, which severely impact system stability and performance. In a typical business environment, mixed workloads like large ETL tasks, intensive queries for BI reports, and ad-hoc data explorations can push database memory to its limits, leading to OOM errors. When OOM occurs, the impact on business is significant:

  • Service Disruption: Critical data query services are interrupted, stalling business decisions and operations.
  • Resource Waste: Queries consume vast computational resources only to fail, resulting in unnecessary costs.
  • Operational Burden: The operations team faces immense pressure, spending valuable time manually troubleshooting, analyzing logs, and restarting services, a process that is both inefficient and reactive.

Traditional methods for resolving these issues, such as manually scaling instances or adjusting configurations, are often insufficient. These solutions lack the elasticity to handle unpredictable workloads, leading to slow performance due to resource contention, and they require constant, manual intervention. Simply put, traditional approaches can't keep up with the dynamic needs of modern data workloads.

To address this critical issue, AnalyticDB for MySQL [1] introduces the new intelligent query routing capability. This solution is designed to fundamentally change how you handle memory overloads, providing an elastic, smart, and worry-free experience for your database operations.

What is Intelligent Query Routing

image
Figure 1. The intelligent query routing framework of AnalyticDB for MySQL

As shown in Figure 1, query routing provides users the ability to deliver specific queries from an interactive resource group to another designated resource group. Query routing operates between the query optimizer and executor, intercepting queries before execution to predict potential memory overload. Large queries can cause resource contention, affecting other queries and potentially leading to system instability. To avoid this, you can use the query routing feature to send these queries to a dedicated resource group, ensuring they don't interfere with other workloads. This feature is particularly useful for the following scenarios: 1) Queries that consume a lot of memory and are prone to OOM errors. 2) Ad-hoc queries that occasionally scan large amounts of data. 3) Heavy ETL (Extract, Transform, Load) queries that are less sensitive to execution time.

Previously AnalyticDB for MySQL provides users the ability to manually route queries to different different resource groups. Users define specific rules in Workload Management (WLM) based on their own application characteristics, for example:

  • Peak Memory Usage: define a rule based on memory consumption, like routing queries with peak memory usage over 10 GB.
  • Scan Size: the intuition is that workloads processing large amount of data are supposed to consume lots of computing resources

While manual routing offers the ability to precisely control routing rules and resource management, its main drawbacks are the needs of deep workload understanding and complicated case-by-case tuning, which are both time-consuming and labor-intensive. Furthermore, it cannot effectively handle new or ad-hoc workloads, potentially leading to performance and stability issues during unexpected peaks. As the business grows, the maintenance costs become prohibitively high.

Automated routing is designed for hands-free optimization. The system uses built-in, pre-defined logic to automatically route problematic queries. The system intelligently detects and automatically routes any query that fails due to insufficient memory (e.g., an OOM error). The primary advantage of automated routing is simplicity. It’s a proactive that uses the system's own intelligence to handle common performance bottlenecks without any manual intervention.

Ruled-based Query Routing VS. Intelligent Query Routing

Existing query routing solutions primarily relied on ruled-based query routing, which is a preset, static distribution mechanism. Intelligent query routing is a major upgrade to this traditional ruled-based routing, as it no longer relies solely on fixed rules. Instead, it integrates AI technologies. The table below provides a detailed comparison of the two routing solutions.

Feature Ruled-based Query Routing Intelligent Query Routing
Operating Principle Relies on preset, static rules, such as:
● If a query's peak memory usage over 10 GB.
Based on AI models that predicts a query's risk in real time, making dynamic decisions based on historical data.
Decision Basis Fixed, hard-coded rules that struggle to handle complex and evolving query patterns and cluster states. A multi-dimensional feature vector that combines query plan statistics (e.g., operator cardinalities) and real-time cluster metrics (e.g., memory utilization).
Misprediction Correction Unable to self-correct. Poorly set rules can lead to numerous mispredictions, impacting system performance. Features a misprediction correction mechanism. The system maintains an index of past failures and automatically flags new, similar queries as high-risk.
Adaptability Low. Rules require manual adjustment and lack adaptability to new query patterns or changes in the cluster environment. High. It can use a global or localized model for different clusters and dynamically adjust policies with self-tuning quota management to adapt to unique cluster characteristics.
Scalability Poor scalability. Manual modification and maintenance of routing rules are required whenever the business or cluster changes. Highly scalable. The AI model can automatically learn and adapt to new query patterns, reducing the need for manual intervention.

The Intelligent Query Routing Solution

Intelligent query routing is an automatic AI-driven query routing solution. Next, we summarize three design goals from the cases of abnormal memory overload we have handled in history to better guide the design of query routing schemes.

Design Goals of Intelligent Query Routing: Through collaborative analysis of practical use cases and in-depth discussions with the AnalyticDB engineering team, we identified three critical design goals to address the challenges of handling MO queries in cloud-native environments. These goals emerged from evaluating real-world scenarios where MO queries caused resource contention, performance degradation, and system instability across heterogeneous database clusters. The following structured approach outlines our prioritized goals:

  • High Prediction Precision (Goal 1):

    • Combine historical query logs with real-time monitoring data to train lightweight models (such as feature engineering-based classifiers) to recognize MO queries.
  • Interpretability and Efficiency (Goal 2):

    • Develop a visual interface to show the user why the MO query is blocked (for example, "this query involves a full table scan and the expected memory consumption exceeds the threshold").
    • The ideal solution should provide low-latency prediction services to avoid blocking query execution.
  • Adaptivity (Goal 3):

    • Given the significant differences in query patterns across various database clusters, an effective solution must be highly adaptable to each cluster's unique characteristics. This flexibility is crucial for accurately distinguishing between non-MO and MO queries, which in turn significantly reduces the number of mispredictions.

This ideal framework balances proactive detection, explainable decision-making, and infrastructure adaptability to ensure both system stability and user productivity in large-scale analytics workloads.

Key Features

Our solution fundamentally shifts the approach to database management from reactive error handling to proactive risk prevention. By leveraging an intelligent, four-module system, our solution intercepts and mitigates memory overload risks before they can impact your business.

  • Intelligent Identification: Our system is located between the query optimizer and the execution engine. It predicts a query's risk of memory overload by creating a multi-dimensional feature vector, which combines query plan statistics (e.g., operator cardinalities) and real-time cluster metrics (e.g., memory utilization).
  • Predictive Analysis: A hybrid model evaluates queries that pass an rule-based filter, ensuring precision and robustness.

    • Global & Local Models: For optimal performance, clusters with limited data use a global model, while more mature clusters use localized models.
    • Misprediction Correction: To prevent false negatives, the system maintains an index of past failures. If a new query resembles a previous one, it's flagged as high-risk, bypassing the model for an extra layer of protection.
  • Dynamic Routing: Queries flagged as high-risk for memory overload are automatically and transparently rerouted. They are sent to a dedicated, memory-rich serverless cluster instead of the original one. This ensures risk isolation and elastic scalability, preventing a single problematic query from impacting other workloads and providing the necessary resources for successful execution.
  • Self-Tuning Quota Management: This module dynamically adjusts quotas based on a query's prediction confidence and past false negatives. This smart system ensures high-confidence predictions are prioritized while borderline queries are managed under a limited quota, which prevents excessive resource usage and maintains workload stability.

By integrating these four modules, our solution automates resource allocation and reduces operational complexity, ensuring your critical queries are always executed successfully. This not only provides database stability but also saves costs.

Use Cases

In this following case study, we reframe two typical database performance issues and their solutions from a customer's business perspective to highlight the critical value of system optimization for business continuity.

Case Study 1: Query Timeouts During Peak Season at an IoT Company. During a peak season, an IoT company's AnalyticDB system experienced query timeouts. The root cause was the large query itself repeatedly failed with an OOM error after running for approximately three minutes each time. This led to a 20% failure rate for other critical queries and caused the large query itself to repeatedly fail, resulting in disruption to core services.

Case Study 2: Query Slowdown and Errors in a Customer's Big Data BI Instance. A media company's database was experiencing frequent query slowdowns and errors, disrupting their daily operations. The problem wasn't a sudden traffic spike, but rather a recurring issue where various types of queries—including those with long execution times, large data scans, or complex JOIN and AGG operations—were repeatedly failing with OOM errors. These failed queries wasted computing resources, causing other tasks to slow down as they waited for resources.

After enabling the intelligent query routing, both of the above cases were well resolved, saving resources and costs. In the first case, query routing module rerouted MO queries into a lower‑impact queue and prevented the cascade of timeouts and wasted retries, avoiding the direct compute waste for that event and roughly an order‑of‑magnitude larger cluster‑level peak waste that would otherwise have occurred.

In the second case, recurring‑OOM scenario, the system automatically recognized multiple high‑risk patterns (long runs, join/aggregate memory spikes, large scans) and routed those queries to safer execution paths. Ten recurring OOM failures that previously consumed compute without producing results were isolated and no longer repeatedly impacted the cluster; this removed the repeated waste from those failed runs and reduced ongoing interference with other workloads.

A broader review of OOM queries reveals a substantial drain on resources. Analysis of our production workloads indicates that intelligent query routing could save up to $1 million annually by avoiding failures and retries caused by memory-overloading queries.

Conclusion

Intelligent query routing addresses the root causes of a class of disruptive queries: it detects heavy or high‑risk queries early (long runs, large scans, memory‑intensive joins/aggregations) and routes them away from contention‑sensitive pools. That prevents OOM failures, timeouts, and the cascade of retries that waste CPU, memory and I/O while delivering no results. By reducing peak contention it also improves cluster stability and protects other users’ workloads during busy periods.

For users, this means fewer manual interventions and less firefighting. Engineers no longer need to pause or re‑submit large jobs by hand, nor to constantly tune ad‑hoc rules; queries that would have caused failures are automatically steered into safer execution paths.

From a product perspective, intelligent query routing combines runtime signals and historical behavior to classify risk, applies routing policies or degraded execution paths, and supports safeguards such as whitelisting critical queries and tuning thresholds. It also emits metrics (failure rate, retry count, CPU/I/O impact) so teams can monitor effectiveness and iteratively refine policies.

For implementation details, configuration options, and rollout guidance, see the product documentation: https://www.alibabacloud.com/help/en/analyticdb/analyticdb-for-mysql/user-guide/jobs-delivery

References

[1] https://www.alibabacloud.com/en/product/analyticdb-for-mysql

0 1 0
Share on

ApsaraDB

548 posts | 174 followers

You may also like

Comments