In our previous blog post, we provided a brief introduction to Intelligent Query Routing. This time, we shall delve into the technical details of it.
Data warehouses, as core data infrastructures, are foundational for many enterprise applications. While these systems are designed for high-performance and scalability, a recurring challenge remains: managing memory resources under heavy query workloads. Memory-Overloading (MO) Queries, can cause a common Out-of-Memory (OOM) exception, which severely impact system stability and performance. In a typical business environment, mixed workloads like large ETL tasks, intensive queries for BI reports, and ad-hoc data explorations can push database memory to its limits, leading to OOM errors. When OOM occurs, the impact on business is significant:
Traditional methods for resolving these issues, such as manually scaling instances or adjusting configurations, are often insufficient. These solutions lack the elasticity to handle unpredictable workloads, leading to slow performance due to resource contention, and they require constant, manual intervention. Simply put, traditional approaches can't keep up with the dynamic needs of modern data workloads.
To address this critical issue, AnalyticDB for MySQL [1] introduces the new intelligent query routing capability. This solution is designed to fundamentally change how you handle memory overloads, providing an elastic, smart, and worry-free experience for your database operations.

Figure 1. The intelligent query routing framework of AnalyticDB for MySQL
As shown in Figure 1, query routing provides users the ability to deliver specific queries from an interactive resource group to another designated resource group. Query routing operates between the query optimizer and executor, intercepting queries before execution to predict potential memory overload. Large queries can cause resource contention, affecting other queries and potentially leading to system instability. To avoid this, you can use the query routing feature to send these queries to a dedicated resource group, ensuring they don't interfere with other workloads. This feature is particularly useful for the following scenarios: 1) Queries that consume a lot of memory and are prone to OOM errors. 2) Ad-hoc queries that occasionally scan large amounts of data. 3) Heavy ETL (Extract, Transform, Load) queries that are less sensitive to execution time.
Previously AnalyticDB for MySQL provides users the ability to manually route queries to different different resource groups. Users define specific rules in Workload Management (WLM) based on their own application characteristics, for example:
While manual routing offers the ability to precisely control routing rules and resource management, its main drawbacks are the needs of deep workload understanding and complicated case-by-case tuning, which are both time-consuming and labor-intensive. Furthermore, it cannot effectively handle new or ad-hoc workloads, potentially leading to performance and stability issues during unexpected peaks. As the business grows, the maintenance costs become prohibitively high.
Automated routing is designed for hands-free optimization. The system uses built-in, pre-defined logic to automatically route problematic queries. The system intelligently detects and automatically routes any query that fails due to insufficient memory (e.g., an OOM error). The primary advantage of automated routing is simplicity. It’s a proactive that uses the system's own intelligence to handle common performance bottlenecks without any manual intervention.
Existing query routing solutions primarily relied on ruled-based query routing, which is a preset, static distribution mechanism. Intelligent query routing is a major upgrade to this traditional ruled-based routing, as it no longer relies solely on fixed rules. Instead, it integrates AI technologies. The table below provides a detailed comparison of the two routing solutions.
| Feature | Ruled-based Query Routing | Intelligent Query Routing |
| Operating Principle | Relies on preset, static rules, such as: ● If a query's peak memory usage over 10 GB. |
Based on AI models that predicts a query's risk in real time, making dynamic decisions based on historical data. |
| Decision Basis | Fixed, hard-coded rules that struggle to handle complex and evolving query patterns and cluster states. | A multi-dimensional feature vector that combines query plan statistics (e.g., operator cardinalities) and real-time cluster metrics (e.g., memory utilization). |
| Misprediction Correction | Unable to self-correct. Poorly set rules can lead to numerous mispredictions, impacting system performance. | Features a misprediction correction mechanism. The system maintains an index of past failures and automatically flags new, similar queries as high-risk. |
| Adaptability | Low. Rules require manual adjustment and lack adaptability to new query patterns or changes in the cluster environment. | High. It can use a global or localized model for different clusters and dynamically adjust policies with self-tuning quota management to adapt to unique cluster characteristics. |
| Scalability | Poor scalability. Manual modification and maintenance of routing rules are required whenever the business or cluster changes. | Highly scalable. The AI model can automatically learn and adapt to new query patterns, reducing the need for manual intervention. |
Intelligent query routing is an automatic AI-driven query routing solution. Next, we summarize three design goals from the cases of abnormal memory overload we have handled in history to better guide the design of query routing schemes.
Design Goals of Intelligent Query Routing: Through collaborative analysis of practical use cases and in-depth discussions with the AnalyticDB engineering team, we identified three critical design goals to address the challenges of handling MO queries in cloud-native environments. These goals emerged from evaluating real-world scenarios where MO queries caused resource contention, performance degradation, and system instability across heterogeneous database clusters. The following structured approach outlines our prioritized goals:
High Prediction Precision (Goal 1):
Interpretability and Efficiency (Goal 2):
Adaptivity (Goal 3):
This ideal framework balances proactive detection, explainable decision-making, and infrastructure adaptability to ensure both system stability and user productivity in large-scale analytics workloads.
Our solution fundamentally shifts the approach to database management from reactive error handling to proactive risk prevention. By leveraging an intelligent, four-module system, our solution intercepts and mitigates memory overload risks before they can impact your business.
Predictive Analysis: A hybrid model evaluates queries that pass an rule-based filter, ensuring precision and robustness.
By integrating these four modules, our solution automates resource allocation and reduces operational complexity, ensuring your critical queries are always executed successfully. This not only provides database stability but also saves costs.
In this following case study, we reframe two typical database performance issues and their solutions from a customer's business perspective to highlight the critical value of system optimization for business continuity.
Case Study 1: Query Timeouts During Peak Season at an IoT Company. During a peak season, an IoT company's AnalyticDB system experienced query timeouts. The root cause was the large query itself repeatedly failed with an OOM error after running for approximately three minutes each time. This led to a 20% failure rate for other critical queries and caused the large query itself to repeatedly fail, resulting in disruption to core services.
Case Study 2: Query Slowdown and Errors in a Customer's Big Data BI Instance. A media company's database was experiencing frequent query slowdowns and errors, disrupting their daily operations. The problem wasn't a sudden traffic spike, but rather a recurring issue where various types of queries—including those with long execution times, large data scans, or complex JOIN and AGG operations—were repeatedly failing with OOM errors. These failed queries wasted computing resources, causing other tasks to slow down as they waited for resources.
After enabling the intelligent query routing, both of the above cases were well resolved, saving resources and costs. In the first case, query routing module rerouted MO queries into a lower‑impact queue and prevented the cascade of timeouts and wasted retries, avoiding the direct compute waste for that event and roughly an order‑of‑magnitude larger cluster‑level peak waste that would otherwise have occurred.
In the second case, recurring‑OOM scenario, the system automatically recognized multiple high‑risk patterns (long runs, join/aggregate memory spikes, large scans) and routed those queries to safer execution paths. Ten recurring OOM failures that previously consumed compute without producing results were isolated and no longer repeatedly impacted the cluster; this removed the repeated waste from those failed runs and reduced ongoing interference with other workloads.
A broader review of OOM queries reveals a substantial drain on resources. Analysis of our production workloads indicates that intelligent query routing could save up to $1 million annually by avoiding failures and retries caused by memory-overloading queries.
Intelligent query routing addresses the root causes of a class of disruptive queries: it detects heavy or high‑risk queries early (long runs, large scans, memory‑intensive joins/aggregations) and routes them away from contention‑sensitive pools. That prevents OOM failures, timeouts, and the cascade of retries that waste CPU, memory and I/O while delivering no results. By reducing peak contention it also improves cluster stability and protects other users’ workloads during busy periods.
For users, this means fewer manual interventions and less firefighting. Engineers no longer need to pause or re‑submit large jobs by hand, nor to constantly tune ad‑hoc rules; queries that would have caused failures are automatically steered into safer execution paths.
From a product perspective, intelligent query routing combines runtime signals and historical behavior to classify risk, applies routing policies or degraded execution paths, and supports safeguards such as whitelisting critical queries and tuning thresholds. It also emits metrics (failure rate, retry count, CPU/I/O impact) so teams can monitor effectiveness and iteratively refine policies.
For implementation details, configuration options, and rollout guidance, see the product documentation: https://www.alibabacloud.com/help/en/analyticdb/analyticdb-for-mysql/user-guide/jobs-delivery
[1] https://www.alibabacloud.com/en/product/analyticdb-for-mysql
Introducing Intelligent Query Routing on AnalyticDB for Always Online Analytics - No Crash!
[Infographic] Highlights | Database New Features in August 2025
ApsaraDB - September 24, 2025
Alibaba Clouder - December 13, 2019
ApsaraDB - April 10, 2024
Alibaba Cloud MaxCompute - December 8, 2020
Data Geek - November 11, 2024
Apache Flink Community China - January 11, 2022
Hologres
A real-time data warehouse for serving and analytics which is compatible with PostgreSQL.
Learn More
Tair
Tair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.
Learn More
Personalized Content Recommendation Solution
Help media companies build a discovery service for their customers to find the most appropriate content.
Learn More
Time Series Database (TSDB)
TSDB is a stable, reliable, and cost-effective online high-performance time series database service.
Learn MoreMore Posts by ApsaraDB