Use the paging cache feature to improve the performance of deep paged queries - AnalyticDB

AnalyticDB for MySQL provides the paging cache feature. You can use the paging cache feature to improve the efficiency of paged queries that involve large amounts of data in the LIMIT, OFFSET, and ORDER BY clauses. This helps resolve the performance issues and resource bottlenecks of deep paged queries. This topic describes how to use the paging cache feature to optimize the performance of paged queries.

Prerequisites

An AnalyticDB for MySQL cluster of V3.2.3 or later is created.

Note

To view and update the minor version of an AnalyticDB for MySQL cluster, log on to the AnalyticDB for MySQL console and go to the Configuration Information section of the Cluster Information page.

Overview

Performance issues caused by deep paged queries

Users on e-commerce platforms want to sort products based on the sales volume or comprehensive scores to prioritize quality products. In most cases, applications display results by using the pagination method to accelerate the page response and prevent performance issues caused by loading large amounts of data at a time.

A common method in databases is to sort data based on a column and perform paged queries by using the LIMIT and OFFSET clauses. The LIMIT clause specifies the number of data entries displayed per page, and the OFFSET clause specifies the start offset of data per page. For example, you want to display 100 data entries per page. To query data on the first page, execute the following statement: SELECT * FROM t_order ORDER BY id LIMIT 0, 100. To query data on page 10,001, execute the following statement: SELECT * FROM t_order ORDER BY id LIMIT 1000000, 100.

When you perform paged queries on large amounts of data, a deep paged query may significantly degrade the database performance due to the overheads of global sorting and frequent lookup operations. This issue is more complex in AnalyticDB for MySQL. To reduce the amount of data shuffled between different nodes, each storage node performs a topN calculation and the Gather node aggregates the execution results of all storage nodes to obtain the final result set.

To ensure that the final result set is correct, the system must assign the rewritten SQL statements to each storage node for execution. For example, to query the data entries on page 10,001, each storage node receives the following SQL statement: SELECT * FROM t_order ORDER BY id LIMIT 0, 1000100. As a result, the Gather node must sort 1,000,100 × Number of storage nodes data entries to obtain 100 entries.

The amount of sorted data linearly increases with the pagination depth, and the performance sharply degrades. This imposes a significant burden on memory and CPU resources and increases the risk of out of memory (OOM) errors. This is the performance issue caused by deep paged queries.

To resolve the performance issues caused by deep paged queries, AnalyticDB for MySQL provides the paging cache feature. The first time you initiate a paged query, the system queries data from the database and stores the query results in a temporary cache table. When you initiate subsequent paged queries that share the same SQL pattern, the system reads data from the temporary cache table to prevent repeated sorting operations. This effectively resolves the performance issue caused by deep paged queries and prevents OOM errors caused by the ORDER BY clause. AnalyticDB for MySQL automatically clears the cached data that is no longer required based on the eviction policy to ensure that resources are properly utilized.

The paging cache feature is suitable for the following scenarios:

Export of large amounts of data
When you export large amounts of data, the client may become unstable. In this case, the system performs paged queries to pull results in batches. If you use the LIMIT and OFFSET clauses, the distributed environment does not guarantee that the data is processed in a specific order. Each query on the data of the same page may return different results. In this case, you must use the ORDER BY clause to prevent duplicate or missing data. The paging cache feature allows you to eliminate the unnecessary ORDER BY clause. This significantly improves query performance and decreases the risk of OOM errors.
Paged query of full data
The full query results are cached in the hot storage space of AnalyticDB for MySQL. This ensures data availability and significantly increases the query speed. You can use the paging cache feature to obtain and display information by pages.
Concurrency control for business reports
If multiple queries are concurrently initiated on the same report, the independent requests increase workloads on the cluster and result in data inconsistency. You can use the paging cache feature to perform a single query within a time window to improve query performance and cluster stability.

Usage

Configure a cache database

Before you use the paging cache feature to cache query results, we recommend that you specify a database to store temporary cache tables for paged queries. If you do not specify a database, temporary cache tables are stored in the connected internal database. When you enable the paging cache feature, a temporary cache table is automatically created.

Note

You cannot specify an external database as the cache database.

For example, you can specify the paging_cache database as the cache database. You can also specify other databases.

SET ADB_CONFIG PAGING_CACHE_SCHEMA=paging_cache;

Enable the paging cache feature for paged queries

If multiple paged queries share the same SQL pattern, you can add a hint to the SQL statements to improve performance. The first time you initiate a paged query that contains a hint, the system creates a temporary cache table to store the paged query results. When you initiate subsequent paged queries that share the same SQL pattern, you can add the same hint to the SQL statements. This way, the system reads data from the temporary cache table without the need for repeated database access.

Limits

After you eliminate the LIMIT and OFFSET clauses, the number of data entries on which you want to perform paged queries must be less than 100 million.

Note

If the number of data entries to be queried exceeds 100 million, Submit a ticket to adjust the upper limit of the number of data entries.

Methods to enable the paging cache feature

You can use one of the following hints to enable the paging cache feature:

paging_id=<paging_id>
The paging_id parameter specifies the cache table that is created for a set of paged queries that share the same SQL pattern but involve different values in the LIMIT and OFFSET clauses. The client must generate a unique ID to uniquely identify a cache table for a set of paged queries.
- If the specified paging_id parameter does not exist, a cache table is created.
- If the queried paging_id parameter exists and the SQL pattern involved in the query matches the SQL pattern that corresponds to the paging_id parameter, the query hits the cache.
- If the queried paging_id parameter exists but the SQL pattern involved in the query does not match the SQL pattern that corresponds to the paging_id parameter, an error occurs. You can check whether the paging_id parameter is already used. For more information, see the "Query information about cache tables" section of this topic.
Note
The paging_id parameter must meet the following naming conventions: The parameter name must be 1 to 127 characters in length and can contain letters, digits, and underscores (_). The parameter name must start with a letter or an underscore (_). The parameter name cannot contain single quotation marks ('), double quotation marks ("), exclamation points (!), or spaces. The parameter name cannot be an SQL reserved word.
For example, you can set the paging ID to paging123 for the result of a set of paged queries that use the paging cache feature.
```
/*paging_id=paging123*/ SELECT * FROM t_order ORDER BY id LIMIT 0, 100;
```
paging_cache_enabled=true
This method does not require frequent modifications of hints. The server uses the SQL pattern that eliminates the LIMIT and OFFSET clauses to generate a paging ID to identify a set of paged queries.
The dependency on the SQL pattern matching limits the flexibility of this method. If no cache table exists for the paged queries that share the same SQL pattern excluding the LIMIT and OFFSET clauses, a cache table is created. If a cache table exists for the paged queries that share the same SQL pattern excluding the LIMIT and OFFSET clauses, the cache table is used to query data.
Sample statement:
```
/*paging_cache_enabled=true*/ SELECT * FROM t_order ORDER BY id LIMIT 0, 100;
```

If a cache table fails to be created, you must clear the cache data of paged queries and re-initiate a paged query to create a cache table.

Query information about cache tables

You can query information about all cache tables of paged queries within the current cluster, such as the paging ID, cache size, and cache status.

SELECT * FROM INFORMATION_SCHEMA.KEPLER_PAGING_CACHE_STATUS_MERGED;

Specify the maximum number of cache tables

You can specify the maximum number of cache tables within the current cluster. Default value: 100.

SET ADB_CONFIG PAGING_CACHE_MAX_TABLE_COUNT=100;

If you attempt to create a cache table when the total number of cache tables exceeds the limit, an error occurs. Sample error message:

Paging cache count exceeds the limit. Please clean up unused caches or increase the related parameter using SET ADB_CONFIG PAGING_CACHE_MAX_TABLE_COUNT=xxx.

Clear the cache data that is no longer required or increase the maximum number of cache tables based on the error message.

Specify a validity period for a cache table

You can specify a validity period for a cache table. Unit: seconds. After the validity period ends, the cache table becomes invalid. When you initiate subsequent paged queries that share the same SQL pattern, the system performs database access and updates the cache table. In most cases, you can specify a validity period for a cache table in report concurrency control scenarios.

For example, you can use the /*paging_cache_enabled=true, paging_cache_validity_interval=300*/ hint to allow a cache table to be valid within 300 seconds after creation. Sample statement:

/*paging_cache_enabled=true, paging_cache_validity_interval=300*/ SELECT * FROM t_order ORDER BY id LIMIT 0, 100;

Clear the cache data of paged queries

When you use the paging cache feature to cache paged query results, the cache data is cached in the hot storage space of AnalyticDB for MySQL. If specific cache data is no longer required, you can clear the cache data to increase the available storage.

Manual clearing

Cache data of paged queries specified by an SQL pattern
Use the /*paging_cache_enabled=true, invalidate_paging_cache=true*/ hint to clear the cache data of paged queries specified by an SQL pattern.
Sample statement:
```
/*paging_cache_enabled=true,invalidate_paging_cache=true*/ SELECT * FROM t_order ORDER BY id LIMIT 0, 100;
```
Cache data of paged queries specified by the paging_id parameter
Clear the cache data of paged queries specified by the paging_id parameter.
Sample statement:
```
CLEAN_PAGING_CACHE paging123;
```
Note
For information about how to obtain the value of the paging_id parameter, see the "Query information about cache tables" section of this topic.

Automatic clearing

You can specify a cache expiration time to clear the cache data of paged queries that is not accessed within the specified time range. Default value: 600. Unit: seconds. By default, the cache data that is not accessed within 10 minutes is automatically cleared.

SET ADB_CONFIG PAGING_CACHE_EXPIRATION_TIME=600;

Disable the paging cache feature for a cluster

After you disable the paging cache feature for a cluster, all hints related to this feature become invalid, and paged queries are processed by using the original query methods.

SET ADB_CONFIG PAGING_CACHE_ENABLE=false;

After you disable the paging cache feature, you can execute the SHOW ADB_CONFIG KEY=PAGING_CACHE_ENABLE; statement to check whether the configuration takes effect.

Common errors and troubleshooting

Paging cache prepare failed, and cache is not available

Error message:

Paging cache prepare failed, and cache is not available. Please use /*paging_cache_enabled=true,invalidate_paging_cache=true*/ to clean the unavailable cache or set a specific pagingId with /*paging_id=xxx*/ to gen a new cache. Note that the old and new cache data may be inconsistent.

Cause: When you use the paging cache feature to query data, exceptions may occur, such as node restarting and scaling. If a paged query hits a cache table that fails to be created, the server does not perform database access or automatically re-create a cache table. The server throws an error.

Solution: To ensure data consistency, perform the following operations: In data export scenarios, we recommend that you clear the exported data and the unavailable cache data and then re-initiate a paged query to re-create a cache table. In other scenarios, we recommend that you clear the unavailable cache data or specify a new value for the paging_id parameter and then re-initiate a paged query to re-create a cache table.

Performance comparison

Use a TPC-H data set of 100 GB to evaluate the optimization effect of the paging cache feature for paged queries in data export scenarios.

This test involves 1 million entries of exported data. Each page contains 100,000 entries. Execute the following statements to perform paged queried on the first page:

-- General paged query without using the paging cache feature
SELECT * FROM lineitem ORDER BY l_orderkey,l_linenumber LIMIT 0,100000;

-- Paged query by using the paging cache feature (ORDER BY eliminated in the data export scenario)
/*paging_cache_enabled=true*/ SELECT * FROM lineitem LIMIT 0,100000;

Test result:

The query requests are executed in single concurrency mode. During the data export process, the average response time of general paged queries is 54,391 ms. After the paging cache feature is enabled, the average response time is 525 ms. The performance is improved by approximately 103 times. The CPU utilization and the memory usage are significantly decreased.

The paging cache feature significantly reduces the response time of paged queries during the data export process and effectively decreases the consumption of CPU and memory resources.