Data Lake Query Acceleration: Improve Query Performance with a High-Performance Cache - AnalyticDB

AnalyticDB for MySQL provides the lake cache feature, which caches frequently accessed files from OSS on high-performance NVMe SSDs to accelerate OSS data reads. This feature is ideal for scenarios that require high bandwidth and involve repetitive data reads, for example, when multiple analysts need to query the same dataset. This topic describes the benefits, use cases, and usage of lake cache.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.

Overview

How it works

lake cache works as follows:

The lake cache client forwards read requests for OSS data to the lake cache accelerator. The client then connects to a lake cache master node to request file metadata.
The lake cache master node returns the file metadata to the lake cache client.
Based on the metadata, the lake cache client sends a request to a lake cache worker to retrieve OSS data:
- If the target file is in the cache space of the lake cache worker, the file is returned directly to the client.
- If the target file is not in the cache space, the lake cache accelerator retrieves the file from OSS. The accelerator then returns the file to the client and caches it for future requests.

Benefits

Millisecond-level latency

The lake cache accelerator uses NVMe SSDs to deliver millisecond-level read latency.
High throughput

The accelerator's bandwidth scales linearly with the cache space size, offering burst throughput of up to several hundred GB/s.
High throughput density

The accelerator delivers high throughput for small datasets, meeting burst read demands for hot data.
Elastic scaling

You can manually scale the cache space up or down based on your business needs to prevent resource waste and reduce costs. The cache space can be scaled from 10 GB to 200,000 GB.
Decoupled storage and computing

Unlike cache on compute nodes, the lake cache accelerator is an independent component, and its size can be adjusted online.
Data consistency

When a file in OSS is updated, the lake cache accelerator automatically detects the change and caches the new version, ensuring that the compute engine always reads the latest data.

Performance metrics and cache eviction policy

Parameter	Description
Accelerator bandwidth	Bandwidth is calculated with the formula: `5 × Cache space`. The unit for bandwidth is GB/s, and the unit for cache space is TB. For example, if the lake cache accelerator has a 10 TB cache space, the read bandwidth is (5 × 10) GB/s = 50 GB/s.
Lake cache accelerator cache space	The value can range from 10 GB to 200,000 GB. The lake cache accelerator provides throughput for cached data based on the configured cache space size. Each terabyte (TB) of cache space provides 5 GB/s of bandwidth. The throughput provided by the accelerator is in addition to and not limited by the standard OSSOSS To request a larger capacity, submit a ticket.
Cache eviction policy	When the cache is full, the system uses the Least Recently Used (LRU) policy to evict data. This policy removes the least recently accessed data first to maximize cache efficiency.

Performance

AnalyticDB for MySQL was tested based on the TPC-H benchmark to compare two methods: enabling the lake cache and directly accessing OSS storage space. In this test, enabling the lake cache feature improved data access efficiency by 2.7 times. For detailed test results, see the following:

Type	Cache space	Dataset size	Spark resources	Query duration
lake cache enabled	12 TB	TPC-H 10 TB dataset	Medium (2 cores, 8 GB)	7219s
Direct OSS access	N/A	TPC-H 10 TB dataset	Medium (2 cores, 8 GB)	19578s

Billing

The lake cache space is billed on a pay-as-you-go basis. For more information, see Pricing for Enterprise Edition and Basic Edition and Pricing for Data Lakehouse Edition.

Usage notes

lake cache is available only in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), Singapore, US (Silicon Valley), US (Virginia), and Indonesia (Jakarta).

Important
If you want to use the lake cache feature in other regions, submit a ticket.
If a cache hardware failure occurs, data queries continue to run without interruption or errors, although performance may degrade. Cached data is reloaded from OSS, and query speed is restored after the process is complete.
When the configured cache space is full, the accelerator replaces less frequently accessed files with more frequently accessed ones based on the cache eviction policy. To prevent files from being evicted, scale up the cache space.

Enable lake cache

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
Go to the Cluster Information page. In the Configuration Information section, click Configure next to Lake Cache.
In the Lake Cache dialog box, turn on the Lake Cache switch and configure the Disk Cache Settings.
Click OK.

Note

After you enable Lake Cache, you can follow the same steps to view the configured cache space size.

Use lake cache

After you enable lake cache, when you read OSS data, you can accelerate OSS data reads by configuring the spark.adb.lakecache.enabled parameter in your Spark job configuration. The following is an example:

Spark SQL

-- This is an example of using lake cache. Modify the code to run your Spark program.
SET spark.adb.lakecache.enabled=true;

-- Add your SQL statements here.
SHOW databases;

Spark JAR

{
  "comments": [
    "-- This is an example of using lake cache. Modify the code to run your Spark program."
  ],
  "args": ["oss://testBucketName/data/readme.txt"],
  "name": "spark-oss-test",
  "file": "oss://testBucketName/data/example.py",
  "conf": {
    "spark.adb.lakecache.enabled": "true"
  }
}

Note

If you want to use the lake cache accelerator with the XIHE engine, submit a ticket.

Monitor lake cache

After you enable lake cache, you can use the CloudMonitor console to verify that your Spark applications are using the cache and to view metrics such as data read volume. To do so, perform the following steps:

Log on to the Cloud Monitor console.
In the left-side navigation pane, choose Cloud Resource Monitoring > Cloud Service Monitoring.
Hover over the AnalyticDB for MySQL card and click AnalyticDB for MySQL 3.0 - Data Lakehouse Edition.
Find the target cluster and click Monitoring Charts in the Actions column.

Click the LakeCache Metrics tab to view the details.

The following table describes the monitoring metrics.

Metric	Description
LakeCache Cache Hit Ratio (%)	The percentage of read requests fulfilled by the cache. Formula: (Cache Hits / Total Read Requests) × 100%.
LakeCache Cache Usage (B)	The amount of used cache space, in bytes.
Total Data Read from LakeCache (B)	The total amount of data read from the cache space, in bytes.