An overview of caching in StarRocks, covering its basic concepts, how it works, and its role in improving system performance. - E-MapReduce

This topic describes the architecture of StarRocks' cache management and the applicable scenarios for each cache type, to help you select the right caching solution for your business needs.

Features

StarRocks provides multiple caching mechanisms that significantly improve query performance by caching hot data to the memory or disk of local BE and CN nodes. This reduces repeated access to remote storage, such as HDFS and object storage.

Cache types

Cache type	Use cases	Default state	Available since
shared-data Data Cache	Accelerates queries on internal tables in shared-data (serverless) instances.	Enabled by default	v3.1.7 / v3.2.3
data lake Data Cache	Accelerates queries on external tables from an External Catalog (such as Hive, Iceberg, and Hudi).	Enabled by default since v3.3.0	v2.5
Index Cache	Caches indexes for shared-data instances, ideal for scenarios where disk capacity is insufficient to cache the full dataset.	Enabled by default	v3.3.13

Note

Since v3.4.0, queries on internal tables in shared-data instances and queries on data lakes share the same Data Cache instance, eliminating the need for separate configurations.

Recommendations

Shared-data instance: Use the shared-data Data Cache. It automatically loads data on demand from remote storage to the local cache, requiring no extra configuration.
Data lake external tables: Use the data lake Data Cache. It supports caching remote files in formats like Parquet and ORC, making it ideal for scenarios that involve repeated scans of large tables, such as ad-hoc analytics and report queries.
Insufficient disk capacity for the full dataset: Enable the Index Cache. It caches only indexes, significantly improving query performance with low disk overhead.
Preloading hot data: Use Data Cache preheating (CACHE SELECT) to load specific data into the cache in advance, to avoid the performance impact of a cold start.