All Products
Search
Document Center

E-MapReduce:Accelerate data lake queries using materialized views

Last Updated:Mar 26, 2026

StarRocks' asynchronous materialized views store pre-computed query results from External Catalog tables and transparently rewrite incoming queries to use those results — eliminating redundant computation across repeated query patterns. This topic explains when to use materialized views, how to create them on External Catalogs, and how to apply them to common data lake acceleration scenarios.

StarRocks supports building asynchronous materialized views based on External Catalogs, including Hive Catalog, Iceberg Catalog, Hudi Catalog, Java Database Connectivity (JDBC) Catalog, and Paimon Catalog. Common scenarios include:

  • Transparent acceleration for data lake reports: When a slow query is detected, create a targeted materialized view. The query rewrite capability then transparently accelerates upper-layer queries without requiring any changes to business application logic or query statements.

  • Incremental computation for joining real-time and offline data: If the real-time fact table is a StarRocks local table and dimension tables are stored in a data lake, build a materialized view to join the local table with External Catalog tables for simple incremental computation.

  • Quickly building a metric layer: Use materialized views for data pre-aggregation and roll-ups to create a lightweight metric layer. The auto-refresh feature further reduces metric calculation complexity.

When to use materialized views

Three acceleration options are available for data lake queries: Data Cache, materialized views, and local tables. The right choice depends on your query patterns and freshness requirements.

Data Cache Materialized view Local table
How data is loaded Automatically cached when a query runs Automatically refreshed on a schedule or trigger Manually maintained import tasks
What is stored Raw data blocks (LRU eviction, no computation results) Pre-computed query results Full table data per the table definition
Query performance Data Cache ≤ Materialized view = Local table
Query statement changes needed None — on-the-fly computation after a cache hit None — query rewrite reuses pre-computed results Required — queries must target the local table

Materialized views offer three advantages over the other two options:

  • Local storage acceleration: Indexes, partitioning, bucketing, and Colocate Groups apply to stored results, improving performance beyond what Data Cache achieves.

  • No import task maintenance: Automatic refresh tasks keep data up to date. For Hive, Iceberg, and Paimon Catalogs, partition-level incremental refresh avoids full refreshes.

  • Transparent query rewrite: Queries are rewritten automatically to use the materialized view — no changes to application query statements needed.

Use materialized views when:

  • Query performance still does not meet latency or concurrency requirements even with Data Cache enabled

  • Queries have reusable patterns — fixed aggregations or repeated Join structures

  • Data is organized in partitions with high aggregation (for example, daily rollups)

Use Data Cache when:

  • Queries scan arbitrary data without reusable patterns

  • Remote storage experiences instability that could affect access

Catalog support matrix

Materialized view capabilities vary by catalog type. The following tables summarize refresh and query rewrite support for each External Catalog.

Refresh support

Catalog Refresh type Version Notes
Hive Catalog Incremental (partition-level) All supported versions Requires enabling Hive metadata cache refresh; StarRocks polls HMS or AWS Glue
Iceberg Catalog Incremental (partition-level) v3.1.4+ Iceberg V1 tables only
JDBC Catalog Incremental (partition-level) v3.1.4+ MySQL Range partitions only
Paimon Catalog Incremental (partition-level) v3.2.1+
Hudi Catalog Full refresh only All supported versions Partition-level change detection is not supported

Query rewrite support

Catalog Query rewrite enabled by default How to enable
Hive Catalog Yes No action needed
Hudi Catalog No Set "force_external_table_query_rewrite" = "true" at MV creation
JDBC Catalog No Set "force_external_table_query_rewrite" = "true" at MV creation
Iceberg Catalog Requires enabling Set "force_external_table_query_rewrite" = "true" at MV creation
Paimon Catalog Requires enabling Set "force_external_table_query_rewrite" = "true" at MV creation

Limitations

Before creating materialized views, be aware of these constraints:

  • No LIMIT or ORDER BY in MV definitions: Including LIMIT or ORDER BY in the materialized view's defining query prevents the MV from being used for query rewrite.

  • Partition alignment: For partition-based incremental refresh, the materialized view's partition key must be included in the base table's partition key.

  • Data consistency: Asynchronous materialized views are eventually consistent with base table data, not real-time. To tolerate a controlled degree of staleness during query rewrite, set mv_rewrite_staleness_second at MV creation.

  • Iceberg Partition Transforms: From v3.2.3, partitioned materialized views on Iceberg tables using Partition Transforms are supported. Supported transforms: identity, year, month, day, and hour only.

  • Complex queries: For complex defining queries, split the query and build multiple simple nested materialized views instead. Nested materialized views adapt to a wider range of query patterns.

  • `bitmap_union` rewrite: When using bitmap_union(to_bitmap(col)) for count(distinct) rewrite, the column used in to_bitmap() must be of BIGINT type (for example, lo_orderkey must be BIGINT).

Create a materialized view on an External Catalog

Creating a materialized view on an External Catalog table follows the same syntax as creating one on a StarRocks local table. Two configuration decisions are specific to External Catalogs: the refresh policy and query rewrite enablement.

Choose a refresh policy

For catalogs that support partition-level detection (Hive, Iceberg v3.1.4+, JDBC v3.1.4+, Paimon v3.2.1+), configure an incremental refresh policy so that only changed partitions are refreshed. This reduces resource consumption and helps ensure data consistency during query rewrite — if the base table partition has changed since the last refresh, the query is not rewritten to use the stale materialized view.

For Hudi Catalog, any triggered refresh performs a full refresh of the entire materialized view.

Iceberg Partition Transforms example

The following example creates a materialized view aligned to the day Partition Transform of an Iceberg table:

-- Definition of the Iceberg table.
CREATE TABLE spark_catalog.test_db.iceberg_sample_datetime_day (
  id         BIGINT,
  data       STRING,
  category   STRING,
  ts         TIMESTAMP)
USING iceberg
PARTITIONED BY (days(ts))

-- Create a materialized view based on the preceding Iceberg table.
CREATE MATERIALIZED VIEW `test_iceberg_datetime_day_mv` (`id`, `data`, `category`, `ts`)
PARTITION BY (`ts`)
DISTRIBUTED BY HASH(`id`)
REFRESH MANUAL
AS
SELECT
  `iceberg_sample_datetime_day`.`id`,
  `iceberg_sample_datetime_day`.`data`,
  `iceberg_sample_datetime_day`.`category`,
  `iceberg_sample_datetime_day`.`ts`
FROM `iceberg`.`test`.`iceberg_sample_datetime_day`;

Configure Hive metadata cache refresh

To enable partition-level change detection for Hive Catalog, turn on the Hive metadata cache refresh feature. When enabled, StarRocks periodically polls the Hive Metastore service (HMS) or AWS Glue to detect changes in frequently queried partitions.

Use ADMIN SET FRONTEND CONFIG to set the following FE dynamic configuration items:

ADMIN SET FRONTEND CONFIG ("key" = "value")
Configuration item Default Description
enable_background_refresh_connector_metadata true (v3.0) / false (v2.5) Enables periodic Hive metadata cache refresh. StarRocks polls HMS or AWS Glue for data updates in frequently accessed Hive External Catalogs.
background_refresh_metadata_interval_millis 600000 (10 minutes) Interval between consecutive Hive metadata cache refreshes. Unit: milliseconds.
background_refresh_metadata_time_secs_since_last_access_secs 86400 (24 hours) If a Hive Catalog is not accessed for longer than this period, its metadata cache refresh stops. Unit: seconds.

Enable query rewrite

For Hive Catalog materialized views, query rewrite is enabled by default. For Hudi and JDBC Catalog materialized views, query rewrite is disabled by default because strong data consistency cannot be guaranteed. For all External Catalog types other than Hive, set force_external_table_query_rewrite to "true" at creation time:

CREATE MATERIALIZED VIEW ex_mv_par_tbl
PARTITION BY emp_date
DISTRIBUTED BY hash(empid)
PROPERTIES (
"force_external_table_query_rewrite" = "true"
)
AS
SELECT empid, deptno, emp_date
FROM `hudi_catalog`.`emp_db`.`emps_par_tbl`
WHERE empid < 5;

Best practices

Use the following workflow to identify and accelerate slow data lake queries:

  1. Identify slow queries: Analyze the Audit Log or large query logs to find high-latency or high-resource queries.

  2. Profile the bottleneck: Run Query Profile on the slow query to determine which stage — Hash Join, aggregation, or scan — consumes the most time.

  3. Identify reusable patterns: Look for fixed Join structures, repeated aggregations, or shared filter conditions across multiple queries.

  4. Create a targeted materialized view: Build a materialized view that captures the expensive, reusable computation. Start simple; use nested materialized views for complex query shapes.

  5. Verify rewrite: Run EXPLAIN on the original query after MV creation to confirm that the query planner rewrites it to use the materialized view.

The following cases use the Hive catalog hive.ssb_1g_csv to demonstrate the most common patterns.

Case 1: Accelerate Join computations

When Query Profile shows that Hash Join between a large fact table and dimension tables dominates execution time, pre-join the tables in a materialized view.

The following three queries all join lineorder against various dimension tables on lo_orderdate:

--Q1
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates
WHERE
    lo_orderdate = d_datekey
    AND d_year = 1993
    AND lo_discount BETWEEN 1 AND 3
    AND lo_quantity < 25;

--Q2
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates
WHERE
    lo_orderdate = d_datekey
    AND d_yearmonth = 'Jan1994'
    AND lo_discount BETWEEN 4 AND 6
    AND lo_quantity BETWEEN 26 AND 35;

--Q3
SELECT SUM(lo_revenue), d_year, p_brand
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates, hive.ssb_1g_csv.part, hive.ssb_1g_csv.supplier
WHERE
    lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228'
    AND s_region = 'ASIA'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;

Q1 and Q2 aggregate after joining lineorder and dates. Q3 aggregates after joining all four tables. Use StarRocks' View Delta Join rewrite capability to build one materialized view joining all four tables — Q1 and Q2 can rewrite against a subset of its columns:

CREATE MATERIALIZED VIEW lineorder_flat_mv
DISTRIBUTED BY HASH(LO_ORDERDATE, LO_ORDERKEY) BUCKETS 48
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
PROPERTIES (
    -- Specify the unique constraint.
    "unique_constraints" = "
    hive.ssb_1g_csv.supplier.s_suppkey;
    hive.ssb_1g_csv.part.p_partkey;
    hive.ssb_1g_csv.dates.d_datekey",
    -- Specify the foreign key constraint.
    "foreign_key_constraints" = "
    hive.ssb_1g_csv.lineorder(lo_partkey) REFERENCES hive.ssb_1g_csv.part(p_partkey);
    hive.ssb_1g_csv.lineorder(lo_suppkey) REFERENCES hive.ssb_1g_csv.supplier(s_suppkey);
    hive.ssb_1g_csv.lineorder(lo_orderdate) REFERENCES hive.ssb_1g_csv.dates(d_datekey)",
    -- Enable query rewrite.
    "force_external_table_query_rewrite" = "TRUE"
)
AS SELECT
       l.LO_ORDERDATE AS LO_ORDERDATE,
       l.LO_ORDERKEY AS LO_ORDERKEY,
       l.LO_PARTKEY AS LO_PARTKEY,
       l.LO_SUPPKEY AS LO_SUPPKEY,
       l.LO_QUANTITY AS LO_QUANTITY,
       l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
       l.LO_DISCOUNT AS LO_DISCOUNT,
       l.LO_REVENUE AS LO_REVENUE,
       s.S_REGION AS S_REGION,
       p.P_BRAND AS P_BRAND,
       d.D_YEAR AS D_YEAR,
       d.D_YEARMONTH AS D_YEARMONTH
   FROM hive.ssb_1g_csv.lineorder AS l
            INNER JOIN hive.ssb_1g_csv.supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
            INNER JOIN hive.ssb_1g_csv.part AS p ON p.P_PARTKEY = l.LO_PARTKEY
            INNER JOIN hive.ssb_1g_csv.dates AS d ON l.LO_ORDERDATE = d.D_DATEKEY;

Case 2: Accelerate aggregations

When Query Profile shows that an AGGREGATE node is the bottleneck, pre-aggregate the data in a materialized view.

Single-table aggregation

Q4 calculates distinct daily order counts — a resource-intensive count(distinct) operation:

--Q4
SELECT
lo_orderdate, count(distinct lo_orderkey)
FROM hive.ssb_1g_csv.lineorder
GROUP BY lo_orderdate
ORDER BY lo_orderdate limit 100;

Two materialized view approaches are available. The first stores the exact aggregate; the second uses bitmap_union for better rewrite flexibility (requires lo_orderkey to be BIGINT):

-- Option 1: Direct count(distinct)
CREATE MATERIALIZED VIEW mv_2_1
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
AS
SELECT
lo_orderdate, count(distinct lo_orderkey)
FROM hive.ssb_1g_csv.lineorder
GROUP BY lo_orderdate;

-- Option 2: bitmap_union for rewrite (lo_orderkey must be BIGINT)
CREATE MATERIALIZED VIEW mv_2_2
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
AS
SELECT
lo_orderdate, bitmap_union(to_bitmap(lo_orderkey))
FROM hive.ssb_1g_csv.lineorder
GROUP BY lo_orderdate;
Do not add LIMIT or ORDER BY to materialized view definitions — both prevent query rewrite.

Multi-table aggregation

Build a nested materialized view on top of an existing join MV to further aggregate the results. Based on lineorder_flat_mv from Case 1, the following MV accelerates Q1 and Q2, which share a similar aggregation pattern:

CREATE MATERIALIZED VIEW mv_2_3
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
AS
SELECT
lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth, SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder_flat_mv
GROUP BY lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth;

Alternatively, combine the join and aggregation in a single MV. This approach uses less storage but has fewer query rewrite opportunities because the computation is more specific:

CREATE MATERIALIZED VIEW mv_2_4
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
PROPERTIES (
    "force_external_table_query_rewrite" = "TRUE"
)
AS
SELECT lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth, SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates
WHERE lo_orderdate = d_datekey
GROUP BY lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth;

Case 3: Accelerate post-aggregation Joins

Some queries aggregate one table first and then join the result against another table. To cover this shape with query rewrite, split the computation across two nested materialized views.

Q5 aggregates the customer table and then joins the result with lineorder:

--Q5
SELECT * FROM  (
    SELECT
      l.lo_orderkey, l.lo_orderdate, c.c_custkey, c_region, sum(l.lo_revenue)
    FROM
      hive.ssb_1g_csv.lineorder l
      INNER JOIN (
        SELECT distinct c_custkey, c_region
        from
          hive.ssb_1g_csv.customer
        WHERE
          c_region IN ('ASIA', 'AMERICA')
      ) c ON l.lo_custkey = c.c_custkey
      GROUP BY  l.lo_orderkey, l.lo_orderdate, c.c_custkey, c_region
  ) c1
WHERE
  lo_orderdate = '19970503'

Similar queries may use different filter values on c_region and lo_orderdate. Build two materialized views — one for the aggregation on customer and one for the join — so the rewrite covers varied filter conditions:

-- mv_3_1: pre-aggregate the customer table
CREATE MATERIALIZED VIEW mv_3_1
DISTRIBUTED BY HASH(c_custkey)
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
PROPERTIES (
    "force_external_table_query_rewrite" = "TRUE"
)
AS
SELECT distinct c_custkey, c_region from hive.ssb_1g_csv.customer;

-- mv_3_2: join lineorder against mv_3_1 and aggregate
CREATE MATERIALIZED VIEW mv_3_2
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
PROPERTIES (
    "force_external_table_query_rewrite" = "TRUE"
)
AS
SELECT l.lo_orderdate, l.lo_orderkey, mv.c_custkey, mv.c_region, sum(l.lo_revenue)
FROM hive.ssb_1g_csv.lineorder l
INNER JOIN mv_3_1 mv
ON l.lo_custkey = mv.c_custkey
GROUP BY l.lo_orderkey, l.lo_orderdate, mv.c_custkey, mv.c_region;

Case 4: Hot and cold data separation

When recent data (last 3 days) is written directly to StarRocks and older data is stored in Hive, but queries span the past 7 days, use a materialized view to bring the Hive data into StarRocks' local storage for the relevant time window:

CREATE MATERIALIZED VIEW mv_4_1
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
AS
SELECT lo_orderkey, lo_orderdate, lo_revenue
FROM hive.ssb_1g_csv.lineorder
WHERE lo_orderdate<=current_date()
AND lo_orderdate>=date_add(current_date(), INTERVAL -4 DAY);

Build additional views or materialized views on top of mv_4_1 to encapsulate upper-layer business logic.