All Products
Search
Document Center

E-MapReduce:Accelerate data lake queries using materialized views

Last Updated:Aug 27, 2025

StarRocks provides an out-of-the-box data lake query feature, which is ideal for exploratory query analysis. You can use asynchronous materialized views to achieve higher concurrency and better performance for reports and applications on your data lake. This topic describes how to use asynchronous materialized views in StarRocks to optimize data lake query performance.

Scenarios

StarRocks supports building asynchronous materialized views based on External Catalogs, such as Hive Catalog, Iceberg Catalog, Hudi Catalog, Java Database Connectivity (JDBC) Catalog, and Paimon Catalog. Materialized views based on External Catalogs are suitable for the following scenarios.

  • Transparent acceleration for data lake reports

    To ensure the query performance of data lake reports, data engineers often need to work closely with data analysts to design the build logic for the report acceleration layer. If the requirements for the acceleration layer are updated, they must update the build logic, execution plan, and query statement accordingly. The query rewrite capability of materialized views lets you make the report acceleration process transparent to users. When a slow query is detected, data engineers can analyze its pattern and create a materialized view as needed. The materialized view then intelligently rewrites and transparently accelerates the upper-layer query. This process rapidly improves query performance without requiring modifications to the logic or query statements of business applications.

  • Incremental computation for joining real-time and offline data

    Suppose your business application needs to join real-time data in StarRocks local tables with historical data in a data lake for incremental computation. In this case, materialized views can provide a simple solution. For example, if the real-time fact table is a local table in StarRocks and the dimension tables are stored in a data lake, you can easily perform incremental computation by building a materialized view to join the local table with tables from external data sources.

  • Quickly building a metric layer

    Calculating and processing metrics can be challenging when dealing with high-dimensional data. You can use materialized views for data pre-aggregation and roll-ups to create a relatively lightweight metric layer. Additionally, you can leverage the auto-refresh feature of materialized views to further reduce the complexity of metric calculation.

Feature comparison

Materialized views, Data Cache, and local tables in StarRocks are all effective methods to achieve significant query performance improvements. The following table compares their main differences.

Comparison item

Data Cache

Materialized view

Local table

Data import and update

Queries automatically trigger data caching

Refresh tasks are automatically triggered

Supports various import methods, but requires manual maintenance of import tasks

Data cache granularity

  • Supports block-level data caching

  • Follows the LRU cache eviction mechanism

  • Does not cache computation results

Stores pre-computed query results

Stores data based on the table definition

Query performance

Data Cache ≤ Materialized view = Local table

Query statement

  • No need to modify query statements for data lake data

  • Once a query hits the cache, on-the-fly computation is performed.

  • No need to modify query statements for data lake data

  • Leverages query rewrite to reuse pre-computed results

Need to modify query statements to query local tables

Compared to directly querying data lake data or importing data into local tables, materialized views offer several unique advantages:

  • Local storage acceleration: Materialized views can leverage StarRocks' local storage acceleration advantages, such as indexes, partitioning and bucketing, and Colocate Groups. This results in better query performance than directly querying data from the data lake.

  • No need to maintain loading tasks: Materialized views transparently update data through automatic refresh tasks, which eliminates the need to maintain import tasks. Additionally, materialized views based on Hive, Iceberg, and Paimon Catalogs can detect data changes and perform incremental refreshes at the partition level.

  • Smart query rewrite: Queries can be transparently rewritten to use materialized views, which accelerates queries without modifying the query statements used by applications.

Usage recommendations

Use materialized views in the following situations:

  • Query performance still does not meet your requirements for query latency and concurrency, even with Data Cache enabled.

  • Queries involve reusable parts, such as fixed aggregation methods or Join patterns.

  • Data is organized in partitions, and queries have a high degree of aggregation (for example, daily aggregation).

Use Data Cache for acceleration in the following situations:

  • Queries do not have many reusable parts and may involve any data in the data lake.

  • The remote storage experiences significant fluctuations or instability, which could potentially impact access.

Create a materialized view based on an External Catalog

Creating a materialized view on a table in an External Catalog is similar to creating one on a StarRocks local table. You only need to set a suitable refresh policy for your data source and manually enable the query rewrite feature for External Catalog materialized views.

Choose a suitable refresh policy

Currently, StarRocks cannot detect partition-level data changes in Hudi Catalogs. Therefore, once a refresh task is triggered, a full refresh is performed.

For Hive Catalogs, Iceberg Catalogs (from v3.1.4), JDBC Catalogs (from v3.1.4, and only MySQL Range partitions are supported), and Paimon Catalogs (from v3.2.1), StarRocks supports detecting partition-level data changes. As a result, StarRocks can:

  • Refresh only the partitions where data has changed. This avoids full refreshes and reduces the resource consumption caused by the refresh process.

  • Ensure data consistency to some extent during query rewrite. If the base table in the data lake undergoes data changes, the query will not be rewritten to use the materialized view.

Note

You can still choose to tolerate a certain degree of data inconsistency by setting the property mv_rewrite_staleness_second when you create the materialized view.

Note that for partition-based refreshes, the partition key of the materialized view must be included in the partition key of the base table.

From v3.2.3, StarRocks supports creating partitioned materialized views on Iceberg tables that use Partition Transforms. The materialized view will be partitioned based on the transformed columns. Currently, only Iceberg tables that use the identity, year, month, day, or hour Transform are supported.

The following example shows the definition of an Iceberg table that uses the day Transform and a partition-aligned materialized view created on that table:

-- Definition of the Iceberg table.
CREATE TABLE spark_catalog.test_db.iceberg_sample_datetime_day (
  id         BIGINT,
  data       STRING,
  category   STRING,
  ts         TIMESTAMP)
USING iceberg
PARTITIONED BY (days(ts))

-- Create a materialized view based on the preceding Iceberg table.
CREATE MATERIALIZED VIEW `test_iceberg_datetime_day_mv` (`id`, `data`, `category`, `ts`)
PARTITION BY (`ts`)
DISTRIBUTED BY HASH(`id`)
REFRESH MANUAL
AS 
SELECT 
  `iceberg_sample_datetime_day`.`id`, 
  `iceberg_sample_datetime_day`.`data`, 
  `iceberg_sample_datetime_day`.`category`, 
  `iceberg_sample_datetime_day`.`ts`
FROM `iceberg`.`test`.`iceberg_sample_datetime_day`;

For Hive Catalogs, you can enable the Hive metadata cache refresh feature to allow StarRocks to detect data changes at the partition level. After this feature is enabled, StarRocks periodically accesses the Hive Metastore service (HMS) or AWS Glue to check the metadata information of frequently queried hot data.

Configuration items

To enable the Hive metadata cache refresh feature, you can use ADMIN SET FRONTEND CONFIG to set the following FE dynamic configuration items. The syntax is as follows.

ADMIN SET FRONTEND CONFIG ("key" = "value")

Configuration name

Default

Description

enable_background_refresh_connector_metadata

  • true for v3.0

  • false for v2.5

Specifies whether to enable periodic refresh of the Hive metadata cache. If enabled, StarRocks polls the metadata service (HMS or AWS Glue) of the Hive cluster and refreshes the metadata cache of frequently accessed Hive external data catalogs to detect data updates. true enables the feature, and false disables it.

background_refresh_metadata_interval_millis

600000 (10 minutes)

The interval between two consecutive Hive metadata cache refreshes. Unit: milliseconds.

background_refresh_metadata_time_secs_since_last_access_secs

86400 (24 hours)

The expiration time for the Hive metadata cache refresh task. For a Hive Catalog that has been accessed, if it is not accessed for a period longer than this time, its metadata cache refresh is stopped. For a Hive Catalog that has not been accessed, StarRocks does not refresh its metadata cache. Unit: seconds.

For Iceberg Catalogs, from v3.1.4, StarRocks supports detecting partition-level data changes. Currently, only Iceberg V1 tables are supported.

Enable query rewrite for External Catalog materialized views

Because strong data consistency cannot be guaranteed, StarRocks disables the query rewrite feature for Hudi and JDBC Catalog materialized views by default. You can enable this feature by setting the property force_external_table_query_rewrite to true when you create the materialized view. For materialized views created on tables in a Hive Catalog, the query rewrite feature is enabled by default. In scenarios that involve query rewrite, if you use a very complex query statement to build the materialized view, we recommend that you split the query statement and build multiple simple materialized views in a nested manner. Nested materialized views are more flexible and can adapt to a wider range of query patterns.

The following is an example.

CREATE MATERIALIZED VIEW ex_mv_par_tbl
PARTITION BY emp_date
DISTRIBUTED BY hash(empid)
PROPERTIES (
"force_external_table_query_rewrite" = "true"
) 
AS
SELECT empid, deptno, emp_date
FROM `hudi_catalog`.`emp_db`.`emps_par_tbl`
WHERE empid < 5;

Best practices

In business scenarios, you can identify slow and resource-intensive queries by analyzing the Audit Log or large query logs. You can also use Query Profile to pinpoint the specific stage where a query is slow. The following sections provide instructions and examples for improving data lake query performance with materialized views.

Case 1: Accelerate Join computations in the data lake

You can use materialized views to accelerate Join queries in the data lake.

Assume the following queries on a Hive catalog are slow queries.

--Q1
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates
WHERE
    lo_orderdate = d_datekey
    AND d_year = 1993
    AND lo_discount BETWEEN 1 AND 3
    AND lo_quantity < 25;

--Q2
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates
WHERE
    lo_orderdate = d_datekey
    AND d_yearmonth = 'Jan1994'
    AND lo_discount BETWEEN 4 AND 6
    AND lo_quantity BETWEEN 26 AND 35;

--Q3 
SELECT SUM(lo_revenue), d_year, p_brand
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates, hive.ssb_1g_csv.part, hive.ssb_1g_csv.supplier
WHERE
    lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228'
    AND s_region = 'ASIA'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;

By analyzing the query profiles, you might notice that most of the query execution time is spent on the Hash Join between the table lineorder and other dimension tables on the column lo_orderdate.

Here, Q1 and Q2 perform aggregation after joining lineorder and dates, while Q3 performs aggregation after joining lineorder, dates, part, and supplier.

Therefore, you can leverage StarRocks' View Delta Join rewrite capability to build a materialized view that joins lineorder, dates, part, and supplier.

CREATE MATERIALIZED VIEW lineorder_flat_mv
DISTRIBUTED BY HASH(LO_ORDERDATE, LO_ORDERKEY) BUCKETS 48
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY) 
PROPERTIES (
    -- Specify the UNIQUE constraint.
    "unique_constraints" = "
    hive.ssb_1g_csv.supplier.s_suppkey;
    hive.ssb_1g_csv.part.p_partkey;
    hive.ssb_1g_csv.dates.d_datekey",
    -- Specify the foreign key constraint.
    "foreign_key_constraints" = "
    hive.ssb_1g_csv.lineorder(lo_partkey) REFERENCES hive.ssb_1g_csv.part(p_partkey);
    hive.ssb_1g_csv.lineorder(lo_suppkey) REFERENCES hive.ssb_1g_csv.supplier(s_suppkey);
    hive.ssb_1g_csv.lineorder(lo_orderdate) REFERENCES hive.ssb_1g_csv.dates(d_datekey)",
    -- Enable query rewrite.
    "force_external_table_query_rewrite" = "TRUE"
)
AS SELECT
       l.LO_ORDERDATE AS LO_ORDERDATE,
       l.LO_ORDERKEY AS LO_ORDERKEY,
       l.LO_PARTKEY AS LO_PARTKEY,
       l.LO_SUPPKEY AS LO_SUPPKEY,
       l.LO_QUANTITY AS LO_QUANTITY,
       l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
       l.LO_DISCOUNT AS LO_DISCOUNT,
       l.LO_REVENUE AS LO_REVENUE,
       s.S_REGION AS S_REGION,
       p.P_BRAND AS P_BRAND,
       d.D_YEAR AS D_YEAR,
       d.D_YEARMONTH AS D_YEARMONTH
   FROM hive.ssb_1g_csv.lineorder AS l
            INNER JOIN hive.ssb_1g_csv.supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
            INNER JOIN hive.ssb_1g_csv.part AS p ON p.P_PARTKEY = l.LO_PARTKEY
            INNER JOIN hive.ssb_1g_csv.dates AS d ON l.LO_ORDERDATE = d.D_DATEKEY;

Case 2: Accelerate aggregations and post-join aggregations in the data lake

Materialized views can be used to accelerate aggregate queries, whether on a single table or involving multiple tables.

  • Single-table aggregate query

    For a typical single-table query, if the Query Profile shows that the AGGREGATE node consumes a significant amount of time, you can build a materialized view that uses common aggregate operators. Assume the following query is slow.

    --Q4
    SELECT
    lo_orderdate, count(distinct lo_orderkey)
    FROM hive.ssb_1g_csv.lineorder
    GROUP BY lo_orderdate
    ORDER BY lo_orderdate limit 100;

    Q4 is a query that calculates the number of distinct daily orders. Because count distinct is resource-intensive, you can create the following two types of materialized views:

    CREATE MATERIALIZED VIEW mv_2_1 
    DISTRIBUTED BY HASH(lo_orderdate)
    PARTITION BY LO_ORDERDATE
    REFRESH ASYNC EVERY(INTERVAL 1 DAY) 
    AS 
    SELECT
    lo_orderdate, count(distinct lo_orderkey)
    FROM hive.ssb_1g_csv.lineorder
    GROUP BY lo_orderdate;
    
    CREATE MATERIALIZED VIEW mv_2_2 
    DISTRIBUTED BY HASH(lo_orderdate)
    PARTITION BY LO_ORDERDATE
    REFRESH ASYNC EVERY(INTERVAL 1 DAY) 
    AS 
    SELECT
    -- lo_orderkey must be of BIGINT type to be used for query rewrite.
    lo_orderdate, bitmap_union(to_bitmap(lo_orderkey))
    FROM hive.ssb_1g_csv.lineorder
    GROUP BY lo_orderdate;
    Note

    Do not create materialized views with LIMIT and ORDER BY clauses here to avoid rewrite failures.

  • Multi-table aggregate query

    In scenarios that involve aggregating join results, you can create a nested materialized view on an existing materialized view that joins multiple tables to further aggregate the results. For example, based on the example in Case 1, you can create the following materialized view to accelerate Q1 and Q2 because their aggregation patterns are similar.

    CREATE MATERIALIZED VIEW mv_2_3
    DISTRIBUTED BY HASH(lo_orderdate)
    PARTITION BY LO_ORDERDATE
    REFRESH ASYNC EVERY(INTERVAL 1 DAY) 
    AS 
    SELECT
    lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth, SUM(lo_extendedprice * lo_discount) AS REVENUE
    FROM lineorder_flat_mv
    GROUP BY lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth;

    You can also perform both joins and aggregations in a single materialized view. These types of materialized views have fewer opportunities for query rewrites because the calculations are more specific. However, they use less storage space after aggregation. You can choose the best approach for your specific scenario.

    CREATE MATERIALIZED VIEW mv_2_4
    DISTRIBUTED BY HASH(lo_orderdate)
    PARTITION BY LO_ORDERDATE
    REFRESH ASYNC EVERY(INTERVAL 1 DAY) 
    PROPERTIES (
        "force_external_table_query_rewrite" = "TRUE"
    )
    AS
    SELECT lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth, SUM(lo_extendedprice * lo_discount) AS REVENUE
    FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates
    WHERE lo_orderdate = d_datekey
    GROUP BY lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth;

Case 3: Accelerate post-aggregation Join computations in the data lake

In some cases, you may need to first perform an aggregation computation on one table and then perform a Join query with other tables. To fully leverage StarRocks' query rewrite feature, we recommend that you build nested materialized views. The following is an example.

--Q5
SELECT * FROM  (
    SELECT 
      l.lo_orderkey, l.lo_orderdate, c.c_custkey, c_region, sum(l.lo_revenue)
    FROM 
      hive.ssb_1g_csv.lineorder l 
      INNER JOIN (
        SELECT distinct c_custkey, c_region 
        from 
          hive.ssb_1g_csv.customer 
        WHERE 
          c_region IN ('ASIA', 'AMERICA') 
      ) c ON l.lo_custkey = c.c_custkey
      GROUP BY  l.lo_orderkey, l.lo_orderdate, c.c_custkey, c_region
  ) c1 
WHERE 
  lo_orderdate = '19970503'

Q5 first performs an aggregation on the customer table, and then performs a Join and aggregation on the lineorder table. Similar queries may involve different filter conditions on c_region and lo_orderdate. To leverage the query rewrite feature, you can create two materialized views, one for aggregation and another for the join.

--mv_3_1
CREATE MATERIALIZED VIEW mv_3_1
DISTRIBUTED BY HASH(c_custkey)
REFRESH ASYNC EVERY(INTERVAL 1 DAY) 
PROPERTIES (
    "force_external_table_query_rewrite" = "TRUE"
)
AS
SELECT distinct c_custkey, c_region from hive.ssb_1g_csv.customer; 

--mv_3_2
CREATE MATERIALIZED VIEW mv_3_2
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY) 
PROPERTIES (
    "force_external_table_query_rewrite" = "TRUE"
)
AS
SELECT l.lo_orderdate, l.lo_orderkey, mv.c_custkey, mv.c_region, sum(l.lo_revenue)
FROM hive.ssb_1g_csv.lineorder l 
INNER JOIN mv_3_1 mv
ON l.lo_custkey = mv.c_custkey
GROUP BY l.lo_orderkey, l.lo_orderdate, mv.c_custkey, mv.c_region;

Case 4: Hot and cold data separation for real-time and historical data in the data lake

Consider a scenario where new data from the last three days is written directly to StarRocks, while data older than three days is batch-written to Hive. However, queries may still need to access data from the past seven days. In this case, you can use a materialized view to create a simple data expiration model.

CREATE MATERIALIZED VIEW mv_4_1 
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY) 
AS 
SELECT lo_orderkey, lo_orderdate, lo_revenue
FROM hive.ssb_1g_csv.lineorder
WHERE lo_orderdate<=current_date()
AND lo_orderdate>=date_add(current_date(), INTERVAL -4 DAY);

You can further build views or materialized views based on the upper-layer business logic to encapsulate computations.