StarRocks provides an out-of-the-box data lake query feature, which is ideal for exploratory query analysis. You can use asynchronous materialized views to achieve higher concurrency and better performance for reports and applications on your data lake. This topic describes how to use asynchronous materialized views in StarRocks to optimize data lake query performance.
Scenarios
StarRocks supports building asynchronous materialized views based on External Catalogs, such as Hive Catalog, Iceberg Catalog, Hudi Catalog, Java Database Connectivity (JDBC) Catalog, and Paimon Catalog. Materialized views based on External Catalogs are suitable for the following scenarios.
Transparent acceleration for data lake reports
To ensure the query performance of data lake reports, data engineers often need to work closely with data analysts to design the build logic for the report acceleration layer. If the requirements for the acceleration layer are updated, they must update the build logic, execution plan, and query statement accordingly. The query rewrite capability of materialized views lets you make the report acceleration process transparent to users. When a slow query is detected, data engineers can analyze its pattern and create a materialized view as needed. The materialized view then intelligently rewrites and transparently accelerates the upper-layer query. This process rapidly improves query performance without requiring modifications to the logic or query statements of business applications.
Incremental computation for joining real-time and offline data
Suppose your business application needs to join real-time data in StarRocks local tables with historical data in a data lake for incremental computation. In this case, materialized views can provide a simple solution. For example, if the real-time fact table is a local table in StarRocks and the dimension tables are stored in a data lake, you can easily perform incremental computation by building a materialized view to join the local table with tables from external data sources.
Quickly building a metric layer
Calculating and processing metrics can be challenging when dealing with high-dimensional data. You can use materialized views for data pre-aggregation and roll-ups to create a relatively lightweight metric layer. Additionally, you can leverage the auto-refresh feature of materialized views to further reduce the complexity of metric calculation.
Feature comparison
Materialized views, Data Cache, and local tables in StarRocks are all effective methods to achieve significant query performance improvements. The following table compares their main differences.
Comparison item | Data Cache | Materialized view | Local table |
Data import and update | Queries automatically trigger data caching | Refresh tasks are automatically triggered | Supports various import methods, but requires manual maintenance of import tasks |
Data cache granularity |
| Stores pre-computed query results | Stores data based on the table definition |
Query performance | Data Cache ≤ Materialized view = Local table | ||
Query statement |
|
| Need to modify query statements to query local tables |
Compared to directly querying data lake data or importing data into local tables, materialized views offer several unique advantages:
Local storage acceleration: Materialized views can leverage StarRocks' local storage acceleration advantages, such as indexes, partitioning and bucketing, and Colocate Groups. This results in better query performance than directly querying data from the data lake.
No need to maintain loading tasks: Materialized views transparently update data through automatic refresh tasks, which eliminates the need to maintain import tasks. Additionally, materialized views based on Hive, Iceberg, and Paimon Catalogs can detect data changes and perform incremental refreshes at the partition level.
Smart query rewrite: Queries can be transparently rewritten to use materialized views, which accelerates queries without modifying the query statements used by applications.
Usage recommendations
Use materialized views in the following situations:
Query performance still does not meet your requirements for query latency and concurrency, even with Data Cache enabled.
Queries involve reusable parts, such as fixed aggregation methods or Join patterns.
Data is organized in partitions, and queries have a high degree of aggregation (for example, daily aggregation).
Use Data Cache for acceleration in the following situations:
Queries do not have many reusable parts and may involve any data in the data lake.
The remote storage experiences significant fluctuations or instability, which could potentially impact access.
Create a materialized view based on an External Catalog
Creating a materialized view on a table in an External Catalog is similar to creating one on a StarRocks local table. You only need to set a suitable refresh policy for your data source and manually enable the query rewrite feature for External Catalog materialized views.
Choose a suitable refresh policy
Currently, StarRocks cannot detect partition-level data changes in Hudi Catalogs. Therefore, once a refresh task is triggered, a full refresh is performed.
For Hive Catalogs, Iceberg Catalogs (from v3.1.4), JDBC Catalogs (from v3.1.4, and only MySQL Range partitions are supported), and Paimon Catalogs (from v3.2.1), StarRocks supports detecting partition-level data changes. As a result, StarRocks can:
Refresh only the partitions where data has changed. This avoids full refreshes and reduces the resource consumption caused by the refresh process.
Ensure data consistency to some extent during query rewrite. If the base table in the data lake undergoes data changes, the query will not be rewritten to use the materialized view.
You can still choose to tolerate a certain degree of data inconsistency by setting the property mv_rewrite_staleness_second when you create the materialized view.
Note that for partition-based refreshes, the partition key of the materialized view must be included in the partition key of the base table.
From v3.2.3, StarRocks supports creating partitioned materialized views on Iceberg tables that use Partition Transforms. The materialized view will be partitioned based on the transformed columns. Currently, only Iceberg tables that use the identity, year, month, day, or hour Transform are supported.
The following example shows the definition of an Iceberg table that uses the day Transform and a partition-aligned materialized view created on that table:
-- Definition of the Iceberg table.
CREATE TABLE spark_catalog.test_db.iceberg_sample_datetime_day (
id BIGINT,
data STRING,
category STRING,
ts TIMESTAMP)
USING iceberg
PARTITIONED BY (days(ts))
-- Create a materialized view based on the preceding Iceberg table.
CREATE MATERIALIZED VIEW `test_iceberg_datetime_day_mv` (`id`, `data`, `category`, `ts`)
PARTITION BY (`ts`)
DISTRIBUTED BY HASH(`id`)
REFRESH MANUAL
AS
SELECT
`iceberg_sample_datetime_day`.`id`,
`iceberg_sample_datetime_day`.`data`,
`iceberg_sample_datetime_day`.`category`,
`iceberg_sample_datetime_day`.`ts`
FROM `iceberg`.`test`.`iceberg_sample_datetime_day`;For Hive Catalogs, you can enable the Hive metadata cache refresh feature to allow StarRocks to detect data changes at the partition level. After this feature is enabled, StarRocks periodically accesses the Hive Metastore service (HMS) or AWS Glue to check the metadata information of frequently queried hot data.
Configuration items
To enable the Hive metadata cache refresh feature, you can use ADMIN SET FRONTEND CONFIG to set the following FE dynamic configuration items. The syntax is as follows.
ADMIN SET FRONTEND CONFIG ("key" = "value")Configuration name | Default | Description |
|
| Specifies whether to enable periodic refresh of the Hive metadata cache. If enabled, StarRocks polls the metadata service (HMS or AWS Glue) of the Hive cluster and refreshes the metadata cache of frequently accessed Hive external data catalogs to detect data updates. |
| 600000 (10 minutes) | The interval between two consecutive Hive metadata cache refreshes. Unit: milliseconds. |
| 86400 (24 hours) | The expiration time for the Hive metadata cache refresh task. For a Hive Catalog that has been accessed, if it is not accessed for a period longer than this time, its metadata cache refresh is stopped. For a Hive Catalog that has not been accessed, StarRocks does not refresh its metadata cache. Unit: seconds. |
For Iceberg Catalogs, from v3.1.4, StarRocks supports detecting partition-level data changes. Currently, only Iceberg V1 tables are supported.
Enable query rewrite for External Catalog materialized views
Because strong data consistency cannot be guaranteed, StarRocks disables the query rewrite feature for Hudi and JDBC Catalog materialized views by default. You can enable this feature by setting the property force_external_table_query_rewrite to true when you create the materialized view. For materialized views created on tables in a Hive Catalog, the query rewrite feature is enabled by default. In scenarios that involve query rewrite, if you use a very complex query statement to build the materialized view, we recommend that you split the query statement and build multiple simple materialized views in a nested manner. Nested materialized views are more flexible and can adapt to a wider range of query patterns.
The following is an example.
CREATE MATERIALIZED VIEW ex_mv_par_tbl
PARTITION BY emp_date
DISTRIBUTED BY hash(empid)
PROPERTIES (
"force_external_table_query_rewrite" = "true"
)
AS
SELECT empid, deptno, emp_date
FROM `hudi_catalog`.`emp_db`.`emps_par_tbl`
WHERE empid < 5;Best practices
In business scenarios, you can identify slow and resource-intensive queries by analyzing the Audit Log or large query logs. You can also use Query Profile to pinpoint the specific stage where a query is slow. The following sections provide instructions and examples for improving data lake query performance with materialized views.
Case 1: Accelerate Join computations in the data lake
You can use materialized views to accelerate Join queries in the data lake.
Assume the following queries on a Hive catalog are slow queries.
--Q1
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates
WHERE
lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25;
--Q2
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates
WHERE
lo_orderdate = d_datekey
AND d_yearmonth = 'Jan1994'
AND lo_discount BETWEEN 4 AND 6
AND lo_quantity BETWEEN 26 AND 35;
--Q3
SELECT SUM(lo_revenue), d_year, p_brand
FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates, hive.ssb_1g_csv.part, hive.ssb_1g_csv.supplier
WHERE
lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228'
AND s_region = 'ASIA'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;By analyzing the query profiles, you might notice that most of the query execution time is spent on the Hash Join between the table lineorder and other dimension tables on the column lo_orderdate.
Here, Q1 and Q2 perform aggregation after joining lineorder and dates, while Q3 performs aggregation after joining lineorder, dates, part, and supplier.
Therefore, you can leverage StarRocks' View Delta Join rewrite capability to build a materialized view that joins lineorder, dates, part, and supplier.
CREATE MATERIALIZED VIEW lineorder_flat_mv
DISTRIBUTED BY HASH(LO_ORDERDATE, LO_ORDERKEY) BUCKETS 48
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
PROPERTIES (
-- Specify the UNIQUE constraint.
"unique_constraints" = "
hive.ssb_1g_csv.supplier.s_suppkey;
hive.ssb_1g_csv.part.p_partkey;
hive.ssb_1g_csv.dates.d_datekey",
-- Specify the foreign key constraint.
"foreign_key_constraints" = "
hive.ssb_1g_csv.lineorder(lo_partkey) REFERENCES hive.ssb_1g_csv.part(p_partkey);
hive.ssb_1g_csv.lineorder(lo_suppkey) REFERENCES hive.ssb_1g_csv.supplier(s_suppkey);
hive.ssb_1g_csv.lineorder(lo_orderdate) REFERENCES hive.ssb_1g_csv.dates(d_datekey)",
-- Enable query rewrite.
"force_external_table_query_rewrite" = "TRUE"
)
AS SELECT
l.LO_ORDERDATE AS LO_ORDERDATE,
l.LO_ORDERKEY AS LO_ORDERKEY,
l.LO_PARTKEY AS LO_PARTKEY,
l.LO_SUPPKEY AS LO_SUPPKEY,
l.LO_QUANTITY AS LO_QUANTITY,
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
l.LO_DISCOUNT AS LO_DISCOUNT,
l.LO_REVENUE AS LO_REVENUE,
s.S_REGION AS S_REGION,
p.P_BRAND AS P_BRAND,
d.D_YEAR AS D_YEAR,
d.D_YEARMONTH AS D_YEARMONTH
FROM hive.ssb_1g_csv.lineorder AS l
INNER JOIN hive.ssb_1g_csv.supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN hive.ssb_1g_csv.part AS p ON p.P_PARTKEY = l.LO_PARTKEY
INNER JOIN hive.ssb_1g_csv.dates AS d ON l.LO_ORDERDATE = d.D_DATEKEY;Case 2: Accelerate aggregations and post-join aggregations in the data lake
Materialized views can be used to accelerate aggregate queries, whether on a single table or involving multiple tables.
Single-table aggregate query
For a typical single-table query, if the Query Profile shows that the AGGREGATE node consumes a significant amount of time, you can build a materialized view that uses common aggregate operators. Assume the following query is slow.
--Q4 SELECT lo_orderdate, count(distinct lo_orderkey) FROM hive.ssb_1g_csv.lineorder GROUP BY lo_orderdate ORDER BY lo_orderdate limit 100;Q4 is a query that calculates the number of distinct daily orders. Because count distinct is resource-intensive, you can create the following two types of materialized views:
CREATE MATERIALIZED VIEW mv_2_1 DISTRIBUTED BY HASH(lo_orderdate) PARTITION BY LO_ORDERDATE REFRESH ASYNC EVERY(INTERVAL 1 DAY) AS SELECT lo_orderdate, count(distinct lo_orderkey) FROM hive.ssb_1g_csv.lineorder GROUP BY lo_orderdate; CREATE MATERIALIZED VIEW mv_2_2 DISTRIBUTED BY HASH(lo_orderdate) PARTITION BY LO_ORDERDATE REFRESH ASYNC EVERY(INTERVAL 1 DAY) AS SELECT -- lo_orderkey must be of BIGINT type to be used for query rewrite. lo_orderdate, bitmap_union(to_bitmap(lo_orderkey)) FROM hive.ssb_1g_csv.lineorder GROUP BY lo_orderdate;NoteDo not create materialized views with LIMIT and ORDER BY clauses here to avoid rewrite failures.
Multi-table aggregate query
In scenarios that involve aggregating join results, you can create a nested materialized view on an existing materialized view that joins multiple tables to further aggregate the results. For example, based on the example in Case 1, you can create the following materialized view to accelerate Q1 and Q2 because their aggregation patterns are similar.
CREATE MATERIALIZED VIEW mv_2_3 DISTRIBUTED BY HASH(lo_orderdate) PARTITION BY LO_ORDERDATE REFRESH ASYNC EVERY(INTERVAL 1 DAY) AS SELECT lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth, SUM(lo_extendedprice * lo_discount) AS REVENUE FROM lineorder_flat_mv GROUP BY lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth;You can also perform both joins and aggregations in a single materialized view. These types of materialized views have fewer opportunities for query rewrites because the calculations are more specific. However, they use less storage space after aggregation. You can choose the best approach for your specific scenario.
CREATE MATERIALIZED VIEW mv_2_4 DISTRIBUTED BY HASH(lo_orderdate) PARTITION BY LO_ORDERDATE REFRESH ASYNC EVERY(INTERVAL 1 DAY) PROPERTIES ( "force_external_table_query_rewrite" = "TRUE" ) AS SELECT lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth, SUM(lo_extendedprice * lo_discount) AS REVENUE FROM hive.ssb_1g_csv.lineorder, hive.ssb_1g_csv.dates WHERE lo_orderdate = d_datekey GROUP BY lo_orderdate, lo_discount, lo_quantity, d_year, d_yearmonth;
Case 3: Accelerate post-aggregation Join computations in the data lake
In some cases, you may need to first perform an aggregation computation on one table and then perform a Join query with other tables. To fully leverage StarRocks' query rewrite feature, we recommend that you build nested materialized views. The following is an example.
--Q5
SELECT * FROM (
SELECT
l.lo_orderkey, l.lo_orderdate, c.c_custkey, c_region, sum(l.lo_revenue)
FROM
hive.ssb_1g_csv.lineorder l
INNER JOIN (
SELECT distinct c_custkey, c_region
from
hive.ssb_1g_csv.customer
WHERE
c_region IN ('ASIA', 'AMERICA')
) c ON l.lo_custkey = c.c_custkey
GROUP BY l.lo_orderkey, l.lo_orderdate, c.c_custkey, c_region
) c1
WHERE
lo_orderdate = '19970503'Q5 first performs an aggregation on the customer table, and then performs a Join and aggregation on the lineorder table. Similar queries may involve different filter conditions on c_region and lo_orderdate. To leverage the query rewrite feature, you can create two materialized views, one for aggregation and another for the join.
--mv_3_1
CREATE MATERIALIZED VIEW mv_3_1
DISTRIBUTED BY HASH(c_custkey)
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
PROPERTIES (
"force_external_table_query_rewrite" = "TRUE"
)
AS
SELECT distinct c_custkey, c_region from hive.ssb_1g_csv.customer;
--mv_3_2
CREATE MATERIALIZED VIEW mv_3_2
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
PROPERTIES (
"force_external_table_query_rewrite" = "TRUE"
)
AS
SELECT l.lo_orderdate, l.lo_orderkey, mv.c_custkey, mv.c_region, sum(l.lo_revenue)
FROM hive.ssb_1g_csv.lineorder l
INNER JOIN mv_3_1 mv
ON l.lo_custkey = mv.c_custkey
GROUP BY l.lo_orderkey, l.lo_orderdate, mv.c_custkey, mv.c_region;Case 4: Hot and cold data separation for real-time and historical data in the data lake
Consider a scenario where new data from the last three days is written directly to StarRocks, while data older than three days is batch-written to Hive. However, queries may still need to access data from the past seven days. In this case, you can use a materialized view to create a simple data expiration model.
CREATE MATERIALIZED VIEW mv_4_1
DISTRIBUTED BY HASH(lo_orderdate)
PARTITION BY LO_ORDERDATE
REFRESH ASYNC EVERY(INTERVAL 1 DAY)
AS
SELECT lo_orderkey, lo_orderdate, lo_revenue
FROM hive.ssb_1g_csv.lineorder
WHERE lo_orderdate<=current_date()
AND lo_orderdate>=date_add(current_date(), INTERVAL -4 DAY);You can further build views or materialized views based on the upper-layer business logic to encapsulate computations.