Use materialized views for data modeling - E-MapReduce - Alibaba Cloud Documentation Center

Data modeling through StarRocks asynchronous materialized views enables you to simplify pipeline management and improve the efficiency and flexibility of data modeling through declarative modeling language. This topic describes how to use StarRocks asynchronous materialized views for data modeling.

Background

Data modeling is the process of data cleansing, layering, aggregation, and association through appropriate methods. When the original data quality is too low, there are too many complex metrics, or the query cost is too high because of lack of aggregation, you can model the original data to obtain easy-to-understand and usable data results.

However, a common contradiction in real-world data modeling is that the modeling process can hardly keep pace with business development, and it is difficult to measure the return on investment of data modeling work. Although modeling methods are simple, they require business experts to have a solid background in data organization and governance to process and organize data, which is a complex process. In the early stages of business, decision makers typically do not invest sufficient resources in data modeling, and it is difficult to see the value that data modeling can bring. Additionally, because business models may change rapidly, the modeling methods themselves also need to be continuously iterated and evolved. Therefore, many data analysts tend not to use data modeling and directly use raw data, which inevitably leads to issues with data quality and query performance. When the need for modeling arises, they encounter the problem that the data usage patterns have already taken shape and are difficult to restructure.

Using StarRocks materialized views for data modeling can effectively solve the above problems. StarRocks asynchronous materialized views have the following capabilities:

Simplify data warehouse architecture: Because StarRocks can provide a one-stop data governance experience, you do not need to maintain other data processing systems or components, saving human and physical resources used to maintain these systems.
Simplify modeling experience: Any data analyst with only basic SQL knowledge can use StarRocks for data modeling without the need for professional data engineers.
Simplify system maintenance: StarRocks asynchronous materialized views can automatically manage the levels and dependencies between data, without requiring the entire data platform to handle this task.

In practice, you can combine StarRocks logical views and asynchronous materialized views for data modeling, as shown below:

Use views to associate real-time data with dimension data, and use materialized views to associate historical data in the data lake with dimension data. At the same time, perform necessary data cleansing and business semantic mapping to obtain an Intermediate Layer that reflects business semantic detailed data.
In the Application Layer, perform data Join, Agg, Union, and Window calculations for different business scenarios, generating views for real-time links and materialized views for near real-time links.
On the Application side, select the appropriate analytical data storage (ADS) for query analysis based on your timeliness and performance requirements, serving real-time dashboards, near real-time BI, Ad hoc queries, and scheduled reports.

Capabilities of asynchronous materialized views

StarRocks asynchronous materialized views have the following atomic capabilities to help with data modeling:

Auto-refresh: Materialized views can be automatically refreshed after data is imported into base tables. You do not need to maintain scheduling tasks externally.
Partition refresh: Through reports with time series attributes, near real-time computing can be achieved through partition refresh.
Collaborative use with views: By collaboratively using materialized views and logical views, you can implement multi-layer modeling, thereby achieving reuse of the intermediate layer and simplification of data models.
Schema Change: You can change the computation results through simple SQL statements without modifying complex data pipelines.

With these features, you can design comprehensive and flexible data models to meet various business requirements and scenarios.

Auto-refresh

When creating an asynchronous materialized view, you can use the REFRESH clause to specify the refresh strategy. Currently, StarRocks asynchronous materialized views support the following refresh strategies:

Auto-refresh (REFRESH ASYNC): A refresh task is triggered whenever data in the base table changes. Data dependencies are automatically managed by the materialized view.
Scheduled refresh (REFRESH ASYNC EVERY (INTERVAL <refresh_interval>)): Refresh tasks are triggered periodically, such as every minute, day, or month. If there are no data changes in the base table, no refresh task will be triggered.
Manual refresh (REFRESH MANUAL): You can only trigger refresh tasks by manually executing REFRESH MATERIALIZED VIEW. You can use this refresh strategy if you trigger refresh tasks through an external scheduling framework.

The syntax is as follows:

CREATE MATERIALIZED VIEW <db_name>.<mv_name>
REFRESH 
    [ ASYNC | 
      ASYNC [START <time>] EVERY(<interval>) | 
      MANUAL
    ]
AS <query>

Partition refresh

When creating an asynchronous materialized view, you can use the PARTITION BY clause to associate the partitions of the base table with the partitions of the materialized view, thereby achieving refresh at the partition granularity.

PARTITION BY <column>: You can reference the same partition column as the base table for the materialized view, so that the base table and materialized view use the same partition granularity.
PARTITION BY date_trunc(<column>): You can use the date_trunc function to roll up based on the base table partition column, thereby creating different granularity partition strategies for the materialized view.
PARTITION BY { time_slice | date_slice }(<column>): Compared to date_trunc, time_slice and date_slice provide more flexible time granularity adjustment, allowing for more fine-grained control of time partition granularity.

The syntax is as follows.

CREATE MATERIALIZED VIEW <db_name>.<mv_name>
REFRESH ASYNC
PARTITION BY 
    [
        <base_table_column> | 
        date_trunc(<granularity>, <base_table_column>) |
        time_slice(<base_table_column>, <granularity>) | 
        date_slice(<base_table_column>, <granularity>)
    ]
AS <query>

Collaborative use with views

You can create materialized views based on views. In this case, the materialized view can be automatically refreshed when data changes in the base tables referenced by the view.
You can create materialized views based on other materialized views, enabling multi-level cascading refresh.
You can create views based on materialized views, equivalent to creating based on regular tables.

Schema change

Asynchronous materialized views can be atomically replaced using the ALTER MATERIALIZED VIEW SWAP statement. You can create a new materialized view, add new columns or change column types, and then replace the old materialized view with the new one.
Views can be directly modified using the ALTER VIEW statement.
Regular tables in StarRocks can be modified using SWAP or ALTER operations.
When base tables (which can be materialized views, views, or regular tables) change, cascading changes will be triggered in the corresponding materialized views.

Hierarchical modeling

In real business scenarios, there are various forms of data sources, including real-time detailed data, dimension data, and archived data in data lakes. On the other hand, business requirements need diverse analytical methods, such as real-time dashboards, near real-time BI queries, analyst Ad hoc queries, and scheduled reports. Different scenarios have different requirements, some need flexibility, some prioritize performance, while others emphasize cost-effectiveness.

Clearly, a single solution cannot adequately meet such diverse requirements. StarRocks can efficiently meet these requirements by combining views and materialized views. Because views do not maintain physical data, each time a view is queried, the query is parsed and executed according to the view's definition. In contrast, materialized views store pre-computed results, which can avoid the overhead of repeated execution. Views are suitable for expressing business semantics and simplifying SQL complexity, but cannot reduce the overhead of query execution. On the other hand, materialized views optimize query performance by pre-computing and storing query results, suitable for scenarios that simplify ETL Pipeline.

Differences between views and materialized views

Comparison item	View	Materialized view
Scenarios	Business modeling, data governance	Data modeling, transparent acceleration, data lakehouse
Storage overheads	Does not store data, no storage overhead	Stores pre-computed results, has additional storage cost
Update overhead	No update overhead	Has update overhead when base table data is updated
Performance benefits	No pre-computation, no performance benefit	Pre-computed results, accelerates queries
Data timeliness	Returns the latest data when querying the view	Results are pre-computed, may have data latency
Dependencies	References base tables by name, view becomes invalid if base table name changes	References base tables by ID, base table name changes do not affect materialized view availability
Syntax	CREATE VIEW	CREATE MATERIALIZED VIEW
Modification syntax	ALTER VIEW	ALTER MATERIALIZED VIEW

Modify views and base tables

You can use the following statements to modify your views, materialized views, and base tables.

-- Modify base table.
ALTER TABLE <db_name>.<table_name> ADD COLUMN <column_desc>;

-- Atomically replace base table.
ALTER TABLE <db_name>.<table1> SWAP WITH <table2>;

-- Modify view definition.
ALTER VIEW <db_name>.<view_name> AS <query>;

-- Atomically replace materialized view (swaps the names of two materialized views without modifying their data).
ALTER MATERIALIZED VIEW <db_name>.<mv1> SWAP WITH <mv2>;

-- Re-enable materialized view.
ALTER MATERIALIZED VIEW <db_name>.<mv_name> ACTIVE;

Schema Change follows these principles:

Table renaming and atomic replacement operations will cause dependent materialized views to become Inactive. For Schema Change operations, materialized views will only become Inactive when the base table columns they depend on undergo Schema Change.
Changes to view definitions will cause dependent materialized views to become Inactive.
Atomic replacement of materialized views will cause dependent nested materialized views to become Inactive.
The Inactive state propagates upward in a cascading manner until there are no more materialized view dependencies.
Inactive materialized views cannot be refreshed or used for automatic query rewriting.
Inactive materialized views can still be directly queried, but data consistency cannot be guaranteed until they are made Active.

Because data consistency cannot be guaranteed for Inactive materialized views, you can use the following methods to fix them:

Manual repair: You can manually repair Inactive materialized views by executing ALTER MATERIALIZED VIEW <mv_name> ACTIVE. This statement will attempt to rebuild based on the original SQL definition of the materialized view. Note that for rebuilding to succeed, the SQL definition must still be valid after the underlying Schema Change, otherwise the operation will fail.
Repair during refresh: StarRocks will automatically attempt to execute the above repair command when refreshing materialized views, rebuilding the materialized view before refreshing.
Automatic repair: StarRocks will attempt to automatically repair Inactive materialized views. You can disable this feature by running ADMIN SET FRONTEND CONFIG('enable_mv_automatic_active_check'='false').

Partition modeling

In addition to hierarchical modeling, partition modeling is also an important aspect of data modeling. Data modeling often involves associating data according to business semantics and setting the time-to-live (TTL) of data based on timeliness requirements. Partition modeling plays an important role in this process. Different data association methods will produce different modeling approaches, such as star schema and snowflake schema. These models have one thing in common: they all use fact tables and dimension tables. Some business scenarios require multiple large fact tables, while others involve complex dimension tables and association relationships. StarRocks materialized views support fact table partition association, meaning that fact tables are partitioned, and the Join results of materialized views are partitioned in the same way.

As shown in the above figure, materialized views are used to associate fact tables with multiple dimension tables:

You need to specify the partition key of a specific base table (usually the fact table) in the partition key of the materialized view to achieve partition association for the materialized view (PARTITION BY fact_tbl.col). A materialized view can only have partition association with one base table.
When data in a certain partition of the associated base table changes, the corresponding partition in the materialized view will be refreshed, without affecting other partitions.
When other non-associated base tables change, the entire materialized view will be refreshed by default. However, you can choose to ignore data changes in certain non-associated tables so that the materialized view is not refreshed when data in these tables changes.

Use cases

Partition association can support various business scenarios:

Fact table updates: You can partition fact tables to fine granularity levels, such as by day or by hour. After the fact table is updated, the corresponding partitions in the materialized view will be automatically refreshed.
Dimension table updates: Typically, data updates in dimension tables will cause all associated results to be refreshed, which is costly. You can choose to ignore data updates in certain dimension tables to avoid refreshing the entire materialized view, or you can specify a time range so that only partitions within that time range are refreshed.
Automatic refresh of external tables: In external data sources such as Apache Hive or Apache Iceberg, data often changes at the partition granularity. StarRocks materialized views can subscribe to partition-level data updates in external tables and only refresh the corresponding partitions of the materialized view.
TTL: When setting partition strategies for materialized views, you can set the number of recent partitions to retain, thereby only keeping the latest data. The corresponding business scenario has high requirements for data timeliness, for example, analysts only need to query the latest data within a certain time window without needing to retain all historical data.

You can use multiple parameters to control refresh behavior:

partition_refresh_number: The number of partitions to refresh in each refresh operation.
partition_ttl_number: The number of recent partitions to retain.
excluded_trigger_tables: Tables to ignore to avoid triggering automatic refresh.
auto_refresh_partitions_limit: The number of partitions to refresh in each automatic refresh operation.

Limitations

Partitioned materialized views have the following limitations:

Only supports creating partitioned materialized views based on partitioned tables.
Only supports DATE and DATETIME partition column types, not STRING type.
Only supports using date_trunc, time_slice, and date_slice for partition roll-up.
Only supports a single column as the partition column, not multiple partition columns.