By Yunxi from Alibaba Cloud Database OLAP Product Department
In the computing field, precomputing and caching are two common means to improving performance and reducing costs. For frequently repeated requests, if they can be answered by caching, it is much faster than computing for new results or reading from slower data storage. In addition, caching allows fewer system resources to be consumed. In the database field, materialized view is a natural representation of precomputing and caching.
This article describes the definition of materialized view and ways to rewrite queries based on it.
In this first section, a brief introduction to materialized views, as well as the purpose of query rewrite based on them, will be presented. The second section will introduce the matching and rewriting process of query optimizers with materialized views. Finally, several implementation methods of query rewriting and their advantages and disadvantages will be introduced.
A Materialized view is a special table that pre-calculates and stores the query results. "Materialized" is relative to "normal". A normal view provides ease-of-use and flexibility features, but it does not speed up data access. A Materialized view is the cache of views. It, instead of building and computing datasets in runtime, pre-computes, stores, and optimizes data access when created, and automatically refreshes to ensure real-time data availability.
For data warehouses, the most important feature of materialized views is query acceleration. There are a large number of complex queries on large tables in the data warehouse, which can consume a lot of resources and time. Materialized views can eliminate the overhead of expensive joins and aggregations by responding to queries by pre-computed results. Thus, they greatly improve query processing time and reduce system load. Materialized views are especially useful for queries that can be anticipated and repeatedly use the same subquery results.
To realize the potential of materialized views, three issues need to be solved:
This article describes the technical implementation of query acceleration by using materialized views from the query optimizer perspective.
Directly querying a materialized view reduces the query processing time, but the query statement must be modified. An important issue in using materialized views to accelerate queries is how to automatically respond to queries by materialized views systematically and automatically. With this transparent rewriting, materialized views can be added or deleted in the same way as an index, without affecting existing SQL statements.
Query rewriting makes materialized views widely used:
To achieve more extensive rewriting, query rewriting is usually integrated into the optimizer rules with several benefits.
First, query rewriting can use optimizer other rules to converts queries into standard and unified forms, simplifying the matching process and increasing the scope of rewriting. Column elimination, predicate pushdown, and subquery "unnesting" are important rules. Subquery unnesting allows materialized views to rewrite and accelerate queries that contain correlated subqueries.
Second, the optimizer can recursively determine whether each subtree is rewritten by a view. Each view produces multiple rewriting operations for each subtree. Different parts of a query statement may be rewritten by different views. Finally, all the rewriting operations are input into the cost-based selector, and the optimal query plan is selected together with the original query.
The query rewriting algorithm only needs to consider the given query expression and view to determine whether the query expression can be calculated from the view. Then, construct an equivalent compensation expression from the view, which is equivalent to the original query expression. The range of query rewriting should be as large as possible. The goal of query rewriting is to rewrite a large number of queries by a small number of materialized views. It is up to the optimizer to finally select an optimal query plan.
The optimizer rewrites a query in a variety of ways. In the simplest case, the query of the materialized view exactly matches the query. The number of queries that meet this query rewriting type is very small. So, the optimizer attempts to construct an equivalent expression by using various rules for more general matching.
The query rewriting check consists of two steps – rewriting match check and equivalent expression construction. For a query to be responded to by a view, the following two conditions must be met:
In some cases, the view may be rewritten even if the conditions are not met. For example, if a view contains part of the data needed for a query, users can use the materialized view to respond to part of the query and compute the remaining data from the raw data. This part of the rewriting check is introduced in advanced rewriting rules.
Specifically, the rewriting check performs a JOIN check and an output check in sequence. If a query or view contains Grouping By and Aggregation, an additional check will be performed. If necessary, the view will be further aggregated.
When the join relationship between the tables of the query and the view is the same, the view may contain all the rows and columns required by the query. A simple method is to consider only INNER JOIN queries without subqueries, in which case it is sufficient to compare only the query table and the quantity to check whether they are identical. Otherwise, check whether the join relationship contained in the relational algebraic tree of the query and view is equivalent to a set of rules. A more common method is to build a Join Graph that takes relationships as nodes and joins as edges. The condition for INNER JOIN is expressed as unbound edges, while the condition for OUTER JOIN is expressed as directed edges. Join Graph is built by querying a relational algebraic tree. By comparing Join Graphs, the materialized view and query can be checked whether they contain the same join relationship.
For example, in the following query
select c_custkey, c_name, l_orderkey, l_partkey, l_quantity from ( select l_orderkey,l_partkey, l_quantity from lineitem where l_orderkey < 1000 and l_shipdate = l_commitdate ) subquery join orders on subquery.l_orderkey = o_orderkey left join customer on o_custkey = c_custkey
For materialized views
create materialized view mview1 enable query rewrite as select * from lineitem join customer On subquery.l_orderkey = o_orderkey left outer join orders On o_custkey = c_custkey
Queries and materialized views have the same join relationship, and queries can be rewritten.
select c_custkey, c_name, l_orderkey,l_partkey, l_quantity from mview1 where l_orderkey < 1000 and l_shipdate = l_commitdate
The output check ensures that the materialized view has enough data to respond to the query, which includes three steps:
Query rewriting allows users to extend the scope of rewriting by querying the equivalent relation. The equivalence relation can be derived from the equivalence condition or the algebraic relation because of the transferability of the equivalence condition. For example, there are two functions,
A=date_format(now(), '%Y%m%D') and
B=date_format(now(), '%Y%m%D'). The function returns A=B since the deterministic features of the function. If the INNER JOIN condition B=C existed, then A=B=C can be further obtained.
There may be some other conditions from which the equivalence relation can be derived. For example, the equivalent relation of C = A + B can be derived from A=2, B=3, and C=5. Compared with equivalent relations, finding such relations makes the search process more complicated, and users cannot implement all the possible equal relationships. So, such expressions are rarely considered. In the equivalence relation derivation, only equivalent conditions are searched. Even if some rewriting opportunities may be missed, such a shallow search can ensure its speed.
In order to ensure that the columns output by the query and the columns required by the compensation predicate can be computed from the view, a method to determine whether the expression from the query is equal to the expression in the view is needed, or can be calculated from it. Expression check cannot be implemented purely syntactically. The fact that two expressions or symbols have the same text does not mean that they are related equally. For example, the existence of aliases breaks syntax-based checks.
The expression check must be derived from the correspondence between the equivalence relation and the table. If all tables in a view and query are unique, columns from the same table are equivalent. If a view and query appear in multiple tables, namely SELF JOIN, there are multiple possibilities for mapping a query to a view table, and each scenario requires a rewriting attempt. With the column mappings and above equivalence relation between the view and the query, the algebra system determines whether the expression from the query is equal to the expression in the view or whether it can be computed from it.
Existed heuristic rules that allow one expression to be computed from another expression. For example, x is computed from x + 1, or AVG(x) is computed from SUM(x) and COUNT(x) based on the arithmetic rule. There are also some Function Dependency rules, such as time functions, users can compute a year based on the returned day result.
The materialized view requires the existence of all the required rows of the query in the materialized view, which means that the range of the view's predicates is greater than or equal to the range of the query. Use Wq for the query predicate and Wv for the view predicate.
Wq => Wv needs to be checked, where
=> indicates that Wq satisfies the meaning of Wv.
The optimizer extracts all predicates and converts them to the format of the configuration file (CNF), that is,
W = P1^P2^...^Pn. Predicates are further divided into
W= PE^PR^PU according to equivalent predicates, range predicates and residual predicates. PE, PR, and PU represent equivalent predicates, range predicates, and residual predicates, respectively. Predicate check becomes PEq ^
PRq ^ PUq => PEv ^ PRv ^ PUv. Due to the nature of orthogonality, the problem is finally decomposed into three types of check -
PEq => PEv, PEq ^ PRq => PRv, and
PEq ^ PUq => PUv.
Equivalent predicates of queries are used to derive equivalent relations. Any predicate in the query where the view does not satisfy the equivalence relation forms a compensation predicate, while a view with an equivalence relation that is not satisfied by the query cannot be rewritten.
Range predicates can be stored in the optimizer in the form of range to facilitate the computing of different sets. For each range predicate in the query, if the view has a corresponding range exact match, compensation is not required. Otherwise, the view range must be greater than the range of the query, and a compensation predicate is formed through the difference set.
The query expression and the predicate that attempts to remain together form a remaining predicate PE, which can only be used for an exact match. Any predicate in the query that does not match the view can be formed as a compensation predicate, while the query cannot be rewritten if it does not match the view
All compensation predicates need to be checked by expression to ensure that they can be computed from the output of the view.
For example, query
select l_orderkey, o_custkey, l_partkey,l_quantity*l_extendedprice from lineitem, orders, part where l_orderkey = o_orderkey and l_partkey = p_partkey and l_partkey >= 150 and l_partkey <= 160 and o_custkey = 123 and o_orderdate = l_shipdate and p_name like '%abc%' and l_quantity*l_extendedprice > 100
create materialized view mview2 Enable Query Rewrite as select l_orderkey, o_custkey, l_partkey,l_shipdate, o_orderdate,l_quantity*l_extendedprice as gross_revenue from lineitem, orders, part where l_orderkey = o_orderkey and l_partkey = p_partkey and p_partkey >= 150 and o_custkey >= 50 and o_custkey <= 500 and p_name like '%abc%'
The query can be rewritten as
select l_orderkey, o_custkey, l_partkey, gross_revenue from mview2 where l_partkey <= 160 and o_custkey = 123 and o_orderdate = l_shipdate and gross_revenue > 100
Additional checks are required if the view and query feature a Group By function or Aggregation function:
For example, query
select c_nationkey, sum(l_quantity*l_extendedprice) from lineitem, orders, customer where l_orderkey = o_orderkey and o_custkey = c_custkey group by c_nationkey
create materialized view mview3 enable Query rewrite as select o_custkey, count_big(*) as cnt, sum(l_quantity*l_extendedprice) as revenue from lineitem, orders where l_orderkey = o_orderkey group by o_custkey
The query can be rewritten as
select c_nationkey, sum(revenue) from customer join mview3 on c_custkey = o_custkey group by c_nationkey
If the grouping list is different, only the outermost case of Group By in the query can be rewritten, otherwise, further aggregation is not available and the second rewriting rule cannot be satisfied. Only specific aggregate functions support further aggregation. Common aggregate functions include MIN, MAX, SUM, and COUNT.
For example, the following query
select o_custkey, count(*) as cnt, sum(l_quantity*l_extendedprice) as revenue from lineitem, orders where l_orderkey = o_orderkey group by o_custkey
create materialized view mview4 enable Query rewrite as select o_custkey, l_partkey, count(*) as cnt, sum(l_quantity*l_extendedprice) as revenue from lineitem, orders where l_orderkey = o_orderkey group by o_custkey, l_partkey
The query can be rewritten as
select c_nationkey, sum(cnt) as cnt, sum(revenue) as revenue from mview4 where c_custkey = o_custkey group by o_custkey
If the aggregate function does not exist in the view, it can be computed according to some rules, such as AVG(x) from SUM(x) and COUNT(x), SUM(x + y) from SUM(x) + SUM(y). These dependent expressions rewrite the heuristic rules in checks.
If materialized views and queries do not meet the rewriting rules in the previous section, other rules can be used for conversion.
If only INNER JOIN is considered, there are two cases of inconsistency between the join relationship of views and queries: queries contain more joins than views, or views contain more joins
If a query contains more tables than a view, the missing JOIN can simply be added to the view to make it meet the rewriting rules. The JOIN condition will be added correctly by rewriting the predicate, and the optimizer can also adjust the plan with subsequent rules.
The JOIN compensation allows the optimizer to match and rewrite earlier to reduce the number of rewriting attempts.
For example, the following query
select c_custkey, c_name, l_orderkey,l_partkey, l_quantity From lineitem, customer, orders Where l_orderkey = o_orderkey And o_custkey = c_custkey Where l_orderkey between 1000 and 1500 And l_shipdate = l_commitdate
Create materialized view mview5 Enable query rewrite as Select l_orderkey,l_partkey, l_quantity From lineitem, orders Where l_orderkey = o_orderkey And o_orderkey >= 500
The query can be rewritten as
Select l_orderkey, l_partkey, l_quantity From mview5 join customer on o_custkey = c_custkey Where l_orderkey between 1000 and 1500 And l_shipdate = l_commitdate
If a JOIN appears in the view but does not appear in the query, users can try to apply the JOIN elimination rules. JOIN elimination is a common method in optimizer optimization. If the JOIN meets the following five conditions, table T1 will not change at the time of joining table T2:
This means that the join relationship between table T1 and table T2 does not affect the result of T1, while T2 in the view can be ignored in the Join Graph. The existence of invariant join allows a larger union or superset to be created on the underlying materialized view, thus allowing materialized views to contain greater pre-computing power and to rewrite more queries.
For example, query
Select l_orderkey, l_partkey, l_quantity From lineitem Where l_orderkey between 1000 and 1500 And l_shipdate = l_commitdate
Create materialized view mview6 Enable query rewrite as Select c_custkey, c_name, l_orderkey,l_partkey, l_quantity From lineitem, orders, customer Where l_orderkey = o_orderkey And o_custkey = c_custkey And o_orderkey >= 500
The query can be rewritten as
select l_orderkey, l_partkey, l_quantity from mview6 where l_orderkey between 1000 and 1500 and l_shipdate = l_commitdate
If OUTER JOIN exists, and the join relationship between the view and the query is inconsistent, the JOIN derivation can be applied to compute the JOIN in the query from the JOIN in the materialized view again. For example, the results of INNER JOIN and ANTI JOIN can be computed from the results of the LEFT OUTER JOIN statement in the materialized view. INNER JOIN can further compute SEMI JOIN. By doing so, materialized views can be used to filter certain rows to respond to different queries with different join relationships.
The JOIN derivation extends the rewriting range allowing the optimizer to combine materialized view-based rewriting rules with unnesting rules to rewrite queries with statements such as IN and EXISTS. Moreover, It is also possible to avoid adjustments to the JOIN by other optimization rules, such as EliminateOuterJoin and PredicatePushDown that may optimize OUTER JOIN to INNER JOIN.
A Materialized view contains only a portion of the data required for a query, which can also be used for query rewriting. In many scenarios, materialized views may and cannot store all data.
A typical case is that data is constantly written, but the writing operation takes place only recently. Creating a fully materialized view on a continuously refreshed table may result in high refresh costs due to frequent refreshes of data inserted into the view, or even the view may become completely unavailable due to continuous refreshes. The View may even be completely unavailable due to the continuous refresh. A better approach is to create a refreshed materialized view with T+1 that stores constant data, reducing the cost of refreshes.
Another common case is that the data warehouse stores all the data, while the query focuses on the data of recent months. Since the cost of creating a materialized view of all the data is too high, materialized views with data of recent months can also rewrite most queries.
Union rewriting will try to respond to some specific queries with views, reducing the amount of data computed in real time. In addition, this rewriting can be further applied to Aggregation rewriting that supports responding to aggregated queries with partial data from materialized views.
For example, query
select l_orderkey, l_partkey, l_quantity from lineitem where l_orderkey l_orderkey > 500 and l_orderkey <= 1500 and l_shipdate = l_commitdate
create materialized view mview8 enable query rewrite as select l_orderkey, l_partkey, l_quantity from lineitem where l_orderkey > 1000 and l_shipdate = l_commitdate
Rewriting result shown as
select l_orderkey, l_partkey, l_quantity from lineitem where l_orderkey > 500 and l_orderkey <= 1000 and l_shipdate = l_commitdate union select o.shippriority from mview8 where l_orderkey > 1000 and l_orderkey <= 1500
View rewriting can be implemented in three ways:
Text matching or syntax matching is the simplest way for rewriting. Compare the text of the query with that of the materialized view or the syntax tree, exact matches can be rewritten. However, this rewriting can only match the complete query statement or sub-statement, and subtle differences will cause the query to be unavailable for rewriting, so the scope of application is very limited. Syntax-based rewriting is simple but efficient, and the cost of rewriting is minimal.
The rule-based rewriting is the same as other optimizer rules. Write different rules for queries and views of different patterns to find the equivalent replacement relational tree.
The simplest rule is to directly compare the plans of the subquery and the view. If they are the same, the plan can be rewritten. Advanced rewriting rules do not need the materialized view to be equivalent to the replaced plan. They will try to compute the compensation predicates and create equivalent query expressions. For example, in the Join rewriting, compare whether a subexpression of a JOIN query is the same as or can be computed from that of a view JOIN. Each subexpression in the JOIN statement has a mapping relationship. Finally, check whether the compensation expression can be computed from the view.
Rule-based rewriting can achieve a large amount of rewriting. It is relatively simple to implement, and the rewriting and matching speed is fast, but it also has some limitations. This rewriting relies on conversion rules to find equivalence relationships. Therefore, a complex view rewriting needs to be implemented by exhausting all possible transformation relations, while no complex view can exhaust all the equivalence relations. For example, when many joins or complex project relationships remain, the scope of rule-based rewriting depends on the number of rules.
Contrary to rule-based rewriting, structure-based rewriting matches and rewrites based on a set of rules by extracting query features. The optimizer expresses the query in the standard form of Join-Select-Project-GroupBy (SPJG). It extracts the five expressions of Join, Projects, Filters, Grouping and Aggregations in the query, and matches and rewrites the expression corresponding to the materialized view respectively.
The structure-based rewriting was described systematically by Microsoft in its 2001 SIGMOD paper Optimizing Queries Using Materialized Views: A Practical, Scalable Solution. This method can be rewritten, including any query method such as Join, Filter, and Project, and a series of steps are used to match, obtaining the compensation expressions. Moreover, queries containing Aggregation can be further rewritten to add Aggregation nodes to return additional summarized results when needed.
The structure-based rewriting is easy to expand. For example, OUTER JOIN rewriting and subquery can complete almost all rewriting operations. However, the search cost is high, especially when the query is complex and the number of rewriting attempts is large.
AnalyticDB is a proprietary PB-level cloud-native data warehouse developed by Alibaba that has been validated for ultra-large scale and core business.
The current version of AnalyticDB supports scheduled full refreshes and query rewriting. Furthermore, it fully implements text-based matching and structure-based matching, enabling rewriting of arbitrary queries containing Join, Filter, Project, and Group By, and supporting advanced rewriting rules such as Aggregation Rollup, Union, Subquery, and OUTER JOIN.
In the coming releases, AnalyticDB will introduce rule-based matching to improve the efficiency of rewriting and expand more rewriting approaches in the future, such as rewriting support of Grouping sets, so that materialized views can have the cube capability of traditional data warehouses.
Alibaba Clouder - January 15, 2018
Alibaba EMR - August 28, 2019
AdrianW - August 22, 2019
ApsaraDB - July 7, 2021
Alibaba Clouder - December 21, 2020
ApsaraDB - April 28, 2020
ApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.Learn More
Protect, backup, and restore your data assets on the cloud with Alibaba Cloud database services.Learn More
TSDB is a stable, reliable, and cost-effective online high-performance time series database service.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
More Posts by ApsaraDB