Use runtime filters in ApsaraDB for SelectDB - ApsaraDB for SelectDB

This topic describes how to use the runtime filter feature provided by ApsaraDB for SelectDB to help you optimize join performance.

Overview

Runtime filters are dynamically generated for certain join queries at runtime to reduce the amount of data that is scanned or computed and prevent unnecessary I/O or network transmission. This way, the queries are accelerated. For more information about the design, implementation, and performance of runtime filters, see ISSUE 6116.

Terms

Left table: the table on the left of a join query. This table is used for the probe operation. You can use the join reorder feature to adjust the order.
Right table: the table on the right of a join query. This table is used for the build operation. You can use the join reorder feature to adjust the order.
Fragment: a part of an SQL statement. A frontend (FE) node divides an SQL statement into fragments and delivers the fragments to the backend (BE) nodes in a distributed cluster for execution. The fragments are executed on the BE nodes, and the results are aggregated and returned to the FE node.

How it works

A runtime filter is dynamically generated during query planning. The HashJoinNode converts the right table of a join to a filter condition and pushes the filter condition down to the OlapScanNode. Then, the OlapScanNode prunes the left table based on the filter condition during the table scan. This significantly reduces the data to be read and computed during the query and improves query performance.

For example, a join query is performed between Table T1 and Table T2 in HashJoin mode. T1 is a fact table with 1,000,000 rows of data. T2 is a dimension table with 200 rows of data. The following diagram shows the amount of data that needs to be scanned in a regular hash join.

|          >      HashJoinNode      <
|         |                          |
|         | 1000000                  | 200
|         |                          |
|   OlapScanNode              OlapScanNode
|         ^                          ^   
|         | 1000000                  | 200
|        T1                          T2
|

As shown in the preceding diagram, a large amount of data in Table T1 needs to be scanned and a large amount of hash join computing is performed.

If the system proactively sends the data records scanned in Table T2 to the HashJoinNode, the HashJoinNode can generate a filter condition based on the data in Table T2, such as the maximum or minimum value of the data in Table T2, or create a bloom filter, and then send the filter condition to the OlapScanNode that is waiting to scan Table T1. The OlapScanNode applies this filter condition and sends the filtered data to the HashJoinNode. This way, the number of hash table probe operations and network overhead are reduced. This filter condition is called a runtime filter. The following diagram shows the result.

|          >      HashJoinNode     <
|         |                         |
|         | 6000                    | 200
|         |                         |
|   OlapScanNode              OlapScanNode
|         ^                         ^   
|         | 1000000                 | 200
|        T1                         T2
|

If you can further push down runtime filters to the storage engine, you can use indexes to prune data in some cases. This significantly reduces the amount of data that is actually read and the scan time. The following diagram shows the result.

|          >      HashJoinNode     <
|         |                         |
|         | 6000                    | 200
|         |                         |
|   OlapScanNode              OlapScanNode
|         ^                         ^   
|         | 6000                    | 200
|        T1                        T2
|

Based on the preceding analysis, it can be found that runtime filters are different from predicate pushdown and partition pruning. A runtime filter is a filter condition that is dynamically generated at runtime. The JOIN ON clause is parsed to determine the filter expression during a query, and the expression is broadcast to the OlapScanNode that is reading the left table. This reduces the amount of data that is read and computed during the query and significantly improves query performance.

Runtime filter types

ApsaraDB for SelectDB provides the following types of runtime filters:

IN predicate: The HashSet structure is used to implement IN predicates, which are then pushed down to the OlapScanNode. The advantage of IN predicates is that the filtering is effective and fast. In terms of disadvantages, IN predicates apply only to broadcast joins and become invalid if the size of the data in the right table exceeds the threshold. The threshold is 1,024 in ApsaraDB for SelectDB. If the number of rows in the right table exceeds 1024, an IN predicate becomes invalid.
Bloom filter: A bloom filter is created based on the data in the hash table, and then pushed down to the OlapScanNode. A bloom filter is characterized by its versatility, applies to various data types, and provides good performance. The disadvantages are that the configuration is complex and the computing overhead is high.
MinMax filter: After a range is specified based on the data in the right table, the range can be pushed down to the OlapScanNode as a MinMax filter. The advantage of MinMax filters is that the overhead is relatively low. The disadvantage is that MinMax filters do not work well on non-numeric columns.

Scenarios

Runtime filters are used to optimize joins between large tables and small tables. If the amount of data in the left table is too small or the amount of data in the right table is too large, runtime filters may not work as expected. Runtime filters apply to scenarios that meet the following requirements:

The left table is large, and the right table is small. This is because creating a runtime filter incurs computing costs, including memory overhead.
The results of the join between the left table and the right table are few. This indicates that a runtime filter can filter out most of the data in the left table.

Usage

Use runtime filters in queries

By default, the runtime filter feature is enabled in ApsaraDB for SelectDB. When ApsaraDB for SelectDB processes user queries, ApsaraDB for SelectDB automatically generates IN predicates and bloom filters based on the tables and query statements for query optimization.

Runtime filter parameters

Parameter	Description
runtime_filter_mode	The pushdown policy of the runtime filter. Valid values: OFF, LOCAL, and GLOBAL. Default value: GLOBAL.
runtime_filter_type	The type of the runtime filter. In most cases, you need to adjust only this parameter and use the default values for the other parameters. Valid values: IN, BLOOM_FILTER, MIN_MAX, IN_OR_BLOOM_FILTER, BITMAP_FILTER. Default value: IN_OR_BLOOM_FILTER. In some cases, you can specify BLOOM_FILTER, MIN_MAX, and IN at the same time to improve performance.
runtime_filter_wait_time_ms	The maximum duration that the OlapScanNode for the left table waits for each runtime filter. Unit: ms. Default value: 1000.
runtime_filters_max_num	The maximum number of Bloom filters that can be applied to each query. Default value: 10.
runtime_bloom_filter_min_size	The minimum length of a bloom filter. Default value: 1048576, which is equal to 1 MiB.
runtime_bloom_filter_max_size	The maximum length of a bloom filter. Default value: 16777216, which is equal to 16 MiB.
runtime_bloom_filter_size	The default length of a bloom filter. Default value: 2097152, which is equal to 2 MiB.
runtime_filter_max_in_num	The threshold for determining not to generate an IN predicate. If the number of rows in the right table is greater than this threshold, no IN predicate is generated. Default value: 1024.

runtime_filter_mode

This parameter specifies the transmission range of the runtime filter between the smallest units of query execution.

Valid values: numeric values 0, 1, and 2 or corresponding mnemonic strings OFF, LOCAL, and GLOBAL. Default value: 2 (GLOBAL).

LOCAL: This policy is relatively conservative. A runtime filter that is created can be used only in the same fragment on the same smallest unit of query execution. In this case, the runtime filter producer, which is the HashJoinNode that creates the filter, and the runtime filter consumer, which is the OlapScanNode that uses the filter, are in the same fragment. This policy is generally used for scenarios such as a regular broadcast join.
GLOBAL: This policy is relatively aggressive. This policy allows you to merge runtime filters and transfer them to different fragments on different execution units over the network. For example, the producer and consumer of runtime filters can be in different fragments during a shuffle join.

In addition to scenarios to which the LOCAL policy is applicable, the GLOBAL policy can be used to optimize queries in a wider range of scenarios. However, in some shuffle joins, the overhead of generating and merging runtime filters outweighs the performance boost for queries. In this case, you can change the policy to LOCAL. If the performance of join queries in a cluster cannot benefit from runtime filters, you can set this parameter to OFF to disable the runtime filter feature.

For more information about the reasons and policies for merging runtime filters when you create and apply runtime filters in different fragments, see ISSUE 6116.

runtime_filter_type

The type of the runtime filter.

Valid values: numeric values 1, 2, 4, 8, and 16 or corresponding mnemonic strings IN, BLOOM_FILTER, MIN_MAX, IN_OR_BLOOM_FILTER, and BITMAP_FILTER. Default value: 8 (IN_OR_BLOOM_FILTER). Use commas (,) to separate multiple values and enclose them in quotation marks ("). You can also add up multiple numeric values that represent types. Example:

set runtime_filter_type="BLOOM_FILTER,IN,MIN_MAX";

The preceding setting is equivalent to the following setting:

set runtime_filter_type=7;

The following table describes the runtime filter types.

Value	Description
IN	An IN predicate is created based on all values of the key column specified in the JOIN ON clause in the right table. The IN predicate is used to filter data in the left table. The overhead for creating and applying an IN predicate is lower than that for creating and applying a bloom filter. The performance is improved if the size of data in the right table is small. A merge method has been implemented for IN predicates. If you specify an IN predicate and other filters at the same time and the number of rows in the right table does not exceed the value of the `runtime_filter_max_in_num` parameters, the system attempts to remove other filters. This is because an IN predicate is a precise filter and can effectively filter data without other filters. However, other filters are removed only if the runtime filter producer and consumer are in the same fragment.
BLOOM_FILTER	A bloom filter has a certain false positive rate that results in less filtered data than expected. However, this does not cause inaccurate results. In most cases, a bloom filter can improve performance or have no significant impact on performance. In a small number of cases, a bloom filter results in performance deterioration. The overhead of creating and applying a bloom filter is high. As a result, a bloom filter may result in performance deterioration if the filtering rate is low or the size of data in the left table is small. Only a bloom filter that is applied to the key column in the left table can be pushed down to the storage engine. In addition, test results show that the performance often deteriorates if a bloom filter is not pushed down to the storage engine. The short circuit logic is available for a bloom filter only if filtering by expression is used on the OlapScanNode. If the false positive rate is too high, a bloom filter is not used. Therefore, no short circuit logic is available after a bloom filter is pushed down to the storage engine. As a result, performance may deteriorate if the filtering rate is low.
MIN_MAX	A MinMax filter is used to filter data that is smaller than the minimum value and greater than the maximum value. The filtering effect of a MinMax filter is related to the type of the key column in the JOIN ON clause and the distribution of data in the left and right tables. If the type of the key column in the JOIN ON clause is a numeric type, such as INT, BIGINT, or DOUBLE, and the maximum and minimum values in the left and right tables are the same, a MinMax filter has no effect. If the maximum value in the right table is less than the minimum value in the left table, or the minimum value in the right table is greater than the maximum value in the left table, a MinMax filter provides the best effect. If the type of the key column in the JOIN ON clause is a non-numeric type, such as VARCHAR, a MinMax Filter often results in performance deterioration.
IN_OR_BLOOM_FILTER	The system automatically determines whether to use an IN predicate or a bloom filter based on the actual number of rows in the right table during execution. By default, if the number of rows in the right table is less than 102,400, an IN predicate is used. Otherwise, a bloom filter is used. You can adjust the threshold by modifying the `runtime_filter_max_in_num` parameter in session variables.
BITMAP_FILTER	A bitmap filter is used only if a bitmap column is returned by the IN subquery. Bitmap filters are supported only in vectorized engines.

runtime_filter_wait_time_ms

The waiting duration for runtime filters.

Valid values: integers. Unit: ms. Default value: 1000.

After the runtime filter feature is enabled, the OlapScanNode for the left table waits for a period of time for each runtime filter that is assigned to the OlapScanNode before data scan. For example, if the OlapScanNode is assigned three runtime filters, the OlapScanNode waits up to 3,000 ms.

It takes time to create and merge runtime filters. The OlapScanNode attempts to push the runtime filters that arrive within the waiting duration down to the storage engine. After the duration elapses, the OlapScanNode uses the runtime filters that have arrived to scan data.

If a runtime filter arrives after the OlapScanNode starts scanning, the OlapScanNode does not push the runtime filter down to the storage engine. Instead, the OlapScanNode performs filtering by expression on the data that has been scanned from the storage engine based on the runtime filter. The runtime filter is not applied to data that has been scanned. This way, the size of the intermediate data obtained is larger than the optimal size, but severe cracking can be prevented.

If a cluster is busy and many resource-intensive or time-consuming queries are executed in the cluster, you can prolong the waiting duration to prevent complex queries from missing optimization opportunities. If the cluster load is light and many small queries that take only a few seconds are executed in the cluster, you can reduce the waiting duration to prevent an increase of 1s latency for each query.

runtime_filters_max_num

The maximum number of bloom filters that can be applied to each query.

Valid values: integers. Default value: 10.

Only the number of bloom filters is limited because bloom filters are more costly to create and apply than MinMax filters or IN predicates.

If the number of bloom filters that are generated exceeds the upper limit, bloom filters with greater selectivity are retained. Bloom filters with greater selectivity are expected to filter more rows. This parameter prevents bloom filters from consuming too much memory overhead, which may result in potential issues.

Selectivity = (HashJoinNode cardinality/HashJoinNode left child cardinality)
-- The cardinality obtained by the FE node is inaccurate. The selectivity calculated for Bloom filters deviates from the actual selectivity. As a result, Bloom filters may be randomly retained.

Note

You need to adjust this parameter only when you tune some time-consuming join queries between large tables.

Bloom filter length-related parameters

The runtime_bloom_filter_min_size, runtime_bloom_filter_max_size, and runtime_bloom_filter_size parameters are used to determine the size of the data structure used by bloom filters, in bytes.

Valid values: integers.

The bloom filters created by the HashJoinNode must be of the same length before they can be merged. Therefore, the FE node calculates the length of bloom filters during query planning.

If the cardinality in the statistics of the right table can be obtained, the system attempts to estimate the optimal size of bloom filters based on the cardinality and rounds the size to the nearest power of 2, which is the log value with base 2. If the cardinality of the right table cannot be obtained, the default bloom filter length specified by the runtime_bloom_filter_size parameter is used. The runtime_bloom_filter_min_size and runtime_bloom_filter_max_size parameters are used to limit the minimum and maximum length of the bloom filters that are eventually used.

Bloom filters with a greater size are more efficient at processing high-cardinality input sets, but consume more memory. For example, if you need to filter high-cardinality columns, such as columns that contain millions of different values, you can increase the value of the runtime_bloom_filter_size parameter and perform benchmark testing. This helps make the filtering by bloom filters more accurate and obtain expected improvement in performance.

The effectiveness of bloom filters depends on the data distribution of the query. Therefore, you need to adjust the length of bloom filters only for specific queries. You do not need to globally adjust the length. You need to adjust these parameters only when you tune some time-consuming join queries between large tables.

View the runtime filters generated for queries

You can execute the EXPLAIN statement to view the query plan of a query. The query plan contains information about the JOIN ON clause used by each fragment and the comments about the runtime filters generated and used in fragments. You can then determine whether the runtime filters are applied to the required JOIN ON clauses.

Sample comment about a runtime filter that is generated in a fragment: runtime filters: filter_id[type] <- table.column.
Sample comment about a runtime filter that is used in a fragment: runtime filters: filter_id[type] -> table.column.

In the following example, a runtime filter whose ID is RF000 is used for the query.

CREATE TABLE test (t1 INT) DISTRIBUTED BY HASH (t1) BUCKETS 2;
INSERT INTO test VALUES (1), (2), (3), (4);

CREATE TABLE test2 (t2 INT) DISTRIBUTED BY HASH (t2) BUCKETS 2;
INSERT INTO test2 VALUES (3), (4), (5);

EXPLAIN SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2;
+-------------------------------------------------------------------+
| Explain String                                                    |
+-------------------------------------------------------------------+
| PLAN FRAGMENT 0                                                   |
|  OUTPUT EXPRS:`t1`                                                |
|                                                                   |
|   4:EXCHANGE                                                      |
|                                                                   |
| PLAN FRAGMENT 1                                                   |
|  OUTPUT EXPRS:                                                    |
|   PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test`.`t1`  |
|                                                                   |
|   2:HASH JOIN                                                     |
|   |  join op: INNER JOIN (BUCKET_SHUFFLE)                         |
|   |  equal join conjunct: `test`.`t1` = `test2`.`t2`              |
|   |  runtime filters: RF000[in] <- `test2`.`t2`                   |
|   |                                                               |
|   |----3:EXCHANGE                                                 |
|   |                                                               |
|   0:OlapScanNode                                                  |
|      TABLE: test                                                  |
|      runtime filters: RF000[in] -> `test`.`t1`                    |
|                                                                   |
| PLAN FRAGMENT 2                                                   |
|  OUTPUT EXPRS:                                                    |
|   PARTITION: HASH_PARTITIONED: `default_cluster:ssb`.`test2`.`t2` |
|                                                                   |
|   1:OlapScanNode                                                  |
|      TABLE: test2                                                 |
+-------------------------------------------------------------------+
-- The lines that contain "runtime filters" show that an IN predicate whose ID is RF000 is generated by 2:HASH JOIN of PLAN FRAGMENT 1. 
-- The values of `test2`.`t2` are known only at runtime. 
-- The IN predicate is used by 0:OlapScanNode to filter out unnecessary data when the node reads `test`.`t1`. 

SELECT t1 FROM test JOIN test2 where test.t1 = test2.t2; 
-- Returns two rows of results: [3, 4].

-- You can view how the query is internally performed by using the profile of the query. You must run the set enable_profile=true; command to enable the profile. 
-- The profile includes whether each runtime filter is pushed down, the waiting time, and the total duration from the time when the OLAP_SCAN_NODE is prepared to the time when the runtime filter is received. 
RuntimeFilter:in:
    -  HasPushDownToEngine:  true
    -  AWaitTimeCost:  0ns
    -  EffectTimeCost:  2.76ms

-- In addition, you can view the filtering effect and time consumed after the runtime filter is pushed down in the OLAP_SCAN_NODE section of the profile. 
    -  RowsVectorPredFiltered:  9.320008M  (9320008)
    -  VectorPredEvalTime:  364.39ms

Planning rules for runtime filters

Runtime filters can be generated only for the equivalent conditions in JOIN ON clauses. NULL-safe conditions are not included because null values in the left table may be filtered out.
Runtime filters cannot be pushed down to the left tables of left outer joins, full outer joins, or anti-joins.
The source expression or target expression cannot be a constant.
The source expression cannot be equivalent to the target expression.
The type of the source expression cannot be equivalent to HLL or BITMAP.
Runtime filters can be pushed down only to the OlapScanNode.
The target expression cannot contain NULL-checking expressions, such as COALESCE, IFNULL, or CASE. If the JOIN ON clause of another upper-layer join contains NULL-checking expressions and a runtime filter is generated, the result may be incorrect after the runtime filter for this outer join is pushed down to the left table of the outer join.
An equivalent column must exist in the original table for the column in the target expression.
Column conduction is not supported in the following scenarios:
- If the JOIN ON clause contains A.k = B.k and B.k = C.k, C.k can be pushed down only to B.k, rather than A.k.
- If the JOIN ON clause contains A.a + B.b = C.c and A.a and B.a are equivalent columns so that A.a can be conducted to B.a, A.a can be replaced with B.a, and the system attempts to push the runtime filter to B. If A.a and B.a are not equivalent columns, the runtime filter cannot be pushed down to B because the target expression must be bound to only one left table.
The types of the source expression and target expression must be the same because bloom filters are hash-based filters. If the types are different, the system attempts to convert the type of the target expression into that of the source expression.
Runtime filters generated by PlanNode.Conjuncts cannot be pushed down. Unlike eqJoinConjuncts and otherJoinConjuncts of the HashJoinNode, PlanNode.Conjuncts generates runtime filters that may lead to incorrect results based on tests. For example, if an IN subquery is converted into a join, the automatically generated JOIN ON clause is saved in PlanNode.Conjuncts. In this case, a runtime filter may result in missing rows in the results.