Use a Bloom filter index to perform large-scale point queries - MaxCompute

A Bloom filter is an efficient probabilistic data structure. MaxCompute allows you to use Bloom filter indexes to perform large-scale point queries. This reduces unnecessary data scanning during queries and improves the overall query efficiency and performance. This topic describes how to use Bloom filter indexes and provides examples.

Background information

Large-scale point queries are commonly performed in data warehouse usage scenarios. When users perform a large-scale point query, they can specify filter conditions based on random columns to retrieve the data that matches the conditions. In big data scenarios, the result data may be scattered in a large number of files. Therefore, high-performance large-scale point queries require powerful retrieval capabilities.

MaxCompute reads data based on tables or partitions. Predicate pushdown at the storage layer filters data based on the local metadata of files in tables or partitions. You can execute the ANALYZE statement to obtain the local metadata. If data is sourced from multiple files, data filtering based on the local metadata, such as data filtering based on the minimum or maximum value in a specific column, has no obvious effect.

If the columns to be queried are fixed, you can use a clustered table for filtering. You can use the clustering key as a filter condition to quickly filter out the buckets and data in buckets that do not need to be read to speed up queries. If you perform operations such as aggregation based on condition fields or perform association with other tables, you can use the shuffle removal capability provided by clustering to further leverage the advantages of clustered tables. However, clustered tables still have the following drawbacks:

For hash-clustered tables, you can filter data only if the query condition contains all the clustering keys. For range-clustered tables, the data filtering effect is obvious only if the query condition contains the prefixes of clustering keys and the clustering keys are matched from left to right. If the query condition does not contain the prefixes of clustering keys, the data filtering effect is not obvious.
If the query condition does not contain a clustering key, data cannot be filtered. Therefore, clustered tables are not useful for queries without fixed conditions.
When you write data, you must shuffle data based on specific fields. This increases costs. If you encounter skewed keys, long tails may occur in tasks.

To overcome the drawbacks, MaxCompute introduces Bloom filter indexes for scenarios in which large-scale point queries are performed.

Prerequisites

A MaxCompute project is created. For more information, see Create a MaxCompute project.
Schema evolution is enabled for the project. If schema evolution is not enabled for your project, you need to run the setproject odps.schema.evolution.enable=true; command at the project level to ensure that subsequent operations can be successfully executed. If you do not run the command, an error similar to Failed to run ddltask - Schema evolution DDLs is not enabled in project:default is reported.

Feature description

A point query is essentially to check whether an element exists in a set. You can use Bloom filters to perform point queries in an efficient way. Therefore, Bloom filter indexes are introduced in both database and data lake technologies to support finer-grained data or file pruning.

A database index, such as a BTree or RTree index, is used to locate a row of data. In big data scenarios, more lightweight indexes are required for reducing index creation and maintenance workloads. Space-efficient and query-efficient Bloom filter indexes are suitable for file pruning in point query scenarios. Therefore, MaxCompute introduces Bloom filter indexes.

Compared with clustered tables, Bloom filter indexes have the following advantages:

High efficiency: Bloom filter indexes can filter out unnecessary data at minimal cost.
High extensibility: You can create a Bloom filter index on one or more columns of a table. You can also create a Bloom filter index for non-clustering keys and use the Bloom filter index with a clustered index.
Obvious filtering effect: Bloom filter indexes have the obvious data filtering effect in high-cardinality and data-intensive scenarios.

Advantages and disadvantages

Advantages
- High efficiency: The resource consumption of insert and query operations is lower than that of ordinary indexes.
- Space saving: Bit arrays are used. 2³² = 4294967296. This indicates that a bit array with 4,200 million bits occupies only 512 MB of memory space.
  Note
  The memory space occupied by the bit array is calculated by using the following formula: 4294967296/8/1024/1024 = 512 MB
Disadvantages
Specific false positive probability (FPP) exists. An element out of a set may be incorrectly considered as an element within the set. However, in most scenarios, resource consumption for reading empty files does not affect the overall efficiency, and the result does not affect the final business accuracy.

Scenarios

If you use one or more columns in a table as a condition to perform point query filtering, you can create a Bloom filter index on the query columns that have an obvious filtering effect.
If you perform point query filtering on fields in clustered tables except for clustering keys, you can create a Bloom filter index on the query columns.
After data in a clustered table is sorted by using the clustering key, sorted key, or Zorder function, you can create a Bloom filter index on the inserted columns to further filter data.

Limits

Bloom filter indexes are suitable for equality comparison operators, such as = and in. Bloom filter indexes are not supported for range queries in which operators such as >, >=, <, and <= are involved and queries in which IS NULL and IS NOT NULL are involved.
The filtering effect depends on the data distribution. If the data distribution is discrete, the data filtering effect is not obvious even if you use Bloom filter indexes.
Bloom filter index is not supported on DECIMAL, INTERVAL_DAY_TIME, INTERVAL_YEAR_MONTH, or complex types such as STRUCT, MAP, ARRAY, and JSON.

Billing

After Bloom filter indexes are created, the indexes occupy storage space. The index storage is billed for the storage usage on Apsara Distributed File System (Pangu). Therefore, the index storage is billed as standard storage.
The creation and use of indexes trigger additional computing workloads. For a query, the system calculates the information about the file where the data to be queried is stored based on the index file. Then, the system further filters input data based on the index calculation result in the query execution planning stage to reduce the amount of input data, and obtains the final calculation result. Both index calculation and queries incur computing fees.
Bloom filter indexes can speed up the execution of jobs that consume subscription resources. The resources that are used to create indexes and run index calculation are all subscription resources. Therefore, commercialization does not increase the costs of subscription jobs.

SQL operations trigger index-related tasks. The following table describes the index-related tasks that are triggered by SQL operations.

SQL operation	Triggered task	Amount of input data related to index-related tasks	Billing
CREATE	DDL operations do not trigger index creation tasks.	N/A.	No fees are incurred.
REBUILD	An index is created or rebuild. If indexes exist, the REBUILD operation triggers index rebuilding. If no indexes exist, the REBUILD operation triggers index creation.	When you rebuild an index, read and write operations are performed on the index columns and other columns at the same time. Therefore, the input data amount is the amount of full data after data is filtered based on the query condition of the rebuilt index.	Billing formula of an index: `Unit price of a pay-as-you-go SQL job × Complexity 1 × Amount of input data related to index-related tasks` Billing method for an SQL query: The pay-as-you-go billing rule for a common SQL job is applied.
INSERT	An index is created for the inserted data. The SELECT statement is executed for a query.	The data amount of the index columns in the table to be queried by using the SELECT statement.
SELECT	The query task that uses index calculation generates the information that is used to filter data in the subsequent SELECT statement. The SELECT statement is executed based on the created index.	The data amount of the index file.

Instructions

This section describes how to generate, use, query, and drop a Bloom filter index and how to modify the property of a Bloom filter index.

Generate a Bloom filter index

Create a Bloom filter index

Syntax:

CREATE BLOOMFILTER INDEX <index_name>
ON TABLE <table_name>
FOR COLUMNS(<column_name>)
IDXPROPERTIES('numitems'='xxx', 'fpp' = 'xx')
[COMMENT 'idxcomment']
;

The following table describes the parameters.

Parameter	Description
index_name	The name of the specified index.
table_name	The name of the table to which the index belongs.
column_name	The name of the column on which you want to create an index.
numitems	The estimated number of elements that are stored in a Bloom filter. The estimated number reflects the capacity of the Bloom filter. This ensures that sufficient memory space is allocated to store the expected number of elements during the creation. The parameter affects the total number of bits that are used in the Bloom filter and is critical for the filter quality. If you set this parameter to an excessively large value, the bit array of the Bloom filter is sparsely populated. This results in a waste of disk space and degrades query performance. If you set this parameter to an excessively small value, the bit array of the Bloom filter is overly populated. This increases the FPP of the Bloom filter. The value must be greater than 0. You can perform estimation based on the number of distinct values in an index column. The maximum value cannot exceed 10000000.
fpp	The FPP. Valid values: `(0,1]`. A small value indicates a high accuracy of the Bloom filter and high storage usage. We recommend that you set this parameter to 0.1.

Note

You can create a Bloom filter index for only one column in a table at a time. You can separately create a Bloom filter index for multiple columns in a table.

Merge Bloom filter indexes

You do not need to perform any additional operations on incremental data. You need only to execute the data insertion statement to merge Bloom filter indexes.

Syntax:

INSERT OVERWRITE TABLE <table_name> [PARTITION <partition_spec>]
SELECT ......

Parameters:

table_name: the name of the table to which the index belongs.
pt_spec: the information about the partitions into which you want to insert data. Only constants are supported. Expressions, such as functions, are not supported. Format: (partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...).

If you insert data into a table that has a Bloom filter index, the system incrementally generates the local Bloom filter index data of the data file, and supports predicate pushdown at the storage layer. Then, the system starts BloomfilterAutoMergeTask to automatically merge Bloom filter indexes into a new Bloom filter index file. This way, the system can filter data in the planning stage to more accurately start the resources that are required for the task. If the following keywords appear on the Json Summary tab in LogView, the Bloom filter indexes are successfully merged.

You can view the time taken to merge Bloom filter indexes on the SubStatusHistory tab of LogView.

Note

The data write task is completed as expected regardless of whether the merging operation succeeds.
If Bloom filter indexes fail to be merged due to system reasons, the new data cannot be filtered during the planning stage. The local Bloom filter index data of the new data can still support predicate pushdown at the storage layer to filter data. You can use an execution plan to determine whether the Bloom filter index is used. If the Bloom filter index is not used, you can rebuild the partition index that is invalid during the query.
Dynamic partitions do not support automatic merging. You need to manually rebuild the index on the updated partitions after data is written.
If the status of the automatic merging task shown in the preceding figure is Failed, you need to explicitly execute the following merging statement. Note that you can rebuild only one partition index at a time.
```
ALTERTABLE <table_name> [PARTITION <partition_spec>] REBUILD BLOOMFILTER INDEX;
```

Use a Bloom filter index

Before you query data, run the following command to enable the Bloom filter index feature:

SET odps.sql.enable.bloom.filter.index=true;

Note

After the Bloom filter index feature is enabled, a job is added for file pruning during queries. The pruned files contain the data that the task needs to read.
If you do not specify the odps.sql.enable.bloom.filter.index parameter, file pruning cannot be performed in the planning stage. However, the system can still use predicate pushdown at the storage layer to filter data based on the local Bloom filter index data. During beta release, the default value of the odps.sql.enable.bloom.filter.index parameter is false. After beta release, the default value may be changed to true based on the online usage situation.
In some cases, data filtering based on a Bloom filter index does not have an obvious effect. For example, if result data is scattered in multiple result files, you cannot effectively filter the files by performing file pruning. In this case, if you set this parameter to true, an additional index search task is executed. This may result in performance deterioration of the task. In this situation, you can set this parameter to false.

The following code shows an example on how to make a Bloom filter index take effect.

SET odps.sql.enable.bloom.filter.index=true;
SELECT * FROM bloomfilter_index_test WHERE key=392 AND value="val_392";

The following LogView snapshot shows that job_1 is an index job that uses a Bloom filter index for file pruning.

On the Summary tab, the table whose name is suffixed with bf is a virtual table that corresponds to the Bloom filter index file.

Note

If a large amount of data exists, the size of the generated Bloom filter index file may be excessively large. In this case, MaxCompute starts a distributed job for file pruning.

Query the Bloom filter index of a table

SHOW INDEXES ON <table_name>;

Parameters:

table_name: the name of the table to which the index belongs.

Drop a Bloom filter index

DROP INDEX [IF EXISTS] <idx_name> ON TABLE <table_name>;

Parameters:

idx_name: the name of the index.
table_name: the name of the table to which the index belongs.

Modify the property of a Bloom filter index

ALTER INDEX <idx_name> ON <table_name> 
SET IDXPROPERTIES(['comment' = 'a'], ['fpp' = '0.01']);

Parameters:

idx_name: the name of the index.
table_name: the name of the table to which the index belongs.

Examples

Example 1: Create a Bloom filter index based on a common partitioned table

Prepare data.

SET odps.namespace.schema=true;
SELECT * FROM bigdata_public_dataset.TPCDS_10G.call_center;

Create a partitioned table named call_center_test.

CREATE TABLE IF NOT EXISTS call_center_test(
    cc_call_center_sk BIGINT NOT NULL, cc_call_center_id CHAR(16) NOT NULL, 
    cc_rec_start_date DATE, cc_rec_end_date DATE, 
    cc_closed_date_sk BIGINT, cc_open_date_sk BIGINT, 
    cc_name VARCHAR(50), cc_class VARCHAR(50), 
    cc_employees BIGINT, cc_sq_ft BIGINT, 
    cc_hours CHAR(20), cc_manager VARCHAR(40), 
    cc_mkt_id BIGINT, cc_mkt_class CHAR(50), 
    cc_mkt_desc VARCHAR(100), cc_market_manager VARCHAR(40), 
    cc_division BIGINT, cc_division_name VARCHAR(50), 
    cc_company BIGINT, cc_company_name CHAR(50), 
    cc_street_number CHAR(10), cc_street_name VARCHAR(60), 
    cc_street_type CHAR(15), cc_suite_number CHAR(10), 
    cc_city VARCHAR(60), cc_county VARCHAR(30), 
    cc_state CHAR(2), cc_zip CHAR(10), 
    cc_country VARCHAR(20), cc_gmt_offset DECIMAL(5,2), 
    cc_tax_percentage DECIMAL(5,2)
)
PARTITIONED BY (ds STRING )
;

Create an index.

CREATE BLOOMFILTER INDEX call_center_test_idx01 
ON table call_center_test 
FOR columns(cc_call_center_sk) 
IDXPROPERTIES('fpp' = '0.03', 'numitems'='1000000') 
COMMENT 'cc_call_center_sk index';

Import data.
Import data in the bigdata_public_dataset.TPCDS_10G.call_center table of the public dataset to the call_center_test table.
```
SET odps.namespace.schema=true;
INSERT OVERWRITE TABLE call_center_test PARTITION (ds='20241115') SELECT * FROM bigdata_public_dataset.TPCDS_10G.call_center LIMIT 10000;
```
If the following keywords appear on the Json Summary tab in LogView, the Bloom filter indexes are successfully merged.

Query data.

SET odps.sql.enable.bloom.filter.index=true;
SELECT * FROM call_center_test WHERE cc_call_center_sk =10 AND ds='20241115';

The following result is returned:

+-------------------+-------------------+-------------------+-----------------+-------------------+-----------------+---------------+------------+--------------+------------+------------+----------------+------------+----------------------------+---------------------------------------------------------------------------------------+-------------------+-------------+------------------+------------+-----------------+------------------+----------------+----------------+-----------------+------------+---------------+------------+------------+---------------+---------------+-------------------+------------+
| cc_call_center_sk | cc_call_center_id | cc_rec_start_date | cc_rec_end_date | cc_closed_date_sk | cc_open_date_sk | cc_name       | cc_class   | cc_employees | cc_sq_ft   | cc_hours   | cc_manager     | cc_mkt_id  | cc_mkt_class               | cc_mkt_desc                                                                           | cc_market_manager | cc_division | cc_division_name | cc_company | cc_company_name | cc_street_number | cc_street_name | cc_street_type | cc_suite_number | cc_city    | cc_county     | cc_state   | cc_zip     | cc_country    | cc_gmt_offset | cc_tax_percentage | ds         |
+-------------------+-------------------+-------------------+-----------------+-------------------+-----------------+---------------+------------+--------------+------------+------------+----------------+------------+--------------+-------------+---------------------------------------------------------------------------------------+-------------------+-------------+------------------+------------+-----------------+------------------+----------------+----------------+-----------------+------------+---------------+------------+------------+---------------+---------------+-------------------+------------+
| 10                | AAAAAAAAKAAAAAAA  | 1998-01-01        | 2000-01-01      | NULL              | 2451050         | Hawaii/Alaska | large      | 187          | 95744      | 8AM-8AM    | Gregory Altman | 2          | Just back responses ought  | As existing eyebrows miss as the matters. Realistic stories may not face almost by a  | James Mcdonald    | 3           | pri              | 3          | pri             | 457              | 1st            | Boulevard      | Suite B         | Midway     | Walker County | AL         | 31904      | United States | -6            | 0.02              | 20241115   |
+-------------------+-------------------+-------------------+-----------------+-------------------+-----------------+---------------+------------+--------------+------------+------------+----------------+------------+----------------------------+---------------------------------------------------------------------------------------+-------------------+-------------+------------------+------------+-----------------+------------------+----------------+----------------+-----------------+------------+---------------+------------+------------+---------------+---------------+-------------------+------------+

The following LogView snapshot shows that the Bloom filter index has taken effect. The table whose name is suffixed with bf is a virtual table that corresponds to the Bloom filter index file.

Query the Bloom filter index of the table.

SHOW INDEXES ON call_center_test;

The following result is returned:

ID = 20241115093930589g9biyii****
{"Indexes": [{
            "id": "aabdaeb10a7b4e99a94716dabad8****",
            "indexColumns": [{"name": "cc_call_center_sk"}],
            "name": "call_center_test_idx01",
            "properties": {
                "comment": "cc_call_center_sk index",
                "fpp": "0.03",
                "numitems": "1000000"},
            "type": "BLOOMFILTER"}]}

OK

Modify the property of a Bloom filter index.
Based on the information returned in the previous step, the original property value of numitems is 1000000. Execute the following statements to change its property value to 10000 and view the change result.
```
-- Change the property value.
ALTER INDEX call_center_test_idx01 ON call_center_test SET IDXPROPERTIES('fpp' = '0.03', 'numitems'='10000');

-- View the property.
SHOW INDEXES ON call_center_test;
```
The following result is returned.

Example 2: Create a Bloom filter index based on a hash-clustered partitioned table

Prepare data.
1. Create a table named scope_tmp.
```
CREATE TABLE if NOT EXISTS scope_tmp(
    phone STRING, 
    card STRING, 
    machine STRING, 
    geohash STRING);
```
2. Run the Tunnel command on the MaxCompute client (odpscmd) to upload data to the scope_tmp table. The following sample command is used to upload the scope2.csv file in the bin directory on the MaxCompute client.
```
Tunnel upload scope2.csv scope_tmp;
```

Create a partitioned table named scope_hash_pt.

CREATE TABLE scope_hash_pt (
    phone STRING, 
    card STRING, 
    machine STRING,
    geohash STRING 
)
PARTITIONED BY (ds STRING)
clustered by (phone) 
sorted by (card) 
into 512 buckets;

Create an index for the scope_hash_pt partitioned table.

CREATE BLOOMFILTER INDEX scope_hash_pt_index01 
ON TABLE scope_hash_pt 
FOR columns(card) 
IDXPROPERTIES('fpp' = '0.03', 'numitems'='1000000') 
COMMENT 'card index';

Import data.
```
INSERT OVERWRITE TABLE scope_hash_pt PARTITION (ds='20241115') SELECT * FROM scope_tmp;
```
If the following keywords appear on the Json Summary tab in LogView, the Bloom filter indexes are successfully merged.

Query the data of the scope_hash_pt table.

SET odps.sql.enable.bloom.filter.index=true;
SELECT * FROM scope_hash_pt WHERE card='073415764266290' and ds='20241115';

The following result is returned:

+-------------+-----------------+----------------+------------+------------+
| phone       | card            | machine        | geohash    | ds         |
+-------------+-----------------+----------------+------------+------------+
| 1576426**** | 073415764266290 | 51133960245770 | fWbDDsf    | 20241115   |
+-------------+-----------------+----------------+------------+------------+

The following LogView snapshot shows that the Bloom filter index has taken effect. The table whose name is suffixed with bf is a virtual table that corresponds to the Bloom filter index file.

Example 3: Create a Bloom filter index based on Zorder data sorting of a partitioned table

Prepare data. For more information, see Step 1 in Example 2: Create a Bloom filter index based on a hash-clustered partitioned table.

Create a partitioned table named scope_zorder_pt.

CREATE TABLE scope_zorder_pt(
    phone STRING, 
    card STRING, 
    machine STRING, 
    geohash STRING, 
    zvalue BIGINT
)
PARTITIONED BY (ds STRING)
;

Create an index.

CREATE BLOOMFILTER INDEX scope_zorder_pt_index01 
ON TABLE scope_zorder_pt 
FOR COLUMNS (card) 
IDXPROPERTIES('fpp' = '0.05', 'numitems'='1000000')
COMMENT 'idxcomment';

Import data.

Download the following JAR packages and store the packages to D:\ on your computer.
- odps-zorder-1.0-SNAPSHOT.jar
- odps-zorder-1.0-SNAPSHOT-jar-with-dependencies.jar

Run the following commands to upload the two JAR packages as resources:

-- Add resources.
ADD JAR D:\odps-zorder-1.0-SNAPSHOT.jar;
ADD JAR D:\odps-zorder-1.0-SNAPSHOT-jar-with-dependencies.jar;

Execute the following statement to create a function:

CREATE FUNCTION zorder AS 'com.aliyun.odps.zorder.evaluateZValue2WithSize' USING ' odps-zorder-1.0-SNAPSHOT-jar-with-dependencies.jar';

Use the function to write data to the scope_zorder_pt partitioned table.

-- Set the odps.sql.validate.orderby.limit parameter to false.
SET odps.sql.validate.orderby.limit=false;

-- Write data.
INSERT OVERWRITE TABLE scope_zorder_pt PARTITION (ds='20241115') SELECT *,zorder(HASH(phone), 100000000, HASH(card), 100000000) AS zvalue FROM scope_tmp ORDER BY zvalue;

If the following keywords appear on the Json Summary tab in LogView, the Bloom filter indexes are successfully merged.

Query data.

SET odps.sql.enable.bloom.filter.index=true;  
SELECT * FROM scope_zorder_pt WHERE card='073415764266290' AND ds='20241115';

The following result is returned:

+-------------+-----------------+----------------+------------+---------------------+------------+
| phone       | card            | machine        | geohash    | zvalue              | ds         |
+-------------+-----------------+----------------+------------+---------------------+------------+
| 1576426**** | 073415764266290 | 51133960245770 | fWbDDsf    | 3590549286038929408 | 20241115   |
+-------------+-----------------+----------------+------------+---------------------+------------+

The following LogView snapshot shows that the Bloom filter index has taken effect. The table whose name is suffixed with bf is a virtual table that corresponds to the Bloom filter index file.