All Products
Search
Document Center

MaxCompute:Bloom filter index (Beta)

Last Updated:Mar 26, 2026

When a large-scale point query runs against a table whose matching rows are scattered across many files, MaxCompute must scan all of those files to find the results. Bloom filter indexes let MaxCompute skip files that provably do not contain the target values at query planning time, reducing I/O and improving query performance without requiring data redistribution.

Prerequisites

Before you begin, ensure that you have:

  • A MaxCompute project. See Create a MaxCompute project.

  • Schema evolution enabled for the project. To enable it, run the following command at the project level:

    setproject odps.schema.evolution.enable=true;

    If schema evolution is not enabled, subsequent DDL operations fail with an error similar to:

    Failed to run ddltask - Schema evolution DDLs is not enabled in project:default

How it works

A point query checks whether specific values exist in a dataset. In big data scenarios, matching rows may be spread across many files, making it expensive to scan all of them.

A Bloom filter index is a compact probabilistic data structure stored alongside each data file. When you run a point query, MaxCompute consults the index at query planning time to identify which files definitely do not contain the target values and skips them — this is called file pruning. For files that pass the filter, standard predicate pushdown at the storage layer applies as a second pass.

This two-stage approach works at the file level, not the row level, so it is lightweight to build and maintain.

Compared with min/max statistics, Bloom filter indexes provide meaningful pruning even for high-cardinality columns where min and max values span the entire file range and offer no filtering benefit.

Compared with clustered tables, Bloom filter indexes offer:

  • High efficiency: Bloom filter indexes can filter out unnecessary data at minimal cost. The resource consumption of insert and query operations is lower than that of ordinary indexes.

  • Flexibility: Index any column, including non-clustering keys, and combine with a clustered index.

  • No write-time shuffle cost: No data redistribution is required during writes.

  • Effective on high-cardinality columns: Works well when data values are widely distributed across files.

Trade-off: Bloom filters have a configurable false positive probability (FPP). A false positive causes MaxCompute to read a file that turns out not to contain the target value, but it never produces incorrect query results — only minor extra I/O in rare cases.

Space efficiency: Bloom filter indexes use bit arrays. With 2^32 = 4,294,967,296 bits, a bit array with 4,200 million bits occupies only 512 MB of memory space (calculated as 4294967296/8/1024/1024 = 512 MB).

Use cases

Bloom filter indexes are effective when:

  • One or more columns are used as equality filter conditions (=, IN) in large-scale point queries.

  • Columns in a clustered table other than the clustering keys are used as filter conditions.

  • Data is sorted by a clustering key, sorted key, or Zorder function, and you want additional file pruning on the sorted columns.

Limitations

  • Supported predicates: = and IN only. Range operators (>, >=, <, <=) and IS NULL / IS NOT NULL are not supported.

  • Filtering effectiveness depends on data distribution. Bloom filters have little benefit when the data distribution is not discrete (low-cardinality columns where most files contain all values).

  • Unsupported column types: DECIMAL, INTERVAL_DAY_TIME, INTERVAL_YEAR_MONTH, STRUCT, MAP, ARRAY, and JSON.

  • One Bloom filter index per column per CREATE statement. To index multiple columns, run separate CREATE BLOOMFILTER INDEX statements.

  • Dynamic partitions do not support automatic index merging. Rebuild the index manually after writing to dynamic partitions.

Billing

Index storage is billed as standard storage on Apsara Distributed File System (Pangu).

SQL operationIndex task triggeredBilling
CREATENo index creation task (DDL only)No charge
REBUILDIndex creation or rebuildPay-as-you-go: unit price × complexity 1 × input data size
INSERTIndex built for inserted data; SELECT runs to build the indexBased on index column data size
SELECTIndex query task runs before the main queryBased on index file size

For subscription resources, the resources used for index creation and index queries are covered by the subscription. Commercialization of Bloom filter indexes does not increase the cost of subscription jobs.

Manage Bloom filter indexes

Create a Bloom filter index

CREATE BLOOMFILTER INDEX <index_name>
ON TABLE <table_name>
FOR COLUMNS(<column_name>)
IDXPROPERTIES('numitems'='<estimated_distinct_values>', 'fpp'='<false_positive_probability>')
[COMMENT 'idxcomment']
;
ParameterDescription
index_nameName of the index.
table_nameName of the table.
column_nameName of the column to index.
numitemsEstimated number of distinct values in the index column. This determines the size of the bit array. Must be greater than 0 and no more than 10,000,000. Set this as close as possible to the actual distinct value count — too high wastes disk space; too low increases FPP.
fppFalse positive probability. Valid range: (0, 1]. Start with 0.1. Lower values reduce false positives but increase storage usage.

To index multiple columns, run a separate CREATE BLOOMFILTER INDEX statement for each column.

Rebuild an index

Rebuild an index after a merge failure or on a specific partition:

ALTER TABLE <table_name> [PARTITION <partition_spec>] REBUILD BLOOMFILTER INDEX;

Only one partition can be rebuilt at a time.

REBUILD triggers an index job billed at the pay-as-you-go rate.

Enable Bloom filter indexes for queries

Bloom filter indexes are not active by default during the Beta period. After Beta release, the default may be changed to true based on online usage. Enable file pruning at query planning time by setting the following before your query:

SET odps.sql.enable.bloom.filter.index=true;
SELECT * FROM <table_name> WHERE <indexed_column> = <value>;

When this flag is enabled, MaxCompute runs an index job before the main query to prune files. The pruned files are those that definitely do not contain the target values.

If result data is scattered across many files, file pruning may not reduce the file count enough to offset the overhead of the index job. In that case, set odps.sql.enable.bloom.filter.index=false.

After a query runs with the index enabled:

  • In LogView, the first job (job_1) is the index job that performs file pruning.

  • On the Summary tab, the virtual table suffixed with bf corresponds to the Bloom filter index file.

If the dataset is large, MaxCompute runs the index job as a distributed job.

View indexes on a table

SHOW INDEXES ON <table_name>;

Drop an index

DROP INDEX [IF EXISTS] <index_name> ON TABLE <table_name>;

Modify index properties

ALTER INDEX <index_name> ON <table_name>
SET IDXPROPERTIES(['comment' = '<new_comment>'], ['fpp' = '<new_fpp>']);

How indexes are merged after data writes

When you insert data into a table that has a Bloom filter index, the system incrementally generates a local Bloom filter index for each new data file. This local index supports predicate pushdown at the storage layer immediately.

In parallel, MaxCompute starts BloomfilterAutoMergeTask to merge all local index files into a consolidated index file. The merged index enables planning-stage file pruning, which is more efficient for large datasets.

INSERT OVERWRITE TABLE <table_name> [PARTITION <partition_spec>]
SELECT ...
  • partition_spec format: (partition_col1 = value1, partition_col2 = value2, ...). Only constant values are supported — functions and expressions are not allowed.

Merge behavior:

  • If the merge succeeds, the keywords for merge completion appear on the Json Summary tab in LogView. Merge time is visible on the SubStatusHistory tab.

  • The data write task completes regardless of whether the merge succeeds.

  • If the merge fails (visible in the BloomfilterAutoMergeTask status in LogView), planning-stage file pruning is unavailable for the new data. The local index still supports storage-layer predicate pushdown. Run REBUILD BLOOMFILTER INDEX on the affected partition to restore planning-stage pruning.

  • Dynamic partitions do not support automatic merging. After writing to dynamic partitions, manually rebuild the index on each updated partition.

Examples

Example 1: Create a Bloom filter index on a common partitioned table

This example uses the TPCDS public dataset to demonstrate the full workflow: create a table, create an index, import data, and query with the index active.

  1. Check the source data.

    SET odps.namespace.schema=true;
    SELECT * FROM bigdata_public_dataset.TPCDS_10G.call_center;
  2. Create a partitioned table named call_center_test.

    CREATE TABLE IF NOT EXISTS call_center_test(
        cc_call_center_sk BIGINT NOT NULL, cc_call_center_id CHAR(16) NOT NULL,
        cc_rec_start_date DATE, cc_rec_end_date DATE,
        cc_closed_date_sk BIGINT, cc_open_date_sk BIGINT,
        cc_name VARCHAR(50), cc_class VARCHAR(50),
        cc_employees BIGINT, cc_sq_ft BIGINT,
        cc_hours CHAR(20), cc_manager VARCHAR(40),
        cc_mkt_id BIGINT, cc_mkt_class CHAR(50),
        cc_mkt_desc VARCHAR(100), cc_market_manager VARCHAR(40),
        cc_division BIGINT, cc_division_name VARCHAR(50),
        cc_company BIGINT, cc_company_name CHAR(50),
        cc_street_number CHAR(10), cc_street_name VARCHAR(60),
        cc_street_type CHAR(15), cc_suite_number CHAR(10),
        cc_city VARCHAR(60), cc_county VARCHAR(30),
        cc_state CHAR(2), cc_zip CHAR(10),
        cc_country VARCHAR(20), cc_gmt_offset DECIMAL(5,2),
        cc_tax_percentage DECIMAL(5,2)
    )
    PARTITIONED BY (ds STRING)
    ;
  3. Create a Bloom filter index on the cc_call_center_sk column.

    CREATE BLOOMFILTER INDEX call_center_test_idx01
    ON TABLE call_center_test
    FOR COLUMNS(cc_call_center_sk)
    IDXPROPERTIES('fpp' = '0.03', 'numitems'='1000000')
    COMMENT 'cc_call_center_sk index';
  4. Import data from the public dataset.

    SET odps.namespace.schema=true;
    INSERT OVERWRITE TABLE call_center_test PARTITION (ds='20241115')
    SELECT * FROM bigdata_public_dataset.TPCDS_10G.call_center LIMIT 10000;

    If the Bloom filter indexes are successfully merged, the merge completion keywords appear on the Json Summary tab in LogView.

    image

  5. Query with the index enabled.

    SET odps.sql.enable.bloom.filter.index=true;
    SELECT * FROM call_center_test WHERE cc_call_center_sk = 10 AND ds='20241115';

    Expected output:

    +-------------------+-------------------+-------------------+-----------------+-------------------+-----------------+---------------+------------+--------------+------------+------------+----------------+------------+----------------------------+---------------------------------------------------------------------------------------+-------------------+-------------+------------------+------------+-----------------+------------------+----------------+----------------+-----------------+------------+---------------+------------+------------+---------------+---------------+-------------------+------------+
    | cc_call_center_sk | cc_call_center_id | cc_rec_start_date | cc_rec_end_date | cc_closed_date_sk | cc_open_date_sk | cc_name       | cc_class   | cc_employees | cc_sq_ft   | cc_hours   | cc_manager     | cc_mkt_id  | cc_mkt_class               | cc_mkt_desc                                                                           | cc_market_manager | cc_division | cc_division_name | cc_company | cc_company_name | cc_street_number | cc_street_name | cc_street_type | cc_suite_number | cc_city    | cc_county     | cc_state   | cc_zip     | cc_country    | cc_gmt_offset | cc_tax_percentage | ds         |
    +-------------------+-------------------+-------------------+-----------------+-------------------+-----------------+---------------+------------+--------------+------------+------------+----------------+------------+--------------+-------------+---------------------------------------------------------------------------------------+-------------------+-------------+------------------+------------+-----------------+------------------+----------------+----------------+-----------------+------------+---------------+------------+------------+---------------+---------------+-------------------+------------+
    | 10                | AAAAAAAAKAAAAAAA  | 1998-01-01        | 2000-01-01      | NULL              | 2451050         | Hawaii/Alaska | large      | 187          | 95744      | 8AM-8AM    | Gregory Altman | 2          | Just back responses ought  | As existing eyebrows miss as the matters. Realistic stories may not face almost by a  | James Mcdonald    | 3           | pri              | 3          | pri             | 457              | 1st            | Boulevard      | Suite B         | Midway     | Walker County | AL         | 31904      | United States | -6            | 0.02              | 20241115   |
    +-------------------+-------------------+-------------------+-----------------+-------------------+-----------------+---------------+------------+--------------+------------+------------+----------------+------------+----------------------------+---------------------------------------------------------------------------------------+-------------------+-------------+------------------+------------+-----------------+------------------+----------------+----------------+-----------------+------------+---------------+------------+------------+---------------+---------------+-------------------+------------+

    In LogView, the virtual table suffixed with bf confirms the Bloom filter index was used.

    image

  6. View the index.

    SHOW INDEXES ON call_center_test;

    Expected output:

    ID = 20241115093930589g9biyii****
    {"Indexes": [{
                "id": "aabdaeb10a7b4e99a94716dabad8****",
                "indexColumns": [{"name": "cc_call_center_sk"}],
                "name": "call_center_test_idx01",
                "properties": {
                    "comment": "cc_call_center_sk index",
                    "fpp": "0.03",
                    "numitems": "1000000"},
                "type": "BLOOMFILTER"}]}
    OK
  7. Modify index properties. Change numitems from 1000000 to 10000 and verify the change.

    -- Update the property.
    ALTER INDEX call_center_test_idx01 ON call_center_test
    SET IDXPROPERTIES('fpp' = '0.03', 'numitems'='10000');
    
    -- Verify the change.
    SHOW INDEXES ON call_center_test;

    image

Example 2: Create a Bloom filter index on a hash-clustered partitioned table

This example indexes a non-clustering-key column (card) in a hash-clustered table, demonstrating that Bloom filter indexes work independently of the clustering key.

  1. Create a staging table and upload data.

    1. Create a table named scope_tmp. ``sql CREATE TABLE IF NOT EXISTS scope_tmp( phone STRING, card STRING, machine STRING, geohash STRING); ``

    2. Upload the scope2.csv file using the Tunnel command in odpscmd. Run this command from the bin directory of the MaxCompute client: `` Tunnel upload scope2.csv scope_tmp; ``

  2. Create a hash-clustered partitioned table named scope_hash_pt.

    CREATE TABLE scope_hash_pt (
        phone STRING,
        card STRING,
        machine STRING,
        geohash STRING
    )
    PARTITIONED BY (ds STRING)
    CLUSTERED BY (phone)
    SORTED BY (card)
    INTO 512 BUCKETS;
  3. Create a Bloom filter index on the card column.

    CREATE BLOOMFILTER INDEX scope_hash_pt_index01
    ON TABLE scope_hash_pt
    FOR COLUMNS(card)
    IDXPROPERTIES('fpp' = '0.03', 'numitems'='1000000')
    COMMENT 'card index';
  4. Import data.

    INSERT OVERWRITE TABLE scope_hash_pt PARTITION (ds='20241115')
    SELECT * FROM scope_tmp;

    If the indexes are successfully merged, the merge completion keywords appear on the Json Summary tab in LogView.

    image

  5. Query with the index enabled.

    SET odps.sql.enable.bloom.filter.index=true;
    SELECT * FROM scope_hash_pt WHERE card='073415764266290' AND ds='20241115';

    Expected output:

    +-------------+-----------------+----------------+------------+------------+
    | phone       | card            | machine        | geohash    | ds         |
    +-------------+-----------------+----------------+------------+------------+
    | 1576426**** | 073415764266290 | 51133960245770 | fWbDDsf    | 20241115   |
    +-------------+-----------------+----------------+------------+------------+

    In LogView, the virtual table suffixed with bf confirms the Bloom filter index was used.

    image

Example 3: Create a Bloom filter index on a Zorder-sorted partitioned table

This example uses Zorder sorting to co-locate related data before creating a Bloom filter index, combining both techniques for maximum pruning effectiveness.

  1. Prepare data by following Step 1 of Example 2.

  2. Create a partitioned table named scope_zorder_pt with an extra zvalue column for storing Zorder values.

    CREATE TABLE scope_zorder_pt(
        phone STRING,
        card STRING,
        machine STRING,
        geohash STRING,
        zvalue BIGINT
    )
    PARTITIONED BY (ds STRING)
    ;
  3. Create a Bloom filter index on the card column.

    CREATE BLOOMFILTER INDEX scope_zorder_pt_index01
    ON TABLE scope_zorder_pt
    FOR COLUMNS(card)
    IDXPROPERTIES('fpp' = '0.05', 'numitems'='1000000')
    COMMENT 'idxcomment';
  4. Import data with Zorder sorting.

    1. Download the following JAR packages and save them to D:\ on your computer:

    2. Add the JAR packages as resources.

      ADD JAR D:\odps-zorder-1.0-SNAPSHOT.jar;
      ADD JAR D:\odps-zorder-1.0-SNAPSHOT-jar-with-dependencies.jar;
    3. Create the Zorder UDF.

      CREATE FUNCTION zorder AS 'com.aliyun.odps.zorder.evaluateZValue2WithSize'
      USING 'odps-zorder-1.0-SNAPSHOT-jar-with-dependencies.jar';
    4. Insert data sorted by Zorder value.

      -- Required to disable the ORDER BY row limit check.
      SET odps.sql.validate.orderby.limit=false;
      
      INSERT OVERWRITE TABLE scope_zorder_pt PARTITION (ds='20241115')
      SELECT *, zorder(HASH(phone), 100000000, HASH(card), 100000000) AS zvalue
      FROM scope_tmp
      ORDER BY zvalue;

      If the indexes are successfully merged, the merge completion keywords appear on the Json Summary tab in LogView.

      image

  5. Query with the index enabled.

    SET odps.sql.enable.bloom.filter.index=true;
    SELECT * FROM scope_zorder_pt WHERE card='073415764266290' AND ds='20241115';

    Expected output:

    +-------------+-----------------+----------------+------------+---------------------+------------+
    | phone       | card            | machine        | geohash    | zvalue              | ds         |
    +-------------+-----------------+----------------+------------+---------------------+------------+
    | 1576426**** | 073415764266290 | 51133960245770 | fWbDDsf    | 3590549286038929408 | 20241115   |
    +-------------+-----------------+----------------+------------+---------------------+------------+

    In LogView, the virtual table suffixed with bf confirms the Bloom filter index was used.

    image

What's next