All Products
Search
Document Center

ApsaraDB for SelectDB:Precise deduplication with BITMAP

Last Updated:Mar 30, 2026

When your dataset reaches hundreds of millions of rows, COUNT(DISTINCT ...) becomes slow. All distinct values from every node are shuffled to a single top-level node for the final merge—a bottleneck that grows with data volume and bitmap size. ApsaraDB for SelectDB solves this with orthogonal bitmaps: the data is bucketed so that each node can compute its own intersection or union independently, and only the reduced results flow to the top-level node.

How orthogonal bitmaps work

Standard COUNT(DISTINCT ...) executes in two stages:

Scan (Node 1)              Scan (Node 2)
+---------+---------+      +---------+---------+
| user_id | tag     |      | user_id | tag     |
+---------+---------+      +---------+---------+
| 1       | A       |      | 3       | A       |
| 2       | B       |      | 4       | C       |
+---------+---------+      +---------+---------+
         |                          |
         +----------+  +-----------+
                    |  |
             Top-level node
         Receives ALL distinct values
         from every node, merges them
         (bottleneck: I/O + single-node compute)

When a bitmap exceeds 1 GB, this final merge step becomes the bottleneck—both in network I/O and in single-node compute pressure.

ApsaraDB for SelectDB avoids this bottleneck using a hid (hash ID) column. During table creation, user IDs are assigned to buckets using a hash algorithm. Bitmap values in different buckets are orthogonal (non-overlapping), so each bucket computes its own result independently:

Stage 1 (distributed): each bucket computes intersection/union locally
Stage 2 (merge): top-level node combines per-bucket results (small data only)

This eliminates the single-node bottleneck and scales to billions of rows.

Limitations

Before you start, note the following constraints:

  • Aggregate Key model required: The bitmap column must be in a table using the Aggregate Key model, with BITMAP_UNION as the aggregation function.

  • Partitioned tables not supported: Orthogonal bitmap functions cannot be used on partitioned tables. Partition boundaries break the orthogonality guarantee between buckets, making results unreliable.

  • `hid` cardinality requirement: The cardinality of the hid column must be at least five times the number of buckets to ensure balanced data distribution after hash bucketing.

Set up bitmap deduplication

Step 1: Create the table

Create a table using the Aggregate Key model with a bitmap Value field and a hid column for hash bucketing.

CREATE TABLE `user_tag_bitmap` (
  `tag`     BIGINT(20)  NULL COMMENT "User tag",
  `hid`     SMALLINT(6) NULL COMMENT "Bucket ID",
  `user_id` BITMAP BITMAP_UNION NULL COMMENT "User bitmap"
) ENGINE=OLAP
AGGREGATE KEY(`tag`, `hid`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`hid`) BUCKETS 3;

The hid column divides the user_id range into groups, with each group assigned to a single bucket. This ensures bitmap values across buckets are orthogonal.

Note

Set the cardinality of hid to at least five times the bucket count. For example, with 3 buckets, hid should have at least 15 distinct values.

Step 2: Import data

Use a LOAD LABEL statement to load data. The ceil(tmp_user_id/500) expression computes the hid value for each row, and to_bitmap(tmp_user_id) converts the raw integer into a bitmap entry.

LOAD LABEL user_tag_bitmap_test
(
    DATA INFILE('hdfs://abc')
    INTO TABLE user_tag_bitmap
    COLUMNS TERMINATED BY ','
    (tmp_tag, tmp_user_id)
    SET (
        tag     = tmp_tag,
        hid     = ceil(tmp_user_id/500),
        user_id = to_bitmap(tmp_user_id)
    )
)
Note

The divisor 500 is not fixed. Adjust it based on the expected range of user IDs and your target number of distinct hid values.

The source data format is two comma-separated columns—tag and user ID:

11111111,1
11111112,2
11111113,3
11111114,4

During import, SelectDB assigns user IDs in the range [1, 5000000) the same hid value and places them in the same bucket. Bitmap values across buckets remain orthogonal, which reduces the compute cost for intersection and union operations.

Step 3: Query with orthogonal bitmap functions

After importing data, use the orthogonal bitmap functions described in the next section. For a full list of standard bitmap functions, see the SQL reference.

Orthogonal bitmap functions

The following five functions support bitmap queries in orthogonal scenarios. Functions that return a count are direct substitutes for the corresponding COUNT(DISTINCT ...) patterns, but execute in a distributed manner.

Function Returns Use when
bitmap_orthogonal_intersect Bitmap Computing the intersection bitmap across tags
orthogonal_bitmap_intersect_count Count Counting users at the intersection of multiple tags
orthogonal_bitmap_union_count Count Counting total distinct users across tags
orthogonal_bitmap_expr_calculate Bitmap Evaluating complex set expressions (AND, OR, NOT)
orthogonal_bitmap_expr_calculate_count Count Counting results of complex set expressions
Note

Orthogonal bitmap functions cannot be used on partitioned tables. Because partitions are orthogonal to each other, data between partitions is not guaranteed to be orthogonal, and results would be unreliable.

bitmap_orthogonal_intersect

Computes the intersection of bitmap values across the specified filter values and returns the result as a bitmap.

Syntax

bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values)

Parameters

Parameter Description
bitmap_column The bitmap column to aggregate.
column_to_filter The dimension column used for filtering.
filter_values A variable-length list of values to filter the dimension column.

Example

SELECT BITMAP_COUNT(bitmap_orthogonal_intersect(user_id, tag, 13080800, 11110200))
FROM user_tag_bitmap
WHERE tag IN (13080800, 11110200);

Equivalent to:

-- Returns the count of users who appear in ALL specified tags
SELECT COUNT(DISTINCT user_id)
FROM user_tag_bitmap
WHERE tag IN (13080800, 11110200)
  AND <user_id appears in every tag>;

During execution, Stage 1 filters rows by the specified tags, then computes the bitmap intersection across all matching tags per bucket. Stage 2 iteratively merges the per-bucket intersection bitmaps into the final result.

orthogonal_bitmap_intersect_count

Computes the intersection of bitmap values and returns the count directly. The semantics match intersect_count, but execution is distributed across buckets.The syntax is the same as that of the `intersect_count` function, but the implementation is different.The syntax is the same as that of the `bitmap_union_count` function, but the implementation is different.

Syntax

orthogonal_bitmap_intersect_count(bitmap_column, column_to_filter, filter_values)

Parameters

Parameter Description
bitmap_column The bitmap column to aggregate.
column_to_filter The dimension column used for filtering.
filter_values A variable-length list of values to filter the dimension column.

Example

SELECT orthogonal_bitmap_intersect_count(user_id, tag, 1150000, 1150001, 390006)
FROM user_tag_bitmap
WHERE tag IN (1150000, 1150001, 390006);

During execution, Stage 1 filters by the specified tags, computes the bitmap intersection across all matching tags, and counts the result per bucket. Stage 2 sums all per-bucket counts into the final total.

orthogonal_bitmap_union_count

Computes the union of bitmap values and returns the count. The semantics match BITMAP_UNION_COUNT, but execution is distributed across buckets.

Syntax

orthogonal_bitmap_union_count(bitmap_column)

Parameters

Parameter Description
bitmap_column The bitmap column whose values are unioned and counted.

Example

SELECT orthogonal_bitmap_union_count(user_id)
FROM user_tag_bitmap
WHERE tag IN (1150000, 1150001, 390006);

Equivalent to:

SELECT COUNT(DISTINCT user_id)
FROM user_tag_bitmap
WHERE tag IN (1150000, 1150001, 390006);

During execution, Stage 1 computes the bitmap union per bucket and counts the result. Stage 2 sums all per-bucket counts into the final total.

orthogonal_bitmap_expr_calculate

Evaluates a set expression across bitmap columns and returns the result as a bitmap. Use this function when you need the bitmap itself for further processing, rather than just a count.

Syntax

orthogonal_bitmap_expr_calculate(bitmap_column, filter_column, input_string)

Parameters

Parameter Description
bitmap_column The bitmap column to aggregate.
filter_column The dimension column used for filtering (the key column for calculation).
input_string A set expression string over the key column values.

input_string supports the following operators:

Operator Operation
& Intersection
| Union
- Difference
^ Exclusive XOR
\ Escape character

Example

SELECT orthogonal_bitmap_expr_calculate(
    user_id,
    tag,
    '(833736|999777)&(1308083|231207)&(1000|20000-30000)'
)
FROM user_tag_bitmap
WHERE tag IN (833736, 999777, 130808, 231207, 1000, 20000, 30000);

During execution, Stage 1 parses input_string to determine the tag filter, filters data, and applies the bitmap expression to the filtered rows per bucket. Stage 2 computes the union of all per-bucket bitmap results and returns the final bitmap.

orthogonal_bitmap_expr_calculate_count

Evaluates a set expression across bitmap columns and returns the count. The syntax and parameters are identical to orthogonal_bitmap_expr_calculate.

Syntax

orthogonal_bitmap_expr_calculate_count(bitmap_column, filter_column, input_string)

Parameters

Parameter Description
bitmap_column The bitmap column to aggregate.
filter_column The dimension column used for filtering (the key column for calculation).
input_string A set expression string over the key column values. Supports &, |, -, ^, and \.

Example

SELECT orthogonal_bitmap_expr_calculate_count(
    user_id,
    tag,
    '(833736|999777)&(1308083|231207)&(1000|20000-30000)'
)
FROM user_tag_bitmap
WHERE tag IN (833736, 999777, 130808, 231207, 1000, 20000, 30000);

During execution, Stage 1 parses input_string, filters data, applies the bitmap expression, and counts the result per bucket. Stage 2 computes the union of all per-bucket bitmap results and returns the final count.

What's next

  • SQL reference — full list of bitmap functions in the Bitmap functions section