When your dataset reaches hundreds of millions of rows, COUNT(DISTINCT ...) becomes slow. All distinct values from every node are shuffled to a single top-level node for the final merge—a bottleneck that grows with data volume and bitmap size. ApsaraDB for SelectDB solves this with orthogonal bitmaps: the data is bucketed so that each node can compute its own intersection or union independently, and only the reduced results flow to the top-level node.
How orthogonal bitmaps work
Standard COUNT(DISTINCT ...) executes in two stages:
Scan (Node 1) Scan (Node 2)
+---------+---------+ +---------+---------+
| user_id | tag | | user_id | tag |
+---------+---------+ +---------+---------+
| 1 | A | | 3 | A |
| 2 | B | | 4 | C |
+---------+---------+ +---------+---------+
| |
+----------+ +-----------+
| |
Top-level node
Receives ALL distinct values
from every node, merges them
(bottleneck: I/O + single-node compute)
When a bitmap exceeds 1 GB, this final merge step becomes the bottleneck—both in network I/O and in single-node compute pressure.
ApsaraDB for SelectDB avoids this bottleneck using a hid (hash ID) column. During table creation, user IDs are assigned to buckets using a hash algorithm. Bitmap values in different buckets are orthogonal (non-overlapping), so each bucket computes its own result independently:
Stage 1 (distributed): each bucket computes intersection/union locally
Stage 2 (merge): top-level node combines per-bucket results (small data only)
This eliminates the single-node bottleneck and scales to billions of rows.
Limitations
Before you start, note the following constraints:
-
Aggregate Key model required: The bitmap column must be in a table using the Aggregate Key model, with
BITMAP_UNIONas the aggregation function. -
Partitioned tables not supported: Orthogonal bitmap functions cannot be used on partitioned tables. Partition boundaries break the orthogonality guarantee between buckets, making results unreliable.
-
`hid` cardinality requirement: The cardinality of the
hidcolumn must be at least five times the number of buckets to ensure balanced data distribution after hash bucketing.
Set up bitmap deduplication
Step 1: Create the table
Create a table using the Aggregate Key model with a bitmap Value field and a hid column for hash bucketing.
CREATE TABLE `user_tag_bitmap` (
`tag` BIGINT(20) NULL COMMENT "User tag",
`hid` SMALLINT(6) NULL COMMENT "Bucket ID",
`user_id` BITMAP BITMAP_UNION NULL COMMENT "User bitmap"
) ENGINE=OLAP
AGGREGATE KEY(`tag`, `hid`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`hid`) BUCKETS 3;
The hid column divides the user_id range into groups, with each group assigned to a single bucket. This ensures bitmap values across buckets are orthogonal.
Set the cardinality of hid to at least five times the bucket count. For example, with 3 buckets, hid should have at least 15 distinct values.
Step 2: Import data
Use a LOAD LABEL statement to load data. The ceil(tmp_user_id/500) expression computes the hid value for each row, and to_bitmap(tmp_user_id) converts the raw integer into a bitmap entry.
LOAD LABEL user_tag_bitmap_test
(
DATA INFILE('hdfs://abc')
INTO TABLE user_tag_bitmap
COLUMNS TERMINATED BY ','
(tmp_tag, tmp_user_id)
SET (
tag = tmp_tag,
hid = ceil(tmp_user_id/500),
user_id = to_bitmap(tmp_user_id)
)
)
The divisor 500 is not fixed. Adjust it based on the expected range of user IDs and your target number of distinct hid values.
The source data format is two comma-separated columns—tag and user ID:
11111111,1
11111112,2
11111113,3
11111114,4
During import, SelectDB assigns user IDs in the range [1, 5000000) the same hid value and places them in the same bucket. Bitmap values across buckets remain orthogonal, which reduces the compute cost for intersection and union operations.
Step 3: Query with orthogonal bitmap functions
After importing data, use the orthogonal bitmap functions described in the next section. For a full list of standard bitmap functions, see the SQL reference.
Orthogonal bitmap functions
The following five functions support bitmap queries in orthogonal scenarios. Functions that return a count are direct substitutes for the corresponding COUNT(DISTINCT ...) patterns, but execute in a distributed manner.
| Function | Returns | Use when |
|---|---|---|
bitmap_orthogonal_intersect |
Bitmap | Computing the intersection bitmap across tags |
orthogonal_bitmap_intersect_count |
Count | Counting users at the intersection of multiple tags |
orthogonal_bitmap_union_count |
Count | Counting total distinct users across tags |
orthogonal_bitmap_expr_calculate |
Bitmap | Evaluating complex set expressions (AND, OR, NOT) |
orthogonal_bitmap_expr_calculate_count |
Count | Counting results of complex set expressions |
Orthogonal bitmap functions cannot be used on partitioned tables. Because partitions are orthogonal to each other, data between partitions is not guaranteed to be orthogonal, and results would be unreliable.
bitmap_orthogonal_intersect
Computes the intersection of bitmap values across the specified filter values and returns the result as a bitmap.
Syntax
bitmap_orthogonal_intersect(bitmap_column, column_to_filter, filter_values)
Parameters
| Parameter | Description |
|---|---|
bitmap_column |
The bitmap column to aggregate. |
column_to_filter |
The dimension column used for filtering. |
filter_values |
A variable-length list of values to filter the dimension column. |
Example
SELECT BITMAP_COUNT(bitmap_orthogonal_intersect(user_id, tag, 13080800, 11110200))
FROM user_tag_bitmap
WHERE tag IN (13080800, 11110200);
Equivalent to:
-- Returns the count of users who appear in ALL specified tags
SELECT COUNT(DISTINCT user_id)
FROM user_tag_bitmap
WHERE tag IN (13080800, 11110200)
AND <user_id appears in every tag>;
During execution, Stage 1 filters rows by the specified tags, then computes the bitmap intersection across all matching tags per bucket. Stage 2 iteratively merges the per-bucket intersection bitmaps into the final result.
orthogonal_bitmap_intersect_count
Computes the intersection of bitmap values and returns the count directly. The semantics match intersect_count, but execution is distributed across buckets.The syntax is the same as that of the `intersect_count` function, but the implementation is different.The syntax is the same as that of the `bitmap_union_count` function, but the implementation is different.
Syntax
orthogonal_bitmap_intersect_count(bitmap_column, column_to_filter, filter_values)
Parameters
| Parameter | Description |
|---|---|
bitmap_column |
The bitmap column to aggregate. |
column_to_filter |
The dimension column used for filtering. |
filter_values |
A variable-length list of values to filter the dimension column. |
Example
SELECT orthogonal_bitmap_intersect_count(user_id, tag, 1150000, 1150001, 390006)
FROM user_tag_bitmap
WHERE tag IN (1150000, 1150001, 390006);
During execution, Stage 1 filters by the specified tags, computes the bitmap intersection across all matching tags, and counts the result per bucket. Stage 2 sums all per-bucket counts into the final total.
orthogonal_bitmap_union_count
Computes the union of bitmap values and returns the count. The semantics match BITMAP_UNION_COUNT, but execution is distributed across buckets.
Syntax
orthogonal_bitmap_union_count(bitmap_column)
Parameters
| Parameter | Description |
|---|---|
bitmap_column |
The bitmap column whose values are unioned and counted. |
Example
SELECT orthogonal_bitmap_union_count(user_id)
FROM user_tag_bitmap
WHERE tag IN (1150000, 1150001, 390006);
Equivalent to:
SELECT COUNT(DISTINCT user_id)
FROM user_tag_bitmap
WHERE tag IN (1150000, 1150001, 390006);
During execution, Stage 1 computes the bitmap union per bucket and counts the result. Stage 2 sums all per-bucket counts into the final total.
orthogonal_bitmap_expr_calculate
Evaluates a set expression across bitmap columns and returns the result as a bitmap. Use this function when you need the bitmap itself for further processing, rather than just a count.
Syntax
orthogonal_bitmap_expr_calculate(bitmap_column, filter_column, input_string)
Parameters
| Parameter | Description |
|---|---|
bitmap_column |
The bitmap column to aggregate. |
filter_column |
The dimension column used for filtering (the key column for calculation). |
input_string |
A set expression string over the key column values. |
input_string supports the following operators:
| Operator | Operation |
|---|---|
& |
Intersection |
| |
Union |
- |
Difference |
^ |
Exclusive XOR |
\ |
Escape character |
Example
SELECT orthogonal_bitmap_expr_calculate(
user_id,
tag,
'(833736|999777)&(1308083|231207)&(1000|20000-30000)'
)
FROM user_tag_bitmap
WHERE tag IN (833736, 999777, 130808, 231207, 1000, 20000, 30000);
During execution, Stage 1 parses input_string to determine the tag filter, filters data, and applies the bitmap expression to the filtered rows per bucket. Stage 2 computes the union of all per-bucket bitmap results and returns the final bitmap.
orthogonal_bitmap_expr_calculate_count
Evaluates a set expression across bitmap columns and returns the count. The syntax and parameters are identical to orthogonal_bitmap_expr_calculate.
Syntax
orthogonal_bitmap_expr_calculate_count(bitmap_column, filter_column, input_string)
Parameters
| Parameter | Description |
|---|---|
bitmap_column |
The bitmap column to aggregate. |
filter_column |
The dimension column used for filtering (the key column for calculation). |
input_string |
A set expression string over the key column values. Supports &, |, -, ^, and \. |
Example
SELECT orthogonal_bitmap_expr_calculate_count(
user_id,
tag,
'(833736|999777)&(1308083|231207)&(1000|20000-30000)'
)
FROM user_tag_bitmap
WHERE tag IN (833736, 999777, 130808, 231207, 1000, 20000, 30000);
During execution, Stage 1 parses input_string, filters data, applies the bitmap expression, and counts the result per bucket. Stage 2 computes the union of all per-bucket bitmap results and returns the final count.
What's next
-
SQL reference — full list of bitmap functions in the Bitmap functions section