All Products
Search
Document Center

Hologres:Analyze large user properties with roaring bitmaps

Last Updated:Mar 26, 2026

When a user profiling system has more than 1,000 tags, wide tables become impractical — update performance degrades as columns accumulate. Roaring bitmaps offer a better approach: encode user IDs as integers, store them in compact bitmap structures, and replace expensive JOIN operations with fast bitwise AND, OR, and NOT operations. For large-scale user property analysis, Hologres can respond to queries in sub-seconds using this approach.

Hologres is compatible with the PostgreSQL ecosystem and natively supports the roaringbitmap extension.

When to use roaring bitmaps

Roaring bitmaps are effective in two scenarios:

  • Large-scale tag filtering: Replace JOIN operations across many large tables with BITMAP_AND intersection operations. The bitmap extension uses Single Instruction Multiple Data (SIMD) optimization to improve CPU utilization by 1% to 2%.

  • Deduplication at scale: Bitmaps have intrinsic deduplication built in, eliminating the overhead of unique vector (UV) computing.

Property tags vs. action tags

Tags in a user profiling system fall into two categories with different characteristics that determine whether bitmap storage makes sense:

Tag typeDescriptionExamplesSuitable for bitmaps?
Property tagsStable user attributes filtered by precise conditionsGender, province, marital statusYes — high compression ratio, efficient bitwise operations
Action tagsBehavioral data organized by time, frequently updated, requires range scanning and aggregate filteringPage views, purchases, active logonsNo — low compression ratio, data changes constantly

Why action tags are not suitable: Fact tables are organized by time (for example, one partition per day), and each day's slice contains limited rows. Compressing sparse per-day slices into bitmaps wastes storage due to row-oriented overhead. Action data also changes frequently, and the [option→bitmap] storage structure cannot identify which records need updating. For action data queries, Hologres uses its original columnar storage format instead.

Prerequisites

Before you begin, make sure you have:

  • A Hologres instance

  • The roaringbitmap extension enabled (run CREATE EXTENSION roaringbitmap;)

  • Source data tables with user IDs and tag columns

Important

Roaring bitmaps store only integers. If your user IDs are strings, encode them to integers before building bitmap tables. See Step 1: Encode user IDs for the encoding approach and trade-offs.

How it works

The bitmap approach has three layers:

  1. Encoding: Map string user IDs to integers using a dictionary table with a BIGSERIAL auto-increment column.

  2. Bitmap table design: Split tag tables by column into bitmap tables. Each bitmap table stores (tag_value, bucket, bitmap) — where bucket = encode_uid / 65536 distributes the data evenly across nodes.

  3. Querying: Combine bitmaps from multiple tables using RB_AND, RB_OR, and rb_cardinality to count matching users without JOIN or deduplication overhead.

Segmenting bitmaps into 65,536 buckets is critical for even resource distribution. Without segmentation, a column like gender (only Male and Female) produces just two bitmaps, concentrating compute and storage on two nodes.

Build a property tag bitmap system

The following steps walk through a complete example using two source tables:

  • dws_userbase — basic user properties (province, gender)

  • dws_shop_cust — shop customer types (shop_id, cust_type)

Property tag table

Step 1: Encode user IDs

Bitmaps require integer input. Create a dictionary table using BIGSERIAL to map string user IDs to sequential integers:

-- Create a UID dictionary table
CREATE TABLE dws_uid_dict (
    encode_uid bigserial,
    uid text primary key
);

-- Populate from your source tag table
INSERT INTO dws_uid_dict(uid)
SELECT uid
FROM dws_userbase ON CONFLICT DO NOTHING;

This produces compact, sequential integer IDs — efficient to store as bitmaps. Sparse integer ranges consume significantly more space than dense ranges.

Bitmap

When to encode:

User ID typeRecommendation
String user IDsEncode (required)
Integer user IDs with infrequent ID restorationEncode (recommended)
Integer user IDs that require frequent restoration (for example, real-time output of user details)Skip encoding — decoding back requires an extra JOIN at query time

Step 2: Design bitmap tables

Split tag columns into separate bitmap tables. Each table uses bucket as the distribution key so queries run in parallel across nodes.

Important

Do not split correlated columns into separate bitmap tables. If cust_type is a metric for shop_id (meaning a user's customer type only makes sense in the context of a specific shop), splitting them produces incorrect results. Combine correlated columns into one bitmap table.

Correct approach — combined correlated columns:

Bitwise operations

What goes wrong when splitting correlated columns:

Column splitting

For the dws_shop_cust table, cust_type depends on shop_id. Splitting them into separate bitmap tables and filtering [shop_id = A] & [cust_type = Fresh] returns UID [1] — but that user is not actually a Fresh customer of shop A. Use a combined bitmap table instead.

Create the bitmap tables:

-- Enable the roaringbitmap extension
CREATE EXTENSION roaringbitmap;

-- Bitmap table: province property
BEGIN;
CREATE TABLE rb_dws_userbase_province (
    province text,
    bucket   int,
    bitmap   roaringbitmap
);
CALL set_table_property('rb_dws_userbase_province', 'distribution_key', 'bucket');
END;

-- Bitmap table: shop + customer type (correlated columns combined)
BEGIN;
CREATE TABLE rb_dws_shop_cust_sid_ctype (
    shop_id   text,
    cust_type text,
    bucket    int,
    bitmap    roaringbitmap
);
CALL set_table_property('rb_dws_shop_cust_sid_ctype', 'distribution_key', 'bucket');
END;

Step 3: Write data into bitmap tables

Use encode_uid / 65536 to assign each user to a bucket, then aggregate with rb_build_agg:

-- Write property bitmap data
INSERT INTO rb_dws_userbase_province
SELECT province,
       encode_uid / 65536 AS bucket,
       rb_build_agg(b.encode_uid) AS bitmap
FROM dws_userbase a
JOIN dws_uid_dict b ON a.uid = b.uid
GROUP BY province, bucket;

-- Write shop + customer type bitmap data
INSERT INTO rb_dws_shop_cust_sid_ctype
SELECT shop_id,
       cust_type,
       encode_uid / 65536 AS bucket,
       rb_build_agg(b.encode_uid) AS bitmap
FROM dws_shop_cust a
JOIN dws_uid_dict b ON a.uid = b.uid
GROUP BY shop_id, cust_type, bucket;

Step 4: Query with bitmap operations

To count users matching [shop_id = A] & [cust_type = Fresh] & [province = Beijing], combine bitmaps from both tables with RB_AND and count with RB_CARDINALITY:

SELECT SUM(RB_CARDINALITY(rb_and(ub.bitmap, uc.bitmap)))
FROM
    (SELECT rb_or_agg(bitmap) AS bitmap,
            bucket
     FROM rb_dws_userbase_province
     WHERE province = 'Beijing'
     GROUP BY bucket) ub
JOIN
    (SELECT rb_or_agg(bitmap) AS bitmap,
            bucket
     FROM rb_dws_shop_cust_sid_ctype
     WHERE shop_id = 'A'
       AND cust_type = 'Fresh'
     GROUP BY bucket) uc ON ub.bucket = uc.bucket;

Each subquery retrieves matching buckets and merges within-bucket bitmaps using rb_or_agg. The outer query intersects the two result sets per bucket and sums the cardinalities.

Handle action tags

For action data (behavioral fact tables), skip the bitmap conversion. Store the fact table in Hologres using columnar orientation, with clustering_key on shop_id and encode_uid for efficient filtering.

Create the action table and its encoded version:

-- Source action table
BEGIN;
CREATE TABLE dws_usershop_behavior (
    uid      int  not null,
    shop_id  text not null,
    pv_cnt   int,
    trd_cnt  int,
    ds       integer not null
);
CALL set_table_property('dws_usershop_behavior', 'distribution_key', 'uid');
COMMIT;

-- Encoded action table (bucket-distributed for local join)
BEGIN;
CREATE TABLE dws_usershop_behavior_bucket (
    encode_uid int  not null,
    shop_id    text not null,
    pv_cnt     int,
    trd_cnt    int,
    ds         int  not null,
    bucket     int
);
CALL set_table_property('dws_usershop_behavior_bucket', 'orientation', 'column');
CALL set_table_property('dws_usershop_behavior_bucket', 'distribution_key', 'bucket');
CALL set_table_property('dws_usershop_behavior_bucket', 'clustering_key', 'shop_id,encode_uid');
COMMIT;

-- Write fact data
INSERT INTO dws_usershop_behavior_bucket
SELECT encode_uid,
       shop_id,
       pv_cnt,
       trd_cnt,
       ds,
       encode_uid / 65536
FROM dws_usershop_behavior a
JOIN dws_uid_dict b ON a.uid = b.uid;

To query users matching [province = Beijing] & [shop_id = A, no purchase in 7 days], generate a bitmap from the filtered action table and join it with the property bitmap index. Because both tables use bucket as their distribution key, Hologres executes this as a local join:

SELECT sum(rb_cardinality(bitmap)) AS cnt
FROM
    (SELECT rb_and(ub.bitmap, us.bitmap) AS bitmap,
            ub.bucket
     FROM
         (SELECT rb_or_agg(bitmap) AS bitmap,
                 bucket
          FROM rb_dws_userbase_province
          WHERE province = 'Beijing'
          GROUP BY bucket) AS ub
     JOIN
         (SELECT rb_build_agg(encode_uid) AS bitmap,
                 bucket
          FROM
              (SELECT encode_uid,
                      bucket
               FROM dws_usershop_behavior_bucket
               WHERE shop_id = 'A'
                 AND ds > to_char(current_date - 7, 'YYYYMMdd')::int
               GROUP BY encode_uid, bucket
               HAVING sum(trd_cnt) = 0) tmp
          GROUP BY bucket) us ON ub.bucket = us.bucket) r;

Load bitmap data offline

Process bitmap data offline to avoid impact on production queries. Both MaxCompute and Hive support loading data into Hologres via foreign tables.

Build bitmaps in MaxCompute

MaxCompute requires a JAR-based user-defined function (UDF) for bitmap operations. All steps mirror the online approach: encode UIDs, assign buckets, and aggregate into bitmaps.

-- Select the MaxCompute project
USE bitmap_demo;

-- Create source tables
CREATE TABLE mc_dws_uid_dict (
    encode_uid bigint,
    bucket     bigint,
    uid        string
);

CREATE TABLE mc_dws_userbase (
    uid       string,
    province  string,
    gender    string,
    marriaged string
);

-- Encode UID values (incremental — skips already-encoded UIDs)
WITH uids_to_encode AS (
    SELECT DISTINCT(ub.uid),
           CAST(ub.uid / 65536 AS BIGINT) AS bucket
    FROM mc_dws_userbase ub
    LEFT JOIN mc_dws_uid_dict d ON ub.uid = d.uid
    WHERE d.uid IS NULL
),
uids_bucket_encode_offset AS (
    SELECT bucket,
           sum(cnt) OVER (ORDER BY bucket ASC) - cnt AS bucket_offset
    FROM (
        SELECT count(1) AS cnt, bucket
        FROM uids_to_encode
        GROUP BY bucket
    ) x
),
dict_used_id_offset AS (
    SELECT max(encode_uid) AS used_id_offset FROM mc_dws_uid_dict
)
-- New encode_uid = max existing + bucket offset + row number within bucket
INSERT INTO mc_dws_uid_dict
SELECT COALESCE((SELECT used_id_offset FROM dict_used_id_offset), 0) + bucket_offset + rn,
       bucket,
       uid
FROM (
    SELECT row_number() OVER (PARTITION BY ub.bucket ORDER BY ub.uid) AS rn,
           ub.bucket,
           bo.bucket_offset,
           uid
    FROM uids_to_encode ub
    JOIN uids_bucket_encode_offset bo ON ub.bucket = bo.bucket
) j;

-- Register bitmap UDFs from the mc-bitmap JAR
ADD JAR function_jar_dir/mc-bitmap-functions.jar AS mc_bitmap_func.jar -f;
CREATE FUNCTION mc_rb_cardinality AS com.alibaba.hologres.RbCardinalityUDF USING mc_bitmap_func.jar;
CREATE FUNCTION mc_rb_build_agg AS com.alibaba.hologres.RbBuildAggUDAF USING mc_bitmap_func.jar;

-- Create and populate the bitmap table
CREATE TABLE mc_rb_dws_userbase_province (
    province string,
    bucket   int,
    bitmap   string
);

INSERT INTO mc_rb_dws_userbase_province
SELECT province,
       b.bucket_num,
       mc_rb_build_agg(b.encode_uid) AS bitmap
FROM mc_dws_userbase a
JOIN mc_dws_uid_dict b ON a.uid = b.uid
GROUP BY province, b.bucket_num;

Download the mc-bitmap JAR to use bitmap functions in MaxCompute.

Import into Hologres

Create a foreign table pointing to the MaxCompute bitmap table, then insert into the Hologres bitmap table using roaringbitmap_text to convert the serialized string format:

-- Create a MaxCompute foreign table in Hologres
CREATE TABLE mc_rb_dws_userbase_province (
    province text,
    bucket   int,
    bitmap   text
) SERVER odps_server OPTIONS (project_name 'bitmap_demo', table_name 'mc_rb_dws_userbase_province');

-- Write bitmap data from MaxCompute into Hologres
INSERT INTO rb_dws_userbase_province
SELECT province,
       bucket::INT,
       roaringbitmap_text(bitmap, FALSE)
FROM mc_rb_dws_userbase_province;

After the import completes, run the same bitwise queries described in Step 4: Query with bitmap operations.

For a detailed walkthrough of offline UV calculation, see Batch UV calculation.

Real-time bitmap updates with Flink

For real-time scenarios, use Hologres with Flink to maintain bitmap data as user actions arrive:

  1. Use the UID dictionary table as a Flink dimension table. Use INSERT ON CONFLICT in Hologres to add new user IDs as they appear, then join the dimension table with the incoming action stream in Flink.

  2. Aggregate the joined stream by tag using Roaring bitmaps.

  3. Write the resulting bitmaps into the bitmap table in Hologres.

What's next