All Products
Search
Document Center

Hologres:BSI (beta)

Last Updated:Nov 22, 2023

If a large number of attribute tags and behavior tags exist in user profile analysis scenarios, the roaring bitmap algorithm has significant limitations. Hologres provides the bit-sliced index (BSI) algorithm to eliminate the limitations and retain the benefits of the roaring bitmap algorithm. This topic describes the best practices for implementing tag computing in Hologres by using BSI.

Background information

The roaring bitmap algorithm is used in user profile analysis scenarios. The algorithm creates an index for a tag table, encodes user IDs, and then saves the encoded user IDs in the bitmap format. This way, the relational operation is converted into the intersection, union, and difference operations of bitmaps. This accelerates real-time computing. The roaring bitmap algorithm has limitations in the following scenarios:

  • Multi-tag joint queries: The roaring bitmap algorithm is more suitable for attribute tag-based queries and can be used only for fixed tag-based queries. If you use the roaring bitmap algorithm in joint queries that are based on multiple behavior tags, such as the numeric tags page views (PVs), order amount, and playback duration, you must backtrack the detailed tables. This increases the detailed data drilldown cost.

  • High-cardinality tag-based queries: If a large amount of data exists after deduplication is performed based on a tag, the storage space that is used for storing roaring bitmaps significantly increases, and the query performance decreases.

The following section describes how to use the BSI algorithm and specific functions to eliminate the preceding limitations. For more information about the functions, see BSI functions (beta).

  • In multi-tag joint queries, data with numeric behavior tags is pre-calculated and compressed for storage by using the BSI algorithm to ensure accuracy. Joined queries on detailed tables are not required. This way, efficient association analysis based on attribute tags and behavior tags is implemented.

  • In high-cardinality behavior tag-based queries, the BSI algorithm is used to store the behavior tag values of all users whose cid values of the INT type are within a specific range by converting the tag values into a maximum of 32-bit slices. Binary conversion and intersection, union, and difference operations of roaring bitmaps are performed for quick data computing to achieve compressed storage and low-latency queries on high-cardinality behavior tag values.

Profile analysis solution

In this example, two user tag tables exist. The dws_userbase table contains basic attribute tags of users, such as the province and gender. The usershop_behavior table contains user behavior tags, such as the Gross Merchandise Volume (GMV).

  • The following figure shows the raw data.image.png

  • The following figure shows the bitmaps of basic user attributes and uids in the rb_tag table. In this table, the tag_name column contains tags such as province and gender.

    image.png

  • The following figure shows the bit slices that are generated based on user behavior tag values and uids by using the BSI algorithm. The tag values are converted into binary values, and the uid bitmaps of the four slices are recorded in the bsi_gmv table.image.png

In this solution, uids and related behavior tag values are compressed into BSIs. Tag-based fast computing is achieved by using the BSI algorithm and AND, OR, and NOT operations on roaring bitmaps. Examples:

  • Use the BSI algorithm to perform the sum operation on behavior tag values of identified user groups. The sum operation is converted into the bitmap intersection operation on each slice.image.png

  • Use the BSI algorithm to obtain the top K values based on the behavior tags of identified user groups. The global sorting is converted into the bitmap intersection operation on slices from higher bits to lower bits.image.png

Basic practices of profile analysis

BSI tables

Table name

Field

Description

dws_userbase

(uid int, province text, gender text)

The table that contains the original attribute tags of users, which is the same as the table in the wide table solution.

dws_uid_dict

(encode_uid serial, uid int)

The uid dictionary encoding table, which is the same as the table in the roaring bitmap solution.

usershop_behavior

(uid int, gmv int)

The original behavior tag table that contains behavior tags of users, such as the GMV.

rb_tag

(tag_name text, tag_val text, bitmap roaringbitmap)

The attribute tag table that is based on the roaring bitmap algorithm.

bsi_gmv

(gmv_bsi bsi)

The detailed GMV metric table that is based on the BSI algorithm.

Data definition language (DDL) statements that are used to create the preceding tables:

  • Create the dws_userbase table.

    CREATE TABLE dws_userbase (
        uid int NOT NULL PRIMARY KEY,
        province text,
        gender text
        ... -- Other attribute columns.
    )
    WITH (
        distribution_key = 'uid'
    );
  • Create the dws_uid_dict table.

    CREATE TABLE dws_uid_dict (
        encode_uid serial,
        uid int PRIMARY KEY
    );
  • Create the usershop_behavior table.

    CREATE TABLE usershop_behavior (
        uid int NOT NULL,
        gmv int
    )
    WITH (
        distribution_key = 'uid'
    );
  • Create the rb_tag table.

    CREATE TABLE rb_tag (
        tag_name text,
        tag_val text,
        bitmap roaringbitmap
    );
  • Create the bsi_gmv table.

    CREATE TABLE bsi_gmv (
        gmv_bsi bsi
    );

Import data

  • Generate roaring bitmaps of attribute tag values based on the dws_userbase and dws_uid_dict tables, and store the roaring bitmaps in different buckets.

    INSERT INTO rb_tag
    SELECT
        'province',
        province,
        rb_build_agg (b.encode_uid) AS bitmap
    FROM
        dws_userbase a
        JOIN dws_uid_dict b ON a.uid = b.uid
    GROUP BY
        province;
    
    INSERT INTO rb_tag
    SELECT
        'gender',
        gender,
        rb_build_agg (b.encode_uid) AS bitmap
    FROM
        dws_userbase a
        JOIN dws_uid_dict b ON a.uid = b.uid
    GROUP BY
        gender;
  • Generate BSI data of behavior tag values based on the usershop_behavior and dws_uid_dict tables, and store the table data in different buckets.

    INSERT INTO bsi_gmv
    SELECT
        bsi_build(array_agg(b.encode_uid),array_agg(a.gmv)) AS bitmap
    FROM
        usershop_behavior a
        JOIN dws_uid_dict b ON a.uid = b.uid
    ;

Profile analysis

You can use the BSI algorithm to perform association analysis based on user attribute tags and behavior tags. For example, you can gain insights into the behavior tags of identified user groups or filter user groups by behavior tag.

User group identification and behavior tag analysis

  • Query the total GMV and average GMV of male users from the Guangdong province.

    • Use the BSI and roaring bitmap algorithms.

      SELECT 
          sum(kv[1]) AS total_gmv, -- Total GMV.
          sum(kv[1])/sum(kv[2]) AS avg_gmv -- Average GMV.
      FROM (
          SELECT 
      	    bsi_sum(t1.gmv_bsi,t2.crowd) AS kv
          FROM
              bsi_gmv t1,
              (SELECT rb_and(a.bitmap,b.bitmap) AS crowd FROM
                  (SELECT bitmap FROM rb_tag WHERE tag_name='gender' AND tag_val='Male ') a, -- Male users.
                  (SELECT bitmap FROM rb_tag WHERE tag_name='province' AND tag_val ='Guangdong') b -- Users from the Guangdong province.
              ) t2 
          ) t;
    • Use the dws_userbase and usershop_behavior tables.

      SELECT
          sum(b.gmv) AS total_gmv,
          avg(b.gmv) AS avg_gmv
      FROM
          dws_userbase a
          JOIN usershop_behavior b ON a.uid = b.uid
      WHERE
          a.province = 'Guangdong'
          AND a.gender = 'Male';
  • Query the consumption amount distribution of male users from the Guangdong province.

    • Use the BSI and roaring bitmap algorithms: Use the bsi_stat function to define boundary value arrays. This accelerates queries in multiple ranges.

      SELECT
          bsi_stat('{100,300,500}', filter_bsi)
      FROM (
          SELECT
              bsi_filter(t1.gmv_bsi,t2.crowd) AS filter_bsi
          FROM 
              bsi_gmv t1,
              (SELECT rb_and(a.bitmap,b.bitmap) AS crowd FROM
                  (SELECT bitmap FROM rb_tag WHERE tag_name='gender' AND tag_val='Male ') a, -- Male users.
                  (SELECT bitmap FROM rb_tag WHERE tag_name='province' AND tag_val = 'Guangdong') b -- Users from the Guangdong province.
              ) t2
          ) t;
    • Use the dws_userbase and usershop_behavior tables: You can use only the CASE WHEN syntax.

      SELECT
          CASE
              WHEN gmv >= 0 AND gmv <= 100 THEN '0-100'
              WHEN gmv > 100 AND gmv <= 300 THEN '100-300'
              WHEN gmv > 300 AND gmv <= 500 THEN '300-500'
              WHEN gmv > 500 THEN '>500'
          END AS gmv_range,
          COUNT(*) AS user_count
      FROM 
          dws_userbase a
          JOIN usershop_behavior b ON a.uid = b.uid
      WHERE a.province = 'Guangdong'
          AND a.gender = 'Male'
      GROUP BY gmv_range
      ORDER BY gmv_range;
  • Query the top K consumption amounts of male users from the Guangdong province on the previous day.

    • Use the BSI and roaring bitmap algorithms.

      SELECT
          rb_to_array(bsi_topk(filter_bsi,10))
      FROM (
          SELECT
              bsi_filter(t1.gmv_bsi,t2.crowd) AS filter_bsi
          FROM 
              bsi_gmv t1,
              (SELECT rb_and(a.bitmap,b.bitmap) AS crowd FROM
                  (SELECT bitmap FROM rb_tag WHERE tag_name='gender' AND tag_val='Male ') a, -- Male users.
                  (SELECT bitmap FROM rb_tag WHERE tag_name='province' AND tag_val ='Guangdong') b -- Users from the Guangdong province.
              ) t2
          ) t;
    • Use the dws_userbase and usershop_behavior tables to query data.

      SELECT
          b.uid,
          b.gmv
      FROM
          dws_userbase a
          JOIN usershop_behavior b ON a.uid = b.uid
      WHERE
          a.province = 'Guangdong'
          AND a.gender = 'Male'
      ORDER BY gmv DESC
      LIMIT 10;

User group identification based on behavior tags

  • Identify users whose consumption amounts are greater than 1,000.

    • Use the BSI and roaring bitmap algorithms.

      SELECT
          rb_to_array(bsi_gt(gmv_bsi, 1000)) AS crowd
      FROM
          bsi_gmv;
    • Use the dws_userbase and usershop_behavior tables.

      SELECT
          array_agg(uid)
      FROM
          usershop_behavior
      WHERE
          gmv > 800;

Advanced practices of profile analysis: bucketing

In the preceding example, the dws_userbase table is compressed into a roaring bitmap table based on the province and gender columns, and the usershop_behavior table is compressed into a BSI table. The compressed roaring bitmap and BSI tables are distributed only on specific nodes of an instance. As a result, computing and storage resources are not evenly distributed, and the instance resources are not fully used. To resolve this issue, you can split roaring bitmap and BSI tables into multiple segments and distribute the segments on the instance to improve execution concurrency. In this example, the bitmap and BSI tables are split into 65,536 segments.

BSI tables

Table name

Field

Description

dws_userbase

(uid int, province text, gender text)

The table that contains the original attribute tags of users, which is the same as the table in the basic practices of profile analysis.

dws_uid_dict

(encode_uid serial, uid int)

The uid dictionary encoding table, which is the same as the table in the basic practices of profile analysis.

usershop_behavior

(uid int, category text, gmv int, ds date)

The original behavior tag table.

Compared with the table in the basic practices of profile analysis, the category and ds fields are added to this table.

rb_tag

(tag_name text, tag_val text, bucket int, bitmap roaringbitmap)

The attribute tag table that is based on the roaring bitmap algorithm.

Compared with the table in the basic practices of profile analysis, the bucket field is added to this table.

bsi_gmv

(ds text, category text, bucket int, gmv_bsi bsi)

The detailed GMV metric table that is based on the BSI algorithm.

Compared with the table in the basic practices of profile analysis, the category, ds, and bucket fields are added to this table.

DDL statements that are used to create the preceding tables:

  • Create the rb_tag table and store data in different buckets.

    CREATE TABLE rb_tag (
        tag_name text,
        tag_val text,
        bucket int,
        bitmap roaringbitmap
    )
    WITH (
        distribution_key = 'bucket ' -- Use the bucket ID as the distribution key.
    );
  • Create the bsi_gmv table and store data in different buckets.

    CREATE TABLE bsi_gmv (
        category text,
        bucket int,
        gmv_bsi bsi,
        ds date
    )
    WITH (
        distribution_key = 'bucket' -- Use the bucket ID as the distribution key.
    );

Import data

  • Generate roaring bitmaps of attribute tag values based on the dws_userbase and dws_uid_dict tables, and store the roaring bitmaps in different buckets.

    INSERT INTO rb_tag
    SELECT
        'province',
        province,
        encode_uid / 65536 AS "bucket",
        rb_build_agg (b.encode_uid) AS bitmap
    FROM
        dws_userbase a
        JOIN dws_uid_dict b ON a.uid = b.uid
    GROUP BY
        province,
        "bucket";
    
    INSERT INTO rb_tag
    SELECT
        'gender',
        gender,
        encode_uid / 65536 AS "bucket",
        rb_build_agg (b.encode_uid) AS bitmap
    FROM
        dws_userbase a
        JOIN dws_uid_dict b ON a.uid = b.uid
    GROUP BY
        gender,
        "bucket";
  • Generate BSI data of behavior tag values based on the usershop_behavior and dws_uid_dict tables, and store the table data in different buckets.

    INSERT INTO bsi_gmv
    SELECT
        a.category,
        b.encode_uid / 65536 AS "bucket",
        bsi_build(array_agg(b.encode_uid),array_agg(a.gmv)) AS bitmap,
        a.ds
    FROM
        usershop_behavior a
        JOIN dws_uid_dict b ON a.uid = b.uid
    WHERE
        ds = CURRENT_DATE - interval '1 day'
    GROUP BY
        category,
        "bucket",
        ds;

Profile analysis

You can use the BSI algorithm to perform association analysis based on user attribute tags and behavior tags. For example, you can gain insights into the behavior tags of identified user groups, filter user groups by behavior tag, and distribute user data into different buckets to accelerate data computing.

User group identification and behavior tag analysis

  • Query the total GMV and average GMV of male users from the Guangdong province in the 3C category on the previous day.

    SELECT 
        sum(kv[1]) AS total_gmv, -- Total GMV.
        sum(kv[1])/sum(kv[2]) AS avg_gmv -- Average GMV.
    FROM (
        SELECT 
    	    bsi_sum(t1.gmv_bsi,t2.crowd) AS kv, 
            t1.bucket
        FROM
            (SELECT gmv_bsi, bucket FROM bsi_gmv WHERE category = '3C' AND ds = CURRENT_DATE - interval '1 day') t1
        JOIN
            (SELECT rb_and(a.bitmap,b.bitmap) AS crowd, a.bucket FROM
                (SELECT bitmap, bucket FROM rb_tag WHERE tag_name='gender' AND tag_val='Male ') a -- Male users.
             JOIN
                (SELECT bitmap, bucket FROM rb_tag WHERE tag_name = 'province' AND tag_val = 'Guangdong') b -- Users from the Guangdong province.
             ON a.bucket = b.bucket
            ) t2 
        ON t1.bucket = t2.bucket
        ) t;
  • Query the consumption amount distribution of male users from the Guangdong province in the 3C category on the previous day.

    SELECT
        bsi_stat('{100,300,500}', bsi_add_agg(filter_bsi))
    FROM (
        SELECT
            bsi_filter(t1.gmv_bsi,t2.crowd) AS filter_bsi,
            t1.bucket
        FROM 
            (SELECT gmv_bsi, bucket FROM bsi_gmv WHERE category = '3C' AND ds = CURRENT_DATE - interval '1 day') t1
        JOIN
            (SELECT rb_and(a.bitmap,b.bitmap) AS crowd, a.bucket FROM
                (SELECT bitmap, bucket FROM rb_tag WHERE tag_name='gender' AND tag_val = 'Male ') a -- Male users.
             JOIN
                (SELECT bitmap, bucket FROM rb_tag WHERE tag_name='province' AND tag_val = 'Guangdong') b -- Users from the Guangdong province.
             ON a.bucket = b.bucket
            ) t2
        ON t1.bucket = t2.bucket
        ) t;
  • Query the top K consumption amounts of male users from the Guangdong province on the previous day.

    SELECT
        bsi_topk(bsi_add_agg(filter_bsi),10)
    FROM (
        SELECT
            bsi_filter(t1.gmv_bsi,t2.crowd) AS filter_bsi,
            t1.bucket
        FROM 
            (SELECT bsi_add_agg(gmv_bsi) AS gmv_bsi, bucket FROM bsi_gmv WHERE ds = CURRENT_DATE - interval '1 day' GROUP BY bucket) t1
        JOIN
            (SELECT rb_and(a.bitmap,b.bitmap) AS crowd, a.bucket FROM
                (SELECT bitmap, bucket FROM rb_tag WHERE tag_name = 'gender' AND tag_val = 'Male ') a -- Male users.
             JOIN
                (SELECT bitmap, bucket FROM rb_tag WHERE tag_name = 'province' AND tag_val = 'Guangdong') b -- Users from the Guangdong province.
             ON a.bucket = b.bucket
            ) t2
        ON t1.bucket = t2.bucket
        ) t;

User group identification based on behavior tags

  • Identify users whose consumption amounts are greater than 1,000 in the 3C category in the previous month.

    SELECT
        rb_to_array(bsi_gt(bsi_add_agg(gmv_bsi), 1000)) AS crowd
    FROM
        bsi_gmv
    WHERE
        category = '3C'
        AND ds BETWEEN CURRENT_DATE - interval '30 day' AND CURRENT_DATE - interval '1 day';