In scenarios in which more than 1,000 tags are used, the solution of using wide tables for tag computing is not suitable. This is because the update efficiency decreases when the number of columns increases. This topic describes how to perform tag computing and profile analysis in such scenarios.
Background information
Hologres is compatible with the PostgreSQL ecosystem and supports Roaring bitmap functions. Indexes are created for tag tables. User IDs are encoded and stored as bitmaps. Relational operations are converted into intersection, union, and difference operations of bitmaps to improve the performance of real-time computing. In scenarios that require analysis of large amounts of user properties, Roaring bitmaps can be used to respond to queries within sub-seconds.
Scenarios
Roaring bitmaps are suitable for the following scenarios:
Scenarios in which large amounts of tags are used: In such scenarios, JOIN operations are required for many large tables. The
BITMAP_ANDfunction can be used to replace JOIN operations to reduce memory consumption. The bitmap plug-in repository can improve the CPU utilization by 1% to 2% by means of Single Instruction Multiple Data (SIMD)-based optimization.Scenarios that involves large amounts of data and requires deduplication: Bitmaps provide intrinsic deduplication capabilities to prevent unique vector (UV) computing and memory overheads.
Tag types
Tags in profiling systems can be classified into the following types. Different types of tags use different computing modes. Specific types of tags must be converted into and stored as bitmaps.
Property tags: Property tags describe user properties such as gender, province, and marriage status. Property tags are stable and can be filtered based on precise filter conditions. For these tags, the bitmap compression ratio is high, and bitmaps are suitable for related operations.
Action tags: Action tags describe action characteristics of users and depict what users do at specific points in time. User actions include page views, purchases, and active logons. Action data is frequently updated and requires range scanning and aggregate filtering. For action tags, the bitmap compression ratio is low, and bitmaps are not suitable for related operations.
Property tags
Property tags describe user properties. Property tags are stable and can be filtered based on precise filter conditions. Bitmaps can be used for efficient compression and operations.
Solution
Assume that a data management platform (DMP) contains two property tag tables, as shown in the following figure.
The dws_userbase table describes basic user properties, and the dws_usercate_prefer table describes user preferences. If you want to obtain the number of users who meet the
[province = Beijing] & [cate_prefer = Fashion]filter condition, you can perform association, filtering, and deduplication operations. However, in scenarios that involve large amounts of data, association and deduplication operations may bring heavy performance burdens.The bitmap-based optimization solution uses bitmap tables that contain pre-created tags to reduce the costs of ad hoc operations. In this example, data in the preceding tables are split by column to create two bitmap tables shown in the following figure. Then, a bitwise AND operation is performed to obtain the users who meet the preceding filter condition.
The rb_dws_userbase_province table describes the bitmap relationship between the province and uid columns, and the rb_dws_usercate_prefer_cprefer table describes the bitmap relationship between the cate_prefer and uid columns. However, the preceding solution has problems. When columns have hierarchical relationships, such splitting and operations may cause computing errors, as shown in the following figure.
Data in the dws_shop_cust table that describes the information about fresh, existing, and potential customers is split by column. The rb_dws_shop_cust_shop_id bitmap table that describes shop IDs and the rb_dws_shop_cust_cust_typebitmap table that describes customer types are created. If you filter the customers who meet the[shop_id = A] & [cust_type = Fresh]filter condition, you obtain a result of uid[1]. However, a uid column value of 1 that corresponds to a cust_type column value of Fresh does not exist. This is because the cust_type and shop_id columns are correlated. In a data warehouse model, cust_type is a metric for shop_id and cannot be used independently. You must use shop_id in combination with cust_type to create the rb_dws_shop_cust_sid_ctype bitmap table to prevent such error.You must compress the uid values into bitmaps and then perform bitwise AND, OR, and NOT operations to compute tags.
Procedure
Encode user information
User IDs may be strings. However, bitmaps contain only integers. Therefore, you must create a table that contains an auto-increment field by using the SERIAL or BIGSERIAL data type before you can encode uid values of the string type into integers. For more information, see PostgreSQL SERIAL.
--Create a dictionary table. CREATE TABLE dws_uid_dict ( encode_uid bigserial, uid text primary key ); --Insert uid values from the tag table. INSERT INTO dws_uid_dict(uid) SELECT uid FROM dws_userbase ON conflict DO NOTHING;Encoded user IDs maintain continuity and can be easily stored as bitmaps. The following figure shows an example in which bitmap2 contains sparse data and delivers storage efficiency much lower than bitmap1. Therefore, if user IDs are encoded, storage costs can be reduced and computing efficiency can be improved.

Sparse numeric data can be encoded, but additional performance overheads may occur. For example, advertising DMPs not only require high-performance profiling, but also require real-time output of user details. If real-time output of user details is necessary, user ID tables must be joined to restore encoded user IDs, which causes additional performance overheads. You must determine whether to encode user IDs based on your specific scenario. We have different recommendations for different cases.
For user IDs of the string type, we recommend that you encode them.
For user IDs of the integer type that require frequent restoration of encoded user IDs, we do not recommend that you encode them.
For user IDs of the integer type that do not require frequent restoration of encoded user IDs, we recommend that you encode them.
Process and query bitmaps
Split the dws_userbase and dws_shop_cust tables into one bitmap table that contains the province and gender columns. The gender column values contain only Male and Female. Compressed bitmaps can be distributed only on two nodes in a cluster. As a result, computing and storage resources are not evenly distributed and the cluster resources are not fully used. In this case, the bitmaps must be split into multiple segments and distributed in the cluster for concurrent execution. For example, you can execute the following SQL statements to split the bitmaps into 65,536 segments:
-- Create a wide table named dws_userbase. BEGIN; CREATE TABLE dws_shop_cust ( uid text not null primary key, shop_id text, cust_type text ); call set_table_property('dws_shop_cust', 'distribution_key', 'uid'); END; -- Create a bitmap extension. CREATE EXTENSION roaringbitmap; BEGIN; CREATE TABLE rb_dws_userbase_province ( province text, bucket int, bitmap roaringbitmap ); call set_table_property('rb_dws_userbase_province', 'distribution_key', 'bucket'); END; BEGIN; CREATE TABLE rb_dws_shop_cust_sid_ctype ( shop_id text, cust_type text, bucket int, bitmap roaringbitmap ); call set_table_property('rb_dws_shop_cust_sid_ctype', 'distribution_key', 'bucket'); END; -- Write data into the bitmap table. INSERT INTO rb_dws_userbase_province SELECT province, encode_uid / 65536 as "bucket", rb_build_agg(b.encode_uid) AS bitmap FROM dws_userbase a join dws_uid_dict b on a.uid = b.uid GROUP BY province, "bucket"; INSERT INTO rb_dws_shop_cust_sid_ctype SELECT shop_id, cust_type, encode_uid / 65536 AS "bucket", rb_build_agg(b.encode_uid) AS bitmap FROM dws_shop_cust a JOIN dws_uid_dict b ON a.uid = b.uid GROUP BY shop_id, cust_type, "bucket";If you want to obtain the users that meet the
[shop_id = A] & [cust_type = Fresh] & [province = Beijing]filter condition, you can perform related AND, OR, and NOT operations on the bitmaps. You can execute the following SQL statements:SELECT SUM(RB_CARDINALITY(rb_and(ub.bitmap, uc.bitmap))) FROM (SELECT rb_or_agg(bitmap) AS bitmap, bucket FROM rb_dws_userbase_province WHERE province = 'Beijing' GROUP BY bucket) ub JOIN (SELECT rb_or_agg(bitmap) AS bitmap, bucket FROM rb_dws_shop_cust_sid_ctype WHERE shop_id = 'A' AND cust_type = 'Fresh' GROUP BY bucket) uc ON ub.bucket = uc.bucket;
Action tags
Typically, fact tables are organized by time. For example, user action tables are organized by day. User data for a specific day contains only limited entries. If such data is compressed into bitmaps, row-oriented storage overheads may cause storage space to be wasted. In addition, in typical computing modes of fact tables, data of multiple days must be aggregated for filtering. If bitmaps are used, they must be expanded before they can be aggregated for operations. Such data frequently changes and requires real-time update. In the [option->bitmap] storage structure, data that needs to be updated cannot be identified. Therefore, bitmaps are not suitable for action data, aggregation, or real-time update.
In scenarios that involve action tags, Hologres can use the original storage format. When fact tables and property tables need to be joined, bitmaps can be generated for the filter results of fact tables and then joined with the bitmap indexes of property tables. Because bitmap index tables use bucket as the distribution key, local join operations can improve the join performance.
You can execute the following SQL statements to obtain users that meet the [province=Beijing] & [shop_id=A AND No purchase for 7 days] filter condition.
-- Create an action table.
BEGIN;
CREATE TABLE dws_usershop_behavior
(
uid int not null,
shop_id text not null,
pv_cnt int,
trd_cnt int,
ds integer not null
);
call set_table_property('dws_usershop_behavior', 'distribution_key', 'uid');
COMMIT;
-- Encode the action table.
BEGIN;
CREATE TABLE dws_usershop_behavior_bucket
(
encode_uid int not null,
shop_id text not null,
pv_cnt int,
trd_cnt int,
ds int not null,
bucket int
);
CALL set_table_property('dws_usershop_behavior_bucket', 'orientation', 'column');
call set_table_property('dws_usershop_behavior_bucket', 'distribution_key', 'bucket');
CALL set_table_property('dws_usershop_behavior_bucket', 'clustering_key', 'shop_id,encode_uid');
COMMIT;
-- Write fact data.
INSERT INTO dws_usershop_behavior_bucket
SELECT *,
encode_uid,
shop_id,
pv_cnt,
trd_cnt,
encode_uid / 65536
FROM dws_usershop_behavior a JOIN dws_uid_dictionary b
on a.uid = b.uid;
-- Join fact data and property data.
SELECT sum(rb_cardinality(bitmap)) AS cnt
FROM
(SELECT rb_and(ub.bitmap, us.bitmap) AS bitmap,
ub.bucket
FROM
(SELECT rb_or_agg(bitmap) AS bitmap,
bucket
FROM rb_dws_userbase_province
WHERE province = 'Beijing'
GROUP BY bucket) AS ub
JOIN
(SELECT rb_build_agg(uid) AS bitmap,
bucket
FROM
(SELECT uid,
bucket
FROM dws_usershop_behavior_bucket
WHERE shop_id = 'A' AND ds > to_char(current_date-7, 'YYYYMMdd')::int
GROUP BY uid,
bucket HAVING sum(trd_cnt) = 0) tmp
GROUP BY bucket) us ON ub.bucket = us.bucket) rOffline processing of bitmaps
You can choose to process bitmap data offline to prevent bitmap data computing from affecting your business. You can load data from foreign tables in MaxCompute or Hive. Bitmaps are processed in similar manners both online and offline. Data can be generated through encoding and aggregation. The following code provides an example on how to create bitmap data offline in MaxCompute.
-- Select a project.
USE bitmap_demo;
-- Create a source table.
CREATE TABLE mc_dws_uid_dict (
encode_uid bigint,
bucket bigint,
uid string
);
CREATE TABLE mc_dws_userbase
(
uid string,
province string,
gender string,
marriaged string
);
-- Encode the uid values.
-- Calculate the new uid values.
WITH uids_to_encode AS
(SELECT DISTINCT(ub.uid),
CAST(ub.uid / 65336 AS BIGINT) AS bucket
FROM mc_dws_userbase ub
LEFT JOIN mc_dws_uid_dict d ON ub.uid = d.uid
WHERE d.uid IS NULL),
-- Calculate the number of uids to be encoded in each bucket. Use the SUM function to obtain the bucket offset.
uids_bucket_encode_offset AS
(SELECT bucket,
sum(cnt) over (ORDER BY bucket ASC) - cnt AS bucket_offset
FROM
(SELECT count(1) AS cnt,
bucket
FROM uids_to_encode
GROUP BY bucket) x),
-- Calculate the maximum number of encoded uids.
dict_used_id_offset AS
(SELECT max(encode_uid) AS used_id_offset FROM mc_dws_uid_dict)
-- New uids = Maximum number of encoded uids + Bucket offset + Row number
INSERT INTO mc_dws_uid_dict
SELECT
COALESCE((SELECT used_id_offset FROM dict_used_id_offset),0) + bucket_offset + rn,
bucket,
uid
FROM
(SELECT row_number() OVER (partition BY ub.bucket ORDER BY ub.uid) AS rn,
ub.bucket,
bo.bucket_offset,
uid
FROM uids_to_encode ub
JOIN uids_bucket_encode_offset bo ON ub.bucket = bo.bucket) j
-- Create bitmap-related functions.
add jar function_jar_dir/mc-bitmap-functions.jar as mc_bitmap_func.jar -f;
create function mc_rb_cardinality as com.alibaba.hologres.RbCardinalityUDF using mc_bitmap_func.jar;
create function mc_rb_build_agg as com.alibaba.hologres.RbBuildAggUDAF using mc_bitmap_func.jar;
-- Create a bitmap table and write data to the table.
CREATE TABLE mc_rb_dws_userbase_province
(
province string,
bucket int,
bitmap string
);
INSERT INTO mc_rb_dws_userbase_province
SELECT province,
b.bucket_num,
mc_rb_build_agg(b.encode_uid) AS bitmap
FROM mc_dws_userbase a join mc_dws_uid_dict b on a.uid = b.uid
GROUP BY province, b.bucket_num;Execute the following statements in Hologres:
-- Create a MaxCompute table.
CREATE TABLE mc_rb_dws_userbase_province (
province text,
bucket int,
bitmap text
) server odps_server options(project_name 'bitmap_demo', table_name 'mc_rb_dws_userbase_province');
-- Write bitmap data from the MaxCompute table to Hologres.
INSERT INTO rb_dws_userbase_province
SELECT province,
bucket::INT,
roaringbitmap_text(bitmap, FALSE)
FROM mc_rb_dws_userbase_province;After the preceding steps are performed, data is loaded to Hologres. You can then perform bitwise operations to speed up queries.
You can use mc-bitmap to compute bitmap data in MaxCompute.
For more information about the offline processing of bitmaps, see Batch UV calculation.
Real-time processing of bitmaps
In real-time computing scenarios, you can use Hologres in combination with Flink to perform real-time deduplication for user tags based on Roaring bitmaps. Perform the following steps:
Use a user ID dictionary table as the dimension table and use the INSERT ON CONFLICT statement of Hologres to add user IDs. Then, join the dimension and action tables in Flink.
Aggregate the resulting table with Roaring bitmaps by tag.
Write the resulting bitmaps into the bitmap table in Hologres.