Near real-time UV computing on hundreds of millions of data records - Hologres

Hologres provides a near real-time unique visitor (UV) computing solution that is based on pre-aggregation. This solution allows you to perform UV computing on hundreds of millions of data records at a high level of queries per second (QPS). In this solution, you can use roaring bitmaps and the periodic scheduling method to pre-aggregate data and count the number of UVs for a custom period of time.

Description

If your business involves a large amount of data and requires high QPS and low latency, you can use roaring bitmaps of Hologres and the periodic scheduling method to pre-aggregate data. This allows you to count the number of UVs for a custom period of time.

Advantages and disadvantages
- Advantages: This solution delivers excellent computing performance and supports UV computing with high QPS and low latency based on accurate cardinality estimation. It also allows you to configure a custom period of time.
- Disadvantages: You need to perform a pre-calculation and periodically update data in aggregation tables, which increases the workload of maintenance tasks.
Scenarios: This solution is suitable for UV computing on hundreds of millions of data records with high QPS for a custom and long period of time.

The methods for using roaring bitmaps vary based on data types and use scenarios. You can select a method based on your business requirements.

Method 1: UV computing for an extended period of time based on an INT type field: This method is suitable for scenarios in which data is filtered based on a single tag and then accurate cardinality estimation is performed on a field of the INT type in the result data. Cardinality refers to the number of distinct values.
Method 2: UV computing for an extended period of time based on a TEXT type field: This method is suitable for scenarios in which data is filtered based on a single tag and then accurate cardinality estimation is performed on a field of the TEXT type in the result data. In this method, mapping tables are required.
Method 3 (advanced method): Bucket-based UV computing for an extended period of time: This method is mostly used in profile analysis scenarios. In this method, intersection, union, and XOR operations are performed based on multiple tags or properties to determine the user cardinality. In addition, user data is distributed in buckets to support parallel queries. This way, user data is highly compressed, I/O operations on user data are reduced, and the computing efficiency is improved.

Note

Roaring bitmaps can be used for UV computing and other accurate cardinality estimation scenarios, such as livestream e-commerce scenarios in which the number of products and the number of brands need to be displayed on dashboards.

For more information about roaring bitmap functions, see roaringbitmap.

Method 1: UV computing for an extended period of time based on an INT type field by using roaring bitmaps

Scenarios

This method is suitable for scenarios in which UV computing on hundreds of millions of data records for a custom and long period of time needs to be performed with high QPS based on user IDs (UIDs) of the INT type.

Procedure

This section describes the procedure used to implement this method.

Step 1: Create a user fact table to store detailed data of all business dimensions.

Step 2: Aggregate the fact table based on basic dimensions that are specified in the GROUP BY clause and store the aggregated UIDs as roaring bitmaps in an aggregation sink table.

Step 3: Query the aggregation sink table based on query dimensions. In this step, perform the OR operation on roaring bitmap fields for deduplication. This way, you can count the number of UVs and count the number of records in the aggregation sink table to obtain the number of page views (PVs). The results are returned in subseconds.

Step 1: Prepare basic data

Create a roaring bitmap extension.
Before you use roaring bitmaps, you must create a roaring bitmap extension. An extension is created at the database level. For each database, you need to create an extension only once. Syntax:
Note
You must load the roaring bitmap extension in the public schema.
```
CREATE EXTENSION IF NOT EXISTS roaringbitmap;
```

Create a user fact table.

Create a user fact table to store complete and detailed user data. In most cases, fact tables are large-size tables that store complete and detailed user data. We recommend that you create a fact table partitioned by day to facilitate data writes and updates.

BEGIN;
CREATE TABLE IF NOT EXISTS ods_app_detail (
     uid int,
     country text,
     prov text,
     city text,
     channel text,
     operator text,
     brand text,
     ip text,
     click_time text,
     year text,
     month text,
     day text,
     ymd text NOT NULL
)
PARTITION BY LIST (ymd) ;-- Configure partitions by day because the fact table contains a large amount of data. 
CALL set_table_property('ods_app_detail', 'orientation', 'column');
CALL set_table_property('ods_app_detail', 'bitmap_columns', 'country,prov,city,channel,operator,brand,ip,click_time, year, month, day, ymd');
-- Specify a distribution key for the table based on real-time query requirements. Data is distributed to shards based on the distribution key.
CALL set_table_property('ods_app_detail', 'distribution_key', 'uid');
-- Specify fields that can be used in the WHERE clause. We recommend that you configure a field that contains the year, month, and date information as the clustering key and event time column.
CALL set_table_property('ods_app_detail', 'clustering_key', 'ymd');
CALL set_table_property('ods_app_detail', 'event_time_column', 'ymd');
COMMIT;

Create an aggregation sink table for roaring bitmaps.
Create an aggregation sink table to store aggregated roaring bitmaps.
- Only millions of data records are stored in the aggregation sink table for each day after the aggregation. We recommend that you create a non-partitioned aggregation sink table to facilitate data write-back. You can also create an aggregation sink table that is partitioned by month or quarter. If you configure partitions by day, the amount of data in each partition is small, and a large number of small files are generated due to partitioning. This increases the memory usage.
- Create the aggregation sink table based on dimensions such as country, prov, and city, and configure dimension fields as distribution key fields. This facilitates dimension-based queries. If more than three fields are specified in the GROUP BY clause, we recommend that you configure the most frequently used fields as distribution key fields.
- Configure the query dimension fields and the date field as primary key fields to prevent data from being repeatedly inserted.
- Configure the date field as the clustering key and event time column to facilitate data filtering.
Sample code:
```
BEGIN;
CREATE TABLE dws_app_rb(
  rb_uid roaringbitmap, -- UV computing.
  country text,
  prov text,
  city text,
  ymd text NOT NULL, -- The date field.
  pv integer, -- PV computing.
  PRIMARY key(country, prov, city, ymd)-- Configure query dimension fields and the date field as primary key fields to prevent data from being repeatedly inserted.
);
CALL set_table_property('dws_app_rb', 'orientation', 'column');
-- Configure the date field as the clustering key and event time column to facilitate data filtering.
CALL set_table_property('dws_app_rb', 'clustering_key', 'ymd');
CALL set_table_property('dws_app_rb', 'event_time_column', 'ymd');
-- Configure fields that are specified in the GROUP BY clause as the distribution key fields.
CALL set_table_property('dws_app_rb', 'distribution_key', 'country,prov,city');
END;
```

Step 2: Build the aggregation sink table with roaring bitmaps.

After you create the fact table, you can build the aggregation sink table by writing result data to the aggregation sink table. Sample code:

-- Example: Query data that was generated in the specified six months from the fact table, convert the queried data into roaring bitmaps, and write the roaring bitmaps to the aggregation sink table.
INSERT INTO dws_app_rb
SELECT  
 RB_BUILD_AGG(uid),
 country,
 prov,
 city,
 ymd,
COUNT(1)
FROM    ods_app_detail
WHERE ymd >= '20231201' AND ymd <='20240502'
GROUP BY country,prov,city,ymd;

If the fact table is updated, you can update incremental data or full data in the aggregation sink table based on the update granularity of the fact table. The following table describes the differences between incremental update and full update.

Update mode of the aggregation sink table

Description

Sample code

Incremental update

If data is updated in the fact table on a regular basis, such as when only the partition that stores data of the previous day is updated, you can execute the INSERT statement to write incremental data to the aggregation sink table.

Write only incremental data:

INSERT INTO dws_app_rb
SELECT  
 RB_BUILD_AGG(uid),
 country,
 prov,
 city,
 ymd,
COUNT(1)
FROM    ods_app_detail
WHERE ymd = '20240503'
GROUP BY country,prov,city,ymd;

Full update

If data is not regularly updated in the fact table and you cannot quickly identify incremental data, you can update the aggregation sink table by using the full data write-back method.

Execute the INSERT OVERWRITE statement to update all data in the aggregation sink table.

-- Execute the INSERT OVERWRITE statement to update all data in the aggregation sink table.
CALL hg_insert_overwrite('public.dws_app_rb' , $$SELECT
 RB_BUILD_AGG(uid),
 country,
 prov,
 city,
 ymd,
COUNT(1) FROM ods_app_detail 
 WHERE ymd >= '20231201' AND ymd <='20240503'
 GROUP BY country,prov,city,ymd$$);

Step 3: Count the number of UVs in a custom and long period of time

You can count the number of UVs and the number of PVs in a custom period of time based on a custom combination of basic dimensions. In most cases, query results are returned in milliseconds. This method is more efficient than executing traditional SQL statements and also supports high QPS. Examples:

Count the number of UVs and the number of PVs on a specific day.

SELECT  
RB_CARDINALITY(RB_OR_AGG(rb_uid)) AS uv,
  country,
  prov,
  city,
  SUM(pv) AS pv
FROM    dws_app_rb
WHERE   ymd = '20240329'
GROUP BY country,prov,city;

Count the number of UVs and the number of PVs in a specific month.

SELECT  RB_CARDINALITY(RB_OR_AGG(rb_uid)) AS uv,
  country,
  prov,
  city,
  SUM(pv) AS pv
FROM dws_app_rb
WHERE   ymd >= '20240301' AND ymd <= '20240331'
GROUP BY country,prov,city;

Method 2: UV computing for an extended period based on a TEXT type field and a mapping table by using roaring bitmaps

In actual business scenarios, the ID fields of most tables are of the TEXT type. However, roaring bitmaps do not support the TEXT type. Therefore, you need to use fields of the SERIAL type to build a mapping table. This allows you to perform efficient UV computing by using roaring bitmaps.

Scenarios

This method is suitable for scenarios in which UV computing on hundreds of millions of data records for a custom and long period of time needs to be performed with high QPS based on a field of the TEXT type. In this method, accurate cardinality estimation is performed on the field of the TEXT type.

Procedure

This section describes the procedure used to implement this method.

Step 1: Create a user fact table to store detailed data of all business dimensions. Create a UID mapping table to store the UIDs of historical users and the mapped 32-bit integers.

Step 2: Join the fact table and the UID mapping table, perform the GROUP BY operation on the data based on the most fine-grained basic dimension, and aggregate all data of the previous day into UIDs based on the largest query dimension. Then, store the UIDs as roaring bitmaps in the aggregation sink table. The aggregation sink table stores millions of data records for each day.

Step 1: Prepare basic data

Create a roaring bitmap extension.
Before you use roaring bitmaps, you must create a roaring bitmap extension. An extension is created at the database level. For each database, you need to create an extension only once. Syntax:
Note
You must load the roaring bitmap extension in the public schema.
```
CREATE EXTENSION IF NOT EXISTS roaringbitmap;
```

Create a user fact table.

BEGIN;
CREATE TABLE IF NOT EXISTS ods_app_detail (
     uid text,
     country text,
     prov text,
     city text,
     channel text,
     operator text,
     brand text,
     ip text,
     click_time text,
     year text,
     month text,
     day text,
     ymd text NOT NULL
)
PARTITION BY LIST (ymd) ;-- Configure partitions by day because the fact table contains a large amount of data. 
CALL set_table_property('ods_app_detail', 'orientation', 'column');
CALL set_table_property('ods_app_detail', 'bitmap_columns', 'country,prov,city,channel,operator,brand,ip,click_time, year, month, day, ymd');
-- Specify a distribution key for the table based on real-time query requirements. Data is distributed to shards based on the distribution key.
CALL set_table_property('ods_app_detail', 'distribution_key', 'uid');
-- Specify fields that can be used in the WHERE clause. We recommend that you configure a field that contains the year, month, and date information as the clustering key and event time column.
CALL set_table_property('ods_app_detail', 'clustering_key', 'ymd');
CALL set_table_property('ods_app_detail', 'event_time_column', 'ymd');
COMMIT;

Create an aggregation sink table for roaring bitmaps.
Create an aggregation sink table to store aggregated roaring bitmaps.
- Only millions of data records are stored in the aggregation sink table for each day after the aggregation. We recommend that you create a non-partitioned aggregation sink table to facilitate data write-back. You can also create an aggregation sink table that is partitioned by month or quarter. If you configure partitions by day, the amount of data in each partition is small, and a large number of small files are generated due to partitioning. This increases the memory usage.
- Create the aggregation sink table based on dimensions such as country, prov, and city, and configure dimension fields as distribution key fields. This facilitates dimension-based queries. If more than three fields are specified in the GROUP BY clause, we recommend that you configure the most frequently used fields as distribution key fields.
- Configure the query dimension fields and the date field as primary key fields to prevent data from being repeatedly inserted.
- Configure the date field as the clustering key and event time column to facilitate data filtering.
Sample code:
```
BEGIN;
CREATE TABLE dws_app_rb(
  rb_uid roaringbitmap, -- UV computing.
  country text,
  prov text,
  city text,
  ymd text NOT NULL, -- The date field.
  pv integer, -- PV computing.
  PRIMARY key(country, prov, city, ymd)-- Configure query dimension fields and the date field as primary key fields to prevent data from being repeatedly inserted.
);
CALL set_table_property('dws_app_rb', 'orientation', 'column');
-- Configure the date field as the clustering key and event time column to facilitate data filtering.
CALL set_table_property('dws_app_rb', 'clustering_key', 'ymd');
CALL set_table_property('dws_app_rb', 'event_time_column', 'ymd');
-- Configure fields that are specified in the GROUP BY clause as the distribution key fields.
CALL set_table_property('dws_app_rb', 'distribution_key', 'country,prov,city');
END;
```

Create a UID mapping table.

UIDs stored in roaring bitmaps must be 32-bit integers and consecutive UIDs are preferred. However, UIDs collected in business systems or tracking points are usually of the TEXT type. Therefore, you need to create a UID mapping table that contains a column of the SERIAL type. The column consists of auto-increment 32-bit integers. This way, UID mappings are automatically managed and remain stable.

BEGIN;
 CREATE TABLE uid_mapping (
     uid text NOT NULL,
     uid_int32 serial,
    PRIMARY KEY (uid)
 );
 -- Configure the UID column as the clustering key and distribution key to quickly find the 32-bit integers that correspond to the UIDs.
CALL set_table_property('uid_mapping', 'clustering_key', 'uid');
CALL set_table_property('uid_mapping', 'distribution_key', 'uid');
CALL set_table_property('uid_mapping', 'orientation', 'row');
COMMIT;

Step 2: Build roaring bitmaps and import them to the aggregation sink table

Import data to and update the UID mapping table.
1. Fully initialize the UID mapping table.
  Import all UID data to the UID mapping table and initialize the data. You can also filter data based on the date field and import the obtained data to the UID mapping table. In this example, the UID data in a specified six-month period is imported.
```
-- Import data in the specified six-month period to the UID mapping table. 
INSERT INTO uid_mapping (uid) B
SELECT distinct (uid) FROM ods_app_detail 
WHERE B.ymd >= '20231201' AND B.ymd <='20240502';
```
2. Verify data in the UID mapping table.
  After you import data to the UID mapping table, check whether the data in the UID mapping table is consistent with the data in the fact table.
```
-- Verify the data.
SELECT COUNT(*) FROM uid_mapping;
```
  Verify the continuity of values in the column of the SERIAL type. In specific scenarios, the TRUNCATE and INSERT operations are performed to initialize the UID mapping table multiple times. However, when the TRUNCATE operation is performed, the sequence of values in the column of the SERIAL type is not reset. As a result, values of the SERIAL type are wasted. After data is imported multiple times, values in the column of the SERIAL type exceed 32 bits.
```
-- Verify the continuity of values in the column of the SERIAL type.
SELECT MAX(uid_int32),MIN(uid_int32) FROM uid_mapping;
```
3. Update the UID mapping table.
  If UIDs in the fact table are updated, you must update the data in the UID mapping table at the earliest opportunity. If only the UIDs on a specific day are updated in the fact table, you can execute the INSERT ON CONFLICT statement to update the UID mapping table. If the updated UIDs cannot be identified, you can perform full data write-back, which is time-consuming if a large amount of data is involved.
  In this example, the INSERT ON CONFLICT DO NOTHING statement is executed to update data of the previous day in the UID mapping table. DO NOTHING ensures that data is updated but not repeatedly written.
```
-- Update data of the previous day in the UID mapping table.
INSERT INTO uid_mapping (uid)
SELECT distinct (uid) FROM ods_app_detail 
WHERE ymd = '20240503' 
ON conflict do nothing;
```
Import data into and update the aggregation sink table.
After you update the UID mapping table, perform the following steps to aggregate data and insert the aggregated results to the aggregation sink table:
1. Join the fact table and the UID mapping table to obtain the aggregation conditions and 32-bit UIDs in the uid_int32 column.
2. Aggregate data based on the aggregation conditions, and insert the aggregated data as roaring bitmaps into the aggregation sink table. You can aggregate data by day or month based on your business requirements.
3. You need to only aggregate data once and store the aggregation result in the aggregation sink table. The number of data records in the aggregation sink table equals the number of UVs.
4. Execute the following statement to insert data into the aggregation sink table: In this example, data in the specified six-month period is aggregated to the aggregation sink table.
```
WITH aggregation_src AS (
    SELECT
       B.uid_int32,
        A.country,
        A.prov,
        A.city,
        A.ymd
    FROM
        ods_app_detail A
        INNER JOIN uid_mapping B ON A.uid = B.uid
    WHERE
        A.ymd >= '20231201' AND A.ymd <='20240502')
INSERT INTO dws_app_rb
SELECT
    RB_BUILD_AGG (uid_int32),
    country,
    prov,
    city,
    ymd,
    COUNT(1)
FROM
    aggregation_src
GROUP BY
    country,
    prov,
    city,
    ymd;
```
5. Update the aggregation sink table.
  If the fact table and UID mapping table are updated, you must also update the aggregation sink table. You can use the incremental update or full update mode to update the aggregation sink table based on the following rules:
  - Incremental update: If data is updated in the fact table on a regular basis, such as each day, you can execute the INSERT statement to write incremental data in the latest partition to the aggregation sink table.
  - Full update: If data is not regularly updated in the fact table, and you cannot identify incremental data, you can execute the INSERT OVERWRITE statement to update the aggregation sink table by using the full data write-back method.

Step 3: Count the number of UVs in a custom and long period of time

Count the number of UVs and the number of PVs on a specific day.

SELECT  country,
        prov,
        city,
        RB_CARDINALITY(RB_OR_AGG(rb_uid)) AS uv,
        SUM(pv) AS pv
FROM    dws_app_rb
WHERE   ymd = '20240329'
GROUP BY country,prov,city;

Count the number of UVs and the number of PVs in a specific month.

-- Count the number of UVs and the number of PVs in a specific month.
SELECT  country
        ,prov
        ,RB_CARDINALITY(RB_OR_AGG(rb_uid)) AS uv
        ,SUM(pv) AS pv
FROM    public.dws_app_rb
WHERE   ymd >= '20240301' AND ymd <= '20240331'
GROUP BY country,prov,city;

Method 3 (advanced method): Bucket-based UV computing for an extended period by using roaring bitmaps

In actual scenarios, you may need to join multiple large tables to count the cardinality. For example, in profile analysis, you may need to join tag tables and property tables to count the user cardinality by performing intersection, union, and XOR operations, or you may need to join behavior tables and property tables to count the behavior cardinality. You can use roaring bitmaps of Hologres together with buckets to perform efficient cardinality computing. You can use buckets to split bitmaps into multiple segments and distribute the segments to allow parallel computing. This way, user data is highly compressed, I/O operations are reduced, and the computing efficiency is improved.

Scenarios

This method is suitable for UV computing on hundreds of millions of data records in a custom and long period of time. In most cases, the length of integers in the column based on which accurate deduplication is performed is long, which may exceed 32 bits, or the column has low cardinality, such as the gender column. This method is commonly used for user cardinality computing based on multiple tags.

How it works

Data is divided and stored in different buckets. This way, data is distributed to different shards. The query performance is affected by the bucketing method and bucket quantity.

Bucketing method: A common and efficient method is to calculate the most significant and least significant bits. The least significant bits are stored as bitmaps, and the most significant bits are stored as bucket IDs.
Bucket quantity: Generally, the number of shards in a single table group of an instance does not exceed 256. For 32-bit integers, the most significant 8 bits with int_value>>24 are stored as bucket IDs, and the least significant 24 bits are stored as bitmaps. This way, data is evenly distributed to and orderly stored in shards, which facilitates parallel processing.
Recommended bucketing formulas:
- Method 1: If a table group contains n shards, the number of buckets is n. The number of bits required for n buckets is the round-up result of log2(n). If the number of shards is less than or equal to 256, 256 buckets are recommended. If the number of shards is greater than 256, calculate the result based on the formula.
- Method 2: If UIDs are small numbers, the most significant bits that are calculated by using Method 1 may be 0, which means all data is stored in the same bucket. In this case, we recommend that you use (i>>16)%256 to calculate the bucket ID. The number of buckets remains unchanged, and 65,536 values are stored in one bucket. This does not negatively affect the storage efficiency.

Procedure

The following sections describe how to use this method.

Note

In this example, the UID field is of the INT type. If the UID field is of the TEXT type, you can create a UID mapping table based on Method 2 to build roaring bitmaps. The bucketing method is the same.

Step 1: Prepare basic data

Create a roaring bitmap extension.
Before you use roaring bitmaps, you must create a roaring bitmap extension. An extension is created at the database level. For each database, you need to create an extension only once. Syntax:
Note
You must load the roaring bitmap extension in the public schema.
```
CREATE EXTENSION IF NOT EXISTS roaringbitmap;
```

Create a user behavior table and a property table.

Buckets are usually used for profile analysis by joining multiple tables. In this example, two tables are required. One user behavior table is used to store complete and detailed user behavior, and one property table is used to store user properties, such as gender and age. You can join the two tables to count the user cardinality based on a custom dimension.

Create a user behavior table named ods_user_behaviour_detail.

-- Create a user behavior table to store detailed behavior data.
BEGIN;
CREATE TABLE IF NOT EXISTS ods_user_behaviour_detail (
     uid int,
     operator text,
     channel text,
     shop_id text,
     time text,
     ymd text NOT NULL
);
CALL set_table_property('ods_user_behaviour_detail', 'orientation', 'column');
-- Specify a distribution key for the table based on real-time query requirements. Data is distributed to shards based on the distribution key.
CALL set_table_property('ods_user_behaviour_detail', 'distribution_key', 'uid');
-- Specify fields that can be used in the WHERE clause. We recommend that you configure a field that contains the year, month, and date information as the clustering key and event time column.
CALL set_table_property('ods_user_behaviour_detail', 'clustering_key', 'ymd');
CALL set_table_property('ods_user_behaviour_detail', 'event_time_column', 'ymd');
COMMIT;

Create a user property table named dim_userbase.

-- Create a user property table.
BEGIN;
CREATE TABLE IF NOT EXISTS dim_userbase (
     uid int, 
     age text,
     gender text,
     country text,
     prov text,
     city text
     );
CALL set_table_property('dim_userbase', 'orientation', 'column');
CALL set_table_property('dim_userbase', 'distribution_key', 'uid');
COMMIT;

Create aggregation sink tables with roaring bitmaps.

Create an aggregation sink table to store aggregated roaring bitmaps.

Only millions of data records are stored in the aggregation sink table for each day after the aggregation. We recommend that you create a non-partitioned aggregation sink table to facilitate data write-back. You can also create an aggregation sink table that is partitioned by month or quarter. If you configure partitions by day, the amount of data in each partition is small, and a large number of small files are generated due to partitioning. This increases the memory usage.
Create the aggregation sink table based on dimensions such as country, prov, and city, and configure dimension fields as distribution key fields. This facilitates dimension-based queries. If more than three fields are specified in the GROUP BY clause, we recommend that you configure the most frequently used fields as distribution key fields.
Configure the query dimension fields and the date field as primary key fields to prevent data from being repeatedly inserted.
Configure the date field as the clustering key and event time column to facilitate data filtering.

Create an aggregation sink table named dws_user_behaviour_rb.

-- Create an aggregation sink table for the user behavior table.
BEGIN;
CREATE TABLE dws_user_behaviour_rb(
  rb_uid roaringbitmap, -- UV computing.
  bucket int NOT NULL, -- Bucketing field.
  operator text,
  channel text,
  shop_id text,
  time text,
  ymd text NOT NULL,
  PRIMARY key(operator,channel,shop_id,time, ymd,bucket)-- Specify query dimension fields and the date field as primary key fields to prevent data from being repeatedly inserted.
);
CALL set_table_property('dws_user_behaviour_rb', 'orientation', 'column');
-- Configure the date field as the clustering key and event time column to facilitate data filtering.
CALL set_table_property('dws_user_behaviour_rb', 'clustering_key', 'ymd');
CALL set_table_property('dws_user_behaviour_rb', 'event_time_column', 'ymd');
-- Configure the bucketing field as the distribution key.
CALL set_table_property('dws_user_behaviour_rb', 'distribution_key', 'bucket');
END;

Create an aggregation sink table named dim_userbase_rb.

-- Create an aggregation sink table for the user property table.
BEGIN;
CREATE TABLE IF NOT EXISTS dim_userbase_rb (
     rb_uid roaringbitmap, -- UV computing.
     bucket int NOT NULL, -- Bucketing field.
     age text,
     gender text,
     country text,
     prov text,
     city text,
     PRIMARY key(age,gender,country, prov,city,bucket)-- Specify query dimension fields as primary key fields to prevent data from being repeatedly inserted.
     );
CALL set_table_property('dim_userbase_rb', 'orientation', 'column');
CALL set_table_property('dim_userbase_rb', 'distribution_key', 'bucket');-- Specify the bucketing field as the distribution key to allow local joins.
COMMIT;

Step 2: Build roaring bitmaps and import them to the aggregation sink tables

After you create fact tables, you must write data from fact tables to the aggregation sink tables and store UIDs in the bucketing field. Configure buckets based on the amount of data and data distribution characteristics of your business. In this example, UID data is evenly distributed to 256 buckets. This way, UID data in the same bucket is distributed on the same shard. This allows parallel computing and achieves high-performance queries. Sample code:

Import data from the user behavior table to the aggregation sink table.

-- Write data from the fact table to the aggregation sink table.
INSERT INTO dws_user_behaviour_rb
SELECT  
RB_BUILD_AGG(uid),
uid >> 24 AS bucket,-- Shift 24 bits to the right. This way, the most significant 8 bits are used as the bucket IDs and the least significant 24 bits are stored as bitmaps.
operator,
channel,
shop_id,
time
ymd
FROM ods_user_behaviour_detail
WHERE 
 ymd >= '20231201' AND ymd <='20240503'

Write data from the property table to the aggregation sink table.

-- Write data from the property table to the aggregation sink table.
INSERT INTO dim_userbase_rb
SELECT 
RB_BUILD_AGG(uid),
uid >> 24 AS bucket,-- Shift 24 bits to the right. This way, the most significant 8 bits are used as the bucket IDs and the least significant 24 bits are stored as bitmaps.
age,
gender,
country,
prov,
city
FROM dim_userbase
GROUP BY age,gender,country,prov,city,bucket;

If fact tables are updated, you can update incremental data or full data in the aggregation sink tables based on the update granularity of the fact tables.

Incremental update: If data is updated in a fact table on a regular basis, such as when only the partition that stores data of the previous day is updated, you can execute the INSERT statement to write incremental data to the aggregation sink table.
Full update: If data is not regularly updated in a fact table and you cannot quickly identify incremental data, you can update the aggregation sink table by using the full data write-back method.

Step 3: Count the number of UVs in a custom and long period of time

You can count the number of UVs and the number of PVs in a custom period of time based on a custom combination of basic dimensions. This method is more efficient than executing traditional SQL statements. Sample code:

-- Query the number of users that meet the following conditions: gender = 'man'&country = 'Beijing'&operator = 'Purchase'&shop_id = '1'.
SELECT
    SUM(RB_CARDINALITY (rb_and (t1.rb_uid, t2.rb_uid)))
FROM (
    SELECT
        rb_or_agg (rb_uid) AS rb_uid,
        bucket
    FROM
        dws_user_behaviour_rb
    WHERE
        OPERATOR = 'Purchase'
        AND shop_id = '1'
        AND ymd = '20240501'
    GROUP BY
        bucket) t1
    JOIN (
        SELECT
            rb_or_agg (rb_uid) AS rb_uid,
            bucket
        FROM
            dim_userbase_rb
        WHERE
            gender = 'man'
            AND country = 'Beijing'
        GROUP BY
            bucket) t2 ON t1.bucket = t2.bucket;