This topic describes the bitmap-based deduplication feature provided by ApsaraDB for SelectDB to help you perform data deduplication and accelerate queries.
Overview
ApsaraDB for SelectDB supports the bitmap data type. In the Aggregate key model, the Value field of the bitmap type can be used with the intersection and union aggregate function of collections for accurate data deduplication.
The bitmap-based deduplication feature in traditional data warehouses has poor performance when it processes intersections and unions for bitmaps that have hundreds of millions of elements or more. This is caused by the following reasons: First, when the size of the bitmap is extremely large (exceeding 1 GB), the processing time for network and disk I/O becomes significantly prolonged. Second, after data scanning is complete in a cluster, all the data is transmitted to a single top-level node to perform the union calculations, which places a considerable load on that node.
ApsaraDB for SelectDB is optimized for distributed environments, that is, it can divide the bitmap column values by bucket by using the hash algorithm. This way, different values are stored in different buckets. This ensures that bitmap values in different buckets are orthogonal, thus significantly reducing the cardinality. In the case of query, the orthogonal bitmap in different buckets is firstly aggregated and calculated, and then the top-level node combines the aggregated calculated values and outputs them. This greatly improves the computing efficiency and eliminates the bottleneck caused by single-node computing at the top level.
Examples
Create a table that uses the Aggregate key model and has a Value field of the bitmap data type. Use an intersection or union function as the aggregation function, such as bitmap_union. The hid column is created to indicate bucket IDs. The following code provides an example:
NoteAppropriately configure the hid column and the number of buckets. The cardinality of the hid column needs to be at least five times the number of buckets to ensure a balanced distribution of data after hash bucketing.
CREATE TABLE `user_tag_Bitmap` ( `tag` bigint(20) NULL COMMENT "User tag", `hid` smallint(6) NULL COMMENT "Bucket ID", `user_id` Bitmap Bitmap_UNION NULL COMMENT "" ) ENGINE=OLAP AGGREGATE KEY(`tag`, `hid`) COMMENT "OLAP" DISTRIBUTED BY HASH(`hid`) BUCKETS 3The hid column is added to the table schema to indicate the value range of the user_id column as a hash bucket column.
Import data.
LOAD LABEL user_tag_Bitmap_test ( DATA INFILE('hdfs://abc') INTO TABLE user_tag_Bitmap COLUMNS TERMINATED BY ',' (tmp_tag, tmp_user_id) SET ( tag = tmp_tag, hid = ceil(tmp_user_id/500), user_id = to_Bitmap(tmp_user_id) ) ) # Note: The number 500 is not fixed. You can adjust the number based on your business requirements.The data imported is in the following format. The two columns specify the tag and user ID.
11111111,1 11111112,2 11111113,3 11111114,4 ...During the data import, ApsaraDB for SelectDB cuts the range of bitmap values in the user_id column. For example, the HID value of the user ID in the range of [1,5000000) is the same, and the row with the same HID value is allocated to a bucket. This way, the bitmap value in each bucket is orthogonal. The orthogonal feature of bitmap values between buckets can be used, which greatly reduces the computing resource usage for intersection and union of collections.
After data is imported, you can call bitmap functions to perform bitmap-related operations. For more information about the bitmap functions that you can call, see the "Bitmap functions" section in the SQL referencence.
Orthogonal bitmap functions
The following functions are available for bitmap queries in orthogonal scenarios: Bitmap_orthogonal_intersect, orthogonal_Bitmap_intersect_count, orthogonal_Bitmap_union_count, orthogonal_Bitmap_expr_calculate, and orthogonal_Bitmap_expr_calculate_count. The following section describes the preceding functions.
The orthogonal bitmap function cannot be used in the partitioned table. Since the partitions of the partitioned table are orthogonal, the data between partitions cannot be guaranteed to be orthogonal. As a result, the calculation result cannot be estimated.
Bitmap_orthogonal_intersect
Calculates the intersection of bitmaps.
Syntax
orthogonal_Bitmap_intersect(Bitmap_column, column_to_filter, filter_values)Parameters
Parameter | Description |
Bitmap_column | The name of the bitmap column. |
column_to_filter | The name of the dimension column that is used for filtering. |
filter_values | A variable-length list of values used to filter the specified dimension column. |
Example
SELECT Bitmap_COUNT(orthogonal_Bitmap_intersect(user_id, tag, 13080800, 11110200)) FROM user_tag_Bitmap WHERE tag IN (13080800, 11110200);During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB filters the tags based on filter_values, and then calculates the intersection of the bitmap values of all aggregated tags. In the second stage, ApsaraDB for SelectDB iteratively calculates the union of all the bitmap values obtained from the first stage.
orthogonal_Bitmap_intersect_count
Calculates the intersection of bitmaps and counts the intersection results. The syntax is the same as that for intersect_count, but the implementation is different.
Syntax
orthogonal_Bitmap_intersect_count(Bitmap_column, column_to_filter, filter_values)Parameters
Parameter | Description |
Bitmap_column | The name of the bitmap column. |
column_to_filter | The name of the dimension column that is used for filtering. |
filter_values | A variable-length list of values used to filter the specified dimension column. |
Example
SELECT orthogonal_Bitmap_intersect_count(user_id, tag, 1150000, 1150001, 390006) FROM user_tag_Bitmap WHERE tag IN (1150000, 1150001, 390006);During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB filters the tags based on filter_values, then calculates the intersection of the bitmap values of all aggregated tags, and finally counts the intersection results. In the second stage, ApsaraDB for SelectDB iteratively calculates the sum of all the count values obtained from the first stage.
orthogonal_Bitmap_union_count
Calculates the union of the bitmap values and counts the union results. The syntax is the same as that for Bitmap_union_count, but the implementation is different.
Syntax
orthogonal_Bitmap_union_count(Bitmap_column)Parameters
Parameter | Description |
Bitmap_column | The name of the bitmap column in which the union of the values is to be calculated for counting. |
Example
SELECT orthogonal_Bitmap_union_count(user_id) FROM user_tag_Bitmap WHERE tag IN (1150000, 1150001, 390006);During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB calculates the union of the bitmap values, and then counts the union results. In the second stage, ApsaraDB for SelectDB iteratively calculates the sum of all the count values obtained from the first stage.
orthogonal_Bitmap_expr_calculate
Calculates the bitmap intersection, union, and difference by using an expression.
Syntax
orthogonal_Bitmap_expr_calculate(Bitmap_column, filter_column, input_string)Parameters
Parameter | Description |
Bitmap_column | The name of the bitmap column in which the union of the values is to be calculated for counting. |
filter_column | The dimension column that is used for filtering, that is, the key column for calculation. |
input_string | The expression string for calculating the bitmap intersection, union, and difference based on the key column. The following calculators are supported by expressions:
|
Example
SELECT orthogonal_Bitmap_expr_calculate(user_id, tag, '(833736|999777)&(1308083|231207)&(1000|20000-30000)') FROM user_tag_Bitmap WHERE tag IN (833736,999777,130808,231207,1000,20000,30000);During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB parses input_string to obtain the tag filter condition, then filters data, and finally performs bitmap calculation on the filtered data based on input_string. In the second stage, ApsaraDB for SelectDB calculates the union of all bitmap values obtained from the first stage and returns the final bitmap results.
orthogonal_Bitmap_expr_calculate_count
Calculates the bitmap intersection, union, and difference by using an expression and counts the calculation results. The syntax and parameters are the same as those for orthogonal_Bitmap_expr_calculate.
Syntax
orthogonal_Bitmap_expr_calculate_count(Bitmap_column, filter_column, input_string)Parameters
Parameter | Description |
Bitmap_column | The name of the bitmap column in which the union of the values is to be calculated for counting. |
filter_column | The dimension column that is used for filtering, that is, the key column for calculation. |
input_string | The expression string for calculating the bitmap intersection, union, and difference based on the key column. The following calculators are supported by expressions:
|
Example
SELECT orthogonal_Bitmap_expr_calculate_count(user_id, tag, '(833736|999777)&(1308083|231207)&(1000|20000-30000)') FROM user_tag_Bitmap WHERE tag IN (833736,999777,130808,231207,1000,20000,30000);During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB parses input_string to obtain the tag filter condition, then filters data and performs bitmap calculation on the filtered data based on input_string, and finally counts the calculation results. In the second stage, ApsaraDB for SelectDB calculates the union of all bitmap values obtained from the first stage and returns the final bitmap results.