All Products
Search
Document Center

ApsaraDB for SelectDB:Bitmap-based accurate deduplication

Last Updated:Dec 11, 2024

This topic describes the bitmap-based deduplication feature provided by ApsaraDB for SelectDB to help you perform data deduplication and accelerate queries.

Overview

ApsaraDB for SelectDB supports the bitmap data type. In the Aggregate key model, the Value field of the bitmap type can be used with the intersection and union aggregate function of collections for accurate data deduplication.

The bitmap-based deduplication feature in traditional data warehouses has poor performance when it processes intersections and unions for bitmaps that have hundreds of millions of elements or more. This is caused by the following reasons: First, when the size of the bitmap is extremely large (exceeding 1 GB), the processing time for network and disk I/O becomes significantly prolonged. Second, after data scanning is complete in a cluster, all the data is transmitted to a single top-level node to perform the union calculations, which places a considerable load on that node.

ApsaraDB for SelectDB is optimized for distributed environments, that is, it can divide the bitmap column values by bucket by using the hash algorithm. This way, different values are stored in different buckets. This ensures that bitmap values in different buckets are orthogonal, thus significantly reducing the cardinality. In the case of query, the orthogonal bitmap in different buckets is firstly aggregated and calculated, and then the top-level node combines the aggregated calculated values and outputs them. This greatly improves the computing efficiency and eliminates the bottleneck caused by single-node computing at the top level.

Examples

  1. Create a table that uses the Aggregate key model and has a Value field of the bitmap data type. Use an intersection or union function as the aggregation function, such as bitmap_union. The hid column is created to indicate bucket IDs. The following code provides an example:

    Note

    Appropriately configure the hid column and the number of buckets. The cardinality of the hid column needs to be at least five times the number of buckets to ensure a balanced distribution of data after hash bucketing.

    CREATE TABLE `user_tag_Bitmap` (
      `tag` bigint(20) NULL COMMENT "User tag",
      `hid` smallint(6) NULL COMMENT "Bucket ID",
      `user_id` Bitmap Bitmap_UNION NULL COMMENT ""
    ) ENGINE=OLAP
    AGGREGATE KEY(`tag`, `hid`)
    COMMENT "OLAP"
    DISTRIBUTED BY HASH(`hid`) BUCKETS 3

    The hid column is added to the table schema to indicate the value range of the user_id column as a hash bucket column.

  2. Import data.

    LOAD LABEL user_tag_Bitmap_test
    (
        DATA INFILE('hdfs://abc')
        INTO TABLE user_tag_Bitmap
        COLUMNS TERMINATED BY ','
        (tmp_tag, tmp_user_id)
        SET (
            tag = tmp_tag,
            hid = ceil(tmp_user_id/500),
            user_id = to_Bitmap(tmp_user_id)
        )
    )
    # Note: The number 500 is not fixed. You can adjust the number based on your business requirements.
  3. The data imported is in the following format. The two columns specify the tag and user ID.

    11111111,1
    11111112,2
    11111113,3
    11111114,4
    ...

    During the data import, ApsaraDB for SelectDB cuts the range of bitmap values in the user_id column. For example, the HID value of the user ID in the range of [1,5000000) is the same, and the row with the same HID value is allocated to a bucket. This way, the bitmap value in each bucket is orthogonal. The orthogonal feature of bitmap values between buckets can be used, which greatly reduces the computing resource usage for intersection and union of collections.

  4. After data is imported, you can call bitmap functions to perform bitmap-related operations. For more information about the bitmap functions that you can call, see the "Bitmap functions" section in the SQL referencence.

Orthogonal bitmap functions

The following functions are available for bitmap queries in orthogonal scenarios: Bitmap_orthogonal_intersect, orthogonal_Bitmap_intersect_count, orthogonal_Bitmap_union_count, orthogonal_Bitmap_expr_calculate, and orthogonal_Bitmap_expr_calculate_count. The following section describes the preceding functions.

Note

The orthogonal bitmap function cannot be used in the partitioned table. Since the partitions of the partitioned table are orthogonal, the data between partitions cannot be guaranteed to be orthogonal. As a result, the calculation result cannot be estimated.

Bitmap_orthogonal_intersect

Calculates the intersection of bitmaps.

Syntax

orthogonal_Bitmap_intersect(Bitmap_column, column_to_filter, filter_values)

Parameters

Parameter

Description

Bitmap_column

The name of the bitmap column.

column_to_filter

The name of the dimension column that is used for filtering.

filter_values

A variable-length list of values used to filter the specified dimension column.

Example

SELECT Bitmap_COUNT(orthogonal_Bitmap_intersect(user_id, tag, 13080800, 11110200)) FROM user_tag_Bitmap WHERE tag IN (13080800, 11110200);

During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB filters the tags based on filter_values, and then calculates the intersection of the bitmap values of all aggregated tags. In the second stage, ApsaraDB for SelectDB iteratively calculates the union of all the bitmap values obtained from the first stage.

orthogonal_Bitmap_intersect_count

Calculates the intersection of bitmaps and counts the intersection results. The syntax is the same as that for intersect_count, but the implementation is different.

Syntax

orthogonal_Bitmap_intersect_count(Bitmap_column, column_to_filter, filter_values)

Parameters

Parameter

Description

Bitmap_column

The name of the bitmap column.

column_to_filter

The name of the dimension column that is used for filtering.

filter_values

A variable-length list of values used to filter the specified dimension column.

Example

SELECT orthogonal_Bitmap_intersect_count(user_id, tag, 1150000, 1150001, 390006) FROM user_tag_Bitmap WHERE tag IN (1150000, 1150001, 390006);

During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB filters the tags based on filter_values, then calculates the intersection of the bitmap values of all aggregated tags, and finally counts the intersection results. In the second stage, ApsaraDB for SelectDB iteratively calculates the sum of all the count values obtained from the first stage.

orthogonal_Bitmap_union_count

Calculates the union of the bitmap values and counts the union results. The syntax is the same as that for Bitmap_union_count, but the implementation is different.

Syntax

orthogonal_Bitmap_union_count(Bitmap_column)

Parameters

Parameter

Description

Bitmap_column

The name of the bitmap column in which the union of the values is to be calculated for counting.

Example

SELECT orthogonal_Bitmap_union_count(user_id) FROM user_tag_Bitmap WHERE tag IN (1150000, 1150001, 390006);

During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB calculates the union of the bitmap values, and then counts the union results. In the second stage, ApsaraDB for SelectDB iteratively calculates the sum of all the count values obtained from the first stage.

orthogonal_Bitmap_expr_calculate

Calculates the bitmap intersection, union, and difference by using an expression.

Syntax

orthogonal_Bitmap_expr_calculate(Bitmap_column, filter_column, input_string)

Parameters

Parameter

Description

Bitmap_column

The name of the bitmap column in which the union of the values is to be calculated for counting.

filter_column

The dimension column that is used for filtering, that is, the key column for calculation.

input_string

The expression string for calculating the bitmap intersection, union, and difference based on the key column.

The following calculators are supported by expressions:

& represents intersection calculation, | represents union calculation,-represents difference calculation, ^ represents exclusive XOR calculation, and \represents an escape character.

Example

SELECT orthogonal_Bitmap_expr_calculate(user_id, tag, '(833736|999777)&(1308083|231207)&(1000|20000-30000)') FROM user_tag_Bitmap WHERE tag IN (833736,999777,130808,231207,1000,20000,30000);

During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB parses input_string to obtain the tag filter condition, then filters data, and finally performs bitmap calculation on the filtered data based on input_string. In the second stage, ApsaraDB for SelectDB calculates the union of all bitmap values obtained from the first stage and returns the final bitmap results.

orthogonal_Bitmap_expr_calculate_count

Calculates the bitmap intersection, union, and difference by using an expression and counts the calculation results. The syntax and parameters are the same as those for orthogonal_Bitmap_expr_calculate.

Syntax

orthogonal_Bitmap_expr_calculate_count(Bitmap_column, filter_column, input_string)

Parameters

Parameter

Description

Bitmap_column

The name of the bitmap column in which the union of the values is to be calculated for counting.

filter_column

The dimension column that is used for filtering, that is, the key column for calculation.

input_string

The expression string for calculating the bitmap intersection, union, and difference based on the key column.

The following calculators are supported by expressions:

& represents intersection calculation, | represents union calculation,-represents difference calculation, ^ represents exclusive XOR calculation, and \represents an escape character.

Example

SELECT orthogonal_Bitmap_expr_calculate_count(user_id, tag, '(833736|999777)&(1308083|231207)&(1000|20000-30000)') FROM user_tag_Bitmap WHERE tag IN (833736,999777,130808,231207,1000,20000,30000);

During query execution, aggregation is performed in two stages. In the first stage, ApsaraDB for SelectDB parses input_string to obtain the tag filter condition, then filters data and performs bitmap calculation on the filtered data based on input_string, and finally counts the calculation results. In the second stage, ApsaraDB for SelectDB calculates the union of all bitmap values obtained from the first stage and returns the final bitmap results.