Remove duplicates by using the HLL feature in ApsaraDB for SelectDB - ApsaraDB for SelectDB

This topic describes how to use the HyperLogLog (HLL) feature provided by ApsaraDB for SelectDB to remove duplicates and accelerate queries.

Overview

In real business scenarios, the pressure on data deduplication increases as the amount of business data grows. If the amount of data reaches a specific scale, the cost of accurate deduplication also increases. If acceptable, you can use an approximate deduplication algorithm to quickly remove duplicates and reduce the pressure on calculation.

ApsaraDB for SelectDB provides the HLL algorithm to help you implement approximate deduplication. HLL supports excellent space complexity that is represented in the O(log(logn)) format and time complexity that is represented in the O(n) format. The error rate of the calculation results can be controlled at about 1% to 2%. The error rate depends on the size of the dataset and the hash function that is used.

How HLL works

The HLL algorithm is an upgraded version of the LogLog algorithm and supports approximate deduplication. Its mathematical basis is the Bernoulli test.

Mathematical basis

A coin has two sides. The possibility that the front or back side of the coin appears is 50% after the coin is flipped. Keep flipping the coin until the front side of the coin appears, which is recorded as a full Bernoulli test. This way, if n Bernoulli tests are performed, the front side of the coin appears n times.

For example, the coin is flipped k times in a Bernoulli test. The number of times the coin is flipped in the first Bernoulli test is k1. In the nth Bernoulli test, the number of times the coin is flipped is kn. The value of n increases with the number of Bernoulli tests. Among the Bernoulli tests, the maximum number of times the coin is flipped can be k, which indicates that the front side of the coin appears after the coin is flipped k times. The maximum number of times the coin is flipped is named k_max. The following conclusions can be drawn based on the Bernoulli tests:

kn is not greater than k_max.
kn is equal to k_max in at least one Bernoulli test.

Combined with the maximum likelihood estimation (MLE) method, the following estimated correlation exists between n and k_max: n = 2^k_max. If you record only k_max, you can estimate the total number of data records, which is the cardinality.

Estimation error

If the following results are obtained:

In the first test, the coin is flipped three times until the front side appears. In this case, k = 3 and n = 1.
In the second test, the coin is flipped twice until the front side appears. In this case, k = 2 and n = 2.
In the third test, the coin is flipped six times until the front side appears. In this case, k = 6 and n = 3.
...
In the nth test, the coin is flipped 12 times until the front side appears. In this case, n = 2^12.

Based on the first three tests, k_max = 6 and n = 3. If you apply the values to the estimation formula, the calculated value of 2^6 is not equal to 3. This indicates that if the number of tests is small, the error rate of this estimation method is high.

The three tests compose one round of estimation. If only one round of estimation is performed and n is large, the estimation error rate is relatively reduced, but still not low.

HLL functions

The HLL feature is an engineering implementation based on the HLL algorithm. HLL functions are used to save the intermediate results during the HLL-based calculation process. HLL can only be used as the type of a value column in a table to continuously reduce the amount of data by aggregating data. This accelerates queries. The error rate of the estimation results that are obtained by using the HLL feature reaches about 1%. An HLL column is generated by using other columns or imported data. When data is imported, the hll_hash function is used to specify the column that is used to generate the HLL column in the data. An HLL function is often used to replace the COUNT DISTINCT function and is used together with the ROLLUP function to quickly calculate the number of unique visitors (UVs).

The following HLL functions are provided:

HLL_UNION_AGG(hll)
This function is an aggregate function and is used to calculate the estimated cardinality for all data that meets the conditions.
HLL_CARDINALITY(hll)
This function is used to calculate the estimated cardinality for a single HLL column.
HLL_HASH(column_name)
This function is used to generate an HLL column when data is inserted or imported. For more information about how to use this function, see the Example section of this topic.

Example

Create a table that contains an HLL column.
```
CREATE TABLE test_hll(
    dt date,
    id int,
    name char(10),
    province char(10),
    os char(10),
    pv hll hll_union
)
Aggregate KEY (dt,id,name,province,os)
distributed by hash(id) buckets 10
PROPERTIES(
    "replication_num" = "1",
    "in_memory"="false"
);
```
Important
Take note of the following items when you use HLL functions:
- When you use HLL functions to remove duplicates from a column, you must set the type of the column to HLL and the aggregation type of the column to HLL_UNION in the CREATE TABLE statement.
- An HLL column cannot be used as a key column.
- You do not need to specify the length or default value for an HLL column. The length is specified by the system based on the data aggregation degree.

Import data.

Import the following sample data from the test_hll.csv file:

2022-05-05,10001,Test 01,Beijing,windows
2022-05-05,10002,Test 01,Beijing,linux
2022-05-05,10003,Test 01,Beijing,macos
2022-05-05,10004,Test 01,Hebei,windows
2022-05-06,10001,Test 01,Shanghai,windows
2022-05-06,10002,Test 01,Shanghai,linux
2022-05-06,10003,Test 01,Jiangsu,macos
2022-05-06,10004,Test 01,Shaanxi,windows

Import data by using one of the following methods:

Import data by using Stream Load. Sample code:

# curl --location-trusted -u root: -H "label:label_test_hll_load" -H "column_separator:," -H "columns:dt,id,name,province,os,pv=hll_hash(id)" -T test_hll.csv http://127.0.0.1:8030/api/demo/test_hll/_stream_load

{
    "TxnId": 693,
    "Label": "label_test_hll_load",
    "TwoPhaseCommit": "false",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 8,
    "NumberLoadedRows": 8,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 320,
    "LoadTimeMs": 23,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 1,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9,
    "CommitAndPublishTimeMs": 11
}

Import data by using Broker Load. Sample code:

LOAD LABEL demo.test_hlllabel
 (
    DATA INFILE("hdfs://hdfs_host:hdfs_port/user/doris_test_hll/data/input/file")
    INTO TABLE `test_hll`
    COLUMNS TERMINATED BY ","
    (dt,id,name,province,os)
    SET (
      pv = HLL_HASH(id)
    )
 );

Query data.

You cannot directly query the original values of the HLL column. You can query the original values of the HLL column only by using the HLL_UNION_AGG function.

Calculate the total number of page views (PVs). Sample code:

SELECT HLL_UNION_AGG(pv) FROM test_hll;
+---------------------+
| hll_union_agg(`pv`) |
+---------------------+
|                   4 |
+---------------------+
1 row in set (0.00 sec)

The results returned are equivalent to those returned by the (DISTINCT pv) function.

SELECT COUNT(DISTINCT pv) FROM test_hll;
+----------------------+
| count(DISTINCT `pv`) |
+----------------------+
|                    4 |
+----------------------+
1 row in set (0.01 sec)

Calculate the PVs of a day. Sample code:

SELECT HLL_UNION_AGG(pv) FROM test_hll GROUP BY dt;
+---------------------+
| hll_union_agg(`pv`) |
+---------------------+
|                   4 |
|                   4 |
+---------------------+
2 rows in set (0.01 sec)