HyperLogLog++ functions are approximate aggregate functions that use a small amount of memory to quickly remove duplicates from large datasets and accelerate queries.
Background
HyperLogLog (HLL) is an efficient algorithm for approximate deduplication. It is suitable for scenarios that do not require high precision, such as page view (PV) and unique visitor (UV) statistics. HLL can serve as a lightweight alternative to COUNT(DISTINCT). Unlike precise deduplication methods such as Bitmap, HLL uses an internal, fixed-size data structure called a sketch. The memory usage of this sketch does not grow with the data volume. When new data is added, only a single hash calculation is required. In big data scenarios, the deduplication error of HLL is typically within 1% or less. This provides a balance between efficiency and availability.
For basic approximate deduplication, MaxCompute provides the APPROX_DISTINCT aggregate function. As business scenarios become more diverse, users increasingly need to store or reuse the intermediate sketch data structure, not just output the final deduplication result. To support this, MaxCompute offers a complete set of HyperLogLog++ functions with an optimized underlying algorithm. These functions reduce memory usage and improve estimation accuracy, providing better support for complex analysis scenarios.
The following are two typical use cases for HLL:
Use case that require repeated queries over time: You can persist the HLL sketch that is generated daily. Subsequent calculations only need to process the new data for the current day and merge it with the historical sketch. This eliminates the need to scan all historical data again, which significantly improves query efficiency.
Use case that require combined deduplication across multiple similar columns: You can build and save a sketch for each column. You can then directly merge these sketches. This allows for efficient reuse of deduplication results and greatly reduces computing overhead.
Function list
MaxCompute SQL supports the following HyperLogLog++ functions.
Function | Feature |
Aggregates values of the same type into a new HLL++ sketch. | |
Merges multiple HLL++ sketches of the same storage class into a new sketch. | |
Calculates the cardinality estimation from an HLL++ sketch. | |
Merges multiple HLL++ sketches of the same storage class into a new sketch and returns the cardinality estimation of the merged sketch. |
Notes
The BINARY data used by the HLL_COUNT_EXTRACT, HLL_COUNT_MERGE, and HLL_COUNT_MERGE_PARTIAL functions must be generated by the HLL_COUNT_INIT function. Data from other systems or methods cannot be used.