Unique visitor analytics

User behavior analysis often requires filtering hundreds of millions or even billions of users to obtain metrics with specific tags. Hologres provides multiple solutions for accurate, high-performance unique visitor (UV) computing at scale.

Background information

A unique visitor (UV) is an individual user who visits a website at least once. After accurate deduplication, the UV count serves as a key metric over a given time period. For example, during a large e-commerce promotion, sellers calculate item page view (IPV) UVs, add-to-cart UVs, and purchase UVs to adjust operation policies in real time. UV counts can be computed over fixed intervals such as one day, seven days, or one month, or over custom periods such as half a year.

UV computing varies in dimensions and data volume depending on business requirements. Common requirements include:

Hundreds of millions of records across more than 10 dimensions per day, with flexible dimension combinations for queries.
Fine-grained, real-time queries over custom long periods, beyond standard day, week, month, or year intervals.
Accurate user deduplication.

For these complex UV computing scenarios, pre-calculation systems such as Apache Kylin or the Flink-MySQL solution with fixed dimension combinations are commonly used. Extract, transform, and load (ETL) tasks perform periodic updates to improve computing efficiency. However, the following disadvantages exist:

A large number of dimensions requires extensive storage space and long pre-calculation time.
Accurate deduplication is resource-intensive and prone to out-of-memory (OOM) errors.
Data cannot be processed in real time. UV computing supports only fixed time periods, not custom ones.

Alternatively, you can use COUNT DISTINCT on fact tables to compute UVs over custom time periods. However, most systems struggle with large fact tables: queries slow down over long periods, and high-concurrency access is not supported.

About the Hologres solution

Hologres is a real-time data warehouse for hybrid serving and analytical processing (HSAP). It uses a distributed architecture, supports real-time data writes, and analyzes petabytes of data with high concurrency and low latency. Hologres is compatible with the PostgreSQL protocol, allowing you to use existing tools for data analysis.

For UV-based behavior analysis, Hologres supports the COUNT DISTINCT function on fact tables with superior query performance. It also offers a roaring bitmap-based pre-aggregation solution. Together, these capabilities enable accurate UV counting across hundreds of millions of records.

COUNT DISTINCT

COUNT DISTINCT follows native PostgreSQL syntax and is optimized in Hologres for multiple scenarios: single-field deduplication, multi-field deduplication, skewed data deduplication, and deduplication without GROUP BY fields. You can query wide fact tables directly with COUNT DISTINCT for improved performance.
Roaring bitmap
Roaring bitmaps are compressed bitmaps designed for indexing. Their data compression and deduplication capabilities make them well suited for UV computing in big data scenarios. Key characteristics:
- In a roaring bitmap, 2¹⁶ chunks are constructed for 32-bit integers and correspond to the 16 most significant bits of the 32-bit integers. The 16 least significant bits of the 32-bit integer are mapped to a single bit in each chunk. The capacity of a single chunk is determined by the existing maximum value in the chunk.
- A roaring bitmap uses one bit to represent a 32-bit integer. This significantly compresses data.
- Roaring bitmaps provide bitwise operations for deduplication.
For more information about how to use roaring bitmaps, see roaringbitmap.

Hologres provides four UV computing solutions for different requirements. Select a solution based on your data volume and timeliness needs.

UV computing in ad hoc queries

Description
Without pre-aggregation, Hologres uses the COUNT DISTINCT function to count UVs directly from fact tables over a specified time period. For more information, see Real-time unique visitor analytics on small dataset.
- Advantages
  
  Supports real-time, custom time periods for UV and page view (PV) computing without pre-calculation or scheduling. The COUNT DISTINCT function is optimized in Hologres for significantly improved computing efficiency.
- Disadvantages
  
  Computing efficiency and queries per second (QPS) may decrease as data volume grows.
Scenarios

Suitable for flexible UV computing on tens of millions of records.

Near-real-time UV computing based on pre-aggregation

Description
Uses Hologres roaring bitmaps with periodic scheduling for pre-aggregation, enabling flexible UV computing over custom time periods. For more information, see Near-real-time UV computing based on pre-aggregation.
- Advantages
  
  Delivers excellent computing performance with high QPS, low latency, and accurate cardinality estimation. Supports custom time periods.
- Disadvantages
  
  Requires pre-calculation and periodic updates to aggregation tables, which increases maintenance overhead.
Scenarios

Suitable for UV computing on hundreds of millions of records with high QPS over custom time periods.

Offline UV computing by Hologres and MaxCompute based on pre-aggregation

Description
For ultra-large datasets, you can build bitmap indexes offline in MaxCompute and query aggregation result tables by business dimensions in Hologres to achieve sub-second UV computing. You only need to pre-aggregate once at the finest granularity. For more information, see Analyze large user properties with roaring bitmaps.
- Advantages
  
  Delivers high efficiency for offline UV computing on ultra-large datasets.
- Disadvantages
  
  Requires user-defined function (UDF) development experience in MaxCompute to build bitmap indexes.
Scenarios

Suitable for UV computing on ultra-large datasets.

Real-time UV computing by Hologres and Flink based on pre-aggregation

Description
For scenarios that demand high timeliness, such as real-time UV display on dashboards, Flink associates with dimension tables in Hologres and performs real-time user tag-based deduplication using roaring bitmaps. For more information, see Real-time UV counting with Hologres and Flink using pre-aggregation.
- Advantages
  
  Enables fine-grained, real-time UV and PV computing with simple data processing, flexible dimension combinations, and high timeliness.
- Disadvantages
  
  Involves Flink window functions and requires Flink development experience.
Scenarios

Suitable for real-time UV computing, such as live dashboard displays during large-scale sales promotions.