All Products
Search
Document Center

Hologres:User behavior analysis

Last Updated:Dec 18, 2024

In most user behavior analysis and user identification scenarios, you need to filter hundreds of millions of users or even billions of users to obtain metric data that has specific tags. This topic describes how to perform user behavior analysis in Hologres.

Background information

A unique visitor (UV) is an individual user who visits a website at least once. The number of UVs can be used as a metric that is collected in a specific period of time after accurate deduplication. The number of UVs is commonly used in behavior analysis. For example, in a large e-commerce promotion, the seller needs to calculate the number of item page view (IPV) UVs, the number of UVs that add items to shopping carts, and the number of UVs that make purchases. The seller can adjust operation policies at the earliest opportunity based on the UV information to achieve sales goals. You can obtain the number of UVs in a fixed period of time, such as one day, seven days, or one month, based on your business requirements. You can also obtain the number of UVs in a specific period of time, such as half a year.

When you count the number of UVs, the computing dimension and amount of data vary based on your business requirements. In most cases, the following requirements are involved:

  • Your business involves hundreds of millions of data records in more than 10 dimensions each day. You want to flexibly combine the dimensions to query data.

  • In addition to data queries by day, week, month, or year, you want to query data in a more fine-grained manner for a specific long period in real time.

  • You want to accurately deduplicate users.

In the preceding complex UV computing scenarios, a pre-calculation system, such as Apache Kylin, or the Flink-MySQL solution with a fixed combination of dimensions is commonly used. Extract, transform, and load (ETL) tasks are used to perform periodic updates. This helps improve the computing efficiency. However, the following disadvantages exist:

  • If UV computing involves a large number of dimensions, a large volume of storage space is required, and the pre-calculation time is long.

  • Accurate deduplication consumes a large amount of resources. As a result, out-of-memory (OOM) issues are prone to occur.

  • Data cannot be processed in real time, and the timeliness requirement cannot be met. UV computing can be performed only for a fixed period of time but not a custom period of time.

You can also use the COUNT DISTINCT function to count the number of UVs in fact tables. This allows you to configure a custom period of time for UV computing. However, most systems cannot compute a large amount of data in fact tables. If the computing period is long, the query process slows down. Fact tables also do not support high-concurrency accesses.

About the Hologres solution

Hologres is a real-time data warehouse for hybrid serving and analytical processing (HSAP). Hologres uses a distributed architecture, supports real-time data writes, and can analyze and process petabytes of data with high concurrency and low latency. Hologres is compatible with the PostgreSQL protocol and allows you to use existing tools for data analysis.

In UV-based behavior analysis scenarios, Hologres allows you to use the COUNT DISTINCT function to query data from fact tables. Therefore, Hologres delivers better query performance than other products. Hologres also supports the roaring bitmap-based pre-aggregation solution. These capabilities enable Hologres to accurately count the number of UVs of hundreds of millions of data records.

  • COUNT DISTINCT

    The COUNT DISTINCT function follows the native PostgreSQL syntax. The COUNT DISTINCT function is optimized in Hologres to support multiple scenarios, such as deduplication based on one field, deduplication based on multiple fields, deduplication on skewed data, and deduplication when no fields are specified in the GROUP BY clause. You can directly use the COUNT DISTINCT function to query data from wide fact tables. This improves the query performance.

  • Roaring bitmap

    Roaring bitmaps are compressed bitmaps for indexing. The data compression and deduplication features of roaring bitmaps are suitable for UV computing in big data scenarios. Roaring bitmaps have the following characteristics:

    • In a roaring bitmap, 216 chunks are constructed for 32-bit integers and correspond to the 16 most significant bits of the 32-bit integers. The 16 least significant bits of the 32-bit integer are mapped to a single bit in each chunk. The capacity of a single chunk is determined by the existing maximum value in the chunk.

    • A roaring bitmap uses one bit to represent a 32-bit integer. This significantly compresses data.

    • Roaring bitmaps provide bitwise operations for deduplication.

    For more information about how to use roaring bitmaps, see Roaring bitmap functions.

The following sections describe the four solutions for UV computing to meet different requirements. You can select a solution based on the amount of your business data and your timeliness requirements.

UV computing in ad hoc queries

  • Description

    Pre-aggregation is not used. Hologres allows you to specify a time period and use the COUNT DISTINCT function to count the number of UVs in fact tables. For more information, see UV computing solution in ad hoc queries.

    • Advantages

      This solution meets real-time requirements and allows you to specify a time period for UV computing. UV and page view (PV) computing in ad hoc queries is performed without the need for pre-calculation and scheduling. The COUNT DISTINCT function for UV computing is optimized in Hologres to significantly improve the computing efficiency.

    • Disadvantages

      If the amount of data increases, the computing efficiency and the supported queries per second (QPS) may decrease.

  • Scenarios

    This solution is suitable for flexible UV computing of tens of millions of data records.

Near-real-time UV computing based on pre-aggregation

  • Description

    Based on roaring bitmaps of Hologres, this solution performs pre-aggregation by using periodic scheduling to allow flexible UV computing in a custom period of time. For more information, see Near-real-time UV computing based on pre-aggregation.

    • Advantages

      Advantages: This solution delivers excellent computing performance and supports UV computing with high QPS and low latency based on accurate cardinality estimation. It also allows you to configure a custom period of time.

    • Disadvantages

      You need to perform pre-calculation and periodically update data in aggregation tables, which increases the workload of maintenance tasks.

  • Scenarios

    This solution is suitable for UV computing of hundreds of millions of data records with high QPS in a custom time period.

Offline UV computing by Hologres and MaxCompute based on pre-aggregation

  • Description

    If you want to count the number of UVs on an ultra-large amount of data, you can build bitmap indexes offline in MaxCompute and query data from aggregation result tables based on business dimensions in Hologres. This achieves UV computing in sub-seconds. In this solution, you need to only perform pre-aggregation once at the finest granularity and generate one pre-aggregation result table with the finest granularity. The powerful computing capability of Hologres helps implement offline UV computing on an ultra-large amount of data. For more information, see Roaring bitmaps.

    • Advantages

      This solution delivers high efficiency in offline UV computing on an ultra-large amount of data.

    • Disadvantages

      This solution requires users who have user-defined function (UDF) development experiences in MaxCompute to build bitmap indexes.

  • Scenarios

    This solution is suitable for UV computing on an ultra-large amount of data.

Real-time UV computing by Hologres and Flink based on pre-aggregation

  • Description

    You can use this solution to meet the high requirement on timeliness of UV computing in scenarios such as real-time UV display on dashboards. In real-time UV computing scenarios, Flink is associated with dimension tables in Hologres. User tag-based deduplication is performed in real time by using roaring bitmaps. For more information, see Use Hologres and Flink to count the number of UVs in real time based on pre-aggregation.

    • Advantages

      This solution implements real-time UV and PV computing in a more fine-grained manner. This solution supports simple data processing and flexible dimension computing, and delivers high timeliness.

    • Disadvantages

      This solution involves window functions of Flink. Users who have Flink development experiences are required.

  • Scenarios

    This solution is suitable for real-time UV computing scenarios, such as real-time data display on dashboards during large-scale sales promotions.