This topic uses e-commerce as an example to describe how to use HyperLogLog to precompute data in AnalyticDB for PostgreSQL. Based on precomputation results, you can view data in multiple dimensions within milliseconds . For information about how to use HyperLogLog, see Use HyperLogLog.

Best practices

The methods provided in this topic are used in the following best practices. If you are familiar with the methods, you can directly apply them to your business with reference to the best practices.

  • AnalyticDB for PostgreSQL precomputes data based on query requests. With the precomputation results, you only need to specify the filter criteria for a data view query and AnalyticDB for PostgreSQL responds to your query for data in all dimensions within 100 milliseconds.

  • AnalyticDB for PostgreSQL executes a GROUPING SETS statement to collect data statistics in multiple dimensions simultaneously. This reduces repetitive data scanning and computing workloads and increases processing efficiency.

  • AnalyticDB for PostgreSQL records user IDs in each dimension as an array. This enables you to not only view data but also spot specific users.

  • AnalyticDB for PostgreSQL uses hll fields to store estimated data statistics for complex views. For example, the values of multiple hll fields can be united to analyze unique visitors (UVs) and identify new UVs.

  • AnalyticDB for PostgreSQL implements stream processing. It uses PipelineDB to compute data and obtain statistics in real time.

  • AnalyticDB for PostgreSQL stores intermediate data (also referred to as raw data) in the OSS_FDW external table and uploads the table to Alibaba Cloud Object Storage Service (OSS). This way, AnalyticDB for PostgreSQL does not need to store intermediate data but only uses it for precomputation. This reduces the volume of data written into AnalyticDB for PostgreSQL and the storage space occupied.

  • AnalyticDB for PostgreSQL uses level-1 and level-2 partitions of Greenplum to break a data view query into smaller units. It uses indexed tags to further narrow down the scope of searches. By doing so, AnalyticDB for PostgreSQL responds to your query for data in any dimension and among any volume within 100 milliseconds.

  • AnalyticDB for PostgreSQL uses column-oriented storage to increase the data compression ratio and the storage space occupied by data statistics.

Background information

Typical e-commerce data view businesses use tags such as brand IDs, sales territory IDs, user IDs, and timestamps to compute and analyze data. These tags may be classified in various dimensions. For example, tag 1 is classified into a gender category, tag 2 is classified into an age category, and tag 3 is classified into a hobby category.

A customer is likely to view the user base of their own brand. A typical example is to view the number of users in each sales territory (or channel), in each time range, and in each dimension.

Preparations

The example in this section is for reference only. Make preparations based on your actual business needs. For this example, define the following data structures:

  • t1: the IDs of active users in a sales territory (or channel) on a specific day.

    t1 (
    uid,       -- The ID of an active user.
    groupid,   -- The ID of the sales territory (or channel).
    day        -- The date.
    )
  • t2: the user base of a brand.

    t2 (
    uid,    -- The ID of a user.
    brand   -- The ID of the brand.
    )
  • t3: the tags of a user.

    t3 (
    uid,    -- The ID of the user.
    tag1,   -- Tag 1 bound to the user, for example, a hobby tag.
    tag2,   -- Tag 2 bound to the user, for example, a gender tag.
    tag3,   -- Tag 3 bound to the user, for example, an age tag.
    ... ,
    )

Based on the defined data structures, you can view data in brand, sales territory (or channel), tag, and date dimensions. Example:

select
  'hobby' as tag,
  t3.tag1 as tag_value,
  count(1) as cnt
from
  t1,
  t2,
  t3
where
  t1.uid = t3.uid
  and t1.uid = t2.uid
  and t2.brand = ?
  and t1.groupid = ?
  AND t1.day = '2017-06-25'
group by t3.tag1

This type of query requires heavy computing workloads. In addition, you may need to compare and analyze data in different dimensions. Therefore, we recommend that you use precomputation to optimize data retrieval.

Data retrieval optimization based on precomputation

Use the following methods to optimize data retrieval and expedite queries:

  • Use column-oriented storage if your instance runs Greenplum.
  • Partition each table by day and then further partition it by brand and group ID.
  • Choose random distribution for tables.
  • Create an independent index for each tag? field.

After the preceding optimization, AnalyticDB for PostgreSQL can respond to a data view query within 100 milliseconds no matter how large the data volume.

The following results are expected after precomputation:

t_result (
  day,      -- The date.
  brand,    -- The ID of the brand.
  groupid,  -- The ID of the sales territory (channel) or shop.
  tag1,     -- Tag 1 bound to users.
  tag2,     -- Tag 2 bound to users.
  tag3,     -- Tag 3 bound to users.
  ...       -- Tag n bound to users.
  cnt,      -- The number of users.
  uids,     -- The user IDs in an array. This field is optional. If you do not want user IDs, you can choose not to save them.
  hll_uids  -- The user IDs estimated by HyperLogLog.
)

With the precomputation results, you can simplify your query as follows:

select
  day, brand, groupid, 'tag?' as tag, cnt, uids, hll_uids
from t_result
where
  day =
  and brand =
  and groupid =
  and tag? = ?

In the preceding example, AnalyticDB for PostgreSQL filters data in partitions based on the first three filter criteria, day, brand, and groupid, and then obtains the expected data based on the tag? field.

Precomputation enables AnalyticdB for PostgreSQL to analyze data in more complex dimensions with fewer computing workloads. For example, AnalyticDB for PostgreSQL can identify differentiated users within two days and users to whom more than one tag is bound.

Precomputation methods

Execute the following SQL statements to precompute data:

select
  t1.day,
  t2.brand,
  t1.groupid,
  t3.tag1,
  t3.tag2,
  t3.tag3,
  ...
  count(1) as cnt,
  array_agg(uid) as uids,
  ## Aggregate user IDs into an array.
  hll_add_agg(hll_hash_integer(uid)) as hll_uids
  ## Convert user IDs into hash values of the HLL data type.
from
  t1,
  t2,
  t3
where
  t1.uid = t3.uid
  and t1.uid = t2.uid
group by
  t1.day,
  t2.brand,
  t1.groupid,
  grouping sets (
  ## Execute one GROUPING SETS statement to precompute data in multiple dimensions. This way, you do not need to execute multiple SQL statements, each of which precomputes data from one dimension. After you execute a GROUPING SETS statement, AnalyticDB for PostgreSQL scans data only once and then generates statistics based on each specified tag.
    (t3.tag1),
    (t3.tag2),
    (t3.tag3),
    (...),
    (t3.tagn)
  )

Data view queries based on precomputation results

For complex queries, AnalyticDB for PostgreSQL performs array-based logical operations to obtain a set of user IDs.

Supported array-based logical operations

AnalyticDB for PostgreSQL supports the following array-based logical operations:

  • Obtain the values that are in array 1 but are not in array 2.

    create or replace function arr_miner(anyarray, anyarray) returns anyarray as $$
    select array(select * from (select unnest($1) except select unnest($2)) t group by 1);
    $$ language sql strict;
  • Obtain the values that are in both arrays 1 and 2.

    create or replace function arr_overlap(anyarray, anyarray) returns anyarray as $$
    select array(select * from (select unnest($1) intersect select unnest($2)) t group by 1);
    $$ language sql strict;
  • Obtain the values that are in array 1 and those in array 2.

    create or replace function arr_merge(anyarray, anyarray) returns anyarray as $$
    select array(select unnest(array_cat($1,$2)) group by 1);
    $$ language sql strict;

Examples

The user ID set as of June 24, 2017 is UID1[], and the user ID set since June 25, 2017 when a promotion started is UID2[]. You can run the following command to identify new users attracted by the promotion:

arr_miner(uid2[], uid1[])

Value-based logical operations by using HyperLogLog

AnalyticDB for PostgreSQL supports the following value-based logical operations:

  • Obtain the number of unique values:

    hll_cardinality(users)
  • Unite the values of two hll fields to obtain a unique value:

    hll_union()

Examples

The user ID set HLL as of June 24, 2017 is uid1_hll, and the user ID set HLL since June 25, 2017 when a promotion started is uid2_hll. You can run the following command to identify new users attracted by the promotion:

hll_cardinality(uid2_hll) - hll_cardinality(uid1_hll)

Precomputation scheduling

Before precomputation is introduced, AnalyticDB for PostgreSQL returns query results by using joins. After precomputation is introduced, AnalyticDB for PostgreSQL precomputes data to generate statistics. Each precomputation must be scheduled based on the data source and data aggregation method. AnalyticDB for PostgreSQL supports two data aggregation methods: stream computing and batch computing.

Day-level precomputation

Historical statistics are not updated and some new statistics are added. New statistics must be written into the t_result table in real time.

insert into t_result
select
  t1.day,
  t2.brand,
  t1.groupid,
  t3.tag1,
  t3.tag2,
  t3.tag3,
  ...
  count(1) as cnt,
  array_agg(uid) as uids,
  hll_add_agg(hll_hash_integer(uid)) as hll_uids
from
  t1,
  t2,
  t3
where
  t1.uid = t3.uid
  and t1.uid = t2.uid
group by
  t1.day,
  t2.brand,
  t1.groupid,
  grouping sets (
    (t3.tag1),
    (t3.tag2),
    (t3.tag3),
    (...),
    (t3.tagn)
  )

Statistics aggregation

If data is precomputed at the day level but you want to query data of a specific month or year, AnalyticDB for PostgreSQL aggregates precomputed day-level statistics to respond to your query. AnalyticDB for PostgreSQL supports asynchronous aggregation. The data view returned to you is the aggregation results generated by AnalyticDB for PostgreSQL.

t_result_month (
  month,    -- yyyy-mm
  brand,    -- The ID of the brand.
  groupid,  -- The ID of the sales territory (channel) or shop.
  tag1,     -- Tag 1 bound to users.
  tag2,     -- Tag 2 bound to users.
  tag3,     -- Tag 3 bound to users.
  ...       -- Tag n bound to users.
  cnt,      -- The number of users.
  uids,     -- The user IDs in an array. This field is optional. If you do not want user IDs, you can choose not to save them.
  hll_uids  -- The user IDs estimated by HyperLogLog.
)

To aggregate data from multiple arrays, define the following function:

postgres=# create aggregate arragg (anyarray) ( sfunc=arr_merge, stype=anyarray);
CREATE AGGREGATE
postgres=# select arragg(c1) from (values (array[1,2,3]),(array[2,5,6])) t (c1);
   arragg
-------------
 {6,3,2,1,5}
(1 row)

For example, execute the following SQL statements to aggregate data at the month level:

select
  to_char(day, 'yyyy-mm'),
  brand,
  groupid,
  tag1,
  tag2,
  tag3,
  ...
  array_length(arragg(uid),1) as cnt,
  arragg(uid) as uids,
  hll_union_agg() as hll_uids
from t_result
group by
  to_char(day, 'yyyy-mm'),
  brand,
  groupid,
  tag1,
  tag2,
  tag3,
  ...

Based on the preceding example, you can obtain statistics aggregated at the year level.

Stream computing

Stream computing is used to aggregate statistics in real time. If the data volume is large, you can use the partition key to distribute data to different stream compute nodes. After the stream compute nodes complete processing, they send processing results to AnalyticDB for PostgreSQL (base on GPDB).