Use the hll plug-in to recommend related content based on user preferences - ApsaraDB RDS

This topic describes how to use the hll plug-in to recommend related content based on user preferences.

Prerequisites

Background information

A recommendation system can be used to increase user stickiness and conversion rates in the following scenarios:

E-commerce websites recommend products based on user purchase habits.
Music websites recommend music based on user listening preferences.
News websites recommend news based on user browsing habits.
Application websites recommend applications based on user download and usage habits.

In this topic, a music website is used as an example to describe how to design a recommendation system and elaborate on the differences between the conventional design and the design that is based on HHL and similarity computing of PostgreSQL.

Design background

When a user (UID) listens to a song (vid), the recommendation system binds tags to that song. A song can have multiple tags, which forms the following mapping:
```
uid ->> tags ->> musics    
```
The recommendation system ranks the tags for a user based on the number of songs associated with each tag to determine the popularity of the tags:
```
tag(count distinct music)    
...   
```
The recommendation system obtains the top five tags and their weights:
```
tag1:40%    
tag2:20%    
tag3:15%    
tag4:15%    
tag5:10%  
```
The recommendation system excludes songs that the user has already listened to from the tag's song library. Then, the system recommends new songs based on their recommendation weights, such as in descending order of playback counts.

Design based on HLL similarity computing

You can use the hll plug-in to store the IDs (vids) of songs that a user (UID) has listened to. Compared with the conventional design, this design has the following advantages:

It stores a small amount of data using approximate hll hash aggregations instead of actual values.
It provides high query efficiency. It supports indexes and does not require calculations, which allows for millisecond-level responses.
It supports operations such as hash union and add. These operations are suitable for sliding window calculations and can meet more business requirements.

Note

For more information about how to use the hll plug-in, see Cardinality estimation (hll).

Connect to the ApsaraDB RDS for PostgreSQL instance and create a test table. For each tag, the table stores an hll that contains the hash values of the IDs of songs that a user has listened to.

CREATE TABLE t_like (
    uid           INT,
    tagid         INT,            -- Tag
    w1 hll, w1_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Monday.   
    w2 hll, w2_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Tuesday.
    w3 hll, w3_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Wednesday.    
    w4 hll, w4_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Thursday.    
    w5 hll, w5_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Friday.    
    w6 hll, w6_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Saturday.    
    w7 hll, w7_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Sunday.
    whole hll,                     -- All
    PRIMARY KEY (uid, tagid)
);

Note

The w1 to w7 fields are set as needed. If you are only concerned with data for a single day, you only need to set one field.

When a user listens to a song, the information is written to the field that corresponds to the current date. If the field already has a value and was last modified on a previous day, the value is overwritten. Otherwise, a hash value is appended.

This operation uses the INSERT INTO ON CONFLICT syntax. The following code provides an example:

-- Insert the hash value of the user's viewing history.
INSERT INTO t_like (
    uid,
    tagid,
    w5,
    w5_mod_time,
    whole
)
VALUES (
    1,                                   -- uid
    200,                                 -- Tag ID
    hll_hash_integer(12346) || hll_empty(),  -- The vid of the song that was watched (multiple can be concatenated).
    NOW(),
    hll_hash_integer(12346) || hll_empty()   -- The vid of the song that was watched.
)
ON CONFLICT (uid, tagid) DO UPDATE
SET
    w5 =
        CASE
            WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE
            THEN EXCLUDED.w5
            ELSE hll_union(COALESCE(t_like.w5, hll_empty()),
                           EXCLUDED.w5)
        END,
    w5_mod_time = EXCLUDED.w5_mod_time,
    whole       = hll_union(COALESCE(t_like.whole, hll_empty()),
                            EXCLUDED.whole)
WHERE
    hll_union(COALESCE(t_like.w5,   hll_empty()), EXCLUDED.w5)    <> COALESCE(t_like.w5,   hll_empty())
    OR
    hll_union(COALESCE(t_like.whole,hll_empty()), EXCLUDED.whole) <> COALESCE(t_like.whole,hll_empty());

Note

In actual business scenarios, you can also perform batch merge updates. You can perform aggregate updates for a single tag of a single user and use hll_union to reduce the update rate.

Query the top 10 tags for the user with UID 1 (uid=1) from the last two days. The following code provides an example:

SELECT
    tagid,
    hll_cardinality(
        hll_union(
            COALESCE(w4, hll_empty()),
            COALESCE(w5, hll_empty())
        )
    ) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;

The following result is returned:

 tagid | vids     
-------+------    
   200 |    2    
(1 row)

Run the following command to create an index.

CREATE INDEX idx_t_like_1
ON t_like (
    uid,
    hll_cardinality(
        hll_union(
            COALESCE(w4, hll_empty()),
            COALESCE(w5, hll_empty())
        )
    )
);

Run the following command to view the execution plan.

EXPLAIN
SELECT
    tagid,
    hll_cardinality(
        hll_union(
            COALESCE(w4, hll_empty()),
            COALESCE(w5, hll_empty())
        )
    ) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;
                                        QUERY PLAN                                             
-------------------------------------------------------------------------------------------    
 Limit  (cost=0.11..0.15 rows=1 width=12)    
   ->  Index Scan Backward using idx_t_like_1 on t_like  (cost=0.11..0.15 rows=1 width=12)    
         Index Cond: (uid = 1)    
(3 rows)

The following result is returned:

                                        QUERY PLAN                                             
-------------------------------------------------------------------------------------------    
 Limit  (cost=0.11..0.15 rows=1 width=12)    
   ->  Index Scan Backward using idx_t_like_1 on t_like  (cost=0.11..0.15 rows=1 width=12)    
         Index Cond: (uid = 1)    
(3 rows)

Use pgbench to insert data into the test table. This topic uses an ECS instance that is in the same VPC as the ApsaraDB RDS for PostgreSQL instance as an example.

Note

pgbench is a simple tool for running benchmark tests on PostgreSQL. Make sure that the PostgreSQL client is installed on the destination ECS instance. For more information about this command, see the official PostgreSQL documentation.

On the ECS instance, run the vi test.sql command to create a test SQL file named test.sql and insert the following content.

\set uid    random(1, 50000)
\set tagid  random(1, 5000)
\set vid    random(1, 10000000)

INSERT INTO t_like (
    uid,
    tagid,
    w5,
    w5_mod_time,
    whole
)
VALUES (
    :uid,
    :tagid,
    hll_hash_integer(:vid) || hll_empty(),
    NOW(),
    hll_hash_integer(:vid) || hll_empty()
)
ON CONFLICT (uid, tagid) DO UPDATE
SET
    w5 =
        CASE
            WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE
            THEN EXCLUDED.w5
            ELSE hll_union(
                     COALESCE(t_like.w5, hll_empty()),
                     EXCLUDED.w5
                 )
        END,
    w5_mod_time = EXCLUDED.w5_mod_time,
    whole       = hll_union(
                     COALESCE(t_like.whole, hll_empty()),
                     EXCLUDED.whole
                 )
WHERE
      hll_union(COALESCE(t_like.w5,   hll_empty()), EXCLUDED.w5)   
    <> COALESCE(t_like.w5,   hll_empty())
   OR
      hll_union(COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole)
    <> COALESCE(t_like.whole, hll_empty());

Run the following command to insert test data.

pgbench \
  -M prepared \
  -n \
  -r \
  -P 1 \
  -c 32 \
  -j 32 \
  -T 120 \
  -f ./test.sql \
  -h pgm-****.pg.rds.aliyuncs.com \   # ApsaraDB RDS for PostgreSQL endpoint
  -p 5432 \                           # Port
  -U testdbuser \                     # Database account
  testdb                              # Destination database name

The following result is returned:

transaction type: ./test.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 32    
number of threads: 32    
duration: 120 s    
number of transactions actually processed: 24636321    
latency average = 0.156 ms    
latency stddev = 0.339 ms    
tps = 205301.110313 (including connections establishing)    
tps = 205354.851711 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set uid random(1,5000000)    
         0.001  \set tagid random(1,5000)    
         0.000  \set vid random(1,10000000)    
         0.154  insert into t_like (

Connect to the ApsaraDB RDS for PostgreSQL instance and run the following command to query the tag popularity ranking for a specific UID.

SELECT
    tagid,
    hll_cardinality(
        hll_union(
            COALESCE(w4, hll_empty()),
            COALESCE(w5, hll_empty())
        )
    ) AS vids
FROM t_like
WHERE uid = 1
ORDER BY 2 DESC
LIMIT 10;

The following result is returned:

 tagid | vids     
-------+------    
   200 |    2    
  1413 |    1    
  1996 |    1    
  2642 |    1    
  3664 |    1    
  4340 |    1    
(6 rows)    
    
Time: 0.688 ms

Note

The response time is 0.688 ms.

Other requirement examples

Filter out songs that the user has already listened to. For example, you can determine whether a song ID (vid) is in the hash.

Inexact operation

Exclusions:

SELECT
    whole || hll_hash_integer(1) = whole
FROM t_like
WHERE uid = 1
  AND tagid = 200;  -- Returns false, indicating that vid:1 is not included.    
 
 ?column?     
----------    
 f    
(1 row)

This includes the following:

SELECT
    whole || hll_hash_integer(12345) = whole
FROM t_like
WHERE uid = 1
  AND tagid = 200;    -- Returns true, indicating that vid:12345 is included.
      
 ?column?     
----------    
 t    
(1 row)

Exact operation

Create a test table for songs that have been listened to. The following code provides an example:

CREATE TABLE t_like_lossless (
    uid INT,
    vid INT,
    PRIMARY KEY (uid, vid)
);

Conventional design

This design is applicable to all databases. However, its disadvantage is that it may require aggregate queries when the data volume is large, which can lead to lower efficiency.

Connect to the ApsaraDB RDS for PostgreSQL instance and create a test table. The following code provides an example of the table creation statement:

CREATE TABLE t_like (
    uid       INT,          -- User ID
    tagid     INT,          -- Song tag ID
    vid       INT,          -- Song ID
    mod_time  TIMESTAMP,    -- The time of the last update. An update is triggered only when more than one day has passed since the last update.
    PRIMARY KEY (uid, tagid, vid)
);

Use pgbench to insert data into the test table. This topic uses an ECS instance that is in the same VPC as the ApsaraDB RDS for PostgreSQL instance as an example.

Note

On the ECS instance, run the vi test.sql command to create a test SQL file named test.sql and insert the following content.

\set uid    random(1, 50000)
\set tagid  random(1, 5000)
\set vid    random(1, 10000000)

INSERT INTO t_like
VALUES (:uid, :tagid, :vid, NOW())
ON CONFLICT (uid, tagid, vid) DO UPDATE
    SET mod_time = EXCLUDED.mod_time
WHERE
    EXCLUDED.mod_time - t_like.mod_time > INTERVAL '1 day';

Run the following command to insert test data.

pgbench \
  -M prepared \
  -n \
  -r \
  -P 1 \
  -c 32 \
  -j 32 \
  -T 120 \
  -f ./test.sql \
  -h pgm-****.pg.rds.aliyuncs.com \   # ApsaraDB RDS for PostgreSQL endpoint
  -p 5432 \                           # Port
  -U testdbuser \                     # Database account
  testdb                              # Destination database name

Connect to the ApsaraDB RDS for PostgreSQL instance and run the following command to obtain statistics for the top 10 tags from the last day.

SELECT
    tagid,
    COUNT(*)
FROM t_like
WHERE uid = 1
  AND NOW() - mod_time < INTERVAL '1 day'
GROUP BY tagid
ORDER BY COUNT(*) DESC
LIMIT 10;

The following sample result is returned:

 tagid | count   
-------+-------  
  2519 |     4  
  3049 |     4  
  3648 |     4  
  1777 |     3  
  1352 |     3  
  1491 |     3  
  1064 |     3  
   572 |     3  
   692 |     3  
   301 |     3  
(10 rows)  
  
Time: 3.947 ms