All Products
Search
Document Center

ApsaraDB RDS:User preference recommendation system (PostgreSQL similarity computing application)

Last Updated:Jul 26, 2025

This topic describes how to use the hll plug-in to recommend related content based on user preferences.

Prerequisites

Background information

A recommendation system can be used to increase user stickiness and conversion rates in the following scenarios:

  • E-commerce websites recommend products based on user purchase habits.

  • Music websites recommend music based on user listening preferences.

  • News websites recommend news based on user browsing habits.

  • Application websites recommend applications based on user download and usage habits.

In this topic, a music website is used as an example to describe how to design a recommendation system and elaborate on the differences between the conventional design and the design that is based on HHL and similarity computing of PostgreSQL.

Design background

  1. When a user (UID) listens to a song (vid), the recommendation system binds tags to that song. A song can have multiple tags, which forms the following mapping:

    uid ->> tags ->> musics    
  2. The recommendation system ranks the tags for a user based on the number of songs associated with each tag to determine the popularity of the tags:

    tag(count distinct music)    
    ...   
  3. The recommendation system obtains the top five tags and their weights:

    tag1:40%    
    tag2:20%    
    tag3:15%    
    tag4:15%    
    tag5:10%  
  4. The recommendation system excludes songs that the user has already listened to from the tag's song library. Then, the system recommends new songs based on their recommendation weights, such as in descending order of playback counts.

Design based on HLL similarity computing

You can use the hll plug-in to store the IDs (vids) of songs that a user (UID) has listened to. Compared with the conventional design, this design has the following advantages:

  • It stores a small amount of data using approximate hll hash aggregations instead of actual values.

  • It provides high query efficiency. It supports indexes and does not require calculations, which allows for millisecond-level responses.

  • It supports operations such as hash union and add. These operations are suitable for sliding window calculations and can meet more business requirements.

Note

For more information about how to use the hll plug-in, see Cardinality estimation (hll).

  1. Connect to the ApsaraDB RDS for PostgreSQL instance and create a test table. For each tag, the table stores an hll that contains the hash values of the IDs of songs that a user has listened to.

    CREATE TABLE t_like (
        uid           INT,
        tagid         INT,            -- Tag
        w1 hll, w1_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Monday.   
        w2 hll, w2_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Tuesday.
        w3 hll, w3_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Wednesday.    
        w4 hll, w4_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Thursday.    
        w5 hll, w5_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Friday.    
        w6 hll, w6_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Saturday.    
        w7 hll, w7_mod_time TIMESTAMP, -- The hash composed of the vids of songs listened to on Sunday.
        whole hll,                     -- All
        PRIMARY KEY (uid, tagid)
    );
    Note

    The w1 to w7 fields are set as needed. If you are only concerned with data for a single day, you only need to set one field.

  2. When a user listens to a song, the information is written to the field that corresponds to the current date. If the field already has a value and was last modified on a previous day, the value is overwritten. Otherwise, a hash value is appended.

    This operation uses the INSERT INTO ON CONFLICT syntax. The following code provides an example:

    -- Insert the hash value of the user's viewing history.
    INSERT INTO t_like (
        uid,
        tagid,
        w5,
        w5_mod_time,
        whole
    )
    VALUES (
        1,                                   -- uid
        200,                                 -- Tag ID
        hll_hash_integer(12346) || hll_empty(),  -- The vid of the song that was watched (multiple can be concatenated).
        NOW(),
        hll_hash_integer(12346) || hll_empty()   -- The vid of the song that was watched.
    )
    ON CONFLICT (uid, tagid) DO UPDATE
    SET
        w5 =
            CASE
                WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE
                THEN EXCLUDED.w5
                ELSE hll_union(COALESCE(t_like.w5, hll_empty()),
                               EXCLUDED.w5)
            END,
        w5_mod_time = EXCLUDED.w5_mod_time,
        whole       = hll_union(COALESCE(t_like.whole, hll_empty()),
                                EXCLUDED.whole)
    WHERE
        hll_union(COALESCE(t_like.w5,   hll_empty()), EXCLUDED.w5)    <> COALESCE(t_like.w5,   hll_empty())
        OR
        hll_union(COALESCE(t_like.whole,hll_empty()), EXCLUDED.whole) <> COALESCE(t_like.whole,hll_empty());
    Note

    In actual business scenarios, you can also perform batch merge updates. You can perform aggregate updates for a single tag of a single user and use hll_union to reduce the update rate.

  3. Query the top 10 tags for the user with UID 1 (uid=1) from the last two days. The following code provides an example:

    SELECT
        tagid,
        hll_cardinality(
            hll_union(
                COALESCE(w4, hll_empty()),
                COALESCE(w5, hll_empty())
            )
        ) AS vids
    FROM t_like
    WHERE uid = 1
    ORDER BY 2 DESC
    LIMIT 10;    

    The following result is returned:

     tagid | vids     
    -------+------    
       200 |    2    
    (1 row)   
  4. Run the following command to create an index.

    CREATE INDEX idx_t_like_1
    ON t_like (
        uid,
        hll_cardinality(
            hll_union(
                COALESCE(w4, hll_empty()),
                COALESCE(w5, hll_empty())
            )
        )
    );
  5. Run the following command to view the execution plan.

    EXPLAIN
    SELECT
        tagid,
        hll_cardinality(
            hll_union(
                COALESCE(w4, hll_empty()),
                COALESCE(w5, hll_empty())
            )
        ) AS vids
    FROM t_like
    WHERE uid = 1
    ORDER BY 2 DESC
    LIMIT 10;
                                            QUERY PLAN                                             
    -------------------------------------------------------------------------------------------    
     Limit  (cost=0.11..0.15 rows=1 width=12)    
       ->  Index Scan Backward using idx_t_like_1 on t_like  (cost=0.11..0.15 rows=1 width=12)    
             Index Cond: (uid = 1)    
    (3 rows)    

    The following result is returned:

                                            QUERY PLAN                                             
    -------------------------------------------------------------------------------------------    
     Limit  (cost=0.11..0.15 rows=1 width=12)    
       ->  Index Scan Backward using idx_t_like_1 on t_like  (cost=0.11..0.15 rows=1 width=12)    
             Index Cond: (uid = 1)    
    (3 rows)    
  6. Use pgbench to insert data into the test table. This topic uses an ECS instance that is in the same VPC as the ApsaraDB RDS for PostgreSQL instance as an example.

    Note

    pgbench is a simple tool for running benchmark tests on PostgreSQL. Make sure that the PostgreSQL client is installed on the destination ECS instance. For more information about this command, see the official PostgreSQL documentation.

    1. On the ECS instance, run the vi test.sql command to create a test SQL file named test.sql and insert the following content.

      \set uid    random(1, 50000)
      \set tagid  random(1, 5000)
      \set vid    random(1, 10000000)
      
      INSERT INTO t_like (
          uid,
          tagid,
          w5,
          w5_mod_time,
          whole
      )
      VALUES (
          :uid,
          :tagid,
          hll_hash_integer(:vid) || hll_empty(),
          NOW(),
          hll_hash_integer(:vid) || hll_empty()
      )
      ON CONFLICT (uid, tagid) DO UPDATE
      SET
          w5 =
              CASE
                  WHEN DATE(t_like.w5_mod_time) <> CURRENT_DATE
                  THEN EXCLUDED.w5
                  ELSE hll_union(
                           COALESCE(t_like.w5, hll_empty()),
                           EXCLUDED.w5
                       )
              END,
          w5_mod_time = EXCLUDED.w5_mod_time,
          whole       = hll_union(
                           COALESCE(t_like.whole, hll_empty()),
                           EXCLUDED.whole
                       )
      WHERE
            hll_union(COALESCE(t_like.w5,   hll_empty()), EXCLUDED.w5)   
          <> COALESCE(t_like.w5,   hll_empty())
         OR
            hll_union(COALESCE(t_like.whole, hll_empty()), EXCLUDED.whole)
          <> COALESCE(t_like.whole, hll_empty());
    2. Run the following command to insert test data.

      pgbench \
        -M prepared \
        -n \
        -r \
        -P 1 \
        -c 32 \
        -j 32 \
        -T 120 \
        -f ./test.sql \
        -h pgm-****.pg.rds.aliyuncs.com \   # ApsaraDB RDS for PostgreSQL endpoint
        -p 5432 \                           # Port
        -U testdbuser \                     # Database account
        testdb                              # Destination database name

      The following result is returned:

      transaction type: ./test.sql    
      scaling factor: 1    
      query mode: prepared    
      number of clients: 32    
      number of threads: 32    
      duration: 120 s    
      number of transactions actually processed: 24636321    
      latency average = 0.156 ms    
      latency stddev = 0.339 ms    
      tps = 205301.110313 (including connections establishing)    
      tps = 205354.851711 (excluding connections establishing)    
      statement latencies in milliseconds:    
               0.001  \set uid random(1,5000000)    
               0.001  \set tagid random(1,5000)    
               0.000  \set vid random(1,10000000)    
               0.154  insert into t_like (    
  7. Connect to the ApsaraDB RDS for PostgreSQL instance and run the following command to query the tag popularity ranking for a specific UID.

    SELECT
        tagid,
        hll_cardinality(
            hll_union(
                COALESCE(w4, hll_empty()),
                COALESCE(w5, hll_empty())
            )
        ) AS vids
    FROM t_like
    WHERE uid = 1
    ORDER BY 2 DESC
    LIMIT 10;      

    The following result is returned:

     tagid | vids     
    -------+------    
       200 |    2    
      1413 |    1    
      1996 |    1    
      2642 |    1    
      3664 |    1    
      4340 |    1    
    (6 rows)    
        
    Time: 0.688 ms    
    Note

    The response time is 0.688 ms.

Other requirement examples

Filter out songs that the user has already listened to. For example, you can determine whether a song ID (vid) is in the hash.

Inexact operation

  • Exclusions:

    SELECT
        whole || hll_hash_integer(1) = whole
    FROM t_like
    WHERE uid = 1
      AND tagid = 200;  -- Returns false, indicating that vid:1 is not included.    
     
     ?column?     
    ----------    
     f    
    (1 row)  
  • This includes the following:

    SELECT
        whole || hll_hash_integer(12345) = whole
    FROM t_like
    WHERE uid = 1
      AND tagid = 200;    -- Returns true, indicating that vid:12345 is included.
          
     ?column?     
    ----------    
     t    
    (1 row)    

Exact operation

Create a test table for songs that have been listened to. The following code provides an example:

CREATE TABLE t_like_lossless (
    uid INT,
    vid INT,
    PRIMARY KEY (uid, vid)
); 

Conventional design

This design is applicable to all databases. However, its disadvantage is that it may require aggregate queries when the data volume is large, which can lead to lower efficiency.

  1. Connect to the ApsaraDB RDS for PostgreSQL instance and create a test table. The following code provides an example of the table creation statement:

    CREATE TABLE t_like (
        uid       INT,          -- User ID
        tagid     INT,          -- Song tag ID
        vid       INT,          -- Song ID
        mod_time  TIMESTAMP,    -- The time of the last update. An update is triggered only when more than one day has passed since the last update.
        PRIMARY KEY (uid, tagid, vid)
    );    
  2. Use pgbench to insert data into the test table. This topic uses an ECS instance that is in the same VPC as the ApsaraDB RDS for PostgreSQL instance as an example.

    Note

    pgbench is a simple tool for running benchmark tests on PostgreSQL. Make sure that the PostgreSQL client is installed on the destination ECS instance. For more information about this command, see the official PostgreSQL documentation.

    1. On the ECS instance, run the vi test.sql command to create a test SQL file named test.sql and insert the following content.

      \set uid    random(1, 50000)
      \set tagid  random(1, 5000)
      \set vid    random(1, 10000000)
      
      INSERT INTO t_like
      VALUES (:uid, :tagid, :vid, NOW())
      ON CONFLICT (uid, tagid, vid) DO UPDATE
          SET mod_time = EXCLUDED.mod_time
      WHERE
          EXCLUDED.mod_time - t_like.mod_time > INTERVAL '1 day';
    2. Run the following command to insert test data.

      pgbench \
        -M prepared \
        -n \
        -r \
        -P 1 \
        -c 32 \
        -j 32 \
        -T 120 \
        -f ./test.sql \
        -h pgm-****.pg.rds.aliyuncs.com \   # ApsaraDB RDS for PostgreSQL endpoint
        -p 5432 \                           # Port
        -U testdbuser \                     # Database account
        testdb                              # Destination database name
  3. Connect to the ApsaraDB RDS for PostgreSQL instance and run the following command to obtain statistics for the top 10 tags from the last day.

    SELECT
        tagid,
        COUNT(*)
    FROM t_like
    WHERE uid = 1
      AND NOW() - mod_time < INTERVAL '1 day'
    GROUP BY tagid
    ORDER BY COUNT(*) DESC
    LIMIT 10;

    The following sample result is returned:

     tagid | count   
    -------+-------  
      2519 |     4  
      3049 |     4  
      3648 |     4  
      1777 |     3  
      1352 |     3  
      1491 |     3  
      1064 |     3  
       572 |     3  
       692 |     3  
       301 |     3  
    (10 rows)  
      
    Time: 3.947 ms